Google Cloud Storage
Jump to navigation
Jump to search
Contents
About
This page is a child of: Google
Google Cloud Storage (GCS) is a cloud storage service provided by Google Cloud Platform (GCP) that offers object storage for live or archived data. It is highly scalable and secure, making it suitable for a wide range of applications including storing large unstructured data sets, archival and disaster recovery, and serving website content.
Features
- High Durability: GCS offers high data durability through redundancy and replication.
- Scalability: It seamlessly scales to handle large amounts of data.
- Security: Provides robust security features including fine-grained access controls and encryption at rest and in transit.
Use Cases
- Data Storage: For storing files, backups, and large datasets.
- Data Archiving: Long-term archival of data, including integration with Google's Coldline storage for cost-effectiveness.
- Static Website Hosting: Hosting static websites directly from storage buckets.
Interacting with GCS using Python
You can interact with GCS programmatically in many languages, including python.
- Google Cloud Client Library for Python: Use the `google-cloud-storage` library to interact with GCS.
- Authentication: Typically done via service accounts. Securely manage and use credentials for GCP.
- Operations: Common operations include creating and managing storage buckets, uploading and downloading files, and setting file metadata.
Installation and Setup
- Install the library using pip: `pip install google-cloud-storage`.
- Set up authentication by creating a service account in GCP and downloading the JSON key file.
- Use the client library in Python scripts to interact with GCS.
Code Examples: Downloading a File from GCS (Python)
See: Download files from Google Storage with Python script
|
Simple download. |
# Simple script to download a object (file) from Google Cloud Storage (GCS).
#
# NOTE: To get this working you'll want to setup your own GCS account
# with a file to download and a service key to access it. Once setup it
# will download to a "tmp/" folder on your local/running machine.
#
# INSTRUCTIONS:
# (1) To create a file to download.
# * Sign into Google Cloud Console (https://console.cloud.google.com/)
# (WARNING: May need to use a credit card to sign up for free trail)
# * Create a "New Project".
# * On the left-hand side menu go to "Cloud Storage" > "Buckets".
# * Click "+ Create" (bucket) and give it a name (eg: "noske-test-datasets").
# * Click "Upload Files" and drag in any file to upload it (eg: "train-00000-of-00001.parquet").
#
# (2) To create a service account key:
# * Sign into Google Cloud Console (https://console.cloud.google.com/)
# * Choose the GCP project (drop-down list at the top of the console).
# * On the left-hand side menu, go to "IAM & Admin" > "Service Accounts".
# * Click: "Create Service Account", give it a name, description, and click "Create".
# * Assign the necessary roles to the service account (e.g., "Storage Object Viewer"
# or "Cloud Storage Admin" for accessing GCS objects). Click "Continue".
# * In the "Keys" tab click "Add Key"... and chose JSON and it will download a .json.
# * Move the .json into a code subdir (eg: "gcs-service-keys/storage-key.json").
#
# (3) Update the global constants below (BUCKET_NAME, OBJECT_NAME)
#
# (4) Setup project as:
# account-key/ (copy your access key here)
# tmp/ (starts empty)
# gcs_loader.py (this file)
BUCKET_NAME = 'noske-test-datasets' # (1) Set to the name of your GCS bucket.
OBJECT_NAME = 'train-00000-of-00001.parquet' # (1) Set to the name of your file in the bucket.
LOCAL_FILE_DIR = 'tmp'
LOCAL_FILE_PATH = LOCAL_FILE_DIR + '/' + OBJECT_NAME
SERVICE_KEY_ACCOUNT = 'account-key/noske-test-gcp-storage-account-key.json' # (2) Set to your downloaded service key.
from google.cloud import storage
def download_gcs_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client.from_service_account_json(SERVICE_KEY_ACCOUNT)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(f'Blob {source_blob_name} downloaded to {destination_file_name}.')
print(f'Attempting to download {BUCKET_NAME}/{OBJECT_NAME} from CVS to {LOCAL_FILE_PATH}... \n')
download_gcs_blob(BUCKET_NAME, OBJECT_NAME, LOCAL_FILE_PATH)
|
Downloading a [[[HuggingFace]] Dataset from GCS. |
# Simple script to download a dataset from a GCS path.
#
# For instance the Google Cloud Storage (GCS) path is:
# 'gs://noske-test-datasets/subfolder/train.csv'
#
# Instructions: See `gcs_loader.py`.
GCS_PATH = 'gs://noske-test-datasets/subfolder/train.csv' # Set to "gs://bucket-name/path/to/file".
FILETYPE = 'csv' # Set to csv" or "parquet" as appropriate.
TEMP_FILE_PATH = 'tmp/downloaded_file'
SERVICE_KEY_ACCOUNT = 'account-key/noske-test-gcp-storage-account-key.json' # Set to your downloaded service key.
from datasets import Dataset, load_dataset
from google.cloud import storage
def load_dataset_from_gcs(path: str, filetype: str) -> Dataset:
"""Downloads a file from GCS... the path must be in "GCS path" format.
NOTE: The `path` must look like: "gs://noske-test-datasets/subfolder/train.csv"
and has some error checking for the path.
"""
# Check path is as expected.
assert filetype in ["csv", "parquet"], f"Filetype must be csv or parquet. Received: {filetype}"
assert path.startswith("gs://"), f"Path must start with 'gs://'. Path must be in 'GCS path' format (eg: 'gs://bucket-name/path/to/file'). Received: {path}"
end_bucked_idx = path.find('/', 5)
assert end_bucked_idx > 0, f"Path must be include a bucket-name. Path must be in 'GCS path' format (eg: 'gs://bucket-name/path/to/file'). Received: {path}"
gcs_bucket_name = path[5:end_bucked_idx]
gcs_file_path = path[end_bucked_idx+1:]
print(f"gcs_bucket_name = '{gcs_bucket_name}', gcs_file_path = '{gcs_file_path}'")
# Connect to client.
storage_client = storage.Client.from_service_account_json(SERVICE_KEY_ACCOUNT)
# Download to a temp file.
bucket = storage_client.bucket(gcs_bucket_name)
blob = bucket.blob(gcs_file_path)
blob.download_to_filename(TEMP_FILE_PATH)
print(f"GCS file '{gcs_file_path}' in bucket '{gcs_bucket_name}' copied SUCCESFULLY to '{TEMP_FILE_PATH}'.")
# Load dataset.
return load_dataset(filetype, data_files=TEMP_FILE_PATH)
# Download dataset:
print(f'Attempting to download {FILETYPE} dataset from {GCS_PATH}... \n')
dataset = load_dataset_from_gcs(GCS_PATH, FILETYPE)
print(dataset) # Print format of dataset (eg: "DatasetDict({ train: Dataset({ features: ['id', 'review', ...], num_rows: 7 })})").
print(dataset['train'][-1]) # Print last row of dataset (eg: "{'id': 123, 'review': 'TEST 1', ...}")
Code Examples: Saving a File to GCS (Python)
|
Simple save to CSV program. |
# Demo script so save a object (file) to GCS path.
#
# NOTE: In this case we save a .parquet file but it could be anything.
# NOTE: As long as the bucket-name is right, it should create subfolders
# as needed.
#
# For instance the Google Cloud Storage (GCS) path is:
# 'gs://noske-test-datasets/subfolder/new-train.csv'
#
# Instructions: See `gcs_loader.py`.
BUCKET_NAME = 'noske-test-datasets' # Set to the name of your GCS bucket.
SAVE_OBJECT_PATH = 'savefolder/new-train.parquet' # Set to the name of your desired file path in the bucket.
TEMP_LOCAL_FILEPATH = 'tmp/newdata'
SERVICE_KEY_ACCOUNT = 'account-key/noske-test-gcp-storage-account-key.json' # Set to your downloaded service key.
from google.cloud import storage
from datasets import Dataset
# Function to create a sample dataset and save it as a parquet file.
def save_sample_dataset_to_local(local_path):
data = {"column1": [1, 2, 3], "column2": ["one", "two", "three"]}
dataset = Dataset.from_dict(data)
dataset.to_parquet(local_path)
print(f"Temp local file created at '{local_path}'.")
# Function to upload a file to GCS.
def upload_to_gcs(local_path, bucket_name, save_path):
"""Uploads a file to the bucket."""
# Replace 'your-service-account-key.json' with the path to your GCS service account key
storage_client = storage.Client.from_service_account_json(SERVICE_KEY_ACCOUNT)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(save_path)
if blob.exists():
print(f"WARNING: The file {save_path} already exists in the bucket, so this is where you might have logic to allow or deny overwrite as it will overwrite by default.")
blob.upload_from_filename(local_path)
print(f"File '{local_path}' uploaded to '{save_path}'.")
# Create a sample dataset and save as a parquet file
save_sample_dataset_to_local(TEMP_LOCAL_FILEPATH)
# Upload the file to GCS
upload_to_gcs(TEMP_LOCAL_FILEPATH, BUCKET_NAME, SAVE_OBJECT_PATH)