Ctrl K

BigQuery to Cloud Storage Archive

Archive selected BigQuery tables to Cloud Storage as Parquet, then validate that the exports can be downloaded and read back.

Archive BigQuery tables to a Cloud Storage bucket as Parquet files. This is the flow used before deleting tables from BigQuery: export, confirm the objects exist, read them back locally, and only then remove the original tables. Assumes gcloud, the project venv, and ADC auth are already set up. See the gcloud and BigQuery setup page if not.

Placeholders used below

Replace these before running any cell. The archive path shown here is the convention used throughout the rest of the page.

  • YOUR_PROJECT_ID: GCP project ID
  • YOUR_DATASET_ID: BigQuery dataset to archive
  • YOUR_ARCHIVE_BUCKET: Cloud Storage bucket for archives
  • TABLE_NAME_1, TABLE_NAME_2: tables to export
  • YYYY-MM-DD: snapshot date for the archive path
# variables to construct cloud storage paths:
gs://YOUR_ARCHIVE_BUCKET/bigquery/YOUR_DATASET_ID/TABLE_NAME/snapshot_date=YYYY-MM-DD/data.parquet

Install Cloud Storage client

The BigQuery client alone is enough to submit an export job to Cloud Storage, but the Storage client is needed to create the bucket and to download the files back for validation.

pip install google-cloud-storage

Verify Python auth and project

Confirm the notebook kernel sees the right project and ADC credentials before running any export.

import os, sys
from google.auth import default

print(sys.executable)
print("GOOGLE_CLOUD_PROJECT:", os.getenv("GOOGLE_CLOUD_PROJECT"))
print("CLOUDSDK_CONFIG:", os.getenv("CLOUDSDK_CONFIG"))

creds, project = default()
print("ADC project:", project)
print("Credentials type:", type(creds).__name__)

Create or reuse the archive bucket

Bucket names are globally unique across all of GCP. The bucket location should match the BigQuery dataset location so export, load, and external-table operations stay in the same region. Common values: EU, US, europe-west1.

from google.cloud import storage

project_id = "YOUR_PROJECT_ID"
bucket_name = "YOUR_ARCHIVE_BUCKET"
bucket_location = "EU"

storage_client = storage.Client(project=project_id)
bucket = storage_client.bucket(bucket_name)

print("Exists before create:", bucket.exists())

if not bucket.exists():
    bucket = storage_client.create_bucket(bucket_name, location=bucket_location)
    print("Created bucket:", bucket.name)
else:
    print("Bucket already exists:", bucket.name)

Check the dataset location

Export jobs require the source dataset and the destination bucket to be in compatible locations. Read the dataset location so the bucket region can be matched to it.

from google.cloud import bigquery

project_id = "YOUR_PROJECT_ID"
dataset_id = "YOUR_DATASET_ID"

bq_client = bigquery.Client(project=project_id)
dataset = bq_client.get_dataset(f"{project_id}.{dataset_id}")

print("Dataset location:", dataset.location)

Export tables to Cloud Storage

Iterate over the tables and submit one extract job each. Each job writes a single Parquet file to the snapshot-dated path. For large tables, replace data.parquet with a wildcard pattern like data*.parquet to let BigQuery shard the output.

project_id = "YOUR_PROJECT_ID"
dataset_id = "YOUR_DATASET_ID"
bucket_name = "YOUR_ARCHIVE_BUCKET"
snapshot_date = "YYYY-MM-DD"

tables = [
    "TABLE_NAME_1",
    "TABLE_NAME_2",
]

bq_client = bigquery.Client(project=project_id)

for table_id in tables:
    source_table = f"{project_id}.{dataset_id}.{table_id}"
    destination_uri = (
        f"gs://{bucket_name}/bigquery/{dataset_id}/{table_id}"
        f"/snapshot_date={snapshot_date}/data.parquet"
    )

    extract_job = bq_client.extract_table(
        source_table,
        destination_uri,
        job_config=bigquery.job.ExtractJobConfig(destination_format="PARQUET"),
    )
    extract_job.result()

    print("Export completed")
    print(source_table)
    print(destination_uri)
    print()

Validate archived Parquet files

Download each archived Parquet back to /tmp, read it with pandas, and print its shape and first rows. This is a debug/validation step only.

from google.cloud import storage
from pathlib import Path
import pandas as pd

project_id = "YOUR_PROJECT_ID"
dataset_id = "YOUR_DATASET_ID"
bucket_name = "YOUR_ARCHIVE_BUCKET"
snapshot_date = "YYYY-MM-DD"
download_root = Path(f"/tmp/{dataset_id}_archive_validation")

tables = [
    "TABLE_NAME_1",
    "TABLE_NAME_2",
]

storage_client = storage.Client(project=project_id)
bucket = storage_client.bucket(bucket_name)
download_root.mkdir(parents=True, exist_ok=True)

for table_id in tables:
    object_name = (
        f"bigquery/{dataset_id}/{table_id}"
        f"/snapshot_date={snapshot_date}/data.parquet"
    )
    local_path = download_root / f"{table_id}.parquet"

    blob = bucket.blob(object_name)
    blob.download_to_filename(str(local_path))

    df = pd.read_parquet(local_path)

    print("=" * 80)
    print(f"TABLE: {table_id}")
    print(f"OBJECT: gs://{bucket_name}/{object_name}")
    print(f"LOCAL FILE: {local_path}")
    print(f"SHAPE: {df.shape}")
    print(df.head(3))

Deletion checklist

Only delete the original BigQuery tables after every item below is confirmed. Deletion itself can be done from the BigQuery UI.

  • Every export job completed without error.
  • The expected objects exist in Cloud Storage at the snapshot_date path.
  • Each Parquet file downloaded and read back with pandas.
  • Shape and sample rows look correct.
  • No dashboard, scheduled query, app, or notebook still depends on the original BigQuery tables.

Resources