Hive-Partitioned Parquet Datasets

Hive-Partitioned Parquet Datasets#

Sometimes you want to work with a large set of data from EarthScope’s API(s), but want to store it locally so you don’t have to download it repeatedly.

One powerful way to download and store that data is using PyArrow Datasets.

In this example, we use the SDK to download larger-than-memory data and store it in hive-partitioned parquet on our local filesystem. At that point, we can use myriad tools to interact with the “dataset.”

import datetime as dt

import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.dataset as ds

from earthscope_sdk import AsyncEarthScopeClient

es = AsyncEarthScopeClient()

First, plan the entire desired dataset and (optionally) group &/ sort the query plan to control execution order.

plan = await es.data.gnss_observations(
    start_datetime=dt.datetime(2025, 7, 20),
    end_datetime=dt.datetime(2025, 7, 27),
    network_name="PERM:Alaska",
    session_name="A",
    field=[
        "phase",
        "range",
        "snr",
    ],
    meta_fields=["4charid"],
).plan()

# (optional) sort & group the plan to control the download order
plan.group_by_day()
plan
AsyncGnssObservationsQueryPlan(requests=42, groups=7)

Next, define your desired partitioning schema.

The following partition scheme will produce a directory structure like:

  • /year=<year>/month=<month>/day=<day>/000000-0.parquet

part = ds.partitioning(
    pa.schema(
        [
            ("year", pa.int16()),
            ("month", pa.int16()),
            ("day", pa.int16()),
        ],
    ),
    flavor="hive",
)

Finally, we are ready to iterate over our QueryPlan in smaller-than-memory batches, writing each batch to disk before retrieving the next batch.

output_path = "data/my_dataset"

# Iterate over the plan in batches to avoid loading the entire result set into memory at once.
idx = 0
async for table in plan:
    print(
        f"Table size: {len(table):,} rows, uncompressed size {table.get_total_buffer_size() / 1024**2:.2f} MB"
    )

    # Add the derived columns we will use for partitioning
    table = table.append_column("year", pc.year(table["timestamp"]))
    table = table.append_column("month", pc.month(table["timestamp"]))
    table = table.append_column("day", pc.day(table["timestamp"]))

    # (optional) Sort the table in desired order
    table = table.sort_by(
        [
            ("4charid", "ascending"),
            ("system", "ascending"),
            ("satellite", "ascending"),
            ("obs_code", "ascending"),
            ("timestamp", "ascending"),
        ]
    )

    # Write this table to the dataset
    ds.write_dataset(
        table,
        output_path,
        format="parquet",
        partitioning=part,
        existing_data_behavior="overwrite_or_ignore",
        basename_template=f"{idx:06d}-{{i}}.parquet",
    )

    idx += 1
Table size: 5,086,146 rows, uncompressed size 240.42 MB
Table size: 5,085,276 rows, uncompressed size 237.39 MB
Table size: 5,048,954 rows, uncompressed size 238.21 MB
Table size: 5,070,092 rows, uncompressed size 236.06 MB
Table size: 5,099,502 rows, uncompressed size 240.94 MB
Table size: 5,041,887 rows, uncompressed size 237.13 MB
Table size: 5,016,814 rows, uncompressed size 235.71 MB

This final cell just lists the contents of the output directory.

import os

paths = []
for root, _, files in os.walk(output_path):
    if files:
        for f in files:
            if f.endswith(".parquet"):
                paths.append(os.path.join(root, f))

for path in sorted(paths):
    size = os.path.getsize(path)
    print(f"{path} - {size / 1024**2:.2f} MB")
data/my_dataset/year=2025/month=7/day=20/000000-0.parquet - 99.62 MB
data/my_dataset/year=2025/month=7/day=21/000001-0.parquet - 99.60 MB
data/my_dataset/year=2025/month=7/day=22/000002-0.parquet - 98.54 MB
data/my_dataset/year=2025/month=7/day=23/000003-0.parquet - 99.26 MB
data/my_dataset/year=2025/month=7/day=24/000004-0.parquet - 99.86 MB
data/my_dataset/year=2025/month=7/day=25/000005-0.parquet - 98.55 MB
data/my_dataset/year=2025/month=7/day=26/000006-0.parquet - 98.04 MB