Working with Apache Arrow - EarthScope SDK

The EarthScope SDK returns data as Apache Arrow tables for high performance and easy integration with data science tools.

What is Apache Arrow?¶

Apache Arrow is a cross-language development platform for in-memory columnar data. It provides:

Standardized format: A common memory representation that works across languages (Python, R, Java, etc.)
Zero-copy reads: Data can be shared between libraries without copying
High performance: Optimized for modern CPU architectures
Rich ecosystem: Native integration with data science tools

Why Does the SDK Use Arrow?¶

The EarthScope SDK returns data as PyArrow Tables because:

Efficient streaming: APIs can stream Arrow data directly without serialization overhead
Memory efficient: Columnar format is compact and cache-friendly
Fast: No parsing required (unlike CSV or JSON)
Flexible: Easily convert to your preferred format (Pandas, Polars, DuckDB, etc.)

Converting Arrow Tables¶

The SDK query plans return PyArrow Table objects, which you can easily convert to your preferred data science library:

Pandas¶

import pandas as pd

table = client.data.gnss_observations(...).fetch()
df = table.to_pandas()  # Creates a pandas DataFrame

Polars¶

import polars as pl

table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table)  # Creates a polars DataFrame

DuckDB¶

import duckdb

table = client.data.gnss_observations(...).fetch()

# Query Arrow table directly with SQL
result = duckdb.query("""
    SELECT satellite, COUNT(*) as obs_count
    FROM table
    GROUP BY satellite
    ORDER BY obs_count DESC
""").to_df()

NumPy¶

import numpy as np

table = client.data.gnss_observations(...).fetch()

# Convert specific columns to numpy arrays
timestamps = table['timestamp'].to_numpy()
snr_values = table['snr'].to_numpy()

Direct Arrow Operations¶

You can also work directly with Arrow tables using PyArrow’s compute functions:

import pyarrow.compute as pc

table = client.data.gnss_observations(...).fetch()

# Filter data
filtered = table.filter(pc.field('snr') > 40)

# Compute statistics
mean_snr = pc.mean(table['snr'])

Working with Multiple Batches¶

When iterating through query plans, each batch is an Arrow table:

import polars as pl

# Process each day as a Polars DataFrame
for daily_table in plan.group_by_day():
    # Convert to Polars
    df = pl.from_arrow(daily_table)

    # Process with Polars
    result = df.group_by('satellite').agg([
        pl.col('snr').mean().alias('avg_snr'),
        pl.col('snr').std().alias('std_snr'),
    ])

    print(result)

Best Practices¶

Filter early¶

Reduce data size before conversion:

# GOOD: Filter at API level
table = client.data.gnss_observations(
    system="G",        # Only GPS
    field="snr",       # Only SNR
    ...
).fetch()

# LESS EFFICIENT: Fetch everything, filter later
table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table).filter(pl.col("system") == "G").select("snr")

Avoid unnecessary conversions¶

Work directly with Arrow when you can:

# GOOD: Direct Arrow operations
mean = pc.mean(table['snr'])

# LESS EFFICIENT: Convert just for one calculation
df = table.to_pandas()
mean = df['snr'].mean()

Conversion Reference¶

Quick reference for common conversions:

import polars as pl
import pandas as pd
import duckdb
import pyarrow.compute as pc

# Get Arrow table
table = client.data.gnss_observations(...).fetch()

# Convert to different formats
df_pandas = table.to_pandas()                 # Pandas
df_polars = pl.from_arrow(table)              # Polars
result = duckdb.query("SELECT * FROM table")  # DuckDB
array = table['snr'].to_numpy()               # NumPy

# Direct Arrow operations
filtered = table.filter(pc.field('snr') > 40)
mean_snr = pc.mean(table['snr'])

Next Steps¶

Learn about Query Plans & Memory Management for efficient iteration
See the GNSS Observations tutorial for practical examples
Explore saving to Parquet for local storage