Working with Apache Arrow#

The EarthScope SDK returns data as Apache Arrow tables for high performance and easy integration with data science tools.

What is Apache Arrow?#

Apache Arrow is a cross-language development platform for in-memory columnar data. It provides:

  • Standardized format: A common memory representation that works across languages (Python, R, Java, etc.)

  • Zero-copy reads: Data can be shared between libraries without copying

  • High performance: Optimized for modern CPU architectures

  • Rich ecosystem: Native integration with data science tools

Why Does the SDK Use Arrow?#

The EarthScope SDK returns data as PyArrow Tables because:

  1. Efficient streaming: APIs can stream Arrow data directly without serialization overhead

  2. Memory efficient: Columnar format is compact and cache-friendly

  3. Fast: No parsing required (unlike CSV or JSON)

  4. Flexible: Easily convert to your preferred format (Pandas, Polars, DuckDB, etc.)

Converting Arrow Tables#

The SDK query plans return PyArrow Table objects, which you can easily convert to your preferred data science library:

Pandas#

import pandas as pd

table = client.data.gnss_observations(...).fetch()
df = table.to_pandas()  # Creates a pandas DataFrame

Note

Learn more: See Pandas docs for more information and usage details.

Polars#

import polars as pl

table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table)  # Creates a polars DataFrame

Note

Learn more: See Polars docs for more information and usage details.

DuckDB#

import duckdb

table = client.data.gnss_observations(...).fetch()

# Query Arrow table directly with SQL
result = duckdb.query("""
    SELECT satellite, COUNT(*) as obs_count
    FROM table
    GROUP BY satellite
    ORDER BY obs_count DESC
""").to_df()

Note

Learn more: See DuckDB docs for more information and usage details.

NumPy#

import numpy as np

table = client.data.gnss_observations(...).fetch()

# Convert specific columns to numpy arrays
timestamps = table['timestamp'].to_numpy()
snr_values = table['snr'].to_numpy()

Note

Learn more: See NumPy docs for more information and usage details.

Direct Arrow Operations#

You can also work directly with Arrow tables using PyArrow’s compute functions:

import pyarrow.compute as pc

table = client.data.gnss_observations(...).fetch()

# Filter data
filtered = table.filter(pc.field('snr') > 40)

# Compute statistics
mean_snr = pc.mean(table['snr'])

Working with Multiple Batches#

When iterating through query plans, each batch is an Arrow table:

import polars as pl

# Process each day as a Polars DataFrame
for daily_table in plan.group_by_day():
    # Convert to Polars
    df = pl.from_arrow(daily_table)

    # Process with Polars
    result = df.group_by('satellite').agg([
        pl.col('snr').mean().alias('avg_snr'),
        pl.col('snr').std().alias('std_snr'),
    ])

    print(result)

Best Practices#

Filter early#

Reduce data size before conversion:

# GOOD: Filter at API level
table = client.data.gnss_observations(
    system="G",        # Only GPS
    field="snr",       # Only SNR
    ...
).fetch()

# LESS EFFICIENT: Fetch everything, filter later
table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table).filter(pl.col("system") == "G").select("snr")

Avoid unnecessary conversions#

Work directly with Arrow when you can:

# GOOD: Direct Arrow operations
mean = pc.mean(table['snr'])

# LESS EFFICIENT: Convert just for one calculation
df = table.to_pandas()
mean = df['snr'].mean()

Conversion Reference#

Quick reference for common conversions:

import polars as pl
import pandas as pd
import duckdb
import pyarrow.compute as pc

# Get Arrow table
table = client.data.gnss_observations(...).fetch()

# Convert to different formats
df_pandas = table.to_pandas()                 # Pandas
df_polars = pl.from_arrow(table)              # Polars
result = duckdb.query("SELECT * FROM table")  # DuckDB
array = table['snr'].to_numpy()               # NumPy

# Direct Arrow operations
filtered = table.filter(pc.field('snr') > 40)
mean_snr = pc.mean(table['snr'])

Next Steps#