Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The EarthScope SDK returns data as Apache Arrow tables for high performance and easy integration with data science tools.

What is Apache Arrow?

Apache Arrow is a cross-language development platform for in-memory columnar data. It provides:

Why Does the SDK Use Arrow?

The EarthScope SDK returns data as PyArrow Tables because:

  1. Efficient streaming: APIs can stream Arrow data directly without serialization overhead

  2. Memory efficient: Columnar format is compact and cache-friendly

  3. Fast: No parsing required (unlike CSV or JSON)

  4. Flexible: Easily convert to your preferred format (Pandas, Polars, DuckDB, etc.)

Converting Arrow Tables

The SDK query plans return PyArrow Table objects, which you can easily convert to your preferred data science library:

Pandas

import pandas as pd

table = client.data.gnss_observations(...).fetch()
df = table.to_pandas()  # Creates a pandas DataFrame

Polars

import polars as pl

table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table)  # Creates a polars DataFrame

DuckDB

import duckdb

table = client.data.gnss_observations(...).fetch()

# Query Arrow table directly with SQL
result = duckdb.query("""
    SELECT satellite, COUNT(*) as obs_count
    FROM table
    GROUP BY satellite
    ORDER BY obs_count DESC
""").to_df()

NumPy

import numpy as np

table = client.data.gnss_observations(...).fetch()

# Convert specific columns to numpy arrays
timestamps = table['timestamp'].to_numpy()
snr_values = table['snr'].to_numpy()

Direct Arrow Operations

You can also work directly with Arrow tables using PyArrow’s compute functions:

import pyarrow.compute as pc

table = client.data.gnss_observations(...).fetch()

# Filter data
filtered = table.filter(pc.field('snr') > 40)

# Compute statistics
mean_snr = pc.mean(table['snr'])

Working with Multiple Batches

When iterating through query plans, each batch is an Arrow table:

import polars as pl

# Process each day as a Polars DataFrame
for daily_table in plan.group_by_day():
    # Convert to Polars
    df = pl.from_arrow(daily_table)

    # Process with Polars
    result = df.group_by('satellite').agg([
        pl.col('snr').mean().alias('avg_snr'),
        pl.col('snr').std().alias('std_snr'),
    ])

    print(result)

Best Practices

Filter early

Reduce data size before conversion:

# GOOD: Filter at API level
table = client.data.gnss_observations(
    system="G",        # Only GPS
    field="snr",       # Only SNR
    ...
).fetch()

# LESS EFFICIENT: Fetch everything, filter later
table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table).filter(pl.col("system") == "G").select("snr")

Avoid unnecessary conversions

Work directly with Arrow when you can:

# GOOD: Direct Arrow operations
mean = pc.mean(table['snr'])

# LESS EFFICIENT: Convert just for one calculation
df = table.to_pandas()
mean = df['snr'].mean()

Conversion Reference

Quick reference for common conversions:

import polars as pl
import pandas as pd
import duckdb
import pyarrow.compute as pc

# Get Arrow table
table = client.data.gnss_observations(...).fetch()

# Convert to different formats
df_pandas = table.to_pandas()                 # Pandas
df_polars = pl.from_arrow(table)              # Polars
result = duckdb.query("SELECT * FROM table")  # DuckDB
array = table['snr'].to_numpy()               # NumPy

# Direct Arrow operations
filtered = table.filter(pc.field('snr') > 40)
mean_snr = pc.mean(table['snr'])

Next Steps