Working with Apache Arrow#
The EarthScope SDK returns data as Apache Arrow tables for high performance and easy integration with data science tools.
What is Apache Arrow?#
Apache Arrow is a cross-language development platform for in-memory columnar data. It provides:
Standardized format: A common memory representation that works across languages (Python, R, Java, etc.)
Zero-copy reads: Data can be shared between libraries without copying
High performance: Optimized for modern CPU architectures
Rich ecosystem: Native integration with data science tools
Why Does the SDK Use Arrow?#
The EarthScope SDK returns data as PyArrow Tables because:
Efficient streaming: APIs can stream Arrow data directly without serialization overhead
Memory efficient: Columnar format is compact and cache-friendly
Fast: No parsing required (unlike CSV or JSON)
Flexible: Easily convert to your preferred format (Pandas, Polars, DuckDB, etc.)
Converting Arrow Tables#
The SDK query plans return PyArrow Table objects, which you can easily convert to your preferred data science library:
Pandas#
import pandas as pd
table = client.data.gnss_observations(...).fetch()
df = table.to_pandas() # Creates a pandas DataFrame
Note
Learn more: See Pandas docs for more information and usage details.
Polars#
import polars as pl
table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table) # Creates a polars DataFrame
Note
Learn more: See Polars docs for more information and usage details.
DuckDB#
import duckdb
table = client.data.gnss_observations(...).fetch()
# Query Arrow table directly with SQL
result = duckdb.query("""
SELECT satellite, COUNT(*) as obs_count
FROM table
GROUP BY satellite
ORDER BY obs_count DESC
""").to_df()
Note
Learn more: See DuckDB docs for more information and usage details.
NumPy#
import numpy as np
table = client.data.gnss_observations(...).fetch()
# Convert specific columns to numpy arrays
timestamps = table['timestamp'].to_numpy()
snr_values = table['snr'].to_numpy()
Note
Learn more: See NumPy docs for more information and usage details.
Direct Arrow Operations#
You can also work directly with Arrow tables using PyArrow’s compute functions:
import pyarrow.compute as pc
table = client.data.gnss_observations(...).fetch()
# Filter data
filtered = table.filter(pc.field('snr') > 40)
# Compute statistics
mean_snr = pc.mean(table['snr'])
Working with Multiple Batches#
When iterating through query plans, each batch is an Arrow table:
import polars as pl
# Process each day as a Polars DataFrame
for daily_table in plan.group_by_day():
# Convert to Polars
df = pl.from_arrow(daily_table)
# Process with Polars
result = df.group_by('satellite').agg([
pl.col('snr').mean().alias('avg_snr'),
pl.col('snr').std().alias('std_snr'),
])
print(result)
Best Practices#
Filter early#
Reduce data size before conversion:
# GOOD: Filter at API level
table = client.data.gnss_observations(
system="G", # Only GPS
field="snr", # Only SNR
...
).fetch()
# LESS EFFICIENT: Fetch everything, filter later
table = client.data.gnss_observations(...).fetch()
df = pl.from_arrow(table).filter(pl.col("system") == "G").select("snr")
Avoid unnecessary conversions#
Work directly with Arrow when you can:
# GOOD: Direct Arrow operations
mean = pc.mean(table['snr'])
# LESS EFFICIENT: Convert just for one calculation
df = table.to_pandas()
mean = df['snr'].mean()
Conversion Reference#
Quick reference for common conversions:
import polars as pl
import pandas as pd
import duckdb
import pyarrow.compute as pc
# Get Arrow table
table = client.data.gnss_observations(...).fetch()
# Convert to different formats
df_pandas = table.to_pandas() # Pandas
df_polars = pl.from_arrow(table) # Polars
result = duckdb.query("SELECT * FROM table") # DuckDB
array = table['snr'].to_numpy() # NumPy
# Direct Arrow operations
filtered = table.filter(pc.field('snr') > 40)
mean_snr = pc.mean(table['snr'])
Next Steps#
Learn about Query Plans & Memory Management for efficient iteration
See the GNSS Observations tutorial for practical examples
Explore saving to Parquet for local storage