Direct S3 Access to SAGE miniSEED data repository#

The SAGE miniSEED repository can be accessed directly from S3 with EarthScope-vended AWS credentials.

In this example workflow, a researcher with an EarthScope user account exchanges their EarthScope credentials for temporary AWS Credentials. These temporary credentials are used to initialize a boto3 session to access the data in the S3 bucket.

NOTE: This functionality is still in beta, and access is restricted while we monitor costs and scalability. If you would like to execute a project leveraging S3 direct access, please email data-help@earthscope.org to request account authorization.

Pre-requisites:#

  • The user has registered for an EarthScope user account at https://www.earthscope.org/user/login

  • The user’s account has been approved/enabled for S3 direct access

  • The user’s workflow is configured to run in the AWS us-east-2 region

import boto3

from earthscope_sdk import EarthScopeClient

Get Temporary AWS Credentials#

In order to talk directly to S3, the user needs AWS Credentials.

The EarthScope API has an endpoint that vends temporary (short-lived) AWS credentials with narrowly scoped permission to read from our S3 Access Point serving Restricted and Unrestricted Data. See API documentation

To use this endpoint (just like all endpoints in api.earthscope.org), the user needs to pass their oauth2 access token. This can be retrieved using the SDK.

STEP 1: Create an EarthScopeClient instance#

Creating an instance of this class will automatically load EarthScope credentials from your machine following the settings loading chain.

client = EarthScopeClient()

STEP 2: Retrieve temporary AWS Credentials using the client#

creds = client.user.get_aws_credentials()

Create a boto3 Session with the temporary credentials#

boto3 is the AWS SDK for Python.

The following cell

  1. creates a new boto Session with the retrieved temporary credentials

  2. creates an S3 Client from the session.

    1. Refer to the AWS documentation for available methods

session = boto3.Session(
    aws_access_key_id=creds.aws_access_key_id,
    aws_secret_access_key=creds.aws_secret_access_key,
    aws_session_token=creds.aws_session_token,
)

s3_client = session.client("s3")

Define constants for data location#

The data is exposed via an S3 Access Point (S3AP). This S3AP has a unique, generated alias (that we do not control). This alias is used in place of the bucket name in all S3 operations.

EarthScope’s miniSEED data lives in the miniseed/ prefix of our bucket, and thus of our S3AP.

S3_ACCESS_POINT = "earthscope-mseed-res-na3mtd4fq5kz7pntcyr1uh46use2a--ol-s3"

BUCKET = S3_ACCESS_POINT
PREFIX = "miniseed/"

List contents of miniseed prefix#

Users are allowed to list the miniseed/ prefix of this access point, including all “subdirectories” within. Even restricted data networks are allowed to be listed (but not downloaded).

Listing is permitted from any machine, as long as the user has valid credentials that they obtained from https://api.earthscope.org.

list_resp = s3_client.list_objects_v2(Bucket=BUCKET, Prefix=PREFIX, Delimiter="/")
nets = [c["Prefix"].split("/", 1)[1] for c in list_resp["CommonPrefixes"]]
print(nets)
['1A/', '1B/', '1C/', '1D/', '1E/', '1F/', '1G/', '1H/', '1J/', '1K/', '1L/', '1M/', '1P/', '1Q/', '1T/', '1U/', '1V/', '2A/', '2B/', '2C/', '2D/', '2E/', '2F/', '2G/', '2H/', '2J/', '2K/', '2L/', '2M/', '2O/', '2P/', '2Q/', '2U/', '2V/', '3A/', '3B/', '3C/', '3D/', '3E/', '3F/', '3H/', '3J/', '3K/', '3L/', '3R/', '3U/', '3Y/', '4A/', '4B/', '4E/', '4F/', '4H/', '4J/', '4N/', '4P/', '4Q/', '4S/', '4T/', '4U/', '4Y/', '5A/', '5B/', '5C/', '5E/', '5F/', '5G/', '5H/', '5K/', '5L/', '5O/', '5P/', '5S/', '5W/', '6A/', '6C/', '6D/', '6E/', '6F/', '6G/', '6H/', '6I/', '6J/', '6K/', '6L/', '6M/', '6O/', '6Q/', '6R/', '6W/', '7A/', '7B/', '7C/', '7D/', '7E/', '7F/', '7G/', '7I/', '7J/', '7K/', '7L/', '7O/', '7P/', '7Q/', '7S/', '7T/', '8A/', '8B/', '8E/', '8F/', '8G/', '8H/', '8J/', '8L/', '8P/', '8Q/', '8S/', '8U/', '8W/', '9A/', '9B/', '9C/', '9D/', '9F/', '9G/', '9H/', '9K/', '9L/', '9M/', '9P/', '9R/', 'A2/', 'A7/', 'AB/', 'AC/', 'AE/', 'AF/', 'AG/', 'AI/', 'AK/', 'AL/', 'AM/', 'AO/', 'AP/', 'AR/', 'AS/', 'AT/', 'AU/', 'AV/', 'AX/', 'AY/', 'AZ/', 'B6/', 'BC/', 'BE/', 'BF/', 'BI/', 'BK/', 'BL/', 'BV/', 'BX/', 'C/', 'C0/', 'C1/', 'C8/', 'CA/', 'CB/', 'CC/', 'CD/', 'CH/', 'CI/', 'CK/', 'CM/', 'CN/', 'CO/', 'CS/', 'CT/', 'CU/', 'CW/', 'CY/', 'CZ/', 'DE/', 'DK/', 'DR/', 'DT/', 'DU/', 'DW/', 'EC/', 'EI/', 'EM/', 'EO/', 'EP/', 'ER/', 'ET/', 'FA/', 'G/', 'GB/', 'GD/', 'GE/', 'GF/', 'GG/', 'GH/', 'GI/', 'GM/', 'GO/', 'GR/', 'GS/', 'GT/', 'GY/', 'H2/', 'HG/', 'HK/', 'HL/', 'HT/', 'HV/', 'HW/', 'I0/', 'IC/', 'ID/', 'IE/', 'II/', 'IL/', 'IM/', 'IN/', 'IO/', 'IP/', 'IU/', 'IV/', 'IW/', 'JM/', 'JP/', 'K5/', 'KC/', 'KG/', 'KN/', 'KO/', 'KP/', 'KR/', 'KS/', 'KW/', 'KY/', 'KZ/', 'LB/', 'LD/', 'LH/', 'LI/', 'LM/', 'LO/', 'LX/', 'M8/', 'MB/', 'MC/', 'MG/', 'MH/', 'MI/', 'MM/', 'MN/', 'MP/', 'MR/', 'MS/', 'MU/', 'MX/', 'MY/', 'MZ/', 'N4/', 'NA/', 'NB/', 'NC/', 'NE/', 'NI/', 'NJ/', 'NK/', 'NL/', 'NM/', 'NN/', 'NO/', 'NP/', 'NQ/', 'NR/', 'NT/', 'NU/', 'NV/', 'NW/', 'NX/', 'NY/', 'NZ/', 'O2/', 'OC/', 'OE/', 'OH/', 'OI/', 'OK/', 'ON/', 'OO/', 'OQ/', 'OV/', 'OW/', 'OX/', 'OZ/', 'PA/', 'PB/', 'PE/', 'PI/', 'PL/', 'PM/', 'PN/', 'PO/', 'PQ/', 'PR/', 'PS/', 'PT/', 'PY/', 'QC/', 'QZ/', 'RC/', 'RE/', 'RI/', 'RM/', 'RO/', 'RS/', 'RV/', 'S1/', 'S8/', 'SB/', 'SC/', 'SE/', 'SF/', 'SG/', 'SH/', 'SN/', 'SP/', 'SR/', 'SS/', 'SV/', 'SY/', 'TA/', 'TC/', 'TD/', 'TF/', 'TJ/', 'TM/', 'TO/', 'TR/', 'TT/', 'TW/', 'TX/', 'TZ/', 'UF/', 'UH/', 'UI/', 'UK/', 'UM/', 'UO/', 'US/', 'UT/', 'UU/', 'UW/', 'VD/', 'VE/', 'VU/', 'WA/', 'WC/', 'WF/', 'WI/', 'WM/', 'WU/', 'WW/', 'WY/', 'X1/', 'X2/', 'X3/', 'X4/', 'X5/', 'X6/', 'X7/', 'X8/', 'X9/', 'XA/', 'XB/', 'XC/', 'XD/', 'XE/', 'XF/', 'XG/', 'XH/', 'XI/', 'XJ/', 'XK/', 'XL/', 'XM/', 'XN/', 'XO/', 'XP/', 'XQ/', 'XR/', 'XS/', 'XT/', 'XU/', 'XV/', 'XW/', 'XX/', 'XY/', 'XZ/', 'Y1/', 'Y2/', 'Y3/', 'Y4/', 'Y5/', 'Y6/', 'Y7/', 'Y8/', 'Y9/', 'YA/', 'YB/', 'YC/', 'YD/', 'YE/', 'YF/', 'YG/', 'YH/', 'YI/', 'YJ/', 'YK/', 'YL/', 'YM/', 'YN/', 'YO/', 'YP/', 'YQ/', 'YR/', 'YS/', 'YT/', 'YU/', 'YV/', 'YW/', 'YX/', 'YY/', 'YZ/', 'Z1/', 'Z2/', 'Z3/', 'Z4/', 'Z5/', 'Z6/', 'Z7/', 'Z8/', 'Z9/', 'ZA/', 'ZB/', 'ZC/', 'ZD/', 'ZE/', 'ZF/', 'ZG/', 'ZH/', 'ZI/', 'ZJ/', 'ZK/', 'ZL/', 'ZM/', 'ZN/', 'ZO/', 'ZP/', 'ZQ/', 'ZR/', 'ZS/', 'ZT/', 'ZU/', 'ZV/', 'ZW/', 'ZX/', 'ZY/', 'ZZ/']

Consumer function#

This function is a no-op that simply drains the entire response read from S3, counting the number of bytes read. Replace this function with your actual business logic.

CHUNK_SIZE = 20_000_000


def read_in_chunks(s3_object: dict):
    """A generator that iterates over an S3 object in chunks."""
    stream = s3_object["Body"]._raw_stream
    ##Insert your data processing here with s3_object
    ct = 0
    while True:
        data = stream.read(CHUNK_SIZE)

        if not data:
            break

        ct += len(data)

    return ct

Trying to access Restricted Data#

Even though all of the miniSEED data is visible via listing, some of the data is Restricted.

This is what you would see if you try to download an object you do not have permission to read.

try:
    get_resp = s3_client.get_object(
        Bucket=BUCKET,
        Key=f"{PREFIX}BV/2024/090/SOEH.BV.2024.090",
    )
    raise RuntimeError("Should not reach this line")
except Exception as e:
    print(
        "Successfully failed to get restricted data. The following is the error message a user would see:"
    )
    print(e)
Successfully failed to get restricted data. The following is the error message a user would see:
An error occurred (FgaAccessDenied) when calling the GetObject operation: You are not authorized to get this object.

Access Unrestricted Data (or Restricted Data the user can access)#

If the user tries to download either:

  • any Unrestricted Data

  • any Restricted Data that the user has been granted access to

then the user will not see an error message, and instead successfully download the object directly from S3.

This cell has hardcoded a few known Unrestricted Data objects for comparing download times across different sized objects.

# %%timeit

get_resp = s3_client.get_object(
    Bucket=BUCKET,
    Key=f"{PREFIX}UW/2024/300/MBW.UW.2024.300#2",  # ~6 MB
    # Key=f"{PREFIX}UW/2024/300/MPO.UW.2024.300#2",  # ~85 MB
    # Key=f"{PREFIX}UW/2024/300/SLA.UW.2024.300#2",  # ~400 MB
)
sz = read_in_chunks(get_resp)
print(f"Successfully read object from S3 ({sz} bytes)")
Successfully read object from S3 (6139904 bytes)