Data Provider#

The icesat2db.IceSat2Provider module in icesat2db is the core interface for accessing structured ICESat-2 data and metadata from a tileDB database. With this module, you can execute spatial and temporal queries on ICESat-2 data, retrieving relevant variables efficiently and enabling complex geospatial operations. The icesat2db.IceSat2Provider class streamlines the process, making it easy to access the extensive data generated by the ICESat-2 mission for advanced analysis.

Key capabilities#

Spatial Queries: Query ICESat-2 data based on specific spatial boundaries, enabling analyses within defined regions.
Temporal Queries: Filter data by date range to focus on specific time periods.
Variable Selection: Retrieve only the data variables needed for your analysis to optimize performance.
Quality Filters: Apply additional quality filters to refine data retrieval based on specific conditions.
Reference Point Query: Query ICESat-2 data based on a reference point and get the nearest shots within a defined radius.
Flexible Output Formats: Export results as either xarray.Dataset for multi-dimensional data or pandas.DataFrame for tabular data.

Potential available variables#

The database includes a wide range of variables from the ATL08 land and vegetation product, covering terrain elevation, canopy height metrics, quality flags, and ancillary data. Below is a table of commonly used variables:

Variable Descriptions#
Variable Name	Description	Units	Category
h_canopy	98th percentile of relative canopy heights above estimated terrain	meters	Canopy
h_max_canopy	Maximum relative canopy height within segment (equivalent to RH100)	meters	Canopy
h_mean_canopy	Mean relative canopy height within segment	meters	Canopy
h_te_best_fit	Best-fit terrain elevation at the mid-point of each 100 m segment	meters	Terrain
h_te_mean	Mean terrain photon height above WGS84 Ellipsoid within segment	meters	Terrain
canopy_h_metrics	Canopy height metrics at 10–95th percentiles (18 values per segment)	meters	Canopy
snr	Signal-to-noise ratio of geolocated photons	adimensional	Land Segment
night_flag	Day/night flag derived from solar elevation (0=day, 1=night)	adimensional	Land Segment
layer_flag	Consolidated cloud/blowing snow flag (0=absent, 1=likely present)	adimensional	Land Segment
segment_snowcover	Daily snow/ice cover flag (0=ice-free water; 1=snow-free; 2=snow; 3=ice)	adimensional	Land Segment

For the complete list of available variables, see TileDB Global Database for ICESat-2 ATL08 Data or call provider.get_available_variables().

Retrieving ICESat-2 data with the ICESat-2 provider#

The icesat2db.IceSat2Provider class is your main tool for querying ICESat-2 data from the tileDB database. The following example demonstrates how to configure and use the provider to retrieve data with options to include additional quality filters for customized data refinement.

Basic query example#

import geopandas as gpd
import icesat2db as isdb

# Load region of interest
region_of_interest = gpd.read_file('./data/geojson/region.geojson')

# Instantiate the IceSat2Provider
provider = isdb.IceSat2Provider(storage_type='local',
                                local_path="/path/to/your/database")

# Define the variables to query
variables = ["h_canopy", "h_te_best_fit"]

dataset = provider.get_data(variables=variables,
                            geometry=region_of_interest,
                            start_time="2019-01-01",
                            end_time="2024-12-31",
                            return_type='xarray')

Parameters for `get_data()`#

variables: List of variables (columns) to retrieve from the database. Profile and sub-segment variables (e.g. canopy_h_metrics, h_canopy_20m) return all values per segment by default. To fetch a single element by label and save bandwidth, use the "variable:label" syntax, e.g. "canopy_h_metrics:50" (50th-percentile only) or "h_canopy_20m:50" (centre 20 m bin only).

geometry: (Optional) GeoPandas geometry for spatial filtering.

start_time: (Optional) Start date for temporal filtering (format: “YYYY-MM-DD”).

end_time: (Optional) End date for temporal filtering (format: “YYYY-MM-DD”).

return_type: Specifies the format of the returned data, either xarray.Dataset (“xarray”). or pandas.DataFrame (“dataframe”) - The default is “xarray”.

query_type: (Optional) Type of query to execute, either “nearest” or “bounding_box”, in case of nearest, a point has to be provided as well (default: “bounding_box”).

point: (Optional) Reference point for nearest query, required if query_type is “nearest” (format: Tuple[longitude, latitude]).

num_shots: (Optional) Number of shots to retrieve if the query_type is “nearest” (default: 10).

radius: (Optional) Radius in degrees around the point if the query_type is “nearest” (default: 0.1).

quality_filters: (Optional) Additional quality filters to apply to the query.

The returned data is formatted according to the return_type parameter, making it ready for further analysis.

Applying additional quality filters#

You can further refine the data retrieval by specifying additional quality filters. This customization allows filtering based on specific conditions for selected variables. The filters are added as keyword arguments in the form of field-value conditions.

Example with additional quality filters#

In the following example, we filter for night-time acquisitions with no cloud/blowing snow contamination:

import geopandas as gpd
import icesat2db as isdb

# Instantiate the IceSat2Provider
provider = isdb.IceSat2Provider(storage_type='local',
                                local_path="/path/to/your/database")

# Load region of interest
region_of_interest = gpd.read_file('./data/geojson/region.geojson')

# Define variables and quality filters
variables = ["h_canopy", "h_te_best_fit", "snr"]
quality_filters = {
    'night_flag': "== 1",
    'layer_flag': "== 0",
}

icesat2_data = provider.get_data(variables=variables,
                                 geometry=region_of_interest,
                                 start_time="2019-01-01",
                                 end_time="2024-12-31",
                                 return_type='xarray',
                                 **quality_filters)

Quality filters are passed as key-value pairs where the key is the variable name and the value is the condition string (e.g., 'night_flag': "== 1"). This adds flexibility to refine the query based on specific criteria, improving the relevance of the retrieved data.

Supported output formats#

The icesat2db.IceSat2Provider supports the following output formats, allowing you to choose the structure that best suits your analysis:

xarray.Dataset: Ideal for multi-dimensional data that includes labeled dimensions, suitable for advanced numerical and geospatial analysis.
pandas.DataFrame: Perfect for tabular data and smaller datasets, allowing for quick manipulation and export to CSV or other formats.

Below is an example of how the dataset looks in the xarray.Dataset format:

<xarray.Dataset>
Dimensions:          (segment_id: 284305, percentile: 18)
Coordinates:
  * segment_id       (segment_id) int64 2MB 131271604800 ... 131271952640
  * percentile       (percentile) int32 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
    latitude         (segment_id) float32 1MB 51.23 51.24 ... 47.89 47.90
    longitude        (segment_id) float32 1MB 10.45 10.45 ... 14.12 14.13
    time             (segment_id) datetime64[ns] 2MB 2021-06-15 ... 2021-06-15
Data variables:
    h_canopy         (segment_id) float32 1MB 18.4 22.1 ... 5.3 7.8
    h_te_best_fit    (segment_id) float32 1MB 312.1 315.6 ... 198.4 201.2
    canopy_h_metrics (segment_id, percentile) float32 20MB 4.2 ... 17.9

The dataset includes multiple dimensions and variables:

Dimensions: segment_id (unique ID for each 100 m land segment) and percentile (coordinate axis for profile variables such as canopy_h_metrics, with values 10–95). Sub-segment variables use along_track_offset_m (values 10, 30, 50, 70, 90 m).
Coordinates: time, latitude, and longitude describing each segment’s spatial and temporal context.
Data Variables: Variables such as h_canopy (98th percentile canopy height above terrain) and h_te_best_fit (best-fit terrain elevation).

Below is an example of how the dataset looks in the pandas.DataFrame format:

           latitude  longitude        time  h_canopy  h_te_best_fit
0         51.231842  10.453218  2021-06-15     18.40         312.10
1         51.240115  10.453501  2021-06-15     22.10         315.60
2         51.248388  10.453784  2021-06-15      9.80         318.30
3         51.256661  10.454067  2021-06-15     15.60         320.90
4         51.264934  10.454350  2021-06-15      7.20         323.40
...              ...        ...         ...       ...            ...
284300    47.898234  14.121456  2021-06-15      3.10         195.20
284301    47.890011  14.121739  2021-06-15      5.30         198.40
284302    47.881788  14.122022  2021-06-15      7.80         201.20

[284305 rows x 5 columns]

—