Geospatial data storage is often messy and, at scale, creates inefficiencies, errors, and costs. That’s why data management is an important topic for us. We’ve been transforming UP42 storage into a seamless solution for interacting with large volumes of data - from different geospatial providers. The goal? To enable interoperability and discoverability. In order to achieve this, earlier this year, we adopted the STAC specification in UP42 storage. We’ve talked extensively about the benefits, so in this article, we want to cover some of the challenges we faced during the implementation.
Data modeling challenges at UP42
Our journey with STAC started when we realized we had to adapt UP42 storage to accommodate some exciting product developments: the UP42 data platform and our new tasking interface. This was the status back then:
- Thousands of assets (individual data deliveries) in UP42 storage from thousands of customers
- Dozens of different data formats and structures (e.g., GeoTIFF, DIMAP, etc.)
- Lack of a well-defined schema to derive metadata from some providers
- In some cases, missing metadata for the delivered files
- Inconsistencies in the way the same entity is described (e.g.,
geometry
,location
,bound
,aoi
,aoiInGeoJson
all describe the same entity) - No option to retrieve geospatial metadata without downloading the asset
- Lack of tools to manage data in storage, and no functionality to search, filter, or sort assets for a specific project or objective
- Data ingestion into the UP42 analytics platform inflated with different file systems and metadata logic
It sounds complicated, right? Well, with STAC, you have one standard for describing geospatial data, independent of the provider.
Data management solution design with STAC
Below are some of the questions we asked ourselves when we started.
- How to extract geospatial metadata from providers who do not plan to implement STAC?
- How do we consistently secure a reliable extraction of metadata?
- How can we combine UP42 platform-related metadata with STAC (e.g., UP42 asset or order ID)?
- How to deal with tiled assets and (tri-)stereo pairs?
- How to embed user authentication and authorization in our STAC service?
- How to split the implementation into iterations so users can benefit immediately?
To answer the questions above, we created our data management solution design. Data coming to UP42 storage can be any archive or tasking order, and in the future, data uploaded by the user. Every time new data is uploaded, it goes to our asset service which holds the asset and its metadata. Once an asset is created, metadata is extracted, irrespective of upstream provider, and STAC collection, STAC items, and STAC assets created. The information is then passed to our spatial asset service, where STAC objects are validated and entered into the UP42 database.
There are four main steps in the process of registering a new provider for metadata extraction:
- Define clear schema to derive geospatial asset structure – file name, files being shipped along with the asset, file naming, etc.
- Map fields from the metadata files (e.g., XML, JSON, etc.) received from the provider to must-have STAC fields (e.g., geometry and datetime are examples of must-have fields)
- Add additional data required to make STAC collection, item, and asset usable; besides must-have data, providers would ship additional data (e.g., constellation name) that will be mapped to other STAC fields
- Register this driver in our internal library (metaIO) where metadata is derived.
However, despite our process, we may still face challenges. For example, a provider might update their delivery format leading to changes on our side. In other cases, metadata might not be shipped separately in another file, so we would need to get the metadata from the image headers directly. And in some rare cases, metadata might be missing altogether.
Solution design of UP42’s STAC service
Now, let’s talk about our spatial asset service in more detail. UP42 creates one STAC catalog to search through all assets while taking into account authentication (we map each user to an individual STAC catalog). Under the STAC catalog, each UP42 asset is mapped to a STAC collection – an UP42 asset you received in storage as a result of a completed tasking or catalog order. A STAC collection then contains STAC items – an individual scene with a unique spatiotemporal extent (e.g., tiled images, images with different acquisition times, or stereo or tri-stereo pairs). A STAC item contains STAC assets: a geospatial feature of a STAC item, its quicklook, or metadata file (e.g., multispectral and panchromatic products of an image acquired by an optical sensor). We also add STAC extensions to ensure consistency across providers – pre-defined (e.g., SAR, EO Extension, Projection Extension, View Geometry) or UP42-specific that were created in the process (e.g., UP42 Order Extension, UP42 Product Extension, UP42 System Extension) – flexibility that the STAC specification provides. Below is an example of a STAC collection and item at UP42.
STAC collection example
{
"assets": {},
"links": [
{
"href": "https://api.up42.dev/v2/internal/assets/stac/collections/76d8aa65-952e-4843-9420-9093775f2a16",
"rel": "self",
"type": "application/json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/",
"rel": "parent",
"type": "application/json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/collections/76d8aa65-952e-4843-9420-9093775f2a16/items",
"rel": "items",
"type": "application/geo+json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/",
"rel": "root",
"type": "application/json"
}
],
"stac_extensions": [
"https://api.up42.com/stac-extensions/up42-system/v1.0.0/schema.json",
"https://api.up42.com/stac-extensions/up42-product/v1.0.0/schema.json"
],
"title": "ORT_SPOT7_20180504_094339900_000",
"description": "High-resolution 1.5m SPOT images acquired daily on a global basis. The datasets are available starting from 2012.",
"keywords": [
"Airbus",
"SPOT"
],
"license": "proprietary",
"providers": [
{
"name": "AIRBUS DS",
"roles": [
"producer"
],
"url": "http://www.geo-airbusds.com"
},
{
"name": "AIRBUS DS",
"roles": [
"processor"
],
"url": "http://www.geo-airbusds.com"
}
],
"extent": {
"spatial": {
"bbox": [
[
13.405284660895354,
52.48483722634566,
13.438730113555541,
52.505261378042256
]
]
},
"temporal": {
"interval": [
[
"2018-05-04T09:43:39.700000Z",
"2018-05-04T09:43:39.700000Z"
]
]
}
},
"stac_version": "1.0.0",
"id": "76d8aa65-952e-4843-9420-9093775f2a16",
"type": "Collection",
"up42-system:workspace_id": "3746c3fa-5f89-406e-a674-6c1e3ffbac3f",
"summaries": {},
"up42-system:asset_id": "035d41f1-058f-4494-b074-ea413f3f4bb1",
"up42-product:data_type": "raster",
"up42-system:account_id": "8cd5de7b-82e2-4625-b094-d5392f1cf780",
"up42-system:metadata_version": "0.0.17",
"up42-product:modality": "multispectral"
}
STAC item example
{
"links": [
{
"href": "https://api.up42.dev/v2/internal/assets/stac/",
"rel": "root",
"type": "application/json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/collections/76d8aa65-952e-4843-9420-9093775f2a16",
"rel": "parent",
"type": "application/json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/collections/76d8aa65-952e-4843-9420-9093775f2a16/items",
"rel": "self",
"type": "application/json"
}
],
"type": "FeatureCollection",
"features": [
{
"assets": {
"data": {
"href": "https://api.up42.dev/v2/assets/035d41f1-058f-4494-b074-ea413f3f4bb1",
"title": "Data",
"description": "Storage Data",
"type": "application/gzip",
"roles": [
"data"
]
}
},
"links": [
{
"href": "https://api.up42.dev/v2/internal/assets/stac/collections/76d8aa65-952e-4843-9420-9093775f2a16/items/64f9f02e-b8e0-482c-9b79-a350a6891937",
"rel": "self",
"type": "application/geo+json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/collections/76d8aa65-952e-4843-9420-9093775f2a16",
"rel": "parent",
"type": "application/json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/collections/76d8aa65-952e-4843-9420-9093775f2a16",
"rel": "collection",
"type": "application/json"
},
{
"href": "https://api.up42.dev/v2/internal/assets/stac/",
"rel": "root",
"type": "application/json"
}
],
"stac_extensions": [
"https://stac-extensions.github.io/projection/v1.0.0/schema.json",
"https://stac-extensions.github.io/view/v1.0.0/schema.json",
"https://api.up42.com/stac-extensions/up42-product/v1.0.0/schema.json",
"https://stac-extensions.github.io/eo/v1.0.0/schema.json",
"https://api.up42.com/stac-extensions/up42-system/v1.0.0/schema.json"
],
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
13.405284660895354,
52.505261378042256
],
[
13.438730113555541,
52.505261378042256
],
[
13.438730113555541,
52.48483722634566
],
[
13.405284660895354,
52.48483722634566
],
[
13.405284660895354,
52.505261378042256
]
]
]
},
"bbox": [
13.405284660895354,
52.48483722634566,
13.438730113555541,
52.505261378042256
],
"properties": {
"gsd": 2.352509087916183,
"title": "ORT_SPOT7_20180504_094339900_000_R1C1",
"datetime": "2018-05-04T09:43:39.700000+00:00",
"platform": "SPOT-7",
"proj:epsg": 4326,
"end_datetime": "2018-05-04T09:43:39.700000+00:00",
"view:azimuth": 154.19004879325968,
"constellation": "SPOT",
"eo:cloud_cover": 0.0,
"start_datetime": "2018-05-04T09:43:39.700000+00:00",
"view:sun_azimuth": 149.3071839333792,
"view:sun_elevation": 50.29211361251306,
"up42-system:asset_id": "035d41f1-058f-4494-b074-ea413f3f4bb1",
"view:incidence_angle": 16.720128621987143,
"up42-product:modality": "multispectral",
"up42-product:data_type": "raster",
"up42-system:account_id": "8cd5de7b-82e2-4625-b094-d5392f1cf780",
"up42-system:workspace_id": "3746c3fa-5f89-406e-a674-6c1e3ffbac3f",
"up42-system:metadata_version": "0.0.17"
},
"type": "Feature",
"stac_version": "1.0.0",
"id": "64f9f02e-b8e0-482c-9b79-a350a6891937",
"collection": "76d8aa65-952e-4843-9420-9093775f2a16"
}
]
}
Lessons learned
1. STAC provides flexibility
While STAC can provide a lot of flexibility, it can be hard to find the right solution design – in our case, for a data storage solution. Usually, a STAC catalog is mapped to a data product (e.g., Sentinel L2A, a processing level combined with a satellite constellation). At UP42, due to the nature of our storage, we implemented STAC on the collection level – e.g., for SPOT imagery, you have a different STAC collection, and the user is mapped to a STAC catalog. Onboarding can also be challenging and education is necessary, even for geospatial experts and downstream customers.
2. Data provider deliveries have their intricacies
Schema files, example files, and documentation are sometimes not enough and delivery can vary based on provider automation. We rely heavily on provider files but there are still some cases of missing metadata. Also, extracting information from assets accumulated in the last three years of UP42 storage led to some interesting findings (we did have a good laugh when we came across tomato.png in our database).
3. Technical components need to be chosen carefully
Mapping each UP42 delivery to a STAC collection that we initially introduced needed to be better handled in the core database queries. This meant we had to change the database implementation from using open sources to our own version of an SQL-based backend.
Current state and way forward
Next on our journey is transforming the data in storage into a cloud-native asset format, starting with Cloud Optimized GeoTIFF (COGs). We will also continue to introduce additional STAC extensions and add more metadata. Our goal is to give users a single interface to search for data with one set of parameters using STAC, regardless if this is tasked orders, existing data in storage, or suggestions for similar archive data to the one you are looking for.