EO Data Access (R)evolution
Earth Observation (EO) produces a massive amount of data (34 PB available on CREODIAS) that needs to be archived and made accessible to end-users. Operational costs of maintaining Immediately Available Data (IAD) were cumbersome for many data platforms operating at a large scale such as the Copernicus Open Access Hub. Consequently, the rolling data policy has been implemented whereby, after a certain time, the EO products are moved from the costly IAD to the less expensive Long Term Archive (LTA). Within the LTA, data is stored on magnetic tape, which is the cheapest storage solution, but requires additional cartridge handling (either manual or more efficient automatic handling). Either way, restoring data from the tape LTA to IAD requires a lot of time, so a user must wait hours, days, or even weeks to get immediate access to data.
Furthermore, the IAD retention policy may cause another problem for a user who may not manage to process the data before it is rolled out to the LTA again. A significant performance improvement of the LTAs is related to advances in HDD technology that becomes cheaper, with more storage and less power demand. Nevertheless, the HDD/tape LTAs are not suitable for bulk data processing or for data streaming, e.g. by OGC services (WMS/WFS). In this respect, only the IAD can provide sufficient performance to publish/visualize data on the Internet.
Another advantage of the IAD is related to partial reads over large data files if they are stored in an optimized chunked format such as Cloud Optimized GeoTIFF (COG) or Zarr for rasters, and GeoParquet for vectors. Partial reads using for instance HTTP “range request” is indispensable for “embarrassingly parallel workloads” that are the backbone of the libraries such as the Python DASK. They allow the processing in parallel small chunks of data, so there is no need to wait for complete data ingestion before data processing. This significantly shortens the time required for I/O operations. Another interesting feature of partial reads is related to “lazy” processing when data chunks are processed only if the subsequent process asks for it. This is commonly used by services such as Google Earth Engine (GEE), which processes data on-the-fly for a limited area of interest (AOI), e.g. for a current monitor/display extent and zoom level.
All aforementioned features require IAD, which is being implemented on an unprecedented scale in the Copernicus Data Space Ecosystem (CDSE) project that will be the (r)evolution of the Copernicus Open Access Hub. Within the CDSE, the entire archive of unpacked/unzipped Sentinel satellite data and much more (i.e. Landsat, ENVISAT, SMOS, COP DEM, etc.) will be publicly available as IAD for download. Data processing will be possible using the CREODIAS cloud – a commercial component of the CDSE. Having all Copernicus satellite and EO ancillary data instantly accessible will take the capabilities of EO analytics to the next level.
But how can petabytes of data be accessed efficiently? The key is to have a high performant, redundant storage system combined with an efficient protocol for data access and management. The CREODIAS platform is based on the open-source CEPH technology built using >11000 HDD drives, and provides 3-in-1 interfaces for object-, block- and file-level storage. Data access to CEPH is based on the Amazon S3 protocol which allows for more detailed metadata assignment leading to better data organization and discovery. A single data object stored in CEPH is distributed among many HDD drives. Thus, when data is accessed/read in a parallel mode (e.g. using the s5cmd command) the data throughput can exceed the speed of a high performance NVME disk (~1 GB/s). Large I/O is a key feature when processing large EO datasets.
The EO Data Access (R)evolution is not only limited to the provisioning of a high-performant IAD. The volumes of available EO data a long time ago exceeded the capacity of a single data center. Thus, there is a strong demand for the federation of data centers and data duplication as Deferred Available Data (DAD). This ensures data recovery and endurance in the event of a malfunction of a single data center. The federation of multiple data centers is very challenging as it requires the Harmonized Data Access (HDA) protocol which has a single API in the frontend accessible for a user while connecting to multiple endpoints featuring different APIs. Such a solution is being implemented as a part of the CREODIAS platform, which will be the backbone of the DestinE Data Lake (DEDL) service. The IAD and fast HDA are required to model environmental processes by the Digital Twin Earths, which are the future of monitoring and forecasting the state of our Planet.
To conclude, it must be emphasized that access to EO data has never been easier or cheaper, thanks to the CDSE and DEDL projects that will benefit from the IAD and the massive computing capabilities of the CREODIAS platform.
Author: Dr Jan Musiał, Senior Data Scientist at CloudFerro