As a result of a research collaboration between CloudFerro and Φ-lab, a research laboratory of the European Space Agency (ESA), the first global embedding dataset for Earth Observation (EO) has been introduced. This groundbreaking publication integrates cutting-edge AI technologies to enhance EO capabilities and provide more precise and scalable analysis of satellite data. The AI community can use global embeddings for EO in their research and application development.
Derived from advanced AI models, these embeddings transform vast amounts of satellite imagery into efficient, high-dimensional vector data representations. This innovation marks a significant milestone, enabling smarter and faster analysis of EO data at an unprecedented scale. The global embeddings were computed using the CREODIAS, cloud service platform operated by CloudFerro, powered by GPU-accelerated instances.
“The traditional interaction of users with EO data is going to change dramatically with the wide introduction of embedding products at the scale of full Sentinel data archives. This prototype we built here is the first step towards understanding the value brought by this approach,” says Dr. Mikolaj Czerkawski at ESA Φ lab who led the development of MajorTOM and the technical collaboration with CloudFerro. ”By developing and releasing in a fully open-source setting, we demonstrate how the open data programmes like Copernicus, once again, can deliver unprecedented benefits to the wide community,” adds the expert.
What are embeddings and how do they work?
Embeddings are high-dimensional vectors that transform complex data, such as images or documents into numerical representations. This structured format captures relationships and semantic meaning within the data, allowing AI models to process and analyse it with remarkable context-awareness and precision. This enables machines to identify patterns, similarities, and connections that might otherwise be challenging to detect.
“We’re proud to be at the forefront of such innovation and to realize this ambitious project with ESA AI experts. The Sentinel data embeddings generated with Major TOM and hosted on our CREODIAS platform will bring new capabilities to the geospatial community by making high-quality, AI-ready data accessible globally,” says Dr. Jędrzej Bojanowski, Data Science Manager at CloudFerro. “This collaboration highlights our commitment to embrace the AI revolution and introduce it to EO data ecosystem at large, including Copernicus,” adds the expert.
Embeddings transform raw data into a structured format that can be meaningfully interpreted, allowing AI models to extract deeper insights and relationships. By capturing the underlying patterns and connections within the data, embeddings enable more accurate and context-aware analysis. This approach not only enhances the ability to process complex information but also drives progress in areas such as machine learning, natural language understanding, and computer vision. Embeddings provide the foundation for scalable, versatile AI solutions, unlocking new possibilities across a wide range of applications, from predictive modeling to advanced decision-making systems.
“With this release ESA is adding more momentum to these efforts to help secure a strong position of the European entities in this area,” says Anna Burzykowska, Copernicus Innovation Officer at ESA. “We are keen to continue our collaboration with our industrial and research partners and work diligently to lay the key foundation needed to grow the core of this technology here in Europe, especially for the Copernicus Programme,” adds the expert.
The role of embeddings for EO
Embeddings are increasingly valuable in the field of Earth Observation (EO), offering a range of applications for professionals across this sector. Embeddings can be leveraged by a wide range of professionals across the Earth Observation sector. These include remote sensing scientists, geospatial analysts, and environmental researchers who work with satellite imagery and geospatial data.
How they were calculated
Using Copernicus satellite data, we have generated over 170 million embeddings from 62 TB of raw data, representing 9.368 trillion pixels. By processing more than 8 million images, we condensed this massive amount of information into just 1 TB of optimized data. These streamlined datasets capture essential insights, making it simpler for researchers and analysts to work with the data, fine-tune AI models, and gain valuable insights—without needing to handle the complexity of large, raw datasets.
Available embedding models
This work is part of an expanded standard for releasing Major TOM (https://huggingface.co/Major-TOM) Embedding expansions, now available through open datasets on HuggingFace, including:
- Sentinel-2 Multispectral SSL4EO Model: Core-S2L1C-SSL4EO
- Sentinel-1 RTC SSL4EO Model: Core-S1RTC-SSL4EO
- Sentinel-2 RGB DINOv2 Model: Core-S2RGB-DINOv2
- Sentinel-2 RGB SigLIP Model: Core-S2RGB-SigLIP
Computing environment
Powered by CloudFerro’s GPU-accelerated cloud infrastructure on the CREODIAS platform and guided by ESA’s Φ-lab expertise, this effort demonstrates the potential of AI-driven solutions in EO. The embeddings leverage state-of-the-art vision models like SigLIP, DINOv2, and SSL4EO, unlocking new possibilities for advanced EO tasks.
Plans and development
The next phase involves assessing the performance of these embeddings across diverse Earth Observation (EO) tasks, such as detecting patterns and building predictive models. We will also explore additional foundation models, including MMEarth and DeCUR, to refine their capabilities and ensure seamless integration. Furthermore, the MajorTOM dataset, enriched with embeddings, will be hosted on the CREODIAS repository, providing open access to researchers and fostering collaboration across the EO community.
We proudly introduce Sherlock, an innovative platform providing access to advanced AI models. This groundbreaking development opens new possibilities for companies and organizations seeking to leverage the potential of artificial intelligence.
Sherlock is an innovative Generative AI service created by CloudFerro. The platform is designed for organizations aiming to utilize AI capabilities without the need to manage complex infrastructure. By using endpoints compatible with OpenAI libraries, Sherlock ensures seamless integration with existing solutions, allowing businesses to focus on innovation instead of technical challenges.
Sherlock – security and data privacy
One of Sherlock's core pillars is data security. The models operate within a Polish data center, which ensures that data is always fully protected and remains under the user’s control. Moreover, users’s data inputted into the platform are neither stored nor used for further model training. This approach makes Sherlock an ideal solution for sectors demanding the highest standards of privacy, such as healthcare, public administration, and banking.
A rich variety of advanced AI models
Sherlock offers language models customized for various applications. Bielik 11B v2.3 Instruct is the first advanced language model developed specifically for Polish-language data, created in collaboration with the open-science initiative SpeakLeash (Spichlerz) and the Academic Computer Centre Cyfronet AGH. This model delivers a unique ability to process Polish-language documents and datasets, capturing local nuances and cultural specifics in communication. Meanwhile, Llama 3.1 70B is the largest and most advanced model in the lineup, perfect for complex multilingual tasks involving in-depth analysis and content generation. Both models are easily integrated via a standard API compatible with OpenAI.
The Sherlock platform also provides access to two embedding models: BGE-Multilingual-Gemma2 and e5-mistral-7b-instruct, which enable advanced semantic search systems and large-scale data analysis. For instance, Sherlock can be employed to develop systems that identify key similarities between texts, making it especially valuable in fields such as law, medicine, and science.
Seamless Integration
Sherlock is designed with maximum integration simplicity in mind. You only need to replace a few lines of code in the popular OpenAI library:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ['SHERLOCK_API_KEY'],
base_url="https://api-sherlock.cloudferro.com/openai/v1"
)
model = "Llama-3.1-70B-Instruct"
This simplicity means that organizations can quickly start using advanced AI models without complex integration processes or changes to their existing infrastructure.
Versatile applications
The platform offers models tailored to various use cases:
- Conversational and dialogue systems enabling natural interactions with users.
- Advanced document analysis and content generation.
- Automation of complex workflows.
- Advanced RAG (Retrieval-Augmented Generation) implementations, allowing the creation of intelligent systems that answer questions based on proprietary knowledge bases.
As Sherlock is a high-performance cloud-based platform equipped with powerful GPU processors, the user does not need to invest in and manage their own IT infrastructure. Organizations can immediately start using advanced AI models, focusing on developing their solutions rather than managing technical infrastructure.
The development of the AI ecosystem and the future of the Sherlock platform
Sherlock is part of a larger technological ecosystem created by experienced specialists and innovative companies. CloudFerro is building a network of business connections and opportunities, fostering the development of advanced AI solutions. This pragmatic approach lets us create tangible business value based on the latest advancements in artificial intelligence.
Sherlock shows that local, independent AI solutions can not only match global players, but also offer value that models developed with an international audience in mind will not provide. This is a step towards decentralised AI development based on local knowledge, resources and needs.
The launch of the platform is just the beginning of ambitious development plans. In the near future, Sherlock will be systematically expanded to include:
- Additional open AI models.
- Capabilities for image interpretation and generation.
- Dedicated models for Earth Observation (planned for Q1 2025).
As an experienced provider of cloud services for the most demanding sectors, CloudFerro guarantees not only high-quality services but also dedicated technical support from a team of experts. Our knowledge and experience in handling advanced projects for the space and climate research sectors allow us to deliver reliable and scalable AI solutions.
AI embeddings for Earth Observation
As a result of CloudFerro’s cooperation with Φ-lab, a research laboratory of the European Space Agency (ESA),global AI embeddings for Earth Observation have been introduced. These embeddings transform vast satellite data collections into easily analysable vector-based numerical representations. Embeddings, i.e. vector image descriptions processed by artificial intelligence models, are gaining increasing importance in Earth Observation. They can be utilised by remote sensing scientists, GIS analysts, and environmental researchers working with satellite imagery and geospatial data. By processing over 62 TB of data from the Copernicus program and leveraging advanced AI models like DINOv2, 170 million embeddings have been created, significantly simplifying and accelerating work in remote sensing and environmental management.
The creation of embeddings for Earth Observation stems from CloudFerro's specialisation in the space sector. The company provides innovative cloud services for processing and storing multi-petabyte satellite data collections for Earth Observation and is the prime contractor for key projects for the European Space Agency. Together with consortium partners, CloudFerro develops and operates the Copernicus Data Space Ecosystem platform, the primary access point to data from the Copernicus programme.
The HMAD virtual machine family is now generally available in the FRA1-2 cloud region. These virtual machines offer the same pricing structure as the HMD family while incorporating advanced AMD processors that deliver notable performance enhancements. Users can expect improved CPU and RAM performance, making HMAD VMs a suitable choice for demanding workloads.
The inclusion of local ephemeral storage makes HMAD VMs particularly well-suited for real-time processing tasks, especially those related to Earth Observation (EO) data.
This type of storage is also highly advantageous for applications that require built-in data redundancy, including non-relational databases. By leveraging state-of-the-art computing capabilities alongside local storage, HMAD VMs present a robust solution for users seeking high-speed and efficient processing capabilities.
If you wish to learn more, please contact us at support@creodias.eu.
CloudFerro is excited to announce the launch of the WAW4-1 cloud, a new addition to our services within the WAW4 region, deployed in October 2024. The WAW4-1 cloud represents a significant advancement in our commitment to providing state-of-the-art cloud solutions.
WAW4-1 Cloud Specifications
- Launch Date: October 2024
- Location: Orange Łazy Data Center, WAW4 Region
- Capacity: Doubling our previous capacity
- OpenStack Version: Caracal
- CEPH Version: Quincy
Available Virtual Machines
The WAW4-1 cloud offers a variety of VM flavors tailored to different requirements:
- eo1
- eo2a
- hmda
- hma
These VMs are optimized for applications ranging from Earth observation data parallel processing to high-performance computing tasks.
Access and Management
- Cloud Management Panel: Manage your resources seamlessly via horizon.cloudferro.com.
- S3 Storage Endpoint: Access scalable object storage at s3.waw4-1.cloudferro.com.
- EODATA Endpoint: The endpoint eodata.cloudferro.com is accessible both from VMs and over the internet, ensuring uninterrupted access to essential datasets.
Benefits of the WAW4-1 Cloud
- Advanced Technologies: Leveraging OpenStack Caracal and CEPH Quincy for improved cloud orchestration and storage solutions.
- Enhanced Performance: Greater processing power and reduced latency for efficient operations.
- Scalability: Flexibility to scale resources according to workload demands.
- Reliability: High availability ensured by the robust infrastructure of the Orange Łazy Data Center.
About the Orange Łazy Data Center
Hosting the WAW4-1 cloud, the Orange Łazy Data Center offers:
- High Security Standards: Advanced measures to protect your data.
- Robust Infrastructure: State-of-the-art facilities for reliable service delivery.
- Sustainable Operations: Energy-efficient practices aligning with environmental sustainability.
- Strategic Location: Proximity to major network hubs for optimal connectivity.
Looking Forward
During the first month of operation, WAW4-1 will undergo stress testing, with the possibility of brief interruptions in API availability as we fine-tune resources. In the coming months, we will add more VM families and GPU offerings to this cloud, making it the best in our public portfolio.