At the heart of the EXTRACT framework is data. The EXTRACT project, designed to tackle extreme data scenarios, leverages Nuvla.io and NuvlaEdge to create an efficient data ecosystem. Let’s dive into how the Nuvla data catalogue is shaping real-world applications and driving strategic advancements within EXTRACT.
Designed specifically for distributed environments, this catalogue offers a robust framework for registering, organizing, and managing datasets. Whether you’re working across edge computing setups or hybrid cloud infrastructures, Nuvla brings order and efficiency to your data operations.
Nuvla data catalogue main features:
- Metadata-Driven Indexing: Nuvla’s catalogue leverages rich metadata to help users efficiently search, correlate, and discover datasets.
- Version Tracking & Reusability: It automatically logs versions and usage history, making reproducibility and data governance straightforward.
- Automated Notifications: Whenever new data lands in your catalogue, Nuvla can send out instant notifications—using MQTT or similar protocols—to keep your workflows in sync.
- Seamless Storage Integration: The platform works natively with MinIO and S3-compatible storage systems, enabling rapid data ingestion and access.
- Dynamic Collections: Users can assemble datasets from various data objects and records, applying filters to create collections tailored to specific needs.
- Application Binding: Datasets are easily linked to compatible applications, so data can be processed directly within Nuvla.io—streamlining analysis and deployment.
When it comes to managing data at scale, the Nuvla data catalogue stands out with its robust technical features designed for both performance and flexibility. At its core, the platform leverages a WORM (Write Once, Read Mostly) model, ensuring that data is not only highly accessible but also inherently reproducible—crucial for scenarios where data integrity is paramount.
The lifecycle of your information is managed through distinct resources: data-objects represent S3 assets, data-records allow you to enrich those assets with user-defined metadata, and data-sets offer dynamic grouping to streamline project organization and sharing. Automation and integration are seamless thanks to a feature-rich web interface, a comprehensive REST API, and built-in event notifications via MQTT brokers, all of which empower teams to construct powerful, event-driven workflows.
Together, these technical capabilities make Nuvla a go-to solution for data-driven teams navigating the complexities of edge, cloud, and hybrid environments.
Within the EXTRACT project, Nuvla and NuvlaEdge serve as pivotal components for managing and cataloguing new data, while also providing automated notifications. Each dataset generated is systematically recorded in the Nuvla data catalogue as both a data-object and a data-record. The EXTRACT infrastructure subsequently requests an S3 storage link via this data-object. Upon confirmation of registered S3 credentials, the Nuvla data catalogue issues a storage link, enabling the EXTRACT infrastructure to store the data accordingly. Users receive immediate notifications of new S3 object creation through a dedicated MQTT channel.
This robust data catalogue capability is engineered to accommodate large-scale datasets characterized by real-time demands and high volumes, ensuring efficient data registration and event handling.
It is particulary relevant in the TASKA use case (Transient Astrophysics) which deals with massive, high-velocity data streams from radio telescopes like NenuFAR during solar activity bursts. These streams produce large volumes of interferometric data that need to be processed quickly for scientific analysis. Furthermore, it fits well in the data architecture of the PER use case thanks to the MQTT broker integration, allowing new datasets to quickly feed the multi-agent reinforcement learing models which optimize evacuation routes.
Nuvla’ data catalogue indexes, tracks historical data, and enables seamless automation in response to data events. These features collectively accelerate data ingestion and foster improved interoperability, benefiting both research and industrial teams that rely on streamlined and reliable workflows.
