The EXTRACT monitoring platform, developed by Ikerlan, aims to ensure that all nodes in the Edge-to-Cloud continuum operate according to defined functional and non-functional requirements while providing up-to-date information on their availability, performance, and resource usage to support intelligent scheduling, deployment, and infrastructure management.
The main challenge of the EXTRACT monitoring platform is managing and monitoring a highly heterogeneous and distributed set of nodes across the Edge-to-Cloud continuum, ranging from powerful cloud servers to resource-constrained edge devices, while ensuring low overhead, real-time data collection, and integration of diverse metrics to support efficient orchestration and system reliability.
Given this context, to ensure optimal performance and Quality of Service (QoS) of distributed applications in Edge-to-Cloud systems, it is important to monitor the network state. Traditional network state monitoring in Edge-to-Cloud systems often relies on centralized methods that infer metrics from small samples to improve scalability. However, this introduces two major limitations: reduced precision and a centralization bottleneck. In highly heterogeneous Edge-to-Cloud environments, where network conditions vary frequently and unpredictably, these limitations can severely impact application performance and orchestration decisions. Central nodes become scalability choke points as the number of nodes and links grows quadratically (O(N²), where N is the number of nodes), and sampled inferences often fail to represent the true, dynamic state of the network[1].
As part of the activities to evolve the EXTRACT monitoring platform, the SWIM-NSM monitoring module for network state monitoring has been implemented, a module to monitor latencies and failure rates between the nodes of the architecture. It consists of a distributed architecture with a Prometheus exporter on each node. This solution builds on the SWIM protocol (Scalable Weakly-consistent Infection-style process group Membership), which was originally designed for efficient failure detection and state dissemination in large distributed systems.
SWIM uses ping and indirect ping mechanisms for failure detection and a gossip-based approach for disseminating state changes. Its design reduces network interactions from quadratic to linear complexity, trading off failure detection speed for greater scalability, and an acceptable compromise for network state monitoring. Figure 1 illustrates the comparison of message exchanges between SWIM-NSM and a typical heartbeat-based protocol. Even with a small number of nodes, SWIM-NSM requires fewer message exchanges, and this advantage becomes increasingly significant as the number of nodes grows.

Fig. 1 Number of messages required by SWIM-NSM.
To adapt SWIM for monitoring purposes, SWIM-NSM introduces a third mechanism: metric collection. It adds a “forward acknowledge” message to distinguish between failed pings and successful indirect pings (Figure 2), enabling nodes to calculate round-trip times and derive link-level metrics such as latency, jitter, and packet-loss rate, three of the four key IETF-defined metrics for network characterization (bandwidth is excluded due to its resource-intensive nature).

Fig. 2 SWIM-NSM ping interaction.
In summary, SWIM-NSM fulfills the critical requirements of Edge-to-Cloud network monitoring: distributed architecture, scalability, minimal resource consumption, and accurate metric collection. It provides a robust alternative to centralized approaches and enables informed orchestration decisions based on real-time network state.
[1] A. Orive, A. Agirre, J. Bilbao and M. Marcos, “Passive Network State Monitoring for Dynamic Resource Management in Industry 4.0 Fog Architectures,” 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE), Munich, Germany, 2018, pp. 1414-1419, doi: 10.1109/COASE.2018.8560475.