

Barcelona Supercomputing Center Centro Nacional de Supercomputación



# Task-based Parallel Programming Models: The Convergence of High-Performance and Edge Computing Domains

Eduardo Quiñones {eduardo.quinones@bsc.es}



www.risingstars-project.eu

AMLE Summer School 2023 September 20, 2023



A distributed data-mining software platform for extreme data across the compute continuum

- PhD on Computer Science at Technical University of Catalonia (UPC) in 2009
- Team Leader of the Predictable Parallel Computing Research Group at Barcelona Supercomputing Center
  - Job positions available!! ;)
- Founder and CTO of TalpTech, a Startup company that provides edge computing solutions to precision agriculture



- 1. The need of parallel programming models: OpenMP
- 2. Modelling a real-time system with OpenMP

Agenda

- 3. Main Factors Impacting Parallel Execution
- 4. Runtime optimizacion for real-time systems

### **Heterogeneous and Parallel Computing**



Heterogeneous and Parallel computing becomes key to cope with performance requirements

#### HPC Domain (~300W)



**NVIDIA A100** (GPU-based)

(GPU-based)

Massively parallel systems that operate as fast as possible



New plot and data collected for 2010-2017 by K. Rupp











Intel<sup>®</sup> Xeon<sup>®</sup> Series (40-core)

AMD Instinct<sup>™</sup> MI



AMD EPYC<sup>™</sup> Series (up to 64-core)







Avionics

Automotive

Network of HW/SW components that **must** operate **correctly** in response to its inputs from both functional and non-functional perspectives

**Embedded Domain** (~10-20W)



**NVIDIA Jetson Family** (GPU-based)



Kalray MPPA Coolidge (80-core fabric)



Xilinx Versal (FPGA-based with DFX)

#### AMLE Summer School 2023

### **Heterogeneous and Parallel Computing**

# **Host-centric paradigm:** The parallel computation is orchestrated by the general-purpose multi-core



## The example of the collision detection



Used in Adavanced Driving Assistant System (ADAS) and autonomous vehicles to identify objects (perception) and detect potential collisions





## Heterogeneous and Parallel Computing

Performance: complex computations at high speed



**Real-time:** end-to-end response time within budget



**Power/Thermal:** energy/temperature within budget

**Safety:** guarantee correctness and integrity of operation



**Security:** prevent external elements from affecting correctness and integrity

### The SW Productivity Gap

- *Efficiently exploit parallelism* and achieve the required performance 1.
- *Reason* about the functional and non-functional correctness 2.



Source: ITRS & Hardware-dependent Software, Ecker et al., Springer

#### HPC Domain (~300W)



**NVIDIA A100** (GPU-based)



AMD Instinct<sup>™</sup> MI (GPU-based)

Intel<sup>®</sup> Xeon<sup>®</sup> Series (40-core)



AMD EPYC<sup>™</sup> Series (up to 64-core)



Xilinx Versal (FPGA-based with DFX)

#### Parallel programming models are key!



Like a duck to water!

#### Embedded Domain (~10-20W)



**NVIDIA Jetson Family** (GPU-based)

Kalray MPPA Coolidge

(80-core fabric)



#### **The Importance of Compute Continuum**

- Set of computing resources to handle the complete lifecycle chain of data collected across highly distributed and heterogeneous sources
  - Edge Computing to reduce data transmission latencies, minimize security risks and provide data privacy, and reduce energy consumption
  - HPC to support massive parallel processing capabilities and acceleration features
  - Cloud Computing to provide highly scalable storage systems and on-demand analytics technologies



Parallel Programming Model Support



A distributed data-mining software platform for extreme data across the compute continuum

www.extract-project.eu

#### The Importance of Compute Continuum

- Set of computing resources to handle the complete lifecycle chain of data collected across highly distributed and heterogeneous sources
  - Edge Computing to reduce data transmission latencies, minimize security risks and provide data privacy, and reduce energy consumption
  - HPC to support massive parallel processing capabilities and acceleration features
  - Cloud Computing to provide highly scalable storage systems and on-demand analytics technologies
     Nenufar radio telescopes



10

#### **Parallel Programming Models for productivity**

Parallel programming models ...

- expose parallelism in an easy way,
- *abstract* the complexities of the platform.

The objective is to provide **productivity**:

- <u>Programmability</u>. Simple yet flexible to define parallelism without considering architectural details
- <u>Portability</u>. Code is valid in different platforms
- <u>Performance</u>. Compiler and runtime mechanisms that exploit the performance of the platform



**Parallel Programming Models** 



### **Parallel Programming Models for productivity**



| ning Community Index<br>: www.tiobe.com                  | Model    | Base<br>Language    | Type of<br>PPM       | Type of<br>architect | Type of<br>Parallelism     |
|----------------------------------------------------------|----------|---------------------|----------------------|----------------------|----------------------------|
| <u>A</u>                                                 | CUDA     | C/C++,<br>Python    | HW-<br>centric       | NVIDIA GPU           | Struct/<br>Unstruct        |
| 1 Magoam                                                 | OpenCL   | C/C++               | App-<br>centric      | GPU/<br>FPGAs        | Struct                     |
|                                                          | OpenMP   | C/C++               | Parallel-<br>centric | Shared<br>mem        | Struct/<br><u>Unstruct</u> |
|                                                          | Pthreads | C/C++               | Parallel-<br>centric | Shared<br>mem        | Unstruct                   |
|                                                          | ΜΡΙ      | C/C++,<br>Python    | Parallel-<br>centric | Distributed mem      | Unstruct                   |
| a — C++ — C# — Visual Basic<br>— Assembly language — SQL | COMPSs   | C++, Java<br>Python | Parallel-<br>centric | Distributed mem      | <u>Unstruct</u>            |
|                                                          | Spark    | Java,<br>Python     | Parallel-<br>centric | Distributed mem      | Struct                     |
| Parallel programming models<br>supporting tasking        | Ray      | C++,Java<br>Python  | Parallel-<br>centric | Distributed<br>mem   | Unstruct                   |

AMLE Summer School 2023

#### Our proposal: OpenMP

- **Mature language** constantly reviewed (last release Nov 2021, v5.2)
  - De-facto industrial *standard* in HPC for shared-memory systems.
  - Active research community with an **increasing interest** on the *embedded domain*.
- Productivity
  - Performance
    - Support for different types of in-node parallelism and accelerator devices.
    - Performance analysis tools.
  - Portability
    - Supported by many chip vendors (Intel, IBM, ARM, NVIDIA, TI, Gaisler, Kalray).
  - Programmability
    - Interoperability with other programming models (e.g., CUDA, OpenCL).
    - Allows incremental parallelization and can be easily compiled sequentially.

### **OpenMP tasking model**



#### **OpenMP tasking model**

#### **Expressiveness**:

- Exposes *what* to do in parallel rather than *how* to do it
- The parallel framework orchestrates the execution

Support for different *types of parallelism*:

- regular patterns in the form of parallel loops
  taskloop construct
- Structured
- *Unstructured* irregular patterns that may change
   task construct and depend clauses

Computation is not fully controlled by the programmer but by the parallel framework

AMLE Summer School 2023



"I'm a software engineer, so I can confirm it works by magic."

# Modeling a RT system with OpenMP tasking



# Modeling a RT system with OpenMP tasking

• (Sub)system: set of concurrent tasks



- RT-Task (T<sub>x</sub>)
  - Recurrent: periodic (deadline/period), sporadic ---- Not supported
  - Priority
  - Preemption (non/limited/fully preemptive)
  - Fine-grain parallelism/heterogenous computation

### Time and event-based OpenMP tasks

#### Application-based control loop No OpenMP runtime support needed

```
#pragma omp parallel
#pragma omp single nowait
while(1)
  if(get time()%100) {
    #pragma omp task ...
    rt task 1();
  if(get time()%200) {
    #pragma omp task ...
    rt task N();
```

#### Runtime-based control loop\* OpenMP runtime support

```
#pragma omp parallel
#pragma omp single nowait
{
    #pragma omp task event(periodic:100)
    rt_task_1();
    #pragma omp task event(sporadic:event1)
    rt_task_N();
    #pragma omp task event(sporadic:event2)
    rt_task_N();
}
```

### Time and event-based OpenMP tasks

OpenMP runtime



#### Runtime-based control loop\* OpenMP runtime support

```
#pragma omp parallel
#pragma omp single nowait
{
    #pragma omp task event(periodic:100)
    rt_task_1();
    #pragma omp task event(sporadic:event1)
    rt_task_N();
    #pragma omp task event(sporadic:event2)
    rt_task_N();
```

### Automotive example



```
#pragma omp task event(periodic:33) New instance of the (persistent) task every 33ms
camera_processing();
omp_fulfill_event(camera_event);
... // Processing all sensors
#pragma omp task event(sporadic:camera_event) depend(out:objs)
data_association();
#pragma omp task depend(in:objs, out:obstacles) New instance of the tasks when objs and
tracking();
pragma omp task depend(in:obstacles)
collision checker();
```

## Modeling a real-time system

- (Sub)system: set of concurrent tasks
- RT-Task (T<sub>x</sub>)
  - Recurrent: periodic (deadline/period), sporadic
  - Priority
  - Preemption (non/limited/fully preemptive)
  - Fine-grain parallelism/heterogenous computation (nested parallelism)
    - Described as functionalities (Rx)
    - Execution time (WCET)
    - Accesses labels



Task Dependency Graph (TDG)

### Main Factors Impacting Parallel Execution: TDG

- Parallel structure of the application (including data usage): <u>Task</u>
   <u>Dependency Graph</u> (TDG)
- 2. The execution and memory model: The <u>Runtime Scheduler</u> responsible of mapping task to parallel units

```
#pragma omp task event(periodic:33)
{
    int x,y;
    #pragma omp task depend(out:x,y)
    f1(&x,&y);
    #pragma omp task depend(in:x)
    f2(x);
    #pragma omp target map(to:y) depend(in:y)
    f3(y);
    #pragma omp taskwait
}
```



#pragma omp task event(sporadic:object\_event)

...

Task Dependency Graph (TDG)

A representation of the parallel nature of a given OpenMP region, extracted by means of compilation and runtime methods <sup>1</sup>

- Includes all the information for functional and non-funcional correctness
  - Parallel units and synchronization dependencies
  - Liveness analysis of variables and datasharings involved in the parallel execution
- Independent from the targeted parallel platform (but can include HW dependent information)
  - Execution characterisation of parallel units (e.g., time, energy, memory behaviour)

```
#pragma omp task event(periodic:33)
```

```
int x,y;
#pragma omp task depend(out:x,y) // T1
f1(&x,&y);
#pragma omp task depend(in:x) // T2
f2(x);
#pragma omp target map(to:y) depend(in:y) //T3
f3(y);
#pragma omp taskwait
```



<sup>1</sup>Supported by LLVM

#### Time behavior of OpenMP tasks

# Timing behaviour depends on the **mapping** between **parallel units** to **computing resources**

In the scope of OpenMP:

- Parallel structure of the application
   ✓ TDG
- 2. Scheduler(s) responsible of mapping OpenMP tasks to cores/accelerators
  - ✓ Fix OpenMP threads to HW threads: OMP\_PLACES, OMP\_PROC\_BIND
  - ✓ Fix tasks to threads: *tied* tasks



scheduling decisions

#### Time predictability

The execution time of a TDG is determined by:

- 1. Execution of OpenMP tasks within the critical path
- 2. Interferences of the rest of OpenMP tasks on the critical path
- 3. Interferences on HW/SW resources with other applications





interference critical path tasks



#### **Preemption strategy**

There exist three preemption strategies:



OpenMP tasks define a limited preemption scheduling strategy

- Preemptions occur at predefined points (a.k.a. *task scheduling points*)
- A priority level can be set to OpenMP tasks

The implementation of the runtime scheduler is implementation-defined

### **OpenMP Tasking Model Overhead**



## Leveraging the Task Dependency Graph

#pragma omp taskgraph
for (int i=0; i<nTasks; ++i) {
 #pragma omp task depend(out:a[i%nCores])
 cpu\_bound\_fn();</pre>



#### The taskgraph

- Reduces/eliminates overhead due to task orchestration and dependency resolution
- Used to instantiate and orchestrate tasks



### **TDG-driven framework**



Chenle, Y, Royuela, S, and Quiñones, E. Enhancing OpenMP Tasking Model: Performance and Portability, In International Workshop on OpenMP (IWOMP). 2021. Chenle, Y, Royuela, S, and Quiñones, E. Taskgraph: A Low Contention OpenMP Tasking Framework, In <a href="https://arxiv.org/abs/2212.04771">https://arxiv.org/abs/2212.04771</a>. 2022.

AIVILE JUITITIEL JUITUUL ZUZJ

#### **TDG-driven framework**



Chenle, Y, Royuela, S, and Quiñones, E. Enhancing OpenMP Tasking Model: Performance and Portability, In International Workshop on OpenMP (IWOMP). 2021. Chenle, Y, Royuela, S, and Quiñones, E. Taskgraph: A Low Contention OpenMP Tasking Framework, In <a href="https://arxiv.org/abs/2212.04771">https://arxiv.org/abs/2212.04771</a>. 2022.

AIVILE JUITITIEL JUITUUL ZUZJ

#### **Optimizing Parallel Execution using the TDG**



AMLE Summer School 2023

#### **Can tasking replace threading?**







Overhead of Taskgraph when replacing original for with taskloop, with OMP\_NUM\_THREADS=48 (lower is better).

 $\rightarrow$  Penalization when number of iterations is low (small problem size)

ightarrow Speedup close to optimal when recording is amortized

#### Interoperability with low-level libraries



Chenle, Y, Royuela, S, and Quiñones, E. OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices, In SCOPES. 2020.

#### Interoperability with low-level libraries



34

#### Goals:

✓ Enhance *performance* (NVIDIA devices)

saxpy

✓ Maintain programmability

#### **Results:**

 OpenMP synchronizations take longer than CUDA graphs



cholesky

# Taskgraph for Adaptive Optics



#### Simplified code:





**Cyril Cetre BSC-THALES RISING Stars Secondment** 

AMLE Summer School 2023

### What happens with the OpenMP tasking model?

The specification clearly moves from threading to tasking (host and device):

| Parallel<br>loops      | taskyield<br>final, mergeable |             | <b>taskloor</b><br>priority    | inoutset<br>grainsize/num_tasks<br>taskwait nowait |                                                        |             |             |
|------------------------|-------------------------------|-------------|--------------------------------|----------------------------------------------------|--------------------------------------------------------|-------------|-------------|
|                        | Tasking                       | ta          | Accelerators<br>askgroup, depe |                                                    | task reductions<br>affinity, detach<br>taskwait depend |             | ??          |
| <sup>1.0</sup><br>1997 | 3.0<br>2008                   | 3.1<br>2011 | 4.0<br>2013                    | 4.5<br><b>2015</b>                                 | 5.0<br><b>2018</b>                                     | 5.1<br>2020 | 6.0<br>2023 |

**Proposal**: a new directive to expose a region that can be represented as a TDG and event-driven model

#pragma omp taskgraph
#pragma omp task event...

#### **Interoperability with DSMLs**



#### Model Driven Engineering (MDE)

- 1. Construction of complex systems
- 2. Formal verification of FR and NFR with composability
- 3. Correct-by-construction paradigm thourgh code generation
  - Suitable for single-core execution or very limited multi-core support

Gap between the MDE used for CPS and the PPM supported by parallel platforms



#### **Parallel Programming Models**

- 1. Mandatory for SW productivity
  - Programmability: Parallel abstraction hiding HW complexities
  - Portability: Compatibility multiple HW platforms
  - Performance: Efficient exploitation of HW parallel capabilities
- 2. Efficient offloading to HW acceleration devices

### Interoperability with DSMLs: Automotive

- 1. Exploit parallelism within OpenMP (host and target) tasks
- 2. Exploit heterogeneity through specializations



#### Home-take message

- 1. Real-time systems <u>requires parallel computation</u> to cope with the performance requirements of the most advanced functionalities, and...
- ... current task-based parallel programming models allows to <u>reasoning</u> <u>about functional correctness and time predictability</u> while removing from developers the responsibility of managing the complexity of parallel execution
- **3.** <u>**OpenMP**</u> provides the level of productivity required while allowing reasoning about the functional and non-functional requirements across the <u>compute continuum</u>



Barcelona Supercomputing Center Centro Nacional de Supercomputación



# OpenMP in the scope of embedded systems

Eduardo Quiñones Head of the Predictable Parallel Computing Research Group {eduardo.quinones@bsc.es}

> Journée thématique GdR SOC2 – IRT Saint Exupéry : « Calcul haute performance pour les systèmes embarqués »

> > March 14, 2023