Multiprocessing
Tutorial by Jonas Wilfert, Tobias Thummerer
🚧 This tutorial is under revision and will be replaced by an up-to-date version soon 🚧
License
# Copyright (c) 2021 Tobias Thummerer, Lars Mikelsons, Josef Kircher, Johannes Stoljar, Jonas Wilfert
# Licensed under the MIT license.
# See LICENSE (https://github.com/thummeto/FMI.jl/blob/main/LICENSE) file in the project root for details.
Motivation
This Julia Package FMI.jl is motivated by the use of simulation models in Julia. Here the FMI specification is implemented. FMI (Functional Mock-up Interface) is a free standard (fmi-standard.org) that defines a container and an interface to exchange dynamic models using a combination of XML files, binaries and C code zipped into a single file. The user can thus use simulation models in the form of an FMU (Functional Mock-up Units). Besides loading the FMU, the user can also set values for parameters and states and simulate the FMU both as co-simulation and model exchange simulation.
Introduction to the example
This example shows how to parallelize the computation of an FMU in FMI.jl. We can compute a batch of FMU-evaluations in parallel with different initial settings. Parallelization can be achieved using multithreading or using multiprocessing. This example shows multiprocessing, check multithreading.ipynb
for multithreading. Advantage of multithreading is a lower communication overhead as well as lower RAM usage. However in some cases multiprocessing can be faster as the garbage collector is not shared.
The model used is a one-dimensional spring pendulum with friction. The object-orientated structure of the SpringFrictionPendulum1D can be seen in the following graphic.
Target group
The example is primarily intended for users who work in the field of simulations. The example wants to show how simple it is to use FMUs in Julia.
Other formats
Besides, this Jupyter Notebook there is also a Julia file with the same name, which contains only the code cells and for the documentation there is a Markdown file corresponding to the notebook.
Getting started
Installation prerequisites
Description | Command | Alternative | |
---|---|---|---|
1. | Enter Package Manager via | ] | |
2. | Install FMI via | add FMI | add " https://github.com/ThummeTo/FMI.jl " |
3. | Install FMIZoo via | add FMIZoo | add " https://github.com/ThummeTo/FMIZoo.jl " |
4. | Install FMICore via | add FMICore | add " https://github.com/ThummeTo/FMICore.jl " |
5. | Install BenchmarkTools via | add BenchmarkTools |
Code section
Adding your desired amount of processes:
using Distributed
n_procs = 2
addprocs(n_procs; exeflags=`--project=$(Base.active_project()) --threads=auto`, restrict=false)
2-element Vector{Int64}:
2
3
To run the example, the previously installed packages must be included.
# imports
@everywhere using FMI
@everywhere using FMIZoo
@everywhere using DifferentialEquations
@everywhere using BenchmarkTools
Checking that we workers have been correctly initialized:
workers()
@everywhere println("Hello World!")
# The following lines can be uncommented for more advanced information about the subprocesses
# @everywhere println(pwd())
# @everywhere println(Base.active_project())
# @everywhere println(gethostname())
# @everywhere println(VERSION)
# @everywhere println(Threads.nthreads())
Hello World!
From worker 2: Hello World!
From worker 3: Hello World!
Simulation setup
Next, the batch size and input values are defined.
# Best if batchSize is a multiple of the threads/cores
batchSize = 16
# Define an array of arrays randomly
input_values = collect(collect.(eachrow(rand(batchSize,2))))
16-element Vector{Vector{Float64}}:
[0.8851952085263832, 0.7179804161486827]
[0.21425729251828196, 0.7576128182635197]
[0.4724078792226447, 0.8182634736074463]
[0.010492763911591374, 0.21866745899772355]
[0.0319997665874483, 0.9603455116034896]
[0.2830691148013105, 0.3182558905150308]
[0.7637920693656779, 0.8840179208204414]
[0.18523320899257345, 0.2812224712629118]
[0.323359803003088, 0.23924536562916765]
[0.054141162143514165, 0.04931045779091259]
[0.4405929299883178, 0.9010355502374826]
[0.48223847814394183, 0.7783966395253015]
[0.09262116112179952, 0.9583484202189898]
[0.6434037978280975, 0.4674983425354752]
[0.20085456576492988, 0.6861333179540255]
[0.5388425380447254, 0.6133056808971772]
Shared Module
For Distributed we need to embed the FMU into its own module
. This prevents Distributed from trying to serialize and send the FMU over the network, as this can cause issues. This module needs to be made available on all processes using @everywhere
.
@everywhere module SharedModule
using FMIZoo
using FMI
t_start = 0.0
t_step = 0.1
t_stop = 10.0
tspan = (t_start, t_stop)
tData = collect(t_start:t_step:t_stop)
model_fmu = loadFMU("SpringPendulum1D", "Dymola", "2022x"; type=:ME)
end
We define a helper function to calculate the FMU and combine it into an Matrix.
@everywhere function runCalcFormatted(fmu, x0, recordValues=["mass.s", "mass.v"])
data = simulateME(fmu, SharedModule.tspan; recordValues=recordValues, saveat=SharedModule.tData, x0=x0, showProgress=false, dtmax=1e-4)
return reduce(hcat, data.states.u)
end
Running a single evaluation is pretty quick, therefore the speed can be better tested with BenchmarkTools.
@benchmark data = runCalcFormatted(SharedModule.model_fmu, rand(2))
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m): [39m[36m[1m3.213 s[22m[39m … [35m 3.222 s[39m [90m┊[39m GC [90m([39mmin … max[90m): [39m0.37% … 0.74%
Time [90m([39m[34m[1mmedian[22m[39m[90m): [39m[34m[1m3.218 s [22m[39m[90m┊[39m GC [90m([39mmedian[90m): [39m0.56%
Time [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m): [39m[32m[1m3.218 s[22m[39m ± [32m6.245 ms[39m [90m┊[39m GC [90m([39mmean ± σ[90m): [39m0.56% ± 0.27%
[34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m
[34m█[39m[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[32m▁[39m[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m█[39m [39m▁
3.21 s[90m Histogram: frequency by time[39m 3.22 s [0m[1m<[22m
Memory estimate[90m: [39m[33m300.75 MiB[39m, allocs estimate[90m: [39m[33m7602418[39m.
Single Threaded Batch Execution
To compute a batch we can collect multiple evaluations. In a single threaded context we can use the same FMU for every call.
println("Single Threaded")
@benchmark collect(runCalcFormatted(SharedModule.model_fmu, i) for i in input_values)
Single Threaded
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took [34m51.376 s[39m (0.43% GC) to evaluate,
with a memory estimate of [33m4.70 GiB[39m, over [33m121638676[39m allocations.
Multithreaded Batch Execution
In a multithreaded context we have to provide each thread it's own fmu, as they are not thread safe. To spread the execution of a function to multiple processes, the function pmap
can be used.
println("Multi Threaded")
@benchmark pmap(i -> runCalcFormatted(SharedModule.model_fmu, i), input_values)
Multi Threaded
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took [34m30.595 s[39m (0.00% GC) to evaluate,
with a memory estimate of [33m99.89 KiB[39m, over [33m1606[39m allocations.
As you can see, there is a significant speed-up in the median execution time. But: The speed-up is often much smaller than n_procs
(or the number of physical cores of your CPU), this has different reasons. For a rule of thumb, the speed-up should be around n/2
on a n
-core-processor with n
Julia processes.
Unload FMU
After calculating the data, the FMU is unloaded and all unpacked data on disc is removed.
@everywhere unloadFMU(SharedModule.model_fmu)
Summary
In this tutorial it is shown how multi processing with Distributed.jl
can be used to improve the performance for calculating a Batch of FMUs.