Multiprocessing

Tutorial by Jonas Wilfert, Tobias Thummerer

🚧 This tutorial is under revision and will be replaced by an up-to-date version soon 🚧

License

# Copyright (c) 2021 Tobias Thummerer, Lars Mikelsons, Josef Kircher, Johannes Stoljar, Jonas Wilfert
# Licensed under the MIT license. 
# See LICENSE (https://github.com/thummeto/FMI.jl/blob/main/LICENSE) file in the project root for details.

This Julia Package FMI.jl is motivated by the use of simulation models in Julia. Here the FMI specification is implemented. FMI (Functional Mock-up Interface) is a free standard (fmi-standard.org) that defines a container and an interface to exchange dynamic models using a combination of XML files, binaries and C code zipped into a single file. The user can thus use simulation models in the form of an FMU (Functional Mock-up Units). Besides loading the FMU, the user can also set values for parameters and states and simulate the FMU both as co-simulation and model exchange simulation.

Introduction to the example

This example shows how to parallelize the computation of an FMU in FMI.jl. We can compute a batch of FMU-evaluations in parallel with different initial settings. Parallelization can be achieved using multithreading or using multiprocessing. This example shows multiprocessing, check multithreading.ipynb for multithreading. Advantage of multithreading is a lower communication overhead as well as lower RAM usage. However in some cases multiprocessing can be faster as the garbage collector is not shared.

The model used is a one-dimensional spring pendulum with friction. The object-orientated structure of the SpringFrictionPendulum1D can be seen in the following graphic.

svg

Target group

The example is primarily intended for users who work in the field of simulations. The example wants to show how simple it is to use FMUs in Julia.

Other formats

Besides, this Jupyter Notebook there is also a Julia file with the same name, which contains only the code cells and for the documentation there is a Markdown file corresponding to the notebook.

Getting started

Installation prerequisites

	Description	Command	Alternative
1.	Enter Package Manager via	]
2.	Install FMI via	add FMI	add " https://github.com/ThummeTo/FMI.jl "
3.	Install FMIZoo via	add FMIZoo	add " https://github.com/ThummeTo/FMIZoo.jl "
4.	Install FMICore via	add FMICore	add " https://github.com/ThummeTo/FMICore.jl "
5.	Install BenchmarkTools via	add BenchmarkTools

Code section

Adding your desired amount of processes:

using Distributed
n_procs = 2
addprocs(n_procs; exeflags=`--project=$(Base.active_project()) --threads=auto`, restrict=false)

2-element Vector{Int64}:
 2
 3

To run the example, the previously installed packages must be included.

# imports
@everywhere using FMI
@everywhere using FMIZoo
@everywhere using DifferentialEquations
@everywhere using BenchmarkTools

Checking that we workers have been correctly initialized:

workers()

@everywhere println("Hello World!")

# The following lines can be uncommented for more advanced information about the subprocesses
# @everywhere println(pwd())
# @everywhere println(Base.active_project())
# @everywhere println(gethostname())
# @everywhere println(VERSION)
# @everywhere println(Threads.nthreads())

Hello World!


      From worker 2:	Hello World!
      From worker 3:	Hello World!

Simulation setup

Next, the batch size and input values are defined.


# Best if batchSize is a multiple of the threads/cores
batchSize = 16

# Define an array of arrays randomly
input_values = collect(collect.(eachrow(rand(batchSize,2))))

16-element Vector{Vector{Float64}}:
 [0.3488158160031857, 0.8863705259285821]
 [0.170547352997095, 0.3281963331996949]
 [0.1779082928828457, 0.7380807247238945]
 [0.8667426969591703, 0.6343504086309357]
 [0.6644077432159878, 0.7558507882310342]
 [0.6368908597953518, 0.18847597374357894]
 [0.07292435433880418, 0.9149624556569568]
 [0.7416541040647278, 0.4212059752411341]
 [0.3831847871996975, 0.46817375137726114]
 [0.22133782357965537, 0.7416324041960253]
 [0.1165839496664921, 0.33324714259730304]
 [0.5090345283774405, 0.28660977963691836]
 [0.3901250846672749, 0.41795654954932615]
 [0.9692257350505836, 0.5507960512326558]
 [0.7494671079286187, 0.8064768886896903]
 [0.5157010701675325, 0.39515754938769065]

Shared Module

For Distributed we need to embed the FMU into its own module. This prevents Distributed from trying to serialize and send the FMU over the network, as this can cause issues. This module needs to be made available on all processes using @everywhere.

@everywhere module SharedModule
    using FMIZoo
    using FMI

    t_start = 0.0
    t_step = 0.1
    t_stop = 10.0
    tspan = (t_start, t_stop)
    tData = collect(t_start:t_step:t_stop)

    model_fmu = loadFMU("SpringPendulum1D", "Dymola", "2022x"; type=:ME)
end

We define a helper function to calculate the FMU and combine it into an Matrix.

@everywhere function runCalcFormatted(fmu, x0, recordValues=["mass.s", "mass.v"])
    data = simulateME(fmu, SharedModule.tspan; recordValues=recordValues, saveat=SharedModule.tData, x0=x0, showProgress=false, dtmax=1e-4)
    return reduce(hcat, data.states.u)
end

Running a single evaluation is pretty quick, therefore the speed can be better tested with BenchmarkTools.

@benchmark data = runCalcFormatted(SharedModule.model_fmu, rand(2))

BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.483 s[22m[39m … [35m 2.495 s[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.58% … 0.97%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.491 s             [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.57%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.490 s[22m[39m ± [32m6.406 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.68% ± 0.26%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[32m▁[39m[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m█[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m█[39m [39m▁
  2.48 s[90m        Histogram: frequency by time[39m         2.5 s [0m[1m<[22m

 Memory estimate[90m: [39m[33m300.74 MiB[39m, allocs estimate[90m: [39m[33m7802418[39m.

Single Threaded Batch Execution

To compute a batch we can collect multiple evaluations. In a single threaded context we can use the same FMU for every call.

println("Single Threaded")
@benchmark collect(runCalcFormatted(SharedModule.model_fmu, i) for i in input_values)

Single Threaded







BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took [34m39.684 s[39m (0.59% GC) to evaluate,
 with a memory estimate of [33m4.70 GiB[39m, over [33m124838676[39m allocations.

Multithreaded Batch Execution

In a multithreaded context we have to provide each thread it's own fmu, as they are not thread safe. To spread the execution of a function to multiple processes, the function pmap can be used.

println("Multi Threaded")
@benchmark pmap(i -> runCalcFormatted(SharedModule.model_fmu, i), input_values)

Multi Threaded







BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took [34m23.573 s[39m (0.00% GC) to evaluate,
 with a memory estimate of [33m96.11 KiB[39m, over [33m1472[39m allocations.

As you can see, there is a significant speed-up in the median execution time. But: The speed-up is often much smaller than n_procs (or the number of physical cores of your CPU), this has different reasons. For a rule of thumb, the speed-up should be around n/2 on a n-core-processor with n Julia processes.

Unload FMU

After calculating the data, the FMU is unloaded and all unpacked data on disc is removed.

@everywhere unloadFMU(SharedModule.model_fmu)

Summary

In this tutorial it is shown how multi processing with Distributed.jl can be used to improve the performance for calculating a Batch of FMUs.