Home
Chapters

Generic expressions are represented by <something> (e.g., <function> or <operator>). This is just notation, and the symbols < and > should not be misconstrued as Julia's syntax.

Action	Keyboard Shortcut
Previous Section	`Ctrl + 🠘`
Next Section	`Ctrl + 🠚`
List of Sections	`Ctrl + z`
List of Subsections	`Ctrl + x`
Close Any Popped Up Window (like this one)	`Esc`
Open All Codes and Outputs in a Post	`Alt + 🠛`
Close All Codes and Outputs in a Post	`Alt + 🠙`

When benchmarking, the equivalence of time measures is as follows.

Unit	Acronym	Measure in Seconds
Seconds	s	1
Milliseconds	ms	10^-3
Microseconds	μs	10^-6
Nanoseconds	ns	10^-9

Links
Dark Mode

Personal

Website

11c. Task-Based Parallelism: @spawn

Martin Alfaro

PhD in Economics

PART II: HIGH PERFORMANCE
7. Introduction to Performance

a. Overview and Goals

b. When To Optimize Code?

c. Benchmarking Execution Time

d. Preliminaries on Types

e. Functions: Type Inference and Multiple Dispatch

8. Type Stability

a. Overview and Goals

b. Defining Type Stability

c. Type Stability with Scalars and Vectors

d. Type Stability with Global Variables

e. Barrier Functions

f. Type Stability with Tuples

g. Type Stability with Higher-Order Functions

h. Gotchas for Type Stability

9. Reducing Memory Allocations

a. Overview and Goals

b. Stack vs Heap

c. Objects Allocating Memory

d. Slice Views to Reduce Allocations

e. Pre-Allocations

f. Reductions

g. Static Vectors for Small Collections

h. Lazy Operations

i. Lazy Broadcasting and Loop Fusion
10. Vectorization (SIMD)

a. Overview and Goals

b. Macros as a Means for Optimizations

c. Introduction to SIMD

d. SIMD: Independence of Iterations

e. SIMD: Unit Strides

f. SIMD: Branchless Code

g. Packages For SIMD

11. Multithreading

a. Overview and Goals

b. Introduction to Multithreading

c. Task-Based Parallelism: @spawn

d. Thread-Safe Operations

e. Parallel For-Loops: @threads

f. Applying Parallelization

g. Packages for Multithreading

Introduction

The previous section explained the basics of multithreading. In particular, we've shown that operations can be computed either sequentially (Julia's default) or concurrently. The latter approach enables multiple operations to be processed simultaneously, with operations running as soon as a thread becomes available. When Julia's session is initialized with more than one thread, this implies that computations can be executed in parallel.

This section will focus on Julia's native multithreading mechanisms, a topic that will span several sections. Our primary goal here is to demonstrate how to write multithreaded code, rather than exploring how and when to apply the technique.

We've deliberately structured our explanation in this way to smooth subsequent discussions. However, a crucial caveat at this point remains necessary: while multithreading can offer significant performance advantages, it's not applicable in all scenarios. In particular, multithreading demands extreme caution in handling dependencies between operations, as mismanagement can lead to silent catastrophic bugs. We'll defer the topic of unsafe-thread operations for now, as identifying them presupposes a basic understanding of parallelism techniques.

Enabling Multithreading

Julia initializes every session with a given pool of threads available. Each of these threads is responsible for executing a given set of instructions. Consequently, the number of threads delimits the number of instructions that the CPU can handle simultaneously.

By default, Julia only operates with a single thread, requiring setting an alternative number of threads to enable multithreading. You can achieve this in VSCode or VSCodium by going to File > Preferences > Settings. Then, you should search for the keyword threads, prompting the following line:

After pressing Edit in settings.json, you should add the line "julia.NumThreads": "auto". This will automatically identify the number of threads based on your computer's features (either logical or physical threads available). Notice that the effects won't take place on the current session.

To check whether the effects have taken place, use the command Threads.nthreads(). This displays the number of threads available in the session. Any number greater than one will indicate that multithreading is activated. Notice also that the changes are permanent, so that every new Julia session will start with the number of threads specified.

Once we have a session with more than one thread, there are several packages for performing multithreaded computations. The focus on this section will be on the built-in package Threads, which is automatically imported when you start Julia.

# package Threads automatically imported when you start Julia

Threads.nthreads()

Output in REPL

using Base.Threads      # or `using .Threads`

nthreads()

Output in REPL

Warning! - Loaded Package

All the scripts below assume that you've executed the line using Base.Threads. Furthermore, all the examples are based on a session with two worker threads.

Task-Based Parallelism: @spawn

The first approach we'll cover is implemented through the macro @spawn, which streamlines the application of the previous section's techniques. Specifically, by prepending any operation with @spawn, we create a (non-sticky) task that's scheduled right away for its execution. Recall that once a task is scheduled, it'll immediately start its computation if there's a thread available.

Unlike other approaches that we'll present, @spawn requires explicitly instructing Julia to wait for the task to complete. The way to do this depends on the nature of the output. For tasks that perform computation and additionally return an output, we have the function fetch. This waits for calculations of a task to finish and then returns its output. Since parallel computation requires spawning multiple tasks, the function argument of fetch should comprise all the tasks spawned and fetch be broadcasted.

In the following, we illustrate fetch with two spawned tasks that return vectors as their output.

x = rand(10); y = rand(10)

function foo(x)
    a = x .* -2
    b = x .*  2

    a,b
end

x = rand(10); y = rand(10)

function foo(x)
    task_a = @spawn x .* -2
    task_b = @spawn x .*  2

    a,b = fetch.((task_a, task_b))
end

It's important to distinguish between task_a and a: while a refers to the vector created (i.e., the task's output), task_a denotes the task creating the vector a. The distinction is essential since the function fetch only takes a task as its input.

Alternatively, for operations that don't return any output, we can use either the function wait or the macro @sync. The function wait is applied similarly to fetch. Instead, the macro @sync requires wrapping all operations to be synchronized, which is done by enclosing the operations with the keywords begin and end.

For the demonstration, let's consider a mutating function. Mutating functions are suitable as an example, since they only modify values of a collection, without returning any output.

x = rand(10); y = rand(10)

function foo!(x,y)
    @. x = -x
    @. y = -y
end

x = rand(10); y = rand(10)

function foo!(x,y)
    task_a = @spawn (@. x = -x)
    task_b = @spawn (@. y = -y)

    wait.((task_a, task_b))
end

x = rand(10); y = rand(10)

function foo!(x,y)
    @sync begin
        @spawn (@. x = -x)
        @spawn (@. y = -y)
    end    
end

Multithreading Overhead

To see the advantages of @spawn in action, let's compute the sum and maximum of a vector x, for which we present a sequential and a simultaneous approach. To clearly shed light on the benefits of parallelization, we also include the time to execute each operation in isolation. The results establish that the time of the sequential procedure is equivalent to the sum of each computation. Instead, thanks to parallelism, the execution time under multithreading is roughly equivalent to the maximum time required for either computation.

x = rand(10_000_000)

function non_threaded(x)    
    a           = maximum(x)
    b           = sum(x)
    
    all_outputs = (a,b)
end

Output in REPL

julia>

@btime maximum($x)

  7.705 ms (0 allocations: 0 bytes)

julia>

@btime sum($x)

  3.131 ms (0 allocations: 0 bytes)

julia>

@btime non_threaded($x)

  10.917 ms (0 allocations: 0 bytes)

x = rand(10_000_000)

function multithreaded(x)    
    task_a      = @spawn maximum(x)
    task_b      = @spawn sum(x)
    
    all_tasks   = (task_a, task_b)
    all_outputs = fetch.(all_tasks)
end

Output in REPL

julia>

@btime maximum($x)

  7.705 ms (0 allocations: 0 bytes)

julia>

@btime sum($x)

  3.131 ms (0 allocations: 0 bytes)

julia>

@btime multithreaded($x)

  7.741 ms (21 allocations: 1.250 KiB)

As we can see, the execution time under multithreaded is roughly equivalent to the maximum time for a single operation to complete in isolation. Nonetheless, this equivalence isn't exact. The reason is that multithreading have a non-negligible overhead, stemming from the creation and scheduling of tasks. This determines multithreading isn't beneficial for operations involving small objects, as the added overhead negates any potential benefits.

To illustrate this, let's compare the execution times of a sequential and multithreaded approach for different sizes of x. In the case considered, the single-threaded approach dominates for sizes smaller than 100,000.

x_small  = rand(    1_000)
x_medium = rand(  100_000)
x_big    = rand(1_000_000)

function foo(x)    
    a           = maximum(x)
    b           = sum(x)
    
    all_outputs = (a,b)
end

Output in REPL

julia>

@btime foo($x_small)

  866.758 ns (0 allocations: 0 bytes)

julia>

@btime foo($x_medium)

  59.934 μs (0 allocations: 0 bytes)

julia>

@btime foo($x_big)

  620.869 μs (0 allocations: 0 bytes)

x_small  = rand(    1_000)
x_medium = rand(  100_000)
x_big    = rand(1_000_000)

function foo(x)    
    task_a      = @spawn maximum(x)
    task_b      = @spawn sum(x)
    
    all_tasks   = (task_a, task_b)
    all_outputs = fetch.(all_tasks)
end

Output in REPL

julia>

@btime foo($x_small)

  3.245 μs (14.33 allocations: 1.068 KiB)

julia>

@btime foo($x_medium)

  55.853 μs (21 allocations: 1.250 KiB)

julia>

@btime foo($x_big)

  549.445 μs (21 allocations: 1.250 KiB)

NOTATION

PAGE LAYOUT

LINKS TO SECTIONS

KEYBOARD SHORTCUTS

TIME MEASUREMENT

Dark Mode

PART II: HIGH PERFORMANCE

Introduction

Enabling Multithreading

Task-Based Parallelism: @spawn

Multithreading Overhead