Home
Chapters

Generic expressions are represented by <something> (e.g., <function> or <operator>). This is just notation, and the symbols < and > should not be misconstrued as Julia's syntax.

Action	Keyboard Shortcut
Previous Section	`Ctrl + 🠘`
Next Section	`Ctrl + 🠚`
List of Sections	`Ctrl + z`
List of Subsections	`Ctrl + x`
Close Any Popped Up Window (like this one)	`Esc`
Open All Codes and Outputs in a Post	`Alt + 🠛`
Close All Codes and Outputs in a Post	`Alt + 🠙`

When benchmarking, the equivalence of time measures is as follows.

Unit	Acronym	Measure in Seconds
Seconds	s	1
Milliseconds	ms	10^-3
Microseconds	μs	10^-6
Nanoseconds	ns	10^-9

Links
Dark Mode

Personal

Website

9g. Static Vectors for Small Collections

Martin Alfaro

PhD in Economics

PART II: HIGH PERFORMANCE
7. Introduction to Performance

a. Overview and Goals

b. When To Optimize Code?

c. Benchmarking Execution Time

d. Preliminaries on Types

e. Functions: Type Inference and Multiple Dispatch

8. Type Stability

a. Overview and Goals

b. Defining Type Stability

c. Type Stability with Scalars and Vectors

d. Type Stability with Global Variables

e. Barrier Functions

f. Type Stability with Tuples

g. Type Stability with Higher-Order Functions

h. Gotchas for Type Stability

9. Reducing Memory Allocations

a. Overview and Goals

b. Stack vs Heap

c. Objects Allocating Memory

d. Slice Views to Reduce Allocations

e. Pre-Allocations

f. Reductions

g. Static Vectors for Small Collections

h. Lazy Operations

i. Lazy Broadcasting and Loop Fusion
10. Vectorization (SIMD)

a. Overview and Goals

b. Macros as a Means for Optimizations

c. Introduction to SIMD

d. SIMD: Independence of Iterations

e. SIMD: Unit Strides

f. SIMD: Branchless Code

g. Packages For SIMD

11. Multithreading

a. Overview and Goals

b. Introduction to Multithreading

c. Task-Based Parallelism: @spawn

d. Thread-Safe Operations

e. Parallel For-Loops: @threads

f. Applying Parallelization

g. Packages for Multithreading

Code Script

The scripts of this section are available here under the name allCode.jl. They've been tested under Julia 1.11.3.

Introduction

Due to the memory-allocation overhead involved, the creation of vectors can rapidly become a performance bottleneck. The issue has far-reaching implications, as vector creation not only occurs when we explicitly define a variable holding a new vector. It also happens internally in various scenarios, such as when referencing a slice like x[1:2] or computing intermediate results on the fly like x .* y in the operation sum(x .* y).

This section introduces a way to address this limitation, while still preserving the use of vectors for collections. The solution leverages the so-called static vectors, provided by the StaticArrays package. Unlike built-in vectors, which are allocated on the heap, static vectors are stack-allocated.

Internally, static vectors are build on top of tuples. This feature determines that static vectors are only suitable for collections comprising a few elements. As a rule of thumb, consider using static vectors for collections with up to 75 elements. Exceeding this threshold can lead to increased overhead during creation and access, potentially offsetting any performance benefits or even resulting in a fatal error. [note] The recommended number of elements I provide is actually lower than the documentation's suggested (100 elements). The reason for this discrepancy is that as you approach the upper limit, the performance benefits of static vectors compared to regular vectors decrease significantly. As a result, the time spent benchmarking with collections of 100 elements will likely offset any potential advantage.

Static vectors offer additional benefits relative to tuples. Firstly, they maintain their performance benefits even at sizes where tuples would typically lose their advantages. Secondly, they're more convenient to work with, as they are manipulated similarly to regular vectors. In fact, they support any array type, including matrices. Finally, the StaticArrays package provides mutable variants. This makes static vectors more flexible than tuples, which are only available in an immutable form. It's worth indicating tough that, while the mutable version provides performance benefits relative to regular vectors, the immutable option still offers the best performance.

Warning!

To avoid repetition, the entire section assumes all collections are small. Taking this into account, all the benefits highlighted are contingent upon this assumption. We also suppose that the StaticArrays package is already available in the workspace, so the command using StaticArrays is omitted from each code snippet.

Creating Static Vectors

The package StaticArrays include several variants of static vectors. Our focus in particular we'll be on the type SVector, whose objects will be simply referred to as SVectors.

There are two approaches to creating an SVector, each serving a distinct purpose. The first one creates an SVector through literal values, while the other option included converts a standard vector into an SVector. The several implementations of each approach are illustrated below.

# all 'sx' define the same static vector '[3,4,5]'


sx = SVector(3,4,5)
sx = SVector{3, Int64}(3,4,5)
sx = SA[3,4,5]
sx = @SVector [i for i in 3:5]

Output in REPL

julia>

sx

3-element SVector{3, Int64} with indices SOneTo(3):
 3
 4
 5

# all 'sx' define a static vector with same elements as 'x'
x = collect(1:10)

sx = SVector(x...)
sx = SVector{length(x), eltype(x)}(x)
sx = SA[x...]
sx = @SVector [a for a in x]

Output in REPL

julia>

sx

10-element SVector{10, Int64} with indices SOneTo(10):
  1
  2
  3
  ⋮
  9
 10

Of these approaches, we'll primarily rely on the function SVector, occasionally employing SA for indexing purposes. [note] The approach based on the macro @SVector requires some caveats. For instance, it doesn't support definitions based on local variables, thus precluding the use of eachindex(x) in the array comprehension, unless x is a global variable. Note the use of the splatting operator ... to turn a regular vector into an SVector. This operator is necessary for the function SVector to transform a collection into a sequence of arguments. [note] For instance, foo(x...) is equivalent to foo(x[1], x[2], x[3]) given a vector or tuple x with 3 elements.

Regarding slices of SVectors, the approach used for their creation could result in either a regular vector or an SVector. This depends on the indexing employed: a slice remains an SVector when indices are given as SVectors, whereas the slice becomes a regular vector for indices provided by ranges or regular vectors. The sole exception to this rule is when the slice references the whole object (i.e., sx[:]), in which case an SVector is returned.

x = collect(3:10) ; sx = SVector(x...)

# both define static vectors
slice1 = sx[:]
slice2 = sx[SA[1,2]]        # or slice2 = sx[SVector(1,2)]

Output in REPL

julia>

slice1

8-element SVector{8, Int64} with indices SOneTo(8):
  3
  4
  ⋮
  9
 10

julia>

slice2

2-element SVector{2, Int64} with indices SOneTo(2):
 3
 4

x = collect(3:10) ; sx = SVector(x...)

# both define and ordinary vector
slice2 = sx[1:2]
slice2 = sx[[1,2]]

Output in REPL

julia>

slice

2-element Vector{Int64}:
 3
 4

SVectors Don't Allocate Memory and Are Faster

One of the key advantages of SVectors is that they're internally built on top of tuples. Consequently, SVectors don't allocate memory.

x = rand(10)

function foo(x)
    a = x[1:2]              # 1 allocation (copy of slice)
    b = [3,4]               # 1 allocation (vector creation)

    sum(a) * sum(b)         # 0 allocation (scalars don't allocate)
end

Output in REPL

julia>

@btime foo($x)

  29.819 ns (2 allocations: 160 bytes)

x = rand(10)

@views function foo(x)
    a = x[1:2]              # 0 allocation (view of slice)
    b = [3,4]               # 1 allocation (vector creation) 

    sum(a) * sum(b)         # 0 allocation (scalars don't allocate)
end

Output in REPL

julia>

@btime foo($x)

  15.015 ns (1 allocation: 80 bytes)

x = rand(10);   tup = Tuple(x)

function foo(x)
    a = x[1:2]              # 0 allocation (slice of tuple)
    b = (3,4)               # 0 allocation (tuple creation)

    sum(a) * sum(b)         # 0 allocation (scalars don't allocate)
end

Output in REPL

julia>

@btime foo($tup)

  1.400 ns (0 allocations: 0 bytes)

x = rand(10);   sx = SA[x...]

function foo(x)
    a = x[SA[1,2]]          # 0 allocation (slice of static array)
    b = SA[3,4]             # 0 allocation (static array creation)

    sum(a) * sum(b)         # 0 allocation (scalars don't allocate)
end

Output in REPL

julia>

@btime foo($sx)

  1.600 ns (0 allocations: 0 bytes)

The decrease in memory allocations from SVectors is especially relevant for operations that result in temporary vectors, such as broadcasting.

x = rand(10)

foo(x) = sum(2 .* x)

Output in REPL

julia>

@btime foo($x)

  17.936 ns (1 allocation: 144 bytes)

x = rand(10);   sx = SVector(x...)

foo(x) = sum(2 .* x)

Output in REPL

julia>

@btime foo($sx)

  1.800 ns (0 allocations: 0 bytes)

Interestingly, the performance benefits of SVectors extend beyond memory allocation. This entails that, even when operations on regular vectors don't allocate memory, SVectors can still provide a speed boost. Below, we demonstrate this through the function sum(f, <vector>), which sums the elements of <vector> after they're transformed via f. The example shows that the implementation with SVectorsyields faster execution times, even though regular vectors already don't incur memory allocations.

x  = rand(10)

foo(x) = sum(a -> 10 + 2a +  3a^2, x)

Output in REPL

julia>

@btime foo($x)

  4.400 ns (0 allocations: 0 bytes)

x  = rand(10);  sx = SVector(x...);

foo(x) = sum(a -> 10 + 2a +  3a^2, x)

Output in REPL

julia>

@btime foo($sx)

  2.900 ns (0 allocations: 0 bytes)

SVector Type and Its Mutable Variant

Similar to tuples, SVectors are immutable, meaning that its elements can't be added, removed, or modified. Nevertheless, when mutable collections are needed, the package StaticArrays provides a variant given by the type MVector. The creation of MVector instances and their slices follow the same syntax as SVectors, with the only difference being the substitution of the function Svector with MVector. This is illustrated below.

x  = [1,2,3]
sx = SVector(x...)

sx[1] = 0

Output in REPL

ERROR: setindex!(::SVector{3, Int64}, value, ::Int) is not defined

x  = [1,2,3]
mx = MVector(x...)

mx[1] = 0

Output in REPL

julia>

mx

3-element MVector{3, Int64} with indices SOneTo(3):
 0
 2
 3

The mutability of MVectors makes them ideal for initializing a vector that will eventually be filled via a for-loop. In fact, executing similar(sx) when sx is an SVector automatically returns an MVector.

'similar' for SVectors

sx = SVector(1,2,3)

mx = similar(sx)        # it defines an MVector with undef elements

Output in REPL

3-element MVector{3, Int64} with indices SOneTo(3):
        2073873450800
 -1152921504606846976
                    0

Type Stability: Size is Part of the Static Vectors' Type

SVectors are formally defined as objects with type SVector{N,T}, where N specifies the number of elements and T denotes the element's type. For instance, SVector(4,5,6) has type SVector{3,Int64}, indicating that it comprises 3 elements with type Int64. Importantly, this implies that the number of elements is part of the SVector type. This feature, which is shared with MVectors and inherited from tuples, can readily introduce type instabilities if not handled carefully.

The approaches to ensuring type stability are similar to those employed for tuples. This means that we should either pass SVectors and MVectors as function arguments, or dispatch by the number of elements through the Valtechnique.

x = rand(50)

function foo(x)
    output = SVector{length(x), eltype(x)}(undef)
    output = MVector{length(x), eltype(x)}(undef)

    for i in eachindex(x)
        temp      = x .> x[i]
        output[i] = sum(temp)
    end

    return output
end

@code_warntype foo(x)                     # type unstable

x = rand(50);   sx = SVector(x...)

function foo(x)
    
    output = MVector{length(x), eltype(x)}(undef)
    
    for i in eachindex(x)
        temp      = x .> x[i]
        output[i] = sum(temp)
    end

    return output
end

@code_warntype foo(sx)                    # type stable

x = rand(50)

function foo(x, ::Val{N}) where N
    sx     = SVector{N, eltype(x)}(x)
    output = MVector{N, eltype(x)}(undef)

    for i in eachindex(x)
        temp      = x .> x[i]
        output[i] = sum(temp)
    end

    return output
end

@code_warntype foo(x, Val(length(x)))     # type stable

Performance Comparison

MVectors offer performance benefits over regular vectors. However, you should bear in mind that they're never more performant than SVectors and may additionally result in memory allocations. Considering this, it's recommended to use SVectors if you're certain that the collection won't be mutated.

Below, we illustrate the performance of SVectors and MVectors. The examples demonstrate that they may exhibit similar performance, with the possibility of SVectors being more performant. Furthermore, SVectors and MVectors consistently outperform regular vectors for small collections, regardless of their relative performance. To emphasize this point, we include the benchmark time for the same operation using a regular vector.

x  = rand(10)
sx = SVector(x...);  mx = MVector(x...)

foo(x) = sum(a -> 10 + 2a +  3a^2, x)

Output in REPL

julia>

@btime foo($x)

  4.400 ns (0 allocations: 0 bytes)

julia>

@btime foo($sx)

  2.800 ns (0 allocations: 0 bytes)

julia>

@btime foo($mx)

  2.900 ns (0 allocations: 0 bytes)

x  = rand(10)
sx = SVector(x...);  mx = MVector(x...)

foo(x) = 10 + 2x +  3x^2

Output in REPL

julia>

@btime foo.($x)

  19.739 ns (1 allocation: 144 bytes)

julia>

@btime foo.($sx)

  1.600 ns (0 allocations: 0 bytes)

julia>

@btime foo.($mx)

  6.600 ns (1 allocation: 96 bytes)

Static Vectors vs Pre-Allocations

Considering the advantages of static vectors over regular vectors, we can now compare their performance to other strategies that reduce memory allocations. In particular, we'll examine how they stack up against pre-allocating memory for intermediate outputs. Our examples demonstrate that static vectors can efficiently store intermediate results, making pre-allocation techniques unnecessary. Moreover, they reveal that storing the final output in an MVector can lead to performance gains over using a regular vector.

For the illustration, we consider a for-loop that requires an intermediate result during each iteration i. This involves counting the number of elements in x that are greater than x[i], which can be formally implemented as sum(x .> x[i]). To make the comparison starker, we isolate the computation of the intermediate step x .> x[i]. Note that every implementation below requires pre-allocating the vector output, leaving us with only one decision to make: whether to pre-allocate the temporary vector temp. This also explains why all implementations involve at least one memory allocation.

x = rand(50)

function foo(x; output = similar(x))
    for i in eachindex(x)
        temp      = x .> x[i]
        output[i] = sum(temp)
    end

    return output
end

Output in REPL

julia>

@btime foo($x)

  3.188 μs (101 allocations: 5.17 KiB)

x = rand(50)

function foo(x; output = similar(x), temp = similar(x))
    for i in eachindex(x)
        @. temp      = x > x[i]
           output[i] = sum(temp)
    end

    return output
end

Output in REPL

julia>

@btime foo($x)

  695.745 ns (2 allocations: 992 bytes)

x = rand(50);   sx = SVector(x...)

function foo(x; output = Vector{Float64}(undef, length(x)))
    for i in eachindex(x)
        temp      = x .> x[i]
        output[i] = sum(temp)
    end

    return output
end

Output in REPL

julia>

@btime foo($sx)

  183.661 ns (1 allocation: 496 bytes)

x = rand(50);   sx = SVector(x...)

function foo(x; output = similar(x))
    for i in eachindex(x)
        temp      = x .> x[i]
        output[i] = sum(temp)
    end

    return output
end

Output in REPL

julia>

@btime foo($sx)

  148.817 ns (1 allocation: 448 bytes)

x = rand(50);   sx = SVector(x...)

function foo(x; output = MVector{length(x),eltype(x)}(undef))
    for i in eachindex(x)
        temp      = x .> x[i]
        output[i] = sum(temp)
    end

    return output
end

Output in REPL

julia>

@btime foo($sx)

  148.975 ns (1 allocation: 448 bytes)

The "No-Preallocation" tab serves as our baseline, providing a reference point for the other methods. As for the "Pre-allocating" tab, it reuses a regular vector to compute temp. In contrast, the "SVector" tab converts x to an SVector sx without pre-allocating temp. The benchmarks reveal that latter approach is more performant, thanks to the memory allocation avoidance and additional optimizations provided by SVectors.

As for the last two tabs, they continue defining x as an SVector, but additionally treat output as an MVector. The last tab in particular does this by using similar(sx) to initialize output, whereas the other tab explicitly specifies an MVector. Comparing these cases, the benchmarks demonstrate that integrating MVectors into the operation yields further performance gains.

NOTATION

PAGE LAYOUT

LINKS TO SECTIONS

KEYBOARD SHORTCUTS

TIME MEASUREMENT

Dark Mode

PART II: HIGH PERFORMANCE

Introduction

Creating Static Vectors

SVectors Don't Allocate Memory and Are Faster

SVector Type and Its Mutable Variant

Type Stability: Size is Part of the Static Vectors' Type

Performance Comparison

Static Vectors vs Pre-Allocations