pic
Personal
Website

Performance Improvements in Julia (Part 3): Reducing Allocations from Slices

Martin Alfaro - PhD in Economics
June 25, 2023
Remark
The script in this post is available here under the name allCode.jl. It's been tested under Julia 1.9.

For more accurate benchmarks, each variable is wrapped in ref(x) = (Ref(x))[] and interpolated through $.

Introduction

This is the 3rd part on fixes to enhance performance in Julia. The focus has been in particular on methods that are relatively easy to implement. The previous two parts covered lazy broadcast and static vectors. This part discusses how to handle slices of vectors.

Given a vector x, a slice of a vector is defined as x[<some index of elements>], allowing the user to access a subset of x's elements. Remarkably, accessing only one element, such as x[3], doesn't allocate memory. On the contrary, if the slice contains more than one element, Julia creates a copy of it by default, resulting in memory allocation.

The simplest and most general solution to address this is to work with views of slices. A view is a lightweight object that references the original object, avoiding the creation of a new object to store the subset of elements. Due to this, it doesn't allocate memory.

It's worth remarking that reducing memory allocation is neither necessary nor sufficient for increasing speed—it's not uncommon to have a function allocating more memory, yet being faster. The trade-off between speed and memory allocation is complex. Broadly speaking, copying data incurs additional CPU instructions to locate a new object in memory. However, it also stores elements in a contiguous memory block, allowing the CPU to access elements faster.

From a practical standpoint, excessive memory allocation is a warning sign, hinting at potential issues with the code.

Slices of Vectors and their Indices

Let's consider the following example to create a slice of some vector x.

x       = rand(10) # testing vector
foo1(x) = x[[1,2]]

@btime foo1(ref($x))
Output in REPL
  30.050 ns (2 allocations: 160 bytes)
x       = rand(10) # testing vector
foo2(x) = x[1:2]

@btime foo2(ref($x))
Output in REPL
  16.333 ns (1 allocation: 80 bytes)
x       = rand(10) # testing vector
foo3(x) = view(x, 1:2)

@btime foo3(ref($x))
Output in REPL
  1.700 ns (0 allocations: 0 bytes)

The first tab shows that indexing a slice through a vector results in two allocations: one for the slice and another for the index. Instead, the code in the second tab eliminates the latter allocation, by using a range as the slice's index. [note] The same is true for for-loops, where using a range or eachindex prevents allocating the object through which we iterate. Nonetheless, there's still one allocation involved, which corresponds to the slice itself. To avoid it, we can combine a range for the slice's index with the function view. By doing so, the slice references the original elements of x, resulting in no memory allocations.

It's important to note that the application of these different approaches may not always yield a noticeable difference. If we're only using that code once, adding view may reduce code readability without providing significant benefits. However, if that code snippet is eventually called within a for-loop, the impact can be substantial. To see this, consider the following example, where each of the functions in the example is repeatedly used.

function example1(x)
    for _ in 1:10_000
        foo1(x)
    end
end

@btime example1(ref($x))
Output in REPL
  298.800 μs (20000 allocations: 1.53 MiB)
function example2(x)
    for _ in 1:10_000
        foo2(x)
    end
end

@btime example2(ref($x))
Output in REPL
  147.900 μs (10000 allocations: 781.25 KiB)
function example3(x)
    for _ in 1:10_000
        foo3(x)
    end
end

@btime example3(ref($x))
Output in REPL
  1.970 μs (0 allocations: 0 bytes)

Macros for Views

Adding view to the code can quickly become tedious, especially when dealing with functions that involve multiple slices. Due to this, Julia provides some macros to facilitate its usage. One of these macros is @view, which creates a view of the object immediately following it. The following code snippet demonstrates its use.

'@view' macro
# macro '@view'
foo(x) = @view(x[1:2]) + @view(x[3:4])

I don't find this macro particularly useful, and it even requires more typing compared to the function view. On the contrary, there's another macro that I frequently used, which is called @views. This creates a view for each slice within an expression. This feature is particularly useful for functions, as the following example demonstrates.

x = rand(10) # testing vector

function example1(x)
    output = x[1:2] .+ x[3:4] .* x[5:6]
    
    return output .+ x[7:8] .* x[9:10]
end

@btime example1(ref($x))
Output in REPL
  101.065 ns (7 allocations: 560 bytes)
x = rand(10) # testing vector

@views function example2(x)
    output = x[1:2] .+ x[3:4] .* x[5:6]
    
    return output .+ x[7:8] .* x[9:10]
end

@btime example2(ref($x))
Output in REPL
  29.305 ns (2 allocations: 160 bytes)

The function without @views has 7 allocations: One for output, another for the vector in the return statement, and 5 from the slices. Instead, by adding @views before the definition of the function, each slice is called through a view. This results in the elimination of the 5 allocations from the slices, reducing the total allocations to 2.

Notice that the use of @views can be applied to any expression, not just functions. For instance, the following code creates views of each slice in a for-loop.

@views In For-Loops
x = rand(10) # testing vector
function example(x)
    output = 0.0

    @views for i in 1:100
        intermediate_result = x[1:2] .+ x[3:4] .* x[5:6] ./ i
        output              = sum(intermediate_result)
    end

    return output
end

@btime example(ref($x))
Output in REPL
  1.380 μs (100 allocations: 7.81 KiB)

Why not Static Vectors?

In the last code snippet, we eliminated the allocations from the slices. However, we still have 100 allocations due to the computation of intermediate_result in each iteration. Could we eliminate these allocations as well? Answering this question allows us to showcase the methods discussed in this blog so far. Furthermore, it highlights when the use of views is advantageous compared to other alternatives, such as static vectors.

First, the 100 allocations can be avoided by converting intermediate_result into a lazy array. This technique exploits that intermediate_result itself is not of interest, but rather serves as an input for a sum. Therefore, we can delay the computation of intermediate_result until we perform the sum. The following code snippet demonstrates this approach, where recall that 1 μs is 1000 ns.

Views + Lazy Arrays
using LazyArrays
x = rand(10) # testing vector

function example1(x)
    output = 0.0

    @views for i in 1:100
        intermediate_result = @~ x[1:2] .+ x[3:4] .* x[5:6] ./ i
        output              = sum(intermediate_result)
    end

    return output
end

@btime example1(ref($x))
Output in REPL
  541.176 ns (0 allocations: 0 bytes)

One question you might be asking yourself is why we're not using tuples or static vectors. After all, we showed in a previous post that tuples/static vectors allow for faster computations and do not allocate. More generally, this raises the question of when views should be applied. The key to understanding this is that tuples/static vectors are performant only when the object comprises a small number of elements. Instead, the performance of view is not impacted by the number of elements.

In the particular example we're considering, it's possible to use tuples or static vectors: we're working with an array with 10 elements, making it ideal for its application. This not only would speed up the calculations, but also eliminate the need for views (tuples/static vectors do not allocate). Notice, nonetheless, that working with slices of static vectors can be somewhat cumbersome, as it requires specifying the indexes through a static vector. If you don't do this, the static vector would be converted to an ordinary vector.

using LazyArrays
x = Tuple(rand(10)) # testing vector, converted to a tuple

function example2(x)
    output = 0.0

    for i in 1:100
        intermediate_result = x[1:2] .+ x[3:4] .* x[5:6] ./ i
        output              = sum(intermediate_result)
    end

    return output
end

@btime example2(ref($x))
Output in REPL
  1.400 ns (0 allocations: 0 bytes)
using LazyArrays, StaticArrays
x = @SVector rand(10) # testing vector, converted to a static vector

function example3(x)
    output = 0.0

    for i in 1:100
        intermediate_result = x[SVector(1,2)] .+ x[SVector(3,4)] .* x[SVector(5,6)] ./ i
        output              = sum(intermediate_result)
    end

    return output
end

@btime example3(ref($x))
Output in REPL
  1.600 ns (0 allocations: 0 bytes)