allCode.jl. They've been tested under Julia 1.11.9.allCode.jl. They've been tested under Julia 1.11.9.Certain operations involving for-loops may trigger the creation of new vectors during each iteration, leading to repeated memory allocation. In many cases, this dynamic allocation is unnecessary. Particularly, when the vectors only store temporary intermediate results that are discarded after the current iteration.
A comparable issue arises when the same operations must be executed repeatedly with varying inputs, as in simulations or Monte Carlo experiments. If only the final output is required, generating new intermediate vectors for every run results in avoidable allocations and wasted resources.
In these situations, performance can be improved through pre-allocation. This technique consists of initializing a vector before execution begins, then reusing it throughout each computation. By allocating memory upfront and updating the vector in place, the program avoids the repeated overhead of memory allocation.
Remarkably, the technique isn't exclusive to Julia, but rather represents an optimization strategy applicable across programming languages. Its effectiveness ultimately stems from favoring mutations over the creation of new objects, thus minimizing allocations on the heap.
Our presentation will include a review of methods for initializing vectors. This is a prerequisite for implementing a pre-allocation strategy. We then present some scenarios where pre-allocation proves advantageous, with special emphasis on its advantages within nested for-loops.
The benefits of pre-allocation stem from the broader advantages of in-place operations. This strategy encourages reusing existing vectors rather than creating new objects. Because this principle is valuable beyond pre-allocation, we begin by illustrating the performance gains that come from reusing the same memory address.
To make this concrete, consider the operation of sorting a vector x. In Julia, many built-in functions have an in-place function counterpart. Sorting is one such case: sort(x) allocates a new sorted vector, while sort!(x) rearranges the elements of x in place. When preserving the original ordering isn’t necessary, creating a separate copy provides no benefit. In those cases, the mutating version sort!(x) is preferable, as it avoids unnecessary allocations by reusing the memory associated with x. The gains from this approach are presented below.
x = rand(1_000_000)@btime sort($x) 43.514 ms (9 allocations: 15.267 MiB)
x = rand(1_000_000)@btime sort!($x) 1.750 ms (0 allocations: 0 bytes)
As we can see, the differences in runtime are substantial. If the same operation were executed repeatedly inside a for-loop, the performance gaps would become even more pronounced.
Pre-allocation builds on the same underlying idea, but applies to situations where a vector like x doesn’t yet exist. Considering this, we begin with a review of how to initialize vectors efficiently.
Vector initialization refers to the process of creating a vector that subsequently will be filled with values. The process typically involves two steps: reserving space in memory and populating that space with some initial values. An efficient approach to initializing a vector involves performing only the first step, keeping whatever content is in the memory address. These values held in memory are referred to as undef (undefined). Although Julia will display them as numerical values, they're essentially arbitrary and meaningless.
There are two methods for initializing a vector with undef values. The first one is through a constructor, requiring the specification of the length and element types. The second one is based on the function similar, which creates a vector with the same type and dimension as another existing vector. This approach is particularly useful when the output sought matches the structure of an input.
Below, we compare the performance of several approaches to initializing vectors. In particular, we establish that working with undef values is faster than populating vectors with specific values. To starkly show the differences in execution time, we repeat the process of vector creation 100,000 times.
x = collect(1:100)
repetitions = 100_000 # repetitions in a for-loop
function foo(x, repetitions)
for _ in 1:repetitions
Vector{Int64}(undef, length(x))
end
end@btime foo($x, $repetitions) 3.455 ms (100000 allocations: 85.449 MiB)
x = collect(1:100)
repetitions = 100_000 # repetitions in a for-loop
function foo(x, repetitions)
for _ in 1:repetitions
similar(x)
end
end@btime foo($x, $repetitions) 2.977 ms (100000 allocations: 85.449 MiB)
x = collect(1:100)
repetitions = 100_000 # repetitions in a for-loop
function foo(x, repetitions)
for _ in 1:repetitions
zeros(Int64, length(x))
end
end@btime foo($x, $repetitions) 10.797 ms (100000 allocations: 85.449 MiB)
x = collect(1:100)
repetitions = 100_000 # repetitions in a for-loop
function foo(x, repetitions)
for _ in 1:repetitions
ones(Int64, length(x))
end
end@btime foo($x, $repetitions) 11.003 ms (100000 allocations: 85.449 MiB)
x = collect(1:100)
repetitions = 100_000 # repetitions in a for-loop
function foo(x, repetitions)
for _ in 1:repetitions
fill(2, length(x)) # vector filled with integer 2
end
end@btime foo($x, $repetitions) 10.170 ms (100000 allocations: 85.449 MiB)
_ isn't a keyword, but rather a convention widely adopted by programmers to denote dummy variables. These are variables included solely to meet syntactical requirements, but are never actually used or referenced within the code. In our example, the inclusion of _ is because for-loops must always include a variable to iterate over. While any other symbol could be used, _ signals programmers that its value can be safely ignored.We can initialize a vector by passing it to the function as a keyword argument. This strategy even enables the use of similar(x), where x is a previous function argument. Considering this, the following two implementations are equivalent.
function foo(x)
output = similar(x)
# <some calculations using 'output'>
endfunction foo(x; output = similar(x))
# <some calculations using 'output'>
endWhen multiple variables of the same type need to be initialized, array comprehensions offer a concise solution. A more efficient alternative is based on the so-called generators, which are the lazy counterpart of array comprehensions. This reduces memory allocations by initializing and assigning the vectors element-wise. The behavior differs from array comprehensions, which first create a vector that holds all the initialized vectors.
Below we illustrate these strategies. Although the example is based on similar(x), note that the same principle applies a constructor like Vector{Float64}(undef, length(x)).
x = [1,2,3]
function foo(x)
a,b,c = [similar(x) for _ in 1:3]
# <some calculations using a,b,c>
end@btime foo($x) 101.973 ns (8 allocations: 320 bytes)
x = [1,2,3]
function foo(x)
a,b,c = (similar(x) for _ in 1:3)
# <some calculations using a,b,c>
end@btime foo($x) 40.751 ns (3 allocations: 144 bytes)
To describe how pre-allocation works, we'll consider a scenario where the strategy proves to be advantageous. The setting involves a nested for-loop, in which the output of a for-loop serves as an intermediate input for another for-loop. The examples are based not only on explicit for-loops, but also on operations that are internally computed via a for-loop (e.g., broadcasting).
Let's first describe the inner for-loop that will be eventually embedded in an outer for-loop. Suppose we're assessing a worker's performance over a 30-day period, with daily scores recorded on a scale from 0 to 1. Defining successful performance as any score above 0.5, the following code snippets generate vectors indicating the days on which the goal was achieved.
nr_days = 30
score = rand(nr_days)
performance(score) = score .> 0.5@btime performance($score) 49.875 ns (3 allocations: 96 bytes)
nr_days = 30
score = rand(nr_days)
function performance(score)
target = similar(score)
for i in eachindex(score)
target[i] = score[i] > 0.5
end
return target
end@btime performance($score) 45.300 ns (2 allocations: 304 bytes)
nr_days = 30
score = rand(nr_days)
function performance(score; target=similar(score))
for i in eachindex(score)
target[i] = score[i] > 0.5
end
return target
end@btime performance($score) 44.777 ns (2 allocations: 304 bytes)
Consider now that the output of this operation (namely, the vector of days in which the target was met) represents an intermediate step in another for-loop. For instance, suppose we have multiple workers, each with recorded performance scores. The goal is to summarize each worker's information through a summary statistic. In particular, the ratio of the standard deviation to the mean, which expresses variability in units of the average.
In practice, this statistic requires computing both the mean and the standard deviation. Below, we show that computing each statistic via reductions isn't efficient. The reason is that, while reductions avoid memory allocations, they require computing the days on which the target was met twice. Instead, it's more efficient to compute and store this vector beforehand, which we then reuse as an input for computing the mean and the standard deviation.
The result highlights a more general principle: when the same intermediate input is required for multiple outputs, the cost of allocating memory for the intermediate vector tends to be lower than its repeated computation. In the example, the reduction is implemented via lazy broadcasting, which internally relies on reduction techniques. The rest implement the result via explicit or implicit for-loops, with the intermediate input stored in a vector.
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)] # 3 workers
function repeated_call(scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
stats[col] = std(@~ scores[col] .> 0.5) / mean(@~ scores[col] .> 0.5)
end
return stats
end@btime repeated_call($scores) 685.195 ns (2 allocations: 80 bytes)
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)] # 3 workers
performance(score) = score .> 0.5
function repeated_call(scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
target = performance(scores[col])
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call($scores) 495.696 ns (11 allocations: 368 bytes)
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)] # 3 workers
function performance(score)
target = similar(score)
for i in eachindex(score)
target[i] = score[i] > 0.5
end
return target
end
function repeated_call(scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
target = performance(scores[col])
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call($scores) 362.846 ns (8 allocations: 992 bytes)
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)] # 3 workers
function performance(score; target = similar(score))
for i in eachindex(score)
target[i] = score[i] > 0.5
end
return target
end
function repeated_call(scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
target = performance(scores[col])
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call($scores) 376.145 ns (8 allocations: 992 bytes)
Once we establish that storing the intermediate vector is more performant, our approach will necessarily involve memory allocations. The methods described above, however, allocate more than required: during each iteration, a new vector is created for each worker to store the days in which the target was met. This intermediate vector doesn't need to be preserved for future use. Indeed, because it's stored in a local variable, it'll be discarded once the iteration concludes.
Such cases are strong candidates for pre-allocating intermediate results. By defining a single vector target, we can reuse it across iterations to compute the days when the target was met for each worker. This strategy incurs the overhead of creating the vector only once, after which it's reused for every worker.
The implementation requires defining the vector target before the execution of the outer for-loop. During each iteration, we then mutate target using an in-place function, which we call performance!. This function can update the contents either with a standard for-loop or by applying the broadcasting operator .=.
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)]
performance(score) = score .> 0.5
function repeated_call(scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
target = performance(scores[col])
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call($scores) 524.400 ns (11 allocations: 368 bytes)
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)]
target = similar(scores[1])
performance!(target, score) = (@. target = score > 0.5)
function repeated_call!(target, scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
performance!(target, scores[col])
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call!($target, $scores) 358.638 ns (2 allocations: 80 bytes)
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)]
target = similar(scores[1])
function performance!(target, score)
for i in eachindex(score)
target[i] = score[i] > 0.5
end
end
function repeated_call!(target, scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
performance!(target, scores[col])
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call!($target, $scores) 339.744 ns (2 allocations: 80 bytes)
@. has to be placed at the beginning of the statement.# the following are equivalent and define a new variable
output = @. 2 * x
output = 2 .* x# the following are equivalent and update 'output'
@. output = 2 * x
output .= 2 .* xCompared to a for-loop, the method using .= provides a simpler syntax. This is why, when the function performance! is simple enough as in our example, it's common to avoid its definition and directly insert it into the inner for-loop. The possibility also enables implementing the update via a built-in in-place function. For instance, the function map! can be used with this purpose.
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)]
function repeated_call(scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
target = @. score > 0.5
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call($scores) 822.471 ns (23 allocations: 608 bytes)
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)]
target = similar(scores[1])
function repeated_call!(target, scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
@. target = scores[col] > 0.5
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call!($target, $scores) 336.372 ns (2 allocations: 80 bytes)
nr_days = 30
scores = [rand(nr_days), rand(nr_days), rand(nr_days)]
target = similar(scores[1])
function repeated_call!(target, scores)
stats = Vector{Float64}(undef, length(scores))
for col in eachindex(scores)
map!(a -> a > 0.5, target, scores[col])
stats[col] = std(target) / mean(target)
end
return stats
end@btime repeated_call!($target, $scores) 329.814 ns (2 allocations: 80 bytes)