allCode.jl
. It's been tested under Julia 1.9
.For more accurate benchmarks, each variable is wrapped in
ref(x) = (Ref(x))[]
and interpolated through $
. Action | Keyboard Shortcut |
---|---|
Open All Codes and Outputs in a Post | Ctrl + 🠙 |
Close All Codes and Outputs in a Post | Ctrl + 🠛 |
Close Any Popped Up Window (like this one) | Esc |
Previous Post | Ctrl + 🠘 |
Next Post | Ctrl + 🠚 |
List of Sections in a Post | Ctrl + x |
List of Keyboard Shortcuts | Ctrl + z |
allCode.jl
. It's been tested under Julia 1.9
.ref(x) = (Ref(x))[]
and interpolated through $
. This is the 4th part on performance improvements in Julia that are easy to implement. The previous parts covered lazy broadcasting, tuples/static vectors, and views. This part discusses pre-allocation of outputs.
The technique addresses the memory allocation that comes with creating new objects. It leverages that mutating the contents of an existing object does not allocate, enabling the user to avoid creating a new object in every iteration of a loop. Although we'll illustrate the concepts through vectors, the same principle applies to other objects. [note] Notice that pre-allocating outputs entails no benefits if we only need to compute a result once, as we always need to create a new object to start with.
The main takeaway from the post is that pre-allocations can greatly enhance performance. However, they can also hinder code readability. Therefore, you should decide whether its implementation pays off, by carefully weighing the trade-off between performance and code complexity involved.
The packages used in this post are the following.
using BenchmarkTools, Random, Statistics
Random.seed!(1234)
x = rand(100)
We'll suppose that our ultimate goal is to store the computation of the vector output
, and that this requires computing a vector temp
multiple times. Notice that pre-allocations entail no advantages when temp
only operates on single elements, even if the operations are performed multiple times. The reason is that single elements do not allocate memory. For example, temp
in the following code only involves manipulating single elements of x
, and so no allocation arises.
function example1(x)
output = similar(x)
for i in eachindex(x)
temp = x[1] / x[i]
output[i] = mean(temp)
end
return output
end
@btime example1(ref($x));
80.864 ns (1 allocation: 896 bytes)
The only allocation arising in the example is due to the creation of the vector output
. As we consider that the user wants to store output
, this allocation is unavoidable.
Based on the above observations, our focus will be on a vector temp
. To keep the code simple, we also initialize output
by introducing it as a keyword argument. [note] The keyword argument similar(x)
uses the function's argument x
. This is possible in Julia as a function's arguments are evaluated sequentially, allowing the user to define a keyword argument based on any argument defined before.
In particular, we'll consider the following baseline example.
function example2(x; output=similar(x))
for i in eachindex(x)
temp = x ./ x[i]
output[i] = mean(temp)
end
return output
end
@btime example2(ref($x));
5.540 μs (101 allocations: 88.38 KiB)
Out of the 101 allocations, one corresponds to the creation of output
and hence will always occur. The remaining 100 allocations arise because a new vector temp
is created in each iteration of the for-loop (recall that the length of x
is 100, which defines the number of iterations).
We can avoid these 100 allocations by pre-allocating temp
. This requires creating temp
before the loop, which we then update within the loop. The first example below remarks a wrong use of pre-allocations, where we don't use .=
. The other two examples represent equivalent ways to perform the pre-allocation.
function compute_output1(x; output=similar(x))
temp = similar(x)
for i in eachindex(x)
temp = x ./ x[i]
output[i] = mean(temp)
end
return output
end
@btime compute_output1(ref($x));
5.640 μs (102 allocations: 89.25 KiB)
function compute_output2(x; output=similar(x))
temp = similar(x)
for i in eachindex(x)
temp .= x ./ x[i]
output[i] = mean(temp)
end
return output
end
@btime compute_output2(ref($x));
4.143 μs (2 allocations: 1.75 KiB)
function compute_output3(x; output=similar(x))
temp = similar(x)
for i in eachindex(x)
@. temp = x / x[i]
output[i] = mean(temp)
end
return output
end
@btime compute_output3(ref($x));
4.057 μs (2 allocations: 1.75 KiB)
The use of the macro @.
in example3
is the easiest way to perform the pre-allocation of temp
. This macro adds a dot to each operator in the expression, including =
, thereby ensuring we're not omitting any dot unintentionally.
temp
in the previous example is equivalent to doing it through an in-place operation. The following example shows this equivalence, where the content of temp
is updated in each iteration by executing preallocated!
.
function preallocated!(temp, x, i)
for j in eachindex(x)
temp[j] = x[j] / x[i]
end
end
function compute_output4(x; output=similar(x))
temp = similar(x)
for i in eachindex(x)
preallocated!(temp, x, i)
output[i] = mean(temp)
end
return output
end
@btime compute_output4(ref($x));
The function compute_output3
in the last example reduces the number of allocations from 102 to 2. These 2 allocations correspond to the creation of vectors output
and temp
.
The approach is useful if output
is the final output we're interested in. However, this result could eventually be used as an input in another loop, determining that the 2 allocations would be repeated multiple times. To show this, consider the following example.
function intermediate_result(x; output=similar(x))
temp = similar(x)
for i in eachindex(x)
@. temp = x / x[i]
output[i] = mean(temp)
end
return output
end
final_result(x) = [intermediate_result(x) for _ in 1:100]
@btime final_result(ref($x));
410.400 μs (201 allocations: 175.88 KiB)
The example shows that the 2 allocations are repeated 100 times, making a total of 201 allocations (200 + 1 allocation from creating the vector in final_result
itself). However, we can further reduce the number of allocations by creating temp
and output
outside the loop in final_result
.
function intermediate_result(x, output, temp)
for i in eachindex(x)
@. temp = x / x[i]
output[i] = mean(temp)
end
return output
end
function final_result(x)
output = similar(x)
temp = similar(x)
[intermediate_result(x, output, temp) for i in 1:100]
end
@btime final_result(ref($x));
403.800 μs (3 allocations: 2.62 KiB)
The last example reveals that pre-allocating outputs can make the code quite cumbersome. This means that you should assess whether improving performance is worth the extra complexity. To clearly see this, compare the code above with the following one, which provides an identical result:
intermediate_result(x) = mean.(x ./ x[i] for i in eachindex(x))
final_result(x) = [intermediate_result(x) for _ in 1:100]
@btime final_result(ref($x))
543.900 μs (10201 allocations: 8.72 MiB)
Consequently, you should first consider alternative options to pre-allocations when possible. For example, depending on the operation to perform, the use of lazy broadcasting or static vectors may keep the code simpler while achieving similar performance.
Finally, notice that we considered examples where the vector's length is 100. This explains why there are no big differences in time between the approaches. This suggests that the performance gains from pre-allocations only materialize when the vector's length is larger. Indeed, when we increase the vector's length to 10,000, the differences in time become evident, as the following example demonstrates.
Random.seed!(1234)
x = rand(10_000)
intermediate_result(x) = mean.(x ./ x[i] for i in eachindex(x))
final_result(x) = [intermediate_result(x) for _ in 1:100]
@btime final_result(ref($x))
71.684 s (2000401 allocations: 74.57 GiB)
Random.seed!(1234)
x = rand(10_000)
function intermediate_result(x, output, temp)
for i in eachindex(x)
@. temp = x / x[i]
output[i] = mean(temp)
end
return output
end
function final_result(x)
output = similar(x)
temp = similar(x)
[intermediate_result(x, output, temp) for i in 1:100]
end
@btime final_result(ref($x));
4.723 s (5 allocations: 157.22 KiB)