@inbounds
and @simd
. <function>
or <operator>
).
This is just notation, and the symbols <
and >
should not be misconstrued as Julia's syntax.
Action | Keyboard Shortcut |
---|---|
Previous Section | Ctrl + 🠘 |
Next Section | Ctrl + 🠚 |
List of Sections | Ctrl + z |
List of Subsections | Ctrl + x |
Close Any Popped Up Window (like this one) | Esc |
Open All Codes and Outputs in a Post | Alt + 🠛 |
Close All Codes and Outputs in a Post | Alt + 🠙 |
Unit | Acronym | Measure in Seconds |
---|---|---|
Seconds | s | 1 |
Milliseconds | ms | 10-3 |
Microseconds | μs | 10-6 |
Nanoseconds | ns | 10-9 |
SIMD relies on the ability to implement the same set of instructions on multiple data. This parallel processing allows operations like scalar additions to be performed on several elements at once, significantly boosting performance. However, conditional statements pose a particular challenge for SIMD execution. When code branches into different paths and each requires distinct instructions, it becomes difficult to maintain the "same instruction" requirement that SIMD depends on. And, although compilers can sometimes transform such code to make it SIMD-compatible, these adaptations often come at the cost of reduced efficiency.
This section will focus on the handle of conditional operations. Firstly, some conditional statements could be inherent to the nature of our program, and hence unavoidable. This occurs when different operations are executed based on the state of the program. In these cases, we'll show that the approaches to implementing conditionals are dealt different internally. Depending on the workflow at hand, this results in advantages and disadvantages for each approach, with some being more SIMD-friendly than others.
Secondly, we'll consider cases where conditional statements can be eliminated entirely. This reflects scenarios where our code introduces conditions implicitly, which can be bypassed by writing code in an alternative way. In other scenarios, we can avoid conditional statements by transforming a conditional statement into an algebraic operation.
We start by covering two scenarios that internally result in conditional statements. These conditions are implicit, making their connection to conditionals not apparent.
The first one arises when functions are type unstable. Internally, this gives rise to multiple branches, one for each possible type. Recall that type instability is a major performance bottleneck. In fact, achieving high performance is nearly impossible without addressing type stability, which makes any concerns about SIMD optimizations secondary. Ultimately, we should strive for type stability, regardless of any SIMD consideration.
Another form of conditional computation arises in for-loops, which perform bounds checking by default. This operation represents a form of conditional execution, as each iteration is executed as long as the indices are within bounds.
The following example shows that the compiler already finds the implementation of SIMD instructions beneficial by merely adding @inbounds
. This explains why adding @simd
on top of @inbounds
has no impact on the execution times. [note] Recall that the compiler may disable bounds checking in some cases, especially in straightforward cases. For instance, this would be the case in our example if only x
had been indexed and eachindex(x)
were employed as the iteration range. This is contrast to the case considered below, where we're also indexing the vector output
. Furthermore, it also demonstrates the converse: adding `@inbounds` tends to be a necessary condition for the application of SIMD. In the example, only adding @simd
doesn't trigger SIMD instructions.
x = rand(1_000_000)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end
@btime foo($x)
775.469 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@simd for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end
@btime foo($x)
792.687 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end
@btime foo($x)
474.154 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end
@btime foo($x)
452.921 μs (2 allocations: 7.629 MiB)
@inbounds
and @simd
. Indeed, the example below shows that a for-loop exhibits roughly the same performance as broadcasting when @inbounds
and @simd
are applied.
x = rand(1_000_000)
foo(x) = 2 ./ x
@btime foo($x)
431.487 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end
@btime foo($x)
435.973 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end
@btime foo($x)
809.359 μs (2 allocations: 7.629 MiB)
Conditional statements can be evaluated eagerly or lazily. To illustrate this distinction, let's consider a simple example where we need to compute the result 1 + 1
, but only if some condition C
is met. A lazy approach first evaluates whether C
holds true, before proceeding with the computation of 1 + 1
. Thus, the operation is deferred until it's confirmed that C
is satisfied. In contrast, an eager approach computes 1 + 1
, regardless of whether C
is satisfied or not. After this, if C
turns out to be false, the computation remains unused.
When the conditional statement is applied only once, a lazy approach is always more performant as avoids unnecessary computations. However, neither approach consistently outperforms the other when the conditional statement is part of a for-loop. This is largely due the way modern CPUs handle simultaneous computations through SIMD. If the CPU can compute multiple operations simultaneously, it's possible that computing all conditions and branches upfront, selecting the relevant branches only afterwards, could actually result in performance gains. This is especially true when the branches consist of simple operations.
In Julia, whether a conditional statement is evaluated lazily or eagerly depends on the method used to compute it. Next, we explore this topic in more detail.
The ifelse
function in Julia follows an eager approach, where both the condition and the possible outcomes are evaluated before the result is determined. In contrast, expressions using if
or &&
favor lazy computations, where only the necessary parts are evaluated. The following example demonstrates this difference in behavior, through a reduction operation subject to a condition. [note] Note that ifelse
requires specifying an operation when the condition is true and when it's not. For a sum reduction, we can handle this by returning zero when the condition isn't met.
x = rand(1_000_000)
function foo(x)
output = 0.0
for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
end
@btime foo($x)
415.373 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
end
@btime foo($x)
414.155 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
for i in eachindex(x)
output += ifelse(x[i] > 0.5, x[i]/2, 0)
end
return output
end
@btime foo($x)
393.046 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i] > 0.5, x[i]/2, 0)
end
return output
end
@btime foo($x)
87.192 μs (0 allocations: 0 bytes)
As the example reveals, note that an eager computation doesn't automatically imply the application of SIMD. This explains why the macro @simd
is included, which hints the compiler that vectorizing the operation is worth considering. In fact, we'll see that when conditions comprise multiple statements, adding @simd
could prompt the compiler to vectorize conditions, while still relying on a lazy evaluation for executing operations.
Furthermore, it's worth remarking that applying SIMD instructions doesn't necessarily increase performance. The example below demonstrates this point, where the compiler adopts a SIMD approach through ifelse
.
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
end
end
foo!($output,$x)
2.806 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
end
end
foo!($output,$x)
2.888 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
end
end
foo!($output,$x)
16.026 ms (0 allocations: 0 bytes)
Ternary operators are an alternative approach for conditional statements, consisting of the form <condition> ? <action if true> : <action if false>
. Unlike the previous methods, this form relies on heuristics to determine whether an eager or lazy approach should be used, depending on which would more likely be faster in the application considered.
For the illustrations, we'll consider examples where we directly add @inbounds
and @simd
to each approach.
Starting with the same example as above, we show that the ternary operator could choose an eager approach.
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
end
foo!($output,$x)
422.480 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i]>0.5, x[i]/2, 0)
end
return output
end
foo!($output,$x)
85.895 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += x[i]>0.5 ? x[i]/2 : 0
end
return output
end
foo!($output,$x)
87.881 μs (0 allocations: 0 bytes)
Instead, the ternary operator implements a lazy approach in the following example.
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.99
output += log(x[i])
end
end
return output
end
foo!($output,$x)
405.304 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i] > 0.99, log(x[i]), 0)
end
return output
end
foo!($output,$x)
3.470 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += x[i]>0.99 ? log(x[i]) : 0
end
return output
end
foo!($output,$x)
421.493 μs (0 allocations: 0 bytes)
It's worth remarking that the method chosen by the ternary operator isn't foolproof. In the following scenario, it actually chooses the slowest approach.
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output[i] = log(x[i])
end
end
end
foo!($output,$x)
26.620 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, log(x[i]), 0)
end
end
foo!($output,$x)
16.864 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = x[i]>0.5 ? log(x[i]) : 0
end
end
foo!($output,$x)
26.517 ms (0 allocations: 0 bytes)
As a rule of thumb, an eager approach is potentially more performant when branches comprise simple algebraic computations. On the contrary, conditional statements with computational-demanding operations will more likely benefit from a lazy implementation. In fact, this is a heuristic that ternary operators commonly follow.
To demonstrate this, the following example considers a conditional statement where only one branch has a computation, which in turn is straightforward. An eager approach with SIMD is faster, and coincides with the approach chosen when a ternary operator is chosen.
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if condition(x[i])
output += computation(x[i])
end
end
return output
end
foo!($output,$x)
416.681 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(condition(x[i]), computation(x[i]), 0)
end
return output
end
foo!($output,$x)
86.242 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += condition(x[i]) ? computation(x[i]) : 0
end
return output
end
foo!($output,$x)
86.370 μs (0 allocations: 0 bytes)
Instead, the following scenario considers a branch with more computational-intensive calculations. In this case, a lazy approach is faster, which is the approach implemented by the ternary operator.
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if condition(x[i])
output += computation(x[i])
end
end
return output
end
foo!($output,$x)
12.346 ms (0 allocations: 0 bytes)
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(condition(x[i]), computation(x[i]), 0)
end
return output
end
foo!($output,$x)
16.552 ms (0 allocations: 0 bytes)
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += condition(x[i]) ? computation(x[i]) : 0
end
return output
end
foo!($output,$x)
12.492 ms (0 allocations: 0 bytes)
Next, we consider scenarios where you already have a vector holding conditions. This could occur either because the vector is already part of your dataset, or because the conditions will be reused multiple times over you code, in which case storing the conditions is worthy.
Storing conditions in a vector could be done through an object with type Vector{Bool}
or BitVector
. The latter is the default type returned by Julia, as when you define objects like x .> 0
. Although this type offer certain performance advantages, it can also hinder the application of SIMD. In cases like this, transforming BitVector
to Vector{Bool}
could speed up computations.
The following example demonstrates this, where the execution time is faster even considering the vector transformation.
x = rand(1_000_000)
bitvector = x .> 0.5
function foo(x,bitvector)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
end
return output
end
foo($x,$bitvector)
3.393 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
bitvector = x .> 0.5
function foo(x,bitvector)
output = similar(x)
boolvector = Vector(bitvector)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
end
return output
end
foo($x,$bitvector)
862.798 μs (4 allocations: 8.583 MiB)
ifelse
without a vector of conditions. This allows you to avoid memory allocations, while still applying SIMD effectively. The following example illustrates this point. [note] Note that the approach for Vector{Bool}
is somewhat different to the examples we considered above. As we don't have a vector of conditions already defined, it's optimal to create Vector{Bool}
directly, rather than defining it as a transformation of the BitVector
. In this way, we avoid unnecessary memory allocations too.
x = rand(1_000_000)
function foo(x)
output = similar(x)
bitvector = x .> 0.5
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
end
return output
end
foo($x)
3.628 ms (6 allocations: 7.753 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
boolvector = Vector{Bool}(undef,length(x))
boolvector .= x .> 0.5
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
end
return output
end
foo($x)
774.952 μs (4 allocations: 8.583 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i]>0.5, x[i]/i, x[i]*i)
end
return output
end
foo($x)
501.114 μs (2 allocations: 7.629 MiB)
We leverage algebraic equivalences to express conditions in ways that allow us to avoid the creation of branches. Mathematically, given a set \(\left\{ b_{i}\right\} _{i=1}^{n}\) where \(b_{i}\in\left\{ 0,1\right\}\):
all conditions are satisfied when
at least one condition is satisfied when
In terms of Julia, given two Boolean scalars c1
and c2
, these equivalences become
c1 && c2
is Bool(c1 * c2)
c1 || c2
is Bool(1 - !c1 * !c2)
For instance, with for-loops:
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) && (y[i] < 0.6) && (x[i] > y[i])
output += x[i]
end
end
return output
end
foo($x)
2.116 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) * (y[i] < 0.6) * (x[i] > y[i])
output += x[i]
end
end
return output
end
foo($x)
905.078 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) || (y[i] < 0.6) || (x[i] > y[i])
output += x[i]
end
end
return output
end
foo($x)
2.724 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if Bool(1 - !(x[i] > 0.3) * !(y[i] < 0.6) * !(x[i] > y[i]))
output += x[i]
end
end
return output
end
foo($x)
889.879 μs (0 allocations: 0 bytes)
While with broadcasting:
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) && (y<0.6) && (x>y), x,y)
foo($x)
5.356 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) * (y<0.6) * (x>y), x,y)
foo($x)
541.621 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) || (y<0.6) || (x>y), x,y)
foo($x)
3.354 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse(Bool(1 - !(x>0.3) * !(y<0.6) * !(x>y)), x,y)
foo($x)
536.276 μs (2 allocations: 7.629 MiB)