allCode.jl. They've been tested under Julia 1.11.9.allCode.jl. They've been tested under Julia 1.11.9.SIMD accelerates workloads by applying a single instruction to many elements at once. Critically, its effectiveness hinges on all elements following the same control flow. Yet, this pattern can be disrupted when conditionals are introduced, either explicitly in the source code or implicitly by the compiler.
Once data elements evaluate a condition differently, their execution paths diverge and the hardware can no longer maintain a unified instruction stream. While compilers attempt to mitigate this issue by transforming code into SIMD-friendly forms, these adaptations tend to introduce additional work. This added overhead reduces the net benefit of SIMD.
The current section explores strategies for efficiently applying SIMD in the presence of conditional behavior. In particular, we'll show how to restructure computations to make control-flow branches disappear, allowing the program to preserve a uniform instruction stream.
We'll first examine scenarios in which the compiler introduces branches as a byproduct of its internal computation techniques. In these situations, users may be unaware that conditional control flow has been generated. For this reason, it's essential to recognize when such hidden branches occur.
After this, we consider explicit conditional statements written directly by the user. These conditions are intrinsic to program logic, ensuring that certain computations occur only when specific criteria are met. Because they can't simply be eliminated, we revisit common approaches to expressing conditions and analyze their internal implementation. The aim is to compare their relative strengths and limitations, indicating which ones interact more favorably with SIMD execution.
Finally, we show that conditional statements can often be reformulated as algebraic operations. This transformation removes branching logic and restores SIMD’s ability to operate on a uniform instruction stream. In fact, compilers may implement these transformations automatically when it's likely beneficial.
Certain workflows require the introduction of conditional statements. When this occurs, conditions are integral to a program's logical flow and can't be avoided. It's therefore important to consider the most effective way to express them.
In this book, we've examined several ways to introduce conditional statements. In the context of SIMD, a relevant distinction among these approaches is whether they evaluate conditional statements eagerly or lazily.
To illustrate, let's consider computing 1 + 1 only when certain condition C is met. A lazy strategy checks C first, performing the addition 1 + 1 only if the condition is true. In other words, the operation is deferred until it's actually needed. By contrast, an eager approach performs 1 + 1 immediately, regardless of whether C is satisfied. If the condition eventually evaluates to false, the result of 1 + 1 is simply discarded.
When a conditional statement is applied only once, a lazy approach is more performant because it avoids unnecessary work. Instead, a trade-off arises when the conditional appears inside a for-loop, where its evaluation may repeatedly apply. In this case, SIMD can apply the same operation across many elements at once, making it advantageous to compute all branches upfront and then select the relevant results only afterward. This is especially true when the branches involve inexpensive computations.
In Julia, whether a conditional expression is evaluated lazily or eagerly depends on how the code is written. Next, we'll examine this aspect in more detail.
The ifelse function in Julia follows an eager evaluation strategy. This means that both the condition and all possible outcomes are computed before deciding which result to return. In contrast, the if construct favors lazy computations, only evaluating the branch corresponding to the condition's truth value.
The following example demonstrates the computational difference between if and ifelse. The example is based on a reduction operation, where the elements to be summed are contingent on a condition. Because ifelse requires specifying an operation for both the true and false cases, we return zero when the condition isn't met.
x = rand(1_000_000)
function foo(x)
output = 0.0
for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
end@btime foo($x) 1.312 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
end@btime foo($x) 1.321 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
for i in eachindex(x)
output += ifelse(x[i] > 0.5, x[i]/2, 0)
end
return output
end@btime foo($x) 1.221 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i] > 0.5, x[i]/2, 0)
end
return output
end@btime foo($x) 349.996 μs (0 allocations: 0 bytes)
As the example reveals, the fact that ifelse is eager doesn't automatically trigger the application of SIMD. This is precisely why @inbounds @simd had to be included.
Note that applying SIMD instructions with ifelse doesn't necessarily increase performance. The example below demonstrates this point.
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
end
endfoo!($output,$x) 56.122 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
end
endfoo!($output,$x) 50.983 ms (0 allocations: 0 bytes)
Ternary operator is an alternative approach to expressing conditional logic. It follows the syntax <condition> ? <action if true> : <action if false>. Unlike the previous methods, this form doesn't enforce an eager or a lazy approach. Instead, it relies on heuristics to determine which approach should be implemented. The decision depends on the compiler's assessment about which strategy will likely be faster in the specific application.
To illustrate this behavior, we revisit the same reduction example used earlier, now applying @inbounds @simd directly within each construct. In the following case, the ternary operator chooses an eager strategy, thus evaluating both expressions before selecting the relevant branch.
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
endfoo!($x) 1.318 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i]>0.5, x[i]/2, 0)
end
return output
endfoo!($x) 386.313 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += x[i]>0.5 ? x[i]/2 : 0
end
return output
endfoo!($x) 392.661 μs (0 allocations: 0 bytes)
Instead, the ternary operator opts for a lazy approach in the following example.
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.99
output += log(x[i])
end
end
return output
endfoo!($x) 1.267 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i] > 0.99, log(x[i]), 0)
end
return output
endfoo!($x) 11.496 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += x[i]>0.99 ? log(x[i]) : 0
end
return output
endfoo!($x) 1.358 ms (0 allocations: 0 bytes)
It's worth noting that the method chosen by the ternary operator isn't foolproof: it could choose a less performant approach. In the following example, the ternary operator ends up implementing the slower approach.
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output[i] = log(x[i])
end
end
endfoo!($output,$x) 79.400 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, log(x[i]), 0)
end
endfoo!($output,$x) 54.072 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = x[i]>0.5 ? log(x[i]) : 0
end
endfoo!($output,$x) 80.596 ms (0 allocations: 0 bytes)
As a general guideline, conditional statements with computationally-demanding operations will more likely benefit from a lazy implementation. In contrast, branches comprising simple algebraic computations tend to benefit from an eager approach. These heuristics tend to drive the decision adopted by the ternary operator.
To make this concrete, consider a case where only one branch performs a computation and that computation is straightforward. Under these conditions, an eager evaluation combined with SIMD typically outperforms a lazy alternative. In fact, this is precisely the strategy adopted when using a ternary operator.
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if condition(x[i])
output += computation(x[i])
end
end
return output
endfoo!($x) 1.296 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(condition(x[i]), computation(x[i]), 0)
end
return output
endfoo!($x) 412.153 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += condition(x[i]) ? computation(x[i]) : 0
end
return output
endfoo!($x) 312.681 μs (0 allocations: 0 bytes)
The situation reverses when the conditional branch involves a computationally intensive operation. Here, a lazy strategy avoids performing expensive work unnecessarily, leading to better performance overall. Below, we illustrate a scenario like this, showing that the ternary operator also chooses a lazy evaluation.
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if condition(x[i])
output += computation(x[i])
end
end
return output
endfoo!($x) 39.390 ms (0 allocations: 0 bytes)
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(condition(x[i]), computation(x[i]), 0)
end
return output
endfoo!($x) 40.469 ms (0 allocations: 0 bytes)
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += condition(x[i]) ? computation(x[i]) : 0
end
return output
endfoo!($x) 40.580 ms (0 allocations: 0 bytes)
In many situations, you may already have a vector of Boolean conditions defined. This can happen because the condition vector is part of the dataset itself, or because the same conditions will be reused throughout the code, making it worthwhile to compute and store them in advance.
Julia represents such vectors using either Vector{Bool} or BitVector. The latter is the default output for broadcast expressions like x .> 0. While BitVector is memory-efficient and can be faster in some contexts, it can also inhibit SIMD optimizations. When this occurs, converting the BitVector to a Vector{Bool} can improve performance. The example below illustrates this effect: even after accounting for the cost of converting the type, the version using Vector{Bool} achieves a shorter execution time.
x = rand(1_000_000)
bitvector = x .> 0.5
function foo(x,bitvector)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x,$bitvector) 13.156 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
bitvector = x .> 0.5
function foo(x,bitvector)
output = similar(x)
boolvector = Vector(bitvector)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x,$bitvector) 2.588 ms (5 allocations: 8.583 MiB)
ifelse directly to scalar values inside the for-loop. This approach eliminates unnecessary memory allocations, while still allowing the compiler to generate SIMD instructions effectively. The following example demonstrates this point.x = rand(1_000_000)
function foo(x)
output = similar(x)
bitvector = x .> 0.5
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x) 9.641 ms (7 allocations: 7.749 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
boolvector = Vector{Bool}(undef,length(x))
boolvector .= x .> 0.5
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x) 2.314 ms (5 allocations: 8.583 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i]>0.5, x[i]/i, x[i]*i)
end
return output
endfoo($x) 1.436 ms (3 allocations: 7.629 MiB)
Branches can often be eliminated by rewriting logical conditions in terms of simple algebraic identities. By doing so, we're eliminating the hurdle that impedes a direct application of SIMD.
To see how this works, let's first formalize Boolean logic in mathematical terms. Given a set \(\left\{ b_{i}\right\} _{i=1}^{n}\) where \(b_{i}\in\left\{ 0,1\right\}\):
All conditions are satisfied when
At least one condition is satisfied when
These identities translate directly into Boolean expressions. Specifically, given two Boolean scalars c1 and c2:
c1 && c2 can be expressed as Bool(c1 * c2),
c1 || c2 can be expressed as Bool(1 - !c1 * !c2).
Once this strategy is adopted, control flow can be rewritten in a manner that's friendlier to vectorization. The following example illustrates this approach by replacing branching logic in a for-loop.
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) && (y[i] < 0.6) && (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x,$y) 6.708 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) * (y[i] < 0.6) * (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x,$y) 2.924 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) || (y[i] < 0.6) || (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x,$y) 9.961 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if Bool(1 - !(x[i] > 0.3) * !(y[i] < 0.6) * !(x[i] > y[i]))
output += x[i]
end
end
return output
endfoo($x,$y) 2.806 ms (0 allocations: 0 bytes)
The following example does the same, but for the case of broadcasting.
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) && (y<0.6) && (x>y), x, y)foo($x,$y) 15.562 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) * (y<0.6) * (x>y), x, y)foo($x,$y) 1.748 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) || (y<0.6) || (x>y), x, y)foo($x,$y) 11.352 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse(Bool(1 - !(x>0.3) * !(y<0.6) * !(x>y)), x, y)foo($x,$y) 1.459 ms (3 allocations: 7.629 MiB)
Notice that such algebraic transformations may be performed automatically by the compiler. Whether this occurs depends on how the code is written. For example, revisiting the case of for-loops, the compiler performs the transformation automatically when expressing the condition through a separate function.
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) && (y[i] < 0.6) && (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x,$y) 6.732 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) * (y[i] < 0.6) * (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x,$y) 2.727 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
condition(a,b) = (a > 0.3) * (b < 0.6) * (a > b)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if condition(x[i],y[i])
output += x[i]
end
end
return output
endfoo($x,$y) 2.791 ms (0 allocations: 0 bytes)