allCode.jl. They've been tested under Julia 1.11.8.allCode.jl. They've been tested under Julia 1.11.8.SIMD accelerates workloads by applying a single instruction to many elements at once. Its effectiveness heavily hinges on all elements following the same control flow. Yet, this pattern can be disrupted when conditionals are introduced (either explicitly in the source code or implicitly by the compiler). Once data elements evaluate a condition differently, their execution paths diverge and the hardware can no longer maintain a unified instruction stream. While compilers attempt to mitigate this issue by transforming code into SIMD-friendly forms, these adaptations tend to introduce additional work that reduces the overall benefit of SIMD.
This section explores strategies for efficiently applying SIMD in the presence of conditional behavior. We'll first examine scenarios in which the compiler introduces branches as a byproduct of its internal computation techniques. By employing alternative coding strategies, we'll show how these branches can be avoided altogether, thereby preserving a uniform instruction stream.
After this, we'll explore conditional statements that are intrinsic to the program's logic and therefore unavoidable. These are the explicit conditions that users introduce to ensure that certain operations execute only under specific circumstances. Here, we'll revisit the usual approaches to expressing conditions, focusing on their internal implementation. The goal is to compare the relative strengths and limitations of each approach, indicating which ones interact more favorably with SIMD execution.
Finally, we'll show that conditional statements can be equivalently recast as algebraic operations, effectively removing the branching logic that disrupts SIMD execution. In several scenarios, the compiler may implement these transformations automatically when it's likely beneficial.
Certain coding patterns lead to hidden branching. These are conditional execution paths that the user doesn't write explicitly, but the compiler adds internally to carry out the computation.
One particular source of hidden branching is type instability. When a function can return values of different types, Julia must generate separate execution paths to handle each possibility. These extra branches remain invisible in the source code, yet they still break the uniform instruction flow that SIMD depends on. The solution is simply to restore type stability, for which we can apply all the techniques explored in Chapter 8 of this book. It's worth remarking, though, that type stability should always be a priority, regardless of any SIMD consideration: any attempt to achieve high performance is nearly impossible without ensuring type stability. Considering this, we'll proceed throughout the section assuming that type stability was guaranteed.
A second source of hidden branching arises in for-loops due to bounds checking. This operation represents a subtle form of conditional execution, where each iteration is executed only when indices remain within valid bounds. Bound checks interfere with the application of SIMD, as it prevents the compiler from treating the for-loop as a uniform sequence of operations.
A consequence of this behavior is that adding `@inbounds` is a prerequisite for the application of SIMD. If bound checks remain in place, inserting @simd alone will rarely lead the compiler to emit SIMD instructions. In fact, merely adding @inbounds could be enough to enable SIMD on its own, rendering @simd annotations redundant. In those cases, you'll notice that using @inbounds @simd rather than @inbounds won't yield any additional speedup. Both of these points are demonstrated below. [note] Recall that the compiler may automatically disable bounds checking in simple scenarios. For instance, this is the case when only x is indexed and eachindex(x) is employed as the iteration range. Instead, explicit use of @inbounds becomes necessary when both x and another vector indexed, as in the examples we introduce below.
x = rand(1_000_000)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end@btime foo($x) 824.276 μs (3 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@simd for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end@btime foo($x) 1.109 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end@btime foo($x) 390.190 μs (3 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end@btime foo($x) 393.380 μs (3 allocations: 7.629 MiB)
A practical lesson from these examples is that, when signaling to the compiler that SIMD is appropriate, for-loops should be annotated with @inbounds @simd rather than using @simd alone.
@inbounds @simd usually achieves a similar performance to its broadcast variant, as demonstrated below.x = rand(1_000_000)
foo(x) = 2 ./ x@btime foo($x) 452.692 μs (3 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end@btime foo($x) 802.428 μs (3 allocations: 7.629 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = 2/x[i]
end
return output
end@btime foo($x) 399.047 μs (3 allocations: 7.629 MiB)
Certain workflows require the user to introduce conditional statements. When this occurs, conditions are integral to a program's logical flow and can't be avoided. This is why it's important to consider the most effective way to introduce conditions.
Regarding SIMD, a relevant distinction for conditional statements is whether they're evaluated eagerly or lazily. To illustrate, let's consider computing 1 + 1 only when certain condition C is met. A lazy strategy checks C first, performing the addition 1 + 1 only if the condition is true. Expressed differently, the operation is deferred until it's actually needed. By contrast, an eager approach performs 1 + 1 immediately, regardless of whether C is satisfied. If the condition eventually evaluates to false, the result of 1 + 1 is simply discarded.
When conditional statements are applied only once, a lazy approach is more performant as it avoids unnecessary work. Instead, a trade-off arises when the conditional appears inside a for-loop, as they may involve many potential computations. In this case, as SIMD can evaluate multiple operations simultaneously, it may be advantageous to compute all branches upfront, selecting the relevant results only afterward. This is especially true when the branches involve inexpensive computations.
In Julia, whether a conditional expression is evaluated lazily or eagerly depends on how the code is written. Next, we'll examine this aspect in more detail.
The ifelse function in Julia follows an eager evaluation strategy. This means that both the condition and all possible outcomes are computed before deciding which result to return. In contrast, the if construct favors lazy computations, only evaluating the branch corresponding to the condition's truth value.
The following example demonstrates the computational difference between if and ifelse. The example is based on a reduction operation, where the elements to be summed are contingent on a condition. Note that ifelse requires specifying an operation for both the true and false cases, we return zero when the condition introduced isn't met.
x = rand(1_000_000)
function foo(x)
output = 0.0
for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
end@btime foo($x) 409.692 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
end@btime foo($x) 406.976 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
for i in eachindex(x)
output += ifelse(x[i] > 0.5, x[i]/2, 0)
end
return output
end@btime foo($x) 385.429 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i] > 0.5, x[i]/2, 0)
end
return output
end@btime foo($x) 89.666 μs (0 allocations: 0 bytes)
As the example reveals, the fact that ifelse is eager doesn't automatically trigger the application of SIMD. This is precisely why @inbounds @simd had to be included.
It's also worth noting that applying SIMD instructions with ifelse doesn't necessarily increase performance. The example below demonstrates this point.
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
end
endfoo!($output,$x) 18.720 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
end
endfoo!($output,$x) 16.518 ms (0 allocations: 0 bytes)
Ternary operators are an alternative approach to expressing conditional logic. They follow the syntax <condition> ? <action if true> : <action if false>. Unlike the previous methods, this form doesn't enforce an eager or a lazy approach. Instead, it relies on heuristics to determine which approach should be implemented. The decision depends on the compiler's assessment about which strategy will likely be faster in the application considered.
To illustrate this behavior, we revisit the same reduction example used earlier, now applying @inbounds @simd directly within each construct. In the following case, the ternary operator chooses an eager strategy, thus evaluating both expressions before selecting the appropriate result.
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output += x[i]/2
end
end
return output
endfoo!($output,$x) 407.000 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i]>0.5, x[i]/2, 0)
end
return output
endfoo!($output,$x) 89.095 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += x[i]>0.5 ? x[i]/2 : 0
end
return output
endfoo!($output,$x) 88.226 μs (0 allocations: 0 bytes)
Instead, the ternary operator opts for a lazy approach in the following example.
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if x[i] > 0.99
output += log(x[i])
end
end
return output
endfoo!($output,$x) 393.501 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(x[i] > 0.99, log(x[i]), 0)
end
return output
endfoo!($output,$x) 3.609 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += x[i]>0.99 ? log(x[i]) : 0
end
return output
endfoo!($output,$x) 413.706 μs (0 allocations: 0 bytes)
It's worth noting that the method chosen by the ternary operator isn't foolproof. In the following example, the ternary operator ends up implementing the slower approach.
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
if x[i] > 0.5
output[i] = log(x[i])
end
end
endfoo!($output,$x) 26.299 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i] > 0.5, log(x[i]), 0)
end
endfoo!($output,$x) 17.272 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
output = similar(x)
function foo!(output,x)
@inbounds @simd for i in eachindex(x)
output[i] = x[i]>0.5 ? log(x[i]) : 0
end
endfoo!($output,$x) 25.943 ms (0 allocations: 0 bytes)
As a general guideline, conditional statements with computationally-demanding operations will more likely benefit from a lazy implementation. In contrast, an eager approach is usually more performant when branches comprise simple algebraic computations. These heuristics tend to drive the decision adopted by the ternary operator.
To make this concrete, consider a case where only one branch performs a computation and that computation is straightforward. Under these conditions, an eager evaluation combined with SIMD typically outperforms a lazy alternative. In fact, this is precisely the strategy adopted when using a ternary operator.
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if condition(x[i])
output += computation(x[i])
end
end
return output
endfoo!($output,$x) 399.915 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(condition(x[i]), computation(x[i]), 0)
end
return output
endfoo!($output,$x) 88.727 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
condition(a) = a > 0.5
computation(a) = a * 2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += condition(x[i]) ? computation(x[i]) : 0
end
return output
endfoo!($output,$x) 89.035 μs (0 allocations: 0 bytes)
Instead, below we consider a branch with computational-intensive calculations. In this case, a lazy approach is faster, which is the approach implemented by the ternary operator.
When the conditional branch involves a computationally intensive operation, the situation reverses. Here, a lazy strategy avoids performing expensive work unnecessarily, which leads to better performance overall. Below, we illustrate a scenario like, showing that the ternary operator also chooses a lazy evaluation.
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
if condition(x[i])
output += computation(x[i])
end
end
return output
endfoo!($output,$x) 12.655 ms (0 allocations: 0 bytes)
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += ifelse(condition(x[i]), computation(x[i]), 0)
end
return output
endfoo!($output,$x) 12.023 ms (0 allocations: 0 bytes)
x = rand(2_000_000)
condition(a) = a > 0.5
computation(a) = exp(a)/3 - log(a)/2
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += condition(x[i]) ? computation(x[i]) : 0
end
return output
endfoo!($output,$x) 13.124 ms (0 allocations: 0 bytes)
In many situations, you may already have a vector of Boolean conditions available. This can happen because the condition vector is part of the dataset itself, or because the same conditions will be reused throughout the code, making it worthwhile to compute and store them in advance.
Julia represents such condition vectors using either Vector{Bool} or BitVector. The latter is Julia's default for broadcasted comparisons, as in expressions like x .> 0. While BitVector is memory-efficient and can be faster in some contexts, it can also inhibit SIMD optimizations. When this occurs, converting the BitVector to a Vector{Bool} can improve performance. The example below illustrates this effect: even after accounting for the cost of converting the type, the version using Vector{Bool} achieves a shorter execution time.
x = rand(1_000_000)
bitvector = x .> 0.5
function foo(x,bitvector)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x,$bitvector) 3.098 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
bitvector = x .> 0.5
function foo(x,bitvector)
output = similar(x)
boolvector = Vector(bitvector)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x,$bitvector) 848.068 μs (5 allocations: 8.583 MiB)
ifelse applied directly to scalar values inside the for-loop. This approach eliminates unnecessary memory allocations, while still allowing the compiler to generate SIMD instructions effectively. The following example demonstrates this point.x = rand(1_000_000)
function foo(x)
output = similar(x)
bitvector = x .> 0.5
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x) 3.254 ms (7 allocations: 7.749 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
boolvector = Vector{Bool}(undef,length(x))
boolvector .= x .> 0.5
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
end
return output
endfoo($x) 704.903 μs (5 allocations: 8.583 MiB)
x = rand(1_000_000)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = ifelse(x[i]>0.5, x[i]/i, x[i]*i)
end
return output
endfoo($x) 467.963 μs (3 allocations: 7.629 MiB)
We can often eliminate branches by rewriting logical conditions using simple algebraic identities. By doing so, we're eliminating the hurdle due to branching that impedes a direct application of SIMD.
To see how this works, let's first express the logic in mathematical terms. Given a set \(\left\{ b_{i}\right\} _{i=1}^{n}\) where \(b_{i}\in\left\{ 0,1\right\}\), two frequent constructs can be expressed:
all conditions are satisfied when
least one condition is satisfied when
These identities translate directly into Boolean expressions in Julia. Given two Boolean scalars c1 and c2:
c1 && c2 becomes Bool(c1 * c2),
c1 || c2 becomes Bool(1 - !c1 * !c2).
Once you adopt this perspective, you can rewrite control flow in a way that's friendlier to vectorization. For example, you can replace branching logic in a for-loop as follows.
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) && (y[i] < 0.6) && (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x) 2.063 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) * (y[i] < 0.6) * (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x) 865.019 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if (x[i] > 0.3) || (y[i] < 0.6) || (x[i] > y[i])
output += x[i]
end
end
return output
endfoo($x) 2.917 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
y = rand(1_000_000)
function foo(x,y)
output = 0.0
@inbounds @simd for i in eachindex(x)
if Bool(1 - !(x[i] > 0.3) * !(y[i] < 0.6) * !(x[i] > y[i]))
output += x[i]
end
end
return output
endfoo($x) 868.655 μs (0 allocations: 0 bytes)
With broadcasting, we can reexpress the conditional logic as follows.
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) && (y<0.6) && (x>y), x,y)foo($x) 5.223 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) * (y<0.6) * (x>y), x,y)foo($x) 537.296 μs (3 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse((x>0.3) || (y<0.6) || (x>y), x,y)foo($x) 3.248 ms (3 allocations: 7.629 MiB)
x = rand(1_000_000)
y = rand(1_000_000)
foo(x,y) = @. ifelse(Bool(1 - !(x>0.3) * !(y<0.6) * !(x>y)), x,y)foo($x) 497.927 μs (3 allocations: 7.629 MiB)