pic
Personal
Website

10f. SIMD: Branchless Code

PhD in Economics

Branches

SIMD relies on the ability to implement the same set of instructions on multiple data. This parallel processing allows operations like scalar additions to be performed on several elements at once, significantly boosting performance. However, conditional statements pose a particular challenge for SIMD execution. When code branches into different paths and each requires distinct instructions, it becomes difficult to maintain the "same instruction" requirement that SIMD depends on. And, although compilers can sometimes transform such code to make it SIMD-compatible, these adaptations often come at the cost of reduced efficiency.

This section will focus on the handle of conditional operations. Firstly, some conditional statements could be inherent to the nature of our program, and hence unavoidable. This occurs when different operations are executed based on the state of the program. In these cases, we'll show that the approaches to implementing conditionals are dealt different internally. Depending on the workflow at hand, this results in advantages and disadvantages for each approach, with some being more SIMD-friendly than others.

Secondly, we'll consider cases where conditional statements can be eliminated entirely. This reflects scenarios where our code introduces conditions implicitly, which can be bypassed by writing code in an alternative way. In other scenarios, we can avoid conditional statements by transforming a conditional statement into an algebraic operation.

Type Stability and Bounds Checking as Avoidable Conditions

We start by covering two scenarios that internally result in conditional statements. These conditions are implicit, making their connection to conditionals not apparent.

The first one arises when functions are type unstable. Internally, this gives rise to multiple branches, one for each possible type. Recall that type instability is a major performance bottleneck. In fact, achieving high performance is nearly impossible without addressing type stability, which makes any concerns about SIMD optimizations secondary. Ultimately, we should strive for type stability, regardless of any SIMD consideration.

Another form of conditional computation arises in for-loops, which perform bounds checking by default. This operation represents a form of conditional execution, as each iteration is executed as long as the indices are within bounds.

The following example shows that the compiler already finds the implementation of SIMD instructions beneficial by merely adding @inbounds. This explains why adding @simd on top of @inbounds has no impact on the execution times. [note] Recall that the compiler may disable bounds checking in some cases, especially in straightforward cases. For instance, this would be the case in our example if only x had been indexed and eachindex(x) were employed as the iteration range. This is contrast to the case considered below, where we're also indexing the vector output. Furthermore, it also demonstrates the converse: adding `@inbounds` tends to be a necessary condition for the application of SIMD. In the example, only adding @simd doesn't trigger SIMD instructions.

x = rand(1_000_000)

function foo(x)
    output = similar(x)

    for i in eachindex(x)
        output[i] = 2/x[i]
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  775.469 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)

function foo(x)
    output = similar(x)

    @simd for i in eachindex(x)
        output[i] = 2/x[i]
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  792.687 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)

function foo(x)
    output = similar(x)

    @inbounds for i in eachindex(x)
        output[i] = 2/x[i]
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  474.154 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)

function foo(x)
    output = similar(x)

    @inbounds @simd for i in eachindex(x)
        output[i] = 2/x[i]
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  452.921 μs (2 allocations: 7.629 MiB)

Warning! - SIMD and `@inbounds`
Disabling bound checking commonly acts as a necessary condition for the application of SIMD. Therefore, all for-loops that aim to employ SIMD should be implemented prepending both @inbounds and @simd.
Broadcasting and For-Loops
Broadcasting disables inbounds checking and strongly favors SIMD. This is why it can sometimes appear more performant than a simple for-loop. However, broadcasting essentially implements a for-loop, but offering a concise notation. Considering this, applying targeted optimizations to a for-loop can always match and possibly surpass the performance of broadcasting.

Indeed, the example below shows that a for-loop exhibits roughly the same performance as broadcasting when @inbounds and @simd are applied.

x      = rand(1_000_000)
foo(x) = 2 ./ x
Output in REPL
julia>
@btime foo($x)
  431.487 μs (2 allocations: 7.629 MiB)
x      = rand(1_000_000)

function foo(x)
    output = similar(x)

    @inbounds @simd for i in eachindex(x)
        output[i] = 2/x[i]
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  435.973 μs (2 allocations: 7.629 MiB)
x      = rand(1_000_000)

function foo(x)
    output = similar(x)

    for i in eachindex(x)
        output[i] = 2/x[i]
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  809.359 μs (2 allocations: 7.629 MiB)

Approaches to Conditional Statements

Conditional statements can be evaluated eagerly or lazily. To illustrate this distinction, let's consider a simple example where we need to compute the result 1 + 1, but only if some condition C is met. A lazy approach first evaluates whether C holds true, before proceeding with the computation of 1 + 1. Thus, the operation is deferred until it's confirmed that C is satisfied. In contrast, an eager approach computes 1 + 1, regardless of whether C is satisfied or not. After this, if C turns out to be false, the computation remains unused.

When the conditional statement is applied only once, a lazy approach is always more performant as avoids unnecessary computations. However, neither approach consistently outperforms the other when the conditional statement is part of a for-loop. This is largely due the way modern CPUs handle simultaneous computations through SIMD. If the CPU can compute multiple operations simultaneously, it's possible that computing all conditions and branches upfront, selecting the relevant branches only afterwards, could actually result in performance gains. This is especially true when the branches consist of simple operations.

In Julia, whether a conditional statement is evaluated lazily or eagerly depends on the method used to compute it. Next, we explore this topic in more detail.

ifelse vs if

The ifelse function in Julia follows an eager approach, where both the condition and the possible outcomes are evaluated before the result is determined. In contrast, expressions using if or && favor lazy computations, where only the necessary parts are evaluated. The following example demonstrates this difference in behavior, through a reduction operation subject to a condition. [note] Note that ifelse requires specifying an operation when the condition is true and when it's not. For a sum reduction, we can handle this by returning zero when the condition isn't met.

x = rand(1_000_000)

function foo(x)
    output = 0.0

    for i in eachindex(x)
        if x[i] > 0.5
            output += x[i]/2
        end
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  415.373 μs (0 allocations: 0 bytes)
x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if x[i] > 0.5
            output += x[i]/2
        end
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  414.155 μs (0 allocations: 0 bytes)
x = rand(1_000_000)

function foo(x)
    output = 0.0

    for i in eachindex(x)
        output += ifelse(x[i] > 0.5, x[i]/2, 0)
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  393.046 μs (0 allocations: 0 bytes)
x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += ifelse(x[i] > 0.5, x[i]/2, 0)
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  87.192 μs (0 allocations: 0 bytes)

As the example reveals, note that an eager computation doesn't automatically imply the application of SIMD. This explains why the macro @simd is included, which hints the compiler that vectorizing the operation is worth considering. In fact, we'll see that when conditions comprise multiple statements, adding @simd could prompt the compiler to vectorize conditions, while still relying on a lazy evaluation for executing operations.

Furthermore, it's worth remarking that applying SIMD instructions doesn't necessarily increase performance. The example below demonstrates this point, where the compiler adopts a SIMD approach through ifelse.

x      = rand(5_000_000)
output = similar(x)

function foo!(output,x)
    for i in eachindex(x)
        output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
    end
end
Output in REPL
julia>
foo!($output,$x)
  2.806 ms (0 allocations: 0 bytes)
x      = rand(5_000_000)
output = similar(x)

function foo!(output,x)
    @inbounds for i in eachindex(x)
        output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
    end
end
Output in REPL
julia>
foo!($output,$x)
  2.888 ms (0 allocations: 0 bytes)
x      = rand(5_000_000)
output = similar(x)

function foo!(output,x)
    @inbounds @simd for i in eachindex(x)
        output[i] = ifelse(x[i] > 0.5, x[i]/2, 0)
    end
end
Output in REPL
julia>
foo!($output,$x)
  16.026 ms (0 allocations: 0 bytes)

Ternary Operators

Ternary operators are an alternative approach for conditional statements, consisting of the form <condition> ? <action if true> : <action if false>. Unlike the previous methods, this form relies on heuristics to determine whether an eager or lazy approach should be used, depending on which would more likely be faster in the application considered.

For the illustrations, we'll consider examples where we directly add @inbounds and @simd to each approach.

Different Choices

Starting with the same example as above, we show that the ternary operator could choose an eager approach.

x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if x[i] > 0.5
            output += x[i]/2
        end
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  422.480 μs (0 allocations: 0 bytes)
x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += ifelse(x[i]>0.5, x[i]/2, 0)
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  85.895 μs (0 allocations: 0 bytes)
x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += x[i]>0.5 ? x[i]/2 : 0
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  87.881 μs (0 allocations: 0 bytes)

Instead, the ternary operator implements a lazy approach in the following example.

x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if x[i] > 0.99
            output += log(x[i])
        end
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  405.304 μs (0 allocations: 0 bytes)
x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += ifelse(x[i] > 0.99, log(x[i]), 0)
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  3.470 ms (0 allocations: 0 bytes)
x = rand(1_000_000)

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += x[i]>0.99 ? log(x[i]) : 0
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  421.493 μs (0 allocations: 0 bytes)

Ternary Operator Could Choose A Less Performant Approach

It's worth remarking that the method chosen by the ternary operator isn't foolproof. In the following scenario, it actually chooses the slowest approach.

x      = rand(5_000_000)
output = similar(x)

function foo!(output,x)
    @inbounds @simd for i in eachindex(x)
        if x[i] > 0.5
            output[i] = log(x[i])
        end
    end
end
Output in REPL
julia>
foo!($output,$x)
  26.620 ms (0 allocations: 0 bytes)
x      = rand(5_000_000)
output = similar(x)

function foo!(output,x)
    @inbounds @simd for i in eachindex(x)
        output[i] = ifelse(x[i] > 0.5, log(x[i]), 0)
    end
end
Output in REPL
julia>
foo!($output,$x)
  16.864 ms (0 allocations: 0 bytes)
x      = rand(5_000_000)
output = similar(x)

function foo!(output,x)
    @inbounds @simd for i in eachindex(x)
        output[i] = x[i]>0.5 ? log(x[i]) : 0
    end
end
Output in REPL
julia>
foo!($output,$x)
  26.517 ms (0 allocations: 0 bytes)

Scenarios Under Which Each Approach is Better

As a rule of thumb, an eager approach is potentially more performant when branches comprise simple algebraic computations. On the contrary, conditional statements with computational-demanding operations will more likely benefit from a lazy implementation. In fact, this is a heuristic that ternary operators commonly follow.

To demonstrate this, the following example considers a conditional statement where only one branch has a computation, which in turn is straightforward. An eager approach with SIMD is faster, and coincides with the approach chosen when a ternary operator is chosen.

x              = rand(1_000_000)
condition(a)   = a > 0.5
computation(a) = a * 2

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if condition(x[i])
            output += computation(x[i])
        end
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  416.681 μs (0 allocations: 0 bytes)
x              = rand(1_000_000)
condition(a)   = a > 0.5
computation(a) = a * 2

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += ifelse(condition(x[i]), computation(x[i]), 0)
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  86.242 μs (0 allocations: 0 bytes)
x              = rand(1_000_000)
condition(a)   = a > 0.5
computation(a) = a * 2

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += condition(x[i]) ? computation(x[i]) : 0
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  86.370 μs (0 allocations: 0 bytes)

Instead, the following scenario considers a branch with more computational-intensive calculations. In this case, a lazy approach is faster, which is the approach implemented by the ternary operator.

x              = rand(2_000_000)
condition(a)   = a > 0.5
computation(a) = exp(a)/3 - log(a)/2

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if condition(x[i])
            output += computation(x[i])
        end
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  12.346 ms (0 allocations: 0 bytes)
x              = rand(2_000_000)
condition(a)   = a > 0.5
computation(a) = exp(a)/3 - log(a)/2

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += ifelse(condition(x[i]), computation(x[i]), 0)
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  16.552 ms (0 allocations: 0 bytes)
x              = rand(2_000_000)
condition(a)   = a > 0.5
computation(a) = exp(a)/3 - log(a)/2

function foo(x)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        output += condition(x[i]) ? computation(x[i]) : 0
    end

    return output
end
Output in REPL
julia>
foo!($output,$x)
  12.492 ms (0 allocations: 0 bytes)

Vector of Conditions

Next, we consider scenarios where you already have a vector holding conditions. This could occur either because the vector is already part of your dataset, or because the conditions will be reused multiple times over you code, in which case storing the conditions is worthy.

Storing conditions in a vector could be done through an object with type Vector{Bool} or BitVector. The latter is the default type returned by Julia, as when you define objects like x .> 0. Although this type offer certain performance advantages, it can also hinder the application of SIMD. In cases like this, transforming BitVector to Vector{Bool} could speed up computations.

The following example demonstrates this, where the execution time is faster even considering the vector transformation.

x         = rand(1_000_000)
bitvector = x .> 0.5

function foo(x,bitvector)
    output     = similar(x)
        

    @inbounds @simd for i in eachindex(x)
        output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
    end

    return output
end
Output in REPL
julia>
foo($x,$bitvector)
  3.393 ms (2 allocations: 7.629 MiB)
x         = rand(1_000_000)
bitvector = x .> 0.5

function foo(x,bitvector)
    output     = similar(x)
    boolvector = Vector(bitvector)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
    end

    return output
end
Output in REPL
julia>
foo($x,$bitvector)
  862.798 μs (4 allocations: 8.583 MiB)

No Vector of Conditions
The conclusions stated here assumes that you already have the vector holding the conditions. If this isn't the case and you want to apply SIMD instructions, you should implement ifelse without a vector of conditions. This allows you to avoid memory allocations, while still applying SIMD effectively. The following example illustrates this point. [note] Note that the approach for Vector{Bool} is somewhat different to the examples we considered above. As we don't have a vector of conditions already defined, it's optimal to create Vector{Bool} directly, rather than defining it as a transformation of the BitVector. In this way, we avoid unnecessary memory allocations too.

x = rand(1_000_000)

function foo(x)
    output     = similar(x)
    bitvector  = x .> 0.5
    
    
    @inbounds @simd for i in eachindex(x)
        output[i] = ifelse(bitvector[i], x[i]/i, x[i]*i)
    end

    return output
end
Output in REPL
julia>
foo($x)
  3.628 ms (6 allocations: 7.753 MiB)
x = rand(1_000_000)

function foo(x)
    output     = similar(x)
    boolvector = Vector{Bool}(undef,length(x))
        boolvector .= x .> 0.5

    @inbounds @simd for i in eachindex(x)
        output[i] = ifelse(boolvector[i], x[i]/i, x[i]*i)
    end

    return output
end
Output in REPL
julia>
foo($x)
  774.952 μs (4 allocations: 8.583 MiB)
x = rand(1_000_000)

function foo(x)
    output     = similar(x)
    
    

    @inbounds @simd for i in eachindex(x)
        output[i] = ifelse(x[i]>0.5, x[i]/i, x[i]*i)
    end

    return output
end
Output in REPL
julia>
foo($x)
  501.114 μs (2 allocations: 7.629 MiB)

Algebraic Operations as Compound Conditions

We leverage algebraic equivalences to express conditions in ways that allow us to avoid the creation of branches. Mathematically, given a set \(\left\{ b_{i}\right\} _{i=1}^{n}\) where \(b_{i}\in\left\{ 0,1\right\}\):

  • all conditions are satisfied when

\[ \prod_{i=1}^{n}c_{i} = 1 \]
  • at least one condition is satisfied when

\[ 1-\prod_{i=1}^{n}(1-c_{i}) = 1 \]

In terms of Julia, given two Boolean scalars c1 and c2, these equivalences become

  • c1 && c2 is Bool(c1 * c2)

  • c1 || c2 is Bool(1 - !c1 * !c2)

For instance, with for-loops:

x                 = rand(1_000_000)
y                 = rand(1_000_000)


function foo(x,y)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if (x[i] > 0.3) && (y[i] < 0.6) && (x[i] > y[i])
            output += x[i]
        end       
    end

    return output
end
Output in REPL
julia>
foo($x)
  2.116 ms (0 allocations: 0 bytes)
x                 = rand(1_000_000)
y                 = rand(1_000_000)


function foo(x,y)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if (x[i] > 0.3)  * (y[i] < 0.6)  * (x[i] > y[i])
            output += x[i]
        end       
    end

    return output
end
Output in REPL
julia>
foo($x)
  905.078 μs (0 allocations: 0 bytes)
x                 = rand(1_000_000)
y                 = rand(1_000_000)


function foo(x,y)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if (x[i] > 0.3) || (y[i] < 0.6) || (x[i] > y[i])
            output += x[i]
        end       
    end

    return output
end
Output in REPL
julia>
foo($x)
  2.724 ms (0 allocations: 0 bytes)
x                 = rand(1_000_000)
y                 = rand(1_000_000)


function foo(x,y)
    output = 0.0

    @inbounds @simd for i in eachindex(x)
        if Bool(1 - !(x[i] > 0.3)  * !(y[i] < 0.6)  * !(x[i] > y[i]))
            output += x[i]
        end       
    end

    return output
end
Output in REPL
julia>
foo($x)
  889.879 μs (0 allocations: 0 bytes)

While with broadcasting:

x              = rand(1_000_000)
y              = rand(1_000_000)


foo(x,y)       = @. ifelse((x>0.3) && (y<0.6) && (x>y), x,y)
Output in REPL
julia>
foo($x)
  5.356 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
y              = rand(1_000_000)


foo(x,y)       = @. ifelse((x>0.3) *  (y<0.6) *  (x>y), x,y)
Output in REPL
julia>
foo($x)
  541.621 μs (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
y              = rand(1_000_000)


foo(x,y)       = @. ifelse((x>0.3) || (y<0.6) || (x>y), x,y)
Output in REPL
julia>
foo($x)
  3.354 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
y              = rand(1_000_000)


foo(x,y)       = @. ifelse(Bool(1 - !(x>0.3) * !(y<0.6) * !(x>y)), x,y)
Output in REPL
julia>
foo($x)
  536.276 μs (2 allocations: 7.629 MiB)