pic pic
Personal
Website

10g. SIMD Packages

PhD in Economics
Code Script
This section's scripts are available here, under the name allCode.jl. They've been tested under Julia 1.11.8.

Introduction

Up to this point, we’ve leaned on the built-in @simd macro to encourage vectorization in for-loops. This approach, nonetheless, exhibits several limitations.

The first one is control. The @simd macro acts as a suggestion, rather than a strict command: it hints to the compiler that SIMD optimizations might improve performance, but the implementation decision is ultimately up to the compiler's discretion. Second, @simd is designed to be conservative, prioritizing code safety over speed. In practice, this means it won't implement many transformations required to unlock SIMD's full capabilities.

To overcome these shortcomings, we'll introduce the @turbo macro from the LoopVectorization package. Unlike @simd, this macro can be used in both for-loops and broadcast operations.

Rather than merely suggesting vectorization, @turbo rewrites for-loops and broadcast expressions into a form that's explicitly structured for SIMD execution. Furthermore, it applies more aggressive optimizations and integrates with the SLEEFPirates package. The latter enables SIMD-accelerated implementations of common mathematical functions, including logarithms, exponentials, powers, and trigonometric operations.

This extra power comes with an important shift in responsibility: @turbo assumes that your code satisfies the conditions necessary for these transformations, placing the burden of safety and correctness on the user.

Caveats About Improper Use of @turbo

In contrast to @simd, applying @turbo demands deliberate care, as an incorrect application can silently produce wrong results. The risk stems from the additional assumptions that @turbo makes to enable more aggressive optimizations. In particular, @turbo operates under two key assumptions:

  • No out-of-bound access: @turbo removes bounds checks entirely. Thus, all indices should be valid.

  • Iteration order invariance: @turbo supposes that the outcome doesn't depend on the order in which iterations execute. Note, though, that reductions are allowed.

The second assumption becomes especially important, since it may conflict with for-loops that have dependent iterations. The example below illustrates this issue: because each iteration consumes the result produced by the previous one, applying @turbo breaks the required iteration-order invariance and therefore produces incorrect results.

x = [0.1, 0.2, 0.3]

function foo!(x)
    for i in 2:length(x)
        x[i] = x[i-1] + x[i]
    end
end
Output in REPL
julia>
foo!(x)

julia>
x
3-element Vector{Float64}:
 0.1
 0.3
 0.6
x = [0.1, 0.2, 0.3]

function foo!(x)
    @inbounds @simd for i in 2:length(x)
        x[i] = x[i-1] + x[i]
    end
end
Output in REPL
julia>
foo!(x)

julia>
x
3-element Vector{Float64}:
 0.1
 0.3
 0.6
x = [0.1, 0.2, 0.3]

function foo!(x)
    @turbo for i in 2:length(x)
        x[i] = x[i-1] + x[i]
    end
end
Output in REPL
julia>
foo!(x)

julia>
x
3-element Vector{Float64}:
 0.1
 0.3
 0.5

Even when a for-loop has no explicit loop-carried dependency, floating-point computations remain sensitive to reordering. In particular, because floating-point addition isn’t associative, reordering a sequence of additions can change the final result. Note that integer arithmetic, by contrast, is unaffected by such reordering.

Considering that @turbo isn't suitable for all operations, we next present cases where the macro can be safely applied.

Safe Applications of @turbo

There are two categories of operations that represent safe uses of @turbo: embarrassingly parallel problems and reductions.

In the first category, each iteration is completely independent, making execution order irrelevant for the final outcome. The example below illustrates this scenario, by performing an independent transformation on each element of a vector.

x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.721 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.940 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  279.068 μs (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

foo(x)         = @turbo calculation.(x)
Output in REPL
julia>
@btime foo($x)
  272.868 μs (3 allocations: 7.629 MiB)

The second safe application is reductions. While reductions introduce dependencies across iterations, the package only requires that iterations can be arbitrarily reordered. This is satisfied by reduction operations.

x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output = 0.0
    
    for i in eachindex(x)
        output += calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.424 ms (0 allocations: 0 bytes)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output = 0.0
    
    @inbounds @simd for i in eachindex(x)
        output += calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.758 ms (0 allocations: 0 bytes)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output = 0.0
    
    @turbo for i in eachindex(x)
        output += calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  169.195 μs (0 allocations: 0 bytes)

Accelerated Versions of Certain Functions

Another advantage of LoopVectorization comes from its integration with the library SLEEF (an anocrym for "SIMD Library for Evaluating Elementary Functions"). This feature accelerates the evaluation of several mathematical functions, including the exponential, logarithmic, power, and trigonometric functions. SLEEF is exposed in LoopVectorization via the SLEEFPirates package,

Below, we illustrate the use of @turbo across these different functions. For a complete list of supported functions, consult the SLEEFPirates documentation. All examples are based on an element-wise transformation of x via the function calculation, which will take different definitions depending on the function illustrated.

Logarithm

x              = rand(1_000_000)
calculation(a) = log(a)

function foo(x)
    output = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.511 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = log(a)

function foo(x)
    output = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.526 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = log(a)

function foo(x)
    output = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  1.556 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = log(a)

foo(x)         = @turbo calculation.(x)
Output in REPL
julia>
@btime foo($x)
  1.615 ms (3 allocations: 7.629 MiB)

Exponential Function

x              = rand(1_000_000)
calculation(a) = exp(a)

function foo(x)
    output = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  2.577 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = exp(a)

function foo(x)
    output = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  2.719 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = exp(a)

function foo(x)
    output = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  581.658 μs (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = exp(a)

foo(x)         = @turbo calculation.(x)
Output in REPL
julia>
@btime foo($x)
  552.402 μs (3 allocations: 7.629 MiB)

Power Functions

This includes any operation comprising terms \(x^{y}\).

x              = rand(1_000_000)
calculation(a) = a^4

function foo(x)
    output = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.695 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a^4

function foo(x)
    output = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  3.751 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a^4

function foo(x)
    output = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  272.640 μs (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a^4

foo(x)         = @turbo calculation.(x)
Output in REPL
julia>
@btime foo($x)
  297.336 μs (3 allocations: 7.629 MiB)

The implementation for power functions includes calls to other functions, such as the one used for computing square roots.

x              = rand(1_000_000)
calculation(a) = sqrt(a)

function foo(x)
    output = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  1.209 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sqrt(a)

function foo(x)
    output = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  1.230 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sqrt(a)

function foo(x)
    output = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  601.689 μs (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sqrt(a)

foo(x)         = @turbo calculation.(x)
Output in REPL
julia>
@btime foo($x)
  601.851 μs (3 allocations: 7.629 MiB)

Trigonometric Functions

Among others, @turbo can handle the functions sin, cos, and tan. Below, we demonstrate its use with sin.

x              = rand(1_000_000)
calculation(a) = sin(a)

function foo(x)
    output     = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  4.127 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sin(a)

function foo(x)
    output     = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  4.165 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sin(a)

function foo(x)
    output     = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  1.337 ms (3 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sin(a)

foo(x)         = @turbo calculation.(x)
Output in REPL
julia>
@btime foo($x)
  1.307 ms (3 allocations: 7.629 MiB)