pic
Personal
Website

10g. Packages For SIMD

PhD in Economics

Introduction

So far, we've been using the built-in macro @simd to apply SIMD instructions. This approach merely suggests the potential advantages of SIMD, leaving the final decision on its implementation to the compiler. Moreover, the @simd macro only implements basic features of SIMD, and prioritizes code safety over performance.

Next, we consider the macro @turbo from the package LoopVectorization. This macro consistently implements SIMD optimizations when invoked, and applies more aggressive optimizations compared to @simd. Additionally, unlike @simd, @turbo can be utilized in both for-loops and broadcasting operations.

Caveats About Improper Use of @turbo

Unlike @simd, @turbo could return wrong results if used for certain operations. This is because the package makes some additional assumptions about operations with the goal of applying optimizations more aggressively. In particular, @turbo

  • never checks index bounds

  • may provide wrong results if the outcome depends on a specific execution order of the iterations (except for reductions).

Regarding the latter, computing a vector with cumulative sums of a vector will return incorrect results. This can be observed below, where we check the resulting vector by summing all its values.

x = rand(1_000_000)

function foo(x)
    output = copy(x)

    for i in 2:length(x)
        output[i] = output[i-1] + x[i]
    end

    return output
end
Output in REPL
julia>
sum(foo(x))
2.50038e11
x = rand(1_000_000)

function foo(x)
    output = copy(x)

    @inbounds @simd for i in 2:length(x)
        output[i] = output[i-1] + x[i]
    end

    return output
end
Output in REPL
julia>
sum(foo(x))
2.50038e11
x = rand(1_000_000)

function foo(x)
    output = copy(x)

    @turbo for i in 2:length(x)
        output[i] = output[i-1] + x[i]
    end

    return output
end
Output in REPL
julia>
sum(foo(x))
1.03169e6

Cases Covered

Considering that @turbo isn't suitable for all operations, let's present two of its primary applications. The first one arises when iterations are completely independent of each other, making the execution order of iterations irrelevant.

For instance, the following code snippet transforms each element of a vector via a polynomial function.

x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output     = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.098 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output     = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  5.070 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output     = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  492.031 μs (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

foo(x)         = @turbo calculation.(x)
Output in REPL
julia>
@ctime foo($x)
  406.089 μs (2 allocations: 7.629 MiB)

The second application is reductions. Although they make iterations dependent, reductions represent an exception that @turbo can handle properly.

x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output     = 0.0
    
    for i in eachindex(x)
        output += calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.083 ms (0 allocations: 0 bytes)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output     = 0.0
    
    @inbounds @simd for i in eachindex(x)
        output += calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.299 ms (0 allocations: 0 bytes)
x              = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4

function foo(x)
    output     = 0.0
    
    @turbo for i in eachindex(x)
        output += calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  179.722 μs (0 allocations: 0 bytes)

Special Functions

The package LoopVectorization leverages the library SLEEF, acronym for "SIMD Library for Evaluating Elementary Functions". This is available in Julia through the package SLEEFPirates, and is designed to boost the performance of some functions by utilizing SIMD instructions. In particular, it speeds up the computations of the exponential, logarithmic, power, and trigonometric functions.

Below, we illustrate each. See here for a list of all the functions supported.

Logarithm

x              = rand(1_000_000)
calculation(a) = log(a)

function foo(x)
    output     = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.395 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = log(a)

function foo(x)
    output     = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.414 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = log(a)

function foo(x)
    output     = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  1.229 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = log(a)

foo(x) = @turbo calculation.(x)
Output in REPL
julia>
@ctime foo($x)
  1.237 ms (2 allocations: 7.629 MiB)

Exponential Function

x              = rand(1_000_000)
calculation(a) = exp(a)

function foo(x)
    output     = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  1.962 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = exp(a)

function foo(x)
    output     = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  1.950 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = exp(a)

function foo(x)
    output     = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  600.413 μs (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = exp(a)

foo(x) = @turbo calculation.(x)
Output in REPL
julia>
@ctime foo($x)
  596.388 μs (2 allocations: 7.629 MiB)

Power Functions

x              = rand(1_000_000)
calculation(a) = a^4

function foo(x)
    output     = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.100 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a^4

function foo(x)
    output     = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.305 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a^4

function foo(x)
    output     = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  498.762 μs (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = a^4

foo(x) = @turbo calculation.(x)
Output in REPL
julia>
@ctime foo($x)
  434.407 μs (2 allocations: 7.629 MiB)

The implementation of power functions includes square roots.

x              = rand(1_000_000)
calculation(a) = sqrt(a)

function foo(x)
    output     = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  1.232 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sqrt(a)

function foo(x)
    output     = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  1.231 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sqrt(a)

function foo(x)
    output     = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  614.203 μs (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sqrt(a)

foo(x) = @turbo calculation.(x)
Output in REPL
julia>
@ctime foo($x)
  614.241 μs (2 allocations: 7.629 MiB)

Trigonometric Functions

Among others, it includes sin, cos, and tan. Below, we demonstrate it by sin.

x              = rand(1_000_000)
calculation(a) = sin(a)

function foo(x)
    output     = similar(x)
    
    for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.841 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sin(a)

function foo(x)
    output     = similar(x)
    
    @inbounds @simd for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  3.767 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sin(a)

function foo(x)
    output     = similar(x)
    
    @turbo for i in eachindex(x)
        output[i] = calculation(x[i])
    end

    return output
end
Output in REPL
julia>
@ctime foo($x)
  1.386 ms (2 allocations: 7.629 MiB)
x              = rand(1_000_000)
calculation(a) = sin(a)

foo(x) = @turbo calculation.(x)
Output in REPL
julia>
@ctime foo($x)
  1.384 ms (2 allocations: 7.629 MiB)