<function>
or <operator>
).
This is just notation, and the symbols <
and >
should not be misconstrued as Julia's syntax.
Action | Keyboard Shortcut |
---|---|
Previous Section | Ctrl + 🠘 |
Next Section | Ctrl + 🠚 |
List of Sections | Ctrl + z |
List of Subsections | Ctrl + x |
Close Any Popped Up Window (like this one) | Esc |
Open All Codes and Outputs in a Post | Alt + 🠛 |
Close All Codes and Outputs in a Post | Alt + 🠙 |
Unit | Acronym | Measure in Seconds |
---|---|---|
Seconds | s | 1 |
Milliseconds | ms | 10-3 |
Microseconds | μs | 10-6 |
Nanoseconds | ns | 10-9 |
So far, we've been using the built-in macro @simd
to apply SIMD instructions. This approach merely suggests the potential advantages of SIMD, leaving the final decision on its implementation to the compiler. Moreover, the @simd
macro only implements basic features of SIMD, and prioritizes code safety over performance.
Next, we consider the macro @turbo
from the package LoopVectorization
. This macro consistently implements SIMD optimizations when invoked, and applies more aggressive optimizations compared to @simd
. Additionally, unlike @simd
, @turbo
can be utilized in both for-loops and broadcasting operations.
Unlike @simd
, @turbo
could return wrong results if used for certain operations. This is because the package makes some additional assumptions about operations with the goal of applying optimizations more aggressively. In particular, @turbo
never checks index bounds
may provide wrong results if the outcome depends on a specific execution order of the iterations (except for reductions).
Regarding the latter, computing a vector with cumulative sums of a vector will return incorrect results. This can be observed below, where we check the resulting vector by summing all its values.
x = rand(1_000_000)
function foo(x)
output = copy(x)
for i in 2:length(x)
output[i] = output[i-1] + x[i]
end
return output
end
sum(foo(x))
2.50038e11
x = rand(1_000_000)
function foo(x)
output = copy(x)
@inbounds @simd for i in 2:length(x)
output[i] = output[i-1] + x[i]
end
return output
end
sum(foo(x))
2.50038e11
x = rand(1_000_000)
function foo(x)
output = copy(x)
@turbo for i in 2:length(x)
output[i] = output[i-1] + x[i]
end
return output
end
sum(foo(x))
1.03169e6
Considering that @turbo
isn't suitable for all operations, let's present two of its primary applications. The first one arises when iterations are completely independent of each other, making the execution order of iterations irrelevant.
For instance, the following code snippet transforms each element of a vector via a polynomial function.
x = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
3.098 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
5.070 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4
function foo(x)
output = similar(x)
@turbo for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
492.031 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4
foo(x) = @turbo calculation.(x)
@ctime foo($x)
406.089 μs (2 allocations: 7.629 MiB)
The second application is reductions. Although they make iterations dependent, reductions represent an exception that @turbo
can handle properly.
x = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4
function foo(x)
output = 0.0
for i in eachindex(x)
output += calculation(x[i])
end
return output
end
@ctime foo($x)
3.083 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4
function foo(x)
output = 0.0
@inbounds @simd for i in eachindex(x)
output += calculation(x[i])
end
return output
end
@ctime foo($x)
3.299 ms (0 allocations: 0 bytes)
x = rand(1_000_000)
calculation(a) = a * 0.1 + a^2 * 0.2 - a^3 * 0.3 - a^4 * 0.4
function foo(x)
output = 0.0
@turbo for i in eachindex(x)
output += calculation(x[i])
end
return output
end
@ctime foo($x)
179.722 μs (0 allocations: 0 bytes)
The package LoopVectorization
leverages the library SLEEF, acronym for "SIMD Library for Evaluating Elementary Functions". This is available in Julia through the package SLEEFPirates
, and is designed to boost the performance of some functions by utilizing SIMD instructions. In particular, it speeds up the computations of the exponential, logarithmic, power, and trigonometric functions.
Below, we illustrate each. See here for a list of all the functions supported.
x = rand(1_000_000)
calculation(a) = log(a)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
3.395 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = log(a)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
3.414 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = log(a)
function foo(x)
output = similar(x)
@turbo for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
1.229 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = log(a)
foo(x) = @turbo calculation.(x)
@ctime foo($x)
1.237 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = exp(a)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
1.962 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = exp(a)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
1.950 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = exp(a)
function foo(x)
output = similar(x)
@turbo for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
600.413 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = exp(a)
foo(x) = @turbo calculation.(x)
@ctime foo($x)
596.388 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = a^4
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
3.100 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = a^4
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
3.305 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = a^4
function foo(x)
output = similar(x)
@turbo for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
498.762 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = a^4
foo(x) = @turbo calculation.(x)
@ctime foo($x)
434.407 μs (2 allocations: 7.629 MiB)
The implementation of power functions includes square roots.
x = rand(1_000_000)
calculation(a) = sqrt(a)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
1.232 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = sqrt(a)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
1.231 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = sqrt(a)
function foo(x)
output = similar(x)
@turbo for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
614.203 μs (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = sqrt(a)
foo(x) = @turbo calculation.(x)
@ctime foo($x)
614.241 μs (2 allocations: 7.629 MiB)
sin
, cos
, and tan
. Below, we demonstrate it by sin
.
x = rand(1_000_000)
calculation(a) = sin(a)
function foo(x)
output = similar(x)
for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
3.841 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = sin(a)
function foo(x)
output = similar(x)
@inbounds @simd for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
3.767 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = sin(a)
function foo(x)
output = similar(x)
@turbo for i in eachindex(x)
output[i] = calculation(x[i])
end
return output
end
@ctime foo($x)
1.386 ms (2 allocations: 7.629 MiB)
x = rand(1_000_000)
calculation(a) = sin(a)
foo(x) = @turbo calculation.(x)
@ctime foo($x)
1.384 ms (2 allocations: 7.629 MiB)