allCode.jl. They've been tested under Julia 1.11.8.allCode.jl. They've been tested under Julia 1.11.8.This section examines the performance differences between copies and views in the context of SIMD instructions. While views may initially appear advantageous due to their ability to avoid memory allocations, this section demonstrates that copies possess a memory layout better aligned with SIMD requirements.
To understand why this occurs, let's first revisit why SIMD improves performance in the first place. SIMD accelerates computation by applying the same operation simultaneously to multiple data elements. This is achieved through specialized vector registers that can hold several values at once (e.g., 4 floating-point numbers or 4 integers). With data packed into these registers, operations like additions or multiplications can complete with a single instruction instead of many.
For SIMD to fully leverage this vector-based processing, data must adhere to specific rules regarding memory organization. Two fundamental conditions emerge: contiguous memory layout and unit-stride access.
Contiguous Memory Layout: data elements must reside in adjacent memory addresses, one after another, with no gaps between them. Freshly allocated arrays automatically meet this requirement, enabling entire segments to be loaded directly into vector registers. In contrast, array views don't guarantee contiguity. By referencing the original data structure, views could result in highly irregular memory access patterns depending on the elements selected.
Unit Strides: strides refer to the step size between consecutive memory accesses, with unit strides meaning in particular that elements are accessed sequentially. For example, iterating through a freshly allocated vector x using eachindex(x) ensures unit stride: each access moves to the next adjacent address in memory. This contrasts with non-unit strides such as accessing every other element with 1:2:length(x), or unpredictable accesses patterns resulting from an index list like [1, 5, 3, 4].
Because SIMD performs best when both conditions hold, choosing between views and copies involves trade-offs. On one hand, using views reduces memory allocations, but at the expense of making SIMD less effective if views violate contiguity or unit stride. On the other hand, copies create contiguous arrays that are amenable to SIMD, but at the cost of increased memory usage. This section examines how these trade-offs play out in practice.
The performance trade-offs between copies and views are highly dependent on the slicing method employed. For this reason, we begin with a brief overview of three common slicing techniques: vector indexing, ranges, and Boolean indexing. We illustrate each below.
x = [10, 20, 30]
indices = [3, 2, 1]
elements = x[indices]indices3-element Vector{Int64}:
3
2
1elements3-element Vector{Int64}:
30
20
10x = [2, 3, 4, 5, 6]
indices_1 = 1:length(x) # unit strides, equivalent to 1:1:length(x)
indices_2 = 1:2:length(x) # strides twocollect(indices_1)5-element Vector{Int64}:
1
2
3
4
5collect(indices_2)3-element Vector{Int64}:
1
3
5x = [20, 10, 30]
indices = x .> 15
elements = x[indices]indices3-element BitVector:
1
0
1elements2-element Vector{Int64}:
20
30Vector Indexing selects elements by explicitly listing their indices. This approach offers the greatest flexibility in choosing arbitrary elements. However, it can also result in memory access patterns that are highly non-contiguous.
Ranges are a structured form of indexing that selects elements according to a given stride. Their general syntax is <first index>:<stride>:<last index>, where omitting <stride> defaults to a step size of 1. Ranges with unit stride maintain contiguous access patterns, while larger strides produce predictable but non-contiguous access patterns.
Finally, Boolean indexing uses a Boolean vector, with true entries indicating which elements to retain. This technique is commonly used when broadcasting conditions to filter data. Because the resulting pattern of selected elements often lacks regularity, Boolean indexing typically leads to irregular and fragmented memory access.
In a previous section, we highlighted the benefits of using views over copies when working with slices: by maintaining references to the original data, views avoid memory allocation. Yet, we also remarked that copies could outperform views in some scenarios. We're now in a position to explain in more depth why this occurs.
Creating a copy of some data structure involves allocating the information in a new contiguous block of memory. This layout ensures that elements sit next to each other, thus offering two key advantages: faster element retrieval and a more effective use of SIMD instructions. Views can't guarantee this property. If the referenced elements are scattered throughout memory, views will necessarily introduce irregular access patterns.
An analogy may help clarify this distinction. Imagine retrieving books from a library. If every book you need is neatly arranged on a single shelf, collecting them is straightforward: you move once, grab each, and proceed. This mirrors contiguous memory access. Conversely, if the books are dispersed across different floors and sections, each retrieval demands additional time and effort. This is akin to non-contiguous access.
The analogy extends further by introducing a cart that lets you carry several books at once. The cart plays the role of vector registers, which enable SIMD instructions to process multiple adjacent data elements in a single operation.
The following examples illustrate the potential advantages of copies compared to views. They're all based on the computation of a reduction operation.
To isolate the benefits from copying, we construct the slices outside the benchmarked function. In this way, performance differences aren't influenced by the memory allocations associated with copying.
To begin with, let's consider access to elements that follow a regular pattern, but don't exhibit unit stride.
x = rand(5_000_000)
y = @view x[1:2:length(x)]
function foo(y)
output = 0.0
for a in y
output += a
end
return output
end@btime foo($y) 1.540 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
y = @view x[1:2:length(x)]
function foo(y)
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 1.831 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
y = x[1:2:length(x)]
function foo(y)
output = 0.0
for a in y
output += a
end
return output
end@btime foo($y) 1.094 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
y = x[1:2:length(x)]
function foo(y)
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 644.124 μs (0 allocations: 0 bytes)
The example brings several points into focus. First, when comparing the two view-based variants, non-unit strides prevent the compiler from applying SIMD efficiently. As a result, the extra work required to rearrange elements for vectorization ends up slowing down the computation.
Second, when comparing the view and copy implementations without SIMD, the copy performs better. Since neither version benefits from SIMD, the performance gain stems entirely from contiguous memory access.
Finally, when the computation is based on a copy and applies SIMD instructions, the implementation combines the benefits of contiguous memory access and vectorization. This yields the fastest execution among all four variants.
When the access of elements doesn't follow any predictable pattern, the benefits of copying become even more pronounced. To illustrate, let's access elements of some vector x via the sortperm function. This returns the indices that would sort the vector. For instance, given x = [5, 4, 6], executing sort(x) returns [4, 5, 6], while sortperm(x) provides [2, 1, 3]. Assuming that x hasn't been ordered previously, views that access elements according to the indices of sortperm(x) will result in unpredictable memory access pattern.
x = rand(5_000_000)
indices = sortperm(x)
y = @view x[indices]
function foo(y)
output = 0.0
for a in y
output += a
end
return output
end@btime foo($y) 29.691 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = sortperm(x)
y = @view x[indices]
function foo(y)
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 31.315 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = sortperm(x)
y = x[indices]
function foo(y)
output = 0.0
for a in y
output += a
end
return output
end@btime foo($y) 2.276 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = sortperm(x)
y = x[indices]
function foo(y)
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 1.634 ms (0 allocations: 0 bytes)
A similar conclusion can be obtained when slices are created with Boolean indexing.
x = rand(5_000_000)
indices = x .> 0.5
y = @view x[indices]
function foo(y)
output = 0.0
for a in y
output += a
end
return output
end@btime foo($y) 2.465 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = x .> 0.5
y = @view x[indices]
function foo(y)
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 2.326 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = x .> 0.5
y = x[indices]
function foo(y)
output = 0.0
for a in y
output += a
end
return output
end@btime foo($y) 1.095 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = x .> 0.5
y = x[indices]
function foo(y)
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 622.719 μs (0 allocations: 0 bytes)
The previous examples defined slices outside the benchmarked functions, thereby avoiding the memory allocations associated with copying. This was done intentionally to isolate the benefits of sequential memory access and SIMD.
In practice, however, the choice between copies and views must ultimately account for the overhead of memory allocations. Once this effect is incorporated, the trade-off becomes far less obvious: whether copies or views yield better performance is highly case-specific. In the end, benchmarking remains the only reliable way to determine which approach delivers superior performance in a given context.
In what follows, we present cases that account for all these effects simultaneously. We begin with an example where views prove more performant. This scenario is characterized by significant memory allocation costs, while there are only marginal benefits from SIMD due to the simplicity of the operation being executed.
x = rand(5_000_000)
indices = sortperm(x)
function foo(x, indices)
y = @view x[indices]
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 33.449 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = sortperm(x)
function foo(x, indices)
y = x[indices]
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 49.233 ms (2 allocations: 38.147 MiB)
Instead, the following operation is more computationally intensive and benefits substantially from SIMD instructions. In this case, a copy outperforms a view.
x = rand(5_000_000)
indices = sortperm(x)
function foo(x, indices)
y = @view x[indices]
output = 0.0
@simd for a in y
output += a^(3/2)
end
return output
end@btime foo($y) 407.670 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = sortperm(x)
function foo(x, indices)
y = x[indices]
output = 0.0
@simd for a in y
output += a^(3/2)
end
return output
end@btime foo($y) 159.115 ms (2 allocations: 38.147 MiB)
Certain patterns allow us to predict the faster approach, without the need of benchmarking the approach.
One scenario where views always outperform copies is when referencing sequential elements. These scenarios call for the use of view, as they provide the same benefits as copies, but without incurring memory allocations.
x = rand(1_000_000)
indices = 1:length(x)
function foo(x, indices)
y = @view x[indices]
output = 0.0
for a in y
output += a
end
return output
end@btime foo($y) 481.124 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
indices = 1:length(x)
function foo(x, indices)
y = @view x[indices]
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 121.409 μs (0 allocations: 0 bytes)
x = rand(1_000_000)
indices = 1:length(x)
function foo(x, indices)
y = x[indices]
output = 0.0
@simd for a in y
output += a
end
return output
end@btime foo($y) 677.714 μs (2 allocations: 7.629 MiB)
In contrast, a common scenario where copies tend to outpace views is when performing multiple operations on the same slice. In this case, the cost of additional memory allocations is usually offset by the speed gains from contiguous memory access and SIMD instructions.
x = rand(5_000_000)
indices = sortperm(x)
function foo(x, indices)
y = @view x[indices]
output1, output2, output3 = (0.0 for _ in 1:3)
for a in y
output1 += a^(3/2)
output2 += a / 3
output3 += a * 2.5
end
return output1, output2, output3
end@btime foo($y) 404.078 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = sortperm(x)
function foo(x, indices)
y = @view x[indices]
output1, output2, output3 = (0.0 for _ in 1:3)
@simd for a in y
output1 += a^(3/2)
output2 += a / 3
output3 += a * 2.5
end
return output1, output2, output3
end@btime foo($y) 422.823 ms (0 allocations: 0 bytes)
x = rand(5_000_000)
indices = sortperm(x)
function foo(x, indices)
y = x[indices]
output1, output2, output3 = (0.0 for _ in 1:3)
@simd for a in y
output1 += a^(3/2)
output2 += a / 3
output3 += a * 2.5
end
return output1, output2, output3
end@btime foo($y) 155.528 ms (2 allocations: 38.147 MiB)