pic
Personal
Website

9g. Lazy Broadcasting and Loop Fusion

PhD in Economics

Introduction

This section continues analyzing lazy and eager operations towards reducing memory allocations. The focus now shifts to broadcasting operations, which strike a balance between code readability and performance.

In Julia, broadcasting is eager by default, entailing that outputs are immediately computed upon execution. When working with vectors, this implies that broadcasting allocates memory. The current section presents various approaches to avoiding unnecessary memory allocations that may arise due to this, or directly eliminating them.

We'll start highlighting the notion of loop fusion, where multiple broadcasting operations are combined into a single one. Guaranteeing this property prevents broadcasting operations from allocating memory independently. After this, we'll explore the LazyArrays package, which evaluates broadcasting operations in a lazy manner.

What Broadcasting Does?

Before exploring memory allocations, let's briefly analyze the internal execution mechanism of broadcasting. Essentially, broadcasting represents syntactic sugar to translate code into a for-loop, sidestepping the verbosity of an explicit for-loop. The following code snippets illustrate how broadcasting (roughly) executes internally, establishing an equivalence with for-loops.

x      = rand(1_000)

foo(x) = 2 .* x
Output in REPL
julia>
@btime foo($x)
  220.408 ns (1 allocation: 7.94 KiB)
x      = rand(1_000)

function foo(x)
    output = similar(x)

    for i in eachindex(x)
        output[i] = 2 * x[i]
    end

    return output
end
Output in REPL
julia>
@btime foo($x)
  317.588 ns (1 allocation: 7.94 KiB)

Despite their equivalence, you'll often notice performance differences between broadcasting and for-loops. These discrepancies arise because the compiler applies distinct optimizations to each case.

It's also worth remarking that broadcasting isn't faster than a for-loop, as occurs in the examples. On the contrary, for-loops in Julia are flexible enough to outperform broadcasting, although this requires manually optimizing code. Considering this, broadcasting in Julia is more about code readability, without sacrificing potential performance significantly. Beyond this, the example's key takeaway is that broadcasting ultimately executes a for-loop.

Broadcasting: Loop Fusion

In Julia, broadcasting is eager by default, which means that expressions are immediately evaluated and their results materialized. This inevitably leads to memory allocations when broadcasting vectors. In terms of the example above, the allocation in the for-loop version corresponds to the vector output that stores the result. Importantly, memory allocations arise regardless of whether the result is explicitly stored. Such is the case when sum(2 .* x) is run, where the intermediate calculation 2 .* x is temporarily stored, even though only the final result is returned.

Taking this into account, we must be mindful about how expressions with multiple broadcasting operations are handled—splitting them improperly can result in unnecessary allocations. Essentially, when you chain multiple broadcasting operations together, we need to ensure that Julia's compiler can fuse all operations into a single optimized loop. This allows Julia to perform all operations in a single pass over the data, avoiding the creation of intermediate vectors. A single loop additionally gives the compiler a comprehensive view of all the operations to be computed, allowing for additional optimizations.

One common scenario where we could inadvertently prevent loop fusion is when scalars and vectors are combined. The issue arises because certain operations produce the same result whether they're broadcasted or not. This can be observed in the following example.

x = [1, 2, 3]
β = 2

foo(x,β) = x .* β
Output in REPL
julia>
foo(x,β)
3-element Vector{Int64}:
 2
 4
 6
x = [1, 2, 3]
β = 2

foo(x,β) = x * β
Output in REPL
julia>
foo(x,β)
3-element Vector{Int64}:
 2
 4
 6

In a long expression, such a feature determines that we can obtain the same output if we avoid . for certain operations. Unfortunately, this will lead Julia to internally compute each broadcasting operation separately, thus creating multiple temporary vectors. To appreciate this, consider the following code snippet, which deliberately omits various dots. To make the point clearer, we also provide an equivalent implementation, laying bare what is internally executed.

x = rand(100)

foo(x) = exp.(2 * x) + (3 * x) * 5
Output in REPL
julia>
@btime foo($x)
  334.222 ns (5 allocations: 4.38 KiB)
x = rand(100)

function foo(x) 
    aux1  = 2 .* x
        term1 = exp.(aux1)
    aux2  = 3 .* x
        term2 = aux2 .* 5

    return term1 + term2
end
Output in REPL
julia>
@btime foo($x)
  358.500 ns (5 allocations: 4.38 KiB)

To guarantee loop fusion, it's then essential to append a dot to every operator and function to be broadcasted. Note that his process can be error-prone when dealing with multiple operations, with the omission of one dot creating unnecessary allocations. Fortunately, the @. macro provides a convenient solution that reduces the risk of oversight. By prefixing the expression with @., you'll ensure that all operators and functions are broadcasted.

x = rand(100)

foo(x) = exp.(2 .* x) .+ (3 .* x) .* 5
Output in REPL
julia>
@btime foo($x)
  236.451 ns (1 allocation: 896 bytes)
x = rand(100)

foo(x) = @. exp(2 * x) + (3 * x) * 5
Output in REPL
julia>
@btime foo($x)
  233.094 ns (1 allocation: 896 bytes)
x = rand(100)

foo(a) = exp(2 * a) + (3 * a) * 5
Output in REPL
julia>
@btime foo.($x)
  230.184 ns (1 allocation: 896 bytes)

The previous example only considered scenarios with relatively simple expressions. Instead, when multiple operations are added and the expression becomes longer, the need to split operations is inevitable. When this occurs, we can still achieve loop fusion by defining each operation as a separate function. This approach leverages the fact that function definitions are essentially lazy, meaning we only specify the code to be executed without performing any actual computation. The following example illustrates this approach in two ways: by defining a function where the final operation is broadcasted, and an alternative where the last function is broadcasted. Both determine that only the final output allocates memory.

x = rand(100)

aux1(a) = exp(2 * a)
aux2(a) = (3 * a) * 5

foo(x)  = @. aux1(x) + aux2(x)
Output in REPL
julia>
@btime foo($x)
  234.118 ns (1 allocation: 896 bytes)
x = rand(100)

aux1(a) = exp(2 * a)
aux2(a) = (3 * a) * 5

foo(a)  = aux1(a) + aux2(a)
Output in REPL
julia>
@btime foo.($x)
  227.932 ns (1 allocation: 896 bytes)

Lazy Broadcasting

So far, all the cases considered were assuming that the ultimate output was a collection. In this context, the goal has been to reduce memory allocations to a single one, given by the final result. Next, we consider cases where broadcasting serves as an intermediate step in computations that ultimately return a scalar value.

These types of operations require turn broadcasting into a lazy operation. Such functionality is provided by the LazyArrays package, whose use requires prepending the @~ macro to the broadcasting operation. The technique is capable of fusing the broadcasting and reduction operation, allowing for computing the former on-the-fly. An example is provided below.

# eager broadcasting (default)
x      = rand(100)

foo(x) = sum(2 .* x)
Output in REPL
julia>
@btime foo($x,$y)
  45.584 ns (1 allocation: 896 bytes)
using LazyArrays
x      = rand(100)

foo(x) = sum(@~ 2 .* x)
Output in REPL
julia>
@btime foo($x,$y)
  8.208 ns (0 allocations: 0 bytes)

Broadcasting lazily with @~ also allows for splitting operations, as shown below.

x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)

    sum(temp.(x,y))   
end
Output in REPL
julia>
@btime foo($x,$y)
  47.047 ns (1 allocation: 896 bytes)
x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)
    
    sum(@~ temp.(x,y))
end
Output in REPL
julia>
@btime foo($x,$y)
  15.030 ns (0 allocations: 0 bytes)
x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1     = @~ @. 2 * x + 1
    term2     = @~ @. 3 * y - 1
    temp      = @~ @. term1 * term2
    
    sum(temp)
end
Output in REPL
julia>
@btime foo($x,$y)
  16.132 ns (0 allocations: 0 bytes)

Remark
An additional advantage of @~ is that it implements some optimizations for broadcasted operations when possible. As a result, @~ is typically faster than a lazy map, despite that both don't allocate memory. This performance benefit can be appreciated in the following comparison.

x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)
    
    sum(@~ temp.(x,y))
end
Output in REPL
julia>
@btime foo($x,$y)
  15.030 ns (0 allocations: 0 bytes)
x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)
    
    sum(Iterators.map(temp, x,y))
end
Output in REPL
julia>
@btime foo($x,$y)
  41.171 ns (0 allocations: 0 bytes)