Home
Chapters

Generic expressions are represented by <something> (e.g., <function> or <operator>). This is just notation, and the symbols < and > should not be misconstrued as Julia's syntax.

Action	Keyboard Shortcut
Previous Section	`Ctrl + 🠘`
Next Section	`Ctrl + 🠚`
List of Sections	`Ctrl + z`
List of Subsections	`Ctrl + x`
Close Any Popped Up Window (like this one)	`Esc`
Open All Codes and Outputs in a Post	`Alt + 🠛`
Close All Codes and Outputs in a Post	`Alt + 🠙`

When benchmarking, the equivalence of time measures is as follows.

Unit	Acronym	Measure in Seconds
Seconds	s	1
Milliseconds	ms	10^-3
Microseconds	μs	10^-6
Nanoseconds	ns	10^-9

Links
Dark Mode (Experimental)

Personal

Website

9g. Lazy Broadcasting and Loop Fusion

Martin Alfaro

PhD in Economics

PART II: HIGH PERFORMANCE
7. Introduction to Performance

a. Overview and Goals

b. When To Optimize Code?

c. Benchmarking Execution Time

d. Preliminaries on Types

e. Functions: Type Inference and Multiple Dispatch

f. Preliminaries on Memory Allocations

8. Type Stability

a. Overview and Goals

b. Defining Type Stability

c. Type Stability with Scalars and Vectors

d. Type Stability with Global Variables

e. Barrier Functions

f. Type Stability with Tuples

g. Type Stability with Higher-Order Functions

h. Gotchas for Type Stability

9. Reducing Memory Allocations

a. Objects Allocating Memory

b. Slice Views to Reduce Allocations

c. Pre-Allocations

d. Reductions

e. Tuples and Static Vectors for Small Collections

f. Lazy Operations

g. Lazy Broadcasting and Loop Fusion

Introduction

This section continues analyzing lazy and eager operations towards reducing memory allocations. The focus now shifts to broadcasting operations, which strike a balance between code readability and performance.

In Julia, broadcasting is eager by default, entailing that outputs are immediately computed upon execution. When working with vectors, this implies that broadcasting allocates memory. The current section presents various approaches to avoiding unnecessary memory allocations that may arise due to this, or directly eliminating them.

We'll start highlighting the notion of loop fusion, where multiple broadcasting operations are combined into a single one. Guaranteeing this property prevents broadcasting operations from allocating memory independently. After this, we'll explore the LazyArrays package, which evaluates broadcasting operations in a lazy manner.

What Broadcasting Does?

Before exploring memory allocations, let's briefly analyze the internal execution mechanism of broadcasting. Essentially, broadcasting represents syntactic sugar to translate code into a for-loop, sidestepping the verbosity of an explicit for-loop. The following code snippets illustrate how broadcasting (roughly) executes internally, establishing an equivalence with for-loops.

x      = rand(1_000)

foo(x) = 2 .* x

Output in REPL

julia>

@btime foo($x)

  220.408 ns (1 allocation: 7.94 KiB)

x      = rand(1_000)

function foo(x)
    output = similar(x)

    for i in eachindex(x)
        output[i] = 2 * x[i]
    end

    return output
end

Output in REPL

julia>

@btime foo($x)

  317.588 ns (1 allocation: 7.94 KiB)

Despite their equivalence, you'll often notice performance differences between broadcasting and for-loops. These discrepancies arise because the compiler applies distinct optimizations to each case.

It's also worth remarking that broadcasting isn't faster than a for-loop, as occurs in the examples. On the contrary, for-loops in Julia are flexible enough to outperform broadcasting, although this requires manually optimizing code. Considering this, broadcasting in Julia is more about code readability, without sacrificing potential performance significantly. Beyond this, the example's key takeaway is that broadcasting ultimately executes a for-loop.

One important difference between broadcasting and for-loops is the default design of for-loops. This prioritizes safety over performance regarding bounds checking. This means that Julia verifies we're not indexing out of the range (e.g., given a vector x with 100 elements, it ensures we don't attempt to operate with x[101]). Whether this is implemented with broadcasting is internally decided by the compiler, while this usually requires the macro @inbounds in for-loops.

Nevertheless, the compiler skips the checking of bounds in for-loops for simple scenarios. For instance, this is the case when the iteration is over eachindex(x) and only x is indexed.

x      = rand(100)

function foo(x)
    output = zero(eltype(x))

    for i in eachindex(x)
        output += 2 * x[i]
    end

    return output
end

Output in REPL

julia>

@btime foo($x)

  23.671 ns (0 allocations: 0 bytes)

x      = rand(100)

function foo(x)
    output = zero(eltype(x))

    @inbounds for i in eachindex(x)
        output += 2 * x[i]
    end

    return output
end

Output in REPL

julia>

@btime foo($x)

  23.571 ns (0 allocations: 0 bytes)

Note that this case doesn't index output, unlike the previous example. This triggers the automatic application of @inbounds.

There are other optimizations that broadcasting could implement, which for-loops avoid automatically applying. For example, broadcasting can recognize that some operations don't vary per iteration, in which case they'd be calculated once out of the loop. This can be clearly observed in the following example, where sum(x) is computed only once under broadcasting. The for-loop version instead requires applying this explicitly.

x      = rand(1_000)

foo(x) = x ./ sum(x)

Output in REPL

julia>

@btime foo($x)

  475.897 ns (1 allocation: 7.94 KiB)

x      = rand(1_000)

function foo(x)
    output      = similar(x)
    denominator = sum(x)

    @inbounds for i in eachindex(x)
        output[i] = x[i] / denominator
    end

    return output
end

Output in REPL

julia>

@btime foo($x)

  490.722 ns (1 allocation: 7.94 KiB)

x      = rand(1_000)

function foo(x)
    output      = similar(x)
    

    for i in eachindex(x)
        output[i] = x[i] / sum(x)
    end

    return output
end

Output in REPL

julia>

@btime foo($x)

  36.000 μs (1 allocation: 7.94 KiB)

Broadcasting: Loop Fusion

In Julia, broadcasting is eager by default, which means that expressions are immediately evaluated and their results materialized. This inevitably leads to memory allocations when broadcasting vectors. In terms of the example above, the allocation in the for-loop version corresponds to the vector output that stores the result. Importantly, memory allocations arise regardless of whether the result is explicitly stored. Such is the case when sum(2 .* x) is run, where the intermediate calculation 2 .* x is temporarily stored, even though only the final result is returned.

Taking this into account, we must be mindful about how expressions with multiple broadcasting operations are handled—splitting them improperly can result in unnecessary allocations. Essentially, when you chain multiple broadcasting operations together, we need to ensure that Julia's compiler can fuse all operations into a single optimized loop. This allows Julia to perform all operations in a single pass over the data, avoiding the creation of intermediate vectors. A single loop additionally gives the compiler a comprehensive view of all the operations to be computed, allowing for additional optimizations.

One common scenario where we could inadvertently prevent loop fusion is when scalars and vectors are combined. The issue arises because certain operations produce the same result whether they're broadcasted or not. This can be observed in the following example.

x = [1, 2, 3]
β = 2

foo(x,β) = x .* β

Output in REPL

julia>

foo(x,β)

3-element Vector{Int64}:
 2
 4
 6

x = [1, 2, 3]
β = 2

foo(x,β) = x * β

Output in REPL

julia>

foo(x,β)

3-element Vector{Int64}:
 2
 4
 6

In a long expression, such a feature determines that we can obtain the same output if we avoid . for certain operations. Unfortunately, this will lead Julia to internally compute each broadcasting operation separately, thus creating multiple temporary vectors. To appreciate this, consider the following code snippet, which deliberately omits various dots. To make the point clearer, we also provide an equivalent implementation, laying bare what is internally executed.

x = rand(100)

foo(x) = exp.(2 * x) + (3 * x) * 5

Output in REPL

julia>

@btime foo($x)

  334.222 ns (5 allocations: 4.38 KiB)

x = rand(100)

function foo(x) 
    aux1  = 2 .* x
        term1 = exp.(aux1)
    aux2  = 3 .* x
        term2 = aux2 .* 5

    return term1 + term2
end

Output in REPL

julia>

@btime foo($x)

  358.500 ns (5 allocations: 4.38 KiB)

To guarantee loop fusion, it's then essential to append a dot to every operator and function to be broadcasted. Note that his process can be error-prone when dealing with multiple operations, with the omission of one dot creating unnecessary allocations. Fortunately, the @. macro provides a convenient solution that reduces the risk of oversight. By prefixing the expression with @., you'll ensure that all operators and functions are broadcasted.

x = rand(100)

foo(x) = exp.(2 .* x) .+ (3 .* x) .* 5

Output in REPL

julia>

@btime foo($x)

  236.451 ns (1 allocation: 896 bytes)

x = rand(100)

foo(x) = @. exp(2 * x) + (3 * x) * 5

Output in REPL

julia>

@btime foo($x)

  233.094 ns (1 allocation: 896 bytes)

x = rand(100)

foo(a) = exp(2 * a) + (3 * a) * 5

Output in REPL

julia>

@btime foo.($x)

  230.184 ns (1 allocation: 896 bytes)

The previous example only considered scenarios with relatively simple expressions. Instead, when multiple operations are added and the expression becomes longer, the need to split operations is inevitable. When this occurs, we can still achieve loop fusion by defining each operation as a separate function. This approach leverages the fact that function definitions are essentially lazy, meaning we only specify the code to be executed without performing any actual computation. The following example illustrates this approach in two ways: by defining a function where the final operation is broadcasted, and an alternative where the last function is broadcasted. Both determine that only the final output allocates memory.

x = rand(100)

aux1(a) = exp(2 * a)
aux2(a) = (3 * a) * 5

foo(x)  = @. aux1(x) + aux2(x)

Output in REPL

julia>

@btime foo($x)

  234.118 ns (1 allocation: 896 bytes)

x = rand(100)

aux1(a) = exp(2 * a)
aux2(a) = (3 * a) * 5

foo(a)  = aux1(a) + aux2(a)

Output in REPL

julia>

@btime foo.($x)

  227.932 ns (1 allocation: 896 bytes)

Lazy Broadcasting

So far, all the cases considered were assuming that the ultimate output was a collection. In this context, the goal has been to reduce memory allocations to a single one, given by the final result. Next, we consider cases where broadcasting serves as an intermediate step in computations that ultimately return a scalar value.

These types of operations require turn broadcasting into a lazy operation. Such functionality is provided by the LazyArrays package, whose use requires prepending the @~ macro to the broadcasting operation. The technique is capable of fusing the broadcasting and reduction operation, allowing for computing the former on-the-fly. An example is provided below.

# eager broadcasting (default)
x      = rand(100)

foo(x) = sum(2 .* x)

Output in REPL

julia>

@btime foo($x,$y)

  45.584 ns (1 allocation: 896 bytes)

using LazyArrays
x      = rand(100)

foo(x) = sum(@~ 2 .* x)

Output in REPL

julia>

@btime foo($x,$y)

  8.208 ns (0 allocations: 0 bytes)

Broadcasting lazily with @~ also allows for splitting operations, as shown below.

x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)

    sum(temp.(x,y))   
end

Output in REPL

julia>

@btime foo($x,$y)

  47.047 ns (1 allocation: 896 bytes)

x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)
    
    sum(@~ temp.(x,y))
end

Output in REPL

julia>

@btime foo($x,$y)

  15.030 ns (0 allocations: 0 bytes)

x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1     = @~ @. 2 * x + 1
    term2     = @~ @. 3 * y - 1
    temp      = @~ @. term1 * term2
    
    sum(temp)
end

Output in REPL

julia>

@btime foo($x,$y)

  16.132 ns (0 allocations: 0 bytes)

Remark

An additional advantage of @~ is that it implements some optimizations for broadcasted operations when possible. As a result, @~ is typically faster than a lazy map, despite that both don't allocate memory. This performance benefit can be appreciated in the following comparison.

x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)
    
    sum(@~ temp.(x,y))
end

Output in REPL

julia>

@btime foo($x,$y)

  15.030 ns (0 allocations: 0 bytes)

x = rand(100) ; y = rand(100)

function foo(x,y) 
    term1(a)  = 2 * a + 1
    term2(b)  = 3 * b - 1
    temp(a,b) = term1(a) * term2(b)
    
    sum(Iterators.map(temp, x,y))
end

Output in REPL

julia>

@btime foo($x,$y)

  41.171 ns (0 allocations: 0 bytes)

NOTATION

PAGE LAYOUT

LINKS TO SECTIONS

KEYBOARD SHORTCUTS

TIME MEASUREMENT

Dark Mode (Experimental)

PART II: HIGH PERFORMANCE

Introduction

What Broadcasting Does?

REMARK (OPTIONAL) - Differences in Time Between Broadcasting and For-Loops

Broadcasting: Loop Fusion

Lazy Broadcasting