In the context of operations involving 2 (or more) arrays, “broadcasting” refers to recycling array dimensions without allocating additional memory, which is considerably faster and more memory-efficient than R’s regular dimensions replication mechanism.
Before the emergence of the ‘broadcast’ package, if users wished to employ broadcasting, they essentially had to use broadcasting as it existed in a different programming language. For example, they might use the broadcasting as available in the ‘Python’ module ‘Numpy’ (perhaps via the ‘reticulate’ package). Or perhaps they might use the ‘C++’ library ‘xtensor’ via the R-package of the same name (or an extension thereof, like ‘rray’).
With the emergence of the ‘broadcast’ package, users can now call broadcasted implementations without using external libraries, which spares the computing power needed for translating between object structures of different languages.
The “broadcasting” implementation in the ‘broadcast’ package is conceptually (though not programmatically) inspired by the broadcasting employed by the ‘Numpy’ module for ‘Python’, which might be the first implementation of broadcasting. More importantly, ‘Numpy’ is remarkably fast, and the ‘broadcast’ package aims to be somewhat comparably fast.
This page presents the comparisons in the speed of broadcasted operations, between ‘broadcast’ and ‘Numpy’. The operation that is compared is a simple, element-wise, broadcasted addition, given by the code x + y in the ‘Numpy’ module for ‘Python’, and bc.num(x, y, "+") in the ‘broadcast’ package.
Please bear in mind these are rough comparisons of speed. Since the comparisons involve 2 separate programming languages, and ‘Python’, “proper” speed comparison is rather difficult to do fairly.
Methodology
Difficulties in comparing with ‘Python’
Benchmarking a ‘Python’ code snippet in ‘Python’ using a ‘Python’ module, and benchmarking an code snippet in using an package, means mechanisms from different modules/packages are used for the benchmarking, and those 2 benchmarks might not use the exact same approach to measuring the execution time.
‘Python’ and are both languages that use garbage collections (GC). But GC really does mess up benchmarking. The way to circumvent this issue differs in ‘Python’ and . In ‘Python’, GC can temporarily be disabled. does not support this, so instead for benchmarks with heavy GC calls just had to be filtered out.
Due to the above (and other) considerations, any form of benchmarks between and ‘Python’ - including the ones given in this page - should be taken with a grain of salt.
The Set-Up
The operation that was bench-marked in this study, is the operation x + y in ‘Numpy’ and the equivalent bc. num(x, y, "+") in ‘broadcast’.
Here x and y were both decimal numeric arrays (type of 64 bit double in and 64 bit float in ‘Python’), and had the same number of dimensions.
This operation was run for pairs of arrays with different number of dimensions, going from 2 dimensional to 7 dimensional.
So we have x + y where both arrays were 2-dimensional (i.e. matrices), and x + y where both arrays were 3-dimensional, and so on up to 7-dimensional arrays.
The pairs of arrays are fully orthogonal (“orthogonal” in the sense as explained here), thus the maximum amount of broadcasting will be employed.
Given, for example, 4-dimensional arrays, the dimensions of x are (n, 1, n, 1) and the dimensions of y are (1, n, 1, n).
The value of n, so the size of each dimension, varied as follows:
For 2-dimensional arrays, n goes from 1250 to 9500, with step size 750.
For 3-dimensional arrays, n goes from 65 to 450, with step size 35.
For 4-dimensional arrays, n goes from 9 to 99, with step size 10.
For 5-dimensional arrays, n goes from 6 to 39, with step size 3.
For 6-dimensional arrays, n goes from 3 to 21, with step size 2.
For 7-dimensional arrays, n goes from 2 to 14, with step size 1.
These values n were chosen as follows. The maximum n was specified such that the broadcasted element-wise addition of x and y resulted in an array with between 90 to 100 million elements. The minimum n was chosen to be (approximately) one-seventh of the maximum n value. And the step size was set to a value such that the sequence had a length between 10 and 15
For each pair of arrays, the element-wise addition was computed using ‘broadcast’ and ‘Numpy’. This computation was repeated 100 times (though see some technical details about this in the next sub-section). The first 10 were then treated as warm-up iterations, and thus discarded, leaving 90 benchmarks. From these 90 benchmarks, the median, first quartile, and third quartiles were computed. After the benchmarks for a particular dimension, the garbage collector was called manually to clean up any lingering memory.
There are some caveats here, in order to keep the comparisons between ‘broadcast’ and ‘Numpy’ fair, and these caveats are explained in the next sub-section.
Keeping comparisons (somewhat) fair
To keep the comparisons between ‘broadcast’ and ‘Numpy’ fair, a number of measures have been taken.
Distributions of benchmarks tend to be heavily skewed. Therefore, the median measure (together with the quartiles) were taken. The median is also more stable than the mean in the face of outliers.
Garbage collection was disabled in Python. In , only benchmarks with no garbage collection calls and benchmarks with level 0 garbage collection calls were used. I feel this keeps the comparisons relatively fair (but it’s not perfect).
Since only benchmarks with no and level 0 garbage collection calls were used for , the benchmarks were run 200 times, and a check was performed that at least 100 benchmark measurements were kept before discarding the warm-up; 90 benchmarks after discarding the warm-up. If there were less than 90 (= 100 - 10) benchmarks for a particular computation, the benchmarks would be thrown away, and another attempt would be made at benchmarking (but this never happened).
has more support for missing values than ‘Numpy’, which also leads to a difference in speed. But both and ‘Numpy’ handle missing values equally in decimal numbers ( 64bit floats in Numpy and 64bit doubles in ), through the NaN construct. Therefore, only operations on decimal numbers are compared.
Operations like power (^) and division (/) need to handle special cases (like when the right-hand side of the operation is 0). The plus (+) operator, however, has no such special cases. Therefore, to keep things fair, the comparisons on this page only involve summation.
Resources and Code
The ‘benchmark’ package was used for measuring speed in , as this package can also be used to check and filter for garbage collector calls.
In ‘Python’, the time.perf_counter() function was used to accurately measure the time an operation takes. To ensure no time was wasted on printing the result in ‘Python’, the operation a + b was wrapped inside a function without a return statement.
The figures in the Results section were created using the ‘tinyplot’ package, to display the median, first quartile, and third quartile, of the computation times.
The benchmarks were all run on the same computer (processor: 12th Gen Intel(R) Core(TM) i5-12500H @ 2.50 GHz) with 32GB of RAM and running the Windows 11 OS (64 bit).
The code used to run the benchmarks can be found at the bottom of this page. version 4.4.0 was used to run the code, and ‘Python’ version 3.12.0 with ‘Numpy’ version 2.2.1 was used to run the ‘Python’ code.
The code was sourced from ‘RStudio’ using source(), from a freshly started computer. The ‘Python’ code was run via ‘Jupyter’, also from a freshly started computer.
Figure 1: Benchmarks of the element-wise broadcasted addition of 2 decimal numeric arrays, comparing the code x + y in the ‘Numpy’ ‘Python’-module, against the code bc.num(x, y,"+") in the ‘broadcast’ ‘R’-package. Both arrays are 2 - dimensional arrays. The dimensions of x are {n, 1}; the dimensions of y are {1, n}. Here, n is shown on the x-axis. The y-axis shows the time (in ms) it took to run the code. The solid line gives the median time, and the shaded ribbon gives the first and third quartiles. The higher a value is on the y-axis, the more time it takes to run the code, the slower the code.
Figure 2: Benchmarks of the element-wise broadcasted addition of 2 decimal numeric arrays, comparing the code x + y in the ‘Numpy’ ‘Python’-module, against the code bc.num(x, y,"+") in the ‘broadcast’ ‘R’-package. Both arrays are 3 - dimensional arrays. The dimensions of x are {n, 1, n}; the dimensions of y are {1, n, 1}. Here, n is shown on the x-axis. The y-axis shows the time (in ms) it took to run the code. The solid line gives the median time, and the shaded ribbon gives the first and third quartiles. The higher a value is on the y-axis, the more time it takes to run the code, the slower the code.
Figure 3: Benchmarks of the element-wise broadcasted addition of 2 decimal numeric arrays, comparing the code x + y in the ‘Numpy’ ‘Python’-module, against the code bc.num(x, y,"+") in the ‘broadcast’ ‘R’-package. Both arrays are 4 - dimensional arrays. The dimensions of x are {n, 1, n, 1}; the dimensions of y are {1, n, 1, n}. Here, n is shown on the x-axis. The y-axis shows the time (in ms) it took to run the code. The solid line gives the median time, and the shaded ribbon gives the first and third quartiles. The higher a value is on the y-axis, the more time it takes to run the code, the slower the code.
Figure 4: Benchmarks of the element-wise broadcasted addition of 2 decimal numeric arrays, comparing the code x + y in the ‘Numpy’ ‘Python’-module, against the code bc.num(x, y,"+") in the ‘broadcast’ ‘R’-package. Both arrays are 5 - dimensional arrays. The dimensions of x are {n, 1, n, 1, n}; the dimensions of y are {1, n, 1, n, 1}. Here, n is shown on the x-axis. The y-axis shows the time (in ms) it took to run the code. The solid line gives the median time, and the shaded ribbon gives the first and third quartiles. The higher a value is on the y-axis, the more time it takes to run the code, the slower the code.
Figure 5: Benchmarks of the element-wise broadcasted addition of 2 decimal numeric arrays, comparing the code x + y in the ‘Numpy’ ‘Python’-module, against the code bc.num(x, y,"+") in the ‘broadcast’ ‘R’-package. Both arrays are 6 - dimensional arrays. The dimensions of x are {n, 1, n, 1, n, 1}; the dimensions of y are {1, n, 1, n, 1, n}. Here, n is shown on the x-axis. The y-axis shows the time (in ms) it took to run the code. The solid line gives the median time, and the shaded ribbon gives the first and third quartiles. The higher a value is on the y-axis, the more time it takes to run the code, the slower the code.
Figure 6: Benchmarks of the element-wise broadcasted addition of 2 decimal numeric arrays, comparing the code x + y in the ‘Numpy’ ‘Python’-module, against the code bc.num(x, y,"+") in the ‘broadcast’ ‘R’-package. Both arrays are 7 - dimensional arrays. The dimensions of x are {n, 1, n, 1, n, 1, n}; the dimensions of y are {1, n, 1, n, 1, n, 1}. Here, n is shown on the x-axis. The y-axis shows the time (in ms) it took to run the code. The solid line gives the median time, and the shaded ribbon gives the first and third quartiles. The higher a value is on the y-axis, the more time it takes to run the code, the slower the code.
Conclusion & Discussion
It appears that ‘broadcast’ is slightly faster than ‘Numpy’, though the differences in the computation times between ‘broadcast’ and ‘Numpy’ are rather small. It seems reasonable to conclude that, in general, ‘broadcast’ and ‘Numpy’ have somewhat similar speeds. It can also be observed that ‘broadcast’ has a bit more spread in its computation time than ‘Numpy’.
As stated earlier, comparing benchmarks between ‘R’ and ‘Python’ should be taken with a grain of salt. I am open for suggestions on how to improve the computation time comparisons between ‘broadcast’ and ‘Numpy’, and make them more fair (so feel free to post suggestions on the GitHub Discussions page).