library(squarebrackets)
#> Run `?squarebrackets::squarebrackets_help` to open the introduction help page of 'squarebrackets'.
set.seed(1L)
Introduction
‘squarebrackets’ provides subset methods that may be more convenient
alternatives to the [ and [<- operators,
whilst maintaining similar performance.
The goal of this Vignette is to present some problems in sub-setting objects programmatically in ‘R’, and how the ‘squarebrackets’ package solves these problems.
The Vignette starts of with solving sub-setting problems for (simple) vectors. Then it moves on to solving the issue of sub-setting arrays with arbitrary number of dimensions. The many types of data.frames is next. And we end with index-less sub-setting of long vectors.
In order to follow this Vignette, the reader needs to be familiar
with the square-brackets operators ([, [<-,
[[, [[<-), the dimensional structures of
base ‘R’ (vectors, arrays, data.frames), and familiarity with atomic and
recursive types.
This vignette does not provide a complete and thorough explanation of all methods, functions, and options available in the ‘squarebrackets’ package. The vignette merely gives you, the reader, a glimpse of what this package is about. So don’t worry if you don’t understand everything immediately. A more complete explanation of all the available functionality can be found in the package documentation itself.
Vectors: Improved Index Specification
‘squarebrackets’ provides a set of methods that work on both atomic and recursive vectors:
-
ii_xto extract subsets -
ii_modto return a copy with modified subsets -
ii_setto modify an object by reference.
base ‘R’ supports specifying indices for sub-set operations through
logical, integer, and character vectors.
‘squarebrackets’ enhances these capabilities, and adds more
possibilities.
The following sub-sections show some of these capabilities; a
more exhaustive list of the possibilities can be found in the package
documentation.
Specify Indices by Names
Base ‘R’ only selects the first matching names when selecting indices through a character vector. ‘squarebrackets’ selects all matching names.
For example:
nms <- c("a", sample(letters[1:4], 9, replace = TRUE))
x <- sample(1:10)
names(x) <- nms
print(x) # `x` has multiple elements with the name "a"
#> a a d c a b a c c b
#> 2 3 1 5 7 10 6 4 9 8
x["a"] # only selects only the first index with name "a"
#> a
#> 2
ii_x(x, "a") # selects all indices with the name "a"
#> a a a a
#> 2 3 7 6
x[c("a", "a")] # repeats only the first index with name "a"
#> a a
#> 2 2
ii_x(x, c("a", "a")) # repeats all indices with the name "a"
#> a a a a a a a a
#> 2 3 7 6 2 3 7 6To select the indices c("a", "a", "b"), whilst ensuring
all indices with those names get selected, one needs to
do the following in base ‘R’:
x[lapply(c("a", "a", "b"), \(i)which(names(x) == i)) |> unlist()]
#> a a a a a a a a b b
#> 2 3 7 6 2 3 7 6 10 8See how much easier it is with ‘squarebrackets’!:
It’s not just shorter by the way, ‘squarebrackets’ is
faster, as it does not rely on lapply()
(or friends) to do this, but uses compiled ‘C’ code (partly from the
‘collapse’ package).
Inverting Index Specification
Inverting indices in base ‘R’ is done in different ways. (negative numbers for numeric indexing, negation for logical indexing, manually un-matching for character vectors).
‘squarebrackets’ provides a (somewhat) consistent syntax to invert
indices, namely through the use argument. Setting
use to a negative value will invert the indices.
As a consequence, removing sub-sets has the same syntax as extracting indices.
For example:
x <- sample(1:10)
names(x) <- letters[1:10]
x["a"] # extract element "a" in base R
#> a
#> 9
x[!names(x) %in% "a"] # but removing has different syntax
#> b c d e f g h i j
#> 5 10 1 7 8 6 2 3 4
ii_x(x, "a") # extract element "a" with 'squarebrackets'
#> a
#> 9
ii_x(x, "a", -1) # extract all elements except "a", with 'squarebrackets'
#> b c d e f g h i j
#> 5 10 1 7 8 6 2 3 4
Arrays: sub-setting unknown number of dimensions
Introduction
In order to perform subset operations on some array x
with the square brackets operator ([, [<-),
one needs to know how many dimensions it has.
For example:
# if x has 3 dimensions:
x[i, j, k, drop = FALSE]
x[i, j, k] <- value
# if x has 4 dimensions:
x[i, j, k, l, drop = FALSE]
x[i, j, k, l] <- valueUsing x[i, j, k] on an array with 4 dimensions produces
an error, since the number of indices or empty arguments does not
conform to the number of dimensions.
But suppose that the number of dimensions of an array x
is unknown, for example when iterating through many arrays which all may
have different number dimensions. How would one the use the
[ and [<- operators in such a situation?
It’s not strictly impossible, but it is very convoluted.
The methods provided by ‘squarebrackets’ do not use position-based arguments, and as such work on any arbitrary dimensions without requiring prior knowledge.
The s, d argument pair
‘squarebrackets’ provides a set of methods that work on arrays of any number of dimensions:
-
ss_xto extract subsets -
ss_modto return a copy with modified subsets -
ss_setto modify an object by reference.
These methods use the s, use argument pair is to specify
indices for subset operations. This argument form requires no prior
knowledge on the number of dimensions an object has.
s and use must be specified as follows:
- The
sargument must be a list, specifying the subscripts (i.e. dimensional indices). - The
useargument must be an integer vector, specifying the dimensions for whichsholds. Negative integers will invert indices (i.e. select all indices for that dimension EXCEPT the specified ones). - If the subscripts are the same for all dimensions specified in
use,scan also be given as an atomic vector, or as a list of length 1.
To minimize keystrokes, ‘squarebrackets’ provides the
n() function, which is short-hand for list();
n() nests multiple objects together, just
like c() concatenates multiple objects together.
I.e. :
- To specify rows
1:10, specifys = 1:10, anduse = 1. - To specify layers (the third dimension)
4:9, specifys = 4:9anduse = 3. - To specify rows
1:10and remove columns2:5, specifys = n(1:10, 2:5)anduse = c(1, -2). - To specify both rows and columns
1:5, one can specifys = 1:5anduse = 1:2.
The use argument has the default specification
1:ndim(x), where ndim(x) = length(dim(x)).
Examples
Consider the following example - Given a set of atomic arrays with different dimensions, select the first 2 indices of every available dimension:
lst <- list(
array(1:25, c(5, 5)), # matrix / 2d array
array(1:48, c(4, 4, 3)), # 3d array
array(1:240, c(4, 3, 4, 3)) # 4d array
)
for(i in seq_along(lst)) {
x <- lst[[i]]
ss_x(x, s = 1:2, use = 1:ndim(x))
ss_x(x, 1:2) # the same (by default, use = 1:ndim(x))
}The s and use argument are used to perform
sub-setting. Since this is not a position-based system, like base ‘R’,
it works for matrices and arrays of any arbitrary dimension.
Another example - select the first 3 indices for the first dimension, the first 2 indices for the last available dimension, and select all indices for the other dimensions.
lst <- list(
array(1:25, c(5, 5)), # matrix / 2d array
array(1:48, c(4, 4, 3)), # 3d array
array(1:240, c(4, 3, 4, 3)) #4d array
)
for(i in seq_along(lst)) {
x <- lst[[i]]
ss_x(x, n(1:3, 1:2), c(1, ndim(x)))
ss_x(x, s = n(1:3, 1:2), use = c(1, ndim(x))) # the same
}So ‘squarebrackets’ allows the user to perform easy sub-set
operations on arrays, even if the dimensions are not known a-priori,
without ridiculously convoluted fiddling with do.call(),
non-standard evaluation, or other ugly programming tricks. It just
works.
Data.frame: different types, different rules
There are several types of data.frame-like objects available in ‘R’: data.frames, data.tables, tibbles, tidytables; and they all have their own rules regarding sub-set operations.
Consider the following example, where values of the column “a” are being replaced with “XXX”, but only in the rows for which holds that column “b” is larger than 10:
tinycodet::import_as(~ dpr., "dplyr", dependencies = "tibble")
x <- data.frame(a = month.abb, b = 1:12)
y <- dpr.$tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)
x[with(x, b > 10), "a"] <- "XXX" # data.frame with base
y <- dpr.$mutate(y, a = ifelse(b > 10, "XXX", b)) # tibble with tidyverse
z[b > 10, a:= "XXX"] # data.table with fastverse/tinyverseNote that the syntax is different for each type of data.frame.
‘squarebrackets’ provides a set of methods that work consistently on all
manner of tabular (data.frames and matrix) types, with the exact same
syntax:
-
tt_xto extract subsets -
tt_modto return a copy with modified subsets -
tt_setto modify an object by reference.
So let’s do the same operation as above, but now using ‘squarebrackets’:
x <- data.frame(a = month.abb, b = 1:12)
y <- tibble::tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)
tt_mod(x, with(x, b > 10), "a", rp = "XXX")
tt_mod(y, with(x, b > 10), "a", rp = "XXX")
tt_mod(z, with(x, b > 10), "a", rp = "XXX")Notice that the syntax is exactly the same for all classes.
The original attributes are also preserved when using
tt_mod(); i.e. nothing is forced to become a tibble,
data.table, or something else. Input class = output class.
For data.tables specifically, the user can also use
tt_set(), to perform pass-by-reference semantics, which is
considerably faster and more memory efficient:
z <- data.table::data.table(a = month.abb, b = 1:12)
tt_set(z, with(z, b > 10), "a", rp = "XXX")
print(z)
#> a b
#> <char> <int>
#> 1: Jan 1
#> 2: Feb 2
#> 3: Mar 3
#> 4: Apr 4
#> 5: May 5
#> 6: Jun 6
#> 7: Jul 7
#> 8: Aug 8
#> 9: Sep 9
#> 10: Oct 10
#> 11: XXX 11
#> 12: XXX 12This is all powered by the class-agnostic ‘C’ code from the fantastic ‘collapse’ and ‘data.table’ packages.
Complex Indexing with Keywords
Consider the following array:
x <- array(1:(prod(5:3)), 5:3, list(letters[1:5], LETTERS[1:4], month.abb[1:3]))
print(x)
#> , , Jan
#>
#> A B C D
#> a 1 6 11 16
#> b 2 7 12 17
#> c 3 8 13 18
#> d 4 9 14 19
#> e 5 10 15 20
#>
#> , , Feb
#>
#> A B C D
#> a 21 26 31 36
#> b 22 27 32 37
#> c 23 28 33 38
#> d 24 29 34 39
#> e 25 30 35 40
#>
#> , , Mar
#>
#> A B C D
#> a 41 46 51 56
#> b 42 47 52 57
#> c 43 48 53 58
#> d 44 49 54 59
#> e 45 50 55 60Extracting the first 2 elements of each dimension of this array is relatively easy in base ‘R’:
x[1:2, 1:2, 1:2]
#> , , Jan
#>
#> A B
#> a 1 6
#> b 2 7
#>
#> , , Feb
#>
#> A B
#> a 21 26
#> b 22 27But suppose you wish to extract the last elements of each dimension. In base ‘R’, you would have to do something like this:
x[c(dim(x)[1] - 1, dim(x)[1]), c(dim(x)[2] - 1, dim(x)[2]), c(dim(x)[3] - 1, dim(x)[3])]
#> , , Feb
#>
#> C D
#> d 34 39
#> e 35 40
#>
#> , , Mar
#>
#> C D
#> d 54 59
#> e 55 60Horrible. Can’t we do any better?
‘squarebrackets’ allows indexing by keywords via a formula, which allows one to do more complex sub-setting operations. We can do the above operations using keywords in several ways:
ss_x(x, ~ (.N-1):.N)
#> , , Feb
#>
#> C D
#> d 34 39
#> e 35 40
#>
#> , , Mar
#>
#> C D
#> d 54 59
#> e 55 60
ss_x(x, ~ .bi(-2:-1))
#> , , Feb
#>
#> C D
#> d 34 39
#> e 35 40
#>
#> , , Mar
#>
#> C D
#> d 54 59
#> e 55 60Isn’t that better?
‘squarebrackets’ allows users to specify indices by using keywords in a formula, like just shown; the following keywords are available:
-
.M: the given margin/dimension; 0 if not relevant. -
.Nms: the (dim)names at the given margin. -
.N: the size of a given dimension (if.Mis not 0) or else the length ofx. -
.I: equal to1:.N. -
.bi(...): a function to specify bilateral indices. -
.x: the input variablexitself.
Let’s use keywords to select all sub-sets whose dimnames contains a “a”, “A”, “e” or “E”, and compare it to how to do it in base ‘R’:
library(stringi)
p <- "a|A|e|E"
# in base R:
x[
stri_detect(dimnames(x)[[1]], regex = p),
stri_detect(dimnames(x)[[2]], regex = p),
stri_detect(dimnames(x)[[3]], regex = p),
drop = FALSE
]
#> , , Jan
#>
#> A
#> a 1
#> e 5
#>
#> , , Feb
#>
#> A
#> a 21
#> e 25
#>
#> , , Mar
#>
#> A
#> a 41
#> e 45
# using 'squarebrackets':
ss_x(x, ~ stri_detect(.Nms, regex = p))
#> , , Jan
#>
#> A
#> a 1
#> e 5
#>
#> , , Feb
#>
#> A
#> a 21
#> e 25
#>
#> , , Mar
#>
#> A
#> a 41
#> e 45Keywords are available for vectors, arrays, and also data.frame-like objects.
Pass by Reference or Pass By Value?
R’s [<- and [[<- sometimes make a
copy of an object, and sometimes they perhaps don’t. This brings 2
issues:
- Making unnecessary copies wastes memory (and speed);
- On a technical level, it may be difficult to predict if a copy is made or not.
Data.tables from the ‘data.table’ package natively uses pass-by-reference semantics, meaning no copy is made. Tibbles from the ‘tidyverse’ often returns a (very wasteful) copy.
‘squarebrackets’ provides the user the ability to explicitly
choose whether to modify an object by reference (like
data.table), or to return an explicit copy. The
*_mod methods return a modified copy. The
*_set methods modify an object by reference. The
*_set methods are only available for the mutable classes
data.table and mutatomic;
mutatomic is a class of mutable atomic object provided by
‘squarebrackets’ for the explicit purpose of being able to modify atomic
objects by reference, and doing so safely.
Long Vectors: So much memory usage
Sub-set operations without indices
Long Vectors take in quite a bit of memory. Performing a sub-set operation in base ‘R’ on a vector requires an indexing vector, which - for a long vector - may itself also be a long vector. This is a lot of memory usage. We can do better.
‘squarebrackets’ provides the long_x() and
long_set() methods to perform sub-set operations on the
interior of a vector, without any indexing vector at all. Instead of an
indexing vector, they use a stride object. There are 3
types of stride objects that can be used:
-
stride_pv(): Use thisstridetype to specify subsets base on property values, likep == v, wherepis an atomic vector of properties (for examplenames(x)), andvis a value (or range of values)pmight contain. -
stride_seq(): Use thisstridetype to specify a sequence in the form ofseq(from, to, by), without actually allocating a sequence indexing vector. -
stride_ptrn(): Use thisstridetype to specify a patterned sequence in the form of(start:end)[pattern], wherestartandendare natural scalars andpatternis a logical vector.stride_ptrn()specifies this sequence without actually allocating an indexing vector.
An example using stride_pv():
nms <- c(letters, LETTERS, month.abb, month.name) |> rep_len(1e6)
x <- mutatomic(1:1e6, names = nms)
head(x)
#> a b c d e f
#> 1 2 3 4 5 6
#> mutatomic
#> typeof: integer
# extract all elements of x with the name "a":
stride <- stride_pv(names(x), v = "a")
long_x(x, stride) |> head()
#> a a a a a a
#> 1 77 153 229 305 381
#> mutatomic
#> typeof: integerAn example using stride_seq():
x <- 1:50
long_x(x, stride_seq(1, 10, 2)) # equivalent to x[seq(1, 10, 2)]
#> [1] 1 3 5 7 9
# the above can also be specified as a formula:
long_x(x, ~ 1:10:2)
#> [1] 1 3 5 7 9An example using stride_ptrn():
x <- 1:50
ptrn <- c(TRUE, FALSE, FALSE, TRUE)
long_x(x, stride_ptrn(1, 20, ptrn)) # equivalent to x[(1:20)[ptrn]]
#> [1] 1 4 5 8 9 12 13 16 17 20
# the above can also be specified as a formula:
long_x(x, ~ 1:20:ptrn)
#> [1] 1 4 5 8 9 12 13 16 17 20Both extracting sub-sets and pass-by-reference modification of sub-sets, is available for both methods.
Sub-set Modifications without Copies
R’s [<- operator (sometimes) makes copies of objects;
making copies of long vectors, however, is an enormous waste of
memory.
To reduce memory usage, ‘squarebrackets’ provides a class of mutable
atomic objects that can be modified without making
copies, similar to how the ‘data.table’ package works. This new class of
mutable atomic objects is called mutatomic, and can be
created with ease:
We can modify this vector by reference using the various methods that
end with _set.
For example like so:
You can still use regular indices, for example using
ii_set():