Introduction to squarebrackets • squarebrackets

library(squarebrackets)
#> Run `?squarebrackets::squarebrackets_help` to open the introduction help page of 'squarebrackets'.

Introduction

‘squarebrackets’ provides subset methods that may be more convenient alternatives to the [ and [<- operators, whilst maintaining similar performance.

This vignette uses simple examples to show some of the nice properties of these methods. Familiarity with the square-brackets operators ([, [<-) in relation to vectors, arrays, and data.frames is essential to follow this article.

‘squarebrackets’ supports the following structures:

basic atomic structures
(atomic vectors, matrices, and arrays).
mutatomic structures (mutable atomic vectors, matrices, and arrays).
factor.
basic list structures
(recursive vectors, matrices, and arrays).
data.frames
(including the classes tibble, sf-data.frame and sf-tibble).
data.table
(including the classes tidytable, sf-data.table, and sf-tidytable).

Improved Index Specification

base ‘R’ supports specifying indices for sub-set operations through logical, integer, and character vectors.

‘squarebrackets’ enhances these capabilities, and adds more possibilities.

Specify Indices by Names

Base ‘R’ only selects the first matching names when selecting indices through a character vector. ‘squarebrackets’ selects all matching names.

For example:

nms <- c("a", sample(letters[1:4], 9, replace = TRUE))
x <- sample(1:10)
names(x) <- nms
print(x) # `x` has multiple elements with the name "a"
#>  a  a  c  d  c  a  b  c  c  c 
#>  4  6  1  5  7  9  2 10  8  3

x["a"] # only selects only the first index with name "a"
#> a 
#> 4
i_x(x, "a") # selects all indices with the name "a"
#> a a a 
#> 4 6 9

x[c("a", "a")] # repeats only the first index with name "a"
#> a a 
#> 4 4
i_x(x, c("a", "a")) # repeats all indices with the name "a"
#> a a a a a a 
#> 4 6 9 4 6 9

To select the indices c("a", "a", "b"), whilst ensuring all indices with those names get selected, one needs to do the following in base ‘R’:

x[lapply(c("a", "a", "b"), \(i)which(names(x) == i)) |> unlist()]
#> a a a a a a b 
#> 4 6 9 4 6 9 2

See how much easier it is with ‘squarebrackets’!:

i_x(x, c("a", "a", "b"))
#> a a a a a a b 
#> 4 6 9 4 6 9 2

This syntax becomes especially advantageous for arrays;
For example, let’s select all layers (i.e. the 3rd dimension) with the name “a”, twice:

x <- array(1:27, c(3,3,3))
dimnames(x) <- list(letters[1:3], letters[1:3], c("a", "a", "b"))

# in base 'R':
x[,, lapply(c("a", "a"), \(i)which(dimnames(x)[[3L]] == i)) |> unlist()]

# using 'squarebrackets' (shorter, more readable, and FASTER):
ss_x(x, c("a", "a"), 3L)

It’s not just shorter by the way, ‘squarebrackets’ is faster, as it does not rely on lapply() (or friends) to do this, but uses compiled ‘C’ code.

Specify Indices by Imaginary Numbers

‘squarebrackets’ introduces a new way to specify indices:
through imaginary numbers.
Positive imaginary numbers (1i, 2i, etc.) works the same as regular indices. Negative imaginary numbers (-1i, -2i, etc.) starts from the end, counting backwards.

For example:

x <- sample(1:10)
print(x)
#>  [1]  5 10  4  9  2  8  7  6  1  3


i_x(x, 1:3 * -1i) # select last 3 indices
#> [1] 3 1 6
i_x(x, 3:1 * -1i) # select last 3 indices in tail()-like order
#> [1] 6 1 3

This syntax becomes especially advantageous for arrays:

x <- array(1:27, c(3,3,3))

# select last 2 layers using base 'R':
x[,, seq(dim(x)[3L] - 1, dim(x)[3L])]

# select last 2 layers using 'squarebrackets':
ss_x(x, 2:1 * -1i, 3L)

Inverting Index Specification

Inverting indices in base ‘R’ is done in different ways. (negative numbers for numeric indexing, negation for logical indexing, manually un-matching for character vectors).

‘squarebrackets’ provides a (somewhat) consistent syntax to invert indices:

The methods whose names end with _x (like the i_x() shown before) perform extraction;
to invert extraction, i.e. return the object without the specified subset, use the methods whose names end with _wo.
In the modification methods (those whose names end with _mod or _set) one can set the argument inv = TRUE to invert indices.

As a consequence, removing sub-sets has the same syntax as extracting indices.

For example:

x <- sample(1:10)
names(x) <- letters[1:10]

x["a"] # extract element "a" in base R
#> a 
#> 6
x[!names(x) %in% "a"] # but removing has different syntax
#>  b  c  d  e  f  g  h  i  j 
#>  7  8  5 10  1  2  3  4  9

i_x(x, "a") # extract element "a" with 'squarebrackets'
#> a 
#> 6
i_wo(x, "a") # remove element "a" with 'squarebrackets'; same syntax
#>  b  c  d  e  f  g  h  i  j 
#>  7  8  5 10  1  2  3  4  9

Provided Methods

In the previous section about the improved forms of indexing, we’ve already seen some of the methods provided by ‘squarebrackets’; this section gives a more formal introduction to the methods.

The main methods of ‘squarebrackets’ use the naming convention A_B: A tells you on what kind of object and what kind of indices the method operates on; B tells you what operation is performed.

For the A part, the following is available:

i_: operates on subsets of atomic objects by (flat/linear) indices.
i2_: operates on subsets of recursive objects by (flat/linear) indices.
ss_: operates on subsets of atomic objects by (dimensional) subscripts.
ss2_: operates on subsets of recursive objects by (dimensional) subscripts.
slice_: uses index-less, sequence-based, and efficient operations on mutatomic objects.
slicev_: uses uses index-less, value-based and efficient operations on mutatomic objects.

For the B part, the following is available:

_x⁠: extract, exchange, or duplicate (if applicable) subsets.
_wo⁠: returns the original object without the selected subsets.
_mod⁠: modify subsets and return copy.
_set⁠: modify subsets using pass-by-reference semantics.

To illustrate, let’s take the methods used for extracting subsets (∗⁠_x⁠):

When y is atomic, the following holds (roughly speaking):

i_x(y, i) corresponds to y[i]
ss_x(y, n(i, k), c(1, 3)) corresponds to y[i, , k]

When y is a list (i.e. recursive), the following holds (roughly speaking):

i2_x(y, i) corresponds to y[i] or y[[i]] (depending on the arguments given in i2_x())
ss2_x(y, n(i, k), c(1, 3)) corresponds to y[i, , k] or y[[i, , k]] (depending on the arguments given in ss2_x())

Arrays with unknown number of dimensions

Introduction

In order to perform subset operations on some array x with the square brackets operator ([, [<-), one needs to know how many dimensions it has.

For example:

# if x has 3 dimensions:
x[i, j, k, drop = FALSE]
x[i, j, k] <- value

# if x has 4 dimensions:
x[i, j, k, l, drop = FALSE]
x[i, j, k, l] <- value

Using x[i, j, k] on an array with 4 dimensions produces an error, since the number of indices or empty arguments does not conform to the number of dimensions.

But suppose that the number of dimensions of an array x is unknown, for example when iterating through many arrays which all may have different number dimensions. How would one the use the [ and [<- operators in such a situation? It’s not strictly impossible, but it is very convoluted.

The methods provided by ‘squarebrackets’ do not use position-based arguments, and as such work on any arbitrary dimensions without requiring prior knowledge.

The s, d argument pair

The s, d argument pair is the primary manner to specify indices for subset operations in all dimensional objects supported by ‘squarebrackets’ (matrices, arrays, data.frame-like objects). This argument form requires no prior knowledge on the number of dimensions an object has.

s and d must be specified as follows:

The s argument must be a list, specifying the subscripts (i.e. dimensional indices).
The d argument must be an integer vector, specifying the dimensions for which s holds.
If the subscripts are the same for all dimensions specified in d, s can also be given as an atomic vector, or as a list of length 1.

To minimize keystrokes, ‘squarebrackets’ provides the n() function, which is short-hand for list(); n() nests multiple objects together, just like c() concatenates multiple objects together.

I.e. :

To specify rows 1:10, use s = 1:10, and d = 1.
To specify layers (the third dimension) 4:9, use s = 4:9 and d = 3.
To specify rows 1:10 and columns 2:5, use s = n(1:10, 2:5) and d = 1:2.
To specify both rows and columns 1:5, one can use s = 1:5 and d = 1:2.

The d argument has the default specification 1:ndim(x), where ndim(x) = length(dim(x)).

Examples

Consider the following example - Given a set of atomic arrays with different dimensions, select the first 2 indices of every available dimension:


lst <- list(
  array(1:25, c(5, 5)), # matrix / 2d array
  array(1:48, c(4, 4, 3)), # 3d array
  array(1:240, c(4, 3, 4, 3)) # 4d array
)

for(i in seq_along(lst)) {
  x <- lst[[i]]
  ss_x(x, s = 1:2, d = 1:ndim(x))
  ss_x(x, 1:2) # the same (by default, d = 1:ndim(x))
}

The s and d argument are used to perform sub-setting. Since this is not a position-based system, like base ‘R’, it works for matrices and arrays of any arbitrary dimension.

Another example - select the first 3 indices for the first dimension, the first 2 indices for the last available dimension, and select all indices for the other dimensions.


lst <- list(
  array(1:25, c(5, 5)), # matrix / 2d array
  array(1:48, c(4, 4, 3)), # 3d array
  array(1:240, c(4, 3, 4, 3)) #4d array
)

for(i in seq_along(lst)) {
  x <- lst[[i]]
  ss_x(x, n(1:3, 1:2), c(1, ndim(x)))
  ss_x(x, s = n(1:3, 1:2), d = c(1, ndim(x))) # the same
}

So ‘squarebrackets’ allows the user to perform easy sub-set operations on arrays, even if the dimensions are not known a-priori, without ridiculously convoluted fiddling with do.call(), non-standard evaluation, or other ugly programming tricks. It just works.

Different data.frame types

There are several types of data.frame-like objects available in ‘R’: data.frames, data.tables, tibbles, tidytables; and they all have their own rules regarding sub-set operations.

Consider the following example, where values of the column “a” are being replaced with “XXX”, but only in the rows for which holds that column “b” is larger than 10:

tinycodet::import_as(~ dpr., "dplyr", dependencies = "tibble")

x <- data.frame(a = month.abb, b = 1:12)
y <- dpr.$tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)

x[with(x, b > 10), "a"] <- "XXX" # data.frame with base
y <- dpr.$mutate(y, a = ifelse(b > 10, "XXX", b)) # tibble with tidyverse
z[b > 10, a:= "XXX"] # data.table with fastverse/tinyverse

Note the following:

The syntax is different
data.frames use copy-on modify. ‘dplyr’ + ‘tibble’ almost always uses explicit copy, and data.table almost always uses pass-by-reference.
There’s a lot of non-standard evaluation going on.

On point 1): ‘squarebrackets’ uses the exact same methods and syntax for all data.frame types.

On point 2): ‘squarebrackets’ always allows the user to use explicitly return a modified copy (only necessary parts are copied, so no unnecessary copies), through the *_mod methods. For mutable classes, such as data.tables, ‘squarebrackets’ additionally provides the *_set methods, for pass-by-reference semantics.

On point 3): ‘squarebrackets’ will never use non-standard evaluation. All syntax in ‘squarebrackets’ is 100% programmatically friendly, and all input can be stored in a variable for later use. In this particular situation, the obs argument with formula input can be used.

So let’s do the same operation as above, but now using ‘squarebrackets’. Since data.frames and tibbles are not mutable types, for this demonstration I’ll stick to using ss2_mod():


x <- data.frame(a = month.abb, b = 1:12)
y <- tibble::tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)

ss2_mod(x, obs = ~ b > 10, vars = "a", rp = "XXX")
ss2_mod(y, obs = ~ b > 10, vars = "a", rp = "XXX")
ss2_mod(z, obs = ~ b > 10, vars = "a", rp = "XXX")

Notice that the syntax is exactly the same for all classes.

The original attributes are also preserved when using ss2_mod(); i.e. nothing is forced to become a tibble, data.table, or something else. Input class = output class.

For data.tables specifically, the user can also use ss2_set(), to perform pass-by-reference semantics, which is considerably faster and more memory efficient:


z <- data.table::data.table(a = month.abb, b = 1:12)
ss2_set(z, obs = ~ b > 10, vars = "a", rp = "XXX")
print(z)
#>          a     b
#>     <char> <int>
#>  1:    Jan     1
#>  2:    Feb     2
#>  3:    Mar     3
#>  4:    Apr     4
#>  5:    May     5
#>  6:    Jun     6
#>  7:    Jul     7
#>  8:    Aug     8
#>  9:    Sep     9
#> 10:    Oct    10
#> 11:    XXX    11
#> 12:    XXX    12

Mutability

As shown in the previous section, ‘squarebrackets’ supports pass-by-reference semantics (i.e. modification without any copying) for data.tables, and it is also supported for the mutatomic class (a class of mutable atomic objects).

Long Vectors

Long Vectors take in quite a bit of memory. Performing a sub-set operation on a vector requires an indexing vector, which - for a long vector - may itself also be a long vector. This is a lot of memory usage. We can do better.

‘squarebrackets’ provides 2 sets of methods to perform sub-set operations without any indexing vector at all:

The slice_ - methods:
To perform sequence-based sub-set operations.
For example:

x <- 1:50
slice_x(x, 1, 10, 2) # equivalent to x[seq(1, 10, 2)]
#> [1] 1 3 5 7 9

The slicev_ - methods:
To perform value-based sub-set operations. For example:


x <- 1:50
slicev_x(x, v = 1L, r = FALSE) # equivalent to x[x != 1L]
#>  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> [26] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Both extracting sub-sets and pass-by-reference modification of sub-sets, is available for both methods.

Closing Remarks

If this introductory article has piqued your interest, I kindly invite you to read the rest of the (admittedly rather extensive) documentation, and perhaps try out the package for yourself.