Skip to contents
library(squarebrackets)
#> Run `?squarebrackets::squarebrackets_help` to open the introduction help page of 'squarebrackets'.
set.seed(1L)

 

Introduction

‘squarebrackets’ provides subset methods that may be more convenient alternatives to the [ and [<- operators, whilst maintaining similar performance.

The goal of this Vignette is to present some problems in sub-setting objects programmatically in ‘R’, and how the ‘squarebrackets’ package solves these problems.

The Vignette starts of with solving sub-setting problems for (simple) vectors. Then it moves on to solving the issue of sub-setting arrays with arbitrary number of dimensions. The many types of data.frames is next. And we end with index-less sub-setting of long vectors.

In order to follow this Vignette, the reader needs to be familiar with the square-brackets operators ([, [<-, [[, [[<-), the dimensional structures of base ‘R’ (vectors, arrays, data.frames), and familiarity with atomic and recursive types.

This vignette does not provide a complete and thorough explanation of all methods, functions, and options available in the ‘squarebrackets’ package. The vignette merely gives you, the reader, a glimpse of what this package is about. So don’t worry if you don’t understand everything immediately. A more complete explanation of all the available functionality can be found in the package documentation itself.

 

Vectors: Improved Index Specification

‘squarebrackets’ provides a set of methods that work on both atomic and recursive vectors:

  • ii_x to extract subsets
  • ii_wo to return object without the selected subset
  • ii_mod to return a copy with modified subsets
  • ii_set to modify an object by reference.

base ‘R’ supports specifying indices for sub-set operations through logical, integer, and character vectors.
‘squarebrackets’ enhances these capabilities, and adds more possibilities.
The following sub-sections show some of these capabilities; a more exhaustive list of the possibilities can be found in the package documentation.

 

Specify Indices by Names

Base ‘R’ only selects the first matching names when selecting indices through a character vector. ‘squarebrackets’ selects all matching names.

For example:

nms <- c("a", sample(letters[1:4], 9, replace = TRUE))
x <- sample(1:10)
names(x) <- nms
print(x) # `x` has multiple elements with the name "a"
#>  a  a  d  c  a  b  a  c  c  b 
#>  2  3  1  5  7 10  6  4  9  8

x["a"] # only selects only the first index with name "a"
#> a 
#> 2
ii_x(x, "a") # selects all indices with the name "a"
#> a a a a 
#> 2 3 7 6

x[c("a", "a")] # repeats only the first index with name "a"
#> a a 
#> 2 2
ii_x(x, c("a", "a")) # repeats all indices with the name "a"
#> a a a a a a a a 
#> 2 3 7 6 2 3 7 6

To select the indices c("a", "a", "b"), whilst ensuring all indices with those names get selected, one needs to do the following in base ‘R’:

x[lapply(c("a", "a", "b"), \(i)which(names(x) == i)) |> unlist()]
#>  a  a  a  a  a  a  a  a  b  b 
#>  2  3  7  6  2  3  7  6 10  8

See how much easier it is with ‘squarebrackets’!:

ii_x(x, c("a", "a", "b"))
#>  a  a  a  a  a  a  a  a  b  b 
#>  2  3  7  6  2  3  7  6 10  8

It’s not just shorter by the way, ‘squarebrackets’ is faster, as it does not rely on lapply() (or friends) to do this, but uses compiled ‘C’ code.

 

Inverting Index Specification

Inverting indices in base ‘R’ is done in different ways. (negative numbers for numeric indexing, negation for logical indexing, manually un-matching for character vectors).

‘squarebrackets’ provides a (somewhat) consistent syntax to invert indices:

  • The methods whose names end with _x (like the ii_x() shown before) perform extraction;
    to invert extraction, i.e. return the object without the specified subset, use the methods whose names end with _wo.
  • In the modification methods (those whose names end with _mod or _set) one can set the argument inv = TRUE to invert indices.

As a consequence, removing sub-sets has the same syntax as extracting indices.

For example:

x <- sample(1:10)
names(x) <- letters[1:10]

x["a"] # extract element "a" in base R
#> a 
#> 9
x[!names(x) %in% "a"] # but removing has different syntax
#>  b  c  d  e  f  g  h  i  j 
#>  5 10  1  7  8  6  2  3  4

ii_x(x, "a") # extract element "a" with 'squarebrackets'
#> a 
#> 9
ii_wo(x, "a") # remove element "a" with 'squarebrackets'; same syntax
#>  b  c  d  e  f  g  h  i  j 
#>  5 10  1  7  8  6  2  3  4

 

Not Just Vectors

The given enhanced indexing is not just available for regular vectors, but for all types supported by ‘squarebrackets’.

 

Arrays: sub-setting unknown number of dimensions

Introduction

In order to perform subset operations on some array x with the square brackets operator ([, [<-), one needs to know how many dimensions it has.

For example:

# if x has 3 dimensions:
x[i, j, k, drop = FALSE]
x[i, j, k] <- value

# if x has 4 dimensions:
x[i, j, k, l, drop = FALSE]
x[i, j, k, l] <- value

Using x[i, j, k] on an array with 4 dimensions produces an error, since the number of indices or empty arguments does not conform to the number of dimensions.

But suppose that the number of dimensions of an array x is unknown, for example when iterating through many arrays which all may have different number dimensions. How would one the use the [ and [<- operators in such a situation? It’s not strictly impossible, but it is very convoluted.

The methods provided by ‘squarebrackets’ do not use position-based arguments, and as such work on any arbitrary dimensions without requiring prior knowledge.

 

The s, d argument pair

‘squarebrackets’ provides a set of methods that work on arrays of any number of dimensions:

  • ss_x to extract subsets
  • ss_wo to return object without the selected subset
  • ss_mod to return a copy with modified subsets
  • ss_set to modify an object by reference.

These methods use the s, d argument pair is to specify indices for subset operations. This argument form requires no prior knowledge on the number of dimensions an object has.

s and d must be specified as follows:

  • The s argument must be a list, specifying the subscripts (i.e. dimensional indices).
  • The d argument must be an integer vector, specifying the dimensions for which s holds.
  • If the subscripts are the same for all dimensions specified in d, s can also be given as an atomic vector, or as a list of length 1.

To minimize keystrokes, ‘squarebrackets’ provides the n() function, which is short-hand for list(); n() nests multiple objects together, just like c() concatenates multiple objects together.

I.e. :

  • To specify rows 1:10, use s = 1:10, and d = 1.
  • To specify layers (the third dimension) 4:9, use s = 4:9 and d = 3.
  • To specify rows 1:10 and columns 2:5, use s = n(1:10, 2:5) and d = 1:2.
  • To specify both rows and columns 1:5, one can use s = 1:5 and d = 1:2.

The d argument has the default specification 1:ndim(x), where ndim(x) = length(dim(x)).

 

Examples

Consider the following example - Given a set of atomic arrays with different dimensions, select the first 2 indices of every available dimension:


lst <- list(
  array(1:25, c(5, 5)), # matrix / 2d array
  array(1:48, c(4, 4, 3)), # 3d array
  array(1:240, c(4, 3, 4, 3)) # 4d array
)

for(i in seq_along(lst)) {
  x <- lst[[i]]
  ss_x(x, s = 1:2, d = 1:ndim(x))
  ss_x(x, 1:2) # the same (by default, d = 1:ndim(x))
}

The s and d argument are used to perform sub-setting. Since this is not a position-based system, like base ‘R’, it works for matrices and arrays of any arbitrary dimension.

 

Another example - select the first 3 indices for the first dimension, the first 2 indices for the last available dimension, and select all indices for the other dimensions.


lst <- list(
  array(1:25, c(5, 5)), # matrix / 2d array
  array(1:48, c(4, 4, 3)), # 3d array
  array(1:240, c(4, 3, 4, 3)) #4d array
)

for(i in seq_along(lst)) {
  x <- lst[[i]]
  ss_x(x, n(1:3, 1:2), c(1, ndim(x)))
  ss_x(x, s = n(1:3, 1:2), d = c(1, ndim(x))) # the same
}

So ‘squarebrackets’ allows the user to perform easy sub-set operations on arrays, even if the dimensions are not known a-priori, without ridiculously convoluted fiddling with do.call(), non-standard evaluation, or other ugly programming tricks. It just works.

 

Data.frame: different types, different rules

There are several types of data.frame-like objects available in ‘R’: data.frames, data.tables, tibbles, tidytables; and they all have their own rules regarding sub-set operations.

Consider the following example, where values of the column “a” are being replaced with “XXX”, but only in the rows for which holds that column “b” is larger than 10:

tinycodet::import_as(~ dpr., "dplyr", dependencies = "tibble")

x <- data.frame(a = month.abb, b = 1:12)
y <- dpr.$tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)

x[with(x, b > 10), "a"] <- "XXX" # data.frame with base
y <- dpr.$mutate(y, a = ifelse(b > 10, "XXX", b)) # tibble with tidyverse
z[b > 10, a:= "XXX"] # data.table with fastverse/tinyverse

Note that the syntax is different for each type of data.frame.
‘squarebrackets’ provides a set of methods that work consistently on all manner of tabular (data.frames and matrix) types, with the exact same syntax:

  • sbt_x to extract subsets
  • sbt_wo to return object without the selected subset
  • sbt_mod to return a copy with modified subsets
  • sbt_set to modify an object by reference.

So let’s do the same operation as above, but now using ‘squarebrackets’:


x <- data.frame(a = month.abb, b = 1:12)
y <- tibble::tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)

sbt_mod(x, ~ b > 10, "a", rp = "XXX")
sbt_mod(y, ~ b > 10, "a", rp = "XXX")
sbt_mod(z, ~ b > 10, "a", rp = "XXX")

Notice that the syntax is exactly the same for all classes.

The original attributes are also preserved when using sbt_mod(); i.e. nothing is forced to become a tibble, data.table, or something else. Input class = output class.

For data.tables specifically, the user can also use sbt_set(), to perform pass-by-reference semantics, which is considerably faster and more memory efficient:


z <- data.table::data.table(a = month.abb, b = 1:12)
sbt_set(z, ~ b > 10, "a", rp = "XXX")
print(z)
#>          a     b
#>     <char> <int>
#>  1:    Jan     1
#>  2:    Feb     2
#>  3:    Mar     3
#>  4:    Apr     4
#>  5:    May     5
#>  6:    Jun     6
#>  7:    Jul     7
#>  8:    Aug     8
#>  9:    Sep     9
#> 10:    Oct    10
#> 11:    XXX    11
#> 12:    XXX    12

This is all powered by the class-agnostic ‘C’ code from the fantastic ‘collapse’ and ‘data.table’ packages.

 

Pass by Reference or Pass By Value?

R’s [<- and [[<- sometimes make a copy of an object, and sometimes they perhaps don’t. This brings 2 issues:

  • Making unnecessary copies wastes memory (and speed);
  • On a technical level, it may be difficult to predict if a copy is made or not.

Data.tables from the ‘data.table’ package natively uses pass-by-reference semantics, meaning no copy is made. Tibbles from the ‘tidyverse’ often returns a (very wasteful) copy.

‘squarebrackets’ provides the user the ability to explicitly choose whether to modify an object by reference (like data.table), or to return an explicit copy. The *_mod methods return a modified copy. The *_set methods modify an object by reference. The *_set methods are only available for the mutable classes data.table and mutatomic; mutatomic is a class of mutable atomic object provided by ‘squarebrackets’ for the explicit purpose of being able to modify atomic objects by reference, and doing so safely.

 

Long Vectors: So much memory usage

Sub-set operations without indices

Long Vectors take in quite a bit of memory. Performing a sub-set operation in base ‘R’ on a vector requires an indexing vector, which - for a long vector - may itself also be a long vector. This is a lot of memory usage. We can do better.

‘squarebrackets’ provides 2 sets of methods to perform sub-set operations without any indexing vector at all:

The slice_ - methods:
To perform sequence-based sub-set operations.
For example:

x <- 1:50
slice_x(x, 1, 10, 2) # equivalent to x[seq(1, 10, 2)]
#> [1] 1 3 5 7 9

The slicev_ - methods:
To perform value-based sub-set operations. For example:


x <- 1:50
slicev_x(x, v = 1L, r = FALSE) # equivalent to x[x != 1L]
#>  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> [26] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Both extracting sub-sets and pass-by-reference modification of sub-sets, is available for both methods.

 

Sub-set Modifications without Copies

R’s [<- operator (sometimes) makes copies of objects; making copies of long vectors, however, is an enormous waste of memory.

To reduce memory usage, ‘squarebrackets’ provides a class of mutable atomic objects that can be modified without making copies, similar to how the ‘data.table’ package works. This new class of mutable atomic objects is called mutatomic, and can be created with ease:

x <- mutatomic(seq(1L, 1e6L, 2L))
head(x)
#> [1]  1  3  5  7  9 11
#> mutatomic 
#> typeof:  integer

We can modify this vector by reference using the various methods that end with _set.

For example like so:

slice_set(x, 2, 4, rp = -1L)
head(x)
#> [1]  1 -1 -1 -1  9 11
#> mutatomic 
#> typeof:  integer

You can still use regular indices, for example using ii_set():

ii_set(x, 1, rp = -1000L)
head(x)
#> [1] -1000    -1    -1    -1     9    11
#> mutatomic 
#> typeof:  integer

 

Closing Remarks

If this introductory article has piqued your interest, I kindly invite you to read the rest of the (admittedly rather extensive) documentation, and perhaps try out the package for yourself.