library(squarebrackets)
#> Run `?squarebrackets::squarebrackets_help` to open the introduction help page of 'squarebrackets'.
Introduction
‘squarebrackets’ provides subset methods that may be more convenient
alternatives to the [
and [<-
operators,
whilst maintaining similar performance.
This vignette uses simple examples to show some of the nice
properties of these methods. Familiarity with the square-brackets
operators ([
, [<-
) in relation to vectors,
arrays, and data.frames is essential to follow this article.
‘squarebrackets’ supports the following structures:
- basic atomic structures
(atomic vectors, matrices, and arrays). - mutatomic structures (mutable atomic vectors, matrices, and arrays).
- factor.
- basic list structures
(recursive vectors, matrices, and arrays). - data.frames
(including the classes tibble, sf-data.frame and sf-tibble). - data.table
(including the classes tidytable, sf-data.table, and sf-tidytable).
Improved Index Specification
base ‘R’ supports specifying indices for sub-set operations through logical, integer, and character vectors.
‘squarebrackets’ enhances these capabilities, and adds more possibilities.
Specify Indices by Names
Base ‘R’ only selects the first matching names when selecting indices through a character vector. ‘squarebrackets’ selects all matching names.
For example:
nms <- c("a", sample(letters[1:4], 9, replace = TRUE))
x <- sample(1:10)
names(x) <- nms
print(x) # `x` has multiple elements with the name "a"
#> a a c d c a b c c c
#> 4 6 1 5 7 9 2 10 8 3
x["a"] # only selects only the first index with name "a"
#> a
#> 4
i_x(x, "a") # selects all indices with the name "a"
#> a a a
#> 4 6 9
x[c("a", "a")] # repeats only the first index with name "a"
#> a a
#> 4 4
i_x(x, c("a", "a")) # repeats all indices with the name "a"
#> a a a a a a
#> 4 6 9 4 6 9
To select the indices c("a", "a", "b")
, whilst ensuring
all indices with those names get selected, one needs to
do the following in base ‘R’:
See how much easier it is with ‘squarebrackets’!:
This syntax becomes especially advantageous for arrays;
For example, let’s select all layers (i.e. the 3rd dimension) with the
name “a”, twice:
x <- array(1:27, c(3,3,3))
dimnames(x) <- list(letters[1:3], letters[1:3], c("a", "a", "b"))
# in base 'R':
x[,, lapply(c("a", "a"), \(i)which(dimnames(x)[[3L]] == i)) |> unlist()]
# using 'squarebrackets' (shorter, more readable, and FASTER):
ss_x(x, c("a", "a"), 3L)
It’s not just shorter by the way, ‘squarebrackets’ is
faster, as it does not rely on lapply()
(or friends) to do this, but uses compiled ‘C’ code.
Specify Indices by Imaginary Numbers
‘squarebrackets’ introduces a new way to specify indices:
through imaginary numbers.
Positive imaginary numbers (1i
, 2i
, etc.)
works the same as regular indices. Negative imaginary numbers
(-1i
, -2i
, etc.) starts from the end, counting
backwards.
For example:
x <- sample(1:10)
print(x)
#> [1] 5 10 4 9 2 8 7 6 1 3
i_x(x, 1:3 * -1i) # select last 3 indices
#> [1] 3 1 6
i_x(x, 3:1 * -1i) # select last 3 indices in tail()-like order
#> [1] 6 1 3
This syntax becomes especially advantageous for arrays:
x <- array(1:27, c(3,3,3))
# select last 2 layers using base 'R':
x[,, seq(dim(x)[3L] - 1, dim(x)[3L])]
# select last 2 layers using 'squarebrackets':
ss_x(x, 2:1 * -1i, 3L)
Inverting Index Specification
Inverting indices in base ‘R’ is done in different ways. (negative numbers for numeric indexing, negation for logical indexing, manually un-matching for character vectors).
‘squarebrackets’ provides a (somewhat) consistent syntax to invert indices:
- The methods whose names end with
_x
(like thei_x()
shown before) perform extraction;
to invert extraction, i.e. return the object without the specified subset, use the methods whose names end with_wo
. - In the modification methods (those whose names end with
_mod
or_set
) one can set the argumentinv = TRUE
to invert indices.
As a consequence, removing sub-sets has the same syntax as extracting indices.
For example:
x <- sample(1:10)
names(x) <- letters[1:10]
x["a"] # extract element "a" in base R
#> a
#> 6
x[!names(x) %in% "a"] # but removing has different syntax
#> b c d e f g h i j
#> 7 8 5 10 1 2 3 4 9
i_x(x, "a") # extract element "a" with 'squarebrackets'
#> a
#> 6
i_wo(x, "a") # remove element "a" with 'squarebrackets'; same syntax
#> b c d e f g h i j
#> 7 8 5 10 1 2 3 4 9
Provided Methods
In the previous section about the improved forms of indexing, we’ve already seen some of the methods provided by ‘squarebrackets’; this section gives a more formal introduction to the methods.
The main methods of ‘squarebrackets’ use the naming convention
A_B
: A
tells you on what kind of object and
what kind of indices the method operates on; B
tells you
what operation is performed.
For the A
part, the following is available:
-
i_
: operates on subsets of atomic objects by (flat/linear) indices. -
i2_
: operates on subsets of recursive objects by (flat/linear) indices. -
ss_
: operates on subsets of atomic objects by (dimensional) subscripts. -
ss2_
: operates on subsets of recursive objects by (dimensional) subscripts. -
slice_
: uses index-less, sequence-based, and efficient operations on mutatomic objects. -
slicev_
: uses uses index-less, value-based and efficient operations on mutatomic objects.
For the B
part, the following is available:
-
_x
: extract, exchange, or duplicate (if applicable) subsets. -
_wo
: returns the original object without the selected subsets. -
_mod
: modify subsets and return copy. -
_set
: modify subsets using pass-by-reference semantics.
To illustrate, let’s take the methods used for extracting subsets
(∗_x
):
When y
is atomic, the following holds (roughly
speaking):
-
i_x(y, i)
corresponds toy[i]
-
ss_x(y, n(i, k), c(1, 3))
corresponds toy[i, , k]
When y
is a list (i.e. recursive), the following holds
(roughly speaking):
-
i2_x(y, i)
corresponds toy[i]
ory[[i]]
(depending on the arguments given ini2_x()
) -
ss2_x(y, n(i, k), c(1, 3))
corresponds toy[i, , k] or y[[i, , k]]
(depending on the arguments given inss2_x()
)
Arrays with unknown number of dimensions
Introduction
In order to perform subset operations on some array x
with the square brackets operator ([
, [<-
),
one needs to know how many dimensions it has.
For example:
# if x has 3 dimensions:
x[i, j, k, drop = FALSE]
x[i, j, k] <- value
# if x has 4 dimensions:
x[i, j, k, l, drop = FALSE]
x[i, j, k, l] <- value
Using x[i, j, k]
on an array with 4 dimensions produces
an error, since the number of indices or empty arguments does not
conform to the number of dimensions.
But suppose that the number of dimensions of an array x
is unknown, for example when iterating through many arrays which all may
have different number dimensions. How would one the use the
[
and [<-
operators in such a situation?
It’s not strictly impossible, but it is very convoluted.
The methods provided by ‘squarebrackets’ do not use position-based arguments, and as such work on any arbitrary dimensions without requiring prior knowledge.
The s, d argument pair
The s, d
argument pair is the primary manner to specify
indices for subset operations in all dimensional objects supported by
‘squarebrackets’ (matrices, arrays, data.frame-like objects). This
argument form requires no prior knowledge on the number of dimensions an
object has.
s
and d
must be specified as follows:
- The
s
argument must be a list, specifying the subscripts (i.e. dimensional indices). - The
d
argument must be an integer vector, specifying the dimensions for whichs
holds. - If the subscripts are the same for all dimensions specified in
d
,s
can also be given as an atomic vector, or as a list of length 1.
To minimize keystrokes, ‘squarebrackets’ provides the
n()
function, which is short-hand for list()
;
n()
nests multiple objects together, just
like c()
concatenates multiple objects together.
I.e. :
- To specify rows
1:10
, uses = 1:10
, andd = 1
. - To specify layers (the third dimension)
4:9
, uses = 4:9
andd = 3
. - To specify rows
1:10
and columns2:5
, uses = n(1:10, 2:5)
andd = 1:2
. - To specify both rows and columns
1:5
, one can uses = 1:5
andd = 1:2
.
The d
argument has the default specification
1:ndim(x)
, where ndim(x) = length(dim(x))
.
Examples
Consider the following example - Given a set of atomic arrays with different dimensions, select the first 2 indices of every available dimension:
lst <- list(
array(1:25, c(5, 5)), # matrix / 2d array
array(1:48, c(4, 4, 3)), # 3d array
array(1:240, c(4, 3, 4, 3)) # 4d array
)
for(i in seq_along(lst)) {
x <- lst[[i]]
ss_x(x, s = 1:2, d = 1:ndim(x))
ss_x(x, 1:2) # the same (by default, d = 1:ndim(x))
}
The s
and d
argument are used to perform
sub-setting. Since this is not a position-based system, like base ‘R’,
it works for matrices and arrays of any arbitrary dimension.
Another example - select the first 3 indices for the first dimension, the first 2 indices for the last available dimension, and select all indices for the other dimensions.
lst <- list(
array(1:25, c(5, 5)), # matrix / 2d array
array(1:48, c(4, 4, 3)), # 3d array
array(1:240, c(4, 3, 4, 3)) #4d array
)
for(i in seq_along(lst)) {
x <- lst[[i]]
ss_x(x, n(1:3, 1:2), c(1, ndim(x)))
ss_x(x, s = n(1:3, 1:2), d = c(1, ndim(x))) # the same
}
So ‘squarebrackets’ allows the user to perform easy sub-set
operations on arrays, even if the dimensions are not known a-priori,
without ridiculously convoluted fiddling with do.call()
,
non-standard evaluation, or other ugly programming tricks. It just
works.
Different data.frame types
There are several types of data.frame-like objects available in ‘R’: data.frames, data.tables, tibbles, tidytables; and they all have their own rules regarding sub-set operations.
Consider the following example, where values of the column “a” are being replaced with “XXX”, but only in the rows for which holds that column “b” is larger than 10:
tinycodet::import_as(~ dpr., "dplyr", dependencies = "tibble")
x <- data.frame(a = month.abb, b = 1:12)
y <- dpr.$tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)
x[with(x, b > 10), "a"] <- "XXX" # data.frame with base
y <- dpr.$mutate(y, a = ifelse(b > 10, "XXX", b)) # tibble with tidyverse
z[b > 10, a:= "XXX"] # data.table with fastverse/tinyverse
Note the following:
- The syntax is different
- data.frames use copy-on modify. ‘dplyr’ + ‘tibble’ almost always uses explicit copy, and data.table almost always uses pass-by-reference.
- There’s a lot of non-standard evaluation going on.
On point 1): ‘squarebrackets’ uses the exact same methods and syntax for all data.frame types.
On point 2): ‘squarebrackets’ always allows the user to use
explicitly return a modified copy (only necessary parts are copied, so
no unnecessary copies), through the *_mod
methods. For
mutable classes, such as data.tables, ‘squarebrackets’ additionally
provides the *_set
methods, for pass-by-reference
semantics.
On point 3): ‘squarebrackets’ will never use non-standard evaluation.
All syntax in ‘squarebrackets’ is 100% programmatically friendly, and
all input can be stored in a variable for later use. In this particular
situation, the obs
argument with formula input can be
used.
So let’s do the same operation as above, but now using
‘squarebrackets’. Since data.frames and tibbles are not mutable types,
for this demonstration I’ll stick to using ss2_mod()
:
x <- data.frame(a = month.abb, b = 1:12)
y <- tibble::tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)
ss2_mod(x, obs = ~ b > 10, vars = "a", rp = "XXX")
ss2_mod(y, obs = ~ b > 10, vars = "a", rp = "XXX")
ss2_mod(z, obs = ~ b > 10, vars = "a", rp = "XXX")
Notice that the syntax is exactly the same for all classes.
The original attributes are also preserved when using
ss2_mod()
; i.e. nothing is forced to become a tibble,
data.table, or something else. Input class = output class.
For data.tables specifically, the user can also use
ss2_set()
, to perform pass-by-reference semantics, which is
considerably faster and more memory efficient:
z <- data.table::data.table(a = month.abb, b = 1:12)
ss2_set(z, obs = ~ b > 10, vars = "a", rp = "XXX")
print(z)
#> a b
#> <char> <int>
#> 1: Jan 1
#> 2: Feb 2
#> 3: Mar 3
#> 4: Apr 4
#> 5: May 5
#> 6: Jun 6
#> 7: Jul 7
#> 8: Aug 8
#> 9: Sep 9
#> 10: Oct 10
#> 11: XXX 11
#> 12: XXX 12
Mutability
As shown in the previous section, ‘squarebrackets’ supports
pass-by-reference semantics (i.e. modification without any copying) for
data.tables, and it is also supported for the mutatomic
class (a class of mutable atomic objects).
Long Vectors
Long Vectors take in quite a bit of memory. Performing a sub-set operation on a vector requires an indexing vector, which - for a long vector - may itself also be a long vector. This is a lot of memory usage. We can do better.
‘squarebrackets’ provides 2 sets of methods to perform sub-set operations without any indexing vector at all:
The slice_
- methods:
To perform sequence-based sub-set operations.
For example:
x <- 1:50
slice_x(x, 1, 10, 2) # equivalent to x[seq(1, 10, 2)]
#> [1] 1 3 5 7 9
The slicev_
- methods:
To perform value-based sub-set operations. For example:
x <- 1:50
slicev_x(x, v = 1L, r = FALSE) # equivalent to x[x != 1L]
#> [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> [26] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Both extracting sub-sets and pass-by-reference modification of sub-sets, is available for both methods.