library(squarebrackets)
#> Run `?squarebrackets::squarebrackets_help` to open the introduction help page of 'squarebrackets'.
set.seed(1L)
Introduction
‘squarebrackets’ provides subset methods that may be more convenient
alternatives to the [ and [<- operators,
whilst maintaining similar performance.
The goal of this Vignette is to present some problems in sub-setting objects programmatically in ‘R’, and how the ‘squarebrackets’ package solves these problems.
The Vignette starts of with solving sub-setting problems for (simple) vectors. Then it moves on to solving the issue of sub-setting arrays with arbitrary number of dimensions. The many types of data.frames is next. And we end with index-less sub-setting of long vectors.
In order to follow this Vignette, the reader needs to be familiar
with the square-brackets operators ([, [<-,
[[, [[<-), the dimensional structures of
base ‘R’ (vectors, arrays, data.frames), and familiarity with atomic and
recursive types.
This vignette does not provide a complete and thorough explanation of all methods, functions, and options available in the ‘squarebrackets’ package. The vignette merely gives you, the reader, a glimpse of what this package is about. So don’t worry if you don’t understand everything immediately. A more complete explanation of all the available functionality can be found in the package documentation itself.
Vectors: Improved Index Specification
‘squarebrackets’ provides a set of methods that work on both atomic and recursive vectors:
-
ii_xto extract subsets -
ii_woto return object without the selected subset -
ii_modto return a copy with modified subsets -
ii_setto modify an object by reference.
base ‘R’ supports specifying indices for sub-set operations through
logical, integer, and character vectors.
‘squarebrackets’ enhances these capabilities, and adds more
possibilities.
The following sub-sections show some of these capabilities; a
more exhaustive list of the possibilities can be found in the package
documentation.
Specify Indices by Names
Base ‘R’ only selects the first matching names when selecting indices through a character vector. ‘squarebrackets’ selects all matching names.
For example:
nms <- c("a", sample(letters[1:4], 9, replace = TRUE))
x <- sample(1:10)
names(x) <- nms
print(x) # `x` has multiple elements with the name "a"
#> a a d c a b a c c b
#> 2 3 1 5 7 10 6 4 9 8
x["a"] # only selects only the first index with name "a"
#> a
#> 2
ii_x(x, "a") # selects all indices with the name "a"
#> a a a a
#> 2 3 7 6
x[c("a", "a")] # repeats only the first index with name "a"
#> a a
#> 2 2
ii_x(x, c("a", "a")) # repeats all indices with the name "a"
#> a a a a a a a a
#> 2 3 7 6 2 3 7 6To select the indices c("a", "a", "b"), whilst ensuring
all indices with those names get selected, one needs to
do the following in base ‘R’:
x[lapply(c("a", "a", "b"), \(i)which(names(x) == i)) |> unlist()]
#> a a a a a a a a b b
#> 2 3 7 6 2 3 7 6 10 8See how much easier it is with ‘squarebrackets’!:
It’s not just shorter by the way, ‘squarebrackets’ is
faster, as it does not rely on lapply()
(or friends) to do this, but uses compiled ‘C’ code.
Inverting Index Specification
Inverting indices in base ‘R’ is done in different ways. (negative numbers for numeric indexing, negation for logical indexing, manually un-matching for character vectors).
‘squarebrackets’ provides a (somewhat) consistent syntax to invert indices:
- The methods whose names end with
_x(like theii_x()shown before) perform extraction;
to invert extraction, i.e. return the object without the specified subset, use the methods whose names end with_wo. - In the modification methods (those whose names end with
_modor_set) one can set the argumentinv = TRUEto invert indices.
As a consequence, removing sub-sets has the same syntax as extracting indices.
For example:
x <- sample(1:10)
names(x) <- letters[1:10]
x["a"] # extract element "a" in base R
#> a
#> 9
x[!names(x) %in% "a"] # but removing has different syntax
#> b c d e f g h i j
#> 5 10 1 7 8 6 2 3 4
ii_x(x, "a") # extract element "a" with 'squarebrackets'
#> a
#> 9
ii_wo(x, "a") # remove element "a" with 'squarebrackets'; same syntax
#> b c d e f g h i j
#> 5 10 1 7 8 6 2 3 4
Arrays: sub-setting unknown number of dimensions
Introduction
In order to perform subset operations on some array x
with the square brackets operator ([, [<-),
one needs to know how many dimensions it has.
For example:
# if x has 3 dimensions:
x[i, j, k, drop = FALSE]
x[i, j, k] <- value
# if x has 4 dimensions:
x[i, j, k, l, drop = FALSE]
x[i, j, k, l] <- valueUsing x[i, j, k] on an array with 4 dimensions produces
an error, since the number of indices or empty arguments does not
conform to the number of dimensions.
But suppose that the number of dimensions of an array x
is unknown, for example when iterating through many arrays which all may
have different number dimensions. How would one the use the
[ and [<- operators in such a situation?
It’s not strictly impossible, but it is very convoluted.
The methods provided by ‘squarebrackets’ do not use position-based arguments, and as such work on any arbitrary dimensions without requiring prior knowledge.
The s, d argument pair
‘squarebrackets’ provides a set of methods that work on arrays of any number of dimensions:
-
ss_xto extract subsets -
ss_woto return object without the selected subset -
ss_modto return a copy with modified subsets -
ss_setto modify an object by reference.
These methods use the s, d argument pair is to specify
indices for subset operations. This argument form requires no prior
knowledge on the number of dimensions an object has.
s and d must be specified as follows:
- The
sargument must be a list, specifying the subscripts (i.e. dimensional indices). - The
dargument must be an integer vector, specifying the dimensions for whichsholds. - If the subscripts are the same for all dimensions specified in
d,scan also be given as an atomic vector, or as a list of length 1.
To minimize keystrokes, ‘squarebrackets’ provides the
n() function, which is short-hand for list();
n() nests multiple objects together, just
like c() concatenates multiple objects together.
I.e. :
- To specify rows
1:10, uses = 1:10, andd = 1. - To specify layers (the third dimension)
4:9, uses = 4:9andd = 3. - To specify rows
1:10and columns2:5, uses = n(1:10, 2:5)andd = 1:2. - To specify both rows and columns
1:5, one can uses = 1:5andd = 1:2.
The d argument has the default specification
1:ndim(x), where ndim(x) = length(dim(x)).
Examples
Consider the following example - Given a set of atomic arrays with different dimensions, select the first 2 indices of every available dimension:
lst <- list(
array(1:25, c(5, 5)), # matrix / 2d array
array(1:48, c(4, 4, 3)), # 3d array
array(1:240, c(4, 3, 4, 3)) # 4d array
)
for(i in seq_along(lst)) {
x <- lst[[i]]
ss_x(x, s = 1:2, d = 1:ndim(x))
ss_x(x, 1:2) # the same (by default, d = 1:ndim(x))
}The s and d argument are used to perform
sub-setting. Since this is not a position-based system, like base ‘R’,
it works for matrices and arrays of any arbitrary dimension.
Another example - select the first 3 indices for the first dimension, the first 2 indices for the last available dimension, and select all indices for the other dimensions.
lst <- list(
array(1:25, c(5, 5)), # matrix / 2d array
array(1:48, c(4, 4, 3)), # 3d array
array(1:240, c(4, 3, 4, 3)) #4d array
)
for(i in seq_along(lst)) {
x <- lst[[i]]
ss_x(x, n(1:3, 1:2), c(1, ndim(x)))
ss_x(x, s = n(1:3, 1:2), d = c(1, ndim(x))) # the same
}So ‘squarebrackets’ allows the user to perform easy sub-set
operations on arrays, even if the dimensions are not known a-priori,
without ridiculously convoluted fiddling with do.call(),
non-standard evaluation, or other ugly programming tricks. It just
works.
Data.frame: different types, different rules
There are several types of data.frame-like objects available in ‘R’: data.frames, data.tables, tibbles, tidytables; and they all have their own rules regarding sub-set operations.
Consider the following example, where values of the column “a” are being replaced with “XXX”, but only in the rows for which holds that column “b” is larger than 10:
tinycodet::import_as(~ dpr., "dplyr", dependencies = "tibble")
x <- data.frame(a = month.abb, b = 1:12)
y <- dpr.$tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)
x[with(x, b > 10), "a"] <- "XXX" # data.frame with base
y <- dpr.$mutate(y, a = ifelse(b > 10, "XXX", b)) # tibble with tidyverse
z[b > 10, a:= "XXX"] # data.table with fastverse/tinyverseNote that the syntax is different for each type of data.frame.
‘squarebrackets’ provides a set of methods that work consistently on all
manner of tabular (data.frames and matrix) types, with the exact same
syntax:
-
sbt_xto extract subsets -
sbt_woto return object without the selected subset -
sbt_modto return a copy with modified subsets -
sbt_setto modify an object by reference.
So let’s do the same operation as above, but now using ‘squarebrackets’:
x <- data.frame(a = month.abb, b = 1:12)
y <- tibble::tibble(a = month.abb, b = 1:12)
z <- data.table::data.table(a = month.abb, b = 1:12)
sbt_mod(x, ~ b > 10, "a", rp = "XXX")
sbt_mod(y, ~ b > 10, "a", rp = "XXX")
sbt_mod(z, ~ b > 10, "a", rp = "XXX")Notice that the syntax is exactly the same for all classes.
The original attributes are also preserved when using
sbt_mod(); i.e. nothing is forced to become a tibble,
data.table, or something else. Input class = output class.
For data.tables specifically, the user can also use
sbt_set(), to perform pass-by-reference semantics, which is
considerably faster and more memory efficient:
z <- data.table::data.table(a = month.abb, b = 1:12)
sbt_set(z, ~ b > 10, "a", rp = "XXX")
print(z)
#> a b
#> <char> <int>
#> 1: Jan 1
#> 2: Feb 2
#> 3: Mar 3
#> 4: Apr 4
#> 5: May 5
#> 6: Jun 6
#> 7: Jul 7
#> 8: Aug 8
#> 9: Sep 9
#> 10: Oct 10
#> 11: XXX 11
#> 12: XXX 12This is all powered by the class-agnostic ‘C’ code from the fantastic ‘collapse’ and ‘data.table’ packages.
Pass by Reference or Pass By Value?
R’s [<- and [[<- sometimes make a
copy of an object, and sometimes they perhaps don’t. This brings 2
issues:
- Making unnecessary copies wastes memory (and speed);
- On a technical level, it may be difficult to predict if a copy is made or not.
Data.tables from the ‘data.table’ package natively uses pass-by-reference semantics, meaning no copy is made. Tibbles from the ‘tidyverse’ often returns a (very wasteful) copy.
‘squarebrackets’ provides the user the ability to explicitly
choose whether to modify an object by reference (like
data.table), or to return an explicit copy. The
*_mod methods return a modified copy. The
*_set methods modify an object by reference. The
*_set methods are only available for the mutable classes
data.table and mutatomic;
mutatomic is a class of mutable atomic object provided by
‘squarebrackets’ for the explicit purpose of being able to modify atomic
objects by reference, and doing so safely.
Long Vectors: So much memory usage
Sub-set operations without indices
Long Vectors take in quite a bit of memory. Performing a sub-set operation in base ‘R’ on a vector requires an indexing vector, which - for a long vector - may itself also be a long vector. This is a lot of memory usage. We can do better.
‘squarebrackets’ provides 2 sets of methods to perform sub-set operations without any indexing vector at all:
The slice_ - methods:
To perform sequence-based sub-set operations.
For example:
x <- 1:50
slice_x(x, 1, 10, 2) # equivalent to x[seq(1, 10, 2)]
#> [1] 1 3 5 7 9The slicev_ - methods:
To perform value-based sub-set operations. For example:
x <- 1:50
slicev_x(x, v = 1L, r = FALSE) # equivalent to x[x != 1L]
#> [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> [26] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50Both extracting sub-sets and pass-by-reference modification of sub-sets, is available for both methods.
Sub-set Modifications without Copies
R’s [<- operator (sometimes) makes copies of objects;
making copies of long vectors, however, is an enormous waste of
memory.
To reduce memory usage, ‘squarebrackets’ provides a class of mutable
atomic objects that can be modified without making
copies, similar to how the ‘data.table’ package works. This new class of
mutable atomic objects is called mutatomic, and can be
created with ease:
We can modify this vector by reference using the various methods that
end with _set.
For example like so:
You can still use regular indices, for example using
ii_set():