String related infix operators
Source:../vignettes/articles/e_strings_inops.Rmd
e_strings_inops.Rmd
library(tinycodet)
#> Run `?tinycodet::tinycodet` to open the introduction help page of 'tinycodet'.
Overview
‘tinycodet’ adds 3 sets of string-related operators.
First, sub-setting operators:
-
x %s><% ss
: returns the firstn1
and lastn2
characters from each string in character vectorx
. -
x %s<>% ss
trims away firstn1
and lastn2
characters from each string in character vectorx
.
Second, ‘stringi’ already has the %s+%
,
%s*%
, and %s$%
operators, and ‘tinycodet’ adds
some additional string arithmetic operators to complete the set:
-
x %s-% p
removes patternp
from each string in character vectorx
; -
x %s/% p
counts how often patternp
occurs in each string of character vectorx
. -
x %s//% brk
counts how often the text boundary specified in listbrk
occurs in each string of character vectorx
. -
x %ss% p
splits the strings inx
by a delimiter character/pattern defined inp
, and removesp
in the process.
And finally, string search operators:
-
x %s{}% p
operator checks for every string in character vectorx
if the pattern defined inp
is present. Can also be used to check if the strings specifically start or end with patternp
. -
x %s!{}% p
operator checks for every string in character vectorx
if the pattern defined inp
is not present. Can also be used to check if the strings specifically does not start or end with patternp
. -
strfind()<-
locates, extracts, or replaces found patterns.
The x %s-% p
and x %s/% p
operators, and
the string detection operators (%s{}%
, %s!{}%
,
strfind()<-
) perform pattern matching for various
purposes. When a character vector or string is given on the right hand
side, this is interpreted as case-sensitive regex
patterns
from stringi
.
But, of course, sometimes one wants to change this. For example, one may want it to be case insensitive. Or perhaps one wants to use fixed expressions, or something else.
Instead of giving a string or character vector of regex patterns, one
can also supply a list to the right-hand side, to specify exactly how
the pattern should be interpreted. The list should use the exact same
naming convention as stringi
. For example:
list(regex=p, case_insensitive=FALSE, ...)
list(fixed=p, ...)
list(coll=p, ...)
list(charclass=p, ...)
For convenience, ‘tinycodet’ adds the following functions for this purpose:
-
s_regex(p, ...)
is equivalent tolist(regex = p, ...)
-
s_fixed(p, ...)
is equivalent tolist(fixed = p, ...)
-
s_coll(p, ...)
is equivalent tolist(coll = p, ...)
-
s_chrcls(p, ...)
is equivalent tolist(charclass = p, ... )
The next sections will give more details on the given overview.
String subsetting operators
The x %s><% ss
operator returns a subset of each
string in character vector x
. Here ss
is a
vector of length 2, or a matrix with nrow(ss)=length(x)
and
2 columns. The object ss
should consist entirely of
non-negative integers (thus 0, 1, 2, etc. are valid, but -1, -2, -3 etc
are not valid). The first element/column of ss gives the number of
characters counting from the left side to be extracted from x. The
second element/column of ss gives the number of characters counting from
the right side to be extracted from x.
Here are 2 examples:
x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
#> [1] "abcdefghijklm" "nopqrstuvwxyz"
ss <- c(2,3)
x %s><% ss
#> [1] "abklm" "noxyz"
x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
#> [1] "abcdefghijklm" "nopqrstuvwxyz"
ss <- c(1,0)
x %s><% ss
#> [1] "a" "n"
Thus x %s><% ss
“gets” or extracts the given
number of characters from the left and the right, and removes the rest.
There is also x %s<>% ss
, which is the opposite: it
trims away the number of characters from the left and right as defined
in the matrix ss
, leaving you with whatever is left.
Here are again 2 examples:
x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
#> [1] "abcdefghijklm" "nopqrstuvwxyz"
ss <- c(2,3)
x %s<>% ss
#> [1] "cdefghij" "pqrstuvw"
x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
#> [1] "abcdefghijklm" "nopqrstuvwxyz"
ss <- c(1,0)
x %s<>% ss
#> [1] "bcdefghijklm" "opqrstuvwxyz"
String arithmetic
The tinycodet
package includes 7 string arithmetic
operators (3 of which re-exported from ‘stringi’):
-
x %s+% y
concatenatesx
andy
(exported from ‘stringi’); -
x %s-% p
removes patternp
from each string in character vectorx
; -
x %s*% n
repeats each string in character vectorx
forn
times (exported from ‘stringi’); -
x %s/% p
counts how often patternp
occurs in each string of character vectorx
. -
x %s//% brk
counts how often the text boundary specified in listbrk
occurs in each string of character vectorx
. -
e1 %s$% e2
provides access tostri_sprintf
(exported from ‘stringi’); -
x %ss% p
splits the strings inx
by a delimiter character/pattern defined inp
, and removesp
in the process.
I.e.:
"Hello "%s+% " world"
#> [1] "Hello world"
c("Hello world", "Goodbye world") %s-% " world"
#> [1] "Hello" "Goodbye"
c("Hello world", "Goodbye world") %s-% s_fixed(" world")
#> [1] "Hello" "Goodbye"
c("Ha", "Ho", "Hi", "Hu", "He", "Ha") %s*% 2:7
#> [1] "HaHa" "HoHoHo" "HiHiHiHi" "HuHuHuHuHu"
#> [5] "HeHeHeHeHeHe" "HaHaHaHaHaHaHa"
c("hello World & goodbye world", "world domination!") %s/% s_fixed("world", case_insensitive = TRUE)
#> [1] 2 1
c("hello world & goodbye world", "world domination!") %s//% list(type = "word")
#> [1] 9 4
The right-side arguments y
, and n
can be a
single value, or a vector of the same length as x
. The
right-side argument p
can be string or character vector, or
a list as described in the Overview section.
Detect Patterns
Detect
The x %s{}% p
operator checks for every string in
character vector x
if the pattern defined in p
is present. The x %s!{}% p
operator checks for every string
in character vector x
if the pattern defined in
p
is NOT present.
Examples:
x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
#> [1] "abcdefghijklm" "nopqrstuvwxyz"
x %s{}% "a"
#> [1] TRUE FALSE
x %s!{}% "a"
#> [1] FALSE TRUE
which(x %s{}% "a")
#> [1] 1
which(x %s!{}% "a")
#> [1] 2
x[x %s{}% "a"]
#> [1] "abcdefghijklm"
x[x %s!{}% "a"]
#> [1] "nopqrstuvwxyz"
Detect - start or end with pattern
When supplying a list on the right hand side (see the Overview
section above), one can include the list element
at = "start"
or at = "end"
:
- Supplying
at = "start"
will check if strings start with the patterns (seestringi::stri_startswith
). - Supplying
at = "end"
will check if strings end with the patterns (seestringi::stri_endswith
).
Examples:
x <- c(paste0(letters, collapse=""), paste0(rev(letters), collapse=""), NA)
p <- s_fixed("abc", at = "start")
x %s{}% p
#> [1] TRUE FALSE NA
stringi::stri_startswith(x, fixed = "abc") # same as above
#> [1] TRUE FALSE NA
p <- s_fixed("xyz", at = "end")
x %s{}% p
#> [1] TRUE FALSE NA
stringi::stri_endswith(x, fixed = "xyz") # same as above
#> [1] TRUE FALSE NA
p <- s_fixed("cba", at = "end")
x %s{}% p
#> [1] FALSE TRUE NA
stringi::stri_endswith(x, fixed = "cba") # same as above
#> [1] FALSE TRUE NA
p <- s_fixed("zyx", at = "start")
x %s{}% p
#> [1] FALSE TRUE NA
stringi::stri_startswith(x, fixed = "zyx") # same as above
#> [1] FALSE TRUE NA
Locate, Extract, or Replace Patterns
strfind()<-
locates, extracts, or replaces found
patterns. Like the other operators, the argument p
can be a
string or character vector, or a list as described in the Overview
section above.
It can be used in several different ways.
Extract
strfind()
finds all pattern matches, and returns the
extractions of the findings in a list, just like
stringi::stri_extract_all()
:
x <- rep('The quick brown fox jumped over the lazy dog.', 3)
p <- s_fixed(c('quick', 'brown', 'fox'))
strfind(x, p)
#> [[1]]
#> [1] "quick"
#>
#> [[2]]
#> [1] "brown"
#>
#> [[3]]
#> [1] "fox"
Locate
strfind(..., i = "all" )
, finds all pattern matches like
stringi::stri_locate_all()
. And
strfind(..., i = i)
, where i
is an integer
vector, locates the ith occurrence of a pattern, and reports the
locations in a matrix, just like stri_locate_ith()
:
p <- s_fixed("the", case_insensitive = TRUE)
strfind(x, p, i = "all")
#> [[1]]
#> start end
#> [1,] 1 3
#> [2,] 33 35
#>
#> [[2]]
#> start end
#> [1,] 1 3
#> [2,] 33 35
#>
#> [[3]]
#> start end
#> [1,] 1 3
#> [2,] 33 35
strfind(x, p, i = c(1, -1, 2))
#> start end
#> [1,] 1 3
#> [2,] 33 35
#> [3,] 33 35
Replace
strfind() <- value
finds pattern matches in variable
x
, replaces the pattern matches with the character vector
specified in value
, and assigns the transformed character
vector back to x
. This is somewhat similar to
stringi::stri_replace()
, though the replacement is done
in-place. It supports replace vectorized, dictionary, first, and last
replacement:
# vectorized replacement:
x <- rep('The quick brown fox jumped over the lazy dog.', 3)
p <- c('quick', 'brown', 'fox')
rp <- c('SLOW', 'BLACK', 'BEAR')
strfind(x, p) <- rp
print(x)
#> [1] "The SLOW brown fox jumped over the lazy dog."
#> [2] "The quick BLACK fox jumped over the lazy dog."
#> [3] "The quick brown BEAR jumped over the lazy dog."
# dictionary replacement:
# quick => SLOW; brown => BLACK; fox => BEAR
x <- rep('The quick brown fox jumped over the lazy dog.', 3)
p <- c('quick', 'brown', 'fox')
rp <- c('SLOW', 'BLACK', 'BEAR')
strfind(x, p, rt = "dict") <- rp
print(x)
#> [1] "The SLOW BLACK BEAR jumped over the lazy dog."
#> [2] "The SLOW BLACK BEAR jumped over the lazy dog."
#> [3] "The SLOW BLACK BEAR jumped over the lazy dog."
# first replacement:
x <- rep('The quick brown fox jumped over the lazy dog.', 3)
p <- s_fixed("the", case_insensitive = TRUE)
rp <- c('ONE')
strfind(x, p, rt = "first") <- rp
print(x)
#> [1] "ONE quick brown fox jumped over the lazy dog."
#> [2] "ONE quick brown fox jumped over the lazy dog."
#> [3] "ONE quick brown fox jumped over the lazy dog."
# last replacement:
x <- rep('The quick brown fox jumped over the lazy dog.', 3)
p <- s_fixed("the", case_insensitive = TRUE)
rp <- c('ONE')
strfind(x, p, rt = "last") <- rp
print(x)
#> [1] "The quick brown fox jumped over ONE lazy dog."
#> [2] "The quick brown fox jumped over ONE lazy dog."
#> [3] "The quick brown fox jumped over ONE lazy dog."