String related functions
Source:../vignettes/articles/d_strings_functions.Rmd
d_strings_functions.Rmd
library(tinycodet)
#> Run `?tinycodet::tinycodet` to open the introduction help page of 'tinycodet'.
Introduction
Virtually every programming language, even those primarily focused on mathematics, will at some point have to deal with strings. R’s atomic classes basically boil down to some form of either numbers or characters. R’s numerical functions are generally very fast. But R’s native string functions are somewhat slow, do not have a unified naming scheme, and are not as comprehensive as R’s impressive numerical functions.
The primary R-package that fixes this is ‘stringi’. ‘stringi’ is the fastest and most comprehensive string manipulation package available at the time of writing. Many string related packages fully depend on ‘stringi’. The ‘stringr’ package, for example, is merely a thin wrapper around ‘stringi’.
The ‘tinycodet’ package adds a bit more functionality to the ‘stringi’ string manipulation capabilities.
stri_locate_ith
Suppose one wants to transform the first vowels in
the strings of a character vector str
, such that all upper
case vowels become lower case, and vice-versa. One can do that
completely in stringi
+ base R as follows:
x <- c("HELLO WORLD", "goodbye world")
loc <- stringi::stri_locate_first(x, regex="a|e|i|o|u", case_insensitive=TRUE)
extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)
#> [1] "HeLLO WORLD" "gOodbye world"
But now suppose one wants to transform the
second-last vowel. How are you going to do that? It’s
not impossible, but also not super straight-forward. For clear code,
stringi
really needs some kind of “stri_locate_ith”
function. And, of course, the tinycodet
package provides
just that.
The stri_locate_ith(str, i, ...)
function locates for
every element/string in character vector str
, the \(i^\textrm{th}\) occurrence of some
(regex/fixed/etc) pattern. When i
is positive, the
occurrence is counted from left to right. Negative values for
i
are also allowed, in which case the occurrence is counted
from the right to left. But i=0
is not allowed though.
Thus, to get the second occurrence of some pattern, use
i=2
, and to get the second-last
occurrence, use i=-2
.
The stri_locate_ith(str, i, ...)
function uses the exact
same argument and naming convention as stringi
, to keep
your code consistent. And just like
stringi::stri_locate_first/last
, the
stri_locate_ith(str, i, ...)
function is a vectorized
function: str
and i
as well as the pattern
(regex, fixed, coll, charclass
) can all be different-valued
vectors.
To transform the second-last occurrence, one can now
use stri_locate_ith()
in a very similar way as was done
with stri_locate_first/last
:
x <- c("HELLO WORLD", "goodbye world")
loc <- stri_locate_ith( # this part is the key-difference
x, -2, regex="a|e|i|o|u", case_insensitive=TRUE
)
extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)
#> [1] "HELLo WORLD" "goodbyE world"
Notice that the code is virtually equivalent. We only need to change the locate function.
There is also the stri_locate_ith_boundaries()
function,
which of course locates the \(i^\textrm{th}\) text boundary.
strcut_ - functions
The tinycodet
R package adds 2 strcut
functions: strcut_loc()
and strcut_brk()
.
The strcut_loc()
function cuts every string in a
character vector around a location range loc
, such that
every string is cut into the following parts:
- the sub-string before
loc
; - the sub-string at
loc
itself; - the sub-string after
loc
.
The location range loc
would usually be matrix with 2
columns, giving the start and end points of some pattern match.
The strcut_brk()
function is basically a wrapper around
stringi::stri_split_boundaries(..., simplify=NA)
, and with
some more conveniently named arguments.
Examples:
x <- rep(paste0(1:10, collapse=""), 10)
print(x)
#> [1] "12345678910" "12345678910" "12345678910" "12345678910" "12345678910"
#> [6] "12345678910" "12345678910" "12345678910" "12345678910" "12345678910"
loc <- stri_locate_ith(x, 1:10, fixed = as.character(1:10))
strcut_loc(x, loc)
#> prepart mainpart postpart
#> [1,] "" "1" "2345678910"
#> [2,] "1" "2" "345678910"
#> [3,] "12" "3" "45678910"
#> [4,] "123" "4" "5678910"
#> [5,] "1234" "5" "678910"
#> [6,] "12345" "6" "78910"
#> [7,] "123456" "7" "8910"
#> [8,] "1234567" "8" "910"
#> [9,] "12345678" "9" "10"
#> [10,] "123456789" "10" ""
strcut_loc(x, c(5,5))
#> prepart mainpart postpart
#> [1,] "1234" "5" "678910"
#> [2,] "1234" "5" "678910"
#> [3,] "1234" "5" "678910"
#> [4,] "1234" "5" "678910"
#> [5,] "1234" "5" "678910"
#> [6,] "1234" "5" "678910"
#> [7,] "1234" "5" "678910"
#> [8,] "1234" "5" "678910"
#> [9,] "1234" "5" "678910"
#> [10,] "1234" "5" "678910"
test <- c("The above-mentioned features are very useful. ",
"Spam, spam, eggs, bacon, and spam. 123 456 789")
strcut_brk(test, "line")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] "The " "above-" "mentioned " "features " "are " "very " "useful. "
#> [2,] "Spam, " "spam, " "eggs, " "bacon, " "and " "spam. " "123 "
#> [,8] [,9]
#> [1,] NA NA
#> [2,] "456 " "789"
strcut_brk(test, "word")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#> [1,] "The" " " "above" "-" "mentioned" " " "features" " " "are"
#> [2,] "Spam" "," " " "spam" "," " " "eggs" "," " "
#> [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> [1,] " " "very" " " "useful" "." " " NA NA NA NA NA
#> [2,] "bacon" "," " " "and" " " "spam" "." " " "123" " " "456"
#> [,21] [,22]
#> [1,] NA NA
#> [2,] " " "789"
strcut_brk(test, "sentence")
#> [,1] [,2]
#> [1,] "The above-mentioned features are very useful. " NA
#> [2,] "Spam, spam, eggs, bacon, and spam. " "123 456 789"
strcut_brk(test, "character")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t"
#> [2,] "S" "p" "a" "m" "," " " "s" "p" "a" "m" "," " " "e" "g"
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] "i" "o" "n" "e" "d" " " " " " " " " "f" "e" "a"
#> [2,] "g" "s" "," " " "b" "a" "c" "o" "n" "," " " "a"
#> [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
#> [1,] "t" "u" "r" "e" "s" " " "a" "r" "e" " " "v" "e"
#> [2,] "n" "d" " " "s" "p" "a" "m" "." " " "1" "2" "3"
#> [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
#> [1,] "r" "y" " " "u" "s" "e" "f" "u" "l" "." " "
#> [2,] " " "4" "5" "6" " " "7" "8" "9" NA NA NA
Matrix re-ordering operators
The matrix re-ordering operators are quite handy for re-ordering
strings, since the strcut_
- functions return matrices, and
the stri_join_mat()
and their aliases concatenate
matrices.
See the documentation on matrix operators: ?`%row~%`
and
?`%col~%`
See also the “Miscellaneous functionality” article.
Matrix joining
The tinycodet
package adds a tiny additional function to
stringi
:
stri_join_mat
(and their aliases stri_c_mat
and stri_paste_mat
).
As the name suggests, these functions perform row-wise
(margin=1
; the default) or column-wise
(margin=2
) joining of a matrix of strings, thereby
transforming it to a vector of strings. You can do this already in base
R, but it requires converting the matrix to a data.frame or list, and
then calling stri_join
inside do.call()
, which
to me just seems too much trouble for something soooo abysmally
simple.
Here is an example of their usage when re-ordering strings, words, or sentences :
# sorting characters in strings:
x <- c(paste(sample(letters), collapse = ""), paste(sample(letters), collapse = ""))
print(x)
#> [1] "mwlefoudviyphzbgtjsqxnkrac" "vkgufijbqazscrwoehmdnxtpyl"
mat <- strcut_brk(x)
rank <- stringi::stri_rank(as.vector(mat)) |> matrix(ncol=ncol(mat))
sorted <- mat %row~% rank
print(sorted)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
#> [2,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
#> [2,] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
stri_join_mat(sorted, margin=1)
#> [1] "abcdefghijklmnopqrstuvwxyz" "abcdefghijklmnopqrstuvwxyz"
stri_join_mat(sorted, margin=2)
#> [1] "aa" "bb" "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj" "kk" "ll" "mm" "nn" "oo"
#> [16] "pp" "qq" "rr" "ss" "tt" "uu" "vv" "ww" "xx" "yy" "zz"
# sorting words:
x <- c("2nd 3rd 1st", "Goodbye everyone")
print(x)
#> [1] "2nd 3rd 1st" "Goodbye everyone"
mat <- strcut_brk(x, "word")
rank <- stringi::stri_rank(as.vector(mat)) |> matrix(ncol=ncol(mat))
sorted <- mat %row~% rank
sorted[is.na(sorted)] <- ""
stri_c_mat(sorted, margin=1, sep = " ") # <- alias for stri_join_mat
#> [1] " 1st 2nd 3rd" " everyone Goodbye "
stri_c_mat(sorted, margin=2, sep = " ")
#> [1] " " " everyone" "1st Goodbye" "2nd " "3rd "
# randomly shuffle sentences:
x <- c("Hello, who are you? Oh, really?! Cool!", "I don't care. But I really don't.")
print(x)
#> [1] "Hello, who are you? Oh, really?! Cool!"
#> [2] "I don't care. But I really don't."
mat <- strcut_brk(x, "sentence")
rank <- sample(1:length(mat)) |> matrix(ncol = ncol(mat))
sorted <- mat %row~% rank
sorted[is.na(sorted)] <- ""
stri_paste_mat(sorted, margin=1) # <- another alias for stri_join_mat
#> [1] "Hello, who are you? Oh, really?! Cool!"
#> [2] "But I really don't.I don't care. "
stri_paste_mat(sorted, margin=2)
#> [1] "Hello, who are you? " "Oh, really?! But I really don't."
#> [3] "Cool!I don't care. "