String related functions
Source:../vignettes/articles/d_strings_functions.Rmd
d_strings_functions.Rmd
library(tinycodet)
#> Run `?tinycodet::tinycodet` to open the introduction help page of 'tinycodet'.
Introduction
R’s numerical functions are generally very fast. But R’s native string functions are somewhat slow, do not have a unified naming scheme, and are not as comprehensive as R’s impressive numerical functions. The primary R-package that fixes this is ‘stringi’, which many, if not most, string related packages depend on (see the list of reverse-dependencies on CRAN). The ‘stringr’ package, for example, is merely a thin wrapper around ‘stringi’. As string manipulation is important to programming languages, even those primarily focused on mathematics, ‘tinycodet’ adds a little bit new functionality to ‘stringi’.
stri_locate_ith
Suppose one wants to transform the first vowels in
the strings of a character vector str
, such that all upper
case vowels become lower case, and vice-versa. One can do that
completely in stringi
+ base R as follows:
x <- c("HELLO WORLD", "goodbye world")
loc <- stringi::stri_locate_first(x, regex="a|e|i|o|u", case_insensitive=TRUE)
extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)
#> [1] "HeLLO WORLD" "gOodbye world"
But now suppose one wants to transform the
second-last vowel. How are you going to do that in an
efficient way? For clear and efficient code, stringi
really
needs some kind of “stri_locate_ith” function. And, of course, the
tinycodet
package provides just that.
The stri_locate_ith(str, i, ...)
function locates for
every element/string in character vector str
, the \(i^\textrm{th}\) occurrence of some
(regex/fixed/etc) pattern. When i
is positive, the
occurrence is counted from left to right. Negative values for
i
are also allowed, in which case the occurrence is counted
from the right to left. But i=0
is not allowed though.
Thus, to get the second occurrence of some pattern, use
i=2
, and to get the second-last
occurrence, use i=-2
.
The stri_locate_ith(str, i, ...)
function uses the exact
same argument and naming convention as stringi
, to keep
your code consistent. And just like
stringi::stri_locate_first/last
, the
stri_locate_ith(str, i, ...)
function is a vectorized
function: str
and i
as well as the pattern
(regex, fixed, coll, charclass
) can all be different-valued
vectors.
To transform the second-last occurrence, one can now
use stri_locate_ith()
in a very similar way as was done
with stri_locate_first/last
:
x <- c("HELLO WORLD", "goodbye world")
loc <- stri_locate_ith( # this part is the key-difference
x, -2, regex="a|e|i|o|u", case_insensitive=TRUE
)
extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)
#> [1] "HELLo WORLD" "goodbyE world"
Notice that the code is virtually equivalent. We only need to change the locate function.
There is also the stri_locate_ith_boundaries()
function,
which of course locates the \(i^\textrm{th}\) text boundary.
strcut_ - functions
The tinycodet
R package adds 2 strcut
functions: strcut_loc()
and strcut_brk()
.
The strcut_loc()
function cuts every string in a
character vector around a location range loc
, such that
every string is cut into the following parts:
- the sub-string before
loc
; - the sub-string at
loc
itself; - the sub-string after
loc
.
The location range loc
would usually be matrix with 2
columns, giving the start and end points of some pattern match.
The strcut_brk()
function is basically a wrapper around
stringi::stri_split_boundaries(..., simplify=NA)
, and with
some more conveniently named arguments.
Examples:
x <- rep(paste0(1:10, collapse=""), 10)
print(x)
#> [1] "12345678910" "12345678910" "12345678910" "12345678910" "12345678910"
#> [6] "12345678910" "12345678910" "12345678910" "12345678910" "12345678910"
loc <- stri_locate_ith(x, 1:10, fixed = as.character(1:10))
strcut_loc(x, loc)
#> prepart mainpart postpart
#> [1,] "" "1" "2345678910"
#> [2,] "1" "2" "345678910"
#> [3,] "12" "3" "45678910"
#> [4,] "123" "4" "5678910"
#> [5,] "1234" "5" "678910"
#> [6,] "12345" "6" "78910"
#> [7,] "123456" "7" "8910"
#> [8,] "1234567" "8" "910"
#> [9,] "12345678" "9" "10"
#> [10,] "123456789" "10" ""
strcut_loc(x, c(5,5))
#> prepart mainpart postpart
#> [1,] "1234" "5" "678910"
#> [2,] "1234" "5" "678910"
#> [3,] "1234" "5" "678910"
#> [4,] "1234" "5" "678910"
#> [5,] "1234" "5" "678910"
#> [6,] "1234" "5" "678910"
#> [7,] "1234" "5" "678910"
#> [8,] "1234" "5" "678910"
#> [9,] "1234" "5" "678910"
#> [10,] "1234" "5" "678910"
test <- c("The above-mentioned features are very useful. ",
"Spam, spam, eggs, bacon, and spam. 123 456 789")
strcut_brk(test, "line")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] "The " "above-" "mentioned " "features " "are " "very " "useful. "
#> [2,] "Spam, " "spam, " "eggs, " "bacon, " "and " "spam. " "123 "
#> [,8] [,9]
#> [1,] NA NA
#> [2,] "456 " "789"
strcut_brk(test, "word")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#> [1,] "The" " " "above" "-" "mentioned" " " "features" " " "are"
#> [2,] "Spam" "," " " "spam" "," " " "eggs" "," " "
#> [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> [1,] " " "very" " " "useful" "." " " NA NA NA NA NA
#> [2,] "bacon" "," " " "and" " " "spam" "." " " "123" " " "456"
#> [,21] [,22]
#> [1,] NA NA
#> [2,] " " "789"
strcut_brk(test, "sentence")
#> [,1] [,2]
#> [1,] "The above-mentioned features are very useful. " NA
#> [2,] "Spam, spam, eggs, bacon, and spam. " "123 456 789"
strcut_brk(test, "character")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t"
#> [2,] "S" "p" "a" "m" "," " " "s" "p" "a" "m" "," " " "e" "g"
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] "i" "o" "n" "e" "d" " " " " " " " " "f" "e" "a"
#> [2,] "g" "s" "," " " "b" "a" "c" "o" "n" "," " " "a"
#> [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
#> [1,] "t" "u" "r" "e" "s" " " "a" "r" "e" " " "v" "e"
#> [2,] "n" "d" " " "s" "p" "a" "m" "." " " "1" "2" "3"
#> [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
#> [1,] "r" "y" " " "u" "s" "e" "f" "u" "l" "." " "
#> [2,] " " "4" "5" "6" " " "7" "8" "9" NA NA NA
Matrix re-ordering operators
The matrix re-ordering operators are quite handy for re-ordering
strings, since the strcut_
- functions return matrices, and
the stri_join_mat()
and their aliases concatenate
matrices.
See the documentation on matrix operators: ?`%row~%`
and
?`%col~%`
See also the “Miscellaneous functionality” article.
Matrix joining
The tinycodet
package adds a tiny additional function to
stringi
:
stri_join_mat
(and their aliases stri_c_mat
and stri_paste_mat
).
As the name suggests, these functions perform row-wise
(margin=1
; the default) or column-wise
(margin=2
) joining of a matrix of strings, thereby
transforming it to a vector of strings. You can do this already in base
R, but it requires converting the matrix to a data.frame or list, and
then calling stri_join
inside do.call()
, which
to me just seems too much trouble for something soooo abysmally
simple.
Here is an example of their usage when re-ordering strings, words, or sentences :
# sorting characters in strings:
x <- c(paste(sample(letters), collapse = ""), paste(sample(letters), collapse = ""))
print(x)
#> [1] "mwlefoudviyphzbgtjsqxnkrac" "vkgufijbqazscrwoehmdnxtpyl"
mat <- strcut_brk(x)
rank <- stringi::stri_rank(as.vector(mat)) |> matrix(ncol=ncol(mat))
sorted <- mat %row~% rank
print(sorted)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
#> [2,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
#> [2,] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
stri_join_mat(sorted, margin=1)
#> [1] "abcdefghijklmnopqrstuvwxyz" "abcdefghijklmnopqrstuvwxyz"
stri_join_mat(sorted, margin=2)
#> [1] "aa" "bb" "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj" "kk" "ll" "mm" "nn" "oo"
#> [16] "pp" "qq" "rr" "ss" "tt" "uu" "vv" "ww" "xx" "yy" "zz"
# sorting words:
x <- c("2nd 3rd 1st", "Goodbye everyone")
print(x)
#> [1] "2nd 3rd 1st" "Goodbye everyone"
mat <- strcut_brk(x, "word")
rank <- stringi::stri_rank(as.vector(mat)) |> matrix(ncol=ncol(mat))
sorted <- mat %row~% rank
sorted[is.na(sorted)] <- ""
stri_c_mat(sorted, margin=1, sep = " ") # <- alias for stri_join_mat
#> [1] " 1st 2nd 3rd" " everyone Goodbye "
stri_c_mat(sorted, margin=2, sep = " ")
#> [1] " " " everyone" "1st Goodbye" "2nd " "3rd "
# randomly shuffle sentences:
x <- c("Hello, who are you? Oh, really?! Cool!", "I don't care. But I really don't.")
print(x)
#> [1] "Hello, who are you? Oh, really?! Cool!"
#> [2] "I don't care. But I really don't."
mat <- strcut_brk(x, "sentence")
rank <- sample(1:length(mat)) |> matrix(ncol = ncol(mat))
sorted <- mat %row~% rank
sorted[is.na(sorted)] <- ""
stri_paste_mat(sorted, margin=1) # <- another alias for stri_join_mat
#> [1] "Hello, who are you? Oh, really?! Cool!"
#> [2] "But I really don't.I don't care. "
stri_paste_mat(sorted, margin=2)
#> [1] "Hello, who are you? " "Oh, really?! But I really don't."
#> [3] "Cool!I don't care. "