Skip to contents
library(tinycodet)
#> Run `?tinycodet::tinycodet` to open the introduction help page of 'tinycodet'.

 

Introduction

Virtually every programming language, even those primarily focused on mathematics, will at some point have to deal with strings. R’s atomic classes basically boil down to some form of either numbers or characters. R’s numerical functions are generally very fast. But R’s native string functions are somewhat slow, do not have a unified naming scheme, and are not as comprehensive as R’s impressive numerical functions.

The primary R-package that fixes this is ‘stringi’. ‘stringi’ is the fastest and most comprehensive string manipulation package available at the time of writing. Many string related packages fully depend on ‘stringi’. The ‘stringr’ package, for example, is merely a thin wrapper around ‘stringi’.

The ‘tinycodet’ package adds a bit more functionality to the ‘stringi’ string manipulation capabilities.

 

stri_locate_ith

Suppose one wants to transform the first vowels in the strings of a character vector str, such that all upper case vowels become lower case, and vice-versa. One can do that completely in stringi + base R as follows:


x <- c("HELLO WORLD", "goodbye world")
loc <- stringi::stri_locate_first(x, regex="a|e|i|o|u", case_insensitive=TRUE)
extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)
#> [1] "HeLLO WORLD"   "gOodbye world"

But now suppose one wants to transform the second-last vowel. How are you going to do that? It’s not impossible, but also not super straight-forward. For clear code, stringi really needs some kind of “stri_locate_ith” function. And, of course, the tinycodet package provides just that.

The stri_locate_ith(str, i, ...) function locates for every element/string in character vector str, the \(i^\textrm{th}\) occurrence of some (regex/fixed/etc) pattern. When i is positive, the occurrence is counted from left to right. Negative values for i are also allowed, in which case the occurrence is counted from the right to left. But i=0 is not allowed though. Thus, to get the second occurrence of some pattern, use i=2, and to get the second-last occurrence, use i=-2.

The stri_locate_ith(str, i, ...) function uses the exact same argument and naming convention as stringi, to keep your code consistent. And just like stringi::stri_locate_first/last, the stri_locate_ith(str, i, ...) function is a vectorized function: str and i as well as the pattern (regex, fixed, coll, charclass) can all be different-valued vectors.

 

To transform the second-last occurrence, one can now use stri_locate_ith() in a very similar way as was done with stri_locate_first/last:

x <- c("HELLO WORLD", "goodbye world")

loc <- stri_locate_ith( # this part is the key-difference
  x, -2, regex="a|e|i|o|u", case_insensitive=TRUE
)

extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)
#> [1] "HELLo WORLD"   "goodbyE world"

Notice that the code is virtually equivalent. We only need to change the locate function.

 

There is also the stri_locate_ith_boundaries() function, which of course locates the \(i^\textrm{th}\) text boundary.

 

strcut_ - functions

The tinycodet R package adds 2 strcut functions: strcut_loc() and strcut_brk().

The strcut_loc()function cuts every string in a character vector around a location range loc, such that every string is cut into the following parts:

  • the sub-string before loc;
  • the sub-string at loc itself;
  • the sub-string after loc.

The location range loc would usually be matrix with 2 columns, giving the start and end points of some pattern match.

The strcut_brk() function is basically a wrapper around stringi::stri_split_boundaries(..., simplify=NA), and with some more conveniently named arguments.

Examples:


x <- rep(paste0(1:10, collapse=""), 10)
print(x)
#>  [1] "12345678910" "12345678910" "12345678910" "12345678910" "12345678910"
#>  [6] "12345678910" "12345678910" "12345678910" "12345678910" "12345678910"
loc <- stri_locate_ith(x, 1:10, fixed = as.character(1:10))
strcut_loc(x, loc)
#>       prepart     mainpart postpart    
#>  [1,] ""          "1"      "2345678910"
#>  [2,] "1"         "2"      "345678910" 
#>  [3,] "12"        "3"      "45678910"  
#>  [4,] "123"       "4"      "5678910"   
#>  [5,] "1234"      "5"      "678910"    
#>  [6,] "12345"     "6"      "78910"     
#>  [7,] "123456"    "7"      "8910"      
#>  [8,] "1234567"   "8"      "910"       
#>  [9,] "12345678"  "9"      "10"        
#> [10,] "123456789" "10"     ""
strcut_loc(x, c(5,5))
#>       prepart mainpart postpart
#>  [1,] "1234"  "5"      "678910"
#>  [2,] "1234"  "5"      "678910"
#>  [3,] "1234"  "5"      "678910"
#>  [4,] "1234"  "5"      "678910"
#>  [5,] "1234"  "5"      "678910"
#>  [6,] "1234"  "5"      "678910"
#>  [7,] "1234"  "5"      "678910"
#>  [8,] "1234"  "5"      "678910"
#>  [9,] "1234"  "5"      "678910"
#> [10,] "1234"  "5"      "678910"


test <- c("The above-mentioned    features are very useful. ",
"Spam, spam, eggs, bacon, and spam. 123 456 789")
strcut_brk(test, "line")
#>      [,1]     [,2]     [,3]            [,4]        [,5]   [,6]     [,7]      
#> [1,] "The "   "above-" "mentioned    " "features " "are " "very "  "useful. "
#> [2,] "Spam, " "spam, " "eggs, "        "bacon, "   "and " "spam. " "123 "    
#>      [,8]   [,9] 
#> [1,] NA     NA   
#> [2,] "456 " "789"
strcut_brk(test, "word")
#>      [,1]   [,2] [,3]    [,4]   [,5]        [,6]   [,7]       [,8] [,9] 
#> [1,] "The"  " "  "above" "-"    "mentioned" "    " "features" " "  "are"
#> [2,] "Spam" ","  " "     "spam" ","         " "    "eggs"     ","  " "  
#>      [,10]   [,11]  [,12] [,13]    [,14] [,15]  [,16] [,17] [,18] [,19] [,20]
#> [1,] " "     "very" " "   "useful" "."   " "    NA    NA    NA    NA    NA   
#> [2,] "bacon" ","    " "   "and"    " "   "spam" "."   " "   "123" " "   "456"
#>      [,21] [,22]
#> [1,] NA    NA   
#> [2,] " "   "789"
strcut_brk(test, "sentence")
#>      [,1]                                                [,2]         
#> [1,] "The above-mentioned    features are very useful. " NA           
#> [2,] "Spam, spam, eggs, bacon, and spam. "               "123 456 789"
strcut_brk(test, "character")
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] "T"  "h"  "e"  " "  "a"  "b"  "o"  "v"  "e"  "-"   "m"   "e"   "n"   "t"  
#> [2,] "S"  "p"  "a"  "m"  ","  " "  "s"  "p"  "a"  "m"   ","   " "   "e"   "g"  
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] "i"   "o"   "n"   "e"   "d"   " "   " "   " "   " "   "f"   "e"   "a"  
#> [2,] "g"   "s"   ","   " "   "b"   "a"   "c"   "o"   "n"   ","   " "   "a"  
#>      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
#> [1,] "t"   "u"   "r"   "e"   "s"   " "   "a"   "r"   "e"   " "   "v"   "e"  
#> [2,] "n"   "d"   " "   "s"   "p"   "a"   "m"   "."   " "   "1"   "2"   "3"  
#>      [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
#> [1,] "r"   "y"   " "   "u"   "s"   "e"   "f"   "u"   "l"   "."   " "  
#> [2,] " "   "4"   "5"   "6"   " "   "7"   "8"   "9"   NA    NA    NA

 

Matrix re-ordering operators

The matrix re-ordering operators are quite handy for re-ordering strings, since the strcut_ - functions return matrices, and the stri_join_mat() and their aliases concatenate matrices.

See the documentation on matrix operators: ?`%row~%` and ?`%col~%`

See also the “Miscellaneous functionality” article.

 

Matrix joining

The tinycodet package adds a tiny additional function to stringi:

stri_join_mat (and their aliases stri_c_mat and stri_paste_mat).

As the name suggests, these functions perform row-wise (margin=1; the default) or column-wise (margin=2) joining of a matrix of strings, thereby transforming it to a vector of strings. You can do this already in base R, but it requires converting the matrix to a data.frame or list, and then calling stri_join inside do.call(), which to me just seems too much trouble for something soooo abysmally simple.

Here is an example of their usage when re-ordering strings, words, or sentences :


# sorting characters in strings:
x <- c(paste(sample(letters), collapse = ""), paste(sample(letters), collapse = ""))
print(x)
#> [1] "mwlefoudviyphzbgtjsqxnkrac" "vkgufijbqazscrwoehmdnxtpyl"
mat <- strcut_brk(x)
rank <- stringi::stri_rank(as.vector(mat)) |>  matrix(ncol=ncol(mat))
sorted <- mat %row~% rank
print(sorted)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"   "n"  
#> [2,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"   "n"  
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"   "y"   "z"  
#> [2,] "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"   "y"   "z"
stri_join_mat(sorted, margin=1)
#> [1] "abcdefghijklmnopqrstuvwxyz" "abcdefghijklmnopqrstuvwxyz"
stri_join_mat(sorted, margin=2)
#>  [1] "aa" "bb" "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj" "kk" "ll" "mm" "nn" "oo"
#> [16] "pp" "qq" "rr" "ss" "tt" "uu" "vv" "ww" "xx" "yy" "zz"

# sorting words:
x <- c("2nd 3rd 1st", "Goodbye everyone")
print(x)
#> [1] "2nd 3rd 1st"      "Goodbye everyone"
mat <- strcut_brk(x, "word")
rank <- stringi::stri_rank(as.vector(mat)) |> matrix(ncol=ncol(mat))
sorted <- mat %row~% rank
sorted[is.na(sorted)] <- ""
stri_c_mat(sorted, margin=1, sep = " ") # <- alias for stri_join_mat
#> [1] "    1st 2nd 3rd"      "  everyone Goodbye  "
stri_c_mat(sorted, margin=2, sep = " ")
#> [1] "   "         "  everyone"  "1st Goodbye" "2nd "        "3rd "

# randomly shuffle sentences:
x <- c("Hello, who are you? Oh, really?! Cool!", "I don't care. But I really don't.")
print(x)
#> [1] "Hello, who are you? Oh, really?! Cool!"
#> [2] "I don't care. But I really don't."
mat <- strcut_brk(x, "sentence")
rank <- sample(1:length(mat)) |> matrix(ncol = ncol(mat))
sorted <- mat %row~% rank
sorted[is.na(sorted)] <- ""
stri_paste_mat(sorted, margin=1) # <- another alias for stri_join_mat
#> [1] "Hello, who are you? Oh, really?! Cool!"
#> [2] "But I really don't.I don't care. "
stri_paste_mat(sorted, margin=2)
#> [1] "Hello, who are you? "             "Oh, really?! But I really don't."
#> [3] "Cool!I don't care. "