Reputation: 2611
I want to retrieve the sequence of consecutive blanks in a string. Like :
mystring="lalalal lalalal lalala lalalala "
retrieve_sequence_of_consecutive_blanks(mystring)
[1] 2 1 3 1
Actually, I got a solution, with this
sequence_of_blanks=function(vectors_of_strings){
tokens=strsplit(x = mystring,split = "",fixed = TRUE)
sequence=lapply(X = tokens,FUN = rle)
resultats=lapply(sequence, function(item){
resultats=item$lengths[which(item$values==" ")]
})
}
My question is about performance, do you think if there is better way to do it? What about a regex solution? What about a python solution?
Upvotes: 2
Views: 64
Reputation: 33488
If you want a bit more of performance using simple base R:
length_seq_blanks <- function(string) {
x <- nchar(unlist(strsplit(string, "[a-z]+")))
x[x > 0]
}
length_seq_blanks(mystring)
[1] 2 1 3 1
Benchmark
microbenchmark::microbenchmark(
snoram = {
length_seq_blanks <- function(string) {
x <- nchar(unlist(strsplit(string, "[a-z]+")))
x[x > 0]
}
length_seq_blanks(mystring)
},
fprive = {
myrle <- rle(charToRaw(mystring) == charToRaw(" "))
myrle$lengths[myrle$values]
},
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval
snoram 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
fprive 1.866597 1.818247 1.734015 1.684211 1.634093 1.20812 100
Upvotes: 0
Reputation: 11728
You could use
myrle <- rle(charToRaw(mystring) == charToRaw(" "))
myrle$lengths[myrle$values]
which is a bit faster:
microbenchmark::microbenchmark(
OP = sequence_of_blanks(mystring),
akrun = tabulate(cumsum(c(TRUE, diff(str_locate_all(mystring, " ")[[1]][,2]) !=1))),
wiktor = nchar(unlist(str_extract_all(mystring, " +"))),
# charToRaw(mystring) == charToRaw(" "),
fprive = { myrle <- rle(charToRaw(mystring) == charToRaw(" ")); myrle$lengths[myrle$values] }
)
Unit: microseconds
expr min lq mean median uq max neval
OP 32.826 37.680 42.97734 42.3940 46.3405 115.239 100
akrun 40.718 44.874 48.40903 48.4360 50.7050 78.991 100
wiktor 24.166 29.753 34.73199 35.0955 36.7370 129.626 100
fprive 23.258 25.877 29.50010 28.6000 31.6730 43.721 100
If you really need performance, designing some Rcpp function for your particular use giving as arguments charToRaw(mystring)
and charToRaw(" ")
would improve performance.
Upvotes: 2
Reputation: 626926
You may match all space chunks and get their lengths, e.g.
library(stringr)
nchar(unlist(str_extract_all(mystring, " +")))
Or the base R equivalent:
nchar(unlist(regmatches(mystring, gregexpr(" +", mystring))))
Both yield
[1] 2 1 3 1
In Python, you may use
[x.count(" ") for x in re.findall(" +", mystring)]
See the Python demo
If you plan to count any whitespace, replace the literal space with \s
. Tweak as per your further requirements.
Upvotes: 2