hans glick
hans glick

Reputation: 2611

Count sequence of blanks in a string with R

I want to retrieve the sequence of consecutive blanks in a string. Like :

mystring="lalalal  lalalal lalala   lalalala "
retrieve_sequence_of_consecutive_blanks(mystring)
[1] 2 1 3 1

Actually, I got a solution, with this

sequence_of_blanks=function(vectors_of_strings){
  tokens=strsplit(x = mystring,split = "",fixed = TRUE)
  sequence=lapply(X = tokens,FUN = rle)
  resultats=lapply(sequence, function(item){
    resultats=item$lengths[which(item$values==" ")]
  })
}

My question is about performance, do you think if there is better way to do it? What about a regex solution? What about a python solution?

Upvotes: 2

Views: 64

Answers (3)

s_baldur
s_baldur

Reputation: 33488

If you want a bit more of performance using simple base R:

length_seq_blanks <- function(string) {
  x <- nchar(unlist(strsplit(string, "[a-z]+")))
  x[x > 0]
}

length_seq_blanks(mystring)
[1] 2 1 3 1

Benchmark

microbenchmark::microbenchmark(
  snoram = {
    length_seq_blanks <- function(string) {
      x <- nchar(unlist(strsplit(string, "[a-z]+")))
       x[x > 0]
    }
    length_seq_blanks(mystring) 
  },
  fprive = {
    myrle <- rle(charToRaw(mystring) == charToRaw(" "))
    myrle$lengths[myrle$values]
  },
  unit = "relative"
)
Unit: relative
   expr      min       lq     mean   median       uq     max neval
 snoram 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000   100
 fprive 1.866597 1.818247 1.734015 1.684211 1.634093 1.20812   100

Upvotes: 0

F. Priv&#233;
F. Priv&#233;

Reputation: 11728

You could use

myrle <- rle(charToRaw(mystring) == charToRaw(" "))
myrle$lengths[myrle$values]

which is a bit faster:

microbenchmark::microbenchmark(
  OP = sequence_of_blanks(mystring),
  akrun = tabulate(cumsum(c(TRUE, diff(str_locate_all(mystring, " ")[[1]][,2]) !=1))),
  wiktor = nchar(unlist(str_extract_all(mystring, " +"))),
  # charToRaw(mystring) == charToRaw(" "),
  fprive = { myrle <- rle(charToRaw(mystring) == charToRaw(" ")); myrle$lengths[myrle$values] }
)

Unit: microseconds
   expr    min     lq     mean  median      uq     max neval
     OP 32.826 37.680 42.97734 42.3940 46.3405 115.239   100
  akrun 40.718 44.874 48.40903 48.4360 50.7050  78.991   100
 wiktor 24.166 29.753 34.73199 35.0955 36.7370 129.626   100
 fprive 23.258 25.877 29.50010 28.6000 31.6730  43.721   100

If you really need performance, designing some Rcpp function for your particular use giving as arguments charToRaw(mystring) and charToRaw(" ") would improve performance.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You may match all space chunks and get their lengths, e.g.

library(stringr)
nchar(unlist(str_extract_all(mystring, " +")))

Or the base R equivalent:

nchar(unlist(regmatches(mystring, gregexpr(" +", mystring))))

Both yield

[1] 2 1 3 1

In Python, you may use

[x.count(" ") for x in re.findall(" +", mystring)]

See the Python demo

If you plan to count any whitespace, replace the literal space with \s. Tweak as per your further requirements.

Upvotes: 2

Related Questions