user4275591
user4275591

Reputation:

Count characters in a string (excluding spaces) in R?

I want to count the number of characters in a string (excluding spaces) and I'd like to know if my approach can be improved.

Suppose I have:

x <- "hello to you"

I know nchar() will give me the number of characters in a string (including spaces):

> nchar(x)
[1] 12

But I'd like to return the following (excluding spaces):

[1] 10

To this end, I've done the following:

> nchar(gsub(" ", "",x))
[1] 10

My worry is the gsub() will take a long time over many strings. Is this the correct way to approach this, or is there a type of nchar'esque function that will return the number of characters without counting spaces?

Thanks in advance.

Upvotes: 4

Views: 7288

Answers (2)

Alexey Ferapontov
Alexey Ferapontov

Reputation: 5169

Indeed, stringi seems most appropriate here. Try this:

library(stringi)
x <- "hello to you"
stri_stats_latex(x)

Result:

CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
       10             0             2             3             0             0

If you need it in a variable, you can access the parameters via regular [i], e.g.: stri_stats_latex(x)[1]

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Building on Richard's comment, "stringi" would be a great consideration here:

The approach could be to calculate the overall string length and subtract the number of spaces.

Compare the following.

library(stringi)
library(microbenchmark)

x <- "hello to you"
x
# [1] "hello to you"
fun1 <- function(x) stri_length(x) - stri_count_fixed(x, " ")
fun2 <- function(x) nchar(gsub(" ", "",x))
y <- paste(as.vector(replicate(1000000, x, TRUE)), collapse = "     ")

microbenchmark(fun1(x), fun2(x))
# Unit: microseconds
#     expr   min    lq     mean median      uq    max neval
#  fun1(x) 5.560 5.988  8.65163  7.270  8.1255 44.047   100
#  fun2(x) 9.408 9.837 12.84670 10.691 12.4020 57.732   100
microbenchmark(fun1(y), fun2(y), times = 10)
# Unit: milliseconds
#     expr        min         lq      mean     median         uq        max neval
#  fun1(y)   68.22904   68.50273   69.6419   68.63914   70.47284   75.17682    10
#  fun2(y) 2009.14710 2011.05178 2042.8123 2030.10502 2079.87224 2090.09142    10

Upvotes: 8

Related Questions