Linens
Linens

Reputation: 7942

Fast counting of characters in character vector

I have a very long vector of single characters i.e. somechars<-c("A","B","C","A"...) (length is somewhere in the millions)

what is the fastest way I can count the total occurrences of say "A" and "B" in this vector? I have tried using grep and lapply but they all take so long to execute.

My current solution is:

tmp<-table(somechars)
sum(tmp["A"],tmp["B"])

But this still takes a while to compute. Is there some faster way I can be doing this? Or are there any packages I can be using to that does this already faster? I've looked into the stringr package but they use a simple grep.

Upvotes: 3

Views: 3617

Answers (4)

Joshua Ulrich
Joshua Ulrich

Reputation: 176668

Regular expressions are expensive. You can get the result in your question with exact comparison.

> somechars <- sample(LETTERS, 5e6, TRUE)
> sum(c(somechars=="A",somechars=="B"))
[1] 385675
> system.time(sum(c(somechars=="A",somechars=="B")))
   user  system elapsed 
  0.416   0.072   0.487 

UPDATED to include timings from the OP and other answers. Also included a test larger than the 2-character case.

> library(rbenchmark)
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B",somechars)),
+   table = sum(table(somechars)[c("A","B")]),
+   c = sum(c(somechars=="A",somechars=="B")),
+   OR = sum(somechars=="A"|somechars=="B"),
+   IN = sum(somechars %in% c("A","B")),
+   plus = sum(somechars=="A")+sum(somechars=="B") )
   test replications elapsed relative user.self sys.self user.child sys.child
6  plus            5   4.289 1.000000     3.836    0.436          0         0
3     c            5   4.991 1.163675     4.156    0.804          0         0
5    IN            5   5.480 1.277687     4.549    0.880          0         0
4    OR            5   5.574 1.299604     5.000    0.544          0         0
1  grep            5  16.426 3.829797    16.205    0.172          0         0
2 table            5  17.834 4.158079    12.793    4.884          0         0
> 
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B|C|D",somechars)),
+   table = sum(table(somechars)[c("A","B","C","D")]),
+   c = sum(c(somechars=="A",somechars=="B",
+             somechars=="C",somechars=="D")),
+   OR = sum(somechars=="A"|somechars=="B"|
+            somechars=="C"|somechars=="D"),
+   IN = sum(somechars %in% c("A","B","C","D")),
+   plus = sum(somechars=="A")+sum(somechars=="B")+
+          sum(somechars=="C")+sum(somechars=="D") )
   test replications elapsed relative user.self sys.self user.child sys.child
5    IN            5   5.513 1.000000     4.464    1.004          0         0
6  plus            5   8.603 1.560493     7.705    0.860          0         0
3     c            5  10.283 1.865228     8.648    1.560          0         0
4    OR            5  12.348 2.239797    10.849    1.464          0         0
2 table            5  17.960 3.257754    12.877    4.921          0         0
1  grep            5  21.692 3.934700    21.405    0.192          0         0

Upvotes: 8

John
John

Reputation: 23758

I thought that this would be fastest...

sum(somechars %in% c('A', 'B'))

And, it is faster than...

sum(c(somechars=="A",somechars=="B"))

But not faster than...

sum(somechars=="A"|somechars=="B")

But this is qualified by how many comparisons you make... which brings me back to my first guess. Once you want to sum more than 2 letters using the %in% version is the fastest.

Upvotes: 9

Carl Witthoft
Carl Witthoft

Reputation: 21502

My favorite tool, tho' I didn't time-check it against Tomas' solutions, is

rle(sort(your_vector)) 

It's certainly the simplest solution :-) .

Upvotes: 0

Tomas
Tomas

Reputation: 59515

According to my expectations, sum(x=='A') + sum(x=='B') is the fastest.

Unlike the other solutions proposed here it doesn't have to do any other unnecessary operation like concatenating the intermediate results using c(..) or |. It does just the counting - the only thing which is really needed!

R 2.13.1:

> x <- sample(letters, 1e7, TRUE)
> system.time(sum(x=='A') + sum(x=='B'))
   user  system elapsed 
   1.75    0.16    1.98 
> system.time(sum(c(x=='A', x=='B')))
   user  system elapsed 
   2.40    0.23    4.27 
> system.time(sum(x=='A' | x=='B'))
   user  system elapsed 
   2.25    0.19    2.54 

But really interesting is comparison of sum(x %in% c('A','B')) with the first, fastest solution. In R 2.13.1 it takes the same time, in R 2.11.1, it is much slower (same result as John reported)! So I'd recommend to use the first solution: sum(x=='A')+sum(x=='B').

Upvotes: 2

Related Questions