Reputation: 7942
I have a very long vector of single characters i.e. somechars<-c("A","B","C","A"...)
(length is somewhere in the millions)
what is the fastest way I can count the total occurrences of say "A" and "B" in this vector?
I have tried using grep
and lapply
but they all take so long to execute.
My current solution is:
tmp<-table(somechars)
sum(tmp["A"],tmp["B"])
But this still takes a while to compute. Is there some faster way I can be doing this? Or are there any packages I can be using to that does this already faster? I've looked into the stringr
package but they use a simple grep.
Upvotes: 3
Views: 3617
Reputation: 176668
Regular expressions are expensive. You can get the result in your question with exact comparison.
> somechars <- sample(LETTERS, 5e6, TRUE)
> sum(c(somechars=="A",somechars=="B"))
[1] 385675
> system.time(sum(c(somechars=="A",somechars=="B")))
user system elapsed
0.416 0.072 0.487
UPDATED to include timings from the OP and other answers. Also included a test larger than the 2-character case.
> library(rbenchmark)
> benchmark( replications=5, order="relative",
+ grep = sum(grepl("A|B",somechars)),
+ table = sum(table(somechars)[c("A","B")]),
+ c = sum(c(somechars=="A",somechars=="B")),
+ OR = sum(somechars=="A"|somechars=="B"),
+ IN = sum(somechars %in% c("A","B")),
+ plus = sum(somechars=="A")+sum(somechars=="B") )
test replications elapsed relative user.self sys.self user.child sys.child
6 plus 5 4.289 1.000000 3.836 0.436 0 0
3 c 5 4.991 1.163675 4.156 0.804 0 0
5 IN 5 5.480 1.277687 4.549 0.880 0 0
4 OR 5 5.574 1.299604 5.000 0.544 0 0
1 grep 5 16.426 3.829797 16.205 0.172 0 0
2 table 5 17.834 4.158079 12.793 4.884 0 0
>
> benchmark( replications=5, order="relative",
+ grep = sum(grepl("A|B|C|D",somechars)),
+ table = sum(table(somechars)[c("A","B","C","D")]),
+ c = sum(c(somechars=="A",somechars=="B",
+ somechars=="C",somechars=="D")),
+ OR = sum(somechars=="A"|somechars=="B"|
+ somechars=="C"|somechars=="D"),
+ IN = sum(somechars %in% c("A","B","C","D")),
+ plus = sum(somechars=="A")+sum(somechars=="B")+
+ sum(somechars=="C")+sum(somechars=="D") )
test replications elapsed relative user.self sys.self user.child sys.child
5 IN 5 5.513 1.000000 4.464 1.004 0 0
6 plus 5 8.603 1.560493 7.705 0.860 0 0
3 c 5 10.283 1.865228 8.648 1.560 0 0
4 OR 5 12.348 2.239797 10.849 1.464 0 0
2 table 5 17.960 3.257754 12.877 4.921 0 0
1 grep 5 21.692 3.934700 21.405 0.192 0 0
Upvotes: 8
Reputation: 23758
I thought that this would be fastest...
sum(somechars %in% c('A', 'B'))
And, it is faster than...
sum(c(somechars=="A",somechars=="B"))
But not faster than...
sum(somechars=="A"|somechars=="B")
But this is qualified by how many comparisons you make... which brings me back to my first guess. Once you want to sum more than 2 letters using the %in% version is the fastest.
Upvotes: 9
Reputation: 21502
My favorite tool, tho' I didn't time-check it against Tomas' solutions, is
rle(sort(your_vector))
It's certainly the simplest solution :-) .
Upvotes: 0
Reputation: 59515
sum(x=='A') + sum(x=='B')
is the fastest.Unlike the other solutions proposed here it doesn't have to do any other unnecessary operation like concatenating the intermediate results using c(..)
or |
. It does just the counting - the only thing which is really needed!
R 2.13.1:
> x <- sample(letters, 1e7, TRUE)
> system.time(sum(x=='A') + sum(x=='B'))
user system elapsed
1.75 0.16 1.98
> system.time(sum(c(x=='A', x=='B')))
user system elapsed
2.40 0.23 4.27
> system.time(sum(x=='A' | x=='B'))
user system elapsed
2.25 0.19 2.54
But really interesting is comparison of sum(x %in% c('A','B'))
with the first, fastest solution. In R 2.13.1 it takes the same time, in R 2.11.1, it is much slower (same result as John reported)! So I'd recommend to use the first solution: sum(x=='A')+sum(x=='B')
.
Upvotes: 2