neversaint
neversaint

Reputation: 64074

How to count rows with conditional after grouping in data.table

I have the following data frame:

dat <- read_csv(
  "s1,s2,v1,v2
   a,b,10,20
   a,b,22,NA
   a,b,13,33
   c,d,3,NA
   c,d,4.5,NA
   c,d,10,20"
)

dat
#> # A tibble: 6 x 4
#>      s1    s2    v1    v2
#>   <chr> <chr> <dbl> <int>
#> 1     a     b  10.0    20
#> 2     a     b  22.0    NA
#> 3     a     b  13.0    33
#> 4     c     d   3.0    NA
#> 5     c     d   4.5    NA
#> 6     c     d  10.0    20

What I want to do is

  1. Filter row based on v1 values
  2. Group by s1 and s2
  3. Count total lines in every group
  4. Count lines in every group where v2 is not NA.

For example with v1_filter >= 0 we get this:

s1 s2 total_line non_na_line
a  b     3          2
c  d     3          1

And with v1_filter >= 10 we get this:

s1 s2 total_line non_na_line
a  b     2          1
c  d     1          1

How can I achieve that with data.table or dplyr? In reality we have around ~31M rows in dat. So we need a fast method.

I'm stuck with this

 library(data.table)
 dat <- data.table(dat)

 v1_filter = 0
 dat[, v1 >= v1_filter, 
     by=list(s1,s2)]

Upvotes: 2

Views: 4115

Answers (2)

Brian
Brian

Reputation: 190

Using sum should help. Operating on a logical vector, it treats each TRUE as 1 and FALSE as 0, so you can easily do this:

dat %>%
    group_by(s1, s2) %>%
    summarise(total_lines = n(),
              non_na_line = sum(!is.na(v2)))

# A tibble: 2 x 4
# Groups:   s1 [?]
     s1    s2 total_lines non_na_line
  <chr> <chr>       <int>       <int>
1     a     b           3           2
2     c     d           3           1

You'll easily be able to add in a filter between group_by and summarise, to get what you want. Keep in mind that summarise will only retain columns that you group by.

Benchmark

For what it's worth, I ran a quick benchmark, with some test data of similar size as yours.

s1charMix <- rep(letters[seq(from = 1, to = 10)], length.out = 30000000)
s2charMix <- rep(letters[seq(from = 11, to = 20)], length.out = 30000000)
s1chars <- sample(s1charMix, 30000000)
s2chars <- sample(s2charMix, 30000000)
v1Nums <- runif(30000000, min = 0, max = 20)
nomissing <- sample(1:200000,1)
int.mix <- rbinom(30000000 - nomissing, 30, 0.3)
nalist <- rep(NA, nomissing)
v2NumsNA <- sample(x = c(int.mix, nalist), 30000000)
df <- data_frame(s1 = s1chars, s2 = s2chars, v1 = v1Nums, v2 = v2NumsNA)

This should roughly replicate the size and type of the data you suggest:

df

# A tibble: 30,000,000 x 4
      s1    s2         v1    v2
   <chr> <chr>      <dbl> <int>
 1     d     s  9.2123603     7
 2     b     q 16.6638639    11
 3     g     o 18.3682028    11
 4     g     s  0.8779067     9
 5     a     s  0.0719127    10
 6     b     q 16.8809193    12
 7     h     q 15.4382455     6
 8     e     k  2.3565489    11
 9     h     p 16.4508811     9
10     d     n  2.7283823    11
# ... with 29,999,990 more rows

df %>%
    filter(is.na(v2))

# A tibble: 116,924 x 4
      s1    s2         v1    v2
   <chr> <chr>      <dbl> <int>
 1     d     r 13.1448988    NA
 2     b     o  0.2703848    NA
 3     b     t 18.8319385    NA
 4     a     s 11.6448437    NA
 5     j     m  0.5388760    NA
 6     i     k  8.7098427    NA
 7     d     s  6.1149735    NA
 8     h     p  2.5552694    NA
 9     g     r  0.9057442    NA
10     b     s 19.8886830    NA
# ... with 116,914 more rows

Now, let's benchmark dplyr operations vs data.table:

### dplyr
df %>%
    filter(v1 > 10) %>%
    group_by(s1, s2) %>%
    summarise(total_lines = n(),
              non_na_line = sum(!is.na(v2)))

# A tibble: 100 x 4
# Groups:   s1 [?]
      s1    s2 total_lines non_na_line
   <chr> <chr>       <int>       <int>
 1     a     k      150327      149734
 2     a     l      149655      149062
 3     a     m      149794      149200
 4     a     n      149771      149197
 5     a     o      149495      148942
...
> system.time(df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
   user  system elapsed 
  1.848   0.420   2.290
> system.time(for (i in 1:100) df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
   user  system elapsed 
187.657  55.878 245.528 

### Data.table
library(data.table)
dat <- data.table(df)
> dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0]
 s1 s2      N    i.N
  1:  b  q 149968 149348
  2:  g  o 150411 149831
  3:  h  q 150132 149563
  4:  h  p 150786 150224
  5:  e  o 149951 149353
 ...
> system.time(dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
   user  system elapsed 
  2.027   0.228   2.271
> system.time(for (i in 1:100) dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
   user  system elapsed 
213.281  43.949 261.664

TL;DR dplyr and data.table are similarly fast, if anything dplyr is slightly faster

Upvotes: 4

Ajay Ohri
Ajay Ohri

Reputation: 3492

> library(readr)
> dat <- read_csv(
+   "s1,s2,v1,v2
+    a,b,10,20
+    a,b,22,NA
+    a,b,13,33
+    c,d,3,NA
+    c,d,4.5,NA
+    c,d,10,20"
+ )
> 
> dat
# A tibble: 6 x 4
     s1    s2    v1    v2
  <chr> <chr> <dbl> <int>
1     a     b  10.0    20
2     a     b  22.0    NA
3     a     b  13.0    33
4     c     d   3.0    NA
5     c     d   4.5    NA
6     c     d  10.0    20

Using data.table since you have a big data

> library(data.table)
data.table 1.10.4
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> dat=data.table(dat)

Without removing NA and keeping V1 filter as 0.1

> dat1=dat[v1>0.1,.N,.(s1,s2)]
> dat1
   s1 s2 N
1:  a  b 3
2:  c  d 3

Removing v2 NA and keeping V1 filter as 0.1

> dat2=dat[v1>0.1&is.na(v2)==F,.N,.(s1,s2)]
> dat2
   s1 s2 N
1:  a  b 2
2:  c  d 1

Merging the two and keeping V1 filter as 0

 > dat[v1 > 0, .N, by = .(s1, s2)][ dat[v1 > 0 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0 ]
       s1 s2 N i.N
    1:  a  b 3   2
    2:  c  d 3   1

Upvotes: 1

Related Questions