How to count rows with conditional after grouping in data.table

Question

I have the following data frame:

dat <- read_csv(
  "s1,s2,v1,v2
   a,b,10,20
   a,b,22,NA
   a,b,13,33
   c,d,3,NA
   c,d,4.5,NA
   c,d,10,20"
)

dat
#> # A tibble: 6 x 4
#>      s1    s2    v1    v2
#>      
#> 1     a     b  10.0    20
#> 2     a     b  22.0    NA
#> 3     a     b  13.0    33
#> 4     c     d   3.0    NA
#> 5     c     d   4.5    NA
#> 6     c     d  10.0    20

What I want to do is

Filter row based on v1 values
Group by s1 and s2
Count total lines in every group
Count lines in every group where v2 is not NA.

For example with v1_filter >= 0 we get this:

s1 s2 total_line non_na_line
a  b     3          2
c  d     3          1

And with v1_filter >= 10 we get this:

s1 s2 total_line non_na_line
a  b     2          1
c  d     1          1

How can I achieve that with data.table or dplyr? In reality we have around ~31M rows in dat. So we need a fast method.

I'm stuck with this

 library(data.table)
 dat <- data.table(dat)

 v1_filter = 0
 dat[, v1 >= v1_filter, 
     by=list(s1,s2)]

Ajay Ohri · Accepted Answer

> library(readr)
> dat <- read_csv(
+   "s1,s2,v1,v2
+    a,b,10,20
+    a,b,22,NA
+    a,b,13,33
+    c,d,3,NA
+    c,d,4.5,NA
+    c,d,10,20"
+ )
> 
> dat
# A tibble: 6 x 4
     s1    s2    v1    v2
     
1     a     b  10.0    20
2     a     b  22.0    NA
3     a     b  13.0    33
4     c     d   3.0    NA
5     c     d   4.5    NA
6     c     d  10.0    20

Using data.table since you have a big data

> library(data.table)
data.table 1.10.4
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> dat=data.table(dat)

Without removing NA and keeping V1 filter as 0.1

> dat1=dat[v1>0.1,.N,.(s1,s2)]
> dat1
   s1 s2 N
1:  a  b 3
2:  c  d 3

Removing v2 NA and keeping V1 filter as 0.1

> dat2=dat[v1>0.1&is.na(v2)==F,.N,.(s1,s2)]
> dat2
   s1 s2 N
1:  a  b 2
2:  c  d 1

Merging the two and keeping V1 filter as 0

 > dat[v1 > 0, .N, by = .(s1, s2)][ dat[v1 > 0 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0 ]
       s1 s2 N i.N
    1:  a  b 3   2
    2:  c  d 3   1

How to count rows with conditional after grouping in data.table

Answers (2)

Benchmark

Related Questions