Reputation: 64074
I have the following data frame:
dat <- read_csv(
"s1,s2,v1,v2
a,b,10,20
a,b,22,NA
a,b,13,33
c,d,3,NA
c,d,4.5,NA
c,d,10,20"
)
dat
#> # A tibble: 6 x 4
#> s1 s2 v1 v2
#> <chr> <chr> <dbl> <int>
#> 1 a b 10.0 20
#> 2 a b 22.0 NA
#> 3 a b 13.0 33
#> 4 c d 3.0 NA
#> 5 c d 4.5 NA
#> 6 c d 10.0 20
What I want to do is
v1
valuess1
and s2
v2
is not NA
.For example with v1_filter >= 0
we get this:
s1 s2 total_line non_na_line
a b 3 2
c d 3 1
And with v1_filter >= 10
we get this:
s1 s2 total_line non_na_line
a b 2 1
c d 1 1
How can I achieve that with data.table or dplyr?
In reality we have around ~31M rows in dat
. So we need
a fast method.
I'm stuck with this
library(data.table)
dat <- data.table(dat)
v1_filter = 0
dat[, v1 >= v1_filter,
by=list(s1,s2)]
Upvotes: 2
Views: 4115
Reputation: 190
Using sum
should help. Operating on a logical vector, it treats each TRUE
as 1
and FALSE
as 0
, so you can easily do this:
dat %>%
group_by(s1, s2) %>%
summarise(total_lines = n(),
non_na_line = sum(!is.na(v2)))
# A tibble: 2 x 4
# Groups: s1 [?]
s1 s2 total_lines non_na_line
<chr> <chr> <int> <int>
1 a b 3 2
2 c d 3 1
You'll easily be able to add in a filter between group_by
and summarise
, to get what you want. Keep in mind that summarise
will only retain columns that you group by.
For what it's worth, I ran a quick benchmark, with some test data of similar size as yours.
s1charMix <- rep(letters[seq(from = 1, to = 10)], length.out = 30000000)
s2charMix <- rep(letters[seq(from = 11, to = 20)], length.out = 30000000)
s1chars <- sample(s1charMix, 30000000)
s2chars <- sample(s2charMix, 30000000)
v1Nums <- runif(30000000, min = 0, max = 20)
nomissing <- sample(1:200000,1)
int.mix <- rbinom(30000000 - nomissing, 30, 0.3)
nalist <- rep(NA, nomissing)
v2NumsNA <- sample(x = c(int.mix, nalist), 30000000)
df <- data_frame(s1 = s1chars, s2 = s2chars, v1 = v1Nums, v2 = v2NumsNA)
This should roughly replicate the size and type of the data you suggest:
df
# A tibble: 30,000,000 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 d s 9.2123603 7
2 b q 16.6638639 11
3 g o 18.3682028 11
4 g s 0.8779067 9
5 a s 0.0719127 10
6 b q 16.8809193 12
7 h q 15.4382455 6
8 e k 2.3565489 11
9 h p 16.4508811 9
10 d n 2.7283823 11
# ... with 29,999,990 more rows
df %>%
filter(is.na(v2))
# A tibble: 116,924 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 d r 13.1448988 NA
2 b o 0.2703848 NA
3 b t 18.8319385 NA
4 a s 11.6448437 NA
5 j m 0.5388760 NA
6 i k 8.7098427 NA
7 d s 6.1149735 NA
8 h p 2.5552694 NA
9 g r 0.9057442 NA
10 b s 19.8886830 NA
# ... with 116,914 more rows
Now, let's benchmark dplyr
operations vs data.table
:
### dplyr
df %>%
filter(v1 > 10) %>%
group_by(s1, s2) %>%
summarise(total_lines = n(),
non_na_line = sum(!is.na(v2)))
# A tibble: 100 x 4
# Groups: s1 [?]
s1 s2 total_lines non_na_line
<chr> <chr> <int> <int>
1 a k 150327 149734
2 a l 149655 149062
3 a m 149794 149200
4 a n 149771 149197
5 a o 149495 148942
...
> system.time(df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
user system elapsed
1.848 0.420 2.290
> system.time(for (i in 1:100) df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
user system elapsed
187.657 55.878 245.528
### Data.table
library(data.table)
dat <- data.table(df)
> dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0]
s1 s2 N i.N
1: b q 149968 149348
2: g o 150411 149831
3: h q 150132 149563
4: h p 150786 150224
5: e o 149951 149353
...
> system.time(dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
user system elapsed
2.027 0.228 2.271
> system.time(for (i in 1:100) dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
user system elapsed
213.281 43.949 261.664
TL;DR dplyr
and data.table
are similarly fast, if anything dplyr
is slightly faster
Upvotes: 4
Reputation: 3492
> library(readr)
> dat <- read_csv(
+ "s1,s2,v1,v2
+ a,b,10,20
+ a,b,22,NA
+ a,b,13,33
+ c,d,3,NA
+ c,d,4.5,NA
+ c,d,10,20"
+ )
>
> dat
# A tibble: 6 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 a b 10.0 20
2 a b 22.0 NA
3 a b 13.0 33
4 c d 3.0 NA
5 c d 4.5 NA
6 c d 10.0 20
Using data.table since you have a big data
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> dat=data.table(dat)
Without removing NA and keeping V1 filter as 0.1
> dat1=dat[v1>0.1,.N,.(s1,s2)]
> dat1
s1 s2 N
1: a b 3
2: c d 3
Removing v2 NA and keeping V1 filter as 0.1
> dat2=dat[v1>0.1&is.na(v2)==F,.N,.(s1,s2)]
> dat2
s1 s2 N
1: a b 2
2: c d 1
Merging the two and keeping V1 filter as 0
> dat[v1 > 0, .N, by = .(s1, s2)][ dat[v1 > 0 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0 ]
s1 s2 N i.N
1: a b 3 2
2: c d 3 1
Upvotes: 1