Reputation: 60054
I have the following code:
> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
> dt
a b c d
1: 3 1 11 21
2: 3 2 12 22
3: 3 3 13 23
4: 3 4 14 24
5: 3 5 15 25
6: 4 6 16 26
7: 4 7 17 27
8: 4 8 18 28
9: 4 9 19 29
10: 4 10 20 30
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))'
Starting dogroups ... done dogroups in 0 secs
a b c d
1: 3 15 65 115
2: 4 40 90 140
> dt[,c(count=.N,lapply(.SD,sum)),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))'
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
done dogroups in 0 secs
a count b c d
1: 3 5 15 65 115
2: 4 5 40 90 140
How do I avoid the scary "very inefficient" warning?
I can add the count
column before the join:
> dt$count <- 1
> dt
a b c d count
1: 3 1 11 21 1
2: 3 2 12 22 1
3: 3 3 13 23 1
4: 3 4 14 24 1
5: 3 5 15 25 1
6: 4 6 16 26 1
7: 4 7 17 27 1
8: 4 8 18 28 1
9: 4 9 19 29 1
10: 4 10 20 30 1
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))'
Starting dogroups ... done dogroups in 0 secs
a b c d count
1: 3 15 65 115 5
2: 4 40 90 140 5
but this does not look too elegant...
Upvotes: 3
Views: 759
Reputation: 5536
This solution removes the message about the named elements. But you have to put the names back afterwards.
require(data.table)
options(datatable.verbose = TRUE)
dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Output
> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))'
Starting dogroups ... done dogroups in 0.001 secs
a V1 V2 V3 V4
1: 3 5 15 65 115
2: 4 5 40 90 140
Upvotes: 2
Reputation: 118859
One way I could think of is to assign count
by reference:
dt.out <- dt[, lapply(.SD,sum), by = a]
dt.out[, count := dt[, .N, by=a][, N]]
# alternatively: count := table(dt$a)
# a b c d count
# 1: 3 15 65 115 5
# 2: 4 40 90 140 5
Edit 1: I still think it's just message and not a warning. But if you still want to avoid that, just do:
dt.out[, count := as.numeric(dt[, .N, by=a][, N])]
Edit 2: Very interesting. Doing the equivalent of multiple :=
assignment does not produce the same message.
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# Detected that j uses these columns: a
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# Detected that j uses these columns: <none>
# Optimization is on but j left unchanged as '.N'
# Starting dogroups ... done dogroups in 0 secs
# Detected that j uses these columns: N
# Assigning to all 2 rows
# Direct plonk of unnamed RHS, no copy.
Upvotes: 3