Reputation: 297
I am learning data.table
using examples and I am stuck-up with my own scenario.
I am using cars
dataset and converted to a data.table
for trying my commands.
library(data.table)
> cars.dt=data.table(cars)
> cars.dt[1:5]
speed dist
1: 4 2
2: 4 10
3: 7 4
4: 7 22
5: 8 16
.
.
I wanted to calculate the summary statistics for each group of speed
and store it in different columns but the values are stored in multiple rows.
e.g
> cars.dt[, summary(dist), by="speed"]
speed V1
1: 4 2
2: 4 4
3: 4 6
4: 4 6
5: 4 8
---
110: 25 85
111: 25 85
112: 25 85
113: 25 85
114: 25 85
I was expecting the below output and I am unable to achieve it.
speed Min. 1st Qu. Median Mean 3rd Qu. Max.
1: 4 2 4 6 6 8 10
2: 7 4.0 8.5 13.0 13.0 17.5 22.0
3: 8 16 16 16 16 16 16
4: 9 10 10 10 10 10 10
5: 10 18 22 26 26 30 34
6: 11 17.00 19.75 22.50 22.50 25.25 28.00
7: 12 14.0 18.5 22.0 21.5 25.0 28.0
8: 13 26 32 34 35 37 46
9: 14 26.0 33.5 48.0 50.5 65.0 80.0
10: 15 20.00 23.00 26.00 33.33 40.00 54.00
11: 16 32 34 36 36 38 40
12: 17 32.00 36.00 40.00 40.67 45.00 50.00
13: 18 42.0 52.5 66.0 64.5 78.0 84.0
14: 19 36 41 46 50 57 68
15: 20 32.0 48.0 52.0 50.4 56.0 64.0
16: 22 66 66 66 66 66 66
17: 23 54 54 54 54 54 54
18: 24 70.00 86.50 92.50 93.75 99.75 120.00
19: 25 85 85 85 85 85 85
I tried the below command but the output was not in a data.table
> cars.dt[, print(summary(dist)), by="speed"]
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 4 6 6 8 10
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 8.5 13.0 13.0 17.5 22.0
...
Min. 1st Qu. Median Mean 3rd Qu. Max.
70.00 86.50 92.50 93.75 99.75 120.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
85 85 85 85 85 85
Empty data.table (0 rows) of 1 col: speed
I am unable to use functions returning multiple values when using by
clause.
If anyone has any idea as to how to write this, it would be much appreciated.
Also let me know if this possible in data.table
Upvotes: 11
Views: 8832
Reputation: 886938
Try:
dt1 <- cars.dt[, as.list(summary(dist)), by="speed"]
head(dt1)
# speed Min. 1st Qu. Median Mean 3rd Qu. Max.
#1: 4 2 4.00 6.0 6.0 8.00 10
#2: 7 4 8.50 13.0 13.0 17.50 22
#3: 8 16 16.00 16.0 16.0 16.00 16
#4: 9 10 10.00 10.0 10.0 10.00 10
#5: 10 18 22.00 26.0 26.0 30.00 34
#6: 11 17 19.75 22.5 22.5 25.25 28
You could also consider summaryBy
from doBy
to have some control over the summary functions to output.
library(doBy)
dt2 <- summaryBy(.~speed, cars.dt, FUN=c(min, median, mean, max))
head(dt2,2)
# speed dist.min dist.median dist.mean dist.max
#1: 4 2 6 6 10
#2: 7 4 13 13 22
I guess the difference in as.list
and list
argument is:
Without the grouping variable
list(summary(cars.dt$speed)) #this gets a `list` with one `list element`
#[[1]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.0 12.0 15.0 15.4 19.0 25.0
as.list(summary(cars.dt$speed)) #whereas this is also a list with multiple elements
# $Min.
#[1] 4
#$`1st Qu.`
#[1] 12
#$Median
#[1] 15
#$Mean
#[1] 15.4
#$`3rd Qu.`
#[1] 19
#$Max.
#[1] 25
same as list(1:5)
and as.list(1:5)
Upvotes: 16