Reputation: 13
I have a data table named dt, which is presented below (somewhat similar).
dt:
| Gender| Year | Length| Weight | Athlete_flag | Age|
|:-- |:---- |:------|:------ |:------------ |:---|
| M | 2009 | 188 | 89 | 1 |17 |
| F | 2007 | 170 | 65 | 1 |19 |
| M |2007 |172 |90 |0 |45 |
|M |2017 |160 |70 |0 |34 |
|F |2017 |160 |70 |0 |24 |
|M |2018 |160 |70 |0 |58 |
|F |2016 |160 |70 |0 |49 |
|F |2017 |160 |70 |0 |37 |
I would like to create a data table named dt_new with descriptive statistics of specific columns in the previous table by year. I would have two group columns, "Variable" and "Year". In other words, the table would look like following:
dt_new:
Variable|Year |Number of observations| Min|Mean |Max|
:------ |:----|:---- |:---|:------ |:--|
Length | 2007|2 |170 |171 |172|
Length | 2009|1 |188 |188 |188|
Length | 2016|1 |160 |160 |162|
Length | 2017|3 |160 |160 |160|
Length | 2018|1 |160 |160 |160|
Weight | 2007|2 |65 |77.5 |90 |
Weight | 2009|1 |89 |89 |89 |
Weight | 2016|1 |70 |70 |70 |
Weight | 2017|3 |70 |70 |70 |
Weight | 2018|1 |70 |70 |70 |
Age | 2007|2 |19 |32 |45 |
Age | 2009|1 |17 |17 |17 |
Age | 2016|1 |49 |49 |49 |
Age | 2017|3 |24 |31.66 |37 |
Age | 2018|1 |58 |58 |58 |
My plan was to add more columns with descriptive statistics such as percentiles, e.g. P99. I have used data.table and would prefer to find a solution with it. I am able to create a table named dt_incorrect as planned, however not by year, see below. My current code is:
dt_incorrect <- dt[,.("Variable" = colnames(dt[,c("Length","Width","Age")]),
"Number of observations" = nrow(dt[,c("Length","Width","Age")]),
"Min" = lapply(dt[,c("Length","Width","Age")], function(x) min(x, na.rm = T)),
"Mean" = lapply(dt[,c("Length","Width","Age")], function(x) mean(x, na.rm = T)),
"Max" = lapply(dt[,c("Length","Width","Age")], function(x) max(x, na.rm = T)))]
dt_incorrect:
Variable |Number of observations| Min |Mean |Max |
:------ |:---- |:------ |:------ |:---|
Length |8 |160 |166.25 |188 |
Weight |8 |65 |74.25 |90 |
Age |8 |17 |35.375 |58 |
Thank you in advance for all suggestions on how to fix this!
Upvotes: 1
Views: 61
Reputation: 160447
Simply-data.table
:
library(data.table)
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, .(Num = .N, Min = min(value), Mean = mean(value), Max = max(value)),
by = .(Variable, Year)]
# Variable Year Num Min Mean Max
# <fctr> <int> <int> <int> <num> <int>
# 1: Length 2009 1 188 188.00000 188
# 2: Length 2007 2 170 171.00000 172
# 3: Length 2017 3 160 160.00000 160
# 4: Length 2018 1 160 160.00000 160
# 5: Length 2016 1 160 160.00000 160
# 6: Weight 2009 1 89 89.00000 89
# 7: Weight 2007 2 65 77.50000 90
# 8: Weight 2017 3 70 70.00000 70
# 9: Weight 2018 1 70 70.00000 70
# 10: Weight 2016 1 70 70.00000 70
# 11: Athlete_flag 2009 1 1 1.00000 1
# 12: Athlete_flag 2007 2 0 0.50000 1
# 13: Athlete_flag 2017 3 0 0.00000 0
# 14: Athlete_flag 2018 1 0 0.00000 0
# 15: Athlete_flag 2016 1 0 0.00000 0
# 16: Age 2009 1 17 17.00000 17
# 17: Age 2007 2 19 32.00000 45
# 18: Age 2017 3 24 31.66667 37
# 19: Age 2018 1 58 58.00000 58
# 20: Age 2016 1 49 49.00000 49
# Variable Year Num Min Mean Max
If the number of statistics is somewhat variable, you can do effectively the same thing but using lapply
on a list-of-functions:
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, lapply(list(Num=length, Min=min, Mean=mean, Max=max),
function(f) f(value)),
by = .(Variable, Year)]
This works well as long as all of them take the same arguments; for instance, the last three accept na.rm=TRUE
but length
does not. In this case, one can short-cut it a little with
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
)[, c(Num = .N, lapply(list(Min=min, Mean=mean, Max=max),
function(f) f(value, na.rm=TRUE))),
by = .(Variable, Year)]
(if na.rm=
were needed). This supports, for instance, arbitrary use of most basic stat functions, including median
and var
/sd
, though quantile
still needs another argument, probs=
. Even this is not hard to do, if needed.
There are dozens of other ways this method (applying across a list of functions instead of a list of data) can be adapted.
Data:
dt <- setDT(structure(list(Gender = c("M", "F", "M", "M", "F", "M", "F", "F"), Year = c(2009L, 2007L, 2007L, 2017L, 2017L, 2018L, 2016L, 2017L), Length = c(188L, 170L, 172L, 160L, 160L, 160L, 160L, 160L), Weight = c(89L, 65L, 90L, 70L, 70L, 70L, 70L, 70L), Athlete_flag = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Age = c(17L, 19L, 45L, 34L, 24L, 58L, 49L, 37L)), row.names = c(NA, -8L), class = c("data.table", "data.frame")))
Upvotes: 2
Reputation: 24722
Something like this (assuming srctable
is your starting table above):
# get the metrics for each of the columns of interest, by Year
srctable <- srctable[,
lapply(.SD, function(x) c(.N, min(x, na.rm=T), mean(x, na.rm=T), max(x, na.rm=T))),
.SDcols=c("Length","Weight", "Age"),
by="Year"
]
# Add a column that "labels" the metrics created
srctable[, metric:=c("N", "Min", "Mean", "Max"), by=Year]
# Use a combination of dcast and melt to rearrange
dcast(
melt(srctable, id.vars = c("Year", "metric"), measure.vars = c("Length", "Age", "Weight")),
Year+variable~metric,value.var = "value"
)
Output:
Year variable Max Mean Min N
1: 2007 Length 172 171.00000 170 2
2: 2007 Age 45 32.00000 19 2
3: 2007 Weight 90 77.50000 65 2
4: 2009 Length 188 188.00000 188 1
5: 2009 Age 17 17.00000 17 1
6: 2009 Weight 89 89.00000 89 1
7: 2016 Length 160 160.00000 160 1
8: 2016 Age 49 49.00000 49 1
9: 2016 Weight 70 70.00000 70 1
10: 2017 Length 160 160.00000 160 3
11: 2017 Age 37 31.66667 24 3
12: 2017 Weight 70 70.00000 70 3
13: 2018 Length 160 160.00000 160 1
14: 2018 Age 58 58.00000 58 1
15: 2018 Weight 70 70.00000 70 1
Upvotes: 1