Calle
Calle

Reputation: 13

R: construct data table by transposing and using two group columns

I have a data table named dt, which is presented below (somewhat similar).

dt:

| Gender| Year | Length| Weight | Athlete_flag | Age|
|:--    |:---- |:------|:------ |:------------ |:---|
| M     | 2009 | 188   | 89     | 1            |17  |
| F     | 2007 | 170   | 65     | 1            |19  |
| M     |2007  |172    |90      |0             |45  |
|M      |2017  |160    |70      |0             |34  |
|F      |2017  |160    |70      |0             |24  |
|M      |2018  |160    |70      |0             |58  |
|F      |2016  |160    |70      |0             |49  |
|F      |2017  |160    |70      |0             |37  |

I would like to create a data table named dt_new with descriptive statistics of specific columns in the previous table by year. I would have two group columns, "Variable" and "Year". In other words, the table would look like following:

dt_new:

Variable|Year |Number of observations| Min|Mean    |Max|
:------ |:----|:----                 |:---|:------ |:--|
Length  | 2007|2                     |170 |171     |172|
Length  | 2009|1                     |188 |188     |188|
Length  | 2016|1                     |160 |160     |162|
Length  | 2017|3                     |160 |160     |160|
Length  | 2018|1                     |160 |160     |160|
Weight  | 2007|2                     |65  |77.5    |90 |
Weight  | 2009|1                     |89  |89      |89 |
Weight  | 2016|1                     |70  |70      |70 |
Weight  | 2017|3                     |70  |70      |70 |
Weight  | 2018|1                     |70  |70      |70 |
Age     | 2007|2                     |19  |32      |45 |
Age     | 2009|1                     |17  |17      |17 |
Age     | 2016|1                     |49  |49      |49 |
Age     | 2017|3                     |24  |31.66   |37 |
Age     | 2018|1                     |58  |58      |58 |

My plan was to add more columns with descriptive statistics such as percentiles, e.g. P99. I have used data.table and would prefer to find a solution with it. I am able to create a table named dt_incorrect as planned, however not by year, see below. My current code is:

dt_incorrect <- dt[,.("Variable" = colnames(dt[,c("Length","Width","Age")]),
"Number of observations" = nrow(dt[,c("Length","Width","Age")]),
"Min" = lapply(dt[,c("Length","Width","Age")], function(x) min(x, na.rm = T)),
"Mean" = lapply(dt[,c("Length","Width","Age")], function(x) mean(x, na.rm = T)),
"Max" = lapply(dt[,c("Length","Width","Age")], function(x) max(x, na.rm = T)))]

dt_incorrect:

Variable |Number of observations| Min    |Mean    |Max |
:------  |:----                 |:------ |:------ |:---|
Length   |8                     |160     |166.25  |188 |
Weight   |8                     |65      |74.25   |90  |
Age      |8                     |17      |35.375  |58  |

Thank you in advance for all suggestions on how to fix this!

Upvotes: 1

Views: 61

Answers (2)

r2evans
r2evans

Reputation: 160447

Simply-data.table:

library(data.table)
melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
  )[, .(Num = .N, Min = min(value), Mean = mean(value), Max = max(value)), 
     by = .(Variable, Year)]
#         Variable  Year   Num   Min      Mean   Max
#           <fctr> <int> <int> <int>     <num> <int>
#  1:       Length  2009     1   188 188.00000   188
#  2:       Length  2007     2   170 171.00000   172
#  3:       Length  2017     3   160 160.00000   160
#  4:       Length  2018     1   160 160.00000   160
#  5:       Length  2016     1   160 160.00000   160
#  6:       Weight  2009     1    89  89.00000    89
#  7:       Weight  2007     2    65  77.50000    90
#  8:       Weight  2017     3    70  70.00000    70
#  9:       Weight  2018     1    70  70.00000    70
# 10:       Weight  2016     1    70  70.00000    70
# 11: Athlete_flag  2009     1     1   1.00000     1
# 12: Athlete_flag  2007     2     0   0.50000     1
# 13: Athlete_flag  2017     3     0   0.00000     0
# 14: Athlete_flag  2018     1     0   0.00000     0
# 15: Athlete_flag  2016     1     0   0.00000     0
# 16:          Age  2009     1    17  17.00000    17
# 17:          Age  2007     2    19  32.00000    45
# 18:          Age  2017     3    24  31.66667    37
# 19:          Age  2018     1    58  58.00000    58
# 20:          Age  2016     1    49  49.00000    49
#         Variable  Year   Num   Min      Mean   Max

If the number of statistics is somewhat variable, you can do effectively the same thing but using lapply on a list-of-functions:

melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
  )[, lapply(list(Num=length, Min=min, Mean=mean, Max=max),
             function(f) f(value)),
    by = .(Variable, Year)]

This works well as long as all of them take the same arguments; for instance, the last three accept na.rm=TRUE but length does not. In this case, one can short-cut it a little with

melt(dt, id.vars = c("Gender", "Year"), variable.name = "Variable"
  )[, c(Num = .N, lapply(list(Min=min, Mean=mean, Max=max), 
                         function(f) f(value, na.rm=TRUE))),
     by = .(Variable, Year)]

(if na.rm= were needed). This supports, for instance, arbitrary use of most basic stat functions, including median and var/sd, though quantile still needs another argument, probs=. Even this is not hard to do, if needed.

There are dozens of other ways this method (applying across a list of functions instead of a list of data) can be adapted.


Data:

dt <- setDT(structure(list(Gender = c("M", "F", "M", "M", "F", "M", "F", "F"), Year = c(2009L, 2007L, 2007L, 2017L, 2017L, 2018L, 2016L, 2017L), Length = c(188L, 170L, 172L, 160L, 160L, 160L, 160L, 160L), Weight = c(89L, 65L, 90L, 70L, 70L, 70L, 70L, 70L), Athlete_flag = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Age = c(17L, 19L, 45L, 34L, 24L, 58L, 49L, 37L)), row.names = c(NA, -8L), class = c("data.table", "data.frame")))

Upvotes: 2

langtang
langtang

Reputation: 24722

Something like this (assuming srctable is your starting table above):

# get the metrics for each of the columns of interest, by Year
srctable <- srctable[, 
                     lapply(.SD, function(x) c(.N, min(x, na.rm=T), mean(x, na.rm=T), max(x, na.rm=T))),
                     .SDcols=c("Length","Weight", "Age"),
                     by="Year"
                     ]

# Add a column that "labels" the metrics created
srctable[, metric:=c("N", "Min", "Mean", "Max"), by=Year]

# Use a combination of dcast and melt to rearrange
dcast(
  melt(srctable, id.vars = c("Year", "metric"), measure.vars = c("Length", "Age", "Weight")),
  Year+variable~metric,value.var = "value"
)

Output:

    Year variable Max      Mean Min N
 1: 2007   Length 172 171.00000 170 2
 2: 2007      Age  45  32.00000  19 2
 3: 2007   Weight  90  77.50000  65 2
 4: 2009   Length 188 188.00000 188 1
 5: 2009      Age  17  17.00000  17 1
 6: 2009   Weight  89  89.00000  89 1
 7: 2016   Length 160 160.00000 160 1
 8: 2016      Age  49  49.00000  49 1
 9: 2016   Weight  70  70.00000  70 1
10: 2017   Length 160 160.00000 160 3
11: 2017      Age  37  31.66667  24 3
12: 2017   Weight  70  70.00000  70 3
13: 2018   Length 160 160.00000 160 1
14: 2018      Age  58  58.00000  58 1
15: 2018   Weight  70  70.00000  70 1

Upvotes: 1

Related Questions