Christian Lindig
Christian Lindig

Reputation: 1246

R data frame organisation

I'd like to analyse a sequence of rowing races in R where boats with 4 rowers each race pairwise against each other. I wonder about the best way to represent this in a data frame. I currently have 12 timed events, 2 such events constitute a race between two boats.

     time race boat seat1 seat2 seat3 seat4
1  204.98    1    1     2     6     1     5
2  202.49    2    1     4     5     2     7
3  202.27    3    1     2     6     3     7
4  206.48    4    1     1     7     2     8
5  204.85    5    1     4     8     2     6
6  204.93    6    1     2     8     3     5
7  204.91    1    2     3     7     4     8
8  207.40    2    2     1     8     3     6
9  207.62    3    2     1     5     4     8
10 203.41    4    2     3     5     4     6
11 205.04    5    2     3     7     1     5
12 204.96    6    2     4     6     1     7

Here the numbers in the seat columns refer to rowers (so there are 8 of them) but it would be more natural to use names or letters. I need to extract a 12x8 matrix that captures which rower participated in which event.

The code below builds the data frame above:

df <- data.frame ( 
                  time = c(204.98, 202.49, 202.27, 206.48, 204.85, 204.93,
                           204.91, 207.40, 207.62, 203.41, 205.04, 204.96),
                  race = append(1:6, 1:6),
                  boat = append(rep(1,6),rep(2,6)),
                  seat1 = c(2,4,2,1,4,2, 3,1,1,3,3,4),
                  seat2 = c(6,5,6,7,8,8, 7,8,5,5,7,6),
                  seat3 = c(1,2,3,2,2,3, 4,3,4,4,1,1),
                  seat4 = c(5,7,7,8,6,5, 8,6,8,6,5,7))

  1. To extract the relation between rowers and events, would it be better to organise this differently?
  2. Would it be natural to capture additional facts about rowers (like their weight, age) in a separate data frame or is it better (how?) to keep everything in one data frame.

It seems there is a tradeoff between redundancy and convenience. Whereas in a relational database one would use several relations it appears the R community prefers to share data in a single data frame. I am sure there is always a way to make it work but lacking the experience I'd be curious how experienced R users would organise the data.

Addendum: Lots of answers highlight the importance of the questions. Here is one that would benefit from bringing data into matrix form: the total time a rower spent in races: a vector of event times and a {0,1} valued matrix that connects events and rowers mentioned before. The result could be obtained by multiplying them.

Upvotes: 12

Views: 200

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269421

To create a table of events vs. rowers melt the data into long form m and then back into the appropriate wide form. There is no reason you can't have the data in multiple forms so it is really not necessary to choose the best forms. You can always regenerate them if new data comes in. The form of interest really depends on what you want to do with it but the code below gives you three forms:

  1. the original wide form df,
  2. the long form m which could be useful for regression, boxplots, etc. e.g.

    lm(time ~ factor(rower) + 0, m)
    boxplot(time ~ boat, m)
    
  3. the revised wide form df2.

If there exists rower specific attributes then those could be stored in a separate data frame with one row per rower and one column per attribute and depending on what you want to do could be merged with m using merge if you want to use those in a regression, say.

library(data.table)

m <- melt(as.data.table(df), id = 1:3, value.name = "rower")
df2 <- dcast(data = m, time + race + boat ~ rower, value.var = "rower")
setkey(df2, boat, race) # sort
df2

giving:

      time race boat  1  2  3  4  5  6  7  8
 1: 204.98    1    1  1  2 NA NA  5  6 NA NA
 2: 202.49    2    1 NA  2 NA  4  5 NA  7 NA
 3: 202.27    3    1 NA  2  3 NA NA  6  7 NA
 4: 206.48    4    1  1  2 NA NA NA NA  7  8
 5: 204.85    5    1 NA  2 NA  4 NA  6 NA  8
 6: 204.93    6    1 NA  2  3 NA  5 NA NA  8
 7: 204.91    1    2 NA NA  3  4 NA NA  7  8
 8: 207.40    2    2  1 NA  3 NA NA  6 NA  8
 9: 207.62    3    2  1 NA NA  4  5 NA NA  8
10: 203.41    4    2 NA NA  3  4  5  6 NA NA
11: 205.04    5    2  1 NA  3 NA  5 NA  7 NA
12: 204.96    6    2  1 NA NA  4 NA  6  7 NA

Alternately, with dplyr/tidyr:

library(dplyr)
library(tidyr)

m <- df %>%
  pivot_longer(-(1:3), names_to = "seat", values_to = "rower")
df2 <- m %>% 
  pivot_wider(1:3, names_from = rower, values_from = rower, names_sort = TRUE)

Upvotes: 5

Chuck P
Chuck P

Reputation: 3923

No disagreement that it depends on the questions. But I suspect in your case a lot will be answered from long format and that will also make it easy to attach additional rower information when and if needed.

library(dplyr)
library(tidyr)

my_way <- pivot_longer(df, starts_with("seat"), values_to = "rower", names_to = "seat")
my_way
#> # A tibble: 48 x 5
#>     time  race  boat seat  rower
#>    <dbl> <int> <dbl> <chr> <dbl>
#>  1  205.     1     1 seat1     2
#>  2  205.     1     1 seat2     6
#>  3  205.     1     1 seat3     1
#>  4  205.     1     1 seat4     5
#>  5  202.     2     1 seat1     4
#>  6  202.     2     1 seat2     5
#>  7  202.     2     1 seat3     2
#>  8  202.     2     1 seat4     7
#>  9  202.     3     1 seat1     2
#> 10  202.     3     1 seat2     6
#> # … with 38 more rows

my_way %>% group_by(rower) %>% summarise(mean(time))
#> # A tibble: 8 x 2
#>   rower `mean(time)`
#>   <dbl>        <dbl>
#> 1     1         206.
#> 2     2         204.
#> 3     3         205.
#> 4     4         205.
#> 5     5         205.
#> 6     6         205.
#> 7     7         204.
#> 8     8         206.

my_way %>% group_by(rower, seat) %>% summarise()
#> # A tibble: 16 x 2
#> # Groups:   rower [8]
#>    rower seat 
#>    <dbl> <chr>
#>  1     1 seat1
#>  2     1 seat3
#>  3     2 seat1
#>  4     2 seat3
#>  5     3 seat1
#>  6     3 seat3
#>  7     4 seat1
#>  8     4 seat3
#>  9     5 seat2
#> 10     5 seat4
#> 11     6 seat2
#> 12     6 seat4
#> 13     7 seat2
#> 14     7 seat4
#> 15     8 seat2
#> 16     8 seat4

Upvotes: 1

Jan
Jan

Reputation: 5254

This is certainly a matter of opinion (totally agree with @MattB). Data frames are a very convenient way for many statistical analyses but many times you have to transform them to fit your purpose.

Your case shows a data frame in "wide form". I see no convenient way to add more facts about rowers. I would transform it to "long form". In the wide form each rower gets their own row. And since the rowers seem to be your "object of interest" (your cases) that could probably make things easier. The question "which races did rower 4 take part in?" could be answered easily with that form.

Upvotes: 6

MattB
MattB

Reputation: 671

This is going to be a matter of opinion and will depend in part on what sort of questions you will want to ask of this dataset. For example, the question "which races did rower 4 take part in?" is not easily answered with the format above.

For that reason I would lean towards:

  • A table of races, much like you have, but without the seat* columns;
  • A table of rowers, where additional details (name, weight, etc.) can be kept; and
  • A table linking the two, with one row per rower per race.

This would avoid most redundancy and allow most questions (that I can think of!) to be answered relatively straightforwardly. You can always have a function (using, e.g., dcast) to recreate the form you show above for human-readability.

Upvotes: 4

Related Questions