Brian P
Brian P

Reputation: 1516

Reshape data long data in R or aggregate?

I have a data set that is in a long format, and I can’t seem to get it the right shape for analysis. Perhaps this shape is appropriate — my experience has been almost entirely with wide format data, so this data file is not making sense to me. (Reproducible data file at end of the post.)

> head(df,10)
    ID attributes values
1   1         AU    AAA
2   1         AU    BBB
3   1         YR   2014
4   2         AU    CCC
5   2         AU    DDD
6   2         AU    EEE
7   2         AU    FFF
8   2         AU    GGG
9   2         YR   2013
10  3         AU    HHH

The attributes column contain variables of interest to me, and I want to perform a series of aggregation functions. For example, I would like to:

1.Obtain a count of the number of authors (AU) for each ID. For example:

   ID       N.AU
    1           2
    2           5
    3           1
    4           2
    5           5
    6           1
  1. Compute the median number of authors (AU) by year (YR)

    YR           Median.N.AU   
    2013          5.0
    2014          1.5
    

For both of these examples, I have tried dplry with group_by and summaries, but haven’t cracked the code. I have also tried dcast. My hope is to come up with a solution that I can easily generalize to a larger data frame that has many more attributes that take on either a single value or multiple values. Any help or pointers to a similar solution would be greatly appreciated.

attributes = c("AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR",
   "AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR")
ID = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6)
values = c("AAA", "BBB", "2014", "CCC", "DDD", "EEE", "FFF", "GGG", "2013", "HHH", "2014",
   "III", "JJJ", "2014", "KKK", "LLL", "MMM", "NNN", "OOO", "2013", "PPP", "2014")
df <- data.frame(ID, attributes, values)

Upvotes: 2

Views: 408

Answers (3)

cdeterman
cdeterman

Reputation: 19960

I misunderstood the structure of your dataset initially. Thanks to the comments below I realize your data needs to be restructured.

# split the data out
df1 <- df[df$attributes == "AU",]
df2 <- df[df$attributes == "YR",]

# just keeping the columns with data as opposed to the label
df3 <- merge(df1, df2, by="ID")[,c(1,3,5)]
# set column names for clarification
colnames(df3) <- c("ID","author","year")

# get author counts
num.authors <- count(df3, vars=c("ID","year"))
  ID year freq
1  1 2014    2
2  2 2013    5
3  3 2014    1
4  4 2014    2
5  5 2013    5
6  6 2014    1

summaryBy(freq ~ year, data = num.authors, FUN = list(median))
  year freq.median
1 2013         5.0
2 2014         1.5

The nice thing about summaryBy is that you can add in which ever function has been defined in the list and you will get another column containing the other metric (e.g. mean, sd, etc.)

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92302

Here's a possible data.table solution

I would also suggest to create some aggregated data set with separated columns. For example:

library(data.table)
(subdf <- as.data.table(df)[, .(N.AU = sum(attributes == "AU"),
                                Year = values[attributes == "YR"]) , ID])
#    ID N.AU Year
# 1:  1    2 2014
# 2:  2    5 2013
# 3:  3    1 2014
# 4:  4    2 2014
# 5:  5    5 2013
# 6:  6    1 2014

Calculating median per year

subdf[, .(Median.N.AU = median(N.AU)), keyby = Year]
#    Year Median.N.AU
# 1: 2013         5.0
# 2: 2014         1.5

Upvotes: 2

hadley
hadley

Reputation: 103928

I think you're getting confused because you actually have two tables of data linked by a common ID:

library(dplyr)
df <- tbl_df(df)

years <- df %>% 
  filter(attributes == "YR") %>% 
  select(id = ID, year = values)
years
#> Source: local data frame [6 x 2]
#> 
#>    id year
#> 1   1 2014
#> 2   2 2013
#> 3   3 2014
#> 4   4 2014
#> 5   5 2013
#> .. ..  ...

authors <- df %>% 
  filter(attributes == "AU") %>% 
  select(id = ID, author = values)
authors
#> Source: local data frame [16 x 2]
#> 
#>    id author
#> 1   1    AAA
#> 2   1    BBB
#> 3   2    CCC
#> 4   2    DDD
#> 5   2    EEE
#> .. ..    ...

Once you have the data in this form, it's easy to answer the questions you're interested in:

  1. Authors per paper:

    n_authors <- authors %>% 
      group_by(id) %>% 
      summarise(n = n())
    

    Or

    n_authors <- authors %>% count(id)
    
  2. Median authors per year:

    n_authors %>%
      left_join(years) %>%
      group_by(year) %>%
      summarise(median(n))
    #> Joining by: "id"
    #> Source: local data frame [2 x 2]
    #> 
    #>   year median(n)
    #> 1 2013       5.0
    #> 2 2014       1.5
    

Upvotes: 3

Related Questions