Reputation: 1516
I have a data set that is in a long format, and I can’t seem to get it the right shape for analysis. Perhaps this shape is appropriate — my experience has been almost entirely with wide format data, so this data file is not making sense to me. (Reproducible data file at end of the post.)
> head(df,10)
ID attributes values
1 1 AU AAA
2 1 AU BBB
3 1 YR 2014
4 2 AU CCC
5 2 AU DDD
6 2 AU EEE
7 2 AU FFF
8 2 AU GGG
9 2 YR 2013
10 3 AU HHH
The attributes column contain variables of interest to me, and I want to perform a series of aggregation functions. For example, I would like to:
1.Obtain a count of the number of authors (AU) for each ID. For example:
ID N.AU
1 2
2 5
3 1
4 2
5 5
6 1
Compute the median number of authors (AU) by year (YR)
YR Median.N.AU
2013 5.0
2014 1.5
For both of these examples, I have tried dplry with group_by and summaries, but haven’t cracked the code. I have also tried dcast. My hope is to come up with a solution that I can easily generalize to a larger data frame that has many more attributes that take on either a single value or multiple values. Any help or pointers to a similar solution would be greatly appreciated.
attributes = c("AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR",
"AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR")
ID = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6)
values = c("AAA", "BBB", "2014", "CCC", "DDD", "EEE", "FFF", "GGG", "2013", "HHH", "2014",
"III", "JJJ", "2014", "KKK", "LLL", "MMM", "NNN", "OOO", "2013", "PPP", "2014")
df <- data.frame(ID, attributes, values)
Upvotes: 2
Views: 408
Reputation: 19960
I misunderstood the structure of your dataset initially. Thanks to the comments below I realize your data needs to be restructured.
# split the data out
df1 <- df[df$attributes == "AU",]
df2 <- df[df$attributes == "YR",]
# just keeping the columns with data as opposed to the label
df3 <- merge(df1, df2, by="ID")[,c(1,3,5)]
# set column names for clarification
colnames(df3) <- c("ID","author","year")
# get author counts
num.authors <- count(df3, vars=c("ID","year"))
ID year freq
1 1 2014 2
2 2 2013 5
3 3 2014 1
4 4 2014 2
5 5 2013 5
6 6 2014 1
summaryBy(freq ~ year, data = num.authors, FUN = list(median))
year freq.median
1 2013 5.0
2 2014 1.5
The nice thing about summaryBy
is that you can add in which ever function has been defined in the list and you will get another column containing the other metric (e.g. mean, sd, etc.)
Upvotes: 1
Reputation: 92302
Here's a possible data.table
solution
I would also suggest to create some aggregated data set with separated columns. For example:
library(data.table)
(subdf <- as.data.table(df)[, .(N.AU = sum(attributes == "AU"),
Year = values[attributes == "YR"]) , ID])
# ID N.AU Year
# 1: 1 2 2014
# 2: 2 5 2013
# 3: 3 1 2014
# 4: 4 2 2014
# 5: 5 5 2013
# 6: 6 1 2014
Calculating median per year
subdf[, .(Median.N.AU = median(N.AU)), keyby = Year]
# Year Median.N.AU
# 1: 2013 5.0
# 2: 2014 1.5
Upvotes: 2
Reputation: 103928
I think you're getting confused because you actually have two tables of data linked by a common ID:
library(dplyr)
df <- tbl_df(df)
years <- df %>%
filter(attributes == "YR") %>%
select(id = ID, year = values)
years
#> Source: local data frame [6 x 2]
#>
#> id year
#> 1 1 2014
#> 2 2 2013
#> 3 3 2014
#> 4 4 2014
#> 5 5 2013
#> .. .. ...
authors <- df %>%
filter(attributes == "AU") %>%
select(id = ID, author = values)
authors
#> Source: local data frame [16 x 2]
#>
#> id author
#> 1 1 AAA
#> 2 1 BBB
#> 3 2 CCC
#> 4 2 DDD
#> 5 2 EEE
#> .. .. ...
Once you have the data in this form, it's easy to answer the questions you're interested in:
Authors per paper:
n_authors <- authors %>%
group_by(id) %>%
summarise(n = n())
Or
n_authors <- authors %>% count(id)
Median authors per year:
n_authors %>%
left_join(years) %>%
group_by(year) %>%
summarise(median(n))
#> Joining by: "id"
#> Source: local data frame [2 x 2]
#>
#> year median(n)
#> 1 2013 5.0
#> 2 2014 1.5
Upvotes: 3