Wimpel
Wimpel

Reputation: 27792

Create frequency table per group

In R, I have a data.frame with an ID, period, and a couple of variables of which the year is given.

df

df <- structure(list(id = 10001:10010, 
                     period = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), 
                                        .Label = c("2017-01", "2017-02"), class = "factor"), 
                     year1 = c(2013, 2014, 2014, 2017, 1970, 2014, 1970, 2014, 2014, 1970), 
                     year2 = c(2015, NA, 2015, 2015, 1970, 1970, 2014, 2017, 2014, 2017), 
                     year3 = c(2015, 2014, 2015, NA, NA, 2014, 2014, 2015, 1970, 2014), 
                     year4 = c(1970, 2014, 2013, 2014, 1970, 1970, 2014, 2015, 2015, NA)), 
                .Names = c("id", "period", "year1", "year2", "year3", "year4"), 
                row.names = c(NA, -10L), 
                class = "data.frame")

What I would like to achieve, is to create a data.frame in which, for every period and for everyvariable (year1-year4), a frequency is shown how often a specific year (1970,2013-2017) is shown. The desired result should be something like this:

result

result <- structure(list(period = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), 
                                            .Label = c("2017-01", "2017-02"), 
                                            class = "factor"), 
                         category = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), 
                                              .Label = c("year1", "year2", "year3", "year4"), 
                                              class = "factor"), 
                         X1970 = c(1L, 1L, 0L, 2L, 2L, 1L, 1L, 1L), 
                         X2013 = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), 
                         X2014 = c(3L, 0L, 2L, 1L, 2L, 2L, 2L, 2L), 
                         X2015 = c(0L, 3L, 3L, 1L, 0L, 0L, 0L, 1L), 
                         X2017 = c(1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L)), 
                    .Names = c("period", "category", "X1970", "X2013", "X2014", "X2015", "X2017"), 
                    class = "data.frame", 
                    row.names = c(NA, -8L))

I've got medium experience in R, and I'm confident this can be achieved. But I'm completely lost on how to approach this problem. Suggestions/ideas?

Upvotes: 1

Views: 1998

Answers (2)

Sotos
Sotos

Reputation: 51612

Here is an idea using tidyverse,

library(tidyverse)

df %>% 
 gather(category, val, -c(id, period)) %>% 
 na.omit() %>% 
 group_by(period, category, val) %>% 
 summarise(new = n()) %>% 
 spread(val, new, fill = 0)

which gives,

# A tibble: 8 x 7
# Groups:   period, category [8]
   period category `1970` `2013` `2014` `2015` `2017`
*  <fctr>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 2017-01    year1      1      1      3      0      1
2 2017-01    year2      1      0      0      3      2
3 2017-01    year3      0      0      2      3      0
4 2017-01    year4      2      1      1      1      0
5 2017-02    year1      2      0      2      0      0
6 2017-02    year2      1      0      2      0      0
7 2017-02    year3      1      0      2      0      0
8 2017-02    year4      1      0      2      1      0

The main trick here is to convert to long format, calculate the frequencies, and convert back to wide format.

A bit shorter version, utilizing the count function (as @markdly mentions), can be,

df %>% 
 gather(category, val, -c(id, period)) %>% 
 na.omit() %>% 
 count(period, category, val) %>% 
 spread(val, n, fill = 0)

Upvotes: 4

Manuel Bickel
Manuel Bickel

Reputation: 2206

If I understand your problem correctly you can achieve this with the package reshape2 by first turning the data from "wide" to "long" format via melt and then turning it back via cast while selecting the desired categories - I hope this is what you need.

df <- structure(list(id = 10001:10010, 
                     period = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), 
                                        .Label = c("2017-01", "2017-02"), class = "factor"), 
                     year1 = c(2013, 2014, 2014, 2017, 1970, 2014, 1970, 2014, 2014, 1970), 
                     year2 = c(2015, NA, 2015, 2015, 1970, 1970, 2014, 2017, 2014, 2017), 
                     year3 = c(2015, 2014, 2015, NA, NA, 2014, 2014, 2015, 1970, 2014), 
                     year4 = c(1970, 2014, 2013, 2014, 1970, 1970, 2014, 2015, 2015, NA)), 
                .Names = c("id", "period", "year1", "year2", "year3", "year4"), 
                row.names = c(NA, -10L), 
                class = "data.frame")



library(reshape2)
dcast(melt(df[,-1]), period + variable ~ value)

#     period variable 1970 2013 2014 2015 2017 NA
# 1 2017-01    year1    1    1    3    0    1  0
# 2 2017-01    year2    1    0    0    3    2  0
# 3 2017-01    year3    0    0    2    3    0  1
# 4 2017-01    year4    2    1    1    1    0  1
# 5 2017-02    year1    2    0    2    0    0  0
# 6 2017-02    year2    1    0    2    0    0  1
# 7 2017-02    year3    1    0    2    0    0  1
# 8 2017-02    year4    1    0    2    1    0  0

Upvotes: 3

Related Questions