Reputation: 27792
In R, I have a data.frame with an ID, period, and a couple of variables of which the year is given.
df <- structure(list(id = 10001:10010,
period = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L),
.Label = c("2017-01", "2017-02"), class = "factor"),
year1 = c(2013, 2014, 2014, 2017, 1970, 2014, 1970, 2014, 2014, 1970),
year2 = c(2015, NA, 2015, 2015, 1970, 1970, 2014, 2017, 2014, 2017),
year3 = c(2015, 2014, 2015, NA, NA, 2014, 2014, 2015, 1970, 2014),
year4 = c(1970, 2014, 2013, 2014, 1970, 1970, 2014, 2015, 2015, NA)),
.Names = c("id", "period", "year1", "year2", "year3", "year4"),
row.names = c(NA, -10L),
class = "data.frame")
What I would like to achieve, is to create a data.frame in which, for every period and for everyvariable (year1-year4), a frequency is shown how often a specific year (1970,2013-2017) is shown. The desired result should be something like this:
result <- structure(list(period = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
.Label = c("2017-01", "2017-02"),
class = "factor"),
category = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
.Label = c("year1", "year2", "year3", "year4"),
class = "factor"),
X1970 = c(1L, 1L, 0L, 2L, 2L, 1L, 1L, 1L),
X2013 = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L),
X2014 = c(3L, 0L, 2L, 1L, 2L, 2L, 2L, 2L),
X2015 = c(0L, 3L, 3L, 1L, 0L, 0L, 0L, 1L),
X2017 = c(1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L)),
.Names = c("period", "category", "X1970", "X2013", "X2014", "X2015", "X2017"),
class = "data.frame",
row.names = c(NA, -8L))
I've got medium experience in R, and I'm confident this can be achieved. But I'm completely lost on how to approach this problem. Suggestions/ideas?
Upvotes: 1
Views: 1998
Reputation: 51612
Here is an idea using tidyverse
,
library(tidyverse)
df %>%
gather(category, val, -c(id, period)) %>%
na.omit() %>%
group_by(period, category, val) %>%
summarise(new = n()) %>%
spread(val, new, fill = 0)
which gives,
# A tibble: 8 x 7 # Groups: period, category [8] period category `1970` `2013` `2014` `2015` `2017` * <fctr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2017-01 year1 1 1 3 0 1 2 2017-01 year2 1 0 0 3 2 3 2017-01 year3 0 0 2 3 0 4 2017-01 year4 2 1 1 1 0 5 2017-02 year1 2 0 2 0 0 6 2017-02 year2 1 0 2 0 0 7 2017-02 year3 1 0 2 0 0 8 2017-02 year4 1 0 2 1 0
The main trick here is to convert to long format, calculate the frequencies, and convert back to wide format.
A bit shorter version, utilizing the count
function (as @markdly mentions), can be,
df %>%
gather(category, val, -c(id, period)) %>%
na.omit() %>%
count(period, category, val) %>%
spread(val, n, fill = 0)
Upvotes: 4
Reputation: 2206
If I understand your problem correctly you can achieve this with the package reshape2
by first turning the data from "wide" to "long" format via melt
and then turning it back via cast
while selecting the desired categories - I hope this is what you need.
df <- structure(list(id = 10001:10010,
period = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L),
.Label = c("2017-01", "2017-02"), class = "factor"),
year1 = c(2013, 2014, 2014, 2017, 1970, 2014, 1970, 2014, 2014, 1970),
year2 = c(2015, NA, 2015, 2015, 1970, 1970, 2014, 2017, 2014, 2017),
year3 = c(2015, 2014, 2015, NA, NA, 2014, 2014, 2015, 1970, 2014),
year4 = c(1970, 2014, 2013, 2014, 1970, 1970, 2014, 2015, 2015, NA)),
.Names = c("id", "period", "year1", "year2", "year3", "year4"),
row.names = c(NA, -10L),
class = "data.frame")
library(reshape2)
dcast(melt(df[,-1]), period + variable ~ value)
# period variable 1970 2013 2014 2015 2017 NA
# 1 2017-01 year1 1 1 3 0 1 0
# 2 2017-01 year2 1 0 0 3 2 0
# 3 2017-01 year3 0 0 2 3 0 1
# 4 2017-01 year4 2 1 1 1 0 1
# 5 2017-02 year1 2 0 2 0 0 0
# 6 2017-02 year2 1 0 2 0 0 1
# 7 2017-02 year3 1 0 2 0 0 1
# 8 2017-02 year4 1 0 2 1 0 0
Upvotes: 3