Reputation: 35
I have a data frame organized like the sample below:
set.seed(661)
raw <- data.frame(
year = 1900:2020,
cat = sample(c("A", "B", "C", "D", "E"), size = 121, replace = TRUE)
)
head(raw)
year cat
1 1900 B
2 1901 A
3 1902 E
4 1903 E
5 1904 B
6 1905 A
However, I would like to transform this into a data frame where I have a cumulative count for every unique value of raw$cat
for each year. This way I can track the total count of each category from 1900 to raw$year==someX
. The desired data frame would like something like the below:
desiredFinal <- data.frame(
year = c(rep(1901, 5), rep(1902, 5)),
cat = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E"),
count = c(0, 1, 0, 0, 0, 1, 1, 0, 0, 0)
)
desiredFinal
time cat count
1 1900 A 0
2 1900 B 1
3 1900 C 0
4 1900 D 0
5 1900 E 0
6 1901 A 1
7 1901 B 1
8 1901 C 0
9 1901 D 0
10 1901 E 0
I tried the below snippet of code:
test <- raw %>%
group_by(cat) %>%
transmute(year = year,
count = seq(n())
)
But this didn't give me a view of every category for each year.
I also considered creating a series of binary variables for each category in a wide format for each year (ie when raw$cat=="B"
then raw$catB==1
and raw$catA==0
and so) then reshaping it, but wasn't sure if this was the most efficient solution.
I would prefer a solution to this problem in base R, but I would also appreciate a solution in dplyr!
Thanks so much!
Upvotes: 0
Views: 217
Reputation: 270248
This uses base R, the collapse package and magrittr (the last one for the pipes).
In the code below, we first use xtabs
to get a table of counts by cat
and year
(each row is a year and each column is a category). We then compute cumulative sums of the columns giving another table object. Next convert that to a long form data frame using the table method of as.data.frame
, rearrange the order of the columns and sort them in the same order as shown in the question. If the order of the rows and columns is not important the last two lines of code can be omitted.
library(collapse)
library(magrittr)
raw %>%
xtabs(~ year + cat, .) %>%
dapply(cumsum) %>%
as.data.frame %>%
fselect(year, cat, count = Freq) %>%
roworder(year, cat)
The first few rows of the result are
year cat count
1 1900 A 0
2 1900 B 1
3 1900 C 0
4 1900 D 0
5 1900 E 0
6 1901 A 1
7 1901 B 1
8 1901 C 0
9 1901 D 0
10 1901 E 0
11 1902 A 1
12 1902 B 1
Upvotes: 1
Reputation: 389275
Use complete
to create every cat
value in each year, for each cat
get the cumulative sum of count.
library(dplyr)
library(tidyr)
raw %>%
mutate(count = 1) %>%
complete(year, cat, fill = list(count = 0)) %>%
group_by(cat) %>%
mutate(count = cumsum(count)) %>%
ungroup
# A tibble: 605 x 3
# year cat count
# <int> <chr> <dbl>
# 1 1900 A 0
# 2 1900 B 1
# 3 1900 C 0
# 4 1900 D 0
# 5 1900 E 0
# 6 1901 A 1
# 7 1901 B 1
# 8 1901 C 0
# 9 1901 D 0
#10 1901 E 0
# … with 595 more rows
In base R, you can do this as :
dat <- expand.grid(year = unique(raw$year), cat = unique(raw$cat))
raw$count <- 1
dat1 <- merge(dat, raw, all.x = TRUE)
dat1$count[is.na(dat1$count)] <- 0
dat1 <- transform(dat1, count = ave(count, cat, FUN = cumsum))
Upvotes: 0