Brendan
Brendan

Reputation: 35

Convert year data frame to category count data frame in R

I have a data frame organized like the sample below:

set.seed(661)

raw <- data.frame(
    year = 1900:2020,
    cat = sample(c("A", "B", "C", "D", "E"), size = 121, replace = TRUE)
)

head(raw)

  year cat
1 1900   B
2 1901   A
3 1902   E
4 1903   E
5 1904   B
6 1905   A

However, I would like to transform this into a data frame where I have a cumulative count for every unique value of raw$cat for each year. This way I can track the total count of each category from 1900 to raw$year==someX. The desired data frame would like something like the below:

desiredFinal <- data.frame(
    year = c(rep(1901, 5), rep(1902, 5)),
    cat = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E"),
    count = c(0, 1, 0, 0, 0, 1, 1, 0, 0, 0)
)

desiredFinal

   time cat count
1  1900   A     0
2  1900   B     1
3  1900   C     0
4  1900   D     0
5  1900   E     0
6  1901   A     1
7  1901   B     1
8  1901   C     0
9  1901   D     0
10 1901   E     0

I tried the below snippet of code:

test <- raw %>%
    group_by(cat) %>%
    transmute(year = year, 
              count = seq(n())
              )

But this didn't give me a view of every category for each year.

I also considered creating a series of binary variables for each category in a wide format for each year (ie when raw$cat=="B" then raw$catB==1 and raw$catA==0 and so) then reshaping it, but wasn't sure if this was the most efficient solution.

I would prefer a solution to this problem in base R, but I would also appreciate a solution in dplyr!

Thanks so much!

Upvotes: 0

Views: 217

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270248

This uses base R, the collapse package and magrittr (the last one for the pipes).

In the code below, we first use xtabs to get a table of counts by cat and year (each row is a year and each column is a category). We then compute cumulative sums of the columns giving another table object. Next convert that to a long form data frame using the table method of as.data.frame, rearrange the order of the columns and sort them in the same order as shown in the question. If the order of the rows and columns is not important the last two lines of code can be omitted.

library(collapse)
library(magrittr)

raw %>%
  xtabs(~ year + cat, .) %>%
  dapply(cumsum) %>%
  as.data.frame %>%
  fselect(year, cat, count = Freq) %>%
  roworder(year, cat)

The first few rows of the result are

   year cat count
1  1900   A     0
2  1900   B     1
3  1900   C     0
4  1900   D     0
5  1900   E     0
6  1901   A     1
7  1901   B     1
8  1901   C     0
9  1901   D     0
10 1901   E     0
11 1902   A     1
12 1902   B     1

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389275

Use complete to create every cat value in each year, for each cat get the cumulative sum of count.

library(dplyr)
library(tidyr)

raw %>%
  mutate(count = 1) %>%
  complete(year, cat, fill = list(count = 0)) %>%
  group_by(cat) %>%
  mutate(count = cumsum(count)) %>%
  ungroup

# A tibble: 605 x 3
#    year cat   count
#   <int> <chr> <dbl>
# 1  1900 A         0
# 2  1900 B         1
# 3  1900 C         0
# 4  1900 D         0
# 5  1900 E         0
# 6  1901 A         1
# 7  1901 B         1
# 8  1901 C         0
# 9  1901 D         0
#10  1901 E         0
# … with 595 more rows

In base R, you can do this as :

dat <- expand.grid(year = unique(raw$year), cat = unique(raw$cat))
raw$count <- 1
dat1 <- merge(dat, raw, all.x = TRUE)
dat1$count[is.na(dat1$count)] <- 0
dat1 <- transform(dat1, count = ave(count, cat, FUN = cumsum))

Upvotes: 0

Related Questions