Summarize temporal information into a dynamic matrix

Question

I have a data frame like df:

id <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "D", "D", "E")
year <- c("2005", "2006", "2007", "2008", "2005", "2006", "2007", "2005", "2007", "2006", "2007", "2008")
value <- 1:12
df <- data.frame(id, year, value)

I want to convert df into a matrix id_observed where columns count how many id's are observed for the first time, by year. Rows count how many ids "survive" to the consecutive year:

id_observed <- matrix(c(3,2,3,1,0,1,1,0,0,0,0,0,0,0,0,1), nrow = 4, ncol = 4)
#First observed id's (by columns), consecutive id's observations (by rows)
colnames(id_observed) <- c("2005", "2006", "2007", "2008")
rownames(id_observed) <- c("2005", "2006", "2007", "2008")
id_observed

The same idea applies to generate matrix value_observed taking the information from value. Where columns count the aggregated value of id's that are observed for the first time, by year. Rows count the aggregated value of the ids that "survived" to the consecutive year:

value_observed <- matrix(c(14,8,19,4,0,10,11,0,0,0,0,0,0,0,0,12), nrow = 4, ncol = 4)
#First observed value (by columns), consecutive value's observations (by rows)
colnames(value_observed) <- c("2005", "2006", "2007", "2008")
rownames(value_observed) <- c("2005", "2006", "2007", "2008")
value_observed

Any clue on how to build matrices id_observed, and value_observed in an automatic way?

langtang · Accepted Answer

You can create this function, get_matrix(), which leverages tidyverse approach to loop over unique years, creating the data for each year, binding the rows, and then pivoting wider

library(tidyverse)

get_matrix <- function(df, type=c("value","id")) {
  res = lapply(unique(df$year), function(y) {
    d = df %>% group_by(id) %>% filter(min(year)==y) %>% group_by(year)
    if(type == "value") d = summarize(d,n=sum(value))
    else d = summarize(d,n=n())
    d = mutate(d,y=y)
    if(nrow(d)==0) return(tibble(year=y, n=0, y=y)) else return(d)
  })
  bind_rows(res) %>% 
    pivot_wider(id_cols = year,names_from = y,values_from = n,values_fill = 0)    
}

Usage

get_matrix(df, type="value")

Output

  year  `2005` `2006` `2007` `2008`
          
1 2005      14      0      0      0
2 2006       8     10      0      0
3 2007      19     11      0      0
4 2008       4      0      0     12

Usage

get_matrix(df, type="id")

Output

  year  `2005` `2006` `2007` `2008`
          
1 2005       3      0      0      0
2 2006       2      1      0      0
3 2007       3      1      0      0
4 2008       1      0      0      1

Update:

data.table option

setDT(df)[, year:=as.integer(year)]
syears = unique(df$year)
df = df[, y:=min(year), by = id][, .SD[,.N, year], by=y]
dcast(
  rbind(df,data.table(y=setdiff(syears, unique(df$y)))[,`:=`(year=y,N=0)]),
  year~y, value.var="N"
)

Output:

    year  2005  2006  2007  2008
       
1:  2005     3    NA    NA    NA
2:  2006     2     1    NA    NA
3:  2007     3     1     0    NA
4:  2008     1    NA    NA     1

Summarize temporal information into a dynamic matrix

Answers (1)

Update:

Related Questions