X.Jun
X.Jun

Reputation: 390

Sum the number of firms in one year

Suppose I have a dataframe as follows:

dt=structure(list(id = c(1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 5L, 5L, 6L, 6L), year = c(2001L, 2002L, 2003L, 2004L, 
2002L, 2002L, 2003L, 2004L, 2002L, 2003L, 2004L, 2005L, 2001L, 
2002L, 2001L, 2002L)), .Names = c("firm", "year"), row.names = c(NA, 
-16L), class = "data.frame")

dt
 firm year
1   1 2001
2   1 2002
3   1 2003
4   1 2004
5   2 2002
6   3 2002
7   3 2003
8   3 2004
9   4 2002
10  4 2003
11  4 2004
12  4 2005
13  5 2001
14  5 2002
15  6 2001
16  6 2002

Now, I hope to sum the number of firms which have exited the market in one year. For example, I want a table like this:

 resulttable
     All 2001 2002 2003 2004 2005
2001   3    0    2    0    1    0
2002   3    0    1    0    1    1

The first row of resulttable means that 3 firms entered the market in year 2001 and 2 firms exited in 2003, 1 firm exited in year 2004. Thanks!

Upvotes: 3

Views: 110

Answers (4)

Uwe
Uwe

Reputation: 42564

Here is a slightly different approach using data.table which creates the totals before reshaping from long to wide format:

library(data.table)
setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][, All := .N, by = entry][
  , dcast(.SD, entry + All ~ exit, length, value.var = "firm")]
   entry All 2002 2004 2005
1:  2001   3    2    1    0
2:  2002   3    1    1    1

This is already conveying all essential results which the OP verbally described in the question.


However, OP's expected result includes columns for the years 2001 and 2003 although they contain only 0. If it is required to show years without entries or exits, this can be accomplished by completing the missing years before computing the totals All and reshaping:

setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
  CJ(entry = dt$year, exit = dt$year, unique = TRUE), on = .(entry, exit)][
    , All := sum(!is.na(firm)), by = entry][][
      , dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]
   entry All 2001 2002 2003 2004 2005
1:  2001   3    0    2    0    1    0
2:  2002   3    0    1    0    1    1
3:  2003   0    0    0    0    0    0
4:  2004   0    0    0    0    0    0
5:  2005   0    0    0    0    0    0

The missing years are completed by joining with a table of all available combinations of year which is created by a cross join CJ(). The completion introduces a lot of NA values in firm, therefore length(firm) has to be replaced by sum(!is.na(firm)) as aggregation function.


The extension of the resulting wide format can be controlled by the range of years given in CJ(). E.g., the empty entry years 2003 to 2005 can be dropped by

setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
  CJ(entry = min(entry):max(entry), exit = dt$year, unique = TRUE), on = .(entry, exit)][
    , All := sum(!is.na(firm)), by = entry][][
      , dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]
   entry All 2001 2002 2003 2004 2005
1:  2001   3    0    2    0    1    0
2:  2002   3    0    1    0    1    1

which reproduces exactly OP's expected resulttable.

Or, drop also the empty exit years 2001 and 2005 by

setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
  CJ(entry = min(entry):max(entry), exit = min(exit):max(exit)), on = .(entry, exit)][
    , All := sum(!is.na(firm)), by = entry][][
      , dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]
   entry All 2002 2003 2004 2005
1:  2001   3    2    0    1    0
2:  2002   3    1    0    1    1

Upvotes: 1

mt1022
mt1022

Reputation: 17299

You can tabulate "enter" years and "exit" years with table:

res <- table(
    dt$year[!duplicated(dt$firm)],
    factor(dt$year[!duplicated(dt$firm, fromLast = TRUE)], levels = unique(dt$year))
)
res <- as.data.frame.matrix(res)
res$All <- rowSums(res)

# > res
#      2001 2002 2003 2004 2005 All
# 2001    0    2    0    1    0   3
# 2002    0    1    0    1    1   3

I assumed the dt is sorted as provided. If not, one has to sort by year first.


Here is the suggested way from thelatemail in comment, the results look like this:

addmargins(table(
    dt$year[!duplicated(dt$firm)],
    factor(dt$year[!duplicated(dt$firm, fromLast = TRUE)], levels = unique(dt$year))
), 2)

#      2001 2002 2003 2004 2005 Sum
# 2001    0    2    0    1    0   3
# 2002    0    1    0    1    1   3

Upvotes: 4

akrun
akrun

Reputation: 887571

Here is an option using dcast from data.table. Convert the 'data.frame' to 'data.table' (setDT(dt)), grouped by 'firm' get the range of 'year' into two columns, dcast into 'wide' with drop = FALSE to avoid removing the unused levels and then sum the values in the row with Reduce

library(data.table)
dcast(setDT(dt)[, as.list(range(year)), firm], V1 ~ factor(V2, levels = unique(dt$year)), 
          drop =FALSE)[, All := Reduce(`+` , .SD), .SDcols = -1][]
#      V1 2001 2002 2003 2004 2005 All
#1: 2001    0    2    0    1    0   3
#2: 2002    0    1    0    1    1   3

Upvotes: 4

Mark
Mark

Reputation: 4537

This isn't a complete solution as the results do not include the missing 'exited' years. Including them is possible, but a lot of extra steps. Using two libraries, dplyr and tidyr we can walk through the process.

library(dplyr)
library(tidyr)
dt %>% 
  group_by(firm) %>% 
  summarise(entered=min(year),exited=max(year),count=1) %>% 
  group_by(entered,exited) %>% 
  summarise(count=sum(count)) %>%
  mutate(All = sum(count)) %>% 
  ungroup() %>% 
  spread(exited,count,fill=0)

> # A tibble: 2 x 5
>   entered   All `2002` `2004` `2005`
> *   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
> 1    2001     3      2      1      0
> 2    2002     3      1      1      1
  • group_by indicates we want to operate within firm
  • summarise will calculate values once per group, here we get entered, exited and we produce a counting variable count
  • Now we group by entered and exited (the order is important), so we're grouping on the cross of both years
  • We now sum our counting variable by the combination of years. summarise here drops the right most level of grouping
  • mutate creates a new variable, All in this case which looks just like the summarise but instead of collapsing rows, it computes it within our group, duplicating for duplicate rows.
  • ungroup removes residual grouping
  • spread creates a column for each value in our key, populating it with the specified value column, filling in missing values with 0

Upvotes: 3

Related Questions