Reputation: 390
Suppose I have a dataframe as follows:
dt=structure(list(id = c(1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 5L, 5L, 6L, 6L), year = c(2001L, 2002L, 2003L, 2004L,
2002L, 2002L, 2003L, 2004L, 2002L, 2003L, 2004L, 2005L, 2001L,
2002L, 2001L, 2002L)), .Names = c("firm", "year"), row.names = c(NA,
-16L), class = "data.frame")
dt
firm year
1 1 2001
2 1 2002
3 1 2003
4 1 2004
5 2 2002
6 3 2002
7 3 2003
8 3 2004
9 4 2002
10 4 2003
11 4 2004
12 4 2005
13 5 2001
14 5 2002
15 6 2001
16 6 2002
Now, I hope to sum the number of firms which have exited the market in one year. For example, I want a table like this:
resulttable
All 2001 2002 2003 2004 2005
2001 3 0 2 0 1 0
2002 3 0 1 0 1 1
The first row of resulttable means that 3 firms entered the market in year 2001 and 2 firms exited in 2003, 1 firm exited in year 2004. Thanks!
Upvotes: 3
Views: 110
Reputation: 42564
Here is a slightly different approach using data.table
which creates the totals before reshaping from long to wide format:
library(data.table)
setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][, All := .N, by = entry][
, dcast(.SD, entry + All ~ exit, length, value.var = "firm")]
entry All 2002 2004 2005 1: 2001 3 2 1 0 2: 2002 3 1 1 1
This is already conveying all essential results which the OP verbally described in the question.
However, OP's expected result includes columns for the years 2001
and 2003
although they contain only 0. If it is required to show years without entries or exits, this can be accomplished by completing the missing years before computing the totals All
and reshaping:
setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
CJ(entry = dt$year, exit = dt$year, unique = TRUE), on = .(entry, exit)][
, All := sum(!is.na(firm)), by = entry][][
, dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]
entry All 2001 2002 2003 2004 2005 1: 2001 3 0 2 0 1 0 2: 2002 3 0 1 0 1 1 3: 2003 0 0 0 0 0 0 4: 2004 0 0 0 0 0 0 5: 2005 0 0 0 0 0 0
The missing years are completed by joining with a table of all available combinations of year
which is created by a cross join CJ()
. The completion introduces a lot of NA
values in firm
, therefore length(firm)
has to be replaced by sum(!is.na(firm))
as aggregation function.
The extension of the resulting wide format can be controlled by the range of years given in CJ()
. E.g., the empty entry years 2003 to 2005 can be dropped by
setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
CJ(entry = min(entry):max(entry), exit = dt$year, unique = TRUE), on = .(entry, exit)][
, All := sum(!is.na(firm)), by = entry][][
, dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]
entry All 2001 2002 2003 2004 2005 1: 2001 3 0 2 0 1 0 2: 2002 3 0 1 0 1 1
which reproduces exactly OP's expected resulttable
.
Or, drop also the empty exit years 2001 and 2005 by
setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
CJ(entry = min(entry):max(entry), exit = min(exit):max(exit)), on = .(entry, exit)][
, All := sum(!is.na(firm)), by = entry][][
, dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]
entry All 2002 2003 2004 2005 1: 2001 3 2 0 1 0 2: 2002 3 1 0 1 1
Upvotes: 1
Reputation: 17299
You can tabulate "enter" years and "exit" years with table
:
res <- table(
dt$year[!duplicated(dt$firm)],
factor(dt$year[!duplicated(dt$firm, fromLast = TRUE)], levels = unique(dt$year))
)
res <- as.data.frame.matrix(res)
res$All <- rowSums(res)
# > res
# 2001 2002 2003 2004 2005 All
# 2001 0 2 0 1 0 3
# 2002 0 1 0 1 1 3
I assumed the dt
is sorted as provided. If not, one has to sort by year first.
addmargins(table(
dt$year[!duplicated(dt$firm)],
factor(dt$year[!duplicated(dt$firm, fromLast = TRUE)], levels = unique(dt$year))
), 2)
# 2001 2002 2003 2004 2005 Sum
# 2001 0 2 0 1 0 3
# 2002 0 1 0 1 1 3
Upvotes: 4
Reputation: 887571
Here is an option using dcast
from data.table
. Convert the 'data.frame' to 'data.table' (setDT(dt)
), grouped by 'firm' get the range
of 'year' into two columns, dcast
into 'wide' with drop = FALSE
to avoid removing the unused levels and then sum the values in the row with Reduce
library(data.table)
dcast(setDT(dt)[, as.list(range(year)), firm], V1 ~ factor(V2, levels = unique(dt$year)),
drop =FALSE)[, All := Reduce(`+` , .SD), .SDcols = -1][]
# V1 2001 2002 2003 2004 2005 All
#1: 2001 0 2 0 1 0 3
#2: 2002 0 1 0 1 1 3
Upvotes: 4
Reputation: 4537
This isn't a complete solution as the results do not include the missing 'exited' years. Including them is possible, but a lot of extra steps. Using two libraries, dplyr
and tidyr
we can walk through the process.
library(dplyr)
library(tidyr)
dt %>%
group_by(firm) %>%
summarise(entered=min(year),exited=max(year),count=1) %>%
group_by(entered,exited) %>%
summarise(count=sum(count)) %>%
mutate(All = sum(count)) %>%
ungroup() %>%
spread(exited,count,fill=0)
> # A tibble: 2 x 5
> entered All `2002` `2004` `2005`
> * <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 2001 3 2 1 0
> 2 2002 3 1 1 1
group_by
indicates we want to operate within firm
summarise
will calculate values once per group, here we get entered
, exited
and we produce a counting variable count
entered
and exited
(the order is important), so we're grouping on the cross of both yearssummarise
here drops the right most level of groupingmutate
creates a new variable, All
in this case which looks just like the summarise
but instead of collapsing rows, it computes it within our group, duplicating for duplicate rows.ungroup
removes residual groupingspread
creates a column for each value in our key, populating it with the specified value column, filling in missing values with 0Upvotes: 3