Reputation: 2096
I am struggling with a data request from a data.frame in R, which contains a table with clubs, their IDs, the year they have been founded, the ID of the district they belong to, and the year they have been closed. The data.frame dat
looks like
club_id founded district_id closed
1 2012 1 NA
2 2012 2 2014
3 2013 2 NA
4 2013 3 2014
...
NA
in the closed
column means that the club still exists in the current year (2016).
What I want is the number of still existing clubs for each year, e.g., this would give me for the data snippet from above:
2012 2013 2014 ...
2 4 2 ...
I tried something using dplyr
dat %>%
group_by(founded) %>%
summarise(clubs_per_year = n_distinct(club_id))
However, this gives me only the number of new clubs founded in each year and not the total number of existing clubs per year.
Upvotes: 1
Views: 438
Reputation: 5532
Here is a dplyr
solution with a different approach
Note: I came up with this a number of days ago, but I found a bug with n_distinct
and reported it; it is now fixed in the latest development version. In an older version of dplyr
I had to use dplyr::n_distinct
but the current version only requires n_distinct
.
This approach creates a data.frame
with year column and modifies it based on the dat
data.frame
(the data from OP)
library(dplyr)
yrdf <- data.frame(year = 2012:2015) # "dat" could be used to create this as well.
## For each year calculate the count based on the data in 'dat'
yrdf %>%
group_by(year) %>%
mutate(count = n_distinct(
dat$club_id[ (is.na(dat$closed) | (dat$closed > year)) & dat$founded <= year]
)
) %>%
ungroup
## year count
## (int) (int)
## 1 2012 2
## 2 2013 4
## 3 2014 2
## 4 2015 2
Upvotes: 0
Reputation: 3964
I tried a full-dplyr solution. The plan of attack is to generate a sequence of active years for each club, then count the club-ids in each active year.
First, we figure out the last active year for each club.
max_year <- 2015
years <- data_frame(
club_id = 1:4,
founded = c(2012, 2012, 2013, 2013),
closed = c(NA, 2014, NA, 2014))
years <- years %>%
mutate(last_active = ifelse(is.na(closed), max_year, closed - 1))
years
#> Source: local data frame [4 x 4]
#>
#> club_id founded closed last_active
#> (int) (dbl) (dbl) (dbl)
#> 1 1 2012 NA 2015
#> 2 2 2012 2014 2013
#> 3 3 2013 NA 2015
#> 4 4 2013 2014 2013
Next, we create another data-frame that contains one row for each active year in each founded
-last_active
range in the data. We do this by using the do
function. do
let's us compute arbitrary functions on data-frames; the only rule is that the function needs to return a data-frame. do
respects grouping variables, so those grouping columns are returned as well.
# Create a single-column data-frame with a sequence of values
seq_df <- function(col_name, min, max) {
data.frame(seq(min, max)) %>% setNames(col_name)
}
year_scheme <- years %>%
# Find each found-last_active pairings
select(founded, last_active) %>%
distinct %>%
# Create a sequence of rows for each of those pairings
group_by(founded, last_active) %>%
do(seq_df("active_year", .$founded, .$last_active)) %>%
ungroup
year_scheme
#> Source: local data frame [10 x 3]
#>
#> founded last_active active_year
#> (dbl) (dbl) (int)
#> 1 2012 2013 2012
#> 2 2012 2013 2013
#> 3 2012 2015 2012
#> 4 2012 2015 2013
#> 5 2012 2015 2014
#> 6 2012 2015 2015
#> 7 2013 2013 2013
#> 8 2013 2015 2013
#> 9 2013 2015 2014
#> 10 2013 2015 2015
Finally, we can join the tables and count the group ids.
full_years <- left_join(years, year_scheme)
#> Joining by: c("founded", "last_active")
full_years
#> Source: local data frame [10 x 5]
#>
#> club_id founded closed last_active active_year
#> (int) (dbl) (dbl) (dbl) (int)
#> 1 1 2012 NA 2015 2012
#> 2 1 2012 NA 2015 2013
#> 3 1 2012 NA 2015 2014
#> 4 1 2012 NA 2015 2015
#> 5 2 2012 2014 2014 2012
#> 6 2 2012 2014 2014 2013
#> 7 3 2013 NA 2015 2013
#> 8 3 2013 NA 2015 2014
#> 9 3 2013 NA 2015 2015
#> 10 4 2013 2014 2014 2013
# years per club
full_years %>% count(club_id)
#> Source: local data frame [4 x 2]
#>
#> club_id n
#> (int) (int)
#> 1 1 4
#> 2 2 2
#> 3 3 3
#> 4 4 1
# clubs per year
full_years %>% count(active_year)
#> Source: local data frame [4 x 2]
#>
#> active_year n
#> (int) (int)
#> 1 2012 2
#> 2 2013 4
#> 3 2014 2
#> 4 2015 2
Upvotes: 1
Reputation: 92300
I'm not sure how to achieve this using dplyr
, but here's a possible data.table
solution. This is basically creates a sequence per club while replacing NA
with the current year and then counts the incidents per year
library(data.table)
setDT(df)[, .(Year = founded:(replace(closed, is.na(closed), year(Sys.Date())) - 1L)),
by = club_id
][, .(Uniques = uniqueN(club_id)), by = Year]
# Year Uniques
# 1: 2012 2
# 2: 2013 4
# 3: 2014 2
# 4: 2015 2
Upvotes: 2
Reputation: 9825
This solution uses dcast
from the data.table
package:
library(data.table)
##Example data
DT <- data.table(club_id=1:4, founded=rep(2012:2013, each=2),
district_id=c(1, 2, 2, 3), closed=rep(c(NA, 2014), 2))
## Fill in NAs with current year, create row for each year the club
## exists, cast to columns for each year, and get the count of clubs
## per year using length function
dcast(DT[, .(year=founded:ifelse(is.na(closed), year(Sys.Date()), closed)),
by=club_id], . ~ year, length, fill=0)
## . 2012 2013 2014 2015 2016
## 1: . 2 4 4 2 2
The version below is similar, but does not count a club in a year unless it was open for the entire year. Beware of the case where a club was opened and closed in the same year. I've added a club that was opened and closed in 2015.
DT2 <- data.table(club_id=1:5, founded=c(rep(2012:2013, each=2), 2015),
district_id=c(1, 2, 2, 3, 3),
closed=c(rep(c(NA, 2014), 2), 2015))
## Fill in missing values with the current year
DT2[, closed2:=ifelse(is.na(closed), year(Sys.Date()), closed)]
## Cast to columns as before, ignore cases where the club's open and
## closed years match, and then subtract one from the closed year
dcast(DT2[founded!=closed2, .(year=founded:(closed2-1)), by=club_id],
. ~ year, length, fill=0)
## . 2012 2013 2014 2015
## 1: . 2 4 2 2
Upvotes: 0