AntVal
AntVal

Reputation: 665

Join two datasets and fill information for time intervals in r

I have two datasets that look something like this:

country <- c("Albania","Albania","Albania","Albania","Albania",
             "Belgium","Belgium","Belgium","Belgium","Belgium",
             "Canada","Canada","Canada","Canada","Canada",
             "Denmark","Denmark","Denmark","Denmark","Denmark")
year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
country.year <- data.frame(country, year)

    country.year

   country year
1  Albania 1992
2  Albania 1993
3  Albania 1994
4  Albania 1995
5  Albania 1996
6  Belgium 1992
7  Belgium 1993
8  Belgium 1994
9  Belgium 1995
10 Belgium 1996
11  Canada 1992
12  Canada 1993
13  Canada 1994
14  Canada 1995
15  Canada 1996
16 Denmark 1992
17 Denmark 1993
18 Denmark 1994
19 Denmark 1995
20 Denmark 1996
country <- c("Albania","Albania",
             "Belgium","Belgium",
             "Canada","Canada",
             "Denmark","Denmark","Denmark")
cabinet <- c(1200, 1201,
             1560, 1566,
             220, 440,
             880, 819, 870)
cabinet.position2 <- c(12,10,
                       0, 5,
                       -9, 2,
                       1,-15)
begining.date <- c("1991-12-01", "1996-01-10",
                   "1991-05-07", "1995-04-23",
                   "1992-01-01", "1996-01-01",
                   "1991-08-03", "1992-07-01", "1996-06-01")
end.date <- c("1996-01-09", "2000-02-01",
                   "1995-04-01", "1999-04-23",
                   "1995-09-01", "1999-11-30",
                   "1992-02-03", "1996-05-20", "2000-04-01")
cabinets <- data.frame(country, cabinet, begining.date, end.date)
> cabinets
  country cabinet begining.date   end.date
1 Albania    1200    1991-12-01 1996-01-09
2 Albania    1201    1996-01-10 2000-02-01
3 Belgium    1560    1991-05-07 1995-04-01
4 Belgium    1566    1995-04-23 1999-04-23
5  Canada     220    1992-01-01 1995-09-01
6  Canada     440    1996-01-01 1999-11-30
7 Denmark     880    1991-08-03 1992-02-03
8 Denmark     819    1992-07-01 1996-05-20
9 Denmark     870    1996-06-01 2000-04-01

What I want to have is a dataset where the unit of analysis is country*year as in the dataframe "country.year" but includes the position variable for each cabinet from dataframe "cabinets". This position variable concerns a cabinet's policy position, so it really isn't relevant for the data transformation task, but important for later on. So something like this:

country <- c("Albania","Albania","Albania","Albania","Albania",
             "Belgium","Belgium","Belgium","Belgium","Belgium",
             "Canada","Canada","Canada","Canada","Canada",
             "Denmark","Denmark","Denmark","Denmark","Denmark")
year2 <- c(1992, 1993, 1994, 1995, 1996,
           1992, 1993, 1994, 1995, 1996,
           1992, 1993, 1994, 1995, 1996,
           1992, 1993, 1994, 1995, 1996)
cabinet2 <- c(1200,1200,1200,1200, 1201,
             1560,1560,1560, 1566, 1566,
             220,220,220,220, 440,
             819, 819, 819, 819, 870)
cabinet.position2 <- c(12,12,12,12, 10,
              0,0,0, 5, 5,
              -9,-9,-9,-9, 2,
              1, 1, 1, 1, -15)
desired.df <- data.frame(country, year2, cabinet2,cabinet.position2)
desired.df
   country year2 cabinet2 cabinet.position2
1  Albania  1992     1200                12
2  Albania  1993     1200                12
3  Albania  1994     1200                12
4  Albania  1995     1200                12
5  Albania  1996     1201                10
6  Belgium  1992     1560                 0
7  Belgium  1993     1560                 0
8  Belgium  1994     1560                 0
9  Belgium  1995     1566                 5
10 Belgium  1996     1566                 5
11  Canada  1992      220                -9
12  Canada  1993      220                -9
13  Canada  1994      220                -9
14  Canada  1995      220                -9
15  Canada  1996      440                 2
16 Denmark  1992      819                 1
17 Denmark  1993      819                 1
18 Denmark  1994      819                 1
19 Denmark  1995      819                 1
20 Denmark  1996      870               -15

My main issue here is to assign cabinets to different years. As you can see above each year needs to be assigned one cabinet and their position. More importantly, and what is really difficult for me, sometimes a year has multiple cabinets, so I need each year's cabinet to be the one which spent more time as cabinet during that year (for instances, if in year 1995 cabinet A was there from Jan to May, and cabinet B was there from June to December, year 1995 should be assigned with cabinet B).

Any idea?

Thanks a lot!

Upvotes: 1

Views: 93

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

Here is another option using data.table::foverlaps:

library(data.table)
setDT(country.year)
setDT(cabinets)

#create start date and end date of the year
country.year[, paste0("yr.", c("start", "end")) := lapply(c("-01-01", "-12-31"),
    function(x) as.Date(paste0(year, x), format="%Y-%m-%d"))]

setkey(country.year, country, yr.start, yr.end)
setkey(cabinets, country, beginning.date, end.date)
foverlaps(country.year, cabinets)[, {
        k <- which.max(pmin(end.date, yr.end) - yr.start)
        .(cabinet2=cabinet[k], cabinet.position2=cabinet.position[k])
    }, .(country, year)]

output:

    country year cabinet2 cabinet.position2
 1: Albania 1992     1200                12
 2: Albania 1993     1200                12
 3: Albania 1994     1200                12
 4: Albania 1995     1200                12
 5: Albania 1996     1201                10
 6: Belgium 1992     1560                 0
 7: Belgium 1993     1560                 0
 8: Belgium 1994     1560                 0
 9: Belgium 1995     1566                 5
10: Belgium 1996     1566                 5
11:  Canada 1992      220                -9
12:  Canada 1993      220                -9
13:  Canada 1994      220                -9
14:  Canada 1995      220                -9
15:  Canada 1996      440                 2
16: Denmark 1992      819                 1
17: Denmark 1993      819                 1
18: Denmark 1994      819                 1
19: Denmark 1995      819                 1
20: Denmark 1996      870               -15

data (with date conversion, Ian Campbell's data fix and minor typo in the word beginning):

country <- c("Albania","Albania","Albania","Albania","Albania","Belgium","Belgium","Belgium","Belgium","Belgium","Canada","Canada","Canada","Canada","Canada","Denmark","Denmark","Denmark","Denmark","Denmark")
year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
country.year <- data.frame(country, year)

country <- c("Albania","Albania","Belgium","Belgium","Canada","Canada","Denmark","Denmark","Denmark")
cabinet <- c(1200, 1201, 1560, 1566, 220, 440, 880, 819, 870)
cabinet.position <- c(12, 10, 0, 5, -9, 2, NA, 1,-15)
beginning.date <- as.Date(c("1991-12-01", "1996-01-10","1991-05-07", "1995-04-23","1992-01-01", "1996-01-01","1991-08-03", "1992-07-01", "1996-06-01"))
end.date <- as.Date(c("1996-01-09", "2000-02-01","1995-04-01", "1999-04-23","1995-09-01", "1999-11-30","1992-02-03", "1996-05-20", "2000-04-01"))
cabinets <- data.frame(country, cabinet, cabinet.position, beginning.date, end.date)

Upvotes: 1

flafont11
flafont11

Reputation: 137

Edit: New version that includes the merge and creates a new variable to calculate time spent in office, after I reread the question (my bad) and OP's clarification on what cabinet position means.

TidyR solution involving non-equal join.

library(dplyr)
library(fuzzyjoin)
library(lubridate)

# putting data as Date
country.year <- country.year %>%
  mutate(year = paste0(year,"/01","/01"),
         year = as.Date(year, format = "%Y/%m/%d")) 
cabinets <- cabinets %>%
  mutate(begining.date = as.Date(begining.date),
         end.date = as.Date(end.date))

desired.df <- fuzzy_inner_join(country.year,cabinets,
                                    by=c("country"="country",
                                         "year"="begining.date",
                                         "year"="end.date"),
                                    match_fun = list(`==`, `>=`, `<=`))%>%
  select(country=country.x,everything())%>%
  mutate(year=str_sub(year,1,4),
         time.as.cabinet = end.date - begining.date)%>%
  group_by(country,year)%>%
  filter(time.as.cabinet==max(time.as.cabinet)) %>%
  select(country,year,cabinet,cabinet.position2, -country.y)

desired.df %>%
  head(10)
  country year  cabinet cabinet.position2
   <fct>   <chr>   <dbl>             <dbl>
 1 Albania 1992     1200                12
 2 Albania 1993     1200                12
 3 Albania 1994     1200                12
 4 Albania 1995     1200                12
 5 Albania 1996     1200                12
 6 Belgium 1992     1560                 0
 7 Belgium 1993     1560                 0
 8 Belgium 1994     1560                 0
 9 Belgium 1995     1560                 0
10 Belgium 1996     1566                 5

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92282

Using data.table, you can simultaneously do non-equal joins, calculate new variables and update the data in place in very fast manner. Here is an option

### Load data.table and convert the data.frames
library(data.table)
setDT(country.year) ; setDT(cabinets)

### Convert date columns to proper dates and create join columns 
date_cols <- grep("date", names(cabinets), value = TRUE)
cabinets[, (date_cols) := lapply(.SD, as.IDate), .SDcols = date_cols]
cabinets[, paste0(c("start", "end"), "_year") := lapply(.SD, year), .SDcols = date_cols]

### Join by year intervals, while calculating the larget time period and updating the data in place
country.year[
             , cabinet.position2 :=
               cabinets[.SD, 
                        cabinet.position2[which.max(end.date - as.IDate(paste0(year, "-01-01")))] 
                        , on = .(country, start_year <= year, end_year >= year)
                        , by = .EACHI]$V1
             ]


country.year
#     country year cabinet.position2
#  1: Albania 1992                12
#  2: Albania 1993                12
#  3: Albania 1994                12
#  4: Albania 1995                12
#  5: Albania 1996                10
#  6: Belgium 1992                 0
#  7: Belgium 1993                 0
#  8: Belgium 1994                 0
#  9: Belgium 1995                 5
# 10: Belgium 1996                 5
# 11:  Canada 1992                -9
# 12:  Canada 1993                -9
# 13:  Canada 1994                -9
# 14:  Canada 1995                -9
# 15:  Canada 1996                 2
# 16: Denmark 1992                 1
# 17: Denmark 1993                 1
# 18: Denmark 1994                 1
# 19: Denmark 1995                 1
# 20: Denmark 1996               -15

Upvotes: 2

Related Questions