Vedda
Vedda

Reputation: 7435

Merge two data frames to fill in missing dates

I have two data.frames; (1) df1 has year, state, and yield, and (2) df2 has specific weights for each state, but in different year intervals.

I need to merge df1 with df2 where the w variable is merged to fill in the missing years in df2.

For clarification, the years in df1 between 1910 and 1919, use variable w in df2 for each state and year 1910, and for years in 1920 and 1921, use the variable w for each state and year 1920. Because there is missing data in df2 that doesn't match with df2 I want to use the years between the two dates to get the variable w. Hope this is clear.

Sample Data:

df1

df1 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1911L, 
1911L, 1911L, 1911L, 1911L, 1919L, 1920L, 1920L, 1920L, 1920L, 
1920L, 1921L, 1921L, 1921L, 1921L, 1921L), state = c("colorado", 
"kansas", "new mexico", "oklahoma", "texas", "colorado", "kansas", 
"new mexico", "oklahoma", "texas", "texas", "colorado", "kansas", 
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico", 
"oklahoma", "texas"), acre_yield = c("15.5", "19", "15", "16", 
"22", "14", "14.5", "19.5", "7", "11", "23", "18.5", "26.2", 
"20", "26", "20", "12", "22.8", "19.5", "23", "18")), .Names = c("year", 
"state", "acre_yield"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 
59L, 60L), class = "data.frame")

df2

    df2 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1920L, 
1920L, 1920L, 1920L, 1920L), state = c("colorado", "kansas", 
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico", 
"oklahoma", "texas"), w = c(0.117773613611233, 0.332027298270738, 
0.0176064421992724, 0.492169193923849, 0.0404234519949076, 0.305574486110184, 
0.32107131682438, 0.0583601411807103, 0.264145354274187, 0.0508487016105393
)), .Names = c("year", "state", "w"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -10L))

Desired Output:

   year      state acre_yield    w
1  1910   colorado       15.5    0.11777
2  1910     kansas         19    0.33202
3  1910 new mexico         15    0.01761
4  1910   oklahoma         16    0.49217
5  1910      texas         22    0.04042
6  1911   colorado         14    0.11777
7  1911     kansas       14.5    0.33202
8  1911 new mexico       19.5    0.01761
9  1911   oklahoma          7    0.49217
10 1911      texas         11    0.04042
50 1919      texas         23    0.04042
51 1920   colorado       18.5    0.30557
52 1920     kansas       26.2    0.32107
53 1920 new mexico         20    0.05836
54 1920   oklahoma         26    0.26414
55 1920      texas         20    0.05084
56 1921   colorado         12    0.30557
57 1921     kansas       22.8    0.32107
58 1921 new mexico       19.5    0.05836
59 1921   oklahoma         23    0.26414
60 1921      texas         18    0.05084

Upvotes: 2

Views: 387

Answers (3)

Arun
Arun

Reputation: 118809

Using rolling joins from data.table:

require(data.table)
dt1[, w := dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE]]

where dt1 and dt2 are data.tables corresponding to df1 and df2 respectively.

dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE] extracts dt2$w for each matching row of dt1 corresponding to columns state,year. If there's no match, the last matching values is retrieved. This is known as last observation carried forward (locf) join.

Upvotes: 1

alistaire
alistaire

Reputation: 43344

One way, with dplyr:

library(dplyr)
df3 <- df1 %>% filter(year < 1920) %>% 
               left_join(filter(df2, year == 1910) %>% select(-year))
df3 <- df1 %>% filter(year >= 1920) %>% 
               left_join(filter(df2, year == 1920) %>% select(-year)) %>% 
               bind_rows(df3) %>% 
               arrange(year, state)

It's split into two chains, one that just joins the pre-1920 data, the other which does the post-1920, joins the two, and sorts.


Update based on comments:

To split the years into 5-year increments and join on df2 values in those increments:

df1$year_factor <- cut(df1$year, seq(1900, 1950, 5), right = FALSE)
df2$year_factor <- cut(df2$year, seq(1900, 1950, 5), right = FALSE)
df3 <- df1 %>% left_join(select(df2, -year)) %>% select(-year_factor)

This is actually simpler, but it introduces (and removes) a dummy variable, and cut can be a little finicky; play with it as you like. It produces:

   year      state acre_yield          w
1  1910   colorado       15.5 0.11777361
2  1910     kansas         19 0.33202730
3  1910 new mexico         15 0.01760644
4  1910   oklahoma         16 0.49216919
5  1910      texas         22 0.04042345
6  1911   colorado         14 0.11777361
7  1911     kansas       14.5 0.33202730
8  1911 new mexico       19.5 0.01760644
9  1911   oklahoma          7 0.49216919
10 1911      texas         11 0.04042345
11 1919      texas         23         NA
12 1920   colorado       18.5 0.30557449
13 1920     kansas       26.2 0.32107132
14 1920 new mexico         20 0.05836014
15 1920   oklahoma         26 0.26414535
16 1920      texas         20 0.05084870
17 1921   colorado         12 0.30557449
18 1921     kansas       22.8 0.32107132
19 1921 new mexico       19.5 0.05836014
20 1921   oklahoma         23 0.26414535
21 1921      texas         18 0.05084870

Note the one NA value for the 1919 row; since df2 doesn't have any values between 1915 and 1919, there's nothing to insert. To go by decades, change the 5 in seq to 10, or otherwise set as you prefer.

Upvotes: 1

tblznbits
tblznbits

Reputation: 6778

Here's one way to do it with base R using apply:

df1$w <- apply(df1, 1, function(row) {
    idx <- which(df2$state == row['state'] & df2$year <= row['year'])
    idx <- max(idx) # want the max year that matches
    return(df2$w[idx])
})
df1
#    year      state acre_yield          w
# 1  1910   colorado       15.5 0.11777361
# 2  1910     kansas         19 0.33202730
# 3  1910 new mexico         15 0.01760644
# 4  1910   oklahoma         16 0.49216919
# 5  1910      texas         22 0.04042345
# 6  1911   colorado         14 0.11777361
# 7  1911     kansas       14.5 0.33202730
# 8  1911 new mexico       19.5 0.01760644
# 9  1911   oklahoma          7 0.49216919
# 10 1911      texas         11 0.04042345
# 50 1919      texas         23 0.04042345
# 51 1920   colorado       18.5 0.30557449
# 52 1920     kansas       26.2 0.32107132
# 53 1920 new mexico         20 0.05836014
# 54 1920   oklahoma         26 0.26414535
# 55 1920      texas         20 0.05084870
# 56 1921   colorado         12 0.30557449
# 57 1921     kansas       22.8 0.32107132
# 58 1921 new mexico       19.5 0.05836014
# 59 1921   oklahoma         23 0.26414535
# 60 1921      texas         18 0.05084870

I can't promise that this is the most R-fficient way to do this, but it's the first thing that came to mind.

Upvotes: 1

Related Questions