Reputation: 7435
I have two data.frames
; (1) df1
has year, state, and yield, and (2) df2
has specific weights for each state, but in different year intervals.
I need to merge df1
with df2
where the w
variable is merged to fill in the missing years in df2
.
For clarification, the years in df1
between 1910 and 1919, use variable w
in df2
for each state and year 1910, and for years in 1920 and 1921, use the variable w
for each state and year 1920. Because there is missing data in df2
that doesn't match with df2
I want to use the years between the two dates to get the variable w
. Hope this is clear.
Sample Data:
df1
df1 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1911L,
1911L, 1911L, 1911L, 1911L, 1919L, 1920L, 1920L, 1920L, 1920L,
1920L, 1921L, 1921L, 1921L, 1921L, 1921L), state = c("colorado",
"kansas", "new mexico", "oklahoma", "texas", "colorado", "kansas",
"new mexico", "oklahoma", "texas", "texas", "colorado", "kansas",
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico",
"oklahoma", "texas"), acre_yield = c("15.5", "19", "15", "16",
"22", "14", "14.5", "19.5", "7", "11", "23", "18.5", "26.2",
"20", "26", "20", "12", "22.8", "19.5", "23", "18")), .Names = c("year",
"state", "acre_yield"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L,
59L, 60L), class = "data.frame")
df2
df2 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1920L,
1920L, 1920L, 1920L, 1920L), state = c("colorado", "kansas",
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico",
"oklahoma", "texas"), w = c(0.117773613611233, 0.332027298270738,
0.0176064421992724, 0.492169193923849, 0.0404234519949076, 0.305574486110184,
0.32107131682438, 0.0583601411807103, 0.264145354274187, 0.0508487016105393
)), .Names = c("year", "state", "w"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -10L))
Desired Output:
year state acre_yield w
1 1910 colorado 15.5 0.11777
2 1910 kansas 19 0.33202
3 1910 new mexico 15 0.01761
4 1910 oklahoma 16 0.49217
5 1910 texas 22 0.04042
6 1911 colorado 14 0.11777
7 1911 kansas 14.5 0.33202
8 1911 new mexico 19.5 0.01761
9 1911 oklahoma 7 0.49217
10 1911 texas 11 0.04042
50 1919 texas 23 0.04042
51 1920 colorado 18.5 0.30557
52 1920 kansas 26.2 0.32107
53 1920 new mexico 20 0.05836
54 1920 oklahoma 26 0.26414
55 1920 texas 20 0.05084
56 1921 colorado 12 0.30557
57 1921 kansas 22.8 0.32107
58 1921 new mexico 19.5 0.05836
59 1921 oklahoma 23 0.26414
60 1921 texas 18 0.05084
Upvotes: 2
Views: 387
Reputation: 118809
Using rolling joins from data.table
:
require(data.table)
dt1[, w := dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE]]
where dt1
and dt2
are data.tables corresponding to df1
and df2
respectively.
dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE]
extracts dt2$w
for each matching row of dt1
corresponding to columns state,year
. If there's no match, the last matching values is retrieved. This is known as last observation carried forward (locf) join.
Upvotes: 1
Reputation: 43344
One way, with dplyr
:
library(dplyr)
df3 <- df1 %>% filter(year < 1920) %>%
left_join(filter(df2, year == 1910) %>% select(-year))
df3 <- df1 %>% filter(year >= 1920) %>%
left_join(filter(df2, year == 1920) %>% select(-year)) %>%
bind_rows(df3) %>%
arrange(year, state)
It's split into two chains, one that just joins the pre-1920 data, the other which does the post-1920, joins the two, and sorts.
To split the years into 5-year increments and join on df2
values in those increments:
df1$year_factor <- cut(df1$year, seq(1900, 1950, 5), right = FALSE)
df2$year_factor <- cut(df2$year, seq(1900, 1950, 5), right = FALSE)
df3 <- df1 %>% left_join(select(df2, -year)) %>% select(-year_factor)
This is actually simpler, but it introduces (and removes) a dummy variable, and cut
can be a little finicky; play with it as you like. It produces:
year state acre_yield w
1 1910 colorado 15.5 0.11777361
2 1910 kansas 19 0.33202730
3 1910 new mexico 15 0.01760644
4 1910 oklahoma 16 0.49216919
5 1910 texas 22 0.04042345
6 1911 colorado 14 0.11777361
7 1911 kansas 14.5 0.33202730
8 1911 new mexico 19.5 0.01760644
9 1911 oklahoma 7 0.49216919
10 1911 texas 11 0.04042345
11 1919 texas 23 NA
12 1920 colorado 18.5 0.30557449
13 1920 kansas 26.2 0.32107132
14 1920 new mexico 20 0.05836014
15 1920 oklahoma 26 0.26414535
16 1920 texas 20 0.05084870
17 1921 colorado 12 0.30557449
18 1921 kansas 22.8 0.32107132
19 1921 new mexico 19.5 0.05836014
20 1921 oklahoma 23 0.26414535
21 1921 texas 18 0.05084870
Note the one NA
value for the 1919 row; since df2
doesn't have any values between 1915 and 1919, there's nothing to insert. To go by decades, change the 5
in seq
to 10
, or otherwise set as you prefer.
Upvotes: 1
Reputation: 6778
Here's one way to do it with base R using apply
:
df1$w <- apply(df1, 1, function(row) {
idx <- which(df2$state == row['state'] & df2$year <= row['year'])
idx <- max(idx) # want the max year that matches
return(df2$w[idx])
})
df1
# year state acre_yield w
# 1 1910 colorado 15.5 0.11777361
# 2 1910 kansas 19 0.33202730
# 3 1910 new mexico 15 0.01760644
# 4 1910 oklahoma 16 0.49216919
# 5 1910 texas 22 0.04042345
# 6 1911 colorado 14 0.11777361
# 7 1911 kansas 14.5 0.33202730
# 8 1911 new mexico 19.5 0.01760644
# 9 1911 oklahoma 7 0.49216919
# 10 1911 texas 11 0.04042345
# 50 1919 texas 23 0.04042345
# 51 1920 colorado 18.5 0.30557449
# 52 1920 kansas 26.2 0.32107132
# 53 1920 new mexico 20 0.05836014
# 54 1920 oklahoma 26 0.26414535
# 55 1920 texas 20 0.05084870
# 56 1921 colorado 12 0.30557449
# 57 1921 kansas 22.8 0.32107132
# 58 1921 new mexico 19.5 0.05836014
# 59 1921 oklahoma 23 0.26414535
# 60 1921 texas 18 0.05084870
I can't promise that this is the most R-fficient way to do this, but it's the first thing that came to mind.
Upvotes: 1