Reputation: 885
I have 2 data frames that I would like to merge. In df1
observation were recorded on 2 different days. Each record has an index , id1 person identification number and id2 refers the number of the day that recording was made (day had to be different ).There is also a Day variable that records the week day when the recording was made.
In df2
observations were recorded just based on serial number and id1 person identification number. There is just one observation per person. Similarly here there is also a Day variable that records when the recordings were made.
I would like to identify the observations from df2 that were recorded on the same day as in df1.
How can I do this?
Sample data
df1:
structure(list(index = c(11011202, 11011202, 11011202, 11011202,
11011203, 11011203, 11011207, 11011207, 11011207, 11011207, 11011209,
11011209, 11011209, 11011209, 11011210, 11011210, 11011210, 11011210,
11011211, 11011211, 11011211, 11011211, 11011212, 11011212, 11011212,
11011212, 11011212, 11011212, 11011212, 11011212, 11011213, 11011213,
11011213, 11011213, 11011213, 11011213, 11011217, 11011217, 11011219,
11011219, 11011220, 11011220, 11011220, 11011220, 11011220, 11011220,
11020202, 11020202, 11020202, 11020202), id1 = c(1, 1, 4, 4,
1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2,
2, 3, 3, 4, 4, 1, 1, 3, 3, 4, 4, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3,
1, 1, 2, 2), id2 = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), Day = c(5, 1, 5,
1, 1, 3, 4, 7, 4, 7, 4, 1, 4, 1, 5, 7, 5, 7, 1, 2, 1, 2, 7, 2,
7, 2, 7, 2, 7, 2, 7, 4, 7, 4, 7, 4, 4, 1, 3, 1, 1, 2, 1, 2, 1,
2, 4, 7, 4, 7)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
df2:
structure(list(Day = c(3, 3, 4, 6, 6, 6, 7, 7, 7, 7, 4, 4, 6,
6, 6, 4, 3, 7, 7, 5, 5, 7, 5, 6, 6), index = c(11011209, 11011209,
11011210, 11011212, 11011212, 11011213, 11011213, 11011220, 11011220,
11020208, 11020212, 11020212, 11020301, 11020301, 11020301, 11020305,
11020310, 11020315, 11020315, 11020316, 11020316, 11020320, 11020606,
11020611, 11020611), id1 = c(1, 2, 2, 1, 2, 1, 4, 1, 2, 2, 1,
2, 1, 2, 3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2)), row.names = c(NA,
-25L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1
Views: 88
Reputation: 24770
Edit: Based on your clarifications, I think this is what you wanted. I switched to using tidyverse since your data is stored in tibbles.
library(dplyr)
inner_join(df1,df2,by=c("id1","Day"),suffix=c(".df1",".df2"))
# index.df1 id1 id2 Day index.df2
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 11011202 1 1 5 11020316
# 2 11011202 1 1 5 11020606
# 3 11011203 1 2 3 11011209
# 4 11011203 1 2 3 11020310
# 5 11011207 1 1 4 11020212
# 6 11011207 1 1 4 11020305
# 7 11011207 1 2 7 11011220
# 8 11011207 1 2 7 11020315
# 9 11011207 2 1 4 11011210
#10 11011207 2 1 4 11020212
## … with 43 more rows
Upvotes: 1
Reputation: 886938
Both the datasets have duplicate
by
variables, resulting in duplication of rows after the join
. One option is to nest
by those variables, and then do a join
library(dplyr)
library(tidyr)
df2 %>%
group_by(Day, index) %>%
nest %>%
left_join(df1 %>%
rename(idnew = id1)) %>%
unnest(data)
Upvotes: 2