Reputation: 723
I have two data sets, df and df2. This is an oversimplified version of two very large and messy data frames.
In the original df, I created a unique id for each person by grouping them by belt and weight. I want the same id numbers each person has in df to be given to the same people in df2. They need to have the same name, and should be grouped by belt and weight. Note that there are people in df2 who are not in df.
The simplified df looks something like below
belt weight rank id name
1 purple open class 1 55 Tom Cruise
2 black rooster 2 79 Emma Watson
3 blue feather 3 63 John Doe
4 blue feather 4 63 John Doe
5 purple open class 5 55 Tom Cruise
6 brown heavy 6 3 James Bond
7 purple open class 7 55 Tom Cruise
8 purple heavy 8 61 Tom Cruise
9 black open class 9 70 Jane Doe
10 purple heavy 10 61 Tom Cruise
The second data frame looks something like this. A person who is in df2, but not in df should receive an NA for their id. Note that the id's must be given by belt and weight because some people have different points depending on which weight division they competed in
belt2 weight2 rank2 name points
1 purple open class 1 Tom Cruise 100
2 black rooster 2 Emma Watson 30
3 blue feather 3 John Doe 50
4 blue feather 4 John Doe 50
5 purple open class 5 Tom Cruise 100
6 brown heavy 6 James Bond 200
7 black rooster 7 Jon Snow 92
8 purple heavy 8 Tom Cruise 77
9 black open class 9 Jane Doe 88
10 purple heavy 10 Tom Cruise 77
This is what I would like df2 to look like:
belt2 weight2 rank2 id name points
1 purple open class 1 55 Tom Cruise 100
2 black rooster 2 79 Emma Watson 30
3 blue feather 3 63 John Doe 50
4 blue feather 4 63 John Doe 50
5 purple open class 5 55 Tom Cruise 100
6 brown heavy 6 3 James Bond 200
7 black rooster 7 NA Jon Snow 92
8 purple heavy 8 61 Tom Cruise 77
9 black open class 9 70 Jane Doe 88
10 purple heavy 10 61 Tom Cruise 77
Basically, I want the id numbers in df2 to match the id numbers in df. If there is not a match, fill with NA.
# df
belt <- c("purple", "black", "blue", "blue", "purple", "brown", "purple", "purple", "black", "purple")
weight <- c("open class", "rooster", "feather", "feather", "open class", "heavy", "open class", "heavy", "open class", "heavy")
rank <- 1:10
id <- c(55, 79, 63, 63, 55, 3, 55, 61, 70, 61)
names <- c("Tom Cruise", "Emma Watson", "John Doe", "John Doe", "Tom Cruise", "James Bond", "Tom Cruise", "Tom Cruise", "Jane Doe", "Tom Cruise")
(df <- data.frame(belt, weight, rank, id, name = names))
#df2
belt2 <- c("purple", "black", "blue", "blue", "purple", "brown", "black", "purple", "black", "purple")
weight2 <- c("open class", "rooster", "feather", "feather", "open class", "heavy", "rooster", "heavy", "open class", "heavy")
rank2 <- 1:10
names2 <- c("Tom Cruise", "Emma Watson", "John Doe", "John Doe", "Tom Cruise", "James Bond", "Jon Snow", "Tom Cruise", "Jane Doe", "Tom Cruise")
points <- c(100, 30, 50, 50, 100, 200, 92, 77, 88, 77)
(df2 <- data.frame(belt2, weight2, rank2, name = names2, points))
Upvotes: 2
Views: 128
Reputation: 76402
This can be solved with a right join and removing the duplicates after it. I will use base function merge
.
df3 <- merge(
df, df2,
by.x = c("belt", "weight", "rank", "name"),
by.y = c("belt2", "weight2", "rank2", "name"),
all.y = TRUE
)
df3 <- df3[!duplicated(df3),]
df3[order(df3$rank),]
#> belt weight rank name id points
#> 9 purple open class 1 Tom Cruise 55 100
#> 2 black rooster 2 Emma Watson 79 30
#> 4 blue feather 3 John Doe 63 50
#> 5 blue feather 4 John Doe 63 50
#> 10 purple open class 5 Tom Cruise 55 100
#> 6 brown heavy 6 James Bond 3 200
#> 3 black rooster 7 Jon Snow NA 92
#> 7 purple heavy 8 Tom Cruise 61 77
#> 1 black open class 9 Jane Doe 70 88
#> 8 purple heavy 10 Tom Cruise 61 77
Created on 2023-02-08 with reprex v2.0.2
A dplyr
right join is
suppressPackageStartupMessages({
library(dplyr)
})
df %>%
right_join(df2,
by = c("belt" = "belt2", "weight" = "weight2", "rank" = "rank2"),
suffix = c(".x", "")) %>%
select(-name.x) %>%
arrange(rank)
#> belt weight rank id name points
#> 1 purple open class 1 55 Tom Cruise 100
#> 2 black rooster 2 79 Emma Watson 30
#> 3 blue feather 3 63 John Doe 50
#> 4 blue feather 4 63 John Doe 50
#> 5 purple open class 5 55 Tom Cruise 100
#> 6 brown heavy 6 3 James Bond 200
#> 7 black rooster 7 NA Jon Snow 92
#> 8 purple heavy 8 61 Tom Cruise 77
#> 9 black open class 9 70 Jane Doe 88
#> 10 purple heavy 10 61 Tom Cruise 77
Created on 2023-02-08 with reprex v2.0.2
Upvotes: 3
Reputation: 401
You can achieve this task by using a left join
between the two dataframes.
df2 = df2 %>% left_join(df, by = c("belt2" = "belt", "weight2" = "weight", "name" = "name")) %>% select(belt2, weight2, rank2, name, points, id)
belt2 weight2 rank2 name points id
1 purple open class 1 Tom Cruise 100 55
2 purple open class 1 Tom Cruise 100 55
3 purple open class 1 Tom Cruise 100 55
4 black rooster 2 Emma Watson 30 79
5 blue feather 3 John Doe 50 63
6 blue feather 3 John Doe 50 63
7 blue feather 4 John Doe 50 63
8 blue feather 4 John Doe 50 63
9 purple open class 5 Tom Cruise 100 55
10 purple open class 5 Tom Cruise 100 55
11 purple open class 5 Tom Cruise 100 55
12 brown heavy 6 James Bond 200 3
13 black rooster 7 Jon Snow 92 NA
14 purple heavy 8 Tom Cruise 77 61
15 purple heavy 8 Tom Cruise 77 61
16 black open class 9 Jane Doe 88 70
17 purple heavy 10 Tom Cruise 77 61
18 purple heavy 10 Tom Cruise 77 61
Upvotes: 3