Reputation: 445
Assume I have two dataframes that I want to join with each other. However they will end up in many-to many relations, which I do not want. If there are several matches of y to x, I want to consider other columns (in this case country as well). A solution, which could be extedet to more "conditional" columns" would be nice. However, in my mind I do have think a left_join passing several columns in the "by" argument, but the first by argument as a "nececceary condition" and the other by arguments as additional evidence for subseting the corretct columns.
If there is only one possible to match no further conditions should be checked. If the further column (in this case country) has NA, ignore this condition.
# Create the Questions DataFrame
questions_df <- data.frame(
question_id = c(1, 2, 3, 4, 5),
title = c("How to use Python?", "What is SQL?", "Django tutorial", "Data science with R", "Stackoverflow"),
tag = c("python", "sql", "django", "r", "python"),
country = c("NM", "TSE", "FR", "Z", "ZAF")
)
# Create the Tags DataFrame
tags_df <- data.frame(
tag = c("python", "python", "sql", "django", "django", "r", "r"),
expert = c("Expert A", "Expert B", "Expert C", "Expert D", "Expert E", "Expert F", "Expert G"),
country = c("TGV", "NM", "TSE", "FR", "Z", "ZAF", NA)
)
# Perform a left join to illustrate the many-to-many relationship
result <- left_join(questions_df, tags_df, by = "tag")
Upvotes: 0
Views: 58
Reputation: 66880
As an alternative, we could arrange by mismatching of country (such that FALSE, where country matches, goes first), and keep the top row per tag:
left_join(questions_df, tags_df, by = "tag") |>
arrange(country.x != country.y) |>
slice_head(n = 1, by = tag)
Result
question_id title tag country.x expert country.y
1 1 How to use Python? python NM Expert B NM
2 2 What is SQL? sql TSE Expert C TSE
3 3 Django tutorial django FR Expert D FR
4 4 Data science with R r Z Expert F ZAF
(It's unclear to me whether in the case of question_id 4, whether you want to prefer the expert F with a different country, or expert G with no country. The accepted answer sets a preference for the NA; if you want that, we could add question_id, !is.na(country.y),
at the start of arrange()
to get the same output.)
Upvotes: 1
Reputation: 1914
# Perform the join based on the primary `tag` condition
result <- left_join(questions_df, tags_df, by = "tag") %>%
group_by(question_id) %>%
# If there are multiple matches, filter by `country`
filter(n() == 1 | (country.x == country.y | is.na(country.y))) %>%
ungroup() %>%
distinct()
result
Upvotes: 1