Sulz
Sulz

Reputation: 445

many to many in left_join

Assume I have two dataframes that I want to join with each other. However they will end up in many-to many relations, which I do not want. If there are several matches of y to x, I want to consider other columns (in this case country as well). A solution, which could be extedet to more "conditional" columns" would be nice. However, in my mind I do have think a left_join passing several columns in the "by" argument, but the first by argument as a "nececceary condition" and the other by arguments as additional evidence for subseting the corretct columns.

If there is only one possible to match no further conditions should be checked. If the further column (in this case country) has NA, ignore this condition.

# Create the Questions DataFrame
questions_df <- data.frame(
  question_id = c(1, 2, 3, 4, 5),
  title = c("How to use Python?", "What is SQL?", "Django tutorial", "Data science with R", "Stackoverflow"),
  tag = c("python", "sql", "django", "r", "python"),
  country = c("NM", "TSE", "FR", "Z", "ZAF")
)

# Create the Tags DataFrame
tags_df <- data.frame(
  tag = c("python", "python", "sql", "django", "django", "r", "r"),
  expert = c("Expert A", "Expert B", "Expert C", "Expert D", "Expert E", "Expert F", "Expert G"),
  country = c("TGV", "NM", "TSE", "FR", "Z", "ZAF", NA)
)

# Perform a left join to illustrate the many-to-many relationship
result <- left_join(questions_df, tags_df, by = "tag")

Upvotes: 0

Views: 58

Answers (2)

Jon Spring
Jon Spring

Reputation: 66880

As an alternative, we could arrange by mismatching of country (such that FALSE, where country matches, goes first), and keep the top row per tag:

left_join(questions_df, tags_df, by = "tag") |>
  arrange(country.x != country.y) |>
  slice_head(n = 1, by = tag)

Result

  question_id               title    tag country.x   expert country.y
1           1  How to use Python? python        NM Expert B        NM
2           2        What is SQL?    sql       TSE Expert C       TSE
3           3     Django tutorial django        FR Expert D        FR
4           4 Data science with R      r         Z Expert F       ZAF

(It's unclear to me whether in the case of question_id 4, whether you want to prefer the expert F with a different country, or expert G with no country. The accepted answer sets a preference for the NA; if you want that, we could add question_id, !is.na(country.y), at the start of arrange() to get the same output.)

Upvotes: 1

Aleksandr
Aleksandr

Reputation: 1914

# Perform the join based on the primary `tag` condition
result <- left_join(questions_df, tags_df, by = "tag") %>%
  group_by(question_id) %>%
  # If there are multiple matches, filter by `country`
  filter(n() == 1 | (country.x == country.y | is.na(country.y))) %>%
  ungroup() %>%
  distinct()

result

Upvotes: 1

Related Questions