Reputation:

Can I merge two data frames by partial string match of a common column?

I have two data frames that I wish to merge into one based on the column "Species"

Dataframe 1:

Species
Themeda australis
Themeda avenacea
Themeda quadrivalvis
Themeda triandra
Lomandra juncea
Lomandra longifolia

Dataframe 2:

Species	Common_name
Themeda australis (R.Br.) Stapf	Kangaroo grass
Themeda avenacea (F.Muell.) Lugger	Native oat
Themeda quadrivalvis (L.) Kuntze.	Grader grass
Themeda triandra Forssk.	Kangaroo grass
Lomandra juncea (F.Muell.) Ewart	Desert Mat-rush
Lomandra longifolia Labill.	Spiny-headed Mat-rush

df1 <- data.frame(Species =  c(
  "Themeda australis",
  "Themeda avenacea",
  "Themeda quadrivalvis",
  "Themeda triandra",
  "Lomandra juncea",
  "Lomandra longifolia"
))
df2 <- data.frame(Species = c(
  "Themeda australis (R.Br.) Stapf",
  "Themeda avenacea (F.Muell.) Lugger",
  "Themeda quadrivalvis (L.) Kuntze.",
  "Themeda triandra Forssk.",
  "Lomandra juncea (F.Muell.) Ewart",
  "Lomandra longifolia Labill."
), Common_name = c(
  "Kangaroo grass",
  "Native oat",
  "Grader grass",
  "Kangaroo grass",
  "Desert Mat-rush",
  "Spiny-headed Mat-rush"
))

However, because there are multiple species containing the same string, I would like to match the data frames by only the first two words of column Species (e.g. Themeda triandra == Themeda triandra Forssk.). Keeping in mind that I am working with big data: Dataframe 1 is 32,931 rows and Dataframe 2 is 16,185 rows. No matches can be denoted NA.

Desired output:

Species	Common_name
Themeda australis	Kangaroo grass
Themeda avenacea	Native oat
Themeda quadrivalvis	Grader grass
Themeda triandra	Kangaroo grass
Lomandra juncea	Desert Mat-rush
Lomandra longifolia	Spiny-headed Mat-rush

Is this possible?

I have tried the following:

output <- df1 %>%
  fuzzy_inner_join(df2, by = "Species", match_fun = str_detect)

Upvotes: 1

Answers (3)

Merijn van Tilborg

Reputation: 5887

I think the sample data is a bit confusing, where it looks that we can just get the result straight from df2 without taking df1 into account as it gives the same result.

However, OP clearly mentions differently "Dataframe 1 is 32,931 rows and Dataframe 2 is 16,185 rows. No matches can be denoted NA."

I suggest to alter once df2 (your lookup table). Not sure where you read that data from but perhaps write the lookup table with this extra column so you do not have to redo this task every lookup.

df2 <- df2 %>%
  mutate(Species_short = gsub("(\\w+ \\w+).+", "\\1", Species))

After that, just left_join it with your df1

df1 %>%
  left_join(df2, by = c("Species" = "Species_short"))

Results

#                               Species                         Species.y           Common_name
# 1                    Themeda triandra          Themeda triandra Forssk.        Kangaroo grass
# 2                   Themeda australis   Themeda australis (R.Br.) Stapf        Kangaroo grass
# 3                Themeda quadrivalvis Themeda quadrivalvis (L.) Kuntze.          Grader grass
# 4 Funky species without a common name                              <NA>                  <NA>
# 5                    Themeda triandra          Themeda triandra Forssk.        Kangaroo grass
# 6                     Lomandra juncea  Lomandra juncea (F.Muell.) Ewart       Desert Mat-rush
# 7                 Lomandra longifolia       Lomandra longifolia Labill. Spiny-headed Mat-rush
# 8                    Themeda triandra          Themeda triandra Forssk.        Kangaroo grass
# 9                     Lomandra juncea  Lomandra juncea (F.Muell.) Ewart       Desert Mat-rush

Data df1 (changed from OP)

df1 <- data.frame(Species =  c(
  "Themeda triandra",
  "Themeda australis",
  "Themeda quadrivalvis",
  "Funky species missing in df2",
  "Themeda triandra",
  "Lomandra juncea",
  "Lomandra longifolia",
  "Themeda triandra",
  "Lomandra juncea"
))

Upvotes: 0

asd-tm

Reputation: 5263

You can try this:

library(tidyverse)
df2 %>% mutate(Species = str_extract(Species, "^([\\w\\-]+) ([\\w\\-]+)"))

Result:

               Species           Common_name
1    Themeda australis        Kangaroo grass
2     Themeda avenacea            Native oat
3 Themeda quadrivalvis          Grader grass
4     Themeda triandra        Kangaroo grass
5      Lomandra juncea       Desert Mat-rush
6  Lomandra longifolia Spiny-headed Mat-rush

Actually it seems there is no need for join. We can just extract the first two words from Species column. Anyway You can (simple, not fuzzy) join this resulting data.frame with any other data.frame having two-word Species column.

Upvotes: 1

Mark

Reputation: 12518

Another way of doing things, if asd-tm's answer doesn't work for your larger dataset:

Instead of extracting the names based on the structure of the Species names in the second dataframe, I look for a match for any string in the Species column of the first dataframe (| means match any of them)

library(tidyverse)
df2 |> mutate(Species_clean = str_extract(Species, paste(df1$Species, collapse = "|")))

Output:

                             Species           Common_name        Species_clean
1    Themeda australis (R.Br.) Stapf        Kangaroo grass    Themeda australis
2 Themeda avenacea (F.Muell.) Lugger            Native oat     Themeda avenacea
3  Themeda quadrivalvis (L.) Kuntze.          Grader grass Themeda quadrivalvis
4           Themeda triandra Forssk.        Kangaroo grass     Themeda triandra
5   Lomandra juncea (F.Muell.) Ewart       Desert Mat-rush      Lomandra juncea
6        Lomandra longifolia Labill. Spiny-headed Mat-rush  Lomandra longifolia