Reputation:
I have two data frames that I wish to merge into one based on the column "Species"
Species |
---|
Themeda australis |
Themeda avenacea |
Themeda quadrivalvis |
Themeda triandra |
Lomandra juncea |
Lomandra longifolia |
Species | Common_name |
---|---|
Themeda australis (R.Br.) Stapf | Kangaroo grass |
Themeda avenacea (F.Muell.) Lugger | Native oat |
Themeda quadrivalvis (L.) Kuntze. | Grader grass |
Themeda triandra Forssk. | Kangaroo grass |
Lomandra juncea (F.Muell.) Ewart | Desert Mat-rush |
Lomandra longifolia Labill. | Spiny-headed Mat-rush |
df1 <- data.frame(Species = c(
"Themeda australis",
"Themeda avenacea",
"Themeda quadrivalvis",
"Themeda triandra",
"Lomandra juncea",
"Lomandra longifolia"
))
df2 <- data.frame(Species = c(
"Themeda australis (R.Br.) Stapf",
"Themeda avenacea (F.Muell.) Lugger",
"Themeda quadrivalvis (L.) Kuntze.",
"Themeda triandra Forssk.",
"Lomandra juncea (F.Muell.) Ewart",
"Lomandra longifolia Labill."
), Common_name = c(
"Kangaroo grass",
"Native oat",
"Grader grass",
"Kangaroo grass",
"Desert Mat-rush",
"Spiny-headed Mat-rush"
))
However, because there are multiple species containing the same string, I would like to match the data frames by only the first two words of column Species (e.g. Themeda triandra == Themeda triandra Forssk.). Keeping in mind that I am working with big data: Dataframe 1 is 32,931 rows and Dataframe 2 is 16,185 rows. No matches can be denoted NA.
Species | Common_name |
---|---|
Themeda australis | Kangaroo grass |
Themeda avenacea | Native oat |
Themeda quadrivalvis | Grader grass |
Themeda triandra | Kangaroo grass |
Lomandra juncea | Desert Mat-rush |
Lomandra longifolia | Spiny-headed Mat-rush |
Is this possible?
I have tried the following:
output <- df1 %>%
fuzzy_inner_join(df2, by = "Species", match_fun = str_detect)
Upvotes: 1
Views: 102
Reputation: 5887
I think the sample data is a bit confusing, where it looks that we can just get the result straight from df2 without taking df1 into account as it gives the same result.
However, OP clearly mentions differently "Dataframe 1 is 32,931 rows and Dataframe 2 is 16,185 rows. No matches can be denoted NA."
I suggest to alter once df2 (your lookup table). Not sure where you read that data from but perhaps write the lookup table with this extra column so you do not have to redo this task every lookup.
df2 <- df2 %>%
mutate(Species_short = gsub("(\\w+ \\w+).+", "\\1", Species))
After that, just left_join it with your df1
df1 %>%
left_join(df2, by = c("Species" = "Species_short"))
Results
# Species Species.y Common_name
# 1 Themeda triandra Themeda triandra Forssk. Kangaroo grass
# 2 Themeda australis Themeda australis (R.Br.) Stapf Kangaroo grass
# 3 Themeda quadrivalvis Themeda quadrivalvis (L.) Kuntze. Grader grass
# 4 Funky species without a common name <NA> <NA>
# 5 Themeda triandra Themeda triandra Forssk. Kangaroo grass
# 6 Lomandra juncea Lomandra juncea (F.Muell.) Ewart Desert Mat-rush
# 7 Lomandra longifolia Lomandra longifolia Labill. Spiny-headed Mat-rush
# 8 Themeda triandra Themeda triandra Forssk. Kangaroo grass
# 9 Lomandra juncea Lomandra juncea (F.Muell.) Ewart Desert Mat-rush
Data df1 (changed from OP)
df1 <- data.frame(Species = c(
"Themeda triandra",
"Themeda australis",
"Themeda quadrivalvis",
"Funky species missing in df2",
"Themeda triandra",
"Lomandra juncea",
"Lomandra longifolia",
"Themeda triandra",
"Lomandra juncea"
))
Upvotes: 0
Reputation: 5263
You can try this:
library(tidyverse)
df2 %>% mutate(Species = str_extract(Species, "^([\\w\\-]+) ([\\w\\-]+)"))
Result:
Species Common_name
1 Themeda australis Kangaroo grass
2 Themeda avenacea Native oat
3 Themeda quadrivalvis Grader grass
4 Themeda triandra Kangaroo grass
5 Lomandra juncea Desert Mat-rush
6 Lomandra longifolia Spiny-headed Mat-rush
Actually it seems there is no need for join. We can just extract the first two words from Species
column. Anyway You can (simple, not fuzzy) join this resulting data.frame with any other data.frame having two-word Species column.
Upvotes: 1
Reputation: 12518
Another way of doing things, if asd-tm's answer doesn't work for your larger dataset:
Instead of extracting the names based on the structure of the Species names in the second dataframe, I look for a match for any string in the Species column of the first dataframe (|
means match any of them)
library(tidyverse)
df2 |> mutate(Species_clean = str_extract(Species, paste(df1$Species, collapse = "|")))
Output:
Species Common_name Species_clean
1 Themeda australis (R.Br.) Stapf Kangaroo grass Themeda australis
2 Themeda avenacea (F.Muell.) Lugger Native oat Themeda avenacea
3 Themeda quadrivalvis (L.) Kuntze. Grader grass Themeda quadrivalvis
4 Themeda triandra Forssk. Kangaroo grass Themeda triandra
5 Lomandra juncea (F.Muell.) Ewart Desert Mat-rush Lomandra juncea
6 Lomandra longifolia Labill. Spiny-headed Mat-rush Lomandra longifolia
Upvotes: 0