Reputation: 3
I am trying to join several messy datasets together without using "fuzzy matching".
In the core dataset (example dataset1 below), I have simple names for companies. In the datasets I would like to join on (that contain additional information about these companies), these names are accompanied by a variety of suffixes, prefixes, and other complications. These suffixes, prefixes are a wide variety of different lengths, making fuzzy matching less appropriate.
I would like to join based on whether strings in "exporter_group" in dataset1 are contained within strings in "company" in dataset2 as a first step, retaining the "company" column from dataset2 so that I can check the match manually.
Is this possible? Am I taking the right approach? Another way I've thought is creating a map of the simple company names and using a string matching mutate in dataset2 to create a column with the target simple name, then joining based on that new column..
Any help appreciated! The examples below are one company, in reality I will have several hundred so this needs to scale to that.
dataset1 <- tibble::tribble(
~exporter_group, ~exporter,
"LOUIS DREYFUS", "LDC INDONESIA",
"LOUIS DREYFUS", "LDC TRADING INDONESIA",
"LOUIS DREYFUS", "LDC EAST INDONESIA",
"LOUIS DREYFUS", "LOUIS DREYFUS"
)
dataset2 <- tibble::tribble(
~company, ~parent_company, ~subsidiares, ~market_cap_usd, ~bloomberg_ticker, ~thomson_reuters_ticker,
"LOUIS DREYFUS COMPANY", NA, NA, NA, "0308213D NA EQUITY", NA
)
I've tried "fuzzy matching" and filtering based off "str_detect" but I've not quite got anywhere.
Upvotes: 0
Views: 87
Reputation: 610
You said you dont't want "fuzzy matching" because it is "less appropriate", but "these names are accompanied by a variety of suffixes, prefixes, and other complications". It sounds conflit to me, because fuzzy matching is design to deal with varing string.
Here is a simple method using the general packages, but it is limited to the fixed additional string - 'COMPANY'. Since company
in dataset2 is about having an extra ' COMPANY'
string at the end of exporter_group
in dataset1
, we can create the company
column for dataset1
, and then there is no problem for a simple inner join operation.
dataset1 %>%
mutate(company = paste0(exporter_group, " COMPANY") ) %>%
inner_join(dataset2)
Joining, by = "company"
# A tibble: 4 x 8
exporter_group exporter company parent_company subsidiares market_cap_usd bloomberg_t~1 thoms~2
<chr> <chr> <chr> <lgl> <lgl> <lgl> <chr> <lgl>
1 LOUIS DREYFUS LDC INDONESIA LOUIS DREYFUS COMPANY NA NA NA 0308213D NA ~ NA
2 LOUIS DREYFUS LDC TRADING INDONESIA LOUIS DREYFUS COMPANY NA NA NA 0308213D NA ~ NA
3 LOUIS DREYFUS LDC EAST INDONESIA LOUIS DREYFUS COMPANY NA NA NA 0308213D NA ~ NA
4 LOUIS DREYFUS LOUIS DREYFUS LOUIS DREYFUS COMPANY NA NA NA 0308213D NA ~ NA
This also keeps all the columns from two datasets. However, if it is not 'COMPANY' in the dataset2, I am afraid you have to do fuzzy matching like the other answer suggested.
Upvotes: 0
Reputation: 12518
You can use regex_join()
, from the fuzzyjoin package:
pacman::p_load(fuzzyjoin)
regex_join(dataset2, dataset1, by = c(company = "exporter_group"))
Output: (Note: there's something wrong with dataset2 in your example, so I had to make my own!)
A tibble: 4 × 8
company parent_company subsidiares market_cap_usd bloomberg_ticker
<chr> <chr> <dbl> <dbl> <chr>
1 LOUIS DREYFUS COMP… Amazon 4 1000 GOOG
2 LOUIS DREYFUS COMP… Amazon 4 1000 GOOG
3 LOUIS DREYFUS COMP… Amazon 4 1000 GOOG
4 LOUIS DREYFUS COMP… Amazon 4 1000 GOOG
# ℹ 3 more variables: thomson_reuters_ticker <chr>, exporter_group <chr>,
# exporter <chr>
regex_join()
by default does an inner join, but you may want to change this to be a full join or a left join, using either the mode
argument, or by changing it to be regex_left_join()
or regex_full_join()
respectively.
Data:
dataset2 <- tibble::tribble(
~company, ~parent_company, ~subsidiares, ~market_cap_usd, ~bloomberg_ticker, ~thomson_reuters_ticker,
"LOUIS DREYFUS COMPANY", "Amazon", 4, 1000, "GOOG", "idk")
Upvotes: 1