lyd-m
lyd-m

Reputation: 3

Is there a way in R to join between two columns based on whether a string in column 1 is contained within the string in column 2?

I am trying to join several messy datasets together without using "fuzzy matching".

In the core dataset (example dataset1 below), I have simple names for companies. In the datasets I would like to join on (that contain additional information about these companies), these names are accompanied by a variety of suffixes, prefixes, and other complications. These suffixes, prefixes are a wide variety of different lengths, making fuzzy matching less appropriate.

I would like to join based on whether strings in "exporter_group" in dataset1 are contained within strings in "company" in dataset2 as a first step, retaining the "company" column from dataset2 so that I can check the match manually.

Is this possible? Am I taking the right approach? Another way I've thought is creating a map of the simple company names and using a string matching mutate in dataset2 to create a column with the target simple name, then joining based on that new column..

Any help appreciated! The examples below are one company, in reality I will have several hundred so this needs to scale to that.

dataset1 <- tibble::tribble(
  ~exporter_group, ~exporter,
  "LOUIS DREYFUS", "LDC INDONESIA",
  "LOUIS DREYFUS", "LDC TRADING INDONESIA",
  "LOUIS DREYFUS", "LDC EAST INDONESIA",
  "LOUIS DREYFUS", "LOUIS DREYFUS"
)

dataset2 <- tibble::tribble(
  ~company, ~parent_company, ~subsidiares, ~market_cap_usd, ~bloomberg_ticker, ~thomson_reuters_ticker,
  "LOUIS DREYFUS COMPANY", NA, NA, NA, "0308213D NA EQUITY", NA
)

I've tried "fuzzy matching" and filtering based off "str_detect" but I've not quite got anywhere.

Upvotes: 0

Views: 87

Answers (2)

Jinjin
Jinjin

Reputation: 610

You said you dont't want "fuzzy matching" because it is "less appropriate", but "these names are accompanied by a variety of suffixes, prefixes, and other complications". It sounds conflit to me, because fuzzy matching is design to deal with varing string.

Here is a simple method using the general packages, but it is limited to the fixed additional string - 'COMPANY'. Since company in dataset2 is about having an extra ' COMPANY' string at the end of exporter_group in dataset1, we can create the company column for dataset1, and then there is no problem for a simple inner join operation.

dataset1 %>%
  mutate(company = paste0(exporter_group, " COMPANY") ) %>%
  inner_join(dataset2)

Joining, by = "company"
# A tibble: 4 x 8
  exporter_group exporter              company               parent_company subsidiares market_cap_usd bloomberg_t~1 thoms~2
  <chr>          <chr>                 <chr>                 <lgl>          <lgl>       <lgl>          <chr>         <lgl>  
1 LOUIS DREYFUS  LDC INDONESIA         LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA     
2 LOUIS DREYFUS  LDC TRADING INDONESIA LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA     
3 LOUIS DREYFUS  LDC EAST INDONESIA    LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA     
4 LOUIS DREYFUS  LOUIS DREYFUS         LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA   

This also keeps all the columns from two datasets. However, if it is not 'COMPANY' in the dataset2, I am afraid you have to do fuzzy matching like the other answer suggested.

Upvotes: 0

Mark
Mark

Reputation: 12518

You can use regex_join(), from the package:

pacman::p_load(fuzzyjoin)
regex_join(dataset2, dataset1, by = c(company = "exporter_group"))

Output: (Note: there's something wrong with dataset2 in your example, so I had to make my own!)

 A tibble: 4 × 8
  company             parent_company subsidiares market_cap_usd bloomberg_ticker
  <chr>               <chr>                <dbl>          <dbl> <chr>           
1 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
2 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
3 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
4 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
# ℹ 3 more variables: thomson_reuters_ticker <chr>, exporter_group <chr>,
#   exporter <chr>

regex_join() by default does an inner join, but you may want to change this to be a full join or a left join, using either the mode argument, or by changing it to be regex_left_join() or regex_full_join() respectively.

Data:

dataset2 <- tibble::tribble(
  ~company, ~parent_company, ~subsidiares, ~market_cap_usd, ~bloomberg_ticker, ~thomson_reuters_ticker,
  "LOUIS DREYFUS COMPANY", "Amazon", 4, 1000, "GOOG", "idk")

Upvotes: 1

Related Questions