J M
J M

Reputation: 122

Matching word list to individual words in strings using R

Edit: Fixed data example issue

Background/Data: I'm working on a merge between two datasets: one is a list of the legal names of various publicly traded companies and the second is a fairly dirty field with company names, individual titles, and all sorts of other difficult to predict words. The company name list is about 14,000 rows and the dirty data is about 1.3M rows. Not every publicly traded company will appear in the dirty data and some may appear multiple times with different presentations (Exxon Mobil, Exxon, ExxonMobil, etc.).

Accordingly, my current approach is to dismantle the publicly traded company name list into the individual words used in each title (after cleaning out some common words like company, corporation, inc, etc.), resulting in the data shown below as Have1. An example of some of the dirty data is shown below as Have2. I have also cleaned these strings to eliminate words like Inc and Company in my ongoing work, but in case anyone has a better idea than my current approach, I'm leaving the data as-is. Additionally, we can assume there are very few, if any, exact matches in the data and that the Have2 data is too noisy to successfully use a fuzzy match without additional work.

Question: What is the best way to go about determining which of the items in Have2 contains the words from Have1? Specifically, I think I need the final data to look like Want, so that I can then link the public company name to the dirty data name. The plan is to hand-verify the matches given the difficult of the Have2 data, but if anyone has any suggestions on another way to go about this, I am definitely open to suggestions (please, someone, have a suggestion haha).

Tried so far: I have code that sort of works, but takes ages to run and seems inefficient. That is:

library(data.table)
library(stringr)

company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")

have1 <- data.table(table(str_split(company_name_data, "\\W+", simplify = TRUE)))[!V1 == "inc"]

have2 <- c("ceo and director, apple inc",
           "current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
           "xerox inc., president and ceo",
           "president and ceo of the amazon apple assn., division 4")


#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\\W+", simplify = TRUE))

#Creates container
store <- data.table()

#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix

for (i in 1:nrow(have1)){
  
  matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\\b",have1$V1[i],"\\b"), have3[x,])))])
  
  if (nrow(matches) == 0){
    next
  }
  
  #Create combo data
  matches[, have1_word := have1$V1[i]]
  
  #Storage
  store <- rbind(store, matches)
}

Want

Name (from Have2) Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy radiation
vp and general bird aficionado of the amazon apple assn. branch F amazon
vp and general bird aficionado of the amazon apple assn. branch F apple
ceo and director, apple inc apple
xerox inc., president and ceo xerox

Have1

Word N
amazon 1
apple 3
xerox 1
notgoingtomatch 2
radiation 1

Have2

Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F

Upvotes: 2

Views: 475

Answers (1)

Kat
Kat

Reputation: 18754

Using what you have documented, in terms of data from company_name_data and have2 only:

library(tidytext)
library(tidyverse)

#------------ remove stop words before tokenization ---------------

# now split each phrase, remove the stop words, rejoin the phrases 
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data,     # split the phrases into individual words,
                       # remove stop words then reassemble phrases
                       function(x) {
                         paste(unlist(strsplit(x,
                                               " ")
                         )[!(unlist(strsplit(x, 
                                             " ")) %in% (stop_words$word %>% 
                                                           unlist())
                         ) # end 2nd unlist
                         ], # end subscript of string split
                         collapse=" ")})) # reassemble string

haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
                           function(x){
                             paste(unlist(strsplit(x,
                                                   " ")
                             )[(unlist(strsplit(x, 
                                                 " ")) %in% comp2
                             ) # end 2nd unlist
                             ], # end subscript of string split
                             collapse=" ")})) # reassemble string

The results in the second column, based on the text analysis are "apple," "radiation," "xerox," and "amazon apple."

I'm certain this code isn't mine originally. I'm sure I got these ideas from somewhere on StackOverflow...

Upvotes: 1

Related Questions