Reputation: 122
Edit: Fixed data example issue
Background/Data: I'm working on a merge between two datasets: one is a list of the legal names of various publicly traded companies and the second is a fairly dirty field with company names, individual titles, and all sorts of other difficult to predict words. The company name list is about 14,000 rows and the dirty data is about 1.3M rows. Not every publicly traded company will appear in the dirty data and some may appear multiple times with different presentations (Exxon Mobil, Exxon, ExxonMobil, etc.).
Accordingly, my current approach is to dismantle the publicly traded company name list into the individual words used in each title (after cleaning out some common words like company, corporation, inc, etc.), resulting in the data shown below as Have1
. An example of some of the dirty data is shown below as Have2
. I have also cleaned these strings to eliminate words like Inc and Company in my ongoing work, but in case anyone has a better idea than my current approach, I'm leaving the data as-is. Additionally, we can assume there are very few, if any, exact matches in the data and that the Have2
data is too noisy to successfully use a fuzzy match without additional work.
Question: What is the best way to go about determining which of the items in Have2
contains the words from Have1
? Specifically, I think I need the final data to look like Want
, so that I can then link the public company name to the dirty data name. The plan is to hand-verify the matches given the difficult of the Have2
data, but if anyone has any suggestions on another way to go about this, I am definitely open to suggestions (please, someone, have a suggestion haha).
Tried so far: I have code that sort of works, but takes ages to run and seems inefficient. That is:
library(data.table)
library(stringr)
company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")
have1 <- data.table(table(str_split(company_name_data, "\\W+", simplify = TRUE)))[!V1 == "inc"]
have2 <- c("ceo and director, apple inc",
"current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
"xerox inc., president and ceo",
"president and ceo of the amazon apple assn., division 4")
#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\\W+", simplify = TRUE))
#Creates container
store <- data.table()
#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix
for (i in 1:nrow(have1)){
matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\\b",have1$V1[i],"\\b"), have3[x,])))])
if (nrow(matches) == 0){
next
}
#Create combo data
matches[, have1_word := have1$V1[i]]
#Storage
store <- rbind(store, matches)
}
Want
Name (from Have2) | Word (from Have1) |
---|---|
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy | amazon |
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy | radiation |
vp and general bird aficionado of the amazon apple assn. branch F | amazon |
vp and general bird aficionado of the amazon apple assn. branch F | apple |
ceo and director, apple inc | apple |
xerox inc., president and ceo | xerox |
Have1
Word | N |
---|---|
amazon | 1 |
apple | 3 |
xerox | 1 |
notgoingtomatch | 2 |
radiation | 1 |
Have2
Name |
---|
ceo and director, apple inc |
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy |
xerox inc., president and ceo |
vp and general bird aficionado of the amazon apple assn. branch F |
Upvotes: 2
Views: 475
Reputation: 18754
Using what you have documented, in terms of data from company_name_data and have2 only:
library(tidytext)
library(tidyverse)
#------------ remove stop words before tokenization ---------------
# now split each phrase, remove the stop words, rejoin the phrases
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data, # split the phrases into individual words,
# remove stop words then reassemble phrases
function(x) {
paste(unlist(strsplit(x,
" ")
)[!(unlist(strsplit(x,
" ")) %in% (stop_words$word %>%
unlist())
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
function(x){
paste(unlist(strsplit(x,
" ")
)[(unlist(strsplit(x,
" ")) %in% comp2
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
The results in the second column, based on the text analysis are "apple," "radiation," "xerox," and "amazon apple."
I'm certain this code isn't mine originally. I'm sure I got these ideas from somewhere on StackOverflow...
Upvotes: 1