Reputation: 2067
I have some data which contains text and I would like to try and extract the company names from the text. The data looks like:
d <- data.frame(
textColumn = c(
"Apple CEO reports positive growth in Iphone sales",
"Apple's quarterly results are expected to beat that of Intel's",
"Microsoft is expected to release a new product which rivales Apple's Iphone which uses Intels processors",
"Intel Corporation seeks to hire 5000 new staff",
"Amazon enters a new market, the same as Intel"
)
)
Data:
textColumn
1 Apple CEO reports positive growth in Iphone sales
2 Apple's quarterly results are expected to beat that of Intel's
3 Microsoft is expected to release a new product which rivales Apple's Iphone
4 Intel Corporation seeks to hire 5000 new staff
5 Amazon enters a new market, the same as Intel
In a vector I have a number of company names.
companyNames <- c(
"Apple Inc",
"Intel Corp",
"Microsoft Corporation",
"Amazon Company"
)
Data:
[1] "Apple Inc" "Intel Corp" "Microsoft Corporation" "Amazon Company"
The data in the text does not allow me to exactly extract the company names since the character string mostly contains the full company name Apple Inc
, Intel Corp
etc. but the text data refers to just the company Apple
and Intel
etc.
I would like to use fuzzy string extraction to try and extract the company names from the text so the expected output using this example would be:
c(
"Apple",
"Apple | Intel",
"Microsoft | Apple | Intel",
"Intel",
"Amazon | Intel"
)
Data:
[1] "Apple" "Apple | Intel" "Microsoft | Apple | Intel" "Intel" "Amazon | Intel"
Since Apple
only occurs once in the first row of the text data, whereas Apple
and Intel
both occur in the second row (So I separate them by |
). I am looking into the fuzzyExtract
from the fuzzywuzzyR
pakage here but I cannot seem to get it working on my sample data.
Upvotes: 0
Views: 67
Reputation: 581
What you actually need is an exact match (assuming that the company names are the same both in the 'd' data.frame and in the 'companyNames' vector),
# use stringsAsFactors = FALSE
d <- data.frame(
textColumn = c(
"Apple CEO reports positive growth in Iphone sales",
"Apple's quarterly results are expected to beat that of Intel's",
"Microsoft is expected to release a new product which rivales Apple's Iphone which uses Intels processors",
"Intel Corporation seeks to hire 5000 new staff",
"Amazon enters a new market, the same as Intel"
), stringsAsFactors = FALSE
)
companyNames <- c(
"Apple Inc",
"Intel Corp",
"Microsoft Corporation",
"Amazon Company"
)
# extract the company names (without the extensions Inc, Corp etc.)
companyNames = unlist(lapply(strsplit(companyNames, ' '), function(x) x[1]))
# use 'gregexpr' and 'substr' to append the company names to the 'output' vector
output = rep(NA, nrow(d))
for (ROW in 1:nrow(d)) {
iter_row = d[ROW, , drop = T]
iter_vec = c()
for (NAME in companyNames) {
iter_match = gregexpr(pattern = NAME, text = iter_row)
for (idx_match in 1:length(iter_match)) {
if (iter_match[[idx_match]] != -1) {
match_start_idx = iter_match[[idx_match]][1]
match_length = attr(iter_match[[idx_match]], "match.length")
iter_company = substr(iter_row, match_start_idx, match_start_idx + match_length - 1)
iter_vec = append(iter_vec, iter_company)
}
}
}
output[ROW] = paste(iter_vec, collapse = ' | ')
}
This gives:
[1] "Apple" "Apple | Intel" "Apple | Intel | Microsoft" "Intel" "Intel | Amazon"
Upvotes: 0
Reputation: 5138
Here this uses stringr
to clean up the company names, extract them, then collapse the names into a vector. I am sure it will require some adaptation on your part, but this should definitely get you started. Also, the \\b
in the regex is a boundary--it protects against partial matches for the elements of org_type
. Hope this helps!!
library(stringr)
# Removing the organization types you listed (e.g., Inc)
# You may also grab the first word, I chose types because it was more explicit
# but it would reqiure checking on your part (either option will)
org_type <- c("Inc", "Corp", "Corporation", "Company")
company_clean <- str_remove_all(companyNames, str_c("\\s*\\b", org_type, "\\b", collapse = "|"))
# Extracting the company name matches from the list and pasting them together
sapply(str_extract_all(d$textColumn, str_c(company_clean, collapse = "|")), paste0, collapse = " | ")
[1] "Apple" "Apple | Intel" "Microsoft | Apple | Intel" "Intel" "Amazon | Intel"
Upvotes: 1