user8959427
user8959427

Reputation: 2067

fuzzy extracting names (stored in a vector) from a text column in R

I have some data which contains text and I would like to try and extract the company names from the text. The data looks like:

d <- data.frame(
  textColumn = c(
    "Apple CEO reports positive growth in Iphone sales",
    "Apple's quarterly results are expected to beat that of Intel's",
    "Microsoft is expected to release a new product which rivales Apple's Iphone which uses Intels processors",
    "Intel Corporation seeks to hire 5000 new staff",
    "Amazon enters a new market, the same as Intel"
  )
)

Data:

                                                                   textColumn
1                           Apple CEO reports positive growth in Iphone sales
2              Apple's quarterly results are expected to beat that of Intel's
3 Microsoft is expected to release a new product which rivales Apple's Iphone
4                              Intel Corporation seeks to hire 5000 new staff
5                               Amazon enters a new market, the same as Intel

In a vector I have a number of company names.

companyNames <- c(
  "Apple Inc",
  "Intel Corp",
  "Microsoft Corporation",
  "Amazon Company"
)

Data:

[1] "Apple Inc"             "Intel Corp"            "Microsoft Corporation" "Amazon Company"  

The data in the text does not allow me to exactly extract the company names since the character string mostly contains the full company name Apple Inc, Intel Corp etc. but the text data refers to just the company Apple and Intel etc.

I would like to use fuzzy string extraction to try and extract the company names from the text so the expected output using this example would be:

c(
  "Apple",
  "Apple | Intel",
  "Microsoft | Apple | Intel",
  "Intel",
  "Amazon | Intel"
)

Data:

[1] "Apple"                     "Apple | Intel"             "Microsoft | Apple | Intel" "Intel"                     "Amazon | Intel" 

Since Apple only occurs once in the first row of the text data, whereas Appleand Intel both occur in the second row (So I separate them by |). I am looking into the fuzzyExtract from the fuzzywuzzyR pakage here but I cannot seem to get it working on my sample data.

Upvotes: 0

Views: 67

Answers (2)

lampros
lampros

Reputation: 581

What you actually need is an exact match (assuming that the company names are the same both in the 'd' data.frame and in the 'companyNames' vector),


    # use stringsAsFactors = FALSE
    d <- data.frame(
      textColumn = c(
        "Apple CEO reports positive growth in Iphone sales",
        "Apple's quarterly results are expected to beat that of Intel's",
        "Microsoft is expected to release a new product which rivales Apple's Iphone which uses Intels processors",
        "Intel Corporation seeks to hire 5000 new staff",
        "Amazon enters a new market, the same as Intel"
      ), stringsAsFactors = FALSE
    )
    
    companyNames <- c(
      "Apple Inc",
      "Intel Corp",
      "Microsoft Corporation",
      "Amazon Company"
    )
    
    
    # extract the company names (without the extensions Inc, Corp etc.)
    companyNames = unlist(lapply(strsplit(companyNames, ' '), function(x) x[1]))
    
    
    # use 'gregexpr' and 'substr' to append the company names to the 'output' vector
    output = rep(NA, nrow(d))
    
    for (ROW in 1:nrow(d)) {
      
      iter_row = d[ROW, , drop = T]
      iter_vec = c()
      
      for (NAME in companyNames) {
        iter_match = gregexpr(pattern = NAME, text = iter_row)
        
        for (idx_match in 1:length(iter_match)) {
          if (iter_match[[idx_match]] != -1) {
          
            match_start_idx = iter_match[[idx_match]][1]
            match_length = attr(iter_match[[idx_match]], "match.length")
            
            iter_company = substr(iter_row, match_start_idx, match_start_idx + match_length - 1)
            iter_vec = append(iter_vec, iter_company)
          }
        }
      }
      
      output[ROW] = paste(iter_vec, collapse = ' | ')
    }

This gives:


[1] "Apple"   "Apple | Intel"   "Apple | Intel | Microsoft"   "Intel"   "Intel | Amazon"

Upvotes: 0

Andrew
Andrew

Reputation: 5138

Here this uses stringr to clean up the company names, extract them, then collapse the names into a vector. I am sure it will require some adaptation on your part, but this should definitely get you started. Also, the \\b in the regex is a boundary--it protects against partial matches for the elements of org_type. Hope this helps!!

library(stringr)

# Removing the organization types  you listed (e.g., Inc)
# You may also grab the first word, I chose types because it was more explicit
# but it would reqiure checking on your part (either option will)
org_type <- c("Inc", "Corp", "Corporation", "Company")

company_clean <- str_remove_all(companyNames, str_c("\\s*\\b", org_type, "\\b", collapse = "|"))

# Extracting the company name matches from the list and pasting them together
sapply(str_extract_all(d$textColumn, str_c(company_clean, collapse = "|")), paste0, collapse = " | ")
[1] "Apple"                     "Apple | Intel"             "Microsoft | Apple | Intel" "Intel"                     "Amazon | Intel"    

Upvotes: 1

Related Questions