Clemens
Clemens

Reputation: 9

R - How to count the occurence of a specific string for large textfiles

I am trying to find the occurence of ~10.000 different locations in a list of emails. What I need is one vector with the most frequently mentioned location per eMail, one with the second most frequent and one with the third !

Since my dataset is huge, I have problems with the perfomrance. I tried it with stringi and the parallel package but it still runs very slowlx (about 15 min for 20.000 eMails and 10.000 locations). The input data (eMails and Cities) looks like this:

SearchVector = c('Berlin, 'Amsterdam', San Francisco', 'Los Angeles') ...
g$Message = c('This is the first mail from paris. Berlin is a nice place', 'This is the 2nd mail from San francisco. Beirut is a nice place to stay', 'This is the 3rd mail. Los Angeles is a great place') ...

Here is my code using stringi:

# libraries
library(doParallel)
library(stringi)

detectCores()
registerDoParallel(cores=7)
getDoParWorkers()

# function
getCount <- function(data, keyword)
{ 
  keyword2 = paste0( "^(", keyword, ")|(", keyword, ")$|[ ](", keyword, ")[ ]" )
  wcount <- stri_count(data, regex=keyword2)
  return(data.frame(wcount))
}

SearchVector = as.vector(countryList2)
Text = g$Message

cityName1 = character()
cityName2 = character()

result = foreach(i=Text, .combine=rbind, .inorder=FALSE, .packages=c('stringi'), .errorhandling=c('remove')) %dopar% 
{

  cities = as.data.frame(t(getCount(i, SearchVector)))
  colnames(cities) = SearchVector

  if ( length(cities[which(cities > 0)]) == 1 ) {
    cityName1 = names(sort(cities, decreasing = TRUE))[1]
    cityName2 = NA
  }
  else if ( length(cities[which(cities > 0)]) > 1 ) {
    cityName1 = names(sort(cities, decreasing = TRUE))[1]
    cityName2 = names(sort(cities, decreasing = TRUE))[2] 
  }

  else  {
    cityName1 = NA
    cityName2 = NA 

  }

  return(data.frame(cityName1, cityName2))
}


g$cityName1 = result[, 1]
g$cityName2 = result[, 2]

Any ideas how I can speed up this by, for instance, using an index or equal ? I really look forward to getting help on this issue.

Many thanks Clemens

Upvotes: 0

Views: 135

Answers (1)

Akhil Nair
Akhil Nair

Reputation: 3284

It's a bit too messy to comment this, but give this a shot:

library(data.table)
library(stringr)

dt = data.table(Text = g$Message, cleantext = tolower(g$Message))
dt[, place := str_extract_all(cleantext, paste0("(", paste(tolower(SearchVector), collapse = ")|("), ")"))]

Also your SearchVector in the question has some missing quotes.

data.table is usually lightning quick for things like this, but try it on a subset and see if it's acceptably fast.

The place column will look like a bunch of place names separated by commas, but internally it's a list so it's easy to do all sorts of aggregation with that like count places in each text, count how many time each place is mentioned etc.

dt[, n := lapply(place, length)]; dt
nplace = data.table(place = dt[, unlist(place)])[, .N, place]

I also changed all the text to lower case when doing the searching for good luck (this probably isn't the fastest way to be case insensitive but it just looks the most explicit to me).

Upvotes: 1

Related Questions