Kayla
Kayla

Reputation: 87

Fast way for String matching and replacement from another dataframe in R

I have two dataframes that look like this (although the first one is over 90 million rows long and the second dataframe is a little over 14 million rows) Also the second dataframe is randomly ordered

df1 <- data.frame(
  datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
               "strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
               "the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
               "anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
               "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
               "the first known use of this word was in 1539"),
  words = c("anarchist_schools_of_thought  individualism  collectivism", "social_anarchism  individualist_anarchism",
            "anarchy  -ism", "privative  privative_alpha", "infinitive", ""),

  stringsAsFactors=FALSE)

df2 <- data.frame(
  vocabword = c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",                
                 "anarchy","-ism","privative","privative_alpha", "1310_the_ticket",  "infinitive"),
  token = c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism",  "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
            "-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive"), 
  stringsAsFactors = F)

I was able to extract all the words that come after the phrase "wiki/" into another column. Those words need to be replaced by the token column which matches to vocabword in the second dataframe. So for example I would look at the work "anarchist_schools_of_thought" which comes after wiki/ in the first row of the 1st dataframe, and then find the term "anarchist_schools_of_thought" in the second dataframe under vocab word and I want to replace it with the corresponding token which is "Anarchist_schools_of_thought".

So it should eventually come to look like this:

1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein
6 the first known use of this word was in 1539

I realize that a lot of them just capitalize the first letter of the words but some of them are significantly different. I could do a for loop but I think that would take way too much time and I'd prefer to do this either a data.table way or possibly a stringi or stringr way. And I normally would just do a merge but since there's multiple words needing replaced in a single row, it complicates things.

Thanks in advance for any help.

Upvotes: 2

Views: 2821

Answers (4)

Matt Summersgill
Matt Summersgill

Reputation: 4242

The way I've typically done this using straight stringi is as follows:

library(stringi)

Old <- df2[["vocabword"]]
New <- df2[["token"]]

stringi::stri_replace_all_regex(df1[["datalist"]],
                                "\\b"%s+%Old%s+%"\\b",
                                New,
                                vectorize_all = FALSE)

#[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
#[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
#[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
#[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e"                                              
#[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
#[6] "the first known use of this word was in 1539"  

Theoretically, you should be able to get a reasonable improvement by parallelizing this the right way, but you won't be able to get any better than a Nx speed-up (Where N = # of cores available). -- My hunch is that cutting run-time from something like 8 months down to 15 days still doesn't really help you in a practical sense.

However, if you have 14 million potential replacements to make over 90 million rows, it seems like a fundamentally different approach might be necessary. What's the max number of words in any sentence?


Update: Adding some example code to benchmark potential solutions:

Adding to your additional sentences using stringi::stri_rand_lipsum() and adding additional replacement pairs using stringi::stri_rand_strings() makes it easier to see the effects of an increasing corpus size and vocabulary size on run-time.

With 1,000 sentences:

  • 1,000 replacement pairs: 3.9 seconds.
  • 10,000 replacement pairs: 36.5 seconds.
  • 100,000 replacement pairs: 365.4 seconds.

I'm not going to try 14 million, but this should help you evaluate if alternative methods will scale.

library(stringi)

ExtraSentenceCount <- 1e3
ExtraVocabCount <- 1e4

Sentences <- c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
               "strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
               "the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
               "anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
               "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
               "the first known use of this word was in 1539",
               stringi::stri_rand_lipsum(ExtraSentenceCount))

vocabword <- c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",                
           "anarchy","-ism","privative","privative_alpha", "1310_the_ticket",  "infinitive",
           "a",
           stringi::stri_rand_strings(ExtraVocabCount,
                                      length = sample.int(8, ExtraVocabCount, replace = TRUE),
                                      pattern = "[a-z]"))

token <- c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism",  "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
           "-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive",
           "XXXX",
           stringi::stri_rand_strings(ExtraVocabCount,
                                      length = 3,
                                      pattern = "[0-9]"))

system.time({
  Cleaned <- stringi::stri_replace_all_regex(Sentences, "\\b"%s+%vocabword%s+%"\\b", token, vectorize_all = FALSE)
})

#   user  system elapsed 
# 36.652   0.070  36.768 

head(Cleaned)

# [1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism 749 complete wiki/Collectivism"                
# [2] "strains 454 anarchism have often been divided into the categories 454 wiki/Social_anarchism and wiki/Individualist_anarchism 094 similar dual classifications"
# [3] "the word 412 composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek 190.546"                             
# [4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative 358- 190.546"                                            
# [5] "authority sovereignty realm magistracy and the suffix 094 -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                         
# [6] "the first known use 454 this word was 201 1539" 

Update 2: The method below doesn't account for the possibility that you have tags that are sub-strings of another-- i.e. wiki/Individualist and wiki/Individualist_anarchism could give you erroneous results. The only way that I really know to avoid that is using regex/word replacement of full words preceded and followed by word-boundaries (\\b), which can't be based on a fixed string.

One option that might give you hope relies on the fact that you essentially have all of your desired replacements marked with the prefix wiki/. If this is the case for your actual use, then we can take advantage of that and use fixed replacements instead of regex replacement of full words preceded and followed by word-boundaries (\\b). (this is necessary to avoid a vocab word like "ism" being replaced when it occurs as part of a longer word)

Using the same list as above:

prefixed_vocabword <- paste0("wiki/",vocabword)
prefixed_token <- paste0("wiki/",token)

system.time({
  Cleaned <- stringi::stri_replace_all_fixed(Sentences, prefixed_vocabword, prefixed_token, vectorize_all = FALSE)
})

That cuts run-time down to 10.4 seconds with 1,000 sentences and 10,000 replacements, but since the run-time is still grows linearly this would still take hours for your data size.

Upvotes: 0

Adam Sampson
Adam Sampson

Reputation: 2021

Because each of your terms begins with "wiki/" it is possible to re-arrange your dataset to make it MUCH easier to create matches. The method I'm putting forward is to move each "wiki/term" to its own row of the dataframe, use a join to match up words which is efficient, and then reverse the steps to put the strings back together the way they were but with the new terms in them.

library(tidyverse)
df1a <- df1 %>%
  # Create a separator character to identify where to split
  mutate(datalist = str_replace_all(datalist,"wiki/","|wiki/")) %>% 
  mutate(datalist = str_remove(datalist,"^\\|"))

  # Split so that each instance gets its own column
df1a <- 
  str_split(df1a$datalist,"\\|",simplify = TRUE) %>% 
  as.tibble() %>% 
  # Add a rownum column to keep track where to put back together for later
  mutate(rownum = 1:n()) %>% 
  # Gather the dataframe into a tidy form to prepare for joining
  gather("instance","text",-rownum,na.rm = TRUE) %>% 
  # Create a column for joining to the data lookup table
  mutate(keyword = text %>% str_extract("wiki/[^ ]+") %>% str_remove("wiki/")) %>% 
  # Join the keywords efficiently using left_bind
  left_join(df2,by = c("keyword" = "vocabword")) %>% 
  # Put the results back into the text string
  mutate(text = str_replace(text,"wiki/[^ ]+",paste0("wiki/",token))) %>%
  select(-token,-keyword) %>% 
  # Spread the data back out to the original number of rows
  spread(instance,text) %>% 
  # Re-combine the sentences/strings to their original form
  unite("datalist",starts_with("V"),sep="") %>%
  select("datalist")

Results:

# A tibble: 6 x 1
  datalist                                                                                                 
  <chr>                                                                                                    
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individ~
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Indiv~
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively~
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an-~
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive su~
6 the first known use of this word was in 1539    

Upvotes: 0

divibisan
divibisan

Reputation: 12165

This question has a solution that seems to work well with your data: R: replacing multiple regex with sub

install.packages('qdap')
qdap::mgsub(df2[,1], df2[,2], df1[,1])

[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_Privative an- i.e"                                              
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
[6] "the first known use of this word was in 1539"            

Upvotes: 0

acylam
acylam

Reputation: 18681

You can do this with str_replace_all from stringr:

library(stringr)

str_replace_all(df1$datalist, setNames(df2$vocabword, df2$token))

Basically, str_replace_all allows you to supply a named vector with original strings being the names and the replacement being the elements of the vector. You did all the hard work by creating a "dictionary" of strings and replacements. str_replace_all simply took that and do the replacement automatically.

Result:

[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Privative_alpha an- i.e"                                              
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
[6] "the first known use of this word was in 1539"

Upvotes: 2

Related Questions