ayeh
ayeh

Reputation: 68

How to extract all instances of a vector of strings from text based on conditions

I'm relatively new to R and am trying to extract some strings from text (which is a column in a dataframe) and store them together with their names (which is another column of my dataframe) based on the conditions below:

A simplified example of what I'm trying to do is as follows:

textdf <- data.frame(names = letters[1:4], text = c("I'm trying to extract flowers from text", 
                                                "there are certain conditions on how to extract", 
                                                "this red rose is also nice-smelling", 
                                                "scarlet rose is also fine"))

extractdf <- data.frame(extractions = c("extract", "certain", "certain conditions", 
                                        "nice-smelling rose", "red rose"), 
                        synonyms = c(NA, NA, NA, NA, "scarlet rose"))

I want to

  1. look in the "extractions" column and extract all of the instances which appear in "text" column of my df.

  2. if there is no match for a row, say if there is no match for "red rose", I want to look for the synonym which in case is "scarlet rose".

  3. for phrases with the same "FIRST" word I want to extract the longest substring... for example if I have both "certain" and "certain conditions" I want to keep "certain conditions".
  4. extract "nice-smelling rose" also?
  5. finally I want to store all the extractions in a separate column in df, or getting a named list is also fine.

So what I need is this

#result
textdf <- data.frame(names = letters[1:4], text = c("I'm trying to extract flowers from text", 
                                                "there are certain conditions on how to extract", 
                                                "this red rose is also nice-smelling", 
                                                "scarlet rose is also fine"), 
                     ex = c("extract", "certain conditions, extract", "nice-smelling rose, red rose", "scarlet rose"))

I've tried:

##for the first item
library(rebus)
library(stringi)
sapply(textdf$text, function(x) stri_extract_all_regex(x, or1(extractdf$extractions)))

this finds "certain" but not "certain conditions"

##for the second and fourth item
library(stringdist)
Match_Idx = amatch(textdf$text, extractdf$extractions, method = 'lcs', maxDist = Inf)
Matches = data.frame(textdf$text, extractdf$extractions[Match_Idx])

which is nice because it extracts both "certain conditions" and "nice-smelling rose" but the problem is this : what if I have both "certain conditions" and "nice-smelling rose" in the text? how can I make it find both?

I have no idea what to do for the third one... do I have to tokenize both the text and the extractions and find unique first words and then extract the longest match???

I would appreciate your help in solving any of the items or any help on how get them all in a custom function so that I finally get all of what I've extracted together.

Upvotes: 1

Views: 190

Answers (1)

jay.sf
jay.sf

Reputation: 73692

You could work with regular expressions which you put into a vector,

rex <- c("(extract)", "((?>(?>red)|(?>scarlet))\\srose)", 
         "(\\bcertain\\sconditions\\b)", 
         "((?>rose).*(?>nice-smelling)|(?>nice-smelling).*(?>rose))")

create a matching function

fun <- function(x, y) regmatches(x, regexpr(y, x, perl=TRUE))

and apply it with outer.

M <- outer(textdf$text, rex, Vectorize(fun))

Now we should clean the matrix a little which depends a little on your data, e.g.

M[grep("((?>rose)*.(?>nice-smelling)|(?>nice-smelling).*s(?>rose))", 
       M, perl=TRUE)] <- "nice-smelling rose"

Finally collapse the resulting matrix and add the new vector to your data frame.

textdf$ex <- apply(M, 1, function(x) toString(unlist(x)))

Gives

textdf
#   names                                           text                           ex
# 1     a        I'm trying to extract flowers from text                      extract
# 2     b there are certain conditions on how to extract  extract, certain conditions
# 3     c            this red rose is also nice-smelling red rose, nice-smelling rose
# 4     d                      scarlet rose is also fine                 scarlet rose

Upvotes: 1

Related Questions