Reputation: 68
I'm relatively new to R and am trying to extract some strings from text (which is a column in a dataframe) and store them together with their names (which is another column of my dataframe) based on the conditions below:
A simplified example of what I'm trying to do is as follows:
textdf <- data.frame(names = letters[1:4], text = c("I'm trying to extract flowers from text",
"there are certain conditions on how to extract",
"this red rose is also nice-smelling",
"scarlet rose is also fine"))
extractdf <- data.frame(extractions = c("extract", "certain", "certain conditions",
"nice-smelling rose", "red rose"),
synonyms = c(NA, NA, NA, NA, "scarlet rose"))
I want to
look in the "extractions" column and extract all of the instances which appear in "text" column of my df.
if there is no match for a row, say if there is no match for "red rose", I want to look for the synonym which in case is "scarlet rose".
So what I need is this
#result
textdf <- data.frame(names = letters[1:4], text = c("I'm trying to extract flowers from text",
"there are certain conditions on how to extract",
"this red rose is also nice-smelling",
"scarlet rose is also fine"),
ex = c("extract", "certain conditions, extract", "nice-smelling rose, red rose", "scarlet rose"))
I've tried:
##for the first item
library(rebus)
library(stringi)
sapply(textdf$text, function(x) stri_extract_all_regex(x, or1(extractdf$extractions)))
this finds "certain" but not "certain conditions"
##for the second and fourth item
library(stringdist)
Match_Idx = amatch(textdf$text, extractdf$extractions, method = 'lcs', maxDist = Inf)
Matches = data.frame(textdf$text, extractdf$extractions[Match_Idx])
which is nice because it extracts both "certain conditions" and "nice-smelling rose" but the problem is this : what if I have both "certain conditions" and "nice-smelling rose" in the text? how can I make it find both?
I have no idea what to do for the third one... do I have to tokenize both the text and the extractions and find unique first words and then extract the longest match???
I would appreciate your help in solving any of the items or any help on how get them all in a custom function so that I finally get all of what I've extracted together.
Upvotes: 1
Views: 190
Reputation: 73692
You could work with regular expressions which you put into a vector,
rex <- c("(extract)", "((?>(?>red)|(?>scarlet))\\srose)",
"(\\bcertain\\sconditions\\b)",
"((?>rose).*(?>nice-smelling)|(?>nice-smelling).*(?>rose))")
create a matching function
fun <- function(x, y) regmatches(x, regexpr(y, x, perl=TRUE))
and apply it with outer
.
M <- outer(textdf$text, rex, Vectorize(fun))
Now we should clean the matrix a little which depends a little on your data, e.g.
M[grep("((?>rose)*.(?>nice-smelling)|(?>nice-smelling).*s(?>rose))",
M, perl=TRUE)] <- "nice-smelling rose"
Finally collapse the resulting matrix and add the new vector to your data frame.
textdf$ex <- apply(M, 1, function(x) toString(unlist(x)))
Gives
textdf
# names text ex
# 1 a I'm trying to extract flowers from text extract
# 2 b there are certain conditions on how to extract extract, certain conditions
# 3 c this red rose is also nice-smelling red rose, nice-smelling rose
# 4 d scarlet rose is also fine scarlet rose
Upvotes: 1