Reputation: 13334
I need to remove all non-English words from a data frame that looks like this:
ID text
1 they all went to the store bonkobuns and bought chicken
2 if we believe no exomunch standards are in order then we're ok
3 living among the calipodians seems reasonable
4 given the state of all relimited editions we should be fine
I want to end with a data frame as such:
ID text
1 they all went to the store and bought chicken
2 if we believe no standards are in order then we're ok
3 living among the seems reasonable
4 given the state of all editions we should be fine
I have a vector containing all english words: word_vec
I can remove all words that are in a vector from a data frame using the tm package
for(k in 1:nrow(frame){
for(i in 1:length(word_vec)){
frame[k,] <- removeWords(frame[i,],word_vec[i])
}
}
but I want to do the opposite. I want to 'keep' only the words found in the vector.
Upvotes: 2
Views: 2158
Reputation: 887251
You could try gsub
word_vec <- paste(c('bonkobuns ', 'exomunch ', 'calipodians ',
'relimited '), collapse="|")
gsub(word_vec, '', df1$text)
#[1] "they all went to the store and bought chicken"
#[2] "if we believe no standards are in order then we're ok"
#[3] "living among the seems reasonable"
#[4] "given the state of all editions we should be fine"
Suppose, if you already have a word_vec with just the opposite of that in the above vector, for example
word_vec <- c("among", "editions", "bought", "seems", "fine",
"state", "in",
"then", "reasonable", "ok", "standards", "store", "order", "should",
"and", "be", "to", "they", "are", "no", "living", "all", "if",
"we're", "went", "of", "given", "the", "chicken", "believe",
"we")
word_vec2 <- paste(gsub('^ +| +$', '', gsub(paste(word_vec,
collapse="|"), '', df1$text)), collapse= ' |')
gsub(word_vec2, '', df1$text)
#[1] "they all went to the store and bought chicken"
#[2] "if we believe no standards are in order then we're ok"
#[3] "living among the seems reasonable"
#[4] "given the state of all editions we should be fine"
Upvotes: 2
Reputation: 193
All I can think of is the following procedure:
strsplit()
regexpr()
maybe its worth pondering the function which() if you go down this road:
which(c('a','b','c','d','e') == 'd')
[1] 4
Upvotes: 0
Reputation: 10411
Here's a simple way to do it:
txt <- "Hi this is an example"
words <- c("this", "is", "an", "example")
paste(intersect(strsplit(txt, "\\s")[[1]], words), collapse=" ")
[1] "this is an example"
Of course the devil is in the details, so you might need to tweak things a little to take into account the apostrophes and other punctuation signs.
Upvotes: 4