Reputation: 483
I have a long data frame with words. I want to use multi specific words to find each all part-of-speech words.
For example:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply", "supplying cmp
abrasive", "chemical mechanical"))
words
1 clean
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
I want to extract "clean" and "supply" single words with different POS. I have tried use the grep()
function to do.
specific_word <- c("clean", "supply")
grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>%
data.frame(word = ., row.names = NULL) %>%
unique()
But the result is not what I want:
word
1 cleans
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
I prefer to get
words
1 clean
2 cleaning
3 supplying
4 supply
I know maybe regular expression can solve my problem, but I don't know how to define it. Can anyone give me some advice?
Upvotes: 0
Views: 5565
Reputation: 43334
There are various ways to do this, but generally if you want it to be a single word and you're using regex, you need to specify the beginning ^
and end $
of the line so as to limit what can come before or after your pattern. You seem to want it to be able to expand with more letters, so add in \\w*
to allow it:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|')
pattern
#> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$"
df[grep(pattern, df$word), , drop = FALSE] # drop = FALSE to stop simplification to vector
#> word
#> 1 clean
#> 3 cleaning
#> 5 supplying
#> 6 supply
Another interpretation of what you're looking for is to split each term into individual words, and search any of those for a match. tidyr::separate_rows
can be used for such a split, which you can then filter
with grepl
:
library(tidyverse)
df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
df %>% separate_rows(word) %>%
filter(grepl(paste(specific_word, collapse = '|'), word)) %>%
distinct()
#> # A tibble: 4 x 1
#> word
#> <chr>
#> 1 clean
#> 2 cleaning
#> 3 supplying
#> 4 supply
For more robust word tokenization, try tidytext::unnest_tokens
or another word actual word tokenizer.
Upvotes: 2