Eva
Eva

Reputation: 483

R: How to use grep() to find specific words?

I have a long data frame with words. I want to use multi specific words to find each all part-of-speech words.

For example:

df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning", 
                          "cleaning composition", "supplying", "supply", "supplying cmp 
                          abrasive", "chemical mechanical"))

words
1 clean
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical

I want to extract "clean" and "supply" single words with different POS. I have tried use the grep() function to do.

specific_word <- c("clean", "supply")

grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>%
    data.frame(word = ., row.names = NULL) %>%
    unique()

But the result is not what I want:

  word
1 cleans
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical

I prefer to get

words
1 clean
2 cleaning
3 supplying
4 supply

I know maybe regular expression can solve my problem, but I don't know how to define it. Can anyone give me some advice?

Upvotes: 0

Views: 5565

Answers (1)

alistaire
alistaire

Reputation: 43334

There are various ways to do this, but generally if you want it to be a single word and you're using regex, you need to specify the beginning ^ and end $ of the line so as to limit what can come before or after your pattern. You seem to want it to be able to expand with more letters, so add in \\w* to allow it:

df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning", 
                          "cleaning composition", "supplying", "supply", 
                          "supplying cmp abrasive", "chemical mechanical"))

specific_word <- c("clean", "supply")
pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|')

pattern
#> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$"

df[grep(pattern, df$word), , drop = FALSE]    # drop = FALSE to stop simplification to vector
#>        word
#> 1     clean
#> 3  cleaning
#> 5 supplying
#> 6    supply

Another interpretation of what you're looking for is to split each term into individual words, and search any of those for a match. tidyr::separate_rows can be used for such a split, which you can then filter with grepl:

library(tidyverse)

df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning", 
                          "cleaning composition", "supplying", "supply", 
                          "supplying cmp abrasive", "chemical mechanical"))

specific_word <- c("clean", "supply")

df %>% separate_rows(word) %>%
    filter(grepl(paste(specific_word, collapse = '|'), word)) %>% 
    distinct()
#> # A tibble: 4 x 1
#>        word
#>       <chr>
#> 1     clean
#> 2  cleaning
#> 3 supplying
#> 4    supply

For more robust word tokenization, try tidytext::unnest_tokens or another word actual word tokenizer.

Upvotes: 2

Related Questions