Luther_Proton
Luther_Proton

Reputation: 358

Most frequent phrases from text data in R

Does anyone here have experience in identifying the most common phrases (3 ~ 7 consecutive words)? Understand that most analysis on frequency focuses on the most frequent/common word (along with plotting a WordCloud) rather than phrases.

# Assuming a particular column in a data frame (df) with n rows that is all text data
# as I'm not able to provide a sample data as using dput() on a large text file won't # be feasible here 

Text = df$Text_Column
docs = Corpus(VectorSource(Text))
...

Thanks in advance!

Upvotes: 2

Views: 777

Answers (1)

JBGruber
JBGruber

Reputation: 12478

You have several options to do this in R. Let's grab some data first. I use the books by Jane Austen from the janeaustenr and do some cleaning to have each paragrah in a separate row:

library(janeaustenr)
library(tidyverse)
books <- austen_books() %>% 
  mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>% 
  group_by(paragraph) %>% 
  summarise(book = head(book, 1),
            text = trimws(paste(text, collapse = " ")),
            .groups = "drop")

With tidytext:

library(tidytext)
map_df(3L:7L, ~unnest_tokens(books, ngram, text, token = "ngrams", n = .x)) %>% # using multiple values for n is not directly implemented in tidytext
  count(ngram) %>%
  filter(!is.na(ngram)) %>% 
  slice_max(n, n = 10)
#> # A tibble: 10 × 2
#>    ngram               n
#>    <chr>           <int>
#>  1 i am sure         415
#>  2 i do not          412
#>  3 she could not     328
#>  4 it would be       258
#>  5 in the world      247
#>  6 as soon as        236
#>  7 a great deal      214
#>  8 would have been   211
#>  9 she had been      203
#> 10 it was a          202

With quanteda:

library(quanteda)
books %>% 
  corpus(docid_field = "paragraph",
         text_field = "text") %>% 
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE) %>% 
  tokens_ngrams(n = 3L:7L) %>%
  dfm() %>% 
  topfeatures(n = 10) %>% 
  enframe()
#> # A tibble: 10 × 2
#>    name            value
#>    <chr>           <dbl>
#>  1 i_am_sure         415
#>  2 i_do_not          412
#>  3 she_could_not     328
#>  4 it_would_be       258
#>  5 in_the_world      247
#>  6 as_soon_as        236
#>  7 a_great_deal      214
#>  8 would_have_been   211
#>  9 she_had_been      203
#> 10 it_was_a          202

With text2vec:

library(text2vec)
library(janeaustenr)
library(tidyverse)
books <- austen_books() %>% 
  mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>% 
  group_by(paragraph) %>% 
  summarise(book = head(book, 1),
            text = trimws(paste(text, collapse = " ")),
            .groups = "drop")

library(text2vec)
itoken(books$text, tolower, word_tokenizer) %>% 
  create_vocabulary(ngram = c(3L, 7L), sep_ngram = " ") %>% 
  filter(str_detect(term, "[[:alpha:]]")) %>% # keep terms with at tleas one alphabetic character
  slice_max(term_count, n = 10)
#> Number of docs: 10293 
#> 0 stopwords:  ... 
#> ngram_min = 3; ngram_max = 7 
#> Vocabulary: 
#>                term term_count doc_count
#>  1:       i am sure        415       384
#>  2:        i do not        412       363
#>  3:   she could not        328       288
#>  4:     it would be        258       233
#>  5:    in the world        247       234
#>  6:      as soon as        236       233
#>  7:    a great deal        214       209
#>  8: would have been        211       192
#>  9:    she had been        203       179
#> 10:        it was a        202       194

Created on 2022-08-03 by the reprex package (v2.0.1)

Upvotes: 5

Related Questions