Reputation: 4940

Frequency of each word in a set of strings

I have a column in a dataframe, where each row is a string. I would like to get the frequency of each word in this column.

I have tried:

prov <- df$column_x %>%
    na.omit() %>%
    tolower() %>%
    gsub("[,;?']", " ",.)

sort(table(prov), decreasing = TRUE)

in this way, I get the number of times each string is repeated.

How could I get the number of times each word is repeated?

Upvotes: 0

Answers (3)

nghauran

Reputation: 6768

Pipes do the job.

df <- data.frame(column_x = c("hello world", "hello morning hello", 
                              "bye bye world"), stringsAsFactors = FALSE)
require(dplyr)
df$column_x %>%
  na.omit() %>%
  tolower() %>%
  strsplit(split = " ") %>% # or strsplit(split = "\\W") 
  unlist() %>%
  table() %>%
  sort(decreasing = TRUE)

Upvotes: 1

Glaud

Reputation: 733

You can collapse your column to one string, then use regular expression \\W not word to split this string into words and count each word frequency with table function.

library(dplyr)
x <- c("First part of some text,", "another part of text,",NA , "last part of text.")
x <- x %>% na.omit() %>% tolower() 
xx <- paste(x, collapse = " ")
xxx <- unlist(strsplit(xx, "\\W"))
table(xxx)
xxx
        another   first    last      of    part    some    text 
      2       1       1       1       3       3       1       3

Upvotes: 1

Z.Lin

Reputation: 29065

Sounds like you want a document-term matrix...

library(tm)

corp <- Corpus(VectorSource(df$x)) # convert column of strings into a corpus
dtm <- DocumentTermMatrix(corp)    # create document term matrix

> as.matrix(dtm)
    Terms
Docs hello world morning bye
   1     1     1       0   0
   2     2     0       1   0
   3     0     1       0   2

If you wish to join it to the original data frame, you can do so as well:

cbind(df, data.frame(as.matrix(dtm)))

                    x hello world morning bye
1         hello world     1     1       0   0
2 hello morning hello     2     0       1   0
3       bye bye world     0     1       0   2

Sample data used:

df <- data.frame(
  x = c("hello world", 
        "hello morning hello", 
        "bye bye world"),
  stringsAsFactors = FALSE
)

> df
                    x
1         hello world
2 hello morning hello
3       bye bye world

Upvotes: 1

Frequency of each word in a set of strings

Answers (3)

Related Questions