Reputation: 4940
I have a column in a dataframe, where each row is a string. I would like to get the frequency of each word in this column.
I have tried:
prov <- df$column_x %>%
na.omit() %>%
tolower() %>%
gsub("[,;?']", " ",.)
sort(table(prov), decreasing = TRUE)
in this way, I get the number of times each string
is repeated.
How could I get the number of times each word
is repeated?
Upvotes: 0
Views: 1149
Reputation: 6768
Pipes do the job.
df <- data.frame(column_x = c("hello world", "hello morning hello",
"bye bye world"), stringsAsFactors = FALSE)
require(dplyr)
df$column_x %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>% # or strsplit(split = "\\W")
unlist() %>%
table() %>%
sort(decreasing = TRUE)
Upvotes: 1
Reputation: 733
You can collapse your column to one string, then use regular expression \\W
not word to split this string into words and count each word frequency with table
function.
library(dplyr)
x <- c("First part of some text,", "another part of text,",NA , "last part of text.")
x <- x %>% na.omit() %>% tolower()
xx <- paste(x, collapse = " ")
xxx <- unlist(strsplit(xx, "\\W"))
table(xxx)
xxx
another first last of part some text
2 1 1 1 3 3 1 3
Upvotes: 1
Reputation: 29065
Sounds like you want a document-term matrix...
library(tm)
corp <- Corpus(VectorSource(df$x)) # convert column of strings into a corpus
dtm <- DocumentTermMatrix(corp) # create document term matrix
> as.matrix(dtm)
Terms
Docs hello world morning bye
1 1 1 0 0
2 2 0 1 0
3 0 1 0 2
If you wish to join it to the original data frame, you can do so as well:
cbind(df, data.frame(as.matrix(dtm)))
x hello world morning bye
1 hello world 1 1 0 0
2 hello morning hello 2 0 1 0
3 bye bye world 0 1 0 2
Sample data used:
df <- data.frame(
x = c("hello world",
"hello morning hello",
"bye bye world"),
stringsAsFactors = FALSE
)
> df
x
1 hello world
2 hello morning hello
3 bye bye world
Upvotes: 1