Reputation: 25
I have a data frame df
which contains a column named strings
. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
But this seems really inefficient and I don't know how to really code this.
Upvotes: 0
Views: 2541
Reputation: 1753
Assuming you have a mystring
object and a vector of stopWords
, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table()
into a dataframe
object:
frequency_df = data.frame(table(words))
Let me know if this can help you.
Upvotes: 0
Reputation: 897
One way is using tidytext
. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i
in the stop_words or add the argument lower_case = FALSE
to unnest_tokens
Upvotes: 1
Reputation: 2229
At first, you can create a vector of all words through str_split
and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
Upvotes: 1