RunninigPig
RunninigPig

Reputation: 25

R: Count the frequency of every unique character in a column

I have a data frame df which contains a column named strings. The values in this column are some sentences.

For example:

id    strings
1     "I want to go to school, how about you?"
2     "I like you."
3     "I like you so much"
4     "I like you very much"
5     "I don't like you"

Now, I have a list of stop word,

["I", "don't" "you"]

How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.

keyword      frequency
  want            1
  to              2
  go              1
  school          1
  how             1
  about           1
  like            4
  so              1
  very            1
  much            2

My idea is that:

  1. combine the strings in the column to a big string.
  2. Make a list storing the unique character in the big string.
  3. Make the df whose one column is the unique words.
  4. Compute the frequency.

But this seems really inefficient and I don't know how to really code this.

Upvotes: 0

Views: 2541

Answers (3)

Leevo
Leevo

Reputation: 1753

Assuming you have a mystring object and a vector of stopWords, you can do it like this:

# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]

# remove stopwords from the vector
vector = vector[!vector %in% stopWords]

At this point you can turn a frequency table() into a dataframe object:

frequency_df = data.frame(table(words))

Let me know if this can help you.

Upvotes: 0

c1au61o_HH
c1au61o_HH

Reputation: 897

One way is using tidytext. Here a book and the code

library("tidytext")
library("tidyverse")

#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))

df %>% 
  mutate(strings = as.character(strings)) %>% 
  unnest_tokens(word, string) %>%   #this tokenize the strings and extract the words
  filter(!word %in% c("I", "i", "don't", "you")) %>% 
  count(word)

#> # A tibble: 11 x 2
#>    word       n
#>    <chr>  <int>
#>  1 about      1
#>  2 go         1
#>  3 how        1
#>  4 like       4
#>  5 much       2

EDIT

All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens

Upvotes: 1

Daniel
Daniel

Reputation: 2229

At first, you can create a vector of all words through str_split and then create a frequency table of the words.

library(stringr)
stop_words <- c("I", "don't", "you")

# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))

# create a frequency table 
word_list <- as.data.frame(table(all_words))

# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]

Upvotes: 1

Related Questions