Reputation: 2940
I'd like to generate a frequency count summary of frequencies of word counts in a dplyr pipe. It has to be in a dplyr pipe because I am actually querying from bigrquery and it acts as a dplyr pipe.
Suppose I have data like this:
tf1 <- tbl_df(data.frame(row= c(1:5), body=c("tt t ttt j ss oe", "kpw eero", "pow eir sap r", "s", "oe")))
I'd like to have a summary of the word counts (something like this):
n_words freq
1 0 0
2 1 2
3 2 1
4 3 0
5 4 1
6 5 0
7 6 1
But I need to do this in a dplyr pipe (something like below that does not work)
###NOT WORK
tf1 %>%
wordcount(body,sep=" ", count.function=sum)
Upvotes: 0
Views: 2262
Reputation: 51592
Here is another idea that also uses complete
to get all values,
library(tidyverse)
tf1 %>%
mutate(n_words = stringr::str_count(body, ' ') + 1) %>%
count(n_words) %>%
complete(n_words = 0:max(n_words))
which gives,
# A tibble: 7 x 2 n_words n <dbl> <int> 1 0. NA 2 1. 2 3 2. 1 4 3. NA 5 4. 1 6 5. NA 7 6. 1
Upvotes: 5
Reputation: 3235
library(dplyr)
library(stringr)
tf1 %>% mutate(wordcount = str_split(body, " ") %>% lengths()) %>% count(wordcount)
## # A tibble: 4 x 2
## wordcount n
## <int> <int>
## 1 1 2
## 2 2 1
## 3 4 1
## 4 6 1
str_split(tf1$body, " ")
returns
[[1]]
[1] "tt" "t" "ttt" "j" "ss" "oe"
[[2]]
[1] "kpw" "eero"
[[3]]
[1] "pow" "eir" "sap" "r"
[[4]]
[1] "s"
[[5]]
[1] "oe"
lengths
calculate the length of each list element, therefore
str_split(tf1$body, " ") %>% lengths()
## [1] 6 2 4 1 1
This is added as column wordcount
by using mutate
count
returns how many times a value is found in the column wordcount
and stores that as column n
Upvotes: 0