Neal Barsch
Neal Barsch

Reputation: 2940

Summarize word count in dplyr pipe

I'd like to generate a frequency count summary of frequencies of word counts in a dplyr pipe. It has to be in a dplyr pipe because I am actually querying from bigrquery and it acts as a dplyr pipe.

Suppose I have data like this:

tf1 <- tbl_df(data.frame(row= c(1:5), body=c("tt t ttt j ss oe", "kpw eero", "pow eir sap r", "s", "oe")))

I'd like to have a summary of the word counts (something like this):

   n_words freq
1   0    0
2   1    2
3   2    1
4   3    0
5   4    1
6   5    0
7   6    1

But I need to do this in a dplyr pipe (something like below that does not work)

###NOT WORK
tf1 %>%
wordcount(body,sep=" ", count.function=sum) 

Upvotes: 0

Views: 2262

Answers (2)

Sotos
Sotos

Reputation: 51592

Here is another idea that also uses complete to get all values,

library(tidyverse)

tf1 %>% 
   mutate(n_words = stringr::str_count(body, ' ') + 1) %>% 
   count(n_words) %>% 
   complete(n_words = 0:max(n_words))

which gives,

# A tibble: 7 x 2
  n_words     n
    <dbl> <int>
1      0.    NA
2      1.     2
3      2.     1
4      3.    NA
5      4.     1
6      5.    NA
7      6.     1

Upvotes: 5

akraf
akraf

Reputation: 3235

library(dplyr)
library(stringr)
tf1 %>% mutate(wordcount = str_split(body, " ") %>% lengths()) %>% count(wordcount)
## # A tibble: 4 x 2
##   wordcount     n
##       <int> <int>
## 1         1     2
## 2         2     1
## 3         4     1
## 4         6     1

str_split(tf1$body, " ") returns

[[1]]
[1] "tt"  "t"   "ttt" "j"   "ss"  "oe" 

[[2]]
[1] "kpw"  "eero"

[[3]]
[1] "pow" "eir" "sap" "r"  

[[4]]
[1] "s"

[[5]]
[1] "oe"

lengths calculate the length of each list element, therefore

str_split(tf1$body, " ") %>% lengths()
## [1] 6 2 4 1 1

This is added as column wordcount by using mutate

count returns how many times a value is found in the column wordcount and stores that as column n

Upvotes: 0

Related Questions