user8248672
user8248672

Reputation:

Using summarise function to count frequency of words in dataframe

I've tried to look this up on SO, but can't find any simple answers. My question is extremely simple.

Have a look at the mpg dataset in ggplot2 package. I want to first group by model and manufacturer, then count the frequency of the character strings in the "drv" column ("f", "r", "4").

I've tried:

mpg %>% 
  group_by(model, manufacturer) %>% 
  summarise(sum_drv = sum(drv))

Which doesn't work because the drv column contains character values.

Thanks,

Upvotes: 0

Views: 3738

Answers (3)

AntoniosK
AntoniosK

Reputation: 16121

You didn't provide an ideal output so, you can have two different formats.

All drv values in a column:

library(tidyverse)

mpg %>% count(model, manufacturer, drv) 

# # A tibble: 38 x 4
#   model              manufacturer drv       n
#   <chr>              <chr>        <chr> <int>
# 1 4runner 4wd        toyota       4         6
# 2 a4                 audi         f         7
# 3 a4 quattro         audi         4         8
# 4 a6 quattro         audi         4         3
# 5 altima             nissan       f         6
# 6 c1500 suburban 2wd chevrolet    r         5
# 7 camry              toyota       f         7
# 8 camry solara       toyota       f         7
# 9 caravan 2wd        dodge        f        11
#10 civic              honda        f         9
# # ... with 28 more rows

Each drv value as a column:

mpg %>% count(model, manufacturer, drv) %>% spread(drv, n, fill=0)

# # A tibble: 38 x 5
#   model              manufacturer   `4`     f     r
# * <chr>              <chr>        <dbl> <dbl> <dbl>
# 1 4runner 4wd        toyota        6.00  0     0   
# 2 a4                 audi          0     7.00  0   
# 3 a4 quattro         audi          8.00  0     0   
# 4 a6 quattro         audi          3.00  0     0   
# 5 altima             nissan        0     6.00  0   
# 6 c1500 suburban 2wd chevrolet     0     0     5.00
# 7 camry              toyota        0     7.00  0   
# 8 camry solara       toyota        0     7.00  0   
# 9 caravan 2wd        dodge         0    11.0   0   
#10 civic              honda         0     9.00  0   
# # ... with 28 more rows

Upvotes: 2

LyzandeR
LyzandeR

Reputation: 37879

I think table is what you are looking for:

mpg %>% 
 group_by(model, manufacturer, drv) %>% 
 summarise(sum_drv = as.numeric(table(drv)))
# # A tibble: 38 x 4
# # Groups:   model, manufacturer [?]
# model manufacturer   drv sum_drv
# <chr>        <chr> <chr>   <dbl>
#  1        4runner 4wd       toyota     4       6
# 2                 a4         audi     f       7
# 3         a4 quattro         audi     4       8
# 4         a6 quattro         audi     4       3
# 5             altima       nissan     f       6
# 6 c1500 suburban 2wd    chevrolet     r       5
# 7              camry       toyota     f       7
# 8       camry solara       toyota     f       7
# 9        caravan 2wd        dodge     f      11
# 10              civic        honda     f       9
# # ... with 28 more rows

Upvotes: 1

MKR
MKR

Reputation: 20095

You can make a simple change to get the count/frequency as below. Include drv as part by group_by and then you have to just count per group.

mpg %>% 
  group_by(model, manufacturer, drv) %>% 
  summarise(sum_drv = n())

Results:
# A tibble: 38 x 4
# Groups:   model, manufacturer [?]
                model manufacturer   drv sum_drv
                <chr>        <chr> <chr>   <int>
 1        4runner 4wd       toyota     4       6
 2                 a4         audi     f       7
 3         a4 quattro         audi     4       8
 4         a6 quattro         audi     4       3
 5             altima       nissan     f       6
 6 c1500 suburban 2wd    chevrolet     r       5
 7              camry       toyota     f       7
 8       camry solara       toyota     f       7
 9        caravan 2wd        dodge     f      11
10              civic        honda     f       9

Upvotes: 1

Related Questions