Jazzmatazz
Jazzmatazz

Reputation: 645

What is the best way to count values within columns to create a summary table?

I have a tbl_df that has several columns that have multiple values in them. I am looking to use the values in the columns to create several columns. After that, I'm looking to summarize the columns.

One way I can go about it is to create several ifelse within a mutate but that seems inefficient. Is there a better way to go about this? I'm thinking that there is probably a dplyr and/or tidyr based solution.

Example of what I'm looking to do is below. It's only a sampling of the data and columns. It doesn't contain all of the columns that I'm looking to create. The summary table will have some sum and mean based columns.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble::tribble(
  ~type,      ~bb_type,           ~description,
  "B",            NA,                 "ball",
  "S",            NA,                 "foul",
  "X",  "line_drive", "hit_into_play_no_out",
  "S",            NA,      "swinging_strike",
  "S",            NA,                 "foul",
  "X", "ground_ball",        "hit_into_play",
  "S",            NA,      "swinging_strike",
  "X",    "fly_ball",  "hit_into_play_score",
  "B",            NA,                 "ball",
  "S",            NA,                 "foul"
)


df <- df %>% 
  mutate(ground_ball = ifelse(bb_type == "ground_ball", 1, 0),
         fly_ball = if_else(bb_type == "fly_ball", 1, 0),
         X = if_else(type == "X", 1, 0),
# not sure if this is the based way to go about counting columns that start with swinging to sum later
         swinging_strike = grepl("^swinging", description))

df
#> # A tibble: 10 x 7
#>    type  bb_type    description       ground_ball fly_ball     X swinging_strike
#>    <chr> <chr>      <chr>                   <dbl>    <dbl> <dbl> <lgl>          
#>  1 B     <NA>       ball                       NA       NA     0 FALSE          
#>  2 S     <NA>       foul                       NA       NA     0 FALSE          
#>  3 X     line_drive hit_into_play_no…           0        0     1 FALSE          
#>  4 S     <NA>       swinging_strike            NA       NA     0 TRUE           
#>  5 S     <NA>       foul                       NA       NA     0 FALSE          
#>  6 X     ground_ba… hit_into_play               1        0     1 FALSE          
#>  7 S     <NA>       swinging_strike            NA       NA     0 TRUE           
#>  8 X     fly_ball   hit_into_play_sc…           0        1     1 FALSE          
#>  9 B     <NA>       ball                       NA       NA     0 FALSE          
#> 10 S     <NA>       foul                       NA       NA     0 FALSE

summary_df <- df %>% 
  summarize(n = n(),
            fly_ball = sum(fly_ball, na.rm = TRUE),
            ground_ball = sum(ground_ball, na.rm = TRUE))

summary_df
#> # A tibble: 1 x 3
#>       n fly_ball ground_ball
#>   <int>    <dbl>       <dbl>
#> 1    10        1           1

In summary, I'm looking to do the following:

  1. Create new columns for all of the values in bb_type and type that counts them
  2. Create a new column that counts the number of values that start with swinging in the description column. I'd like to see an example that chooses another text string from that column and creates a new column with the count as an additional example. Ex. ball
  3. How would I choose my own name while doing what I'm looking to achieve in 1 and 2? Would I have to simply use dplyr::rename after the fact?

Upvotes: 1

Views: 902

Answers (3)

dc37
dc37

Reputation: 16178

Using dplyr and tidyr you can do something like this. First, you can group by the "bb_type" variable by specifying .drop = FALSE, in order dplyr keep NA values. Then, you can count them and get the sum of all counted values and finally use pivot_wider to get the data being displayed in the orientation you are looking for:

library(dplyr)
library(tidyr)
df %>% group_by(bb_type, .drop = FALSE) %>%
  count() %>% 
  ungroup() %>% mutate(Sum = sum(n)) %>% 
  pivot_wider(.,names_from = bb_type,values_from = n) 

# A tibble: 1 x 5
    Sum fly_ball ground_ball line_drive  `NA`
  <int>    <int>       <int>      <int> <int>
1    10        1           1          1     7

Is it what you are looking for ?

Upvotes: 2

IRTFM
IRTFM

Reputation: 263471

This appears to be a request for a tabulation with a subsequent count of the entries in that tabulation

tb_df <- table(df$bb_type, useNA="always") 

c(Sum=sum(tb_df), tb_df)
        Sum    fly_ball ground_ball  line_drive        <NA> 
         10           1           1           1           7 

If you wanted it as a dataframe you would first turn it into a named list:

data.frame( as.list(  c(Sum=sum(tb_df), tb_df) ) )
  Sum fly_ball ground_ball line_drive NA.
1  10        1           1          1   7

If you wanted this done on all columns then first make a function that handles one column and lapply it to the tbl_df:

tally_col <- function(x){ tb <- table(x, useNA="always") 
 tal <- c(Sum=sum(tb), tb); data.frame( as.list(tal)) }

lapply(df, tally_col)
# ---output---
$type
  Sum B S X NA.
1  10 2 5 3   0

$bb_type
  Sum fly_ball ground_ball line_drive NA.
1  10        1           1          1   7

$description
  Sum ball foul hit_into_play hit_into_play_no_out hit_into_play_score swinging_strike NA.
1  10    2    3             1                    1                   1               2   0

Upvotes: 3

akrun
akrun

Reputation: 887851

We can use table with addmargins from base R

addmargins(table(df$bb_type, useNA = 'always'), 1)
#   fly_ball ground_ball  line_drive        <NA>         Sum 
#          1           1           1           7          10 

Upvotes: 2

Related Questions