Reputation: 1563

Summarizing data by name separated across multiple variables

I'm trying to count totals for goals, primary assists, and secondary assists for each player. My problem is that I can't get my head around the logic to do that, as the data I want to summarize by (player name) is listed across three variables (goal, primary assist and secondary assist)

Here's my reproducible data (it's from dput(), so apologies for the mess).

mydata <- structure(list(primary_assist = c("Dmitry Gilyazitdinov", "Evgeny Orlov", 
"Anton Burdasov", "Sergei Kalinin", "Stanislav Solovyov", "Vasily Streltsov", 
NA, "Bogdan Potekhin", "Bogdan Potekhin", "Vasily Streltsov", 
"Vasily Streltsov", "Viktor Postnikov", "Danil Kaskov", NA, NA, 
"Artemy Panarin"), secondary_assist = c("Andrei Badrutdinov", 
NA, NA, NA, "Danil Gubarev", "Nikita Manukhov", NA, "Evgeny Grigorenko", 
"Daniil Apalkov", "Ivan Boiko", NA, "Viktor Antipin", "Vitaly Sychov", 
NA, NA, "Stanislav Levin"), goal = c("Vitaly Kropachyov", "Dmitry Kozlov", 
"Stanislav Solovyov", "Kirill Polyansky", "Anton Burdasov", "Ilya Solodov", 
"Alexander Antropov", "Daniil Apalkov", "Evgeny Grigorenko", 
"Alexander Antropov", "Alexander Antropov", "Evgeny Grigorenko", 
"Denis Belonogov", "Vitaly Sychov", "Alexander Streltsov", "Pyotr Kopyttsov"
), team = c("Belye Medvedi", "Omskie Yastreby", "Belye Medvedi", 
"Omskie Yastreby", "Belye Medvedi", "Avto", "Avto", "Stalnye Lisy", 
"Stalnye Lisy", "Avto", "Avto", "Stalnye Lisy", "Avto", "Avto", 
"Avto", "Russkie Vityazi"), game_strength = c("PP", "EV", "EV", 
"EV", "EV", "PP", "SO", "EV", "PP", "PP", "EV", "PP", "PP", "EV", 
"PP", "EV"), season = c("2009-10", "2009-10", "2009-10", "2009-10", 
"2009-10", "2009-10", "2009-10", "2009-10", "2009-10", "2009-10", 
"2009-10", "2009-10", "2009-10", "2009-10", "2009-10", "2009-10"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-16L), .Names = c("primary_assist", "secondary_assist", "goal", 
"team", "game_strength", "season"))

mydata
#> # A tibble: 16 x 6
#>    primary_assist       secondary_assist  goal  team  game_strength season
#>    <chr>                <chr>             <chr> <chr> <chr>         <chr> 
#>  1 Dmitry Gilyazitdinov Andrei Badrutdin~ Vita~ Bely~ PP            2009-~
#>  2 Evgeny Orlov         <NA>              Dmit~ Omsk~ EV            2009-~
#>  3 Anton Burdasov       <NA>              Stan~ Bely~ EV            2009-~
#>  4 Sergei Kalinin       <NA>              Kiri~ Omsk~ EV            2009-~
#>  5 Stanislav Solovyov   Danil Gubarev     Anto~ Bely~ EV            2009-~
#>  6 Vasily Streltsov     Nikita Manukhov   Ilya~ Avto  PP            2009-~
#>  7 <NA>                 <NA>              Alex~ Avto  SO            2009-~
#>  8 Bogdan Potekhin      Evgeny Grigorenko Dani~ Stal~ EV            2009-~
#>  9 Bogdan Potekhin      Daniil Apalkov    Evge~ Stal~ PP            2009-~
#> 10 Vasily Streltsov     Ivan Boiko        Alex~ Avto  PP            2009-~
#> 11 Vasily Streltsov     <NA>              Alex~ Avto  EV            2009-~
#> 12 Viktor Postnikov     Viktor Antipin    Evge~ Stal~ PP            2009-~
#> 13 Danil Kaskov         Vitaly Sychov     Deni~ Avto  PP            2009-~
#> 14 <NA>                 <NA>              Vita~ Avto  EV            2009-~
#> 15 <NA>                 <NA>              Alex~ Avto  PP            2009-~
#> 16 Artemy Panarin       Stanislav Levin   Pyot~ Russ~ EV            2009-~

So, I want to count the number of goals, primary assists, and secondary assists for each player, and then have 1 row for each player. Let's say the name "Artemy Panarin" is listed 1 time in goals, 0 times in primary assists, and 2 times in secondary assists, my output would look like this:

tibble::tibble(name = c("Artemy Panarin", "Stanislav Levin", "Danil Kaskov"), team = c("Russkie Vityazi", "Russkie Vityazi", "Avto"), goals = c(1, 1, 0), primary_assists = c(0, 0, 1), secondary_assists = c(2, 0, 0))
#> # A tibble: 3 x 5
#>   name            team            goals primary_assists secondary_assists
#>   <chr>           <chr>           <dbl>           <dbl>             <dbl>
#> 1 Artemy Panarin  Russkie Vityazi  1.00            0                 2.00
#> 2 Stanislav Levin Russkie Vityazi  1.00            0                 0   
#> 3 Danil Kaskov    Avto             0               1.00              0

Does that make any sense? Any ideas? Tidyverse solutions preferred. Thanks!

Upvotes: 6

Answers (3)

Felipe Flores

Reputation: 129

You can use gather and spread. First gather the goal and assists columns into a "key", and then group by key and player. You can conver the NA to 0s later

library(tidyverse)

mydata_tidy <- mydata %>% 
  gather(key = "key", value = "player", primary_assist, secondary_assist, goal) %>% 
  na.omit() 
mydata_tidy %>% 
  group_by(key, player) %>% 
  summarize(count = n()) %>% 
  spread(key, count) %>% 
  filter(player %in% c("Artemy Panarin", "Stanislav Levin", "Danil Kaskov"))
#> # A tibble: 3 x 4
#>   player           goal primary_assist secondary_assist
#>   <chr>           <int>          <int>            <int>
#> 1 Artemy Panarin     NA              1               NA
#> 2 Danil Kaskov       NA              1               NA
#> 3 Stanislav Levin    NA             NA                1

Created on 2018-07-18 by the reprex package (v0.2.0).

Upvotes: 2

akrun

Reputation: 887551

We can gather to 'long' format, grouped by 'name', 'team', and the 'key' column (from gather), summarise to get the count and spread back to 'wide' format

library(tidyverse)
gather(mydata, key, name, primary_assist:goal) %>% 
     group_by(name, team, key) %>% 
     summarise(n = n()) %>% 
     spread(key, n, fill = 0)
# A tibble: 30 x 5
# Groups:   name, team [30]
#   name                team             goal primary_assist secondary_assist
#   <chr>               <chr>           <dbl>          <dbl>            <dbl>
# 1 Alexander Antropov  Avto                3              0                0
# 2 Alexander Streltsov Avto                1              0                0
# 3 Andrei Badrutdinov  Belye Medvedi       0              0                1
# 4 Anton Burdasov      Belye Medvedi       1              1                0
# 5 Artemy Panarin      Russkie Vityazi     0              1                0
# 6 Bogdan Potekhin     Stalnye Lisy        0              2                0
# 7 Daniil Apalkov      Stalnye Lisy        1              0                1
# 8 Danil Gubarev       Belye Medvedi       0              0                1
# 9 Danil Kaskov        Avto                0              1                0
#10 Denis Belonogov     Avto                1              0                0
# ... with 20 more rows

Upvotes: 6

Nate

Reputation: 10671

One way to get the result is to reshape your data with gather()/spread() in addition to your summarize strategy.

library(tidyverse)

scoring_summary <- mydata %>%
    select(primary_assist:team) %>%
    gather("key", "player", -team) %>%
    group_by(player) %>%
    count(key) %>%
    spread(key, n)

# convert NAs to 0
scoring_summary[is.na(scoring_summary)] <- 0

scoring_summary

# A tibble: 28 x 4
# Groups:   player [28]
player               goal primary_assist secondary_assist
<chr>               <dbl>          <dbl>            <dbl>
1 Alexander Antropov      3              0                0
2 Alexander Streltsov     1              0                0
3 Andrei Badrutdinov      0              0                1
4 Anton Burdasov          1              1                0
5 Artemy Panarin          0              1                0
6 Bogdan Potekhin         0              2                0
7 Daniil Apalkov          1              0                1
8 Danil Gubarev           0              0                1
9 Danil Kaskov            0              1                0
10 Denis Belonogov        1              0                0

count() is doing the same thing as your original attempt with summarise(count(goals)

Upvotes: 3

Summarizing data by name separated across multiple variables

Answers (3)

Related Questions