Evan O.
Evan O.

Reputation: 1563

Using dplyr::lag to tidy data frame and fill variables

I'm trying to clean my data so that every row directly beneath a row that contains "gamecentre-playbyplay-event" is labeled as a goal, every row that contains "gamecentre-playbyplay-event" directly beneath a "goal" row is labeled primary assist, and every row that contains "gamecentre-playbyplay-event" directly beneath a "primary assist" row is labeled secondary assist.

Here's what the data looks like:

mydata

# A tibble: 15 x 1
   value                                                                                 
   <chr>                                                                                 
 1 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby"   
 2 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 3 "<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 4 "<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 5 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"   
 6 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 7 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 8 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"   
 9 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
10 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
11 "<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
12 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"   
13 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
14 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
15 "<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"

There are a few problems here though.

  1. I need to set conditions so that the rows are correctly labeled.
  2. If there is no "secondary assist" row, the row is labeled as an NA.
  3. If there is no "primary assist" row, the row is also labeled as an NA.

I'm trying to use dplyr::lag() for this, but me wanting NAs when there aren't primary or secondary assists is confusing.

Here's the basis of what I have so far:

goals <- mydata %>%
  filter(dplyr::lag(str_detect(value, "gamecentre-playbyplay-event team-border"), 1))

goals

# A tibble: 4 x 1
  value                                                                                                                                
  <chr>                                                                                                                                
1 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re
2 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re
3 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re
4 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re

And here's what I'd want my data to look like at the end of all of this. I think using dplyr::lag() is the way to go, but I'm not sure.

# A tibble: 4 x 3
  goal                                     primary_assist                                secondary_assist                              
  <chr>                                    <chr>                                         <chr>                                         
1 "<a href=\"/players/14695\" class=\"gam~ "<a href=\"/players/16639\" class=\"gamecent~ "<a href=\"/players/17027\" class=\"gamecentr~
2 "<a href=\"/players/17453\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ NA                                            
3 "<a href=\"/players/18061\" class=\"gam~ "<a href=\"/players/14752\" class=\"gamecent~ "<a href=\"/players/17522\" class=\"gamecentr~
4 "<a href=\"/players/14752\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ "<a href=\"/players/14757\" class=\"gamecentr~

Any ideas?

dput:

    mydata <- structure(list(value = c("<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby", 
"<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby", 
"<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby", 
"<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby", 
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
)), .Names = "value", class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -15L))

Upvotes: 2

Views: 309

Answers (1)

akrun
akrun

Reputation: 887531

An option would be to create a grouping variable and then spread

library(tidyverse)
mydata %>%
   #create a group based on the occurrence of 'playby'
   group_by(grp = cumsum(str_detect(value, 'playby'))) %>% 
   # filter out the first row of the group that have playby
   filter(row_number() > 1) %>% 
   # create a new category column
   mutate(categ = c("goal", "primary_assist", "secondary_assist")[row_number()]) %>%
   # spread from long to wide
   spread(categ, value) %>% 
   # remove the grouping column as part of clean up
   ungroup %>% 
   select(-grp)
# A tibble: 4 x 3
#  goal                                   primary_assist                              secondary_assist                           
#  <chr>                                  <chr>                                       <chr>                                      
#1 "<a href=\"/players/14695\" class=\"g… "<a href=\"/players/16639\" class=\"gamece… "<a href=\"/players/17027\" class=\"gamece…
#2 "<a href=\"/players/17453\" class=\"g… "<a href=\"/players/14639\" class=\"gamece… <NA>                                       
#3 "<a href=\"/players/18061\" class=\"g… "<a href=\"/players/14752\" class=\"gamece… "<a href=\"/players/17522\" class=\"gamece…
#4 "<a href=\"/players/14752\" class=\"g… "<a href=\"/players/14639\" class=\"gamece… "<a href=\"/players/14757\" class=\"gamece…

Upvotes: 4

Related Questions