Changing cell values in a row conditioned on values of another row

Question

I'm trying to change a data frame with several thousands of rows that each look like one of the following variations:

table, th, td {
  border: 1px solid black
}


  
     a 
     b 
     c 
  
  
     $10 x and $7 y 
     NA 
     NA 
  
  
     $70 a; $95 b 
     NA 
     NA 
  
  
     $6 j 
     NA 
     NA

and change it into this:

table, th, td {
  border: 1px solid black
}

    
      
         a 
         b 
         c 
      
      
         $10 x and $7 y 
         $10 x 
         $7 y 
      
      
         $70 a; $95 b 
         $70 a 
         $95 b 
      
      
         $6 j 
         $6 j 
         NA

This is my current code to accomplish this (I use the number of dollar signs because that is the only consistent value to determine number of transactions):

(This is formatted as a data.table, just in case that makes a difference)

  df$b[(str_count(df$a, pattern = "\$") == 2)] = unlist(strsplit(df$a, " and "))[1]
  df$c[(str_count(df$a, pattern = "\$") == 2)] = unlist(strsplit(df$a, " and "))[2]
  df$b[str_count(df$a, pattern = "\$") < 2] = df$a

Now, instead of the desired result, I get the following:

table, th, td {
  border: 1px solid black
}


  
     a 
     b 
     c 
  
  
     $10 x and $7 y 
     $10 x 
     $7 y 
  
  
     $70 a; $95 b 
     $10 x
     $7 y
  
  
     $6 j 
     $6 j 
     NA

Does anyone know how to fix this issue? I think it has to do with the fact that the strsplit() is taking the first subsetted row and applying it to every row in the subset, but I don't know how to change it to work properly.

David T · Accepted Answer

Don't try and write code to parse HTML, just call an HTML parser:

library(rvest)
library(tidyverse)

stage1 <- 
  "
  
     a 
     b 
     c 
  
  
     $10 x and $7 y 
     NA 
     NA 
  
  
     $70 a; $95 b 
     NA 
     NA 
  
  
     $6 j 
     NA 
     NA 
  
" %>% 
  rvest::minimal_html() %>% 
  rvest::html_node("table") %>% 
  rvest::html_table() %>% 
  as_tibble()

stage1

# A tibble: 3 x 3
  a              b     c    
             
1 $10 x and $7 y NA    NA   
2 $70 a; $95 b   NA    NA   
3 $6 j           NA    NA

Now clean up stage1 using separate and a regex

stage1 %>% 
  select(a) %>% 
  separate(col = "a", into = c("b", "c"), 
           sep = "(?ix) \s* (and|;) \s*",   # Perl stye regex, cases insensitive.
           remove = FALSE, 
           fill= "right")


  a              b     c    
             
1 $10 x and $7 y $10 x $7 y 
2 $70 a; $95 b   $70 a $95 b
3 $6 j           $6 j  NA

Changing cell values in a row conditioned on values of another row

Answers (2)

Related Questions