Reputation: 43
I'm trying to change a data frame with several thousands of rows that each look like one of the following variations:
table, th, td {
border: 1px solid black
}
<table>
<tr>
<th> a </th>
<th> b </th>
<th> c </th>
</tr>
<tr>
<td> $10 x and $7 y </td>
<td> NA </td>
<td> NA </td>
</tr>
<tr>
<td> $70 a; $95 b </td>
<td> NA </td>
<td> NA </td>
</tr>
<tr>
<td> $6 j </td>
<td> NA </td>
<td> NA </td>
</tr>
</table>
and change it into this:
table, th, td {
border: 1px solid black
}
<table>
<tr>
<th> a </th>
<th> b </th>
<th> c </th>
</tr>
<tr>
<td> $10 x and $7 y </td>
<td> $10 x </td>
<td> $7 y </td>
</tr>
<tr>
<td> $70 a; $95 b </td>
<td> $70 a </td>
<td> $95 b </td>
</tr>
<tr>
<td> $6 j </td>
<td> $6 j </td>
<td> NA </td>
</tr>
</table>
This is my current code to accomplish this (I use the number of dollar signs because that is the only consistent value to determine number of transactions):
(This is formatted as a data.table, just in case that makes a difference)
df$b[(str_count(df$a, pattern = "\\$") == 2)] = unlist(strsplit(df$a, " and "))[1]
df$c[(str_count(df$a, pattern = "\\$") == 2)] = unlist(strsplit(df$a, " and "))[2]
df$b[str_count(df$a, pattern = "\\$") < 2] = df$a
Now, instead of the desired result, I get the following:
table, th, td {
border: 1px solid black
}
<table>
<tr>
<th> a </th>
<th> b </th>
<th> c </th>
</tr>
<tr>
<td> $10 x and $7 y </td>
<td> $10 x </td>
<td> $7 y </td>
</tr>
<tr>
<td> $70 a; $95 b </td>
<td> $10 x</td>
<td> $7 y</td>
</tr>
<tr>
<td> $6 j </td>
<td> $6 j </td>
<td> NA </td>
</tr>
</table>
Does anyone know how to fix this issue? I think it has to do with the fact that the strsplit()
is taking the first subsetted row and applying it to every row in the subset, but I don't know how to change it to work properly.
Upvotes: 0
Views: 37
Reputation: 389175
You can use str_split_fixed
from stringr
stringr::str_split_fixed(df$a, '\\s*(;|and)\\s*', 2)
# [,1] [,2]
#[1,] "$10 x" "$7 y"
#[2,] "$70 a" "$95 b"
#[3,] "$6 j" ""
Upvotes: 0
Reputation: 2143
Don't try and write code to parse HTML, just call an HTML parser:
library(rvest)
library(tidyverse)
stage1 <-
"<table>
<tr>
<th> a </th>
<th> b </th>
<th> c </th>
</tr>
<tr>
<td> $10 x and $7 y </td>
<td> NA </td>
<td> NA </td>
</tr>
<tr>
<td> $70 a; $95 b </td>
<td> NA </td>
<td> NA </td>
</tr>
<tr>
<td> $6 j </td>
<td> NA </td>
<td> NA </td>
</tr>
</table>" %>%
rvest::minimal_html() %>%
rvest::html_node("table") %>%
rvest::html_table() %>%
as_tibble()
stage1
# A tibble: 3 x 3
a b c
<chr> <lgl> <lgl>
1 $10 x and $7 y NA NA
2 $70 a; $95 b NA NA
3 $6 j NA NA
Now clean up stage1
using separate
and a regex
stage1 %>%
select(a) %>%
separate(col = "a", into = c("b", "c"),
sep = "(?ix) \\s* (and|;) \\s*", # Perl stye regex, cases insensitive.
remove = FALSE,
fill= "right")
a b c
<chr> <chr> <chr>
1 $10 x and $7 y $10 x $7 y
2 $70 a; $95 b $70 a $95 b
3 $6 j $6 j NA
Upvotes: 2