Reputation: 747
I'm trying to make a table from HTML content. I've made an example HTML doc to show exactly what the issue is - so although there are many easier ways to accomplish what I'm asking in this example, I have to do it this way (make separate vectors) given the actual larger HTML doc I'm dealing with.
Basically I need to extract specific values from top rows and bottom rows in a weirdly formatted table. Sometimes, however, there aren't values available in a specific row/column (not even dummy blank values), so I can't setup a table because variables are different lengths.
Example:
library(XML)
library(rvest)
htmlEx <- read_html(
'<table>
<thead>
<tbody>
<tr class="top">
<td class="price">
<span class="data-value"> 150 </span>
<small class="name"> Good1 </small>
</td>
</tr>
<tr class="bottom">
<td class="price">
<small class="name"> Good2 </small>
</td>
</tr>
<tr class="top">
<td class="price">
<span class="data-value"> 130 </span>
<small class="name"> Good3 </small>
</td>
</tr>
<tr class="bottom">
<td class="price">
<span class="data-value"> 180 </span>
<small class="name"> Good4 </small>
</td>
</tr>
</tbody>
</thead>
</table>'
)
htmlEx <- htmlTreeParse(htmlEx, useInternalNodes=T)
topVals <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "top")]//span', sep = ''), xmlValue)))
topNames <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "top")]//small', sep = ''), xmlValue)))
bottomVals <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "bottom")]//span', sep = ''), xmlValue)))
bottomNames <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "bottom")]//small', sep = ''), xmlValue)))
Since there isn't a data-value for the first "bottom" (for Good2), bottomVals
is of length 1 so I can't compile a dataframe.
Ideally I'd like to change my xpathApply search so that if there is no <span>
under <td class="price">
then it would show up as NA or "". My actual HTML has around 50 different rows with about 5-10 different values missing in different rows/columns, so I can't clean it with logic such as "if length bottomVals
!= length topVals
then append an NA" because every day the order of missing data changes.
Is there a relatively easy fix to my xpath search to accomplish this, or will I have to change my approach completely?
EDIT:
My desired output would for this example is for bottomVals
to be [NA, 180], as there is no value for the first class="bottom". This way I can combine everything into a dataframe (data.frame(topNames, bottomNames, topVals, bottomVals)
) since they're all of length 2. And to generalize, is there a way to look for a specific element and have it be NA if it doesn't exist. E.g. if I tried looking for a div
instead of small
/span
I'd get [NA, NA]
.
I know this seems like a roundabout way to turn it into a dataframe, but it really is the easiest way given the actual DOM I'm working with (it's very unorganized and I have to do lots of data cleaning before compiling).
Upvotes: 2
Views: 416
Reputation: 24079
Here is a possible solution using just rvest. When the html/xml structure is missing some nodes, the easiest solution is to find a node common to every data point in interest.
In this case the "tr" row is common. From there the using html_node()
function will return a value for every parse node even if the subnode of interest is absent.
library(rvest)
#find all tr nodes
tablerows<- html_nodes(htmlEx, "tr")
#parse each tr node and obtain the span value, name value and class
spanrows <- html_node(tablerows, "span") %>% html_text()
smallrows <- html_node(tablerows, "small") %>% html_text()
rowclasses <- tablerows %>% html_attr("class")
df<- data.frame(class = rowclasses, Names = spanrows, Values =smallrows)
df
class Names Values
1 top 150 Good1
2 bottom <NA> Good2
3 top 130 Good3
4 bottom 180 Good4
This table can then reshape to the final desired form.
library(tidyr)
df$id = rep(1:(nrow(df)/2), each=2)
pivot_wider(df, id_cols=id, names_from=class, names_glue = "{class}_{.value}", values_from = c(Values, Names))
# A tibble: 2 x 5
id top_Values bottom_Values top_Names bottom_Names
<int> <fct> <fct> <fct> <fct>
1 " Good1 " " Good2 " " 150 " NA
2 " Good3 " " Good4 " " 130 " " 180 "
Upvotes: 1
Reputation: 626
This will populate it with an empty string when the node is not present:
convert_empty <- function(x) {
value <- xpathApply(x, './span/text()')
if (is.null(value) ){ return ('') }
return (xmlValue(value[1]))
}
bottomVals <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "bottom")]/td', sep = ''), convert_empty)))
Upvotes: 0