
Reputation: 747

Need help extracting xpath - R

I'm trying to make a table from HTML content. I've made an example HTML doc to show exactly what the issue is - so although there are many easier ways to accomplish what I'm asking in this example, I have to do it this way (make separate vectors) given the actual larger HTML doc I'm dealing with.

Basically I need to extract specific values from top rows and bottom rows in a weirdly formatted table. Sometimes, however, there aren't values available in a specific row/column (not even dummy blank values), so I can't setup a table because variables are different lengths.


htmlEx <- read_html(
        <tr class="top">
          <td class="price">
            <span class="data-value"> 150 </span>
            <small class="name"> Good1 </small>
        <tr class="bottom">
          <td class="price">
            <small class="name"> Good2 </small>
        <tr class="top">
          <td class="price">
            <span class="data-value"> 130 </span>
            <small class="name"> Good3 </small>
        <tr class="bottom">
          <td class="price">
            <span class="data-value"> 180 </span>
            <small class="name"> Good4 </small>

htmlEx <- htmlTreeParse(htmlEx, useInternalNodes=T)

topVals <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "top")]//span', sep = ''), xmlValue)))
topNames <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "top")]//small', sep = ''), xmlValue)))

bottomVals <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "bottom")]//span', sep = ''), xmlValue)))
bottomNames <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "bottom")]//small', sep = ''), xmlValue)))

Since there isn't a data-value for the first "bottom" (for Good2), bottomVals is of length 1 so I can't compile a dataframe.

Ideally I'd like to change my xpathApply search so that if there is no <span> under <td class="price"> then it would show up as NA or "". My actual HTML has around 50 different rows with about 5-10 different values missing in different rows/columns, so I can't clean it with logic such as "if length bottomVals != length topVals then append an NA" because every day the order of missing data changes.

Is there a relatively easy fix to my xpath search to accomplish this, or will I have to change my approach completely?


My desired output would for this example is for bottomVals to be [NA, 180], as there is no value for the first class="bottom". This way I can combine everything into a dataframe (data.frame(topNames, bottomNames, topVals, bottomVals)) since they're all of length 2. And to generalize, is there a way to look for a specific element and have it be NA if it doesn't exist. E.g. if I tried looking for a div instead of small/span I'd get [NA, NA].

I know this seems like a roundabout way to turn it into a dataframe, but it really is the easiest way given the actual DOM I'm working with (it's very unorganized and I have to do lots of data cleaning before compiling).

Upvotes: 2

Views: 416

Answers (2)


Reputation: 24079

Here is a possible solution using just rvest. When the html/xml structure is missing some nodes, the easiest solution is to find a node common to every data point in interest.

In this case the "tr" row is common. From there the using html_node() function will return a value for every parse node even if the subnode of interest is absent.

#find all tr nodes
tablerows<- html_nodes(htmlEx, "tr") 

#parse each tr node and obtain the span value, name value and class
spanrows <- html_node(tablerows, "span") %>% html_text()
smallrows <- html_node(tablerows, "small") %>% html_text()
rowclasses <- tablerows %>% html_attr("class")

df<- data.frame(class = rowclasses, Names = spanrows, Values =smallrows)

   class Names Values 
1    top  150   Good1   
2 bottom  <NA>  Good2   
3    top  130   Good3   
4 bottom  180   Good4   

This table can then reshape to the final desired form.

df$id = rep(1:(nrow(df)/2), each=2)
pivot_wider(df, id_cols=id, names_from=class, names_glue = "{class}_{.value}", values_from = c(Values, Names))

# A tibble: 2 x 5
id top_Values bottom_Values top_Names bottom_Names
<int> <fct>      <fct>         <fct>     <fct>       
    1 " Good1 "  " Good2 "     " 150 "    NA         
    2 " Good3 "  " Good4 "     " 130 "   " 180 "   

Upvotes: 1


Reputation: 626

This will populate it with an empty string when the node is not present:

convert_empty <- function(x) {
  value <- xpathApply(x, './span/text()')
  if (is.null(value) ){ return ('') }
  return (xmlValue(value[1]))
bottomVals <- trimws((xpathApply(htmlEx, paste('//*[contains(@class, "bottom")]/td', sep = ''), convert_empty)))

Upvotes: 0

Related Questions