Rvest Split Data by Class Name where the class names change

Question

I'm web scraping sol eBay data using Rvest.

Recently, eBay has started injecting hidden text into the readable text - see the image and scraped data.

Here is a URL example - you may or may not get the interlaced text: Example URL

XPath to a line item

//*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span[1]

XPath I use to get all lines

//*[@id="srp-river-results"]/ul/li/div/div[2]

I need the text from s-a4v02P and that text aggregated by line item.

I get something like the this:

"So2ld Mar 8,D1 2021JUI" "KSold MQaUr2V 3,E 20R2KC1R" and so on

Question is, How can I just get "Sold Mar 8, 2021" "Sold Mar 3, 2021" and so on?

Code I've have so far:

readHTML <- url %>%
        read_html()
    
    Title <- readHTML %>%
        html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]/a/h3') %>%
        html_text()

     SoldDateTop <- readHTML %>%
         html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]') %>%
         html_nodes("[class='s-item__title--tagblock ']") %>%
         html_nodes("[class='POSITIVE']") %>%
         # html_nodes("[class='s-8kjgi5']") %>% <-- this class name changes
         html_text()

Dave2e · Accepted Answer

To determine which tags are shown and which are hidden, there is a “style” element on the page with the display/hidden keys.

Using rvest version 1.0.0

library(rvest)
page <-read_html(url)

#find stype tags
styles <- page %>% html_elements("style") %>% html_text2()

#get the "display inline" key
#Assuming it is always the first style element of the second style node 
displayInline <- gsub("(.*?) \{.*", "\1", styles[2])

#find nodes of span with both class and role specfied
parent <-page %>% html_elements(xpath=".//span[@class='POSITIVE' and @role='text']") 

#retrieve the dates
sapply(parent, function(p) {p %>% html_elements(displayInline) %>% html_text() %>% paste(collapse = "")})

[1] "Sold  Mar 8, 2021"  "Sold  Mar 3, 2021"  "Sold  Feb 27, 2021" "Sold  Feb 22, 2021" "Sold  Feb 20, 2021" "Sold  Feb 19, 2021" "Sold  Feb 5, 2021" 
[8] "Sold  Feb 4, 2021"  "Sold  Feb 3, 2021"  "Sold  Jan 31, 2021" "Sold  Jan 27, 2021" "Sold  Jan 22, 2021" "Sold  Jan 10, 2021" "Sold  Jan 3, 2021" 
[15] "Sold  Jan 1, 2021"  "Sold  Jan 1, 2021"  "Sold  Dec 30, 2020" "Sold  Dec 25, 2020" "Sold  Dec 22, 2020" "Sold  Dec 20, 2020" "Sold  Dec 11, 2020"
[22] "Sold  Mar 3, 2021"  "Sold  Jan 27, 2021" "Sold  Dec 25, 2020"

Rvest Split Data by Class Name where the class names change

Answers (2)

Related Questions