Jacksonsox
Jacksonsox

Reputation: 1233

Rvest Split Data by Class Name where the class names change

I'm web scraping sol eBay data using Rvest.

Recently, eBay has started injecting hidden text into the readable text - see the image and scraped data.

Here is a URL example - you may or may not get the interlaced text: Example URL

XPath to a line item

//*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span[1]

XPath I use to get all lines

//*[@id="srp-river-results"]/ul/li/div/div[2]

I need the text from s-a4v02P and that text aggregated by line item.

I get something like the this:

"So2ld Mar 8,D1 2021JUI" "KSold MQaUr2V 3,E 20R2KC1R" and so on

Question is, How can I just get "Sold Mar 8, 2021" "Sold Mar 3, 2021" and so on?

Code I've have so far:

readHTML <- url %>%
        read_html()
    
    Title <- readHTML %>%
        html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]/a/h3') %>%
        html_text()

     SoldDateTop <- readHTML %>%
         html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]') %>%
         html_nodes("[class='s-item__title--tagblock ']") %>%
         html_nodes("[class='POSITIVE']") %>%
         # html_nodes("[class='s-8kjgi5']") %>% <-- this class name changes
         html_text()

Example of the "hidden" text

Upvotes: 0

Views: 561

Answers (2)

Dave2e
Dave2e

Reputation: 24089

To determine which tags are shown and which are hidden, there is a “style” element on the page with the display/hidden keys.

Using rvest version 1.0.0

library(rvest)
page <-read_html(url)

#find stype tags
styles <- page %>% html_elements("style") %>% html_text2()

#get the "display inline" key
#Assuming it is always the first style element of the second style node 
displayInline <- gsub("(.*?) \\{.*", "\\1", styles[2])

#find nodes of span with both class and role specfied
parent <-page %>% html_elements(xpath=".//span[@class='POSITIVE' and @role='text']") 

#retrieve the dates
sapply(parent, function(p) {p %>% html_elements(displayInline) %>% html_text() %>% paste(collapse = "")})

[1] "Sold  Mar 8, 2021"  "Sold  Mar 3, 2021"  "Sold  Feb 27, 2021" "Sold  Feb 22, 2021" "Sold  Feb 20, 2021" "Sold  Feb 19, 2021" "Sold  Feb 5, 2021" 
[8] "Sold  Feb 4, 2021"  "Sold  Feb 3, 2021"  "Sold  Jan 31, 2021" "Sold  Jan 27, 2021" "Sold  Jan 22, 2021" "Sold  Jan 10, 2021" "Sold  Jan 3, 2021" 
[15] "Sold  Jan 1, 2021"  "Sold  Jan 1, 2021"  "Sold  Dec 30, 2020" "Sold  Dec 25, 2020" "Sold  Dec 22, 2020" "Sold  Dec 20, 2020" "Sold  Dec 11, 2020"
[22] "Sold  Mar 3, 2021"  "Sold  Jan 27, 2021" "Sold  Dec 25, 2020"

Upvotes: 3

QHarr
QHarr

Reputation: 84465

Similar approach but based on observation that the variable part of the class value is length 6 for visible classes so you can extract the appropriate visible class value from the css style instructions

library(rvest)
library(magrittr)
library(stringr)

get_sold_date <- function(nodelist, visible_class){
  nodelist %>% 
    html_nodes(paste0('.POSITIVE span.', visible_class))  %>% 
    html_text() %>% 
      paste(collapse = '')
}

get_visible_class <- function(node){
    stringr::str_extract(node, '(s-[a-z0-9]{6})')
}

page <- read_html('https://www.ebay.com/sch/i.html?_nkw=Star%20Wars%20Black%20Series%20%20-POTF%20-POTF2%20-POTFII%20-Vintage%20%20Boba%20Fett%20Han%20Solo%20(SDCC,San%20Diego,Con)%20Carbonite%20-Walgreens%20-3.75%20-3/4%20-Connexions%20-Die%20-Lot%20-Topps%20-Sideshow%20-1/6%20-1/12%20-AFA%20-UKG%20-Custom%20-Signature%20-Lego%20-Funko%20-Pop&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1')
listings <- page %>% 
  html_nodes('#srp-river-results .s-item')

visible_class <- get_visible_class(page %>% 
                                     html_node('style[type="text/css"]') %>% 
                                     html_text(trim = T))

dates <- map(listings,  get_sold_date,  visible_class)

print(dates)

Also, means you can probably ignore extracting the appropriate class and use a filter function of some sort based on length of class being 8 i.e. html_nodes('.POSITIVE span') %>% html_attr('class') %>% map(nchar) == 8. I will have a look at that later today.

Upvotes: 2

Related Questions