Reputation: 1233
I'm web scraping sol eBay data using Rvest.
Recently, eBay has started injecting hidden text into the readable text - see the image and scraped data.
Here is a URL example - you may or may not get the interlaced text: Example URL
XPath to a line item
//*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span[1]
XPath I use to get all lines
//*[@id="srp-river-results"]/ul/li/div/div[2]
I need the text from s-a4v02P
and that text aggregated by line item.
I get something like the this:
"So2ld Mar 8,D1 2021JUI" "KSold MQaUr2V 3,E 20R2KC1R"
and so on
Question is, How can I just get "Sold Mar 8, 2021" "Sold Mar 3, 2021"
and so on?
Code I've have so far:
readHTML <- url %>%
read_html()
Title <- readHTML %>%
html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]/a/h3') %>%
html_text()
SoldDateTop <- readHTML %>%
html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]') %>%
html_nodes("[class='s-item__title--tagblock ']") %>%
html_nodes("[class='POSITIVE']") %>%
# html_nodes("[class='s-8kjgi5']") %>% <-- this class name changes
html_text()
Upvotes: 0
Views: 561
Reputation: 24089
To determine which tags are shown and which are hidden, there is a “style” element on the page with the display/hidden keys.
Using rvest version 1.0.0
library(rvest)
page <-read_html(url)
#find stype tags
styles <- page %>% html_elements("style") %>% html_text2()
#get the "display inline" key
#Assuming it is always the first style element of the second style node
displayInline <- gsub("(.*?) \\{.*", "\\1", styles[2])
#find nodes of span with both class and role specfied
parent <-page %>% html_elements(xpath=".//span[@class='POSITIVE' and @role='text']")
#retrieve the dates
sapply(parent, function(p) {p %>% html_elements(displayInline) %>% html_text() %>% paste(collapse = "")})
[1] "Sold Mar 8, 2021" "Sold Mar 3, 2021" "Sold Feb 27, 2021" "Sold Feb 22, 2021" "Sold Feb 20, 2021" "Sold Feb 19, 2021" "Sold Feb 5, 2021"
[8] "Sold Feb 4, 2021" "Sold Feb 3, 2021" "Sold Jan 31, 2021" "Sold Jan 27, 2021" "Sold Jan 22, 2021" "Sold Jan 10, 2021" "Sold Jan 3, 2021"
[15] "Sold Jan 1, 2021" "Sold Jan 1, 2021" "Sold Dec 30, 2020" "Sold Dec 25, 2020" "Sold Dec 22, 2020" "Sold Dec 20, 2020" "Sold Dec 11, 2020"
[22] "Sold Mar 3, 2021" "Sold Jan 27, 2021" "Sold Dec 25, 2020"
Upvotes: 3
Reputation: 84465
Similar approach but based on observation that the variable part of the class value is length 6 for visible classes so you can extract the appropriate visible class value from the css style instructions
library(rvest)
library(magrittr)
library(stringr)
get_sold_date <- function(nodelist, visible_class){
nodelist %>%
html_nodes(paste0('.POSITIVE span.', visible_class)) %>%
html_text() %>%
paste(collapse = '')
}
get_visible_class <- function(node){
stringr::str_extract(node, '(s-[a-z0-9]{6})')
}
page <- read_html('https://www.ebay.com/sch/i.html?_nkw=Star%20Wars%20Black%20Series%20%20-POTF%20-POTF2%20-POTFII%20-Vintage%20%20Boba%20Fett%20Han%20Solo%20(SDCC,San%20Diego,Con)%20Carbonite%20-Walgreens%20-3.75%20-3/4%20-Connexions%20-Die%20-Lot%20-Topps%20-Sideshow%20-1/6%20-1/12%20-AFA%20-UKG%20-Custom%20-Signature%20-Lego%20-Funko%20-Pop&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1')
listings <- page %>%
html_nodes('#srp-river-results .s-item')
visible_class <- get_visible_class(page %>%
html_node('style[type="text/css"]') %>%
html_text(trim = T))
dates <- map(listings, get_sold_date, visible_class)
print(dates)
Also, means you can probably ignore extracting the appropriate class and use a filter function of some sort based on length of class being 8 i.e. html_nodes('.POSITIVE span') %>% html_attr('class') %>% map(nchar) == 8
. I will have a look at that later today.
Upvotes: 2