Reputation: 27
I've been trying to resolve this the whole day and I can't figure out a solution. Please help !! So to learn web scraping, I've been practicing on this website :
https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi
The goal is to scrape the price of EVERY PRODUCT. So, thanks to the ressources on this website and other internet users, I made this code that works perfectly :
option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']")
option$clickElement()
priceNodes <- remDr$findElements(using = 'css selector', ".price")
price<-unlist(lapply(priceNodes, function(x){x$getElementText()}))
price<-gsub("€","",price)
price<-gsub(",","",price)
price <- as.numeric(price)
So with this I got the result that I want, which is a list of 204 values (price). Now I'd like to transform this entire process into a function in order to apply this function to a list of adresse (in this case to other brands). And obviously it did not work ... :
FPrice <- function(x) {
url1 <- x
remDr <- rD$client
remDr$navigate(url1)
iframe <- remDr$findElement("css", value=".view-more-less")
option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']")
option$clickElement()
priceNodes <- remDr$findElements(using = 'css selector', ".price")
price<-unlist(lapply(priceNodes, function(x){x$getElementText()}))
}
When I apply it like this :
FPrice("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi")
Error message came up and I don't get the data that I am looking for :
Selenium message:stale element reference: element is not attached to the page document
(Session info: chrome=61.0.3163.100)
(Driver info: chromedriver=2.33.506106 (8a06c39c4582fbfbab6966dbb1c38a9173bfb1a2),platform=Mac OS X 10.12.6 x86_64)
I think it is because there is a function inside of the function... Can anyone please help me the resolve the problem ? Thanks.
Ps. With rvest I made another code :
Price <- function(x) {
url1 <- x
webpage <- read_html(url1)
price_data_html <- html_nodes(webpage,".price")
price_data <- html_text(price_data_html)
price_data<-gsub("€","",price_data)
price_data<-gsub(",","",price_data)
price_data <- as.numeric(price_data)
return(price_data)
}
And it worked fine. I even applied it to a vector containing a list of adresse. However, with rvest I can not get to configure the browser so it select the option "show all". Thus I only get 60 observations while some brands propose more than 200 product, like the case of Fendi.
Thank you very much for your patience. Hope to read from you very soon !
Upvotes: 0
Views: 384
Reputation: 78792
Astoundingly (I verified this) the site does not explicitly prevent scraping in the Terms & Conditions and they left the /fr/fr
path out of their robots.txt
exclusions. i.e. you got lucky. This is likely an oversight on their part.
However, there is a non-Selenium approach to this. The main page loads the product <div>
s via XHR calls, so find that via browser Developer Tools "Network" tab inspection and you can scrape away either page by page or completely. Here are the required 📦s:
library(httr)
library(rvest)
library(purrr)
For the paginated approach, we setup a function:
get_prices_on_page <- function(pg_num = 1) {
Sys.sleep(5) # be kind
GET(
url = "https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi",
query = list(
view = "jsp",
sale = "0",
exclude = TRUE,
pn = pg_num,
npp=60,
image_view = "product",
dScroll = "0"
),
) -> res
pg <- content(res, as="parsed")
list(
total_pgs = html_node(pg, "div.data_totalPages") %>% xml_integer(),
total_items = html_node(pg, "data_totalItems") %>% xml_integer(),
prices_on_page = html_nodes(pg, "span.price") %>%
html_text() %>%
gsub("[^[:digit:]]", "", .) %>%
as.numeric()
)
}
Then get the first page:
prices <- get_prices_on_page(1)
and, then iterate over till we're done, smushing everything together:
c(prices$prices_on_page, map(2:prices$total_pgs, get_prices_on_page) %>%
map("prices_on_page") %>%
flatten_dbl()) -> all_prices
all_prices
## [1] 601 1190 1700 1480 1300 590 950 1590 3200 410 950 595 1100 690
## [15] 900 780 2200 790 1300 410 1000 1480 750 495 850 850 900 450
## [29] 1600 1750 2200 750 750 1550 750 850 1900 1190 1200 1650 2500 580
## [43] 2000 2700 3900 1900 600 1200 650 950 600 800 1100 1200 1000 1100
## [57] 2500 1000 500 1645 550 1505 850 1505 850 2000 400 790 950 800
## [71] 500 2000 500 1300 350 550 290 550 450 2700 2200 650 250 200
## [85] 1700 250 250 300 450 800 800 800 900 600 900 375 5500 6400
## [99] 1450 3300 2350 1390 2700 1500 1790 2200 3500 3100 1390 1850 5000 1690
## [113] 2700 4800 3500 6200 3100 1850 1950 3500 1780 2000 1550 1280 3200 1350
## [127] 2700 1350 1980 3900 1580 18500 1850 1550 1450 1600 1780 1300 1980 1450
## [141] 1320 1460 850 1650 290 190 520 190 1350 290 850 900 480 450
## [155] 850 780 1850 750 450 1100 1550 550 495 850 890 850 590 595
## [169] 650 650 495 595 330 480 400 220 130 130 290 130 250 230
## [183] 210 900 380 340 430 380 370 390 460 255 300 480 550 410
## [197] 350 350 280 190 350 550 450 430
Or, we can get them all in one, fell swoop by using that "view all on one page" feature the site has:
pg <- read_html("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi?view=jsp&sale=0&exclude=true&pn=1&npp=view_all&image_view=product&dScroll=0")
html_nodes(pg, "span.price") %>%
html_text() %>%
gsub("[^[:digit:]]", "", .) %>%
as.numeric() -> all_prices
all_prices
# same result as above
Please keep the crawl delay in if you use the paginated approach and please don't misuse the content. While they don't disallow scraping, the T&C says it for personal product choosing use only.
Upvotes: 1