Reputation: 2085
My goal is to scrape all this diamond data from bluenile.com. I've got some code that seems to be doing that, but it only grabs the first 61 rows.
By the way, I am using the "SelectorGadget" chrome plugin to get the CSS selectors. If I scroll down a little, the highlighting stops. Is it something to do with the website?
library('rvest')
le_url <- "https://www.bluenile.com/diamonds/round-cut?track=DiaSearchRDmodrn"
webpage <- read_html(le_url)
shape_data_html <- html_nodes(webpage,'.shape')
price_data_html <- html_nodes(webpage,'.price')
carat_data_html <- html_nodes(webpage,'.carat')
cut_data_html <- html_nodes(webpage,'.cut')
color_data_html <- html_nodes(webpage,'.color')
clarity_data_html <- html_nodes(webpage,'.clarity')
#Converting data to text
shape_data <- html_text(shape_data_html)
price_data <- html_text(price_data_html)
carat_data <- html_text(carat_data_html)
cut_data <- html_text(cut_data_html)
color_data <- html_text(color_data_html)
clarity_data <- html_text(clarity_data_html)
# make a data.frame
le_mat <- cbind(shape_data, price_data, carat_data, cut_data, color_data, clarity_data)
le_df <- le_mat[-1,]
colnames(le_df) <- le_mat[1,]
Upvotes: 1
Views: 104
Reputation: 84465
Data is dynamically added via API call as you scroll down page. The API call has a query string that allows you to specify startIndex
(start row) and number of results per page (pageSize
). The results per page max seems to be 1000. The return is json from which you can extract all the info you want including the total number of rows; accessed via key of countRaw
. So, you can make a request for the initial 1000, parse out the total row count, countRaw
, and perform a loop, adjusting the row startIndex
parameter until you have all the results.
You can use a json parser e.g. jsonlite to handle the json response.
Example API endpoint call for first 1000 results:
library(jsonlite)
url <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us¤cy=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url)
print(r$countRaw)
You get a list of 8 elements from each call. r$results
is a dataframe containing info of main interest.
Part of response:
Given the indicated result count I was expecting I could do something like (bearing in mind my limited R experience):
total <- r$countRaw
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=placeholder&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us¤cy=USD&productSet=BN&skus='
if(total > 1000){
for(i in seq(1000, total + 1, by = 1000)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$results
# do something with df e.g. merge
}
}
However, it seems that there are only results for first two calls i.e. the initial df
from r$results
shown above and then:
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=1000&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us¤cy=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url2)
df2 <- r$results
Searching the page with css selector .row yields 1002 results versus the indicated total All diamonds number; so, I think there is some exploration to do around filters.
Upvotes: 2