Tyler Knight
Tyler Knight

Reputation: 193

How to pull a product link from customer profile page on Amazon

I'm trying to get the product link from a customers profile page usign R's RVEST package

I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.

For example on this profile page:

https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8

I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2

https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp

library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)

# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)

page %>%
  html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
  html_text()

#I did a test to see if i could even find the href, with no luck

test <- page %>%
  html_nodes("#a-page") %>%
  html_text()

grepl("B01A51S9Y2",test)

Thanks for the tip @Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)

driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText

This doesn't really return anything

Added the get attribute href, and was able to get the link

prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')

for (link in 1:length(prod)){
  print(prod[[link]]$getElementAttribute('href'))
}

Upvotes: 0

Views: 255

Answers (1)

QHarr
QHarr

Reputation: 84465

That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....

enter image description here

You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:

enter image description here

It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector

.profile-at-product-box-link

to gather a webElements collection you can loop and extract href attribute from.

Upvotes: 1

Related Questions