requests-html not finding page element

Question

So I'm trying to navigate to this url: https://www.instacart.com/store/wegmans/search_v3/horizon%201%25 and scrape data from the div with the class item-name item-row. There are two main problems though, the first is that instacart.com requires a login before you can get to that url, and the second is that most of the page is generated with javascript.

I believe I've solved the first problem because my session.post(...) gets a 200 response code. I'm also pretty sure that r.html.render() is supposed to solve the second problem by rendering the javascript generated html before I scrape it. Unfortunately, the last line in my code is only returning an empty list, despite the fact that selenium had no problem getting this element. Does anyone know why this isn't workng?

from requests_html import HTMLSession
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
session = HTMLSession()
res1 = session.get('http://www.instacart.com', headers=headers)
soup = BeautifulSoup(res1.content, 'html.parser')
token = soup.find('meta', {'name': 'csrf-token'}).get('content')
data = {"user": {"email": "alexanderjbusch@gmail.com", "password": "password"},
        "authenticity_token": token}
response = session.post('https://www.instacart.com/accounts/login', headers=headers, data=data)
print(response)
r = session.get("https://www.instacart.com/store/wegmans/search_v3/horizon%201%25", headers=headers)
r.html.render()
print(r.html.xpath("//div[@class='item-name item-row']"))

SIM · Accepted Answer

After logging in using requests module and BeautifulSoup, you can make use of the link I've already suggested in the comment to parse the required data available within json. The following script should get you name, quantity, price and a link to the concerning product. You can only get 21 product using the script below. There is an option for pagination within this json content. You can get all of the products by playing around with that pagination.

import json
import requests
from bs4 import BeautifulSoup

baseurl = 'https://www.instacart.com/store/'
data_url = "https://www.instacart.com/v3/retailers/159/module_data/dynamic_item_lists/cart_starters/storefront_canonical?origin_source_type=store_root_department&tracking.page_view_id=b974d56d-eaa4-4ce2-9474-ada4723fc7dc&source=web&cache_key=df535d-6863-f-1cd&per=30"

data = {"user": {"email": "alexanderjbusch@gmail.com", "password": "password"},
        "authenticity_token": ""}
headers = {
    'user-agent':'Mozilla/5.0',
    'x-requested-with': 'XMLHttpRequest'
}
with requests.Session() as s:

    res = s.get('https://www.instacart.com/',headers={'user-agent':'Mozilla/5.0'})
    soup = BeautifulSoup(res.text, 'lxml')
    token = soup.select_one("[name='csrf-token']").get('content')

    data["authenticity_token"] = token

    s.post("https://www.instacart.com/accounts/login",json=data,headers=headers)
    resp = s.get(data_url, headers=headers)

    for item in resp.json()['module_data']['items']:
        name = item['name']
        quantity = item['size']
        price = item['pricing']['price']
        product_page = baseurl + item['click_action']['data']['container']['path']
        print(f'{name}
{quantity}
{price}
{product_page}
')

Partial output:

SB Whole Milk
1 gal
$3.90
https://www.instacart.com/store/items/item_147511418

Banana
At $0.69/lb
$0.26
https://www.instacart.com/store/items/item_147559922

Yellow Onion
At $1.14/lb
$0.82
https://www.instacart.com/store/items/item_147560764

requests-html not finding page element

Answers (1)

Related Questions