mumutuxin
mumutuxin

Reputation: 21

How to scrape the data of nutrition facts for products from walmart.com?

I have tried urllib with Beautifulsoup. I always get an empty result by using soup.select with related tags. I am new to Python. Thank you so much for your help in advance!

Attached codes are for your reference.

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import pandas as pd
url="https://www.walmart.com/ip/Twin-Pack-Kellogg-s-Frosted-Mini-Wheats-Breakfast-Cereal-48-Oz/940504168"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
name_box = soup.select('div.nutrition-facts-all-facts-servingSize.div.span')
print(name_box)

Upvotes: 0

Views: 1392

Answers (1)

Martin Evans
Martin Evans

Reputation: 46779

You have picked a rather awkward page to start playing with web scraping as the page you are trying to get has a lot of javascript rendering. As such you cannot simply pass the information to BeautfulSoup and get the information you want. i.e. the HTML you get will be different to the HTML you see when viewing source in a browser.

You could investigate using something like selenium to obtain the final HTML via a browser and parse that using BeautifulSoup. Alternatively, the quickest approach is to see if the information you want is already buried somewhere in what you have. In this case you can find it as JSON buried inside one of the <script> sections that is returned.

The JSON can be extracted using the following code:

import urllib.request
from bs4 import BeautifulSoup
import json
import re

url = "https://www.walmart.com/ip/Twin-Pack-Kellogg-s-Frosted-Mini-Wheats-Breakfast-Cereal-48-Oz/940504168"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

for script in soup.find_all('script'):
    if '_setReduxState' in script.text:
        re_json = re.search('__WML_REDUX_INITIAL_STATE__ = ({.*\});}', script.text)
        data = json.loads(re_json.group(1))
        product_id = data['product']['midasContext']['productId']
        print(data['product']['idmlMap'][product_id]['modules']['NutritionFacts'])

This will give you data holding a very deep JSON structure containing all of the information you probably want. I suggest printing data to see all the information you have available.

For example data['product']['idmlMap'][product_id]['modules']['NutritionFacts'] gives you all the nutrition information, but you will probably need to be a bit more specific to get the exact information you want.

Some of the elements in this structure contain the HTML used on the page, so you might then need to further parse some of these to extract the bits you want.

Upvotes: 1

Related Questions