Reputation: 741
I would like to parse the webpage http://dcsd.nutrislice.com/menu/meadow-view/lunch/ to grab today's lunch menu. (I've built an Adafruit #IoT Thermal Printer and I'd like to automatically print the menu each day.)
I initially approached this using BeautifulSoup but it turns out that most of the data is loaded in JavaScript and I'm not sure BeautifulSoup can handle it. If you view source you'll see the relevant data stored in bootstrapData['menuMonthWeeks']
.
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
soup = BeautifulSoup(urllib2.urlopen(url).read())
This is an easy way to get the source and review.
My question is: what is the easiest way to extract this data so that I can do something with it? Literally, all I want is a string something like:
Southwest Cheese Omelet, Potato Wedges, The Harvest Bar (THB), THB - Cheesy Pesto Bread, Ham Deli Sandwich, Red Pepper Sticks, Strawberries
I've thought about using webkit to process the page and get the HTML (i.e. what a browser does) but that seems unnecessarily complex. I'd rather simply find something that can parse the bootstrapData['menuMonthWeeks']
data.
Upvotes: 25
Views: 72016
Reputation: 2946
I realize this is about four years later, but nutrislice (at least now) has an api you can get direct JSON from. Your kid's lunch from a couple days ago: http://dcsd.nutrislice.com/menu/api/digest/school/meadow-view/menu-type/lunch/date/2018/03/14/
Upvotes: 1
Reputation: 60153
Something like PhantomJS may be more robust, but here's some basic Python code to extract it the full menu:
import json
import re
import urllib2
text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menu = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))
print menu
After that, you'll want to search through the menu for the date you're interested in.
EDIT: Some overkill on my part:
import itertools
import json
import re
import urllib2
text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menus = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))
days = itertools.chain.from_iterable(menu['days'] for menu in menus)
day = next(itertools.dropwhile(lambda day: day['date'] != '2014-01-13', days), None)
if day:
print '\n'.join(item['food']['description'] for item in day['menu_items'])
else:
print 'Day not found.'
Upvotes: 11
Reputation: 1125248
All you need is a little string slicing:
import json
soup = BeautifulSoup(urllib2.urlopen(url).read())
script = soup.findAll('script')[1].string
data = script.split("bootstrapData['menuMonthWeeks'] = ", 1)[-1].rsplit(';', 1)[0]
data = json.loads(data)
JSON is, after all, a subset of JavaScript.
Upvotes: 15
Reputation: 122338
Without messing with json
, if you prefer, which it's not recommended, you can try the following:
import urllib2
import re
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
data = urllib2.urlopen(url).readlines()[60].partition('=')[2].strip()
foodlist = []
prev = 'name'
for i in re.findall('"([^"]*)"', data):
if "The Harvest Bar (THB)" in i or i == "description" or i == "start_date":
prev = i
continue
if prev == 'name':
if i.startswith("THB - "):
i = i[6:]
foodlist.append(i)
prev = i
I guess this is what you'll ultimately need:
Orange Chicken Bowl
Roasted Veggie Pesto Pizza
Cheese Sandwich & Yogurt Tube
Steamed Peas
Peaches
Southwest Cheese Omelet
Potato Wedges
Cheesy Pesto Bread
Ham Deli Sandwich
Red Pepper Sticks
Strawberries
Hamburger
Cheeseburger
Potato Wedges
Chicken Minestrone Soup
Veggie Deli Sandwich
Baked Beans
Green Beans
Fruit Cocktail
Cheese Pizza
Pepperoni Pizza
Diced Chicken w/ Cornbread
Turkey Deli Sandwich
Celery Sticks
Blueberries
Cowboy Mac
BYO Asian Salad
Sunbutter Sandwich
Stir Fry Vegetables
Pineapple Tidbits
Enchilada Blanco
Sausage & Black Olive Pizza
Cheese Sandwich & Yogurt Tube
Southwest Black Beans
Red Pepper Sticks
Applesauce
BBQ Roasted Chicken.
Hummus Cup w/ Pita bread
Ham Deli Sandwich
Mashed potatoes w/ gravy
Celery Sticks
Kiwi
Popcorn Chicken Bowl
Tuna Salad w/ Pita Bread
Veggie Deli Sandwich
Corn Niblets
Blueberries
Cheese Pizza
Pepperoni Pizza
BYO Chef Salad
BYO Vegetarian Chef Salad
Turkey Deli Sandwich
Steamed Cauliflower
Banana, Whole
Bosco Sticks
Chicken Egg Roll & Chow Mein Noodles
Sunbutter Sandwich
California Blend Vegetables
Fresh Pears
Baked Mac & Cheese
Italian Dunker
Ham Deli Sandwich
Red Pepper Sticks
Pineapple Tidbits
Hamburger
Cheeseburger
Baked Fries
BYO Taco Salad
Veggie Deli Sandwich
Baked Beans
Coleslaw
Fresh Grapes
Cheese Pizza
Pepperoni Pizza
Diced Chicken w/ Cornbread
Turkey Deli Sandwich
Steamed Cauliflower
Fruit Cocktail
French Dip w/ Au Jus
Baked Fries
Turkey Noodle Soup
Sunbutter Sandwich
Green Beans
Warm Cinnamon Apples
Rotisserie Chicken
Mashed potatoes w/ gravy
Bacon Cheeseburger Pizza
Cheese Sandwich & Yogurt Tube
Steamed Peas
Apple Wedges
Turkey Chili
Cornbread Muffins
BYO Chef Salad
BYO Vegetarian Chef Salad
Ham Deli Sandwich
Celery Sticks
Fresh Pears
Beef, Bean & Red Chili Burrito
Popcorn Chicken & Breadstick
Veggie Deli Sandwich
California Blend Vegetables
Strawberries
Cheese Pizza
Pepperoni Pizza
Hummus Cup w/ Pita bread
Turkey Deli Sandwich
Green Beans
Orange Wedges
Bosco Sticks
Cheesy Bean Soft Taco Roll Up
Sunbutter Sandwich
Pinto Bean Cup
Baby Carrots
Blueberries
With json
:
import urllib2
import json
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
for line in urllib2.urlopen(url):
if "bootstrapData['menuMonthWeeks']" in line:
data = json.loads(line.split("=")[1].strip('\n;'))
print data[0]["name"]
break
Upvotes: 1
Reputation: 11396
without BeautifulSoup, one simple way can we:
import urllib2
import json
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
for line in urllib2.urlopen(url):
if "bootstrapData['menuMonthWeeks']" in line:
data = json.loads(line.split("=")[1].strip('\n;'))
print data[0]["last_updated"]
output:
2013-11-11T11:18:13.636
for a more generic way see JavaScript parser in Python
Upvotes: 1