nashr rafeeg
nashr rafeeg

Reputation: 829

python script optimization for app engine

i have the following script i am using to scrap data from my uni website and insert into a GAE Db

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl  = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech =  Browser()
try:
    page = mech.open(url)
    html = page.read()
except Exception, err:
    print str(err)
#print html 
soup = BeautifulSoup(html)
soup.prettify() 
tables  = soup.find('select')
for options in tables:
    intake = options.string
    #print intake
    try:
        #print viewurl+intake
        page = mech.open(viewurl+intake)
        html = page.read()
        print html
        if html=="Exist in database":
            print intake, " Exist in the database skiping"
        else:
            page = mech.open(inserturl+intake)
            html = page.read()
            print html
            if html=="Ok":
                print intake, "added to the database"
            else:
                print "Error adding ",  intake, " to database"
    except Exception, err:
        print str(err)

i am wondering what would be the best way to optimize this script so i can run it on the app engine servers. as it is, it is now scraping over 300 entries and take well over 10 mins to insert all the data on my local machine

the model that is being used to store the data is

class Intake(db.Model):
    intake=db.StringProperty(multiline=False, required=True)
    #@permerlink    
    def get_absolute_url(self):
        return "/timekeeper/%s/" % self.intake
    class Meta:
        db_table = "Intake"
        verbose_name_plural = "Intakes"
        ordering = ['intake']

Upvotes: 2

Views: 376

Answers (3)

nashr rafeeg
nashr rafeeg

Reputation: 829

hi according to tosh and nick i have modified the script as bellow

from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
    page = urlfetch.fetch(url)
    #print html 
    soup = BeautifulSoup(page.content)
    soup.prettify() 
    tables  = soup.find('select')
    models=[]
    for options in tables:
        intake_code = options.string
        if Intake.all().filter('intake',intake_code).count()<1:
            data = Intake(intake=intake_code)
            models.append(data)
    try:
        if len(models)>0:
            db.put(models)
        else:
            pass 
    except Exception,err:
        pass
except Exception, err:
    print str(err)

am i on the right track ? also i am not really sure how to get this to invoke on a schedule (once a week) what would be the best way to do it?

and thanks for the prompt answers

Upvotes: 1

Nick Johnson
Nick Johnson

Reputation: 101149

The first thing you should do is rewrite your script to use the App Engine datastore directly. A large part of the time you're spending is undoubtedly because you're using HTTP requests (two per entry!) to insert data into your datastore. Using the datastore directly with batch puts ought to cut a couple of orders of magnitude off your runtime.

If your parsing code is still too slow, you can cut the work up into chunks and use the task queue API to do the work in multiple requests.

Upvotes: 2

tosh
tosh

Reputation: 5392

Divide and conquer.

  1. Make a list of tasks (e.g. urls to scrape/parse)
  2. Add your tasks into a queue (appengine taskqueue api, amazon sqs, …)
  3. Process your queue

Upvotes: 4

Related Questions