Mohideen bin Mohammed
Mohideen bin Mohammed

Reputation: 20147

How to Trace Particular Content in Web Page using Python in Certain Time Period?

I Want to Monitor Some Content Changes Which is Present in Some Web Pages. i Want to do the Same in Daily Basis using any Scripting or Browser plugin itself....

for example, I Want to Wet Notified if Some Changes Happened in Particular Content at Some Web Pages Based On My Query Without Subscribing their Subscription.

Upvotes: 0

Views: 142

Answers (2)

Mohideen bin Mohammed
Mohideen bin Mohammed

Reputation: 20147

Here is my code, how i scrap a table from one site. in that site, they didn't define id or class in table so you no need to put anything. if id or class there means just use html.xpath('//table[@id=id_val]/tr') instead of html.xpath('//table/tr')

import time
from lxml import etree
import urllib
while True:
    time.sleep(60) # for 1 minute time interval
    #time.sleep(86400) # for 1 day time interval
    web = urllib.urlopen("http://www.yoursite.com/")
    html = etree.HTML(web.read())
    tr_nodes = html.xpath('//table/tr')
    td_content = [tr.xpath('td') for tr in tr_nodes  if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India'  or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ]
    main_list = []
    for i in td_content:
        if i[5].text == 'Freshers' or  'Freshers' in i[5].text.split('/') or  '0' in i[5].text.split(' '):
            sub_list = [td.text for td in i]
            sub_list.insert(6,'http://yoursite.com/%s'%i[6].xpath('a')[0].get('href'))
            main_list.append(sub_list)
    print 'main_list',main_list

Upvotes: 2

Sajjan Kumar
Sajjan Kumar

Reputation: 373

you can do this simply writing the python script based on urllib/requests/Beautiful soup Modules.

What you have to do is write a function to parse the required part of the website and(do the in a loop) check if it meets your requirement, if it doesn't meet then exit the loop and after some time run again the loop (you can do this using time module's time.sleep() function) and check again and again.

def parse(url):
    #extract the content you want
    while(#condition):
            if condition met:
                #do this
            else:
                #do this
           time.sleep(#time after that you want to recheck)

that's it and you are done. Don't forget to import modules! :)

Upvotes: 2

Related Questions