user8558176
user8558176

Reputation:

How to extract data from multiple URL using python

Hi i want to scrap data from multiple URL, I am doing like

for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

but it not giving me complete data, it is printing only last url data,

here is my code, plz help

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import psycopg2
import operator


for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

    uClient = uReq(my_url)
    page1_html = uClient.read()
    uClient.close()
    # html parsing
    page1_soup = soup(page1_html, 'html.parser')

    # grabing data
    containers = page1_soup.findAll('div', {'class': 'PA15'})

    # Make the connection to PostgreSQL
    conn = psycopg2.connect(database='--',user='--', password='--', port=--)
    cursor = conn.cursor()
    for container in containers:
        toll_name1 = container.p.b.text
        toll_name = toll_name1.split(" ")[1]

        search1 = container.findAll('b')
        highway_number = search1[1].text.split(" ")[0]

        text = search1[1].get_text()
        onset = text.index('in')
        offset = text.index('Stretch')
        state = str(text[onset +2:offset]).strip(' ')

        location = list(container.p.descendants)[10]
        mystr = my_url[my_url.find('?'):]
        TID = mystr.strip('?TollPlazaID=')

        query = "INSERT INTO tollmaster (TID, toll_name, location, highway_number, state) VALUES (%s, %s, %s, %s, %s);"
        data = (TID, toll_name, location, highway_number, state)

        cursor.execute(query, data)

# Commit the transaction
conn.commit()

but it's displaying only second-last url data

Upvotes: 0

Views: 866

Answers (1)

Pythonist
Pythonist

Reputation: 697

Seems like some of the pages are missing your key information, you can use error-catching for it, like this:

try: 
    tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except IndexError:
    continue  # Skip this page if no items were scrapped

You may want to add some logging/print information to keep track of nonexisting tables.

EDIT: It's showing information from only last page, as you are commiting your transaction outside the for loop, overwriting your conn for every i. Just put conn.commit() inside for loop, at the far end.

Upvotes: 1

Related Questions