NitheshKHP
NitheshKHP

Reputation: 391

Scraping data through paginated table using python

I am scraping data through google finance's historical page for a stock (http://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=PLfUVIDTDuSRiQKhwYGQBQ).

I can scrape the 30 rows on the current page. The issue I am facing is that I am unable to scrape through the rest of data in the table (31-241 rows). How do I go to the next page or link. Following is my code:

import urllib2
import xlwt #to write into excel spreadsheet
from bs4 import BeautifulSoup

# Main Coding Section

stock_links = open('stock_link_list.txt', 'r')  #opening text file for reading

#url="https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=zHXOVLPnApG2iALxxYCADQ"
for url in stock_links:
    OurFile = urllib2.urlopen(url)
    OurHtml = OurFile.read()
    OurFile.close()
soup = BeautifulSoup(OurHtml)
#soup1 = soup.find("div", {"class": "gf-table-wrapper sfe-break-bottom-16"}).get_text()
soup1 = soup.find("table", {"class": "gf-table historical_price"}).get_text()

end = url.index('&')
filename = url[47:end]
file = open(filename, 'w')  #opening text file for writing
file.write(soup1)
#file.write(soup1.get_text())   #writing to the text file
file.close()            #closing the text file

Upvotes: 1

Views: 2516

Answers (2)

Umair Ayub
Umair Ayub

Reputation: 21201

Looking at first sight the Row Limit option allows to shows maximum 30 row per page but I manually changed query string parameters to greater numbers and realizes we can view max 200 rows per page

Change URL to

https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=OM3UVLFtkLnzBsjIgYAI&start=0&num=200

It will show 200 rows

And then change start=200&num=400

But more logically, if you have many other sunch kind of links.

Then you can scrape the Pagination area, the last TR and grab those links of next pages and scrape

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

You will have to fine tune it and I would catch more specific errors but you can keep increasing the start to get the next data:

url = "https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=W8LUVLHnAoOswAOFs4DACg&start={}&num=30"

from bs4 import BeautifulSoup
import  requests
# Main Coding Sectio
start = 0
while True:
    try:
        nxt = url.format(start)
        r = requests.get(nxt)
        soup = BeautifulSoup(r.content)
        print(soup.find("table",{"class": "gf-table historical_price"}).get_text())
    except Exception as e:
        print(e)
        break
    start += 30

This gets all the table data up to the last date feb 7 :

......

Date
Open
High
Low
Close
Volume

Feb 7, 2014
552.60
557.90
548.25
551.50
119,711

Upvotes: 2

Related Questions