Reputation: 949
Trying to scrape a table from a website using beautiful soup in order to parse the data. How would I go about parsing it by its headers? So far, I can't even manage to print the entire table.Thanks in advance.
Here is the code:
import urllib2
from bs4 import BeautifulSoup
optionstable = "http://www.barchart.com/options/optdailyvol?type=stocks"
page = urllib2.urlopen(optionstable)
soup = BeautifulSoup(page, 'lxml')
table = soup.find("div", {"class":"dataTables_wrapper","id": "options_wrapper"})
table1 = table.find_all('table')
print table1
Upvotes: 3
Views: 1503
Reputation: 180411
You need to mimic the ajax request to get the table data:
import requests
from time import time
optionstable = "http://www.barchart.com/options/optdailyvol?type=stocks"
params = {"type": "stocks",
"dir": "desc",
"_": str(time()),
"f": "base_symbol,type,strike,expiration_date,bid,ask,last,volume,open_interest,volatility,timestamp",
"sEcho": "1",
"iDisplayStart": "0",
"iDisplayLength": "100",
"iSortCol_0": "7",
"sSortDir_0": "desc",
"iSortingCols": "1",
"bSortable_0": "true",
"bSortable_1": "true",
"bSortable_2": "true",
"bSortable_3": "true",
"bSortable_4": "true",
"bSortable_5": "true",
"bSortable_6": "true",
"bSortable_7": "true",
"bSortable_82": "true",
"bSortable_9": "true",
"bSortable_10": "true",
"sortby": "Volume"}
Then do a get passing the params:
js = requests.get("http://www.barchart.com/option-center/getData.php", params=params).json()
Which gives you:
{u'aaData': [[u'<a href="/quotes/BAC">BAC</a>', u'Call', u'16.00', `u'12/16/16', u'0.89', u'0.90', u'0.91', u'52,482', u'146,378', u'0.26', u'01:43'], [u'<a href="/quotes/ETE">ETE</a>', u'Call', u'20.00', u'01/20/17', u'0.38', u'0.41', u'0.40', u'40,785', u'72,011', u'0.42', u'01:27'], [u'<a href="/quotes/BAC">BAC</a>', u'Call', u'15.00', u'10/21/16', u'1.34', u'1.36', u'1.33', u'35,663', u'90,342', u'0.35', u'01:44'], [u'<a href="/quotes/COTY">COTY</a>', u'Put', u'38.00', u'10/21/16', u'15.00', u'15.30', u'15.10', u'32,321', u'242,382', u'1.24', u'01:44'], [u'<a href="/quotes/COTY">COTY</a>', u'Call', u'38.00', u'10/21/16', u'0.00', u'0.05', u'0.01', u'32,320', u'256,589', u'1.34', u'01:44'], [u'<a href="/quotes/WFC">WFC</a>', u'Put', u'40.00', u'10/21/16', u'0.01', u'0.03', u'0.02', u'32,121', u'37,758', u'0.39', u'01:43'], [u'<a href="/quotes/WFC">WFC</a>', u'Put', u'40.00', u'11/18/16', u'0.16', u'0.17', u'0.16', u'32,023', u'8,789', u'0.30', u'01:44']..................
There are many more params you can pass, if you watch the request in chrome tools under the XHR tab, you can see all the params, the ones above were the minimum needed to get a result. There are quite a lot so I won't post them all here and leave you to figure out how you can influence the results.
If you iterate over js[u'aaData']
, you can see each sublist where each entry corresponds to the columns like:
#base_symbol,type,strike,expiration_date,bid,ask,last,volume,open_interest,volatility,timestamp
[u'<a href="/quotes/AAPL">AAPL</a>', u'Call', u'116.00', u'10/14/16', u'1.36', u'1.38', u'1.37', u'21,812', u'7,258', u'0.23', u'10/10/16']
So if you want to filter rows based on some criteria i.e strike > 15:
for d in filter(lambda row: float(row[2]) > 15, js[u'aaData']):
print(d)
You might also find pandas useful, with a little bit of tidying up we can create a nice df:
# extract base_symbol text
for v in js[u'aaData']:
v[0] = BeautifulSoup(v[0]).a.text
import pandas as pd
cols = "base_symbol,type,strike,expiration_date,bid,ask,last,volume,open_interest,volatility,timestamp"
df = pd.DataFrame(js[u'aaData'],
columns=cols.split(","))
print(df.head(5))
That gives you a nice df to work with:
base_symbol type strike expiration_date bid ask last volume \
0 BAC Call 16.00 12/16/16 0.89 0.90 0.91 52,482
1 ETE Call 20.00 01/20/17 0.38 0.41 0.40 40,785
2 BAC Call 15.00 10/21/16 1.34 1.36 1.33 35,663
3 COTY Put 38.00 10/21/16 15.00 15.30 15.10 32,321
4 COTY Call 38.00 10/21/16 0.00 0.05 0.01 32,320
open_interest volatility timestamp
0 146,378 0.26 10/10/16
1 72,011 0.42 10/10/16
2 90,342 0.35 10/10/16
3 242,382 1.24 10/10/16
4 256,589 1.34 10/10/16
You might just want to change the dtypes df["strike"] = df["strike"].astype(float)
etc..
Upvotes: 6
Reputation: 39013
The page is a dynamic page. It uses the DataTables jquery plugin to display the table. If you look at the page source you'll see an empty table. It is filled after the page is loaded.
You have two choices here. Either figure out which URL the Javascript code running in the browser accesses to get the table data, and hope that simply accessing this URL will work. If it does, the response will probably be in JSON, so you don't need to parse anything.
The second choice is to use a tool like Selenium or PhantomJS to load the page and run the Javascript code. Then you can access the already populated table and get its data.
Upvotes: 2