Reputation: 43
I am attempting to visit this site and pull the information from the table. I'm totally new to this and this is the furthest I could get. Most guides I find aren't really helping me get the end result I wanted and I was hoping to see if someone was able to help.
import requests
from bs4 import BeautifulSoup
source_code = requests.get('blank.com').text
soup = BeautifulSoup(source_code, "lxml")
table = soup.find_all('table')[7]
print(table)
this code outputs the following:
Process finished with exit code 0
What is my next step in tidying up this information to be used by other python methods? I am looking to format it into a nice table with columns preferably.
Thanks!
Upvotes: 2
Views: 169
Reputation: 13349
Use pandas.read_html
import requests
import pandas as pd
from bs4 import BeautifulSoup
source_code = requests.get('http://eoddata.com/stockquote/NASDAQ/GOOG.htm').text
soup = BeautifulSoup(source_code, "lxml")
table = soup.find_all('table')[7]
df = (pd.read_html(str(table)))[0]
df.columns = df.iloc[0]
df = df[1:]
Output:
In [20]: df
Out [20]:
Date Open High Low Close Volume Open Interest
1 02/02/18 1122 1123 1107 1112 4857900 0
2 02/01/18 1163 1174 1158 1168 2412100 0
3 01/31/18 1171 1173 1159 1170 1538600 0
4 01/30/18 1168 1177 1164 1164 1556300 0
5 01/29/18 1176 1187 1172 1176 1378900 0
6 01/26/18 1175 1176 1158 1176 2018700 0
7 01/25/18 1173 1176 1163 1170 1480500 0
8 01/24/18 1177 1180 1161 1164 1416600 0
9 01/23/18 1160 1172 1159 1170 1333000 0
10 01/22/18 1137 1160 1135 1156 1617500 0
df.iloc[1]
will give you values of row having index 1
df[<column name>][<index>]
to get particular column value for specified index
Upvotes: 1
Reputation: 71451
You can try this:
import re
from bs4 import BeautifulSoup as soup
import urllib
s = soup(str(urllib.urlopen('http://eoddata.com/stockquote/NASDAQ/GOOG.htm').read()), 'lxml')
final_data = [i.text for i in s.find_all('td')][27:98]
new_final_data = [dict(zip(final_data[i:i+6], [u'Date', u'Open', u'High', u'Low', u'Close', u'Volume', u'Open Interest'])) for i in range(0, len(final_data), 6)]
Output:
[{u'1,112': u'Close', u'02/02/18': u'Date', u'1,107': u'Low', u'1,123': u'High', u'1,122': u'Open', u'4,857,900': u'Volume'}, {u'1,163': u'High', u'1,174': u'Low', u'1,168': u'Volume', u'1,158': u'Close', u'0': u'Date', u'02/01/18': u'Open'}, {u'01/31/18': u'High', u'2,412,100': u'Date', u'1,171': u'Low', u'1,159': u'Volume', u'0': u'Open', u'1,173': u'Close'}, {u'1,168': u'Close', u'1,177': u'Volume', u'1,170': u'Date', u'0': u'High', u'1,538,600': u'Open', u'01/30/18': u'Low'}, {u'0': u'Low', u'1,176': u'Volume', u'1,556,300': u'High', u'01/29/18': u'Close', u'1,164': u'Open'}, {u'1,378,900': u'Low', u'1,187': u'Date', u'01/26/18': u'Volume', u'1,176': u'High', u'1,172': u'Open', u'0': u'Close'}, {u'1,175': u'Date', u'1,176': u'Low', u'2,018,700': u'Close', u'1,158': u'High', u'0': u'Volume'}, {u'1,163': u'Low', u'1,480,500': u'Volume', u'1,176': u'High', u'1,170': u'Close', u'1,173': u'Open', u'01/25/18': u'Date'}, {u'1,180': u'Low', u'1,161': u'Close', u'1,164': u'Volume', u'1,177': u'High', u'0': u'Date', u'01/24/18': u'Open'}, {u'1,160': u'Low', u'1,172': u'Close', u'1,159': u'Volume', u'0': u'Open', u'1,416,600': u'Date', u'01/23/18': u'High'}, {u'1,160': u'Volume', u'1,170': u'Date', u'0': u'High', u'01/22/18': u'Low', u'1,333,000': u'Open', u'1,137': u'Close'}, {u'1,156': u'Open', u'0': u'Low', u'1,135': u'Date', u'COMPANY PROFILE': u'Close', u'1,617,500': u'High'}]
Upvotes: 1
Reputation: 15376
You can create a nested list from the table using list comprehensions, eg,
table_data = [list(tr.stripped_strings) for tr in table.select('tr')]
The first list in table_data
contains the table headers, so you can use it to get the column names if you write to csv or create a dataframe.
If you don't want table headers in the list, you can select text only from the table data cells,
table_data = [[td.text for td in tr.select('td')] for tr in table.select('tr')]
Upvotes: 2