Trying to learn
Trying to learn

Reputation: 43

Pulling table from site using Python

I am attempting to visit this site and pull the information from the table. I'm totally new to this and this is the furthest I could get. Most guides I find aren't really helping me get the end result I wanted and I was hoping to see if someone was able to help.

import requests
from bs4 import BeautifulSoup

source_code = requests.get('blank.com').text
soup = BeautifulSoup(source_code, "lxml")

table = soup.find_all('table')[7]

print(table)

this code outputs the following:

Process finished with exit code 0

What is my next step in tidying up this information to be used by other python methods? I am looking to format it into a nice table with columns preferably.

Thanks!

Upvotes: 2

Views: 169

Answers (3)

Pygirl
Pygirl

Reputation: 13349

Use pandas.read_html

import requests
import pandas as pd
from bs4 import BeautifulSoup
source_code = requests.get('http://eoddata.com/stockquote/NASDAQ/GOOG.htm').text
soup = BeautifulSoup(source_code, "lxml")

table = soup.find_all('table')[7]

df = (pd.read_html(str(table)))[0]
df.columns = df.iloc[0]
df = df[1:] 

Output:

In [20]: df
Out [20]:   
        Date    Open    High    Low     Close   Volume  Open Interest
1   02/02/18    1122    1123    1107    1112    4857900 0
2   02/01/18    1163    1174    1158    1168    2412100 0
3   01/31/18    1171    1173    1159    1170    1538600 0
4   01/30/18    1168    1177    1164    1164    1556300 0
5   01/29/18    1176    1187    1172    1176    1378900 0
6   01/26/18    1175    1176    1158    1176    2018700 0
7   01/25/18    1173    1176    1163    1170    1480500 0
8   01/24/18    1177    1180    1161    1164    1416600 0
9   01/23/18    1160    1172    1159    1170    1333000 0
10  01/22/18    1137    1160    1135    1156    1617500 0

df.iloc[1] will give you values of row having index 1

df[<column name>][<index>] to get particular column value for specified index

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can try this:

import re
from bs4 import BeautifulSoup as soup
import urllib

s = soup(str(urllib.urlopen('http://eoddata.com/stockquote/NASDAQ/GOOG.htm').read()), 'lxml')
final_data = [i.text for i in s.find_all('td')][27:98]
new_final_data = [dict(zip(final_data[i:i+6], [u'Date', u'Open', u'High', u'Low', u'Close', u'Volume', u'Open Interest'])) for i in range(0, len(final_data), 6)]

Output:

[{u'1,112': u'Close', u'02/02/18': u'Date', u'1,107': u'Low', u'1,123': u'High', u'1,122': u'Open', u'4,857,900': u'Volume'}, {u'1,163': u'High', u'1,174': u'Low', u'1,168': u'Volume', u'1,158': u'Close', u'0': u'Date', u'02/01/18': u'Open'}, {u'01/31/18': u'High', u'2,412,100': u'Date', u'1,171': u'Low', u'1,159': u'Volume', u'0': u'Open', u'1,173': u'Close'}, {u'1,168': u'Close', u'1,177': u'Volume', u'1,170': u'Date', u'0': u'High', u'1,538,600': u'Open', u'01/30/18': u'Low'}, {u'0': u'Low', u'1,176': u'Volume', u'1,556,300': u'High', u'01/29/18': u'Close', u'1,164': u'Open'}, {u'1,378,900': u'Low', u'1,187': u'Date', u'01/26/18': u'Volume', u'1,176': u'High', u'1,172': u'Open', u'0': u'Close'}, {u'1,175': u'Date', u'1,176': u'Low', u'2,018,700': u'Close', u'1,158': u'High', u'0': u'Volume'}, {u'1,163': u'Low', u'1,480,500': u'Volume', u'1,176': u'High', u'1,170': u'Close', u'1,173': u'Open', u'01/25/18': u'Date'}, {u'1,180': u'Low', u'1,161': u'Close', u'1,164': u'Volume', u'1,177': u'High', u'0': u'Date', u'01/24/18': u'Open'}, {u'1,160': u'Low', u'1,172': u'Close', u'1,159': u'Volume', u'0': u'Open', u'1,416,600': u'Date', u'01/23/18': u'High'}, {u'1,160': u'Volume', u'1,170': u'Date', u'0': u'High', u'01/22/18': u'Low', u'1,333,000': u'Open', u'1,137': u'Close'}, {u'1,156': u'Open', u'0': u'Low', u'1,135': u'Date', u'COMPANY PROFILE': u'Close', u'1,617,500': u'High'}]

Upvotes: 1

t.m.adam
t.m.adam

Reputation: 15376

You can create a nested list from the table using list comprehensions, eg,

table_data = [list(tr.stripped_strings) for tr in table.select('tr')]

The first list in table_data contains the table headers, so you can use it to get the column names if you write to csv or create a dataframe.

If you don't want table headers in the list, you can select text only from the table data cells,

table_data = [[td.text for td in tr.select('td')] for tr in table.select('tr')]

Upvotes: 2

Related Questions