Reputation: 3
I'm attempting to extract a series of tables from an HTML document and append a new column with a constant value from a tag being used as a header. The idea would then be to make this new three column table a dataframe. below is the code i've come up with so far. I.e. each table would have a third column where all the row values would equal either AGO, DPK, ATK, or PMS depending which header precedes the series of tables. Would be grateful for any help as i'm new to python and HTML. Thanks a mill!
import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser
br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")
table = br.find_all('td', class_='vc_table_cell')
for element in table:
data = element.find('span', class_='vc_table_content')
prod_name = br.find_all('strong')
ago = prod_name[0].text
dpk = prod_name[1].text
atk = prod_name[2].text
pms = prod_name[3].text
if br.find('strong').text == ago:
data.append(ago.text)
elif br.find('strong').text == dpk:
data.append(dpk.text)
elif br.find('strong').text == atk:
data.append(atk.text)
elif br.find('strong').text == pms:
data.append(pms.text)
print(data.text)
df = pd.DataFrame(data)
The result i'm hoping for is to go from this
AGO
Enterprise Price
Coy A $0.5/L
Coy B $0.6/L
Coy C $0.7/L
to the new table below as a dataframe in Pandas
Enterprise Price Product
Coy A $0.5/L AGO
Coy B $0.6/L AGO
Coy C $0.7/L AGO
and to repeat the same thing for other tables with DPK, ATK and PMS information
Upvotes: 0
Views: 726
Reputation: 195613
I hope I understood your question right. This script will scrape all tables found in the page into the dataframe and save it to csv file:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
Prints:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
The data.csv
(screenshot from LibreOffice):
Upvotes: 0