Reputation: 83
I am trying to extract data from the HTML table on the following website: https://fuelkaki.sg/home
My Python code is as shown below. Pandas is unable to detect the Table. I suspect it is because Beautiful Soup is not able to capture the HTML code on the page properly.
import sys
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd
try:
url = 'https://fuelkaki.sg/home'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
except Exception as e:
error_type, error_obj, error_info = sys.exc_info()
print ('ERROR FOR LINK:', url)
print (error_type, 'Line:', error_info.tb_lineno)
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
df = pd.read_html(page.text)
df
I have tried using Selenium as well (see code below), but still unable to capture the HTML table information.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
url = 'https://fuelkaki.sg/home'
options = Options()
options.binary_location = "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #chrome binary location specified here
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
df = pd.read_html(page)
df
Any advise would be much appreciated
Upvotes: 0
Views: 73
Reputation: 4543
Use:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
url = 'https://fuelkaki.sg/home'
options = Options()
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("table", { "class" : "table" })
pd.DataFrame(np.array([x.text.replace('\u202c', '') for x in table.find_all('td')]).reshape(-1,5))
Output:
Please be aware that using website data can be unethical.
Upvotes: 1