Difficulty extracting HTML table with Python and Pandas

Question

I am trying to extract data from the HTML table on the following website: https://fuelkaki.sg/home

My Python code is as shown below. Pandas is unable to detect the Table. I suspect it is because Beautiful Soup is not able to capture the HTML code on the page properly.

import sys
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd

try:
    url = 'https://fuelkaki.sg/home'
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
    page=requests.get(url, headers=headers)
except Exception as e:
    error_type, error_obj, error_info = sys.exc_info()
    print ('ERROR FOR LINK:', url)
    print (error_type, 'Line:', error_info.tb_lineno)
    
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')

df = pd.read_html(page.text)
df

I have tried using Selenium as well (see code below), but still unable to capture the HTML table information.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://fuelkaki.sg/home'
options = Options()
options.binary_location = "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"    #chrome binary location specified here
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')


df = pd.read_html(page)
df

Any advise would be much appreciated

keramat · Accepted Answer

Use:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://fuelkaki.sg/home'
options = Options()

options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("table", { "class" : "table" })
pd.DataFrame(np.array([x.text.replace('\u202c', '') for x in table.find_all('td')]).reshape(-1,5))

Output:

Please be aware that using website data can be unethical.

Difficulty extracting HTML table with Python and Pandas

Answers (1)

Related Questions