David
David

Reputation: 83

Difficulty extracting HTML table with Python and Pandas

I am trying to extract data from the HTML table on the following website: https://fuelkaki.sg/home

enter image description here

My Python code is as shown below. Pandas is unable to detect the Table. I suspect it is because Beautiful Soup is not able to capture the HTML code on the page properly.

import sys
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd

try:
    url = 'https://fuelkaki.sg/home'
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
    page=requests.get(url, headers=headers)
except Exception as e:
    error_type, error_obj, error_info = sys.exc_info()
    print ('ERROR FOR LINK:', url)
    print (error_type, 'Line:', error_info.tb_lineno)
    
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')

df = pd.read_html(page.text)
df

I have tried using Selenium as well (see code below), but still unable to capture the HTML table information.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://fuelkaki.sg/home'
options = Options()
options.binary_location = "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"    #chrome binary location specified here
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')


df = pd.read_html(page)
df

Any advise would be much appreciated

Upvotes: 0

Views: 73

Answers (1)

keramat
keramat

Reputation: 4543

Use:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://fuelkaki.sg/home'
options = Options()

options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("table", { "class" : "table" })
pd.DataFrame(np.array([x.text.replace('\u202c', '') for x in table.find_all('td')]).reshape(-1,5))

Output:

enter image description here Please be aware that using website data can be unethical.

Upvotes: 1

Related Questions