technophile_3
technophile_3

Reputation: 521

Problem in extracting values from HTML Page tables and saving in pandas dataframe using python

I am extracting information from here I am able to extract the column names and all values from the table however, I am not sure how to save the values in pandas dataframe.

Like for Instance- For the first column 'SPAC' I get all URLs together. This can be saved in a list which in turn could be added to the dataframe. The main issue/challenge that I am facing is that for the rest of values and columns. I get an output like

SPAC
Target
Ticker
Announced
Deadline
TEV ($M)
TEV/IPO
Sector
Geography
Premium
Common
Warrant
Aldel Financial
Hagerty ADF 8/18/2021 4/13/2023 3,134 2698% Financial US/Canada -0.99% $10.00 $1.56
InterPrivate III Financial
Aspiration IPVF 8/18/2021 3/9/2023 1,943 751% Financial US/Canada -1.80% $9.82 $1.00

Here the first 12 columns are the column names, and the rest(from Aldel Financial) are the column values, this Aldel Financial I simply do not want to be saved in my dataframe as I have already extracted the link for the same. For the rest ie from Hagery.... 1.56 are the values for the respective columns.

How do I save it to the pandas dataframe since all are having the same id's?

Here is my complete code.

def extraction():
    print("Program started Successfully!")
    websites = ["https://www.spacresearch.com/symbol?s=live-deal&sector=&geography="]
    data = []
    for live_deals in websites:
        browser.get(live_deals)

        wait = WebDriverWait(browser, 10)
        wait.until(
            EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))
        ).send_keys("kys")
        wait.until(
            EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))
        ).send_keys("pasd")
        wait.until(
            EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))
        ).click()

        time.sleep(2)
        spac = browser.find_elements_by_class_name("ellipsis")
        for all_spac in spac:
            links = all_spac.find_element_by_tag_name("a").get_attribute("href")
            print(links)

        target = browser.find_elements_by_id("companies-table-deal-announced")
        for all_targets in target:
            print(all_targets.text)
        print("Program Ended successfully!")


extraction()

Please help me in understanding how I could save the extracted columns and the values in pandas dataframe. Thanks!

EDIT: So I tried using pandas's read_html() and I got the following output

                              SPAC              Target Ticker  Announced   Deadline  TEV ($M)  TEV/IPO  Sector  Geography  Premium  Common  Warrant
0                CA Healthcare Acq            LumiraDx   CAHC   4/7/2021  1/29/2023   30000.0      NaN     NaN        NaN      NaN     NaN      NaN
1  European Sustainable Growth Acq      ADS-TEC Energy   EUSG  8/11/2021  1/26/2023     356.0      NaN     NaN        NaN      NaN     NaN      NaN
2                  Longview Acq II           HeartFlow    LGV  7/15/2021  3/23/2023    2373.0      NaN     NaN        NaN      NaN     NaN      NaN
3                      D8 Holdings  Vicarious Surgical    DEH  4/15/2021  7/17/2022    1119.0      NaN     NaN        NaN      NaN     NaN      NaN
4                 Isos Acquisition             Bowlero   ISOS   7/1/2021   3/5/2023    2616.0      NaN     NaN        NaN      NaN     NaN      NaN

Although this gives me the values with proper rows and columns it is not giving the URLs of the first column and a lot of column values are NANs.

Here's how I implemented the same:

table = pd.read_html(live_deals)
df = table[0]
print(df.head())

Upvotes: 0

Views: 162

Answers (1)

Shine J
Shine J

Reputation: 818

This should work:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver = webdriver.Chrome("D:/chromedriver/94/chromedriver.exe")


driver.get("https://www.spacresearch.com/symbol?s=live-deal&sector=&geography=")
wait = WebDriverWait(driver,15)
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))).send_keys("[email protected]")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))).send_keys("YourPassword0101")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))).click()

spac_data = {
    "SPAC": [],
    "SPAC_LINK": [],
    "Target": [],
    "Ticker": [],
    "Announced": [],
    "Deadline": [],
    "TEV ($M)": [],
    "TEV/IPO": [],
    "Sector": [],
    "Geography": [],
    "Premium": [],
    "Common": [],
    "Warrant": []
}


tbody = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'tbody')))
rows = tbody.find_elements(By.TAG_NAME, 'tr')

for row in rows:
    cols = row.find_elements(By.TAG_NAME, 'td')
    link = cols[0].find_element(By.TAG_NAME, 'div').find_element(By.TAG_NAME, 'a').get_attribute('href')
    
    spac_data["SPAC"].append(cols[0].text)
    spac_data["SPAC_LINK"].append(link)
    spac_data["Target"].append(cols[1].text)
    spac_data["Ticker"].append(cols[2].text)
    spac_data["Announced"].append(cols[3].text)
    spac_data["Deadline"].append(cols[4].text)
    spac_data["TEV ($M)"].append(cols[5].text)
    spac_data["TEV/IPO"].append(cols[6].text)
    spac_data["Sector"].append(cols[7].text)
    spac_data["Geography"].append(cols[8].text)
    spac_data["Premium"].append(cols[9].text)
    spac_data["Common"].append(cols[10].text)
    spac_data["Warrant"].append(cols[11].text)

df = pd.DataFrame.from_dict(spac_data)
print(df)

Upvotes: 1

Related Questions