Reputation: 521
I am extracting information from here I am able to extract the column names and all values from the table however, I am not sure how to save the values in pandas dataframe.
Like for Instance- For the first column 'SPAC' I get all URLs together. This can be saved in a list which in turn could be added to the dataframe. The main issue/challenge that I am facing is that for the rest of values and columns. I get an output like
SPAC
Target
Ticker
Announced
Deadline
TEV ($M)
TEV/IPO
Sector
Geography
Premium
Common
Warrant
Aldel Financial
Hagerty ADF 8/18/2021 4/13/2023 3,134 2698% Financial US/Canada -0.99% $10.00 $1.56
InterPrivate III Financial
Aspiration IPVF 8/18/2021 3/9/2023 1,943 751% Financial US/Canada -1.80% $9.82 $1.00
Here the first 12 columns are the column names, and the rest(from Aldel Financial) are the column values, this Aldel Financial I simply do not want to be saved in my dataframe as I have already extracted the link for the same. For the rest ie from Hagery.... 1.56 are the values for the respective columns.
How do I save it to the pandas dataframe since all are having the same id's?
Here is my complete code.
def extraction():
print("Program started Successfully!")
websites = ["https://www.spacresearch.com/symbol?s=live-deal§or=&geography="]
data = []
for live_deals in websites:
browser.get(live_deals)
wait = WebDriverWait(browser, 10)
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))
).send_keys("kys")
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))
).send_keys("pasd")
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))
).click()
time.sleep(2)
spac = browser.find_elements_by_class_name("ellipsis")
for all_spac in spac:
links = all_spac.find_element_by_tag_name("a").get_attribute("href")
print(links)
target = browser.find_elements_by_id("companies-table-deal-announced")
for all_targets in target:
print(all_targets.text)
print("Program Ended successfully!")
extraction()
Please help me in understanding how I could save the extracted columns and the values in pandas dataframe. Thanks!
EDIT: So I tried using pandas's read_html() and I got the following output
SPAC Target Ticker Announced Deadline TEV ($M) TEV/IPO Sector Geography Premium Common Warrant
0 CA Healthcare Acq LumiraDx CAHC 4/7/2021 1/29/2023 30000.0 NaN NaN NaN NaN NaN NaN
1 European Sustainable Growth Acq ADS-TEC Energy EUSG 8/11/2021 1/26/2023 356.0 NaN NaN NaN NaN NaN NaN
2 Longview Acq II HeartFlow LGV 7/15/2021 3/23/2023 2373.0 NaN NaN NaN NaN NaN NaN
3 D8 Holdings Vicarious Surgical DEH 4/15/2021 7/17/2022 1119.0 NaN NaN NaN NaN NaN NaN
4 Isos Acquisition Bowlero ISOS 7/1/2021 3/5/2023 2616.0 NaN NaN NaN NaN NaN NaN
Although this gives me the values with proper rows and columns it is not giving the URLs of the first column and a lot of column values are NANs.
Here's how I implemented the same:
table = pd.read_html(live_deals)
df = table[0]
print(df.head())
Upvotes: 0
Views: 162
Reputation: 818
This should work:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/94/chromedriver.exe")
driver.get("https://www.spacresearch.com/symbol?s=live-deal§or=&geography=")
wait = WebDriverWait(driver,15)
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))).send_keys("[email protected]")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))).send_keys("YourPassword0101")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))).click()
spac_data = {
"SPAC": [],
"SPAC_LINK": [],
"Target": [],
"Ticker": [],
"Announced": [],
"Deadline": [],
"TEV ($M)": [],
"TEV/IPO": [],
"Sector": [],
"Geography": [],
"Premium": [],
"Common": [],
"Warrant": []
}
tbody = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'tbody')))
rows = tbody.find_elements(By.TAG_NAME, 'tr')
for row in rows:
cols = row.find_elements(By.TAG_NAME, 'td')
link = cols[0].find_element(By.TAG_NAME, 'div').find_element(By.TAG_NAME, 'a').get_attribute('href')
spac_data["SPAC"].append(cols[0].text)
spac_data["SPAC_LINK"].append(link)
spac_data["Target"].append(cols[1].text)
spac_data["Ticker"].append(cols[2].text)
spac_data["Announced"].append(cols[3].text)
spac_data["Deadline"].append(cols[4].text)
spac_data["TEV ($M)"].append(cols[5].text)
spac_data["TEV/IPO"].append(cols[6].text)
spac_data["Sector"].append(cols[7].text)
spac_data["Geography"].append(cols[8].text)
spac_data["Premium"].append(cols[9].text)
spac_data["Common"].append(cols[10].text)
spac_data["Warrant"].append(cols[11].text)
df = pd.DataFrame.from_dict(spac_data)
print(df)
Upvotes: 1