Parse a table with BeautifulSoup, Selenium in Python

Question

https://rocketreach.co/horizon-blue-cross-blue-shield-of-new-jersey-email-format_b5c604a3f42e0c54 This is the link I'm trying to get the information out of. I need to extract the formats that's in the table "first '_' last" "first_initial last" and so on. If not all of them, then at least the top most format.

Here's what I have so far:

def search_on_google(key_word, driver):
    driver.get("https://www.google.com/")
    searchBoard = driver.find_element_by_name('q')
    searchBoard.send_keys(key_word + " Rocketreach.co")
    searchBoard.send_keys(Keys.TAB)
    searchBoard.send_keys(Keys.ENTER)
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for link in soup.find_all('meta'):
        content = link.get('content')
        print(content)

Edit:

    for i in range(1):
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))

    table_id = driver.find_element(By.TAG_NAME, "tbody")
    rows = table_id.find_elements(By.TAG_NAME, "tr")
    for row in rows:
        tds = row.find_elements(By.TAG_NAME, "td")
        top_format.append(tds[0].text)
        domain.append(tds[1].text)
        print(top_format)
        print(domain)
        break

    return top_format

Arundeep Chohan · Accepted Answer

There's only one table on this page to print all the information you can simply do the following to print all the information. It is also not in any iframes.

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))
table_id = driver.find_element(By.TAG_NAME, "tbody")
rows = table_id.find_elements(By.TAG_NAME, "tr")
for row in rows:
    tds = row.find_elements(By.TAG_NAME, "td")
    for td in tds:
       one_urls.append(td.text)
print(one_urls)

You could do a check before the print or you could do a range.

if tds[0] =='':

I'd also suggest a wait prior to finding the table since your clicking and loading a new page prior to getting the table.

table_id= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "tbody")))

Import these

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

Parse a table with BeautifulSoup, Selenium in Python

Answers (1)

Related Questions