confusedcoder
confusedcoder

Reputation: 23

Parse a table with BeautifulSoup, Selenium in Python

https://rocketreach.co/horizon-blue-cross-blue-shield-of-new-jersey-email-format_b5c604a3f42e0c54 This is the link I'm trying to get the information out of. I need to extract the formats that's in the table "first '_' last" "first_initial last" and so on. If not all of them, then at least the top most format.

Here's what I have so far:

def search_on_google(key_word, driver):
    driver.get("https://www.google.com/")
    searchBoard = driver.find_element_by_name('q')
    searchBoard.send_keys(key_word + " Rocketreach.co")
    searchBoard.send_keys(Keys.TAB)
    searchBoard.send_keys(Keys.ENTER)
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for link in soup.find_all('meta'):
        content = link.get('content')
        print(content)

Edit:

    for i in range(1):
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))

    table_id = driver.find_element(By.TAG_NAME, "tbody")
    rows = table_id.find_elements(By.TAG_NAME, "tr")
    for row in rows:
        tds = row.find_elements(By.TAG_NAME, "td")
        top_format.append(tds[0].text)
        domain.append(tds[1].text)
        print(top_format)
        print(domain)
        break

    return top_format

Upvotes: 0

Views: 472

Answers (1)

Arundeep Chohan
Arundeep Chohan

Reputation: 9969

There's only one table on this page to print all the information you can simply do the following to print all the information. It is also not in any iframes.

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))
table_id = driver.find_element(By.TAG_NAME, "tbody")
rows = table_id.find_elements(By.TAG_NAME, "tr")
for row in rows:
    tds = row.find_elements(By.TAG_NAME, "td")
    for td in tds:
       one_urls.append(td.text)
print(one_urls)

You could do a check before the print or you could do a range.

if tds[0] =='':

I'd also suggest a wait prior to finding the table since your clicking and loading a new page prior to getting the table.

table_id= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "tbody")))

Import these

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

Upvotes: 1

Related Questions