Julian Amrine
Julian Amrine

Reputation: 25

Why can't I use ".text" while scraping table headers with BeautifulSoup to remove unwanted HTML

When I run this code, I can see that the headers list was populated with the results I want, however they are surrounded in some html I don't want to keep.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# barchart.com uses javascript, so for now I need selenium to get full html
url = 'https://www.barchart.com/stocks/quotes/qqq/constituents'
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
page = browser.page_source

#  BeautifulSoup find table
soup = BeautifulSoup(page, 'lxml')
table = soup.find("table")
browser.quit()

# create list headers, then populate with th tagged cells
headers = []

for i in table.find_all('th'):
    title = i()
    headers.append(title)

So I tried:

for i in table.find_all('th'):
    title = i.text()
    headers.append(title)

Which returned "TypeError: 'str' object is not callable"

This seemed to work in some example documentation, but the wikipedia tables used there seemed simpler than the ones on Barchart. Any ideas?

Upvotes: 2

Views: 38

Answers (1)

Hugo G
Hugo G

Reputation: 16494

As @MendelG pointed out, the error lies in i.text() because text is a property and not a function.

Alternatively you can also use get_text() which is a function.

I would also suggest adding a strip() to get rid of extra whitespace around the text. Or if you want to use get_text() it has this built in:

title = i.get_text(strip=True)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# barchart.com uses javascript, so for now I need selenium to get full html
url = 'https://www.barchart.com/stocks/quotes/qqq/constituents'
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
page = browser.page_source

#  BeautifulSoup find table
soup = BeautifulSoup(page, 'lxml')
table = soup.find("table")
browser.quit()

# create list headers, then populate with th tagged cells
headers = []

for i in table.find_all('th'):
    title = i.text.strip()
    # Or alternatively:
    #title = i.get_text(strip=True)
    headers.append(title)

print(headers)

This prints:

['Symbol', 'Name', '% Holding', 'Shares', 'Links']

Upvotes: 1

Related Questions