MinimalMaximizer
MinimalMaximizer

Reputation: 392

Using beautifulsoup, how to scrape the table headers from a page

I've tried using varying pieces of code for scraping the names of table headers using bs and python and each time i just get an empty list returned. This is the value I want to extract:

<table class="table table-bordered table-striped table-hover data-grid ng-scope">
    <thead>
       <tr>
          <th class="ng-isolate-scope sortable" data-colname="Advertiser" data-colsorter="sorter">
           Advertiser

The info I would like to extract is the "data-colname". This what I've tried:

for tx in soup.find_all('th'):
    table_headers.append(tx.get('th.data-colname'))
#this returns an empty list, tried other combinations of this sort ... all returned an empty list

#Another attempt was:
spans = [x.text.strip() for x in soup.select('th.ng-isolate-scope data-colname')]
# returns errors

Upvotes: 2

Views: 12491

Answers (2)

transcranial
transcranial

Reputation: 381

The correct way to extract the value from the attribute data-colname is with, for example:

for tx in soup.find_all('th'):
    table_headers.append(tx['data-colname'])

Here's the code I used:

from bs4 import BeautifulSoup
html = '<table class="table table-bordered table-striped table-hover data-grid ng-scope"> <thead><tr><th class="ng-isolate-scope sortable" data-colname="Advertiser" data-colsorter="sorter">Advertiser</th></tr></thead></table'
soup = BeautifulSoup(html, 'lxml')
table_headers = []
for tx in soup.find_all('th'):
    table_headers.append(tx['data-colname'])

Output:

>>> print table_headers
[u'Advertiser']

Upvotes: 2

Bruno Viegas
Bruno Viegas

Reputation: 11

I think removing the th from inside the get() should solve your problem.

Since tx it's already:

<th class="ng-isolate-scope sortable" data-colname="Advertiser" data-colsorter="sorter">
           Advertiser

Or its siblings, you only have the element you're dealing with one at a time. So, long story short:

for tx in soup.find_all('th'):
    table_headers.append(tx.get('data-colname'))

Hope this helps.

Upvotes: 1

Related Questions