Reputation: 127

Python webscraping tables with multiple header rows

I am working through an issue with scraping a webtable using python. I have been scraping what I would call 'standard' tables for a while and I feel like I understand that reasonably well. I define a standard table as having a structure like:

<table>
<tr class="row-class">
  <th>Bill</th>
  <td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>
</table>

I have now come across a table instance which has a slightly different structure and I can't figure out how to get the data out of it in the format I need. The format I am now trying to scrape is:

<table>
<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>
</table>

The output I am trying to achieve is:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

I assume the problem I am encountering is that because the header is stored in a separate tr row, I only get an output of:

Bill
Ben
Barry

I am wondering if the solution is to traverse the rows and determine if the next tag is a th or td and then perform an appropriate action? I'd appreciate any advice on how the code I am using to test this could be modified to achieve the desired output. The code is:

from bs4 import BeautifulSoup

t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""


soup = BeautifulSoup(t_obj)

trs = soup.find_all("tr", {"class":"row-class"})

for tr in trs:
    for th in tr.findAll('th'):
        print (th.get_text())
        for td in tr.findAll('td'):
            print(td.get_text())
            print(td.get_text())

Upvotes: 2

Answers (3)

dabingsou

Reputation: 2469

Process HTML to fit

from simplified_scrapy.simplified_doc import SimplifiedDoc 
t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""
doc = SimplifiedDoc()
doc.loadHtml(doc.replaceReg(t_obj,"</tr>\s*<tr>",''))# merge tr
trs = doc.trs # get all tr
for tr in trs:
  tds = tr.children # get td and th
  data = [td.text for td in tds]
  print (data)

result:

['Bill', '1', '2', '3', '4']
['Ben', '2', '3', '4', '1']
['Barry', '3', '4', '1', '2']

Upvotes: 0

Ajax1234

Reputation: 71471

You can use indexing:

from bs4 import BeautifulSoup as soup
d = soup(html, 'html.parser').find_all('tr')
result = [[d[i].text]+[c.text for c in d[i+1].find_all('td')] for i in range(0, len(d), 2)]

To print your result:

print('\n'.join(f'{a[1:]},{",".join(b)}' for a, *b in result))

Output:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Upvotes: 0

Andrej Kesely

Reputation: 195593

Here I use 3 methods how to pair the two <tr> tags together:

1st method is using zip() and CSS selector
2nd method is using BeautifulSoup's method find_next_sibling()
3rd method is using zip() and simple slicing with custom step

from bs4 import BeautifulSoup

t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""


soup = BeautifulSoup(t_obj, 'html.parser')

for tr1, tr2 in zip(soup.select('tr.row-class'), soup.select('tr.row-class ~ tr:not(.row-class)')):
    print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )

print()

for tr in soup.select('tr.row-class'):
    print( ','.join(tag.get_text() for tag in tr.select('th') + tr.find_next_sibling('tr').select('td')) )

print()

trs = soup.select('tr')
for tr1, tr2 in zip(trs[::2], trs[1::2]):
    print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )

Prints:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Upvotes: 3

Python webscraping tables with multiple header rows

Answers (3)

Related Questions