Reputation: 127
I am working through an issue with scraping a webtable using python. I have been scraping what I would call 'standard' tables for a while and I feel like I understand that reasonably well. I define a standard table as having a structure like:
<table>
<tr class="row-class">
<th>Bill</th>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>
</table>
I have now come across a table instance which has a slightly different structure and I can't figure out how to get the data out of it in the format I need. The format I am now trying to scrape is:
<table>
<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>
</table>
The output I am trying to achieve is:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
I assume the problem I am encountering is that because the header is stored in a separate tr row, I only get an output of:
Bill
Ben
Barry
I am wondering if the solution is to traverse the rows and determine if the next tag is a th or td and then perform an appropriate action? I'd appreciate any advice on how the code I am using to test this could be modified to achieve the desired output. The code is:
from bs4 import BeautifulSoup
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
soup = BeautifulSoup(t_obj)
trs = soup.find_all("tr", {"class":"row-class"})
for tr in trs:
for th in tr.findAll('th'):
print (th.get_text())
for td in tr.findAll('td'):
print(td.get_text())
print(td.get_text())
Upvotes: 2
Views: 890
Reputation: 2469
Process HTML to fit
from simplified_scrapy.simplified_doc import SimplifiedDoc
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
doc = SimplifiedDoc()
doc.loadHtml(doc.replaceReg(t_obj,"</tr>\s*<tr>",''))# merge tr
trs = doc.trs # get all tr
for tr in trs:
tds = tr.children # get td and th
data = [td.text for td in tds]
print (data)
result:
['Bill', '1', '2', '3', '4']
['Ben', '2', '3', '4', '1']
['Barry', '3', '4', '1', '2']
Upvotes: 0
Reputation: 71471
You can use indexing:
from bs4 import BeautifulSoup as soup
d = soup(html, 'html.parser').find_all('tr')
result = [[d[i].text]+[c.text for c in d[i+1].find_all('td')] for i in range(0, len(d), 2)]
To print your result:
print('\n'.join(f'{a[1:]},{",".join(b)}' for a, *b in result))
Output:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Upvotes: 0
Reputation: 195593
Here I use 3 methods how to pair the two <tr>
tags together:
zip()
and CSS selectorfind_next_sibling()
zip()
and simple slicing with custom stepfrom bs4 import BeautifulSoup
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
soup = BeautifulSoup(t_obj, 'html.parser')
for tr1, tr2 in zip(soup.select('tr.row-class'), soup.select('tr.row-class ~ tr:not(.row-class)')):
print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )
print()
for tr in soup.select('tr.row-class'):
print( ','.join(tag.get_text() for tag in tr.select('th') + tr.find_next_sibling('tr').select('td')) )
print()
trs = soup.select('tr')
for tr1, tr2 in zip(trs[::2], trs[1::2]):
print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )
Prints:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Upvotes: 3