Reputation:
I have a very basic question and couldn't find an answer to it on SO. Suppose I have an HTML table as follows:
html1 = """
<table>
<tbody><tr>
<th>Id</th>
<th>Month</th>
</tr>
<tr><td>1</td><td>January</td></tr>
<tr><td>2</td><td>February</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td><td>October</td></tr>
<tr><td>6</td><td>December</td></tr>
<tr><td>7</td></tr>
<tr><td>Correct</td></tr>
</tbody></table>
"""
I want to drop the tr
tags whose first td
tag is not a digit and keep the rest of the table intact. I'm not sure if it makes sense but below is the desired output:
<table>
<tbody><tr>
<th>Id</th>
<th>Month</th>
</tr>
<tr><td>1</td><td>January</td></tr>
<tr><td>2</td><td>February</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td><td>October</td></tr>
<tr><td>6</td><td>December</td></tr>
<tr><td>7</td></tr>
</tbody></table>
Upvotes: 2
Views: 124
Reputation: 20038
To remove all <tr>
whose first <td>
is not a digit, make sure that the <td>
is not .isdigit()
and then .extract()
it:
from bs4 import BeautifulSoup
html1 = """
<table>
<tbody>
<tr>
<th>Id</th>
<th>Month</th>
</tr>
<tr>
<td>1</td>
<td>January</td>
</tr>
<tr>
<td>2</td>
<td>February</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>4</td>
</tr>
<tr>
<td>5</td>
<td>October</td>
</tr>
<tr>
<td>6</td>
<td>December</td>
</tr>
<tr>
<td>7</td>
</tr>
<tr>
<td>Correct</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html1, "html.parser")
[tag.extract() for tag in soup.find_all("tr") if not tag.find_next("td").text.isdigit()]
print(soup.prettify())
Upvotes: 1