user7864386
user7864386

Reputation:

Drop certain rows from an html table using beautifulsoup

I have a very basic question and couldn't find an answer to it on SO. Suppose I have an HTML table as follows:

html1 = """
<table>
<tbody><tr>
<th>Id</th>
<th>Month</th>
</tr>
<tr><td>1</td><td>January</td></tr>
<tr><td>2</td><td>February</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td><td>October</td></tr>
<tr><td>6</td><td>December</td></tr>
<tr><td>7</td></tr>
<tr><td>Correct</td></tr>
</tbody></table>
"""

I want to drop the tr tags whose first td tag is not a digit and keep the rest of the table intact. I'm not sure if it makes sense but below is the desired output:

<table>
<tbody><tr>
<th>Id</th>
<th>Month</th>
</tr>
<tr><td>1</td><td>January</td></tr>
<tr><td>2</td><td>February</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td><td>October</td></tr>
<tr><td>6</td><td>December</td></tr>
<tr><td>7</td></tr>
</tbody></table>

Upvotes: 2

Views: 124

Answers (1)

MendelG
MendelG

Reputation: 20038

To remove all <tr> whose first <td> is not a digit, make sure that the <td> is not .isdigit() and then .extract() it:

from bs4 import BeautifulSoup


html1 = """
<table>
   <tbody>
      <tr>
         <th>Id</th>
         <th>Month</th>
      </tr>
      <tr>
         <td>1</td>
         <td>January</td>
      </tr>
      <tr>
         <td>2</td>
         <td>February</td>
      </tr>
      <tr>
         <td>3</td>
      </tr>
      <tr>
         <td>4</td>
      </tr>
      <tr>
         <td>5</td>
         <td>October</td>
      </tr>
      <tr>
         <td>6</td>
         <td>December</td>
      </tr>
      <tr>
         <td>7</td>
      </tr>
      <tr>
         <td>Correct</td>
      </tr>
   </tbody>
</table>
"""

soup = BeautifulSoup(html1, "html.parser")

[tag.extract() for tag in soup.find_all("tr") if not tag.find_next("td").text.isdigit()]
print(soup.prettify())

Upvotes: 1

Related Questions