Alex Ketay
Alex Ketay

Reputation: 928

How can I get the first and third td from a table with BeautifulSoup?

I am currently using Python and BeautifulSoup to scrape some website data. I'm trying to pull cells from a table which is formatted like so:

<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>

The problem with the above HTML is that BeautifulSoup reads it as one tag. I need to pull the values from the first <td> and the third <td>, which would be 1 and 20, respectively.

Unfortunately, I have no idea how to go about this. How can I get BeautifulSoup to read the 1st and 3rd <td> tags of each row of the table?

Update:

I figured out the problem. I was using html.parser instead of the default for BeautifulSoup. Once I switched to the default the problems went away. Also I used the method listed in the answer.

I also found out that the different parsers are very temperamental with broken code. For instance, the default parser refused to read past row 192, but html5lib got the job done.So try using lxml, html, and also html5lib if you are having problems parsing the entire table.

Upvotes: 6

Views: 19639

Answers (1)

schesis
schesis

Reputation: 59118

That's a nasty piece of HTML you've got there. If we ignore the semantics of table rows and table cells for a moment and treat it as pure XML, its structure looks like this:

<tr>
  <td>1
    <td>
      <td>20
        <td>5%</td>
      </td>
    </td>
  </td>
</tr>

BeautifulSoup, however, knows about the semantics of HTML tables, and instead parses it like this:

<tr>
  <td>1        <!-- an IMPLICITLY (no closing tag) closed td element -->
  <td>         <!-- as above -->
  <td>20       <!-- as above -->
  <td>5%</td>  <!-- an EXPLICITLY closed td element -->
  </td>        <!-- an error; ignore this -->
  </td>        <!-- as above -->
  </td>        <!-- as above -->
</tr>

... so that, as you say, 1 and 20 are in the first and third td elements (not tags) respectively.

You can actually get at the contents of these td elements like this:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>")
>>> tr = soup.find("tr")
>>> tr
<tr><td>1</td><td></td><td>20</td><td>5%</td></tr>
>>> td_list = tr.find_all("td")
>>> td_list
[<td>1</td>, <td></td>, <td>20</td>, <td>5%</td>]
>>> td_list[0]  # Python starts counting list items from 0, not 1
<td>1</td>
>>> td_list[0].text
'1'
>>> td_list[2].text
'20'
>>> td_list[3].text
'5%'

Upvotes: 15

Related Questions