Matt
Matt

Reputation: 982

How can I identify the correct td tag using bs4 in python when HTML code is inconsistent

I'm using BeautifulSoup4 in Python to parse some HTML code. I've managed to drill down to the correct table and identify the td tags, but the problem I'm facing is that the style attribute in the tags is inconsistently applied and it is making the task of getting the correct td tag a real challenge.

The data I'm trying to pull is a date field, but at any one time there will be multiple td tags that are hidden using CSS (what is visible depends on the option value selected elsewhere in the HTML code).

Actual examples:

<td style="display: none;">01/03/2016</td>
<td style="display: table-cell;">27/10/2015</td> <-- this is the tag I want

and

<td style="display:none">23/02/2016</td>
<td style="">09/05/2011</td> <-- this is the tag I want
<td style="display: none;">29/03/2011</td>
<td style="display:none">19/10/2010</td>

and

<td>27/10/2015</td> <-- this is the tag I want
<td style="display: none">01/03/2016</td>
<td style="display: none">22/03/2016</td>

and

<td style="display:none">11/04/2015</td>
<td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
<td style="display: none">18/10/2013</td>

How would I exclude/remove the incorrect items (which have styles of display:none and display: none) to leave me with the one that I do actually want?

Upvotes: 1

Views: 153

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

Filter the tds using a list comp, keeping only if the td does not have a style attribute in the set {"display:none", "display: none;","display: none;","display: none"}:

In [8]: h1 = """"<td style="display: none;">01/03/2016</td>
   ...: <td style="display: table-cell;">27/10/2015</td>"""

In [9]: h2 = """"<td style="display:none">23/02/2016</td>
   ...: <td style="">09/05/2011</td> <-- this is the tag I want
   ...: <td style="display: none;">29/03/2011</td>
   ...: <td style="display:none">19/10/2010</td>"""

In [10]: h3 = """"<td>27/10/2015</td> <-- this is the tag I want
   ....: <td style="display: none">01/03/2016</td>
   ....: <td style="display: none">22/03/2016</td>"""

In [11]: h4 = """<td style="display:none">11/04/2015</td>
   ....: <td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
   ....: <td style="display: none">18/10/2013</td>"""

In [12]: ignore = {"display:none", "display: none;", "display: none;", "display: none"}

In [13]: for html in [h1, h2, h3, h4]:
   ....:         soup = BeautifulSoup(html, "html.parser")
   ....:         print([td for td in soup.find_all("td") if not td.get("style") in ignore])
   ....:     
[<td style="display: table-cell;">27/10/2015</td>]
[<td style="">09/05/2011</td>]
[<td>27/10/2015</td>]
[<td style="display: table-cell;">02/02/2016</td>]

Upvotes: 1

Related Questions