Reputation: 982
I'm using BeautifulSoup4 in Python to parse some HTML code. I've managed to drill down to the correct table and identify the td tags, but the problem I'm facing is that the style attribute in the tags is inconsistently applied and it is making the task of getting the correct td tag a real challenge.
The data I'm trying to pull is a date field, but at any one time there will be multiple td tags that are hidden using CSS (what is visible depends on the option value selected elsewhere in the HTML code).
Actual examples:
<td style="display: none;">01/03/2016</td>
<td style="display: table-cell;">27/10/2015</td> <-- this is the tag I want
and
<td style="display:none">23/02/2016</td>
<td style="">09/05/2011</td> <-- this is the tag I want
<td style="display: none;">29/03/2011</td>
<td style="display:none">19/10/2010</td>
and
<td>27/10/2015</td> <-- this is the tag I want
<td style="display: none">01/03/2016</td>
<td style="display: none">22/03/2016</td>
and
<td style="display:none">11/04/2015</td>
<td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
<td style="display: none">18/10/2013</td>
How would I exclude/remove the incorrect items (which have styles of display:none
and display: none
) to leave me with the one that I do actually want?
Upvotes: 1
Views: 153
Reputation: 180441
Filter the tds using a list comp, keeping only if the td does not have a style attribute in the set {"display:none", "display: none;","display: none;","display: none"}
:
In [8]: h1 = """"<td style="display: none;">01/03/2016</td>
...: <td style="display: table-cell;">27/10/2015</td>"""
In [9]: h2 = """"<td style="display:none">23/02/2016</td>
...: <td style="">09/05/2011</td> <-- this is the tag I want
...: <td style="display: none;">29/03/2011</td>
...: <td style="display:none">19/10/2010</td>"""
In [10]: h3 = """"<td>27/10/2015</td> <-- this is the tag I want
....: <td style="display: none">01/03/2016</td>
....: <td style="display: none">22/03/2016</td>"""
In [11]: h4 = """<td style="display:none">11/04/2015</td>
....: <td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
....: <td style="display: none">18/10/2013</td>"""
In [12]: ignore = {"display:none", "display: none;", "display: none;", "display: none"}
In [13]: for html in [h1, h2, h3, h4]:
....: soup = BeautifulSoup(html, "html.parser")
....: print([td for td in soup.find_all("td") if not td.get("style") in ignore])
....:
[<td style="display: table-cell;">27/10/2015</td>]
[<td style="">09/05/2011</td>]
[<td>27/10/2015</td>]
[<td style="display: table-cell;">02/02/2016</td>]
Upvotes: 1