Reputation: 61
I am trying to parse out a table and edit the text in the cell according to what I have in a QtableWidget while keeping as much as the original style as possible. The formatting of the td entries are inconsistent, and I can't seem to figure out a good way to properly edit the text to my expectation.
soup = BeautifulSoup(body, "html.parser")
tables = soup.find_all('table')
for table in tables:
rows = table.find_all('tr')
for r, row in enumerate(rows[1:]):
cols = row.find_all('td')
for c, ele in enumerate(cols):
print(ele)
#If i directly do
#ele.string = tableWidget.item(r, c).text()
#the text transforms to my expectation, but it loses all the styling like the hyperlink
#e.g. I lose the entire <p> tag here <td style="padding:.75pt .75pt .75pt .75pt">07:00</td>
#If i try the below, it doesn't work. ele.has_attr('p') is always false for some reason even though the td has <p> tag
if ele.has_attr('p'):
if not ele['p'].has_attr('a'):
ele[p].string = tableWidget.item(r, c).text()
else:
ele.string = tableWidget.item(r, c).text()
below is output of ele, ** inclosed texts are what I am trying to replace
<td style="padding:.75pt .75pt .75pt .75pt"><p class="MsoNormal"><span style='font-family:"Calibri",sans-serif'><a href="https:/random link" target="_blank">**test**</a><o:p></o:p></span></p></td>
<td style="padding:.75pt .75pt .75pt .75pt"><p class="MsoNormal"><span style='font-family:"Calibri",sans-serif'>**07:00**<o:p></o:p></span></p></td>
<td style="padding:.75pt .75pt .75pt .75pt"><p class="MsoNormal"><span style='font-family:"Calibri",sans-serif'>**08:00**<o:p></o:p></span></p></td>
<td style="padding:.75pt .75pt .75pt .75pt">****</td>
Upvotes: 0
Views: 763
Reputation: 28565
Seems like you understand a
and p
are tags:
"it doesn't work. ele.has_attr('p') is always false for some reason even though the td has <p> tag"
but then you are using the .has_attr()
p
and a
are not an attributes. Those are element tags. So you want to check if the <td>
has a tag <p>
within it, and if that <p>
tag has an <a>
tag.
So remove:
if ele.has_attr('p'):
if not ele['p'].has_attr('a'):
and replace with:
if not ele['p'].has_attr('a'):
Upvotes: 0
Reputation: 2012
you're very confused: td
is a tag and p
is a tag too. p
is not an attribute of td
. style
is an attribute of td
. class
is an attribute of p
you can use dot notation to chain find
like this
if ele.p:
if not ele.p.a:
ele.p.string = tableWidget.item(r, c).text()
else:
ele.string = tableWidget.item(r, c).text()
Upvotes: 1