osk386
osk386

Reputation: 514

Change content only to parent element text, in Beautifulsoup

I have this piece of code:

txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for ft in soup.findAll('p'):
        print str(ft).upper()

When running I get this:

<P>HI <SPAN>MARK</SPAN>, HOW ARE YOU?, DON'T FORGET MEETING ON <STRONG>SUNDAY</STRONG>, OK?</P>

But I want to get this:

<p>HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday<strong>, ok?</p>

I just want to change inner text on p tag but keep format in other inner tags inside p, also I want to keep tag names in lowercase

Thanx

Upvotes: 0

Views: 431

Answers (1)

Birei
Birei

Reputation: 36282

You can assign the modified text to the string attribute of the tag, p.string. So loop over all contents of the <p> tag and use the regular expression module to check if it contains the tag symbols < and > and skip them. Something like:

from bs4 import BeautifulSoup
import re

txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for p in soup.find_all('p'):
    p.string = ''.join(
        [str(t).upper()
            if not re.match(r'<[^>]+>', str(t))
            else str(t)
            for t in p.contents])

print soup.prettify(formatter=None)

I use the formatter option to avoid the encoding of html special symbols. It yields:

<html>
 <body>
  <p>
   HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday</strong>, OK?
  </p>
 </body>
</html>

Upvotes: 1

Related Questions