rmcape
rmcape

Reputation: 21

Python:BeautifulSoup modifying text

I need to post process a large number of XHTML files, which I didn't generate, so I can't fix the code that generated it. I can't use regular expressions to blast the whole file, just highly selective pieces, because there are links and id's that have digits that I can't globally change.

I've simplified this example a lot because the original files have RTL text. I'm only interested in modifying the digits that are within the visible text, not the markup. There seem to be 3 different cases.

Snippets from bk1.xhtml:

Case 1: cross reference with links, digits xt with an embedded bookref text

<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a>
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

Case 2: cross reference without links - has digits in xt with no embedded bookref text

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a>
<span class="xt">some text with these digits: 26:118</span></p></aside>

Case 3: footnote without links, but has digits within the ft text

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a>
<span class="ft">some text with these digits: 22</span></p></aside>

I'm trying to figure out how to identify the text strings that are within the visible user-portion so that I can modify just the relevant digits:

Case 1: I need to capture just <a class='bookref' href='bk1.xhtml#bk1_118_26'>some text 26:118</a> assign the "some text 26:118" substring to a variable and run regular expressions against that variable; then replace that substring back into the file where it was.

Case 2: I need to capture just <span class="xt">some text 26:118</span> and change just the digits in the "some text 26:118" substring and run regular expressions against that variable; then replace that substring back into the file where it was.

Case 3: I need to capture just <span class="ft">some text 22</span> and change just the digits in the "some text 22" substring and run regular expressions against that variable; then replace that substring back into the file where it was.

I've got thousands of these to do across a lot of files. I know how to iterate through the files.

After I've processed all of the patterns within one file, I need to write out the changed tree.

I just need to post process it to fix the texts.

I've been googling, reading, and watching a lot of tutorials and I'm getting confused.

Thanks for any help with this.

Upvotes: 1

Views: 84

Answers (1)

Vin&#237;cius Figueiredo
Vin&#237;cius Figueiredo

Reputation: 6518

It seems you want the .replaceWith() method, you'd have first to find all the occurrences of the texts you want to match:

from bs4 import BeautifulSoup

cases = '''
<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a>
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a>
<span class="xt">some text with these digits: 26:118</span></p></aside>

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a>
<span class="ft">some text with these digits: 22</span></p></aside>
'''

soup = BeautifulSoup(cases, 'lxml')

case1 = soup.findAll('a',{'class':'bookref'})
case2 = soup.findAll('span',{'class':'xt'})
case3 = soup.findAll('span',{'class':'ft'})

for match in case1 + case2 + case3:
    text = match.string
    print(text)
    if text:
        newText = text.replace('some text', 'modified!') # this line is your regex things
        text.replaceWith(newText)

The print(text) in the loop prints:

some text with these digits: 26:118
None
some text with these digits: 26:118
some text with these digits: 22

If we call it again, now:

modified! with these digits: 26:118
None
modified! with these digits: 26:118
modified! with these digits: 22

Upvotes: 1

Related Questions