Reputation: 381
How do you use the BeautifulSoup .replace_with()
without having something like sharp brackets being converted to >
thing after a str()
string conversion find-and-replace process?
Python code
from bs4 import BeautifulSoup
with open("../dicttest.txt", "r", encoding="utf-8") as f:
full_text = f.read()
parse_1 = BeautifulSoup(full_text, "html.parser")
for line in parse_1.find_all("grace", "AllExamples"):
match = str(line).replace(";</i> <b>", ";</i><br> <b>")
line.replace_with(match)
print(parse_1)
dicttest.txt
all
<link rel="stylesheet" type="text/css" href="stylesheet.css"><font size="-2">Duden-Oxford Deutsch-Englisch</font><br><grace class="SglMngArticle"><span class="WordHead"><b>all</b></span> <grace class="IPA">/al/</grace> <i>Indefinitpron.</i> <i>u. unbest. Zahlw.</i> </grace><br><br><grace class="NumArticle"><span class="Number">1.</span> <i>attr.</i> (<i>ganz, gesamt...</i>) all; </grace><grace class="AllExamples"><grace class="BoldExamples"><b>in aller Deutlichkeit</b></grace> in all clarity;<br> <grace class="BoldExamples"><b>alle Freude, die sie empfunden hat</b></grace> all the joy she felt;<br> <grace class="BoldExamples"><b>alles Geld, das ich noch habe</b></grace> all the money I have left;<br> <grace class="BoldExamples"><b>aller Eifer nützte ihm nichts</b></grace> all his zeal was to no avail;<br> <grace class="BoldExamples"><b>ich kann diese Leute alle nicht leiden</b></grace> I can't stand any of these people;<br> <grace class="BoldExamples"><b>ich will euch alle nicht mehr sehen</b></grace> I don't want to see any of you again;<br> <grace class="BoldExamples"><b>die Ärzte verdienen alle sehr viel</b></grace> doctors all earn a great deal;<br> <grace class="BoldExamples"><b>alles Geld spendete sie dem Roten Kreuz</b></grace> she donated all her money to the Red Cross;<br> <grace class="BoldExamples"><b>alles Leid der Welt</b></grace> all the suffering in the world;<br> <grace class="BoldExamples"><b>all unser/mein </b><i>usw.</i> <b>...</b> all our/my <i>etc. ...;</i> <b>alles andere/Weitere/Übrige</b></grace> everything else;<br> <grace class="BoldExamples"><b>alles Übrige hat sich nicht geändert</b></grace> nothing else has changed;<br> <grace class="BoldExamples"><b>alles Schöne/Neue/Fremde</b></grace> everything <i>or</i> all that is beautiful/new/strange;<br> <grace class="BoldExamples"><b>alles Gute!</b></grace> all the best!;<br> <grace class="BoldExamples"><b>alle Fenster schließen</b></grace> close all the windows;<br> <grace class="BoldExamples"><b>sie gaben alle Waffen ab</b></grace> they handed in all their weapons;<br> <grace class="BoldExamples"><b>wir/ihr/sie alle</b></grace> all of us/you/them; we/you/they all;<br> <grace class="BoldExamples"><b>das sagen sie alle</b></grace> (<i>ugs.</i>) that's what they all say;<br> <grace class="BoldExamples"><b>alle Beteiligten/Anwesenden</b></grace> all those involved/present;<br> <grace class="BoldExamples"><b>trotz aller Vorbehalte werde ich ...</b></grace> in spite of all my reservations I shall ...;<br> <grace class="BoldExamples"><b>alle beide/alle zehn</b></grace> both of them/all ten of them;<br> <grace class="BoldExamples"><b>alle Männer/Frauen/Kinder</b></grace> all men/women/children;<br> <grace class="BoldExamples"><b>alle Mädchen über zwölf Jahre</b></grace> all girls over twelve;<br> <grace class="BoldExamples"><b>alle Mädchen in der Schule</b></grace> all the girls in the school;<br> <grace class="BoldExamples"><b>alle Bewohner der Stadt</b></grace> all the inhabitants of the town;<br> <grace class="BoldExamples"><b>ohne allen Anlass</b></grace> for no reason [at all]; without any reason [at all];<br> <grace class="BoldExamples"><b>gegen alle Erwartungen</b></grace> contrary to all expectations;<br> <grace class="BoldExamples"><b>alle Jahre wieder</b></grace> every year;<br> <grace class="BoldExamples"><b>alle fünf Minuten/Meter</b></grace> every five minutes/metres;<br> <grace class="BoldExamples"><b>Bücher aller Art</b></grace> books of all kinds; all kinds of books;<br> <grace class="BoldExamples"><b>in aller Eile</b></grace> with all haste;<br> <grace class="BoldExamples"><b>in aller Ruhe</b></grace> in peace and quiet;<br> <grace class="BoldExamples"><b>trotz aller Versuche/Anstrengungen</b></grace> despite all [his/her/their/<i>etc.</i>] attempts/efforts. </grace><br><br><grace class="NumArticle"><span class="Number">2.</span> <i>allein stehend</i> </grace><br><br><grace class="LetterArticle"><span class="Letter">a) </span>(<i>gesamt..., sämtlich</i>) everything; </grace><grace class="AllExamples"><grace class="BoldExamples"><b>alles geht vorüber</b></grace> everything passes [in time];<br> <grace class="BoldExamples"><b>alles für die Braut/den Bastler</b></grace> everything for the bride/handicraft enthusiast;<br> <grace class="BoldExamples"><b>das alles</b></grace> all that;<br> <grace class="BoldExamples"><b>ich weiß nicht, was das alles soll</b></grace> I don't know what all that is supposed to mean;<br> <grace class="BoldExamples"><b>das ist alles Unsinn</b></grace> that is all nonsense;<br> <grace class="BoldExamples"><b>von allem etwas verstehen/wissen</b></grace> understand/know a bit about everything;<br> <grace class="BoldExamples"><b>wer alles war </b><i>od.</i> <b>wer war alles dort</b></grace> who was there?;<br> <grace class="BoldExamples"><b>wen alles habt ihr getroffen?</b></grace> who did you meet?;<br> <grace class="BoldExamples"><b>das sind alles Gauner</b></grace> they're all scoundrels;<br> <grace class="BoldExamples"><b>was gab es dort alles zu sehen?</b></grace> what was there to see?;<br> <grace class="BoldExamples"><b>was es nicht alles gibt!</b></grace> well, would you believe it!; well, I never!;<br> <grace class="BoldExamples"><b>all[es] und jedes</b></grace> everything; (<i>wahllos</i>) anything and everything;<br> <grace class="BoldExamples"><b>trotz allem</b></grace> in spite of <i>or</i> despite everything;<br> <grace class="BoldExamples"><b>sie liebt ihren Hund über alles</b></grace> she loves her dog more than anything else;<br> <grace class="BoldExamples"><b>zu allem fähig sein</b></grace> (<i>fig.</i>) be capable of anything;<br> <grace class="BoldExamples"><b>alles schon mal da gewesen</b></grace> (<i>ugs.</i>) it's all happened before;<br> <grace class="BoldExamples"><b>das kenne ich alles schon</b></grace> I've heard it all before;<br> <grace class="BoldExamples"><b>alles in allem</b></grace> all in all;<br> <grace class="BoldExamples"><b>vor allem</b></grace> above all;<br> <grace class="BoldExamples"><b>alles klar </b><i>od.</i> <b>in Ordnung</b></grace> (<i>ugs.</i>) everything's fine <i>or</i> (<i>coll.</i>) OK;<br> <grace class="BoldExamples"><b>alles klar?</b></grace> everything all right <i>or</i> (<i>coll.</i>) OK?;<br> <grace class="BoldExamples"><b>dann treffen wir uns um 5<sup>00</sup> Uhr, alles klar?</b></grace> we'll meet at 5 o'clock then, all right <i>or</i> (<i>coll.</i>) OK?;<br> <grace class="BoldExamples"><b>das ist alles</b></grace> that's all <i>or</i> (<i>coll.</i>) it;<br> <grace class="BoldExamples"><b>ist das alles?</b></grace> is that all <i>or</i> (<i>coll.</i>) it?;<br> <grace class="BoldExamples"><b>nach allem, was man hört/weiß</b></grace> to judge from everything <i>or</i> all one hears/knows; </grace><br><grace class="LetterArticle"><span class="Letter">b) </span>(<i>jeder einzelne</i>) everyone; </grace><grace class="AllExamples"><grace class="BoldExamples"><b>alle miteinander</b></grace> all together;<br> <grace class="BoldExamples"><b>ihr seid/wir sind/sie sind ..., alle miteinander</b></grace> you/we/they are ..., all of you/us/them;<br> <grace class="BoldExamples"><b>alle auf einmal</b></grace> all at once;<br> <grace class="BoldExamples"><b>sprecht nicht alle auf einmal!</b></grace> don't all speak at once;<br> <grace class="BoldExamples"><b>am besten, wir gehen alle auf einmal zum Chef</b></grace> the best thing would be for us all to go and see the boss together;<br> <grace class="BoldExamples"><b>alle, die ...</b></grace> all those who ...;<br> <grace class="BoldExamples"><b>der Kampf aller gegen alle</b></grace> unfettered competition;<br> <grace class="BoldExamples"><b>in allem einverstanden sein</b></grace> agree <i>or</i> be agreed on everything;<br> <grace class="BoldExamples"><b>von allem etwas nehmen</b></grace> take a bit of everything;<br> <grace class="BoldExamples"><b>er ist bei allem, was er tut, sehr genau</b></grace> he is very precise in everything he does;<br> <grace class="BoldExamples"><b>sie ist in allem sehr empfindlich</b></grace> she is very sensitive about everything; </grace><br><grace class="LetterArticle"><span class="Letter">c) </span>(<i>Neutr. Sg.: alle Beteiligten</i>) </grace><grace class="AllExamples"><grace class="BoldExamples"><b>alles mal herhören!</b></grace> (<i>ugs.</i>) listen everybody!; (<i>stärker befehlend</i>) everybody listen!;<br> <grace class="BoldExamples"><b>alles war nach Hause gegangen</b></grace> (<i>ugs.</i>) everyone <i>or</i> everybody had gone home;<br> <grace class="BoldExamples"><b>alles aussteigen!</b></grace> (<i>ugs.</i>) everyone <i>or</i> all out!; (<i>vom Schaffner gesagt</i>) all change!</grace><br>
</>
a, A
<link rel="stylesheet" type="text/css" href="stylesheet.css"><font size="-2">Duden-Oxford Deutsch-Englisch</font><br><grace class="SglMngArticle"><span class="WordHead"><b>a, A</b></span> <grace class="IPA">/a:/</grace> <i>das;</i> <b>a/A, a/A</b> </grace><br><br><grace class="LetterArticle"><span class="Letter">a) </span>(<i>Buchstabe</i>) a/A; </grace><grace class="AllExamples"><b>kleines a</b> small a;<br> <b>großes A</b> capital A;<br> <b>das A und O</b> (<i>fig.</i>) the essential thing/things (<i>Gen.</i> for);<br> <b>von A bis Z</b> (<i>fig. ugs.</i>) from beginning to end;<br> <b>wer A sagt, muss auch B sagen</b> (<i>fig.</i>) if one starts a thing, one must go through with it; </grace><br><grace class="LetterArticle"><span class="Letter">b) </span>(<i>Musik</i>) [key of] A</grace><br>
</>
The whole story:
I'm making a HTML based dictionary in python using BeautifulSoup and Regular Expressions. The structure of the dictionary is mainly like this:
Headword | IPA
Article 1
...Article A
......All Examples (say German example with an English explaining)
......<b>
German example</b>
......English explaining;
......<b>
German example</b>
......English <i>
explaining;</i>
......and so forth...
...Article B
......All Examples
......and so forth...
In order to arrange all of them by CSS, I have to assign CSS classes to every element (Articles, examples...) in it. I used to do all of this in pure notepad environment with Regex find-and-replace. Everything works fine except for the fact that I want to process text chunk by chunk, namely I don't want a Regex to affect more than the part I'm working with. Say the element AllExamples, I give them a whole class AllExamples
first, then give the German example and English explaining different classes and add <br>
s following those semicolons at the end of English explainations. That's not easy, because:
This can't be done with pure notepad environment with a single Regex find-and-replace. In Editpad Pro, I can match the whole AllExample class by a Regex then use a second Regex to replace ;
with ;<br>
within the matched selection. It's fine if there are few instances to process, but a whole dictionary needs an one-click batch processing.
The reason why I have to match an area first is that there are many equivalent patterns somewhere out of the area which I don't want to touch.
There are exceptions in the structure. Notice the second English explaining with an i
tag at the end, that's where my Regex adding <br>
follows;
fails. So in this case, I have to replace ;<i> <b>
with ;<i><br> <b>
. Again, the whole AllExample class should be matched first, because of those equivalent patterns outside the area.
So BeautifulSoup is the solution, I can easily match the area with it and feed a simple .replace()
to it. Here the problem is BeautifulSoup treat tags and strings as totally different things. However in my case, the tag </i>
and <b>
needs to be matched with a ;
which is a string.
So I'm going to mix tags and strings together and then do a find-and-replace like in a notepad environment. (I know some of you guys may create a certain complicated function in python to do this, but it seems hard for me.)
Then use the .replace_with()
function to give it back to BeautifulSoup like the topic I quoted at the beginning of my post. When I do this, however, all the sharp brackets change to like >
in the resulting print. What should I do to solve this issue please?
Related topic here:
Python - Find text using beautifulSoup then replace in original soup variable
Upvotes: 1
Views: 1048
Reputation: 1121834
Your mistake is treating the HTML tags as text here. You serialised the BeatifulSoup object tree to a HTML string, manipulated that string, then told BeatifulSoup about a new text element. Text elements (NavigableText
) are not tags, and anything HTML-like will be escaped. You'd have to unserialise the text back to a HTML structure.
The 'proper' way to handle this is to insert a new tag in the right places. Your text replacement shows the rules:
<grace class="AllExamples">
tag, find any <i>
element whose text ends in ;
, and which is followed by a <b>
tag.<br/>
right after.I'd just search for <i>
tags inside <grace class="AllExamples">
tags, then filter. Once you found a match, use Tag.insert_after()
to add a new <br/>
tag:
for emphasis in sparse_1.select('grace.AllExamples i'):
# must have text that ends in ;
if emphasis.string is None or not emphasis.string.endswith(';'):
continue
# must have a bold tag next
next_tag = emphasis.find_next_sibling()
if not next_tag or next_tag.name != 'b':
continue
# match confirmed, insert a break tag
emphasis.insert_after(parse_1.new_tag('br'))
You could fold the text check and next_sibling
checks into a generator function too, or in a function that's used to check each element in a .find_all()
operation, but the above is probably the right level of encapsulation for this problem if there are related replacements you need to make.
In short, don't think of your HTML as a large body of text, but as a directed tree with nodes, where nodes are either tags or text elements. Use BeautifulSoup to navigate that tree and then, when in the right locations, manipulate the tree by adding or removing nodes as needed.
Upvotes: 2
Reputation: 19154
Convert it to tag element, create new soup.
match = str(line).replace(";</i> <b>", ";</i><br> <b>")
newElement = BeautifulSoup(match, "html.parser")
line.replace_with(newElement)
Upvotes: 0