Reputation: 15
I have an HTML file like the one below:
<!DOCTYPE HTML>
<html>
<head>
<title>Sezione microbiologia</title>
<link rel="stylesheet" src="./style.css">
</head>
<body>
<div id="content">
<section id="main">
<!-- SOME CONTENT... -->
<h1>Prima diluizione</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Seconda diluizione</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Terza diluizione</h1>
<p>Some content including "terza diluizione"...</p>
</section>
<section id="second">
<!-- SOME CONTENT... -->
</section>
<section id="third">
<!-- SOME CONTENT... -->
</section>
<section id="footer">
<!-- SOME CONTENT... -->
</section>
</div>
</body>
</html>
Problem description:
I am trying to modify the headings <h1>
that contain the the word diluizione
to replace this word and its prefix with "Diluizione seriale". I tried to do this using Python replace()
, the problem is that even lines in the <p>
paragraphs are cut off, whilst I would only like lines in the h1 tags to be modified. On top of that, I still have not managed to find a way to automated taking out the prefix, ie "Prima", "Seconda", "Terza", etc.
The code I tried with
I currently came up with this:
with open('./home.html') as file:
text = file.read()
if "diluizione" in text:
text = text.replace("diluizione", "diluizione seriale")
But this outputs:
<div id="content">
<section id="main">
<!-- SOME CONTENT... -->
<h1>Prima diluizione seriale</h1>
<p>Some content including "prima diluizione seriale"...</p>
<h1>Seconda diluizione seriale</h1>
<p>Some content including "seconda diluizione seriale"...</p>
<h1>Terza diluizione seriale</h1>
<p>Some content including "terza diluizione seriale"...</p>
</section>
So as you can see, even text in the <p>
tags is affected and the headings the prefix is still there.
My desired output would be:
<div id="content">
<section id="main">
<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>
</section>
Any help or suggestion is very appreciated, thanks very much in advance.
Upvotes: 1
Views: 84
Reputation: 2304
You could use the regex through Pythons re
module to achieve this. In order to only filter text within the h1
tags, you may use a positive lookbehind
and a positive lookahead
strategy.
Code:
import re
with open("path/to/home.html") as file:
text = file.read()
text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text)
print(text)
Explanation:
The regular expression (?<=<h1>)\w+ \w+(?=</h1>)
matches two consecutive word characters contained between <h1>
and </h1>
.
Output:
<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>
Upvotes: 2
Reputation: 2469
Have a look at html.parser. Instead of trying to do sting interpolation, rather parse the HTML into a structure and then traverse it from there
Upvotes: 1