Reputation: 966

Using lxml to tag parts of a text

I am working with XML using the python lxml library.

I have a paragraph of text like so,

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer facilisis elit eget
condimentum efficitur. Donec eu dignissim lectus. Integer tortor
lacus, porttitor at ipsum quis, tempus dignissim dui. Curabitur cursus
quis arcu in pellentesque. Aenean volutpat, tortor a commodo interdum,
lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>

Suppose I would want to tag to a specific bit of text like "tortor lacus, porttitor at ipsum quis, tempus" that exists inside the paragraph above, with the tag . How would I go about doing this with lxml. Right now I'm using text replace, but I feel that isn't the right way to go about this.

i.e. the result I am looking for would be

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer facilisis elit eget
condimentum efficitur. Donec eu dignissim lectus. Integer <foobar>tortor
lacus, porttitor at ipsum quis, tempus</foobar> dignissim dui. Curabitur cursus 
quis arcu in pellentesque. Aenean volutpat, tortor a commodo interdum,
lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>

Upvotes: 1

Answers (3)

Daniel Haley

Reputation: 52858

Replacing text with an actual element is tricky in lxml; especially if you have mixed content (mix of text and child elements).

The tricky part is knowing what to do with the remaining text and where to insert the element. Should the remaining text be part of the parent .text? Should it be part of the .tail of the preceding sibling? Should it be part of the new element's .tail?

What I've done in the past is to process all of the text() nodes and add placeholder strings to the text (whether that's .text or .tail). I then serialize the tree to a string and do a search and replace on the placeholders. After that I either parse the string as XML to build a new tree (for further processing, validation, analysis, etc.) or write it to a file.

Please see my related question/answer for additional info on .text/.tail in this context.

Here's an example based on my answer in the question above.

Notes:

I added gotcha elements to show how it handles mixed content.
I added a second search string (Aenean volutpat) to show replacing more than one string.
In this example, I'm only processing text() nodes that are children of p.

Python

import re
from lxml import etree

xml = """<doc>
<p>Lorem ipsum dolor <gotcha>sit amet</gotcha>, consectetur adipiscing elit. Integer facilisis elit eget
condimentum efficitur. Donec eu dignissim lectus. Integer tortor
lacus, porttitor at ipsum quis, tempus dignissim dui. Curabitur cursus
quis arcu <gotcha>in pellentesque</gotcha>. Aenean volutpat, tortor a commodo interdum,
lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>
</doc>
"""


def update_text(orig_text, phrase_list, elemname):
    new_text = orig_text
    for phrase in phrase_list:
        if phrase in new_text:
            # Add placeholders for the new start/end tags.
            new_text = new_text.replace(phrase, f"[elemstart:{elemname}]{phrase}[elemend:{elemname}]")
        else:
            new_text = new_text
    return new_text


root = etree.fromstring(xml)

foobar_phrases = {"tortor lacus, porttitor at ipsum quis, tempus", "Aenean volutpat"}

for text in root.xpath("//p/text()"):
    parent = text.getparent()
    updated_text = update_text(text.replace("\n", " "), foobar_phrases, "foobar")
    if text.is_text:
        parent.text = updated_text
    elif text.is_tail:
        parent.tail = updated_text

# Serialze the tree to a string so we can replace the placeholders with proper tags.
serialized_tree = etree.tostring(root, encoding="utf-8").decode()
serialized_tree = re.sub(r"\[elemstart:([^\]]+)\]", r"<\1>", serialized_tree)
serialized_tree = re.sub(r"\[elemend:([^\]]+)\]", r"</\1>", serialized_tree)

# Now we can either parse the string back into a tree (for additional processing, validation, etc.),
# print it, write it to a file, etc.
print(serialized_tree)

Printed Output (line breaks added for readability)

<doc>
<p>Lorem ipsum dolor <gotcha>sit amet</gotcha>, consectetur adipiscing elit. 
Integer facilisis elit eget condimentum efficitur. Donec eu dignissim lectus.
Integer <foobar>tortor lacus, porttitor at ipsum quis, tempus</foobar> dignissim dui.
Curabitur cursus quis arcu <gotcha>in pellentesque</gotcha>. <foobar>Aenean volutpat</foobar>, 
tortor a commodo interdum, lorem est convallis dui, sodales imperdiet ligula ligula non felis.</p>
</doc>

Upvotes: 1

Liju

Reputation: 2303

If <p> tag won't be nested inside another <p>, You may consider regex replace

import re

a="""
other lines here that may contain foo
<p>
this is a foo inside para
and this is new line in this foo para
</p>
excess lines here that also may contain foo in it.
"""

search="foo"
newtagname="bar"

b=re.sub("("+search+")(?=[^><]*?</p>)","<"+newtagname+">\\1</"+newtagname+">",a)

print(b)

This prints

other lines here that may contain foo
<p>
this is a <bar>foo</bar> inside para
and this is new line in this <bar>foo</bar> para
</p>
excess lines here that also may contain foo in it.

Upvotes: 0

andy meissner

Reputation: 1322

You can check like this if there are any children:

from lxml import etree

root = etree.parse("test.xml").getroot()
paragraphs = root.findall("p")

print(f"Found {len(paragraphs)} paragraphs")

for i in range(len(paragraphs)):
    if len(list(paragraphs[i])) > 0:
        print(f"Paragraph {i} has children")
    else:
        print(f"Paragraph {i} has no children")

First the code filters all paragraphs, and than looks if the paragraph has children.

Now if you have no children you can just replace the text like before and if you have children you can replace the whole child

Upvotes: 0

Using lxml to tag parts of a text

Answers (3)

Related Questions