Ralph Corrigan
Ralph Corrigan

Reputation: 23

Spacy displacy.render produces </br> tags which are not xml compliant

I am using SPacy for NER on various texts. The dataframe is being parsed into XML for storage and analysis in eXist-DB and I want to take the visualizer results as html to store and show alongside. So far so good. However, the html generated contains </br> tags that are automatically invalid in eXist-DB:

<!DOCTYPE html>
<html lang="xx">
    <head>
        <title>displaCy</title>
    </head>

    <body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: ltr">
<figure style="margin-bottom: 6rem">
<div class="entities" style="line-height: 2.5; direction: ltr"></br></br>Some text here </br> some more text  
<mark class="entity" style="background: #33ff82; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    more text
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">LOC</span>
</mark>
more text </div>
</figure>
</body>
</html>

I can write something to go through each html doc and change the tags, but wondered if there was any way to make displacy.render produce xml compliant html from the start?

Upvotes: 0

Views: 150

Answers (1)

Ralph Corrigan
Ralph Corrigan

Reputation: 23

I've applied a simple (if inelegant) fix by running

re.sub(r"</br>,"<br/>",html) 

against my html before saving it. This works, but I would still like to know if there is anything I can apply to stop the </br> tag being created in the first place.

Upvotes: 0

Related Questions