Reputation: 23
I am using SPacy for NER on various texts. The dataframe is being parsed into XML for storage and analysis in eXist-DB and I want to take the visualizer results as html to store and show alongside. So far so good. However, the html generated contains </br>
tags that are automatically invalid in eXist-DB:
<!DOCTYPE html>
<html lang="xx">
<head>
<title>displaCy</title>
</head>
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: ltr">
<figure style="margin-bottom: 6rem">
<div class="entities" style="line-height: 2.5; direction: ltr"></br></br>Some text here </br> some more text
<mark class="entity" style="background: #33ff82; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
more text
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">LOC</span>
</mark>
more text </div>
</figure>
</body>
</html>
I can write something to go through each html doc and change the tags, but wondered if there was any way to make displacy.render produce xml compliant html from the start?
Upvotes: 0
Views: 150
Reputation: 23
I've applied a simple (if inelegant) fix by running
re.sub(r"</br>,"<br/>",html)
against my html before saving it. This works, but I would still like to know if there is anything I can apply to stop the </br>
tag being created in the first place.
Upvotes: 0