Reputation: 79
I'm using Spacy NER to recognize named entities from text but I have whole HTML page as input so how can I remove all the html tags from text and only give raw text without html tags to NER model for prediction and after prediction how can I show same text with HTML tags?
I tried xml.etree.ElementTree to remove HTML tags, this gives me text without html tags, but after prediction how can I display this text with all html tags in original format.
import xml.etree.ElementTree
def remove_html_tags(text):
"""Remove html tags from a string"""
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
Is there any way that I can again display this text with original html tags or Spacy has any feature to ignore html tags while prediction of named entities ?
Upvotes: 0
Views: 1684
Reputation: 279
I don't think spacy has a functionality like that... but you could save the xml ElementTree and just pass the text in to spacy... some version of:
root = xml.etree.ElementTree.fromstring(text)
doc = nlp(root.itertext())
Upvotes: 0
Reputation: 13
I know it is lazy way but you can save first condition of your html page somewhere.
Upvotes: 0