Jay Singh
Jay Singh

Reputation: 79

How can I remove html tags from text while predicting named entities with Spacy NER and again display the same text in original format with html tags?

I'm using Spacy NER to recognize named entities from text but I have whole HTML page as input so how can I remove all the html tags from text and only give raw text without html tags to NER model for prediction and after prediction how can I show same text with HTML tags?

I tried xml.etree.ElementTree to remove HTML tags, this gives me text without html tags, but after prediction how can I display this text with all html tags in original format.

import xml.etree.ElementTree

def remove_html_tags(text):
    """Remove html tags from a string"""
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

Is there any way that I can again display this text with original html tags or Spacy has any feature to ignore html tags while prediction of named entities ?

Upvotes: 0

Views: 1684

Answers (2)

jojo
jojo

Reputation: 279

I don't think spacy has a functionality like that... but you could save the xml ElementTree and just pass the text in to spacy... some version of:

root = xml.etree.ElementTree.fromstring(text)

doc = nlp(root.itertext())

Upvotes: 0

asdsfa
asdsfa

Reputation: 13

I know it is lazy way but you can save first condition of your html page somewhere.

Upvotes: 0

Related Questions