Parag
Parag

Reputation: 95

Python: Masking Named Entities in Email Text

I have created a python script to extract named entities as follows:

# set java path
java_path = r'C:/Program Files/Java/jre1.8.0_161/bin/java.exe'

os.environ['JAVAHOME'] = java_path

# initialize NER tagger
sn = StanfordNERTagger('C:/Users/Parag/Documents/stanford-ner-2018-10-16/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
                       path_to_jar='C:/Users/Parag/Documents/stanford-ner-2018-10-16/stanford-ner-2018-10-16/stanford-ner.jar')

# tag named entities
ner_tagged_sentences = [sn.tag(sent.split()) for sent in dataset_unseen['Text']]
dataset_unseen['Text'] = dataset_unseen.apply(Detectner,axis=1)
# extract all named entities
named_entities = []

for sentence in ner_tagged_sentences:
    temp_entity_name = ''
    temp_named_entity = None

    for term, tag in sentence:
        if tag != 'O':
            temp_entity_name = ' '.join([temp_entity_name, term]).strip()
            temp_named_entity = (temp_entity_name, tag)

        else:
            if temp_named_entity:
                named_entities.append(temp_named_entity)
                temp_entity_name = ''
                temp_named_entity = None
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
entity_frame.head()

** Output **

 Entity Name      Entity Type       Frequency

 ABC Farms        ORGANIZATION          5 

 Freddy Hill Lane  ORGANIZATION          3 

 North Lane Thames ORGANIZATION          2 

Now I want to mask these named entities with a pattern like "######" to follow GDPR regulations by hiding customer sensitive information.

I have attempted options like:

  1. Applying for loop on original data frame - check Text for named entities present in named entity data frame - mask named entity with '#####'.

  2. Define a function for masking named entities in Text:

def Detectner(row):
    ner_tagged_sentences = [sn.tag(sent.split()) for sent in row['Text']]
    results = ner_tagged_sentences.sub('**********',row['Text'])
    return results

dataset_unseen['Text'] = dataset_unseen.apply(Detectner,axis=1)

But I get Following Error:

AttributeError: ("'list' object has no attribute 'sub'", 'occurred at index 0')

How can I extract and mask named entities in Text. Any improvement to this code is highly apprciated !

Upvotes: 3

Views: 2111

Answers (1)

nmq
nmq

Reputation: 3154

When you make the tagged sentences, you are creating a list in the line

ner_tagged_sentences = [sn.tag(sent.split()) for sent in row['Text']]

The type of ner_tagged_sentences is list which has no sub method.

You can try multiple things to achieve your goal to make documents anonymous:

  1. Replace tokens with non-O tags with something (token-level)
  2. Replace Named Entity Text directly in document (string-level)

It seems like you are trying to do number (2)

Upvotes: 1

Related Questions