Reputation: 95
I have created a python script to extract named entities as follows:
# set java path
java_path = r'C:/Program Files/Java/jre1.8.0_161/bin/java.exe'
os.environ['JAVAHOME'] = java_path
# initialize NER tagger
sn = StanfordNERTagger('C:/Users/Parag/Documents/stanford-ner-2018-10-16/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
path_to_jar='C:/Users/Parag/Documents/stanford-ner-2018-10-16/stanford-ner-2018-10-16/stanford-ner.jar')
# tag named entities
ner_tagged_sentences = [sn.tag(sent.split()) for sent in dataset_unseen['Text']]
dataset_unseen['Text'] = dataset_unseen.apply(Detectner,axis=1)
# extract all named entities
named_entities = []
for sentence in ner_tagged_sentences:
temp_entity_name = ''
temp_named_entity = None
for term, tag in sentence:
if tag != 'O':
temp_entity_name = ' '.join([temp_entity_name, term]).strip()
temp_named_entity = (temp_entity_name, tag)
else:
if temp_named_entity:
named_entities.append(temp_named_entity)
temp_entity_name = ''
temp_named_entity = None
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
entity_frame.head()
** Output **
Entity Name Entity Type Frequency
ABC Farms ORGANIZATION 5
Freddy Hill Lane ORGANIZATION 3
North Lane Thames ORGANIZATION 2
Now I want to mask these named entities with a pattern like "######" to follow GDPR regulations by hiding customer sensitive information.
I have attempted options like:
Applying for loop on original data frame - check Text for named entities present in named entity data frame - mask named entity with '#####'.
Define a function for masking named entities in Text:
def Detectner(row):
ner_tagged_sentences = [sn.tag(sent.split()) for sent in row['Text']]
results = ner_tagged_sentences.sub('**********',row['Text'])
return results
dataset_unseen['Text'] = dataset_unseen.apply(Detectner,axis=1)
But I get Following Error:
AttributeError: ("'list' object has no attribute 'sub'", 'occurred at index 0')
How can I extract and mask named entities in Text. Any improvement to this code is highly apprciated !
Upvotes: 3
Views: 2111
Reputation: 3154
When you make the tagged sentences, you are creating a list
in the line
ner_tagged_sentences = [sn.tag(sent.split()) for sent in row['Text']]
The type
of ner_tagged_sentences
is list
which has no sub
method.
You can try multiple things to achieve your goal to make documents anonymous:
O
tags with something (token-level)It seems like you are trying to do number (2)
Upvotes: 1