Reputation: 151
Loading the Presidio analyzing engine takes some time. I want to filter out specific names but I want to filter out different names for every document. I don't understand how to perform this seemingly simple task with Presidio. Somehow I can pass an allow_list to the analyze function but not a deny_list. I need to create a recognizer to a that can take a different deny list each time. But I don't want to reload the analyzer engine each time because I don't need to reload the NLP models each time.
How can a recognizer be built in Presidio that takes a different deny list (or better a deny dictionary) each time it is called?
I found that there is a function my_analyzer_first.registry.remove_recognizer
but this seems to not work because it permanently changes the analyzer. Running the code below, recognizes "Bob" as patient even after the recognizer has been removed.
from presidio_analyzer import PatternRecognizer, EntityRecognizer, RecognizerResult, AnalyzerEngine, nlp_engine
from presidio_analyzer.nlp_engine import NlpArtifacts
import time
my_analyzer_first = AnalyzerEngine(
supported_languages=["en"], default_score_threshold=0.5
)
def on_the_fly_without_loading(text : str, deny_list : list[str]):
denylist_recognizer = PatternRecognizer(supported_entity="[PATIENT]", deny_list=deny_list)
my_analyzer_first.registry.add_recognizer(denylist_recognizer)
entities = my_analyzer_first.analyze(text=text, language="en")
my_analyzer_first.registry.remove_recognizer(denylist_recognizer)
return entities
def sanity_check(text : str):
entities = my_analyzer_first.analyze(text=text, language="en")
return entities
if __name__=="__main__":
N = 10
text = 4 * "I went to the zoo and said hello to Bob the tiger."
deny_list = ["Bob"]
print(on_the_fly_without_loading(text, deny_list))
print(sanity_check(text))
You'd expect 'Bob' to only be removed in the first case but it is removed each time.
When comparing the registry my_analyzer_first.registry
before and after my_analyzer_first.registry.remove_recognizer(denylist_recognizer)
, I concluded doesn't change the value!
Upvotes: 0
Views: 205
Reputation: 74
Presidio has an 'ad-hoc' recognizers feature, that allows passing recognizers into Analyzer's 'analyze' method. This way, you can instantiate multiple recognizers (deny-list or pattern recognizers) and pass them without changing the AnalyzerEngine instance.
Based on your try:
from presidio_analyzer import PatternRecognizer, EntityRecognizer, RecognizerResult, AnalyzerEngine, nlp_engine
from presidio_analyzer.nlp_engine import NlpArtifacts
import time
denylist_recognizer = PatternRecognizer(supported_entity="[PATIENT]", deny_list=["Bob"])
text = 4 * "I went to the zoo and said hello to Bob the tiger."
my_analyzer_first = AnalyzerEngine(
supported_languages=["en"], default_score_threshold=0.5
)
print(len(my_analyzer_first.analyze(text=text, language="en"))) # Prints 4
print(len(my_analyzer_first.analyze(text=text, language="en", ad_hoc_recognizers=[denylist_recognizer]))) # Prints 8
Upvotes: 0