Pink Flying Elephant
Pink Flying Elephant

Reputation: 151

On-the-fly Deny List in Presidio?

Loading the Presidio analyzing engine takes some time. I want to filter out specific names but I want to filter out different names for every document. I don't understand how to perform this seemingly simple task with Presidio. Somehow I can pass an allow_list to the analyze function but not a deny_list. I need to create a recognizer to a that can take a different deny list each time. But I don't want to reload the analyzer engine each time because I don't need to reload the NLP models each time.

How can a recognizer be built in Presidio that takes a different deny list (or better a deny dictionary) each time it is called?

I found that there is a function my_analyzer_first.registry.remove_recognizer but this seems to not work because it permanently changes the analyzer. Running the code below, recognizes "Bob" as patient even after the recognizer has been removed.

from presidio_analyzer import PatternRecognizer, EntityRecognizer, RecognizerResult, AnalyzerEngine, nlp_engine
from presidio_analyzer.nlp_engine import NlpArtifacts
import time

my_analyzer_first = AnalyzerEngine(
        supported_languages=["en"], default_score_threshold=0.5
    )

def on_the_fly_without_loading(text : str, deny_list : list[str]):
    denylist_recognizer = PatternRecognizer(supported_entity="[PATIENT]", deny_list=deny_list)
    my_analyzer_first.registry.add_recognizer(denylist_recognizer)
    entities = my_analyzer_first.analyze(text=text, language="en")
    my_analyzer_first.registry.remove_recognizer(denylist_recognizer)
    return entities


def sanity_check(text : str):
    entities = my_analyzer_first.analyze(text=text, language="en")
    return entities

if __name__=="__main__":

    N = 10

    text = 4 * "I went to the zoo and said hello to Bob the tiger."

    deny_list = ["Bob"]

    print(on_the_fly_without_loading(text, deny_list))
    print(sanity_check(text))

You'd expect 'Bob' to only be removed in the first case but it is removed each time. When comparing the registry my_analyzer_first.registry before and after my_analyzer_first.registry.remove_recognizer(denylist_recognizer), I concluded doesn't change the value!

Upvotes: 0

Views: 205

Answers (1)

Sharon Hart
Sharon Hart

Reputation: 74

Presidio has an 'ad-hoc' recognizers feature, that allows passing recognizers into Analyzer's 'analyze' method. This way, you can instantiate multiple recognizers (deny-list or pattern recognizers) and pass them without changing the AnalyzerEngine instance.

Based on your try:

from presidio_analyzer import PatternRecognizer, EntityRecognizer, RecognizerResult, AnalyzerEngine, nlp_engine
from presidio_analyzer.nlp_engine import NlpArtifacts
import time

denylist_recognizer = PatternRecognizer(supported_entity="[PATIENT]", deny_list=["Bob"])
text = 4 * "I went to the zoo and said hello to Bob the tiger."
my_analyzer_first = AnalyzerEngine(
    supported_languages=["en"], default_score_threshold=0.5
)
print(len(my_analyzer_first.analyze(text=text, language="en"))) # Prints 4
print(len(my_analyzer_first.analyze(text=text, language="en", ad_hoc_recognizers=[denylist_recognizer]))) # Prints 8

Upvotes: 0

Related Questions