Reputation: 319
I am working on detecting PI/SI information within given dataset(spark). I have set of rules (in csv format) as below
Rule_No,Target,Pattern,Fuzzy_Match,EPDR,Category,Active
1,Name,name,true,PI - Name,General/ID,true
1,Name,identity,true,PI - Name,General/ID,true
1,Content,Smith,true,PI - Name,General/ID,true
1,Content,Jones,true,PI - Name,General/ID,true
1,Content,Williams,true,PI - Name,General/ID,true
5,Name,Gender,true,PI - Gender,General/ID,true
5,Content,M,false,PI - Gender,General/ID,true
5,Content,F,false,PI - Gender,General/ID,true
5,Content,Male,false,PI - Gender,General/ID,true
5,Content,Female,false,PI - Gender,General/ID,true
What I am trying to do is iterate over dataset columns and apply each of these rules to check whether particular column has PII or not.
So say if I have column called name
and given rule says scan the content of this column with pattern say Smith
. If I found the match I will know this column is PI column and then move to next column and apply each and every rule until I find a match.
I am using nested for comprehension to iterate over list of columns and list of rules. What I want is when I find a match I want to move to the next column instead of applying remaining rules.
I have written code like this
for {
c <- ds.columns.toList
rule <- rules if rule.active && checkPII(ds, c, rule.target, rule.pattern, rule.fuzzyMatch)
} yield {
<return PII information>
}
but this will apply every rule to same column even if it gets match. How can I move to next column instead of keep applying remaining rules?
Upvotes: 1
Views: 130
Reputation: 27356
for
turns into a map
call which always checks every elements. You need to use collectFirst
, which stops at the first match.
ds.columns.toList.flatMap { c =>
rules.collectFirst {
case rule if rule.active && checkPII(ds, c, rule.target, rule.pattern, rule.fuzzyMatch) =>
<return PII information>
}
}
Using flatMap
means that it will discard failed matches and just return a list of matching values.
Upvotes: 2