Reputation: 53
I've begun using PRX code within SAS to identify free text phrases entered in a database I'm using. A typical phrase I'm identifying is: 'positive modified hodge test' or 'positive for modified hodge test'. These phrases are embedded within large strings of text at times. What I don't want to flag are phrases that say 'previous positive hodge test'. I've read some documentation to implement a negative lookbehind to NOT flag phrases that include "previous" but it's not doing what I had anticipated.
if prxmatch("/pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i") > 0 then hodge_id = 1;
The PRX code above will match all phrases below: "positive modified hodge" "previous positive hodge test" "confirmed positive hodge carbapenemase" "positive for modified hodge test" "positive by the modified hodge"
if prxmatch("/pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i") > 0 then
hodge_id = 1; /* Without lookback */
if prxmatch("/(?<!previous)\s*pos\w+ (for)?(by)?\s?(the)?\s?
(modi|hod|mht)/i") > 0 then hodge_id = 1; /* With lookbook */
Using the negative lookback, I expect to flag: "positive modified hodge" "confirmed positive hodge carbapenemase" "positive for modified hodge test" "positive by the modified hodge"
but not: "previous positive hodge test"
What happens is that it omits the phrase including "previous" but also the first phrase "positive modified hodge".
My PRX is in the beginning stages, so any advice in cleaning/simplifying it is appreciated.
Upvotes: 1
Views: 205
Reputation: 3315
you were pretty close.
/*
you need to have
(?<!previous\s) or (?<!previous)\s
instead of (?<!previous)\s*
*/
data have;
length string $200.;
infile datalines;
input string & $ ;
datalines;
this is cool and nice positive modified hodge wow so nice
this is wow confirmed positive hodge carbapenemase
now this positive for modified hodge test and later
cool is my name positive by the modified hodge hello
wow and wow previous positive hodge test
Mr cool
;
data want;
set have;
if _N_ = 1 then
do;
retain patternID;
pattern = "/(?<!previous\s)pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i";
patternID = prxparse(pattern);
end;
if prxmatch(patternID,string) > 0 then
hodge_id = 1;
else hodge_id =0;
drop pattern patternid;
run;
Upvotes: 1