Reputation: 43
I have a dataset with a text field that contains long free flowing text values , I need to identify and extract all the 16 digit account number from that text field and create a column from those extracted account number
Data that I have
input acct_num txt_field ;
DATALINES;
3435436 Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024
7789976 Data declined and assigned to 7890512323454545
Data that I need
acct_num txt_field acct1 acct2
3435436 Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024 3456123789065322 0909876789432123
7789976 Data declined and assigned to 7890512323454545 7890512323454545
As of now I have used Prxparse and prxmatch functions but those work when you know what exactly to look for in the text field, here I am just looking for any 16 digit values
Upvotes: 1
Views: 85
Reputation: 643
I have an alternate approach, but hopefully it satisfies your request.
SELECT SUBSTRING(txt_field, PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%',Text), 16) AS AC_Number FROM TableA
Hope this Helps!
Upvotes: 0
Reputation: 6378
An alternative to regex is to use SCAN to parse the text into words, then just check the length of each word and whether it is all digits or not, e.g.:
data want ;
set have ;
length word $50 acct_num_16 $200 ;
do i=1 to countw(txt_field," ");
word=scan(txt_field,i," ") ;
if length(word)=16 and notdigit(trim(word))=0 then acct_num_16=catx('|',acct_num_16,word);
end ;
put acct_num_16= ;
drop i word ;
run ;
Upvotes: 2
Reputation: 2776
data want;
set have;
retain prx;
if _n_ = 1 then prx = prxparse('/\d{16}/o');
start = 1;
end = length(txt_field);
call prxnext(prx, start, end, txt_field, pos, len);
do while (pos > 0);
put "inloop";
string = substr(txt_field, pos, len);
output;
call prxnext(prx, start, end, txt_field, pos, len);
end;
drop prx start end pos len;
run;
Upvotes: 2
Reputation: 12909
You're on the right track with regex. Use call prxnext()
to iterate through all instances of 16-digit account numbers. The regex \b\d{16}\b
will find these.
data want;
set have;
length acct_num_16 $200.;
retain exprid;
/* Generate an expression ID */
if(_N_ = 1) then exprid = prxparse('/\b\d{16}\b/');
/* Scan all of the text */
start = 1;
stop = length(txt_field);
/* Find the first value */
call prxnext(exprid, start, stop, txt_field, pos, len);
/* Keep scanning until there are no more account numbers found and
append it to a comma-separated list */
do while (pos > 0);
acct_num_16 = catx(',', acct_num_16, substr(txt_field, pos, len) );
call prxnext(exprid, start, stop, txt_field, pos, len);
end;
keep acct_num txt_field acct_num_16;
run;
acct_num txt_field acct_num_16
3435436 Payment issue reported ... 3456123789065322,0909876789432123
7789976 Data declined and ... 7890512323454545
Upvotes: 2