Ezio_Auditore
Ezio_Auditore

Reputation: 43

Extracting all the 16 digit account numbers from a text field and creating a field for that account number

I have a dataset with a text field that contains long free flowing text values , I need to identify and extract all the 16 digit account number from that text field and create a column from those extracted account number

Data that I have

    input acct_num   txt_field  ;
    DATALINES; 
    3435436     Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024  
    7789976     Data declined and assigned to 7890512323454545  

Data that I need

acct_num              txt_field                                                                       acct1               acct2
    3435436     Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024   3456123789065322    0909876789432123
    7789976     Data declined and assigned to 7890512323454545                                 7890512323454545  

As of now I have used Prxparse and prxmatch functions but those work when you know what exactly to look for in the text field, here I am just looking for any 16 digit values

Upvotes: 1

Views: 85

Answers (4)

Elijah David
Elijah David

Reputation: 643

I have an alternate approach, but hopefully it satisfies your request.

SELECT SUBSTRING(txt_field, PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%',Text), 16) AS AC_Number FROM TableA

Hope this Helps!

Upvotes: 0

Quentin
Quentin

Reputation: 6378

An alternative to regex is to use SCAN to parse the text into words, then just check the length of each word and whether it is all digits or not, e.g.:

data want ;
  set have ;
  length word $50 acct_num_16 $200 ;
  do i=1 to countw(txt_field," ");
    word=scan(txt_field,i," ") ;
    if length(word)=16 and notdigit(trim(word))=0 then acct_num_16=catx('|',acct_num_16,word);
  end ;
  put acct_num_16= ;
  drop i word ;
run ;

Upvotes: 2

shaun_m
shaun_m

Reputation: 2776

data want;
    set have;
    retain prx;
    if _n_ = 1 then prx = prxparse('/\d{16}/o');
    start = 1;
    end = length(txt_field);
    call prxnext(prx, start, end, txt_field, pos, len);
    do while (pos > 0);
        put "inloop";
        string = substr(txt_field, pos, len);
        output;
        call prxnext(prx, start, end, txt_field, pos, len);
    end;
    drop prx start end pos len;
run;

Upvotes: 2

Stu Sztukowski
Stu Sztukowski

Reputation: 12909

You're on the right track with regex. Use call prxnext() to iterate through all instances of 16-digit account numbers. The regex \b\d{16}\b will find these.

data want;
    set have;
    length acct_num_16 $200.;
    retain exprid;
   
    /* Generate an expression ID */
    if(_N_ = 1) then exprid = prxparse('/\b\d{16}\b/');

    /* Scan all of the text */
    start = 1;
    stop  = length(txt_field);

    /* Find the first value */
    call prxnext(exprid, start, stop, txt_field, pos, len);

    /* Keep scanning until there are no more account numbers found and 
       append it to a comma-separated list */
    do while (pos > 0);
        acct_num_16 = catx(',', acct_num_16, substr(txt_field, pos, len) );
        call prxnext(exprid, start, stop, txt_field, pos, len);
    end;

    keep acct_num txt_field acct_num_16;
run;
acct_num    txt_field                    acct_num_16
3435436     Payment issue reported ...   3456123789065322,0909876789432123
7789976     Data declined and ...        7890512323454545

Upvotes: 2

Related Questions