Reputation: 507

SAS extract substring from string with prxchange or prxpson(prxmatch(prxparse()))

2 SOLUTIONS POSTED AT BOTTOM

My code

    data test;  
        extract_string = "<some string here>";
        my_result1 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "A1M_PRE");  
        my_result2 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "AC2_0M");  
        my_result3 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "GA3_30M");
        my_result4 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "DE3_1H30M");  
    run;

Desired results

Extract the number after _ but preceding M in strings that have M at the end. The result set should be:

    my_result1 = ""  
    my_result2 = "0"  
    my_result3 = "30"  
    my_result4 = "30"

The following `extract_string` values fail

"\.*(\d*)M\b\"  
"\.*(\d*?)M\b\"  
"\.*(\d{*})M\b\"  
"\.*(\d{*?})M\b\"  
"\.*(\d){*}M\b\"  
"\.*(\d){*?}M\b\"  

"\.*(\d+)M\b\"  
"\.*(\d+?)M\b\"  
"\.*(\d{+})M\b\"  
"\.*(\d{+?})M\b\"  
"\.*(\d){+}M\b\"  
"\.*(\d){+?}M\b\"  

"\.*(\d+\d+)M\b\"

Potential solutions which I would request help with

Perhaps I just haven't tested the correct extract_string yet. Ideas?
Perhaps my cat("s/&.*", extract_string, ".*$/$1/") needs to be modified. Ideas?
Perhaps I need to use prxpson(prxmatch(prxparse())) instead of prxchange. How would that be formulated?

Links I've looked at but have not been able to successfully implement

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

https://www.pharmasug.org/proceedings/2013/CC/PharmaSUG-2013-CC35.pdf

SAS PRX to extract substring please

extracting substring using regex in sas

Extract substring from a string in SAS

SOLUTIONS

Solution 1

The suffix in the cat function and the extract_string were modified.

    data test;  
        extract_string = "?(?:_[^_r\n]*?(\d+)M)?$";
        my_result1 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "A1M_PRE");
        my_result2 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "AC2_0M");
        my_result3 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "GA3_30M");
        my_result4 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "DE3_1H30M");
    run;

Solution 2

This solution uses the other prx-family functions: prxparse, prxmatch, and prxposn.

data have;
  length string $10;
  input string;
  datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;

data want;
  set have;

  rxid = prxparse ('/_.*?(\d+)M\s*$/');

  length digit_string $8;

  if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string);

  number_extracted = input (digit_string, ? 12.);
run;

Upvotes: 2

Answers (3)

Richard

Reputation: 27516

Use PRXPOSN to extract a match group.

Example:

Use pattern /_.*?(\d+)M\s*$/ to locate the last run of digits before a terminating M character.

Regex:

_ literal underscore
.*? non-greedy any characters
(\d+) capture one or more digits
M literal M
\s*$ - any number of trailing spaces, needed due to SAS character values being right padded with spaces to variable attribute length

data have;
  length string $10;
  input string;
  datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;

data want;
  set have;

  rxid = prxparse ('/_.*?(\d+)M\s*$/');

  length digit_string $8;

  if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string);

  number_extracted = input (digit_string, ? 12.);
run;

Result

Upvotes: 1

The fourth bird

Reputation: 163577

If you want remove from the line and keep the digits preceding M at the end of the line, you could use a capturing group. In the replacement keep the value of group 1 $1

^.*?(?:_[^_r\n]*?(\d+)M)?$

Explanation

^ Start of string
.*? Match any char as least as possible
(?: Non capture group
- _[^_r\n]*? Match _ and any char except an underscore
- (\d+)M Capture group 1, match 1+ digits followed by M
)? Close group and make it optional
$ End of string

Regex demo

You could make the extract_string the full pattern:

extract_string = "^.*?(?:_[^_r\n]*?(\d+)M)?$";
my_result1 = prxchange(cat("s/", extract_string, "/$1/"), -1, "A1M_PRE");

Or if you must keep the leading ^.* use

extract_string = "?(?:_[^_r\n]*?(\d+)M)?$";

Upvotes: 2

Cary Swoveland

Reputation: 110745

I understand that SAS can use Perl's regex engine. The latter supports \K, which directs the engine to discard everything matched so far and reset the starting point of the match to the current location. The following regular expression should therefore match the substring's digits that are of interest.

_.*?\K\d+(?=M$)

Demo

A failure to match would be interpreted as an empty string having been matched.

Upvotes: 3