Jayden.Cameron
Jayden.Cameron

Reputation: 507

SAS extract substring from string with prxchange or prxpson(prxmatch(prxparse()))

2 SOLUTIONS POSTED AT BOTTOM

My code

    data test;  
        extract_string = "<some string here>";
        my_result1 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "A1M_PRE");  
        my_result2 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "AC2_0M");  
        my_result3 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "GA3_30M");
        my_result4 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "DE3_1H30M");  
    run;

Desired results

Extract the number after _ but preceding M in strings that have M at the end. The result set should be:

    my_result1 = ""  
    my_result2 = "0"  
    my_result3 = "30"  
    my_result4 = "30"

The following extract_string values fail

"\.*(\d*)M\b\"  
"\.*(\d*?)M\b\"  
"\.*(\d{*})M\b\"  
"\.*(\d{*?})M\b\"  
"\.*(\d){*}M\b\"  
"\.*(\d){*?}M\b\"  

"\.*(\d+)M\b\"  
"\.*(\d+?)M\b\"  
"\.*(\d{+})M\b\"  
"\.*(\d{+?})M\b\"  
"\.*(\d){+}M\b\"  
"\.*(\d){+?}M\b\"  

"\.*(\d+\d+)M\b\" 

Potential solutions which I would request help with

Links I've looked at but have not been able to successfully implement

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

https://www.pharmasug.org/proceedings/2013/CC/PharmaSUG-2013-CC35.pdf

SAS PRX to extract substring please

extracting substring using regex in sas

Extract substring from a string in SAS

SOLUTIONS

Solution 1

The suffix in the cat function and the extract_string were modified.

    data test;  
        extract_string = "?(?:_[^_r\n]*?(\d+)M)?$";
        my_result1 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "A1M_PRE");
        my_result2 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "AC2_0M");
        my_result3 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "GA3_30M");
        my_result4 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "DE3_1H30M");
    run;

Solution 2

This solution uses the other prx-family functions: prxparse, prxmatch, and prxposn.

data have;
  length string $10;
  input string;
  datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;

data want;
  set have;

  rxid = prxparse ('/_.*?(\d+)M\s*$/');

  length digit_string $8;

  if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string);

  number_extracted = input (digit_string, ? 12.);
run;

Upvotes: 2

Views: 3037

Answers (3)

Richard
Richard

Reputation: 27516

Use PRXPOSN to extract a match group.

Example:

Use pattern /_.*?(\d+)M\s*$/ to locate the last run of digits before a terminating M character.

Regex:

  • _ literal underscore
  • .*? non-greedy any characters
  • (\d+) capture one or more digits
  • M literal M
  • \s*$ - any number of trailing spaces, needed due to SAS character values being right padded with spaces to variable attribute length
data have;
  length string $10;
  input string;
  datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;

data want;
  set have;

  rxid = prxparse ('/_.*?(\d+)M\s*$/');

  length digit_string $8;

  if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string);

  number_extracted = input (digit_string, ? 12.);
run;

Result

enter image description here

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163577

If you want remove from the line and keep the digits preceding M at the end of the line, you could use a capturing group. In the replacement keep the value of group 1 $1

^.*?(?:_[^_r\n]*?(\d+)M)?$

Explanation

  • ^ Start of string
  • .*? Match any char as least as possible
  • (?: Non capture group
    • _[^_r\n]*? Match _ and any char except an underscore
    • (\d+)M Capture group 1, match 1+ digits followed by M
  • )? Close group and make it optional
  • $ End of string

Regex demo


You could make the extract_string the full pattern:

extract_string = "^.*?(?:_[^_r\n]*?(\d+)M)?$";
my_result1 = prxchange(cat("s/", extract_string, "/$1/"), -1, "A1M_PRE");

Or if you must keep the leading ^.* use

extract_string = "?(?:_[^_r\n]*?(\d+)M)?$";

Upvotes: 2

Cary Swoveland
Cary Swoveland

Reputation: 110745

I understand that SAS can use Perl's regex engine. The latter supports \K, which directs the engine to discard everything matched so far and reset the starting point of the match to the current location. The following regular expression should therefore match the substring's digits that are of interest.

_.*?\K\d+(?=M$)

Demo

A failure to match would be interpreted as an empty string having been matched.

Upvotes: 3

Related Questions