Extracting a substring in SAS using regex

Question

Problem

I need to extract a specific string from HTML using regex. The name of the string always follows the following pattern:

<2 digits>.zip

I would like to do this in one step.

What I have

data have;
    infile datalines truncover;
    input Line $ 1-500;
    datalines;
"2001"
"2002"
;
run;

What I need

The file's name and extension from the HTML code.

File               Line                                                   
01data.zip         "2001"         
02moarstuff.zip    "2002"

What I've tried

I've tried using the following regular expression:

/\d+\w+(\.zip)/

After testing with http://regexr.com/ , the expression does find the right string. I then tried to use a technique found on page 3 of this SAS regex whitepaper to remove everything except for the desired substring by using the prxchange() function:

data want;
    length File $25.;
    set have;

    file=prxchange('s/^.*\d+\w+(\.zip).*$/$1/',-1, line);
run;

This will get me:

File    Line                                                   
.zip    "2001"         
.zip    "2002"

It ends up replacing the string with .zip, but I am missing the file's name. I've tried different values of $ in the replacement, but no success.

Question

What am I doing wrong with this regex replacement?

hjpotter92 · Accepted Answer

You were nearly there. Just grouping over the wrong objects:

file=prxchange('s/^.*\d{2}(\w+\.zip).*$/$1/',-1, line);

Answers (1)