Stu Sztukowski
Stu Sztukowski

Reputation: 12909

Extracting a substring in SAS using regex

Problem

I need to extract a specific string from HTML using regex. The name of the string always follows the following pattern:

<2 digits><any number of characters>.zip

I would like to do this in one step.

What I have

data have;
    infile datalines truncover;
    input Line $ 1-500;
    datalines;
"<td><a href=""Location/01data.zip"">2001</td>"
"<td><a href=""Location/02moarstuff.zip"">2002</td>"
;
run;

What I need

The file's name and extension from the HTML code.

File               Line                                                   
01data.zip         "<td><a href=""Location/01data.zip"">2001</td>"         
02moarstuff.zip    "<td><a href=""Location/02moarstuff.zip"">2002</td>"    

What I've tried

I've tried using the following regular expression:

/\d+\w+(\.zip)/

After testing with http://regexr.com/ , the expression does find the right string. I then tried to use a technique found on page 3 of this SAS regex whitepaper to remove everything except for the desired substring by using the prxchange() function:

data want;
    length File $25.;
    set have;

    file=prxchange('s/^.*\d+\w+(\.zip).*$/$1/',-1, line);
run;

This will get me:

File    Line                                                   
.zip    "<td><a href=""Location/01data.zip"">2001</td>"         
.zip    "<td><a href=""Location/02moarstuff.zip"">2002</td>" 

It ends up replacing the string with .zip, but I am missing the file's name. I've tried different values of $ in the replacement, but no success.

Question

What am I doing wrong with this regex replacement?

Upvotes: 2

Views: 4295

Answers (1)

hjpotter92
hjpotter92

Reputation: 80639

You were nearly there. Just grouping over the wrong objects:

file=prxchange('s/^.*\d{2}(\w+\.zip).*$/$1/',-1, line);

Upvotes: 2

Related Questions