Reputation: 12909
Problem
I need to extract a specific string from HTML using regex. The name of the string always follows the following pattern:
<2 digits><any number of characters>.zip
I would like to do this in one step.
What I have
data have;
infile datalines truncover;
input Line $ 1-500;
datalines;
"<td><a href=""Location/01data.zip"">2001</td>"
"<td><a href=""Location/02moarstuff.zip"">2002</td>"
;
run;
What I need
The file's name and extension from the HTML code.
File Line
01data.zip "<td><a href=""Location/01data.zip"">2001</td>"
02moarstuff.zip "<td><a href=""Location/02moarstuff.zip"">2002</td>"
What I've tried
I've tried using the following regular expression:
/\d+\w+(\.zip)/
After testing with http://regexr.com/ , the expression does find the right string. I then tried to use a technique found on page 3 of this SAS regex whitepaper to remove everything except for the desired substring by using the prxchange()
function:
data want;
length File $25.;
set have;
file=prxchange('s/^.*\d+\w+(\.zip).*$/$1/',-1, line);
run;
This will get me:
File Line
.zip "<td><a href=""Location/01data.zip"">2001</td>"
.zip "<td><a href=""Location/02moarstuff.zip"">2002</td>"
It ends up replacing the string with .zip
, but I am missing the file's name. I've tried different values of $
in the replacement, but no success.
Question
What am I doing wrong with this regex replacement?
Upvotes: 2
Views: 4295
Reputation: 80639
You were nearly there. Just grouping over the wrong objects:
file=prxchange('s/^.*\d{2}(\w+\.zip).*$/$1/',-1, line);
Upvotes: 2