Reputation: 15
I need to parse a link to a zip file out of html. The name of this zipfile changes every month. Here is a snippet of the HTML I need to parse:
<a href="http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip">
The string I need to get is "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" so I can download the file using WebClient. The only portion of that zip file URL that remains constant from month to month is "http://nppes.viva-it.com/". Is there a way using a regular expression to parse the full URL, "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip", out of the HTML?
Upvotes: 0
Views: 1640
Reputation: 32323
By using HtmlAgilityPack:
var html = "<a href=\"http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip\">";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchor = doc.DocumentNode.SelectSingleNode("//a");
var href = anchor.GetAttributeValue("href", null);
now href
variable holds "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip"
value.
Isn't it simplier than regex?
Upvotes: 1
Reputation:
Here is a raw regex - uses branch reset.
The answer is in capture buffer 2.
<a
(?=\s)
(?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
href \s*=
(?|
(?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s* \g{-2} )
| (?> (?!\s*['"]) \s* () (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
)
)
\s+ (?:".*?"|'.*?'|[^>]*?)+
>
Not sure if C# can do branch reset. If it can't, this variation works.
The answer is always the result of capture buffer 2 catted with capture buffer 3.
<a
(?=\s)
(?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
href \s*=
(?:
(?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s* \g{-2} )
| (?> (?!\s*['"]) \s* (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
)
)
\s+ (?:".*?"|'.*?'|[^>]*?)+
>
Upvotes: 0
Reputation: 224858
If there will only ever be one ZIP linked to on the page, no problem:
Regex re = new Regex(@"http://nppes\.viva-it\.com/.+\.zip");
re.Match(html).Value // To get the matched URL
Upvotes: 0