Reputation: 105

C# Using Regex.Match to retrive file name from website source

im trying to retrive a file name from a website source using Regex.Match i have something similiar to retrive the page title:

string title = Regex.Match(f, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

f string is redirecting to my page..

so what i need is: retrive the file name from this source:

<br><p><b>Download:</b> 24 hours<br><b>Time Left for Download:</b> <span id='cd'></span></p><p>Click on the file name to begin download.</p><div class='linkbox'><ul><li><a href="http://site.com/file/y8Qi2Bw8SXPX/51423">blabla.pdf</a></li></div></ul>
<a id="facebookbtn-link" title="send to Facebook" href="http://www.facebook.com/sharer.php?u=http://site.com/product/komM8k" onclick="return popup(this)" ><img src="http://site/img/facebook.png" alt="Facebook" />Post on Facebook</a>

i need to retrive the blabla.pdf the problem is, the page always updating the file names, so it wont be the same name everytime, so what exactly i need is to retrive the name between >blabla.pdf

Upvotes: 1

Answers (3)

ΩmegaMan

Reputation: 31606

Since you are not doing tag processing but looking for a specific anchored pattern I believe that Regex is a fine tool to use in this situation. Here is a pattern which will do the job.

string data = @"<br><p><b>Download:</b> 24 hours<br><b>Time Left for Download:</b>
<span id='cd'></span></p><p>Click on the file name to begin download.</p><div class='linkbox'><ul><li>
<a href=""http://site.com/file/y8Qi2Bw8SXPX/51423"">blabla.pdf</a></li></div></ul>
<a id=""facebookbtn-link"" title=""send to Facebook""
href=""http://www.facebook.com/sharer.php?u=http://site.com/product/komM8k""
onclick=""return popup(this)"" ><img src=""http://site/img/facebook.png"" alt=""Facebook"" />Post on Facebook</a>";


Console.WriteLine (Regex.Match(data, @"(?:\>)(?<PDF>[^\.]+\.pdf)(?:\<)").Groups["PDF"].Value);

// blabla.pdf is outputed

EDIT: To match any file use (note the named grouped change away from PDF)

Regex.Match(data, @"(?:\>)(?<File>[^\.]+\.[a-z]{3})(?:\</a\>)").Groups["File"].Value

Upvotes: 0

Jerry

Reputation: 4408

Try this pattern:

<a href="[^>]+>(.+?)</a>

The captured group ($1) should have the filename

Upvotes: 0

Mike

Reputation: 2605

To elaborate on SLaks answer. There is a package called the HTML Agility pack. It can come as a NuGet package.

An example is here http://htmlagilitypack.codeplex.com/wikipage?title=Examples

Upvotes: 2

C# Using Regex.Match to retrive file name from website source

Answers (3)

Related Questions