Reputation: 1
In the following post I followed the examples to create my httprequest and list files from webServer directory: C# HttpWebRequest command to get directory listing
I'm trying to use the example there to list files from my web server. I can list the files from the example server quoted on the link, but my server just shows the last added file. My code is exactly like the example there. I noticed that my html code is a little different. Someone have an idea:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>186.215.156.154 - /download/Zatix/Zatix - Satisfação Geral/</title>
</head>
<body>
<h1>
186.215.156.154 - /download/Zatix/Zatix - Satisfação Geral/</h1>
<hr>
<pre>
<a href="/download/Zatix/">[Para a pasta superior]</a>
<br>
<br>
sexta-feira, 19 de novembro de 2010 11:17 52355 <a href="/download/Zatix/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral_3_00.zip">Zatix - Satisfação Geral_3_00.zip</a><br>sexta-feira, 19 de novembro de 2010 11:17 52355 <a href="/download/Zatix/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral_4_00.zip">Zatix - Satisfação Geral_4_00.zip</a>
<br>
</pre>
<hr>
</body>
</html
I think I have to change something in the return of GetDirectoryListingRegexForUrl method.
My code is something like this:
private string GetDirectoryListingRegexForUrl(string url)
{
if (url.Equals(Url));
{
return "<A HREF=\".*\">(?<name>.*)</A>";
}
throw new NotSupportedException();
}
public void ListStudies()
{
Url = BaseUrl + this.clientName + "/" + this.activeStudy + "/";
Console.WriteLine(Url);
CookieContainer cookies;
HttpWebResponse response;
HttpWebRequest req = (HttpWebRequest)System.Net.WebRequest.Create(Url);
req.Credentials = _NetworkCredential;
req.CookieContainer = new CookieContainer();
req.AllowAutoRedirect = true;
cookies = req.CookieContainer;
try
{
response = (HttpWebResponse)req.GetResponse();
if (response.StatusCode != HttpStatusCode.OK)
Console.WriteLine("URL NÃO RESPONDEU");
else
Console.WriteLine("URL OK");
using (response)
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex(GetDirectoryListingRegexForUrl(Url));
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
Console.WriteLine(match.Groups["name"]);
}
}
}
}
}
}
catch (Exception e)
{
MessageBox.Show(e.Message, "Update Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
I hope you can help me! Thanks.
Upvotes: 0
Views: 3437
Reputation:
Here's the correct regex:
<A HREF=\".*?\">(?<name>.*?)</A>
Compare it to the original one:
<A HREF=\".*\">(?<name>.*)</A>
The problem lies with the repetition operators .*
which are greedy by default. Greedy means the regex will expand as far as possible while looking for a match. Meaning it will start with the first <A
and finishing with the last A>
in the string letting everything in between in. That 'everything' includes the others <A...A>
's in the middle.
You need to specify that the repetition operators are lazy. You do it by adding ?
to them like .*?
.
P.S. Parsing HTML with regular expressions is notoriously a bad idea. It's okay if you need a quick and dirty fix but a no-go for a long-term solution. Add to that the fact that in your case the output will vary per server and likely per-server version. The code is not universally-functional. Please consider the other approach like negotiating with the server directly to get a directory listing (if you have the access of course).
And finally some funny read to the thema:
RegEx match open tags except XHTML self-contained tags
Upvotes: 1
Reputation: 75862
Two major problems here.
1). The output of a request like this is completely arbitrary and not even guaranteed. It's the server's concern.
2). Regex is not a suitable means for parsing HTML or any similar structure because it is not a regular grammar. Your best bet assuming you have any kind of reliability in your response at all is to rely on something like the HtmlAgilityPack to enforce a rigourous XHTML document (may not be required if you're lucky) and read that as an XML document with XPath queries to pull out the content you're interested in.
Upvotes: 1