HttpWebRequest command to get directory listing

In the following post I followed the examples to create my httprequest and list files from webServer directory: C# HttpWebRequest command to get directory listing

I'm trying to use the example there to list files from my web server. I can list the files from the example server quoted on the link, but my server just shows the last added file. My code is exactly like the example there. I noticed that my html code is a little different. Someone have an idea:

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>186.215.156.154 - /download/Zatix/Zatix - Satisfação Geral/</title>
</head>
<body>
    <h1>
        186.215.156.154 - /download/Zatix/Zatix - Satisfação Geral/</h1>
    <hr>
    <pre>
    <a href="/download/Zatix/">[Para a pasta superior]</a>
    <br>
    <br>
    sexta-feira, 19 de novembro de 2010    11:17        52355 <a href="/download/Zatix/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral_3_00.zip">Zatix - Satisfação Geral_3_00.zip</a><br>sexta-feira, 19 de novembro de 2010    11:17        52355 <a href="/download/Zatix/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral/Zatix%20-%20Satisfa%C3%A7%C3%A3o%20Geral_4_00.zip">Zatix - Satisfação Geral_4_00.zip</a>
    <br>
</pre>
    <hr>
</body>
</html

I think I have to change something in the return of GetDirectoryListingRegexForUrl method.

My code is something like this:

private string GetDirectoryListingRegexForUrl(string url)
{
    if (url.Equals(Url));
    {
        return "<A HREF=\".*\">(?<name>.*)</A>";                   
    }
    throw new NotSupportedException();
}

public void ListStudies()
{
    Url = BaseUrl + this.clientName + "/" + this.activeStudy + "/";
    Console.WriteLine(Url);
    CookieContainer cookies;
    HttpWebResponse response;
    HttpWebRequest req = (HttpWebRequest)System.Net.WebRequest.Create(Url);            

    req.Credentials = _NetworkCredential;
    req.CookieContainer = new CookieContainer();
    req.AllowAutoRedirect = true;
    cookies = req.CookieContainer;

    try
    {
        response = (HttpWebResponse)req.GetResponse();

        if (response.StatusCode != HttpStatusCode.OK)
            Console.WriteLine("URL NÃO RESPONDEU");
        else
            Console.WriteLine("URL OK");

        using (response)
        {
            using (StreamReader reader = new StreamReader(response.GetResponseStream()))
            {
                string html = reader.ReadToEnd();
                Regex regex = new Regex(GetDirectoryListingRegexForUrl(Url));
                MatchCollection matches = regex.Matches(html);                                             

                if (matches.Count > 0)
                {
                    foreach (Match match in matches)
                    {
                        if (match.Success)
                        {
                            Console.WriteLine(match.Groups["name"]);                                    
                        }                                
                    }
                }
            }
        }
    }
    catch (Exception e)
    {
        MessageBox.Show(e.Message, "Update Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
    }            
}

I hope you can help me! Thanks.

Upvotes: 0

Views: 3437

Answers (2)

user151323
user151323

Reputation:

Here's the correct regex:

<A HREF=\".*?\">(?<name>.*?)</A>

Compare it to the original one:

<A HREF=\".*\">(?<name>.*)</A>

The problem lies with the repetition operators .* which are greedy by default. Greedy means the regex will expand as far as possible while looking for a match. Meaning it will start with the first <A and finishing with the last A> in the string letting everything in between in. That 'everything' includes the others <A...A>'s in the middle.

You need to specify that the repetition operators are lazy. You do it by adding ? to them like .*?.

P.S. Parsing HTML with regular expressions is notoriously a bad idea. It's okay if you need a quick and dirty fix but a no-go for a long-term solution. Add to that the fact that in your case the output will vary per server and likely per-server version. The code is not universally-functional. Please consider the other approach like negotiating with the server directly to get a directory listing (if you have the access of course).

And finally some funny read to the thema:

Parsing Html The Cthulhu Way

RegEx match open tags except XHTML self-contained tags

Upvotes: 1

annakata
annakata

Reputation: 75862

Two major problems here.

1). The output of a request like this is completely arbitrary and not even guaranteed. It's the server's concern.

2). Regex is not a suitable means for parsing HTML or any similar structure because it is not a regular grammar. Your best bet assuming you have any kind of reliability in your response at all is to rely on something like the HtmlAgilityPack to enforce a rigourous XHTML document (may not be required if you're lucky) and read that as an XML document with XPath queries to pull out the content you're interested in.

Upvotes: 1

Related Questions