adv12
adv12

Reputation: 8551

WebBrowser HtmlElement.GetAttribute("href") prepending hostname

My Windows Forms application hosts a WebBrowser control that displays a page full of links. I'm trying to find all the anchor elements in the loaded HtmlDocument and read their href attributes so I can provide a multi-file download interface in C#. Below is a simplified version of the function where I find and process the anchor elements:

public void ListAnchors(string baseUrl, HtmlDocument doc) // doc is retrieved from webBrowser.Document
{
    HtmlElementCollection anchors = doc.GetElementsByTagName("a");
    foreach (HtmlElement el in anchors)
    {
        string href = el.GetAttribute("href");
        Debug.WriteLine("el.Parent.InnerHtml = " + el.Parent.InnerHtml);
        Debug.WriteLine("el.GetAttribute(\"href\") = " + href);
    }
}

The anchor tags are all surrounded by <PRE> tags. The hostname from which I'm loading the HTML is a local machine on the network (lts930411). The source HTML for one entry looks like this:

<PRE><A href="/A/a150923a.lts">a150923a.lts</A></PRE>

The output of the above C# code for one anchor element is this:

el.Parent.InnerHtml = <A href="/A/a150923a.lts">a150923a.lts</A>

el.GetAttribute("href") = http://lts930411/A/a150923a.lts

Why is el.GetAttribute("href") adding the scheme and hostname prefix (http://lts930411) rather than returning the literal value of the href attribute from the source HTML? Is this behavior I can count on? Is this "feature" documented somewhere? (I was prepending the base URL myself, but that gave me addresses like http://lts930411http://lts930411/A/a150923a.lts. I'd be okay with just expecting the full URL if I could find documentation promising this will always happen.)

Upvotes: 3

Views: 8414

Answers (3)

reza.Nikmaram
reza.Nikmaram

Reputation: 337

first reference to Microsoft.mshtml

enter image description here

 using mshtml;      
   var allTagA =  webBrowser1.Document.GetElementsByTagName("a");
   foreach (HtmlElement item in allTagA )
    {
        string href = ((HTMLAnchorElement)item.DomElement).href;
    } 
    

Upvotes: 0

Reza Aghaei
Reza Aghaei

Reputation: 125277

As stated in IHTMLAnchorElement.href documents, relative urls are resolved against the location of the document containing the a element.

As an option to get untouched href attribute values, you can use this code:

var expression = "href=\"(.*)\"";
var list = document.GetElementsByTagName("a")
                   .Cast<HtmlElement>()
                   .Where(x => Regex.IsMatch(x.OuterHtml, expression))
                   .Select(x => Regex.Match(x.OuterHtml, expression).Groups[1].Value)
                   .ToList();

The above code, returns untouched href attribute value of all a tags in a document.

Upvotes: 2

c4pricorn
c4pricorn

Reputation: 3481

Try this code:

    foreach (HtmlElement el in anchors)
        {
            string href = System.IO.Path.GetFileName(el.GetAttribute("href"));
            ...
        }

Upvotes: 0

Related Questions