Gary
Gary

Reputation: 83

C# Convert Relative to Absolute Links in HTML String

I'm mirroring some internal websites for backup purposes. As of right now I basically use this c# code:

System.Net.WebClient client = new System.Net.WebClient();
byte[] dl = client.DownloadData(url);

This just basically downloads the html and into a byte array. This is what I want. The problem however is that the links within the html are most of the time relative, not absolute.

I basically want to append whatever the full http://domain.is before the relative link as to convert it to an absolute link that will redirect to the original content. I'm basically just concerned with href= and src=. Is there a regex expression that will cover some of the basic cases?

Edit [My Attempt]:

public static string RelativeToAbsoluteURLS(string text, string absoluteUrl)
{
    if (String.IsNullOrEmpty(text))
    {
        return text;
    }

    String value = Regex.Replace(
        text, 
        "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", 
        "<$1$2=\"" + absoluteUrl + "$3\"$4>", 
        RegexOptions.IgnoreCase | RegexOptions.Multiline);

    return value.Replace(absoluteUrl + "/", absoluteUrl);
}

Upvotes: 8

Views: 16168

Answers (10)

Nathan Baulch
Nathan Baulch

Reputation: 20683

The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:

var baseUri = new Uri("http://test.com");
var pattern = @"(?<name>src|href)=""(?<value>/[^""]*)""";
var matchEvaluator = new MatchEvaluator(
    match =>
    {
        var value = match.Groups["value"].Value;
        Uri uri;

        if (Uri.TryCreate(baseUri, value, out uri))
        {
            var name = match.Groups["name"].Value;
            return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
        }

        return null;
    });
var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);

The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.

Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.

Upvotes: 9

Mahmoud
Mahmoud

Reputation: 1

this is what you are looking for, this code snippet can convert all the relative URLs to absolute inside any HTML code:

Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
    Dim result As String = Nothing
    ' Getting all Href
    Dim opt As New RegexOptions
    Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
    Dim i As Integer
    Dim NewSTR As String = html
    For i = 0 To XpHref.Matches(html).Count - 1
        Application.DoEvents()
        Dim Oldurl As String = Nothing
        Dim OldHREF As String = Nothing
        Dim MainURL As New Uri(PageURL)
        OldHREF = XpHref.Matches(html).Item(i).Value
        Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
        Dim NEWURL As New Uri(MainURL, Oldurl)
        Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
        NewSTR = NewSTR.Replace(OldHREF, NewHREF)
    Next
    html = NewSTR
    Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
    For i = 0 To XpSRC.Matches(html).Count - 1
        Application.DoEvents()
        Dim Oldurl As String = Nothing
        Dim OldHREF As String = Nothing
        Dim MainURL As New Uri(PageURL)
        OldHREF = XpSRC.Matches(html).Item(i).Value
        Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
        Dim NEWURL As New Uri(MainURL, Oldurl)
        Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
        NewSTR = NewSTR.Replace(OldHREF, NewHREF)
    Next
    Return NewSTR
End Function

Upvotes: 0

jfren484
jfren484

Reputation: 1000

I know this is an older question, but I figured out how to do it with a fairly simple regex. It works well for me. It handles http/https and also root-relative and current directory-relative.

var host = "http://www.google.com/";
var baseUrl = host + "images/";
var html = "<html><head></head><body><img src=\"/images/srpr/logo3w.png\" /><br /><img src=\"srpr/logo3w.png\" /></body></html>";
var regex = "(?<=(?:href|src)=\")(?!https?://)(?<url>[^\"]+)";
html = Regex.Replace(
    html,
    regex,
    match => match.Groups["url"].Value.StartsWith("/")
        ? host + match.Groups["url"].Value.Substring(1)
        : baseUrl + match.Groups["url"].Value);

Upvotes: 0

Samidjo
Samidjo

Reputation: 2355

Simple function

public string ConvertRelativeUrlToAbsoluteUrl(string relativeUrl)
{

if (Request.IsSecureConnection)
  return string.Format("https://{0}{1}", Request.Url.Host, Page.ResolveUrl(relativeUrl));
else
  return string.Format("http://{0}{1}", Request.Url.Host, Page.ResolveUrl(relativeUrl));

}

Upvotes: 0

Smith
Smith

Reputation: 5951

Just use this function

'# converts relative URL ro Absolute URI
    Function RelativeToAbsoluteUrl(ByVal baseURI As Uri, ByVal RelativeUrl As String) As Uri
        ' get action tags, relative or absolute
        Dim uriReturn As Uri = New Uri(RelativeUrl, UriKind.RelativeOrAbsolute)
        ' Make it absolute if it's relative
        If Not uriReturn.IsAbsoluteUri Then
            Dim baseUrl As Uri = baseURI
            uriReturn = New Uri(baseUrl, uriReturn)
        End If
        Return uriReturn
    End Function

Upvotes: 0

Marc Gravell
Marc Gravell

Reputation: 1062520

Uri WebsiteImAt = new Uri(
       "http://www.w3schools.com/media/media_mimeref.asp?q=1&s=2,2#a");
string href = new Uri(WebsiteImAt, "/something/somethingelse/filename.asp")
       .AbsoluteUri;
string href2 = new Uri(WebsiteImAt, "something.asp").AbsoluteUri;
string href3 = new Uri(WebsiteImAt, "something").AbsoluteUri;

which with your Regex-based approach is probably (untested) mappable to:

        String value = Regex.Replace(text, "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", match => 
            "<" + match.Groups[1].Value + match.Groups[2].Value + "=\""
                + new Uri(WebsiteImAt, match.Groups[3].Value).AbsoluteUri + "\""
                + match.Groups[4].Value + ">",RegexOptions.IgnoreCase | RegexOptions.Multiline);

I should also advise not to use Regex here, but to apply the Uri trick to some code using a DOM, perhaps XmlDocument (if xhtml) or the HTML Agility Pack (otherwise), looking at all //@src or //@href attributes.

Upvotes: 5

Garett
Garett

Reputation: 16818

You could use the HTMLAgilityPack accomplish this. You would do something along these (not tested) lines:

  • Load the url
  • Select all links
  • Load the link into a Uri and test whether it is relative If it relative convert it to absolute
  • Update the links value with the new uri
  • save the file

Here are a few examples:

Relative to absolute paths in HTML (asp.net)

http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home

http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

Upvotes: 1

Ian Mercer
Ian Mercer

Reputation: 39277

You should use HtmlAgility pack to load the HTML, access all the hrefs using it, and then use the Uri class to convert from relative to absolute as necessary.

See for example http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

Upvotes: 5

Matthew Manela
Matthew Manela

Reputation: 16752

While this may not be the most robust of solutions it should get the job done.

var host = "http://domain.is";
var someHtml = @"
<a href=""/some/relative"">Relative</a>
<img src=""/some/relative"" />
<a href=""http://domain.is/some/absolute"">Absolute</a>
<img src=""http://domain.is/some/absolute"" />
";


someHtml = someHtml.Replace("src=\"" + host,"src=\"");
someHtml = someHtml.Replace("href=\"" + host,"src=\"");
someHtml = someHtml.Replace("src=\"","src=\"" + host);
someHtml = someHtml.Replace("href=\"","src=\"" + host);

Upvotes: 1

Yogesh
Yogesh

Reputation: 14608

I think url is of type string. Use Uri instead with a base uri pointing to your domain:

Uri baseUri = new Uri("http://domain.is");
Uri myUri = new Uri(baseUri, url);

System.Net.WebClient client = new System.Net.WebClient();
byte[] dl = client.DownloadData(myUri);

Upvotes: 0

Related Questions