azamsharp
azamsharp

Reputation: 20068

Regular Expression to Extract the Url out of the Anchor Tag

I want to extract the http link from inside the anchor tags? The extension that should be extracted should be WMV files only.

Upvotes: 1

Views: 3223

Answers (3)

Rashmi Pandit
Rashmi Pandit

Reputation: 23798

Regex:

<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>

[Note: \s* is used in several places to match the extra white space characters that can occur in the html.]

Sample C# code:

/// <summary>
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// Matches only for .wmv files
/// </summary>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetHrefDetailsWMV(string htmlATag, out string wmvLink, out string name)
{
    wmvLink = null;
    name = null;

    string pattern = "<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>";

    if (Regex.IsMatch(htmlATag, pattern))
    {
        Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        wmvLink = r.Match(htmlATag).Result("${link}");
        name = r.Match(htmlATag).Result("${name}");
        return true;
    }
    else
        return false;
}

MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file'>Name of File</a></td>", 
                out wmvLink, out name); // No match
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file.wmv'>Name of File</a></td>",
                out wmvLink, out name); // Match
MyRegEx.TryGetHrefDetailsWMV("<td><a    href='/path/to/file.wmv'   >Name of File</a></td>", out wmvLink, out name); // Match

Upvotes: 1

Peter Boughton
Peter Boughton

Reputation: 112150

I wouldn't do this with regex - I would probably use jQuery:

jQuery('a[href$=.wmv]').attr('href')

Compare this to chaos's simplified regex example, which (as stated) doesn't deal with fussy/complex markup, and you'll hopefully understand why a DOM parser is better than a regex for this type of problem.

Upvotes: 1

chaos
chaos

Reputation: 124277

Because HTML's syntactic rules are so loose, it's pretty difficult to do with any reliability (unless, say, you know for absolute certain that all your tags will use double quotes around their attribute values). Here's some fairly general regex-based code for the purpose:

function extract_urls($html) {
    $html = preg_replace('<!--.*?-->', '', $html);
    preg_match_all('/<a\s+[^>]*href="([^"]+)"[^>]*>/is', $html, $matches);
    foreach($matches[1] as $url) {
        $url = str_replace('&amp;', '&', trim($url));
        if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
            $urls[] = $url;
    }
    preg_match_all('/<a\s+[^>]*href=\'([^\']+)\'[^>]*>/is', $html, $matches);
    foreach($matches[1] as $url) {
        $url = str_replace('&amp;', '&', trim($url));
        if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
            $urls[] = $url;
    }
    preg_match_all('/<a\s+[^>]*href=([^"\'][^> ]*)[^>]*>/is', $html, $matches);
    foreach($matches[1] as $url) {
        $url = str_replace('&amp;', '&', trim($url));
        if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
            $urls[] = $url;
    }
    return $urls;
}

Upvotes: 2

Related Questions