LP13
LP13

Reputation: 34109

Regular expression to extract title from web page

I have the code below that calls the website and extract title from the page. Working fine but it also extract new line characters or tab. so sometimes the string looks like

\r\n\tSome WebSite | Official Company Website\r\n

public string GetPageTitle(string url)
    {
        string regex = @"(?<=<title.*>)([\s\S]*)(?=</title>)";
        string source = this._client.DownloadString(url);
        return Regex.Match(source, regex, RegexOptions.IgnoreCase).Value;           
    }

what should be the regular expression to ignore \r\n and \t

Upvotes: 0

Views: 151

Answers (1)

Rion Williams
Rion Williams

Reputation: 76557

Consider Non-Regular Expression Options

If you aren't set explicitly on a Regular Expression, it's worth noting that the Trim() method will remove any leading and trailing white-space from your string, which includes tabs and new lines :

return Regex.Match(source, regex, RegexOptions.IgnoreCase).Value.Trim();

Likewise an explicit replacement would work as well :

return Regex.Match(source, regex, RegexOptions.IgnoreCase).Value
                                                          .Replace("\t","")
                                                          .Replace(Environment.NewLine,"");

Upvotes: 1

Related Questions