Regular expression to extract title from web page

Question

I have the code below that calls the website and extract title from the page. Working fine but it also extract new line characters or tab. so sometimes the string looks like

Some WebSite | Official Company Website

public string GetPageTitle(string url)
    {
        string regex = @"(?<=)([\s\S]*)(?=)";
        string source = this._client.DownloadString(url);
        return Regex.Match(source, regex, RegexOptions.IgnoreCase).Value;           
    }

what should be the regular expression to ignore and

Rion Williams · Accepted Answer

Consider Non-Regular Expression Options

If you aren't set explicitly on a Regular Expression, it's worth noting that the Trim() method will remove any leading and trailing white-space from your string, which includes tabs and new lines :

return Regex.Match(source, regex, RegexOptions.IgnoreCase).Value.Trim();

Likewise an explicit replacement would work as well :

return Regex.Match(source, regex, RegexOptions.IgnoreCase).Value
                                                          .Replace("	","")
                                                          .Replace(Environment.NewLine,"");

Regular expression to extract title from web page

Answers (1)

Related Questions