Sebastian
Sebastian

Reputation: 4811

Substring from the beginning of a word

The HTTP GET response for a request is like below

    <html>
      <head>        <script type="text/javascript">----</script>        <script type="text/javascript">---</script>             <title>Detailed Notes</title>
      </head>
      <body style="background-color: #FFFFFF; border-width: 0px; font-family: sans-serif; font-size: 13; color: #000000">           <p>this is one note&nbsp;</p>  </body>      </html>

I am getting this as a string and i have to read the body part out of it.

I tried HtmlAgility pack, but HTML parsing is getting failed due to some specials in the html content (I think something from the commented script causing this issue).

So to read the tag content i am thinking of a SubString operation.

Like SubString from the beginning of <body tag.

How can we do SubString from the beginning of a word from a text?

Upvotes: 2

Views: 182

Answers (2)

Jimi
Jimi

Reputation: 32248

Using a simple SubString() with IndexOf() + LastIndexOf():

string BodyContent = input.Substring(0, input.LastIndexOf("</body>") - 1).Substring(input.IndexOf("<body"));
BodyContent = BodyContent.Substring(BodyContent.IndexOf(">") + 1).Trim();

This will return:
<p> this is one note&nbsp;</p>

string FullBody = input.Substring(0, input.LastIndexOf("</body>") + 7).Substring(input.IndexOf("<body")).Trim();

This will return:

<body style = background-color: #FFFFFF; border-width: 0px; font-family: sans-serif; font-size: 13; color: #000000' >< p > this is one note&nbsp;</p> </body>

Upvotes: 2

Ahmed Soliman
Ahmed Soliman

Reputation: 1710

The " will cause a problme so you need to replace every " after you get the request source

WebClient client = new WebClient(); // make an instance of webclient
string source = client.DownloadString("url").Replace("\"",",,"); // get the html source and escape " with any charachter
string code = "<body style=\"background-color: #FFFFFF; border-width: 0px; font-family: sans-serif; font-size: 13; color: #000000\">           <p>this is one note&nbsp;</p>  </body>";
MatchCollection m0 = Regex.Matches(code, "(<body)(?<body>.*?)(</body>)", RegexOptions.Singleline); // use RE to get between tags
foreach (Match m in m0) // loop through the results
{
    string result = m.Groups["body"].Value.Replace(",,", "\""); // get the result and replace the " back
}

Upvotes: 1

Related Questions