Reputation: 1017
I have this code:
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
WebRequest request = WebRequest.Create(url);
request.Method = "GET";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string content = reader.ReadToEnd();
int start = content.IndexOf("profile/");
int end = content.IndexOf("'");
string result = content.Substring(start, end - start - 1);
reader.Close();
response.Close();
}
For example i have a long line:
<span class="message-profile-name" ><a href='/profile/daniel'>daniel</a></span>: <span class="message-text">hello everyone<wbr/> <img class='emoticon emoticon-tongue' src='/t.gif'/></span>
I want to build a new string with: daniel hello everyone
How can i do it ? In my code it dosent work im getting error exception say
ArgumentOutOfRangeException Length cannot be less than zero. Parameter name: length
On the line: string result = content.Substring(start, end - start - 1);
In this case: start = 19572 end = 2110
Upvotes: 0
Views: 190
Reputation: 116178
Use HtmlAgilityPack instead of trying to parse manually.
var wc = new WebClient();
wc.DownloadStringCompleted += (s, e) =>
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(e.Result);
var link = doc.DocumentNode
.SelectSingleNode("//span[@class='message-profile-name']")
.Element("a")
.Attributes["href"].Value;
};
wc.DownloadStringAsync(new Uri("http://chatroll.com/rotternet"));
Upvotes: 1
Reputation: 3960
It seems the string you want will always be enclosed inside an href with the format profile/xxx, it'd be simple with regex once you get the content into text form, and using regex would still work even if you can have the potential of having multiple <a href=...> elements
Match match = Regex.Match(content, @"(?<=<a\s*?href='/profile/\w*?'>\s*?)\w*?(?=\s*?<\s*?/a\s*?>)");
string result = match.Value;
Will match all the bold ones, and .Value will return whatever is the element's value, in this case daniel, you can also preced the regex with (i?) to make it case insensitive to also match the last item in the list
UPDATE:
To get the content from any other kind of element, just replace the highlighted section to match the element, (?<=<a\s*?href='/profile/\w*?'>\s*?)\w*?(?=\s*?<\s*?/a\s*?>). In your case, "message-text">hello everyone<wbr/>
would be (?i)(?<="message-text"\s*?>\s*?).*?(?=\s*?<\s*?/wbr\s*?>), and that will get hello everyone from the following variations, the .*? means match anything (including spaces and punctuations), but as few as possible). Note that I changed your ending tag from your reply, if it it should be and not it's a tiny change you can make to get it working
Upvotes: 0
Reputation: 62265
Use appropriate tools for spliting symbols array into the meaningful for you data array.
You can use a HtmlAgilityPack to parse the string and return the tree of meaningful tokens.
After you can iterate over them and aggregate into the result string based on your own logic.
Upvotes: 0