Reputation: 1283
I have the problem in parsing the following html:
<tr>
<td><p><b>
<span>Company:</span></b>
<span>Test</span>
</p></td>
</tr>
<tr>
<td><p><b>
<span>Company:</span></b>
<span>Test 2</span>
</p></td>
</tr>
my code is:
HtmlDocument doc = new HtmlDocument();
doc.Load(@"email.txt");
Console.WriteLine(doc1.DocumentNode.InnerText);
I have the following output: Company:TestCompany:Test 2
, but I want
Company: Test
Company: Test 2
So, the problem is that line breaks aren't being parsed.
P.S.: doc.OptionWriteEmptyNodes = true;
makes no difference.
Update: I mean, whatever html is there it is not parsing end lines. Even if there are <br />
tags or etc.
Upvotes: 0
Views: 379
Reputation: 460158
There is no line-break in your html. Even in your browser you wouldn't see it, both labels would be displayed side by side. What is your actual requirement? Document.InnerText
just returns all text-controls value side by side.
If you dont want that you have to select what you want(f.e. all spans) and then use String.Join(Environment.NewLine, allInnerText)
var allInnerTexts = doc.DocumentNode.SelectNodes("//text()")
.Select(n => n.InnerText.Trim())
.Where(text => !String.IsNullOrEmpty(text));
Console.WriteLine(String.Join(Environment.NewLine, allInnerTexts));
Upvotes: 1