Raman Sinclair
Raman Sinclair

Reputation: 1283

HtmlAgilityPack : Can't parse endline (ignores end line)

I have the problem in parsing the following html:

<tr>
<td><p><b>
<span>Company:</span></b>
<span>Test</span>
</p></td>
</tr>

<tr>
<td><p><b>
<span>Company:</span></b>
<span>Test 2</span>
</p></td>
</tr>

my code is:

HtmlDocument doc = new HtmlDocument();
doc.Load(@"email.txt");
Console.WriteLine(doc1.DocumentNode.InnerText);

I have the following output: Company:TestCompany:Test 2, but I want

Company: Test
Company: Test 2

So, the problem is that line breaks aren't being parsed.

P.S.: doc.OptionWriteEmptyNodes = true; makes no difference.

Update: I mean, whatever html is there it is not parsing end lines. Even if there are <br /> tags or etc.

Upvotes: 0

Views: 379

Answers (1)

Tim Schmelter
Tim Schmelter

Reputation: 460158

There is no line-break in your html. Even in your browser you wouldn't see it, both labels would be displayed side by side. What is your actual requirement? Document.InnerText just returns all text-controls value side by side.

If you dont want that you have to select what you want(f.e. all spans) and then use String.Join(Environment.NewLine, allInnerText)

var allInnerTexts = doc.DocumentNode.SelectNodes("//text()")
   .Select(n => n.InnerText.Trim())
   .Where(text => !String.IsNullOrEmpty(text));
Console.WriteLine(String.Join(Environment.NewLine, allInnerTexts));

Upvotes: 1

Related Questions