Reputation: 15779
I need to detect sentence boundaries in HTML. There is lots of sentence boundary detection software out there (java.text.BreakIterator is the one I'm using), but all of it assumes plain text. HTML is richer than that, and includes some clues as to where sentences break.
For example, <p>, <ul>/<li>, <td>
and other tags mark sentence boundaries, or at least indicate that a sentence probably doesn't extend across them. <b>, <i>, <em>, <span>, <a>
and a few others tags could appear inside a sentence.
Is anyone aware of any software that takes advantage of HTML markup, in addition to the normal NLP stuff, in determining sentence boundaries?
Upvotes: 3
Views: 629
Reputation: 15779
The solution I implemented was 1. split the document into separate blocks on all html tags except the inline tags (<i>
, <b>
, <span>
, etc.), 2. strip the inline tags from each block, 3. look for sentences within each block using traditional methods.
Upvotes: 1