Reputation: 36726
I want to extract text blocks from a HTML page and I'm using boilerpipe to do this. It works fine for one text in a page, but some pages like blogs have multiple texts in the page.
I want to extract all texts, but identifying each one as a separate text, and not only one.
There is some library that can do this?
EDIT: I'm using Jsoup to parse HTML, but I don't want do parsing, but information extraction like boilerpipe do in the pages. I want to test other similar tool.
Upvotes: 1
Views: 1346
Reputation: 11257
The closest Java library I'm aware of is the Road Runner project: http://www.dia.uniroma3.it/db/roadRunner/ It's a system that can construct a special kind of regular expression on tokens in the HTML document which can (in many cases) detect patterns of this kind given several documents based on the same template. This might be achieved for blogs by, for example, looking at paginated pages. You would probably still have to pick out precisely which repeated patterns were the ones of interest for each site.
For blogs, I would probably look for a feed link in the header of the blog and use a feed parsing library to parse out the permalinks for each article. Crawl those and use boilerpipe (only necessary because lots of blogs don't include the full text in the RSS/Atom feed). Lots of blogs don't include the full text on the main page either, so I'd focus on methods of identifying the permalinks, and go from there.
Upvotes: 1
Reputation: 17923
JSoup is very widely used parser for these type of tasks. Please check it.
Upvotes: 3
Reputation: 26142
Well, personally I liked using Doj together with HtmlUnit. Basically Doj introduces something similar to CSS selectors for Java.
Example (from official page):
Doj spanDoj = Doj.on(page).get("#updates tr", 1).get("td", 2).get("span.item");
You can see more complex example on the linked page (scroll it down).
Upvotes: 2