ansgri
ansgri

Reputation: 2156

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.

The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.

What technique/library would you advice?

Updates/Remarks

Upvotes: 3

Views: 10645

Answers (10)

VNVN
VNVN

Reputation: 511

Check this out http://www.alchemyapi.com/api/demo.html

They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

Upvotes: 0

Eric  DeLabar
Eric DeLabar

Reputation: 309

Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.

If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.

Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.

Upvotes: 0

Maxim
Maxim

Reputation: 543

If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.

This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.

Upvotes: 0

Alexandre Victoor
Alexandre Victoor

Reputation: 3104

You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.

Upvotes: 0

Joe Liversedge
Joe Liversedge

Reputation: 4164

If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>

Upvotes: 2

Vhaerun
Vhaerun

Reputation: 13266

If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.

Upvotes: 0

jatanp
jatanp

Reputation: 4112

You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.

Upvotes: 3

James Law
James Law

Reputation: 282

You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml. As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's

http://hc.apache.org/httpclient-3.x/

Upvotes: 2

graham r
graham r

Reputation:

You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.

Upvotes: 0

IcePhoenix
IcePhoenix

Reputation:

In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.

Upvotes: 0

Related Questions