samwise
samwise

Reputation: 299

Parsing HTML webpages in Java

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)

Is there anything better?

EDIT:

Now that I've gotten jsoup to work, I think a better question is how do I speed it up?

Upvotes: 4

Views: 1820

Answers (3)

billygoat
billygoat

Reputation: 22004

I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.

Upvotes: 2

JustBeingHelpful
JustBeingHelpful

Reputation: 18990

A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.

Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

Upvotes: 0

Ed Staub
Ed Staub

Reputation: 15700

Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.

Upvotes: 5

Related Questions