Reputation: 299
I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).
I used scanner objects with reg. expressions and jsoup with its html parser.
Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)
Is there anything better?
EDIT:
Now that I've gotten jsoup to work, I think a better question is how do I speed it up?
Upvotes: 4
Views: 1820
Reputation: 22004
I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.
Upvotes: 2
Reputation: 18990
A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.
Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.
Upvotes: 0