giri
giri

Reputation: 27199

How to extract the data from a website using java?

I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.

Upvotes: 6

Views: 42570

Answers (4)

vietspider
vietspider

Reputation: 11

You can use VietSpider XML from

http://sourceforge.net/projects/binhgiang/files/

Download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip

VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relational database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports session (login, query by form input), multi-downloading, JavaScript handling, proxy (and multi-proxy by auto scan the proxies from website)…

Upvotes: 1

Alex Dean
Alex Dean

Reputation: 16065

You definitely need a good parser like NekoHTML.

Here's an example of using NekoHTML, albeit using Groovy (a Java-based scripting language) rather than Java itself:

http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy

Upvotes: 1

almathie
almathie

Reputation: 731

Depending on what you are really trying to do, you can use many different solutions.

If you juste wanna fetch the HTML code of a web page, then URL.getContent() may be your solution. Here is a little tutorial :

http://www.javacoffeebreak.com/books/extracts/javanotesv3/c10/s4.html

EDIT : didn't understand he was searching for a way to parse the HTML code. Some tools have been suggested above. Sorry for that.

Upvotes: 0

lucas
lucas

Reputation: 6971

What you're referring to is commonly called 'screenscraping'. There are a variety of ways to do this in Java, however, I prefer HtmlUnit. While it was designed as a way to test web functionality, you can use it to hit a remote webpage, and parse it out.

I would recommend using a good error handling html parser like Tagsoup to extract from the HTML exactly what you're looking for.

Upvotes: 7

Related Questions