Eric
Eric

Reputation: 11

Data Extraction?

I am looking for methods to extract various data from various websites. I know there are programs out there you can buy but being that I am trying to learn I want to do it myself. Does anyone have any suggestions on a general structure and if so, what language would you write it in. My first thought was java but I am more than willing and grateful to hear anyone else's opinion.

Upvotes: 1

Views: 1286

Answers (2)

Aravind Yarram
Aravind Yarram

Reputation: 80194

look at hadoop (grids) and solr (crawlers and indexers ). They both support heavy processing and efficient indexing (for efficient searching) respectively.

Upvotes: 0

Freddy
Freddy

Reputation: 442

What kind of data are you trying to extract from websites? What websites? etc. A little more detail on your idea/project would be helpful

I recently had the need to look into and try a few html parsers to get some data I needed in a more consolidated format.

I tried JTidy (http://jtidy.sourceforge.net/) and looked into Web-Harvest (http://web-harvest.sourceforge.net/). JTidy wouldn't quite do what I wanted and Web-Harvest was overkill.

I ultimately settled on using Java + htmlparser (http://htmlparser.sourceforge.net/)

It took very little development time to get what I needed and htmlparser allows you to form 'filters' that search for specific things in the DOM.

Upvotes: 1

Related Questions