Reputation: 11
I am looking for methods to extract various data from various websites. I know there are programs out there you can buy but being that I am trying to learn I want to do it myself. Does anyone have any suggestions on a general structure and if so, what language would you write it in. My first thought was java but I am more than willing and grateful to hear anyone else's opinion.
Upvotes: 1
Views: 1286
Reputation: 80194
look at hadoop (grids) and solr (crawlers and indexers ). They both support heavy processing and efficient indexing (for efficient searching) respectively.
Upvotes: 0
Reputation: 442
What kind of data are you trying to extract from websites? What websites? etc. A little more detail on your idea/project would be helpful
I recently had the need to look into and try a few html parsers to get some data I needed in a more consolidated format.
I tried JTidy (http://jtidy.sourceforge.net/) and looked into Web-Harvest (http://web-harvest.sourceforge.net/). JTidy wouldn't quite do what I wanted and Web-Harvest was overkill.
I ultimately settled on using Java + htmlparser (http://htmlparser.sourceforge.net/)
It took very little development time to get what I needed and htmlparser allows you to form 'filters' that search for specific things in the DOM.
Upvotes: 1