Reputation: 20062
I am a beginner in web crawling. I am trying to crawl a page, for example, this page: http://shopping.yahoo.com/search;_ylt=AkzLiLhD9_ulIJy.SYsw9T0bFt0A?p=video&did=0
I need to extract the search results such as: Amazon.com or antonline.com. Can any body help me in naming some techniques, tools, sw that can help me achieve this ?
EDIT: I have to work with Java.
Upvotes: 2
Views: 1712
Reputation: 435
Read in a file from a URL. It'll be all markup.
Apply regular expressions to extract the data using patterns found in the page.
Examine the markup and tease out patterns in the data, then write regular expressions to extract the data. I'm assuming here that you'll want a title and price for each item. So, for example, I see in your example file that all titles are wrapped in <li class ='hproduct'>
, and all prices are inside <p class='price'>
. Write a regex that finds the contents of those divs, in that order.
Upvotes: 2
Reputation: 22171
Selenium WebDriver can do it:
http://seleniumhq.org/projects/webdriver/
I've ever used it for extraction with ruby one year ago but it is still available for Java.
Look at Watir also: (http://watir.com)
a sample in the article: (with Ruby)
You can also look for HTMLUnit library.
Below an example with HTMLUnit for scrapping(extracting) web page's html elements:
http://htmlunit.sourceforge.net/gettingStarted.html
Upvotes: 1
Reputation: 33392
Basically the idea is to inspect page in browser devtools (Chrome or Firebug). Try to find special id's or classes. On your page this is <ul class='hproducts'>
that has a list of <li class='hproduct'>
Use that!
Then you make a call and get response and parse it. (Google for DOM, SAX, XPath...) This is very different between languages and libs. For example on Java we have JSoup library that can fetch html (it is a little different to xml in this case, huh) and parse it in convenient way.
Or better google for their API ;)
Upvotes: 2