Reputation: 1470
Hi I am going to work on a project wherein i want to query few web sites and fetch similar data from them and present it to the user. for eg: if a user has a query with name-"reebok shoes" of size "9.0" between range "$30 to $75" , my application should scape a few websites (which i will be providing) for these queries and fetch the relevant data from them.Without saving the data in DB i need to format and fetch it to the user. I am new to this so need pointers on which framework to choose or which tool or any important stuff i should know abt web scraping. I did researched a few tools and framework but not sure which one is able to handle query specific web scraping.
Upvotes: 1
Views: 1727
Reputation: 1470
After doing a good research, i have finally settled with SimplehtmlDom (PHP)parser which helps to extract the html tags and store it into JSON files. Then i perform some data formatting function and forward the formatted JSON file to the front end , using HTML i then represent the data. I also tried Scrapy (Python) which is much easier than simplehtmldom. Let me know if anyone is having any doubts.
Upvotes: 1
Reputation: 1614
Try Crowbar to interprete all the Javascript on remote websites to get the real content if it's not static. Then either use Crowbar itself to implement your scraping, but if you find Javascript to cumbersome (like me), you can use Perl and HTML::TagParser
to get the content form the site.
For instance, I had to grab store addresses and shopnames from an electronics chain, so I did:
my $html = HTML::TagParser->new($html);
my $address = $html->getElementsByClassName("mystoremystorecontentcontainer")->innerText();
my $shopname = $html->getElementsByClassName("mystoremystorecontentmiddle text_headline")->innerText();
($html
was a string holding the complete website.)
If you know how the data is arranged - means, what id or class name the tag has, which holds the data - it can be pretty easy.
A little warning: The method innerText()
is badly implemeted. If the text is not clean from special characters (e.g. a stray 'Ä
' instead of an Ä
), all hell will break lose. Good luck...
Upvotes: 0