shailbenq
shailbenq

Reputation: 1470

Data Aggregation from multiple websites for a given query

Hi I am going to work on a project wherein i want to query few web sites and fetch similar data from them and present it to the user. for eg: if a user has a query with name-"reebok shoes" of size "9.0" between range "$30 to $75" , my application should scape a few websites (which i will be providing) for these queries and fetch the relevant data from them.Without saving the data in DB i need to format and fetch it to the user. I am new to this so need pointers on which framework to choose or which tool or any important stuff i should know abt web scraping. I did researched a few tools and framework but not sure which one is able to handle query specific web scraping.

Upvotes: 1

Views: 1727

Answers (2)

shailbenq
shailbenq

Reputation: 1470

After doing a good research, i have finally settled with SimplehtmlDom (PHP)parser which helps to extract the html tags and store it into JSON files. Then i perform some data formatting function and forward the formatted JSON file to the front end , using HTML i then represent the data. I also tried Scrapy (Python) which is much easier than simplehtmldom. Let me know if anyone is having any doubts.

Upvotes: 1

Alexander Janssen
Alexander Janssen

Reputation: 1614

Try Crowbar to interprete all the Javascript on remote websites to get the real content if it's not static. Then either use Crowbar itself to implement your scraping, but if you find Javascript to cumbersome (like me), you can use Perl and HTML::TagParser to get the content form the site.

For instance, I had to grab store addresses and shopnames from an electronics chain, so I did:

my $html = HTML::TagParser->new($html);
my $address = $html->getElementsByClassName("mystoremystorecontentcontainer")->innerText();
my $shopname = $html->getElementsByClassName("mystoremystorecontentmiddle text_headline")->innerText();

($html was a string holding the complete website.)

If you know how the data is arranged - means, what id or class name the tag has, which holds the data - it can be pretty easy.

A little warning: The method innerText() is badly implemeted. If the text is not clean from special characters (e.g. a stray 'Ä' instead of an Ä), all hell will break lose. Good luck...

Upvotes: 0

Related Questions