Reputation: 405
I m very new to Java.
Now , I want to retrieve the news article contents using Google news search -keyword: "toy" from page 1 to page 10.
That is retrieving 100 news content from page 1 - page 10. (assuming 10 news article in every pages)
After I read this Crawler4j vs. Jsoup for the pages crawling and parsing in Java
I decide to use Crawler4j as it can
Give base URI (home page)
Take all the URIs from each page and retrieve the contents of those too.
Move recursively for every URI you retrieve.
Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).
In my case , I can give the google search page from p1 to p10 .And it returns the 100 news article if I set intnumberOfCrawlers=1
However , when i try the Quickstart of Crawler4j example
It only returns the external links found from the orginal link . Like these:
URL: http://www.ics.uci.edu/~lopes/
Text length: 2619
Html length: 11656
Number of outgoing links: 38
URL: http://www.ics.uci.edu/~welling/
Text length: 4503
Html length: 23713
Number of outgoing links: 24
URL: http://www.ics.uci.edu/~welling/teaching/courses.html
Text length: 2222
Html length: 15138
Number of outgoing links: 33
URL: http://www.ics.uci.edu/
Text length: 3661
Html length: 51628
Number of outgoing links: 86
Hence , I wonder can crawler4j
perform the function I raised . Or should I use crawler4j
+JSoup
together ?
Upvotes: 2
Views: 1008
Reputation: 5751
crawler4j
respects crawler politness such as the robots.txt
. In your case this file is the following one.
Inspecting this file reveals, that it is disallowed to crawl your given seed points:
Disallow: /search
So you won't be able to crawl the given site, unless you modify the classes to ignore robots.txt
. However, this is not considered polite and is not compliant with crawler ethics.
Upvotes: 4
Reputation: 1020
There is a lot of questions on your post I will try my best to answer:
"Is it able to retrieve website content by Crawler4j?"
"Hence , I wonder can crawler4j perform the function I raised . Or should I use crawler4j +JSouptogether ?"
"It only returns the external links found from the orginal link . Like these"
BasicCrawler
you'll need to add the allow urls here return href.startsWith("http://www.ics.uci.edu/");
modify to include moreBasicCrawlController
you'll need to add your page seeds here config.setMaxDepthOfCrawling(2);
Upvotes: 0