Reputation: 3573
I'm working on a project which needs to design a web crawler in Java which can take a user query about a particular news subject and then visit different news websites and then extract news content from those pages and store it in some files/databases. I need this to make a summary of overall stored content. I'm new to this field and so expect some help from people who have any experience how to do it.
Right now I have the code to extract news content from a single page which takes the page manually, but I have no idea how to integrate it in a web crawler to extract content from different pages.
Can anyone give some good links to tutorials or implementations in Java which I can use or modify according to my needs?
Upvotes: 4
Views: 21467
Reputation: 40345
I'd recommend that you check out my answers here: How can I bring google-like recrawling in my application(web or console) and Designing a web crawler
The first answer was provided for a C# question, but it's actually a language agnostic answer so it applies to Java too. Check out the links I've provided in both answers, there is some good reading material. I'd also say that you should try one of the already existing java crawlers, rather than writing one yourself (it's not a small project).
...a web crawler in java which can take a user query about a particular news subject and then visits different news websites and then extracts news content from those pages and store it in some files/databases.
That requirement seem to go beyond the scope of "just a crawler" and go into the area of machine learning and natural language processing. If you have a list of websites for which you're sure that they serve news, then you might be able to extract the news content. However, even then you have to determine what part of the website is news and what's not (i.e. there might also be links, ads, comments, etc). So exactly what kind of requirements are you facing here? Do you have a list of news websites? Do you have a reliable way to extract news?
Upvotes: 0
Reputation: 42597
One word of advice in addition to the other answers - make sure that your crawler respects robots.txt
(i.e. does not crawl sites rapidly and indiscriminately) or you are likely to get yourself/your organisation blocked by the sites you want to visit.
Upvotes: 5
Reputation: 10121
I found this article to be really helpful when i was reading about Web Crawlers.
It provides a step by step guide to developing a multi-threaded crawler.
In essence, the following is a very high level view of what a crawler should do
- Insert first URL in the queue
Loop until enough documents are gathered:
- Get first URL from the queue and save the document
- Extract links from the saved document and insert them in the queue
Upvotes: 0
Reputation: 145
Here are some open source Java libraries that most people would recommend,
My personal favourite would be Java Web Crawler, in terms of its speed and easiness to configure.
btw, if it's not something that big, for an assignment, if your source websites are NOT changing frequently, I would recommend implementing a simple HTML parser.
Hope it will help
Upvotes: 3
Reputation: 92274
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Upvotes: 8