Reputation: 1423
I'm making a simple program to scrape content from several webpages. I want to improve the speed of my program so I want to use threads. I want to be able to control the amount of threads with some integer(down the line I want users to be able to define this).
This is the code I want to create threads for:
public void runLocales(String langLocale){
ParseXML parser = new ParseXML(langLocale);
int statusCode = parser.getSitemapStatus();
if (statusCode > 0){
for (String page : parser.getUrls()){
urlList.append(page+"\n");
}
}else {
urlList.append("Connection timed out");
}
}
And the parseXML class:
public class ParseXML {
private String sitemapPath;
private String sitemapName = "sitemap.xml";
private String sitemapDomain = "somesite";
Connection.Response response = null;
boolean success = false;
ParseXML(String langLocale){
sitemapPath = sitemapDomain+"/"+langLocale+"/"+sitemapName;
int i = 0;
int retries = 3;
while (i < retries){
try {
response = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
success = true;
break;
} catch (IOException e) {
}
i++;
}
}
public int getSitemapStatus(){
if(success){
int statusCode = response.statusCode();
return statusCode;
}else {
return 0;
}
}
public ArrayList<String> getUrls(){
ArrayList<String> urls = new ArrayList<String>();
try {
Document doc = response.parse();
Elements element = doc.select("loc");
for (Element page : element){
urls.add(page.text());
}
return urls;
} catch (IOException e) {
System.out.println(e);
return null;
}
}
}
I've been reading up about threads for a few days now and i can't figure out how to implement threading in my case? Can someone offer some insight please?
Upvotes: 0
Views: 181
Reputation: 63966
Something like this should do:
new Thread(
new Runnable() {
public void run() {
try {
runLocales(langLocale);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(
"child thread " + new Date(System.currentTimeMillis()));
}
}).start();
Obviously, you still need to add the code to control how many Threads you want to create, etc., and decide what you want to do if your threshold is reached.
Upvotes: 1
Reputation: 6657
You can use the ThreadGroup
for controlling the threads you want to maintain. Or you can also implement the ThreadPool
mechanism for controlling threads.
You can help for using thread group class here.
And for ThreadPool
implementation sample here.
Hope this will help you.
Enjoy !!!
Upvotes: 1
Reputation: 15675
Excuseme if I'm answering the obvious and your problem is different but, it looks like what you would like is to define
public class Runner extends Runnable{
private final String langLocale;
public Runner(String langLocale){
this.langLocale = langeLocale;
}
public void run(){ //Instead of public void runLocales(String langLocale)
//Do your thing here
}
}
And then create and start new threads using new Thread(new Runner("smth")).start();
Only you probably want to keep track of the thread to join it, so you don't have too many threads at a time. And when you have that problem, consider using a ThreadPool where you hand in the Runnables directly.
And one last thing, when crawling, be a good citizen! Respect the recommendations, use the robots.txt file, don't open more than a couple of threads to the same server, etc...
Have fun!
Upvotes: 1
Reputation: 15807
Have a look at chapter 6 of the following book: http://www.google.co.uk/url?sa=t&rct=j&q=sun%20certified%20java%20programmer%20for%20java%206%20pdf&source=web&cd=1&ved=0CIEBEBYwAA&url=http%3A%2F%2Fm2projects.googlecode.com%2Ffiles%2FSCJP_Sun_Certified_Programmer_for_Java_6_Exam_310-065.pdf&ei=zauqT9b9GIuR0QWNzaCLBA&usg=AFQjCNEYGyInJaVdgFZ2JVlZlPtOEdCFPA&cad=rja. I bought the book and found it very useful.
Upvotes: 1