Reputation: 523
Is there any efficient way to parallelize a large numer of GET requests in Java. I have a file with 200,000 lines, each one needing a GET request from Wikimedia. And then I have to write a part of the response to a common file. I've pasted the main part of my code below as reference.
while ((line = br.readLine()) != null) {
count++;
if ((count % 1000) == 0) {
System.out.println(count + " tags parsed");
fbw.flush();
bw.flush();
}
//System.out.println(line);
String target = new String(line);
if (target.startsWith("\"") && (target.endsWith("\""))) {
target = target.replaceAll("\"", "");
}
String url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&rvprop=timestamp&rvlimit=1&rvdir=newer&titles=";
url = url + URLEncoder.encode(target, "UTF-8");
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
// optional default is GET
con.setRequestMethod("GET");
//add request header
//con.setRequestProperty("User-Agent", USER_AGENT);
int responsecode = con.getResponseCode();
//System.out.println("Sending 'Get' request to URL: " + url);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
Document doc = loadXMLFromString(response.toString());
NodeList x = doc.getElementsByTagName("revisions");
if (x.getLength() == 1) {
String time = x.item(0).getFirstChild().getAttributes().item(0).getTextContent().substring(0,10).replaceAll("-", "");
bw.write(line + "\t" + time + "\n");
} else if (x.getLength() == 2) {
String time = x.item(1).getFirstChild().getAttributes().item(0).getTextContent().substring(0, 10).replaceAll("-", "");
bw.write(line + "\t" + time + "\n");
} else {
fbw.write(line + "\t" + "NULL" + "\n");
}
}
I googled about and it seems there are two options. One is to create threads and the other is to use something called an Executor. Could someone provide a little guidance on which one would be more appropriate for this task?
Upvotes: 4
Views: 514
Reputation: 1748
As stated above, you should dimension the number of parallel GET requests based on the capacity of the server. If you want to stick on JVM but want to use Groovy, here is a really short example of parallel GET requests.
Initially there is a list of URLs that you want to fetch. Once done, the tasks list contains all results accessible through the get() method for later processing. Here just printed out as an example.
import groovyx.net.http.AsyncHTTPBuilder
def urls = [
'http://www.someurl.com',
'http://www.anotherurl.com'
]
AsyncHTTPBuilder http = new AsyncHTTPBuilder(poolSize:urls.size())
def tasks = []
urls.each{
tasks.add(http.get(uri:it) { resp, html -> return html })
}
tasks.each { println it.get() }
Note that you for a production environment need to take care of timeouts, error responses etc.
Upvotes: 0
Reputation: 718836
If you really, really need to do it via GET requests, I recommend that you use a ThreadPoolExecutor with a small thread pool (2 or 3) to avoid overloading the wikipedia servers. That will avoid a lot of coding ...
Also consider using the Apache HttpClient libraries (with persistent connections!).
But a much better idea to use the database download option. Depending on what you are doing, you may be able to choose one of the smaller downloads. This page discusses the various options.
Note: that Wikipedia prefers people to download the database dumps (etcetera) rather than pounding on their web servers.
Upvotes: 5
Reputation: 31724
What you need to is this:
Upvotes: 0