user1943079
user1943079

Reputation: 523

Parallelizing many GET requests

Is there any efficient way to parallelize a large numer of GET requests in Java. I have a file with 200,000 lines, each one needing a GET request from Wikimedia. And then I have to write a part of the response to a common file. I've pasted the main part of my code below as reference.

while ((line = br.readLine()) != null) {
    count++;
    if ((count % 1000) == 0) {
        System.out.println(count + " tags parsed");
        fbw.flush();
        bw.flush();
    }
    //System.out.println(line);
    String target = new String(line);
    if (target.startsWith("\"") && (target.endsWith("\""))) {
        target = target.replaceAll("\"", "");
    }
    String url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&rvprop=timestamp&rvlimit=1&rvdir=newer&titles=";
    url = url + URLEncoder.encode(target, "UTF-8");
    URL obj = new URL(url);
    HttpURLConnection con = (HttpURLConnection) obj.openConnection();
    // optional default is GET
    con.setRequestMethod("GET");
    //add request header
    //con.setRequestProperty("User-Agent", USER_AGENT);
    int responsecode = con.getResponseCode();
    //System.out.println("Sending 'Get' request to URL: " + url);
    BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
    String inputLine;
    StringBuffer response = new StringBuffer();
    while ((inputLine = in.readLine()) != null) {
        response.append(inputLine);         
    }
    Document doc = loadXMLFromString(response.toString());
    NodeList x = doc.getElementsByTagName("revisions");
    if (x.getLength() == 1) {
        String time = x.item(0).getFirstChild().getAttributes().item(0).getTextContent().substring(0,10).replaceAll("-", "");
        bw.write(line + "\t" + time + "\n");
    } else if (x.getLength() == 2) {
        String time = x.item(1).getFirstChild().getAttributes().item(0).getTextContent().substring(0, 10).replaceAll("-", "");          
        bw.write(line + "\t" + time + "\n");
    } else {
        fbw.write(line + "\t" + "NULL" + "\n");
    }
}

I googled about and it seems there are two options. One is to create threads and the other is to use something called an Executor. Could someone provide a little guidance on which one would be more appropriate for this task?

Upvotes: 4

Views: 514

Answers (3)

Jocce Nilsson
Jocce Nilsson

Reputation: 1748

As stated above, you should dimension the number of parallel GET requests based on the capacity of the server. If you want to stick on JVM but want to use Groovy, here is a really short example of parallel GET requests.

Initially there is a list of URLs that you want to fetch. Once done, the tasks list contains all results accessible through the get() method for later processing. Here just printed out as an example.

import groovyx.net.http.AsyncHTTPBuilder

def urls = [
  'http://www.someurl.com',
  'http://www.anotherurl.com'
]
AsyncHTTPBuilder http = new AsyncHTTPBuilder(poolSize:urls.size())
def tasks = []
urls.each{
  tasks.add(http.get(uri:it) { resp, html -> return html })
}
tasks.each { println it.get() }

Note that you for a production environment need to take care of timeouts, error responses etc.

Upvotes: 0

Stephen C
Stephen C

Reputation: 718836

If you really, really need to do it via GET requests, I recommend that you use a ThreadPoolExecutor with a small thread pool (2 or 3) to avoid overloading the wikipedia servers. That will avoid a lot of coding ...

Also consider using the Apache HttpClient libraries (with persistent connections!).


But a much better idea to use the database download option. Depending on what you are doing, you may be able to choose one of the smaller downloads. This page discusses the various options.

Note: that Wikipedia prefers people to download the database dumps (etcetera) rather than pounding on their web servers.

Upvotes: 5

Jatin
Jatin

Reputation: 31724

What you need to is this:

  1. Have a producer thread that reads each line and adds it to a queue.
  2. Have a ThreadPoolin which every thread takes a URL and does a GET request
  3. It gets the response and adds it to a queue.
  4. Have one more consumer thread, which checks the queue and adds it to a file.

Upvotes: 0

Related Questions