Reputation: 3819
I wrote the following function to calculate sizes of URLs. The input is a pre-built map, mapping filetype->object which holds a Set of url strings. I run it on a a set of 3000 urls, and when I raised the number of threads until 20, I got better results each run. after 20 threads the performance starts decreasing.
My initial goal was to run it on a set of 500,000 urls. So I thought I'll run it using 200 threads in the threadPool. the result I got was:
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 6. Size: 14 MB. Time took to calculate: 1010
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 2863. Size: 3 GB. Time took to calculate: 61004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3481. Size: 3 GB. Time took to calculate: 121002
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3691. Size: 3 GB. Time took to calculate: 181004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3706. Size: 3 GB. Time took to calculate: 241004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3838. Size: 4 GB. Time took to calculate: 301004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 4596. Size: 4 GB. Time took to calculate: 361004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 5059. Size: 5 GB. Time took to calculate: 421008
and it was quite disappointing, because very soon, just after ~3000 urls, the performance decreases and only 100-150 urls are processed in a minute. As can be seen, 2000 urls were processed in the first minute.
Am I doing something wrong here using the thread pool the way i do? Or is there another bottleneck here?
void getSize(Map<String, Files> map) throws IOException {
final BufferedWriter bw = new BufferedWriter(new FileWriter("bad_files.txt"));
Set<Entry<String, Files>> entrySet = map.entrySet();
for (Entry<String, Files> entry : entrySet) {
final List<Long> sizeList = new ArrayList<Long>();
Files filesObject = entry.getValue();
HashSet<String> urlsSet = filesObject.urlsSet;
ExecutorService pool = Executors.newFixedThreadPool(200);
final long startTime = System.currentTimeMillis();
final String k = entry.getKey();
final Files value = entry.getValue();
Timer t= new Timer();
t.schedule(new TimerTask() {
@Override
public void run() {
long size = 0;
for (Long s : sizeList) {
size+=s;
}
System.out.println("FileType: " + k + ". CountTotalFiles: " + value.urlsSet.size() + " .CountCalcedSize: " + sizeList.size() +". Size: " + FileUtils.byteCountToDisplaySize(size) + ". Time took to calculate: " + (System.currentTimeMillis() - startTime));
}
}, 1000, 60000);
for (final String urlStr : urlsSet) {
Runnable call = new Runnable() {
@Override
public void run() {
HttpURLConnection urlCon = null;
try {
URL url= new URL(urlStr);
urlCon = (HttpURLConnection) url.openConnection();
// if (url.getProtocol().equals("https")) {
// setSSLContext((HttpsURLConnection)urlCon);
// }
if ( urlCon.getResponseCode() != HttpURLConnection.HTTP_OK) {
bw.append("Response: " + urlCon.getResponseCode() + " " +urlStr + "\n");
} else {
sizeList.add(Long.valueOf(urlCon.getContentLength()));
}
// urlCon.disconnect();
} catch (Exception e) {
try {
bw.append(e.getMessage() + " " +urlStr + "\n");
// urlCon.disconnect();
} catch (IOException e1) {
e1.printStackTrace();
}
}
}
};
pool.submit(call);
};
pool.shutdown();
try {
pool.awaitTermination(100, TimeUnit.DAYS);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long size = 0;
for (Long s : sizeList) {
size+=s;
}
System.out.println("FileType: " + entry.getKey() + ". CountTotalFiles: " + entry.getValue().urlsSet.size() + " .CountCalcedSize: " + sizeList.size() +". Size: " + FileUtils.byteCountToDisplaySize(size) + ". Time took to calculate: " + (System.currentTimeMillis() - startTime));
}
bw.flush();
bw.close();
}
Upvotes: 0
Views: 126
Reputation: 13242
You should use HttpURLConnection.setRequestMethod('HEAD') before opening the connection to increase performance, if you are just interested in the headers.
Upvotes: 1