Do I have some kind of bottleneck in this code to calculate URL file sizes?

Question

I wrote the following function to calculate sizes of URLs. The input is a pre-built map, mapping filetype->object which holds a Set of url strings. I run it on a a set of 3000 urls, and when I raised the number of threads until 20, I got better results each run. after 20 threads the performance starts decreasing.

My initial goal was to run it on a set of 500,000 urls. So I thought I'll run it using 200 threads in the threadPool. the result I got was:

FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 6. Size: 14 MB. Time took to calculate: 1010
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 2863. Size: 3 GB. Time took to calculate: 61004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3481. Size: 3 GB. Time took to calculate: 121002
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3691. Size: 3 GB. Time took to calculate: 181004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3706. Size: 3 GB. Time took to calculate: 241004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 3838. Size: 4 GB. Time took to calculate: 301004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 4596. Size: 4 GB. Time took to calculate: 361004
FileType: pdf. CountTotalFiles: 394231 .CountCalcedSize: 5059. Size: 5 GB. Time took to calculate: 421008

and it was quite disappointing, because very soon, just after ~3000 urls, the performance decreases and only 100-150 urls are processed in a minute. As can be seen, 2000 urls were processed in the first minute.

Am I doing something wrong here using the thread pool the way i do? Or is there another bottleneck here?

    void getSize(Map map) throws IOException {
        final BufferedWriter bw = new BufferedWriter(new FileWriter("bad_files.txt"));



        Set> entrySet = map.entrySet();
        for (Entry entry : entrySet) {
            final List sizeList = new ArrayList();
            Files filesObject = entry.getValue();
            HashSet urlsSet = filesObject.urlsSet;
            ExecutorService pool = Executors.newFixedThreadPool(200);
            final long startTime = System.currentTimeMillis();

            final String k = entry.getKey();
            final Files value = entry.getValue();
            Timer t= new Timer();
            t.schedule(new TimerTask() {

                @Override
                public void run() {
                    long size = 0;
                    for (Long s : sizeList) {
                        size+=s;
                    }
                     System.out.println("FileType: " + k  + ". CountTotalFiles: " + value.urlsSet.size() + " .CountCalcedSize: " + sizeList.size() +". Size: " +  FileUtils.byteCountToDisplaySize(size) + ". Time took to calculate: " + (System.currentTimeMillis() - startTime));

                }
            }, 1000, 60000);


            for (final String urlStr : urlsSet) {

                Runnable call = new Runnable() {

                    @Override
                    public void run() {
                        HttpURLConnection urlCon = null;
                        try {
                            URL url= new URL(urlStr);
                             urlCon = (HttpURLConnection) url.openConnection();

//                          if (url.getProtocol().equals("https")) {
//                              setSSLContext((HttpsURLConnection)urlCon);
//                          }

                                if ( urlCon.getResponseCode() != HttpURLConnection.HTTP_OK) {
                                    bw.append("Response: " + urlCon.getResponseCode() + "  " +urlStr + "
");
                                } else {
                                    sizeList.add(Long.valueOf(urlCon.getContentLength()));

                                }
//                              urlCon.disconnect();
                            } catch (Exception e) {
                                try {
                                    bw.append(e.getMessage()  + "  " +urlStr + "
");
//                                  urlCon.disconnect();    
                                } catch (IOException e1) {
                                    e1.printStackTrace();
                                }
                            }
                        }
                    };


                    pool.submit(call);
                };
                pool.shutdown();
                try {
                    pool.awaitTermination(100, TimeUnit.DAYS);
                } catch (InterruptedException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
                long size = 0;
                for (Long s : sizeList) {
                    size+=s;
                }

                System.out.println("FileType: " +  entry.getKey() + ". CountTotalFiles: " + entry.getValue().urlsSet.size() + " .CountCalcedSize: " + sizeList.size() +". Size: " +  FileUtils.byteCountToDisplaySize(size) + ". Time took to calculate: " + (System.currentTimeMillis() - startTime));     

            }


        bw.flush();
        bw.close();




    }

Leonard Br&#252;nings · Accepted Answer

You should use HttpURLConnection.setRequestMethod('HEAD') before opening the connection to increase performance, if you are just interested in the headers.

Do I have some kind of bottleneck in this code to calculate URL file sizes?

Answers (1)

Related Questions