Why does crawling a folder in Linux gets faster with each iteration?

Question

I have a program in which I need to crawl a specific folder to delete its contents. I am using the Files.walkFileTree method to achieve the same. On Ubuntu 14.04, 64bit 4GB RAM, although my program runs fine, the first time I start crawling a folder, it takes a very long time. However, on subsequent crawls of the same folder, the time decreases drastically, and settles down. Here is the output of simple System.currentTimeInMillis() calls to check the time spent:

key:

Deletion tells the time it took to crawl a folder called 'output' and delete all its contents(Around ~5000 files).
Copy tells the time it took to copy the contents of a small folder over to another place.
Crawl tells the time it took to crawl a directory called 'content', parse its ~5000 files to create corresponding objects + Deletion + Copy [See above].
Parse tells the time it took to actually write the contents of these ~5000 objects into separate files. All times are in milliseconds.

FOLDER A

On the first time:

Deletion: 100100
Copy: 53
Crawl: 143244
Parse: 4307

On the second time:

Deletion: 486
Copy: 3
Crawl: 1424
Parse: 4581

On the third time:

Deletion: 567
Copy: 16
Crawl: 1999
Parse: 4027

FOLDER B On first time:

Deletion: 88971
Copy: 47
Crawl: 137623
Parse: 4125

On the second time:

Deletion: 443
Copy: 31
Crawl: 1631
Parse: 3986

On the third time:

Deletion: 434
Copy: 4
Crawl: 1648
Parse: 4048

It is to be noted that the Crawl time includes time for deletion and copying. All three operations were performed on the same folder, with the same contents, just minutes apart from each other. This happens for each new folder that I try this program on. Is this an issue with the file system in place? Or is it something to do with my code?

The code I use to benchmark is here:

public void publish(boolean fastParse) throws IOException, InterruptedException {
        InfoHandler info = new InfoHandler();
        long start = System.currentTimeMillis();
        if (fastParse == true) {
            crawler.fastReadIntoQueue(Paths.get(DirectoryCrawler.SOURCE_DIRECTORY).normalize());
        }
        else {
            crawler.readIntoQueue(Paths.get(DirectoryCrawler.SOURCE_DIRECTORY).normalize());
        }
        long read = System.currentTimeMillis();
        info.findLatestPosts(fileQueue);
        //info.findNavigationPages(fileQueue);
        long ps = System.currentTimeMillis();
        parser = new Parser();
        parser.parse(fileQueue);
        info.writeInfoFile();
        long pe = System.currentTimeMillis();
        System.out.println("Crawl: " + (read - start));
        System.out.println("Parse: " + (pe - ps));
    }

And the class which defines the readIntoQueue method can be found here: https://github.com/pawandubey/griffin/blob/master/src/main/java/com/pawandubey/griffin/DirectoryCrawler.java

Hi-Angel · Accepted Answer

It's because Linux usually pretty aggressively caching everything. It follows a principle — free RAM is a wasted RAM. But do not worry, once an app needs more RAM, Linux just drops a cache.

You can run free for the system, it would look like:

$ free -h
             total       used       free     shared    buffers     cached
Mem:          2,9G       2,7G       113M        55M        92K       1,0G
-/+ buffers/cache:       1,8G       1,1G
Swap:         1,9G       100M       1,8G

In the first row, the «used» column (2,7G) is amount of memory that is used by the system, including cache. The second row, the same column is amount of a used memory by applications. You can think of the difference between the rows as a free memory, because once an app needed it, the system, as I mentioned, just drops the cache.

You may read more about it here

Why does crawling a folder in Linux gets faster with each iteration?

Answers (2)

Related Questions