Pawan
Pawan

Reputation: 1588

Why does crawling a folder in Linux gets faster with each iteration?

I have a program in which I need to crawl a specific folder to delete its contents. I am using the Files.walkFileTree method to achieve the same. On Ubuntu 14.04, 64bit 4GB RAM, although my program runs fine, the first time I start crawling a folder, it takes a very long time. However, on subsequent crawls of the same folder, the time decreases drastically, and settles down. Here is the output of simple System.currentTimeInMillis() calls to check the time spent:

key:

FOLDER A

On the first time:

Deletion: 100100
Copy: 53
Crawl: 143244
Parse: 4307

On the second time:

Deletion: 486
Copy: 3
Crawl: 1424
Parse: 4581

On the third time:

Deletion: 567
Copy: 16
Crawl: 1999
Parse: 4027

FOLDER B On first time:

Deletion: 88971
Copy: 47
Crawl: 137623
Parse: 4125

On the second time:

Deletion: 443
Copy: 31
Crawl: 1631
Parse: 3986

On the third time:

Deletion: 434
Copy: 4
Crawl: 1648
Parse: 4048

It is to be noted that the Crawl time includes time for deletion and copying. All three operations were performed on the same folder, with the same contents, just minutes apart from each other. This happens for each new folder that I try this program on. Is this an issue with the file system in place? Or is it something to do with my code?

The code I use to benchmark is here:

public void publish(boolean fastParse) throws IOException, InterruptedException {
        InfoHandler info = new InfoHandler();
        long start = System.currentTimeMillis();
        if (fastParse == true) {
            crawler.fastReadIntoQueue(Paths.get(DirectoryCrawler.SOURCE_DIRECTORY).normalize());
        }
        else {
            crawler.readIntoQueue(Paths.get(DirectoryCrawler.SOURCE_DIRECTORY).normalize());
        }
        long read = System.currentTimeMillis();
        info.findLatestPosts(fileQueue);
        //info.findNavigationPages(fileQueue);
        long ps = System.currentTimeMillis();
        parser = new Parser();
        parser.parse(fileQueue);
        info.writeInfoFile();
        long pe = System.currentTimeMillis();
        System.out.println("Crawl: " + (read - start));
        System.out.println("Parse: " + (pe - ps));
    }

And the class which defines the readIntoQueue method can be found here: https://github.com/pawandubey/griffin/blob/master/src/main/java/com/pawandubey/griffin/DirectoryCrawler.java

Upvotes: 1

Views: 143

Answers (2)

Maks
Maks

Reputation: 202

Well, possible answer (from http://www.tldp.org/LDP/tlk/fs/filesystem.html):

The Linux Virtual File system is implemented so that access to its files is as fast and efficient as possible. It must also make sure that the files and their data are kept correctly. These two requirements can be at odds with each other. The Linux VFS caches information in memory from each file system as it is mounted and used.

So probably your OS just caches files information. In this case program restart won't "help" much (as this is feature of the OS, not your application), and you should reproduce slowness only after OS restart (or maybe there is a way to evict FS data from the caches somehow). You can play with it to prove/disapprove this theory.

Upvotes: 1

Hi-Angel
Hi-Angel

Reputation: 5619

It's because Linux usually pretty aggressively caching everything. It follows a principle — free RAM is a wasted RAM. But do not worry, once an app needs more RAM, Linux just drops a cache.

You can run free for the system, it would look like:

$ free -h
             total       used       free     shared    buffers     cached
Mem:          2,9G       2,7G       113M        55M        92K       1,0G
-/+ buffers/cache:       1,8G       1,1G
Swap:         1,9G       100M       1,8G

In the first row, the «used» column (2,7G) is amount of memory that is used by the system, including cache. The second row, the same column is amount of a used memory by applications. You can think of the difference between the rows as a free memory, because once an app needed it, the system, as I mentioned, just drops the cache.

You may read more about it here

Upvotes: 1

Related Questions