Reputation: 1588
I have a program in which I need to crawl a specific folder to delete its contents. I am using the Files.walkFileTree method to achieve the same.
On Ubuntu 14.04, 64bit 4GB RAM, although my program runs fine, the first time I start crawling a folder, it takes a very long time. However, on subsequent crawls of the same folder, the time decreases drastically, and settles down.
Here is the output of simple System.currentTimeInMillis()
calls to check the time spent:
key:
FOLDER A
On the first time:
Deletion: 100100
Copy: 53
Crawl: 143244
Parse: 4307
On the second time:
Deletion: 486
Copy: 3
Crawl: 1424
Parse: 4581
On the third time:
Deletion: 567
Copy: 16
Crawl: 1999
Parse: 4027
FOLDER B On first time:
Deletion: 88971
Copy: 47
Crawl: 137623
Parse: 4125
On the second time:
Deletion: 443
Copy: 31
Crawl: 1631
Parse: 3986
On the third time:
Deletion: 434
Copy: 4
Crawl: 1648
Parse: 4048
It is to be noted that the Crawl time includes time for deletion and copying. All three operations were performed on the same folder, with the same contents, just minutes apart from each other. This happens for each new folder that I try this program on. Is this an issue with the file system in place? Or is it something to do with my code?
The code I use to benchmark is here:
public void publish(boolean fastParse) throws IOException, InterruptedException {
InfoHandler info = new InfoHandler();
long start = System.currentTimeMillis();
if (fastParse == true) {
crawler.fastReadIntoQueue(Paths.get(DirectoryCrawler.SOURCE_DIRECTORY).normalize());
}
else {
crawler.readIntoQueue(Paths.get(DirectoryCrawler.SOURCE_DIRECTORY).normalize());
}
long read = System.currentTimeMillis();
info.findLatestPosts(fileQueue);
//info.findNavigationPages(fileQueue);
long ps = System.currentTimeMillis();
parser = new Parser();
parser.parse(fileQueue);
info.writeInfoFile();
long pe = System.currentTimeMillis();
System.out.println("Crawl: " + (read - start));
System.out.println("Parse: " + (pe - ps));
}
And the class which defines the readIntoQueue method can be found here: https://github.com/pawandubey/griffin/blob/master/src/main/java/com/pawandubey/griffin/DirectoryCrawler.java
Upvotes: 1
Views: 143
Reputation: 202
Well, possible answer (from http://www.tldp.org/LDP/tlk/fs/filesystem.html):
The Linux Virtual File system is implemented so that access to its files is as fast and efficient as possible. It must also make sure that the files and their data are kept correctly. These two requirements can be at odds with each other. The Linux VFS caches information in memory from each file system as it is mounted and used.
So probably your OS just caches files information. In this case program restart won't "help" much (as this is feature of the OS, not your application), and you should reproduce slowness only after OS restart (or maybe there is a way to evict FS data from the caches somehow). You can play with it to prove/disapprove this theory.
Upvotes: 1
Reputation: 5619
It's because Linux usually pretty aggressively caching everything. It follows a principle — free RAM is a wasted RAM. But do not worry, once an app needs more RAM, Linux just drops a cache.
You can run free
for the system, it would look like:
$ free -h
total used free shared buffers cached
Mem: 2,9G 2,7G 113M 55M 92K 1,0G
-/+ buffers/cache: 1,8G 1,1G
Swap: 1,9G 100M 1,8G
In the first row, the «used» column (2,7G) is amount of memory that is used by the system, including cache. The second row, the same column is amount of a used memory by applications. You can think of the difference between the rows as a free memory, because once an app needed it, the system, as I mentioned, just drops the cache.
You may read more about it here
Upvotes: 1