ManInMoon
ManInMoon

Reputation: 7005

How can I improve this SUPERFAST directory size finder?

I have several massive directories (I cannot restructure for legacy reasons).

A typical directory is likely to contain 150K sub-directories, each of which has nested directories and maybe 4K files.

I am unable to get a directory size from windows explorer or via cygwin using du. These both just continuing processing for hours.

I have written my own code to solve this problem - and what I have is very fast for smaller folders - but still slow for these massive ones.

Can anyone improve?

(If you have a completely different solution I would be glad to hear of it too.)

var size = GetDirectorySize3b(@"C:\MyMassiveFolder");

        public long GetDirectorySize3b(string parentDirectory)
        {
            Int64 ttl = 0;
            Stopwatch sw = new Stopwatch();
            var dirs = Directory.GetDirectories(parentDirectory);
            var llDirs = SplitIntoLists(dirs.ToList<string>(), 10);
            ttl = ParallelDirSizeLLS(llDirs);
            return ttl;

        }

        public List<List<string>> SplitIntoLists(List<string> l, int numLists)
        {
            List<List<string>> lls = new List<List<string>>();

            int listLength = l.Count/numLists + 1;
            for (int i = 0; i < l.Count; i += listLength)
            {
                var partL = l.Skip(i).Take(listLength).ToList<string>();
                lls.Add(partL);
            }

            return lls;
        }

        public long ParallelDirSizeLLS(List<List<string>> lls)
        {

            _size = 0;

            Parallel.ForEach(lls,
                //new ParallelOptions { MaxDegreeOfParallelism = 30 },
                ParallelDirSizeL);

            return _size;
        }

        private void ParallelDirSizeL(List<string> l)
        {
            foreach (var dir in l)
            {

                var ds = GetDirectorySize3(dir);
                Interlocked.Add(ref _size, ds);
            }
        }

        public long GetDirectorySize3(string parentDirectory)
        {
            Scripting.FileSystemObject fso = new Scripting.FileSystemObject();
            Scripting.Folder folder = fso.GetFolder(parentDirectory);
            Int64 dirSize = (Int64)folder.Size;

            Marshal.ReleaseComObject(fso);

            return dirSize;
        }

Upvotes: 2

Views: 350

Answers (6)

Dmytro Marchuk
Dmytro Marchuk

Reputation: 476

Since storage devices do I/O synchronously, you will not get any speed benefit from parallelization of read operations.

Your approach might be to cache as much as possible into RAM and then process that in parallel. An approach we use on the project I work on for operations with files on NTFS is caching MFT records. However, we have hand-written file system parsing code with a lot of man-hours put in it, which is not the solution for you.

So you may want to try to find source code that does the thing for you. This link mentions two open-source fast search implementations for NTFS, which you might look at, because they do the exact thing: cache MFT in memory for super-fast search. They do not solve your problem directly, but seem to have source code for the approach.

It is pretty low-level solution but in my opinion every other method would have results similar to the already discussed, since every operation for processing file or folder tries to read MFT record by record, which is typically 1KB in size. However, disks process one, say, 2MB read operation faster then, 2048 1KB operations. Also reading records may physically reside near to each other, in which case caching is also a benefit. Mentioned products do that for search. But you can use their code for determining files' sizes.

Upvotes: 1

Chief Wiggum
Chief Wiggum

Reputation: 2934

I usually use the free version of Tree Size to get the size of massive folder structures. It takes it's time, but has so far has always delivered:

TreeSize Free

Upvotes: 0

Aladdin
Aladdin

Reputation: 339

Actually I suggest you should take a very different approach to solve the problem.

My Solution is based on the the way to collect the file names that the folder contains.The os dependent methods to get sub folders and file is relatively slow for massive amount of file, so you should go directly to the underlying File System and read the file structure from there.

Most of Windows OS drives FS is NTFS, and there is a very efficient library to read the FS directly, I will put a link to the source of the library and an example for how to use it in the comments. but

Upvotes: 0

rmalchow
rmalchow

Reputation: 2769

this basic java class:

import java.io.File;
import java.util.concurrent.atomic.AtomicLong;

public class DirSize {

    private static AtomicLong l = new AtomicLong();
    private static AtomicLong files = new AtomicLong();
    private static AtomicLong dirs = new AtomicLong();

    public static void recurse(File f) {
        if(f==null) {
            return;
        }
        if(f.isDirectory()) {
            dirs.getAndIncrement();
            if(f.listFiles()==null) {
                return;
            }
            for(File fc : f.listFiles()) {
                recurse(fc);
            }
        } else {
            files.getAndIncrement();
            l.getAndAdd(f.length());
        }
    }

    public static void main(String[] args) {
        long start = System.currentTimeMillis();
        recurse(new File("/usr"));
        long end = System.currentTimeMillis();
        System.out.println(end-start+" ms");
        System.out.println(files.get()+" files");
        System.out.println(dirs.get()+" dirs");
        System.out.println("size: "+l.get());
        System.out.println("size: "+(l.get()/(1024*1024))+" MB");
        double secs = (double)(end-start) / 1000d;
        double f = (double)files.get();
        System.out.println(Math.round(f/secs)+" files/s ");
    }

}

gives me:

11631 ms
386589 files
33570 dirs
size: 93068412461
size: 88756 MB
33238 files/s 

on first time run (but with the OS not freshly rebooted). this is macOS on a macbook pro with an SSD with sequential read&write above 700 MB/s the point here is probably less the throughput than the fact that an SSD essentially has no seek time, because reading the file size is an IOP, but a tiny one.

what disks are you running on? what filesystem? does it have to be windows?

Upvotes: 0

cpsaez
cpsaez

Reputation: 324

Why dont use a FileSystemWatcher to monitorize the directories and have the query size precalculed?. Maybe create a SQLite file in the top directory and have a table with all files and properties, including size. If a file is created/modified/deleted, FileSystemWatcher can notify your app and you can update your database for rapid queries. It's just an idea.

Upvotes: 1

Igor
Igor

Reputation: 309

I am not sure about solution but maybe you can try to use Microsoft Indexing Service? It store info about all indexed files including size.

I found some info: http://www.thejoyofcode.com/Using_Windows_Search_in_your_applications.aspx

Upvotes: 1

Related Questions