Reputation: 933
I need to checksum every single file on a given USB disk in a C# application. I suspect the bottleneck here is the actual read off the disk so I'm looking to make this as fast as possible.
I suspect this would be much quicker if I could read the files on the disk sequentially, in the actual order they appear on the disk (assuming the drive is not fragmented).
How can I find this information for each file from it's standard path? i.e. given a file at "F:\MyFile.txt", how can I find the start location of this file on the disk?
I'm running a C# application in Windows.
Upvotes: 1
Views: 655
Reputation: 111860
Now... I don't really know if it will be useful for you:
[StructLayout(LayoutKind.Sequential)]
public struct StartingVcnInputBuffer
{
public long StartingVcn;
}
public static readonly int StartingVcnInputBufferSizeOf = Marshal.SizeOf(typeof(StartingVcnInputBuffer));
[StructLayout(LayoutKind.Sequential)]
public struct RetrievalPointersBuffer
{
public uint ExtentCount;
public long StartingVcn;
public long NextVcn;
public long Lcn;
}
public static readonly int RetrievalPointersBufferSizeOf = Marshal.SizeOf(typeof(RetrievalPointersBuffer));
[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
public static extern SafeFileHandle CreateFileW(
[MarshalAs(UnmanagedType.LPWStr)] string filename,
[MarshalAs(UnmanagedType.U4)] FileAccess access,
[MarshalAs(UnmanagedType.U4)] FileShare share,
IntPtr securityAttributes,
[MarshalAs(UnmanagedType.U4)] FileMode creationDisposition,
[MarshalAs(UnmanagedType.U4)] FileAttributes flagsAndAttributes,
IntPtr templateFile);
[DllImport("kernel32.dll", ExactSpelling = true, SetLastError = true, CharSet = CharSet.Auto)]
static extern bool DeviceIoControl(IntPtr hDevice, uint dwIoControlCode,
ref StartingVcnInputBuffer lpInBuffer, int nInBufferSize,
out RetrievalPointersBuffer lpOutBuffer, int nOutBufferSize,
out int lpBytesReturned, IntPtr lpOverlapped);
// Returns a FileStream that can only Read
public static void GetStartLogicalClusterNumber(string fileName, out FileStream file, out long startLogicalClusterNumber)
{
SafeFileHandle handle = CreateFileW(fileName, FileAccess.Read | (FileAccess)0x80 /* FILE_READ_ATTRIBUTES */, FileShare.Read, IntPtr.Zero, FileMode.Open, 0, IntPtr.Zero);
if (handle.IsInvalid)
{
throw new Win32Exception();
}
file = new FileStream(handle, FileAccess.Read);
var svib = new StartingVcnInputBuffer();
int error;
RetrievalPointersBuffer rpb;
int bytesReturned;
DeviceIoControl(handle.DangerousGetHandle(), (uint)589939 /* FSCTL_GET_RETRIEVAL_POINTERS */, ref svib, StartingVcnInputBufferSizeOf, out rpb, RetrievalPointersBufferSizeOf, out bytesReturned, IntPtr.Zero);
error = Marshal.GetLastWin32Error();
switch (error)
{
case 38: /* ERROR_HANDLE_EOF */
startLogicalClusterNumber = -1; // empty file. Choose how to handle
break;
case 0: /* NO:ERROR */
case 234: /* ERROR_MORE_DATA */
startLogicalClusterNumber = rpb.Lcn;
break;
default:
throw new Win32Exception();
}
}
Note that the method will return a FileStream
that you can keep open and use to read the file, or you can easily modify it to not return it (and not create it) and then reopen the file when you want to hash it.
To use:
string[] fileNames = Directory.GetFiles(@"D:\");
foreach (string fileName in fileNames)
{
try
{
long startLogicalClusterNumber;
FileStream file;
GetStartLogicalClusterNumber(fileName, out file, out startLogicalClusterNumber);
}
catch (Exception e)
{
Console.WriteLine("Skipping: {0} for {1}", fileName, e.Message);
}
}
I'm using the API described here: https://web.archive.org/web/20160130161216/http://www.wd-3.com/archive/luserland.htm . The program is much easier because you only need the initial Logical Cluster Number (the first version of the code could extract all the LCN extents, but it would be useless, because you have to hash a file from first to last byte). Note that empty files (files with length 0) don't have any cluster allocated. The function returns -1
for the cluster (ERROR_HANDLE_EOF
). You can choose how to handle it.
Upvotes: 1
Reputation: 35905
If your drives are SSD or based on memory stick technology - forget it.
Memory sticks and other similar devices are generally based on SSD (or similar) technology, where the problem of random read/write access is actually not a problem. So you can just enumerate files and run your checksum.
You can try running this in several threads, but I am not sure that could speed up the process, it's something you may need to test. It may also vary from device to device.
Bonus
@xanatos mentioned an interesting point: "I always noticed that copying thousand of files on a memory stick is much slower than copying a single big file"
It is indeed much faster to copy one big file, rather than a pile of small files. And the reason is (usually) not because the files are located close to each other, so it's easier for hardware to read them sequentially. The problem comes to the OS that needs to keep tracking of each file.
If you ever run a procmon on Windows, you would observe huge amount of FileCreates, FileReads and FileWrites. In order to copy 100 files, OS would open each file, read its content, write to another file, close both files + lots of update operations that are sent to the file system, such as update attributes for both files, update security descriptors for both files, update directory information etc. So one copy operation has many satellite operations.
Upvotes: 1