Reputation: 405
I have a external disk with a billion files. If I mount the external disk in computer A, my program will scan all files' path and save the files' path in a database table. After that, when I eject the external disk, those data will still remain in the table. The problem is, if some files are deleted in the computer B, and I mount it to the computer A again, I must synchronize the database table in computer A. However, I don't want to scan all the files again because it takes a lots time and waste a lots memory. Is there any way to update the database table without scanning all files whilst minimizing the memory used?
Besides, in my case, memory limitation is more important than time. Which means I would rather to save more memory than save more time.
I think I can cut the files to a lot of sections and use some specific function (may be SHA1?) to check whether the files in this section are deleted. However, I cannot find out a way to cut the files to the sections. Can anyone help me or give me better ideas?
Upvotes: 8
Views: 249
Reputation: 1999
If memory is important I would go for the operation system facilities.
If you have ext4 I will presume you are on Unix (you can install find on other operation systems like Win). If this is the case you could use the native find command (this would be for the last minute, you can of course remember the last scan time and modify this to whatever you like): find /directory_path -type f -mtime -1 -print
Of course you won't have the deletes. If a heuristic algorithm works for you then you can create a thread that slowly goes to each file stored in your database (whatever you need to display first then from newer to older) and check it is still online. This won't consume much memory. I reckon you won't be able to show a billion files to the user anyway.
Upvotes: 0
Reputation: 56
Do you have a list of what's deleted when the delete happens(or change whatever process deletes to create this)? If so couldn't you have a list of "I've been deleted" with a timestamp, and then pick up items from this list to only synchronize on what's changed? Naturally, you would still want to have some kind of batch job to sync during a slow time on the server, but I think that could reduce the load.
Another option may be, depending on what is changing the code, to have that process just update the databases (if you have multiple nodes) directly when it deletes. This would introduce some coupling into the systems, but would be the most efficient way to do it.
The best ways in my opinion are some variation on the idea of messaging that a delete has occurred(even if that's just a file that you write to some where with a list of recently deleted files), or some kind of direct callback mechanism, either through code or by just adjusting the persistent data store the application uses directly from the delete process.
Even with all this said, you would always need to have some kind of index synchronization or periodic sanity check on the indexes to be sure that everything is matched up correctly.
You could (and I would be shocked if you didn't have to based on the number of files that you have) partition off the file space into folders with, say, 5,000-10,000 files per folder, and then create a simple file that has a hash of the names of all the files in the folder. This would catch deletes, but I still think that a direct callback of some form when the delete occurs is a much better idea. If you have a monolithic folder with all this stuff, creating something to break that into separate folders (we used simple number under the main folder so we could go on ad nauseum) should speed everything up greatly; even if you have to do this for all new files and leave the old files in place as is, at least you could stop the bleeding on the file retrieval.
In my opinion, since you are programmatically controlling an index of the files, you should really have the same program involved somehow (or notified) when changes occur at the time of change to the underlying file system, as opposed to allowing changes to happen and then looking through everything for updates. Naturally, to catch the outliers where this communication breaks down, you should also have synchronization code in there to actually check what is in the file system and update the index periodically (although this could and probably should be batched out of process to the main application).
Upvotes: 0
Reputation: 111239
If you don't have control over the file system on the disk you have no choice but to scan the file names on the entire disk. To list the files that have been deleted you could do something like this:
update files in database: set "seen on this scan" to false
for each file on disk do:
insert/update database, setting "seen on this scan" to true
done
deleted files = select from files where "seen on this scan" = false
A solution to the db performance problem could be accumulating the file names into a list of some kind and do a bulk insert/update whenever you reach, say, 1000 files.
As for directories with 1 billion files, you just need to replace the code that lists the files with something that wraps the C functions opendir
and readdir
. If I were you wouldn't worry about it too much for now. No sane person has 1 billion files in one directory because that sort of thing cripples file systems and common OS tools, so the risk is low and the solution is easy.
Upvotes: 1
Reputation: 718768
In theory, you could speed things up by checking "modified" timestamps on directories. If a directory has not been modified, then you don't need to check any of the files in that directory. Unfortunately, you do need to scan possible subdirectories, and finding them involves scanning the directory ... unless you've saved the directory tree structure.
And of course, this is moot it you've got a flat directory containing a billion files.
I imagine that you are assembling all of the filepaths in memory so that you can sort them before querying the database. (And sorting them is a GOOD idea ...) However there is an alternative to sorting in memory:
(Do you REALLY have a billion files on a disc? That sounds like a bad design for your data store ...)
Upvotes: 0