Reputation: 1360
I plan on getting a huge folder of data. The total size of the folder would be approximately 2TB
and it would be comprised of about 2 million files. I will need to perform some processing on those files (mainly removing 99% of them).
I anticipate some issues due to the size of the data. In particular, I would like to know if Python is able to list these files correctly using os.listdir()
in a reasonable time.
For instance, I know from experience that in some cases, deleting huge folders like this one on Ubuntu can be painful.
Upvotes: 4
Views: 1955
Reputation: 155506
os.scandir
was created largely because of issues with using os.listdir
on huge directories, so I would expect os.listdir
to suffer in the scenario you describe, where os.scandir
should perform better, both because it can process the folders with lower memory consumption and because (typically) you benefit at least a little by avoiding per-entry stat
calls (e.g. to distinguish files from directories).
Upvotes: 6
Reputation: 23556
Unless you're given those millions of files already in the form of huge folder, you may easily separate them when copying, for example, use first few characters of the file as a folder name, for example:
abcoweowiejr.jpg goes to abc/ folder
012574034539.jpg goes to 012/ folder
and so on... This way you never have to read a folder that has millions of files.
Upvotes: 2