Joseph Budin
Joseph Budin

Reputation: 1360

How does os.listdir() performs on very large folders?

I plan on getting a huge folder of data. The total size of the folder would be approximately 2TB and it would be comprised of about 2 million files. I will need to perform some processing on those files (mainly removing 99% of them).

I anticipate some issues due to the size of the data. In particular, I would like to know if Python is able to list these files correctly using os.listdir() in a reasonable time.

For instance, I know from experience that in some cases, deleting huge folders like this one on Ubuntu can be painful.

Upvotes: 4

Views: 1955

Answers (2)

ShadowRanger
ShadowRanger

Reputation: 155506

os.scandir was created largely because of issues with using os.listdir on huge directories, so I would expect os.listdir to suffer in the scenario you describe, where os.scandir should perform better, both because it can process the folders with lower memory consumption and because (typically) you benefit at least a little by avoiding per-entry stat calls (e.g. to distinguish files from directories).

Upvotes: 6

lenik
lenik

Reputation: 23556

Unless you're given those millions of files already in the form of huge folder, you may easily separate them when copying, for example, use first few characters of the file as a folder name, for example:

abcoweowiejr.jpg goes to abc/ folder
012574034539.jpg goes to 012/ folder

and so on... This way you never have to read a folder that has millions of files.

Upvotes: 2

Related Questions