Daniel_T
Daniel_T

Reputation: 39

Python scanning list for empty files

I am just looking for an opinion here...

I have a script which loops through a list of files (7.000 files to be exact) and removes them from the list, if the file size is 0kb.

files = [...]
files_0 = []

for f in files:
    if os.stat(f).st_size==0:
        files.remove(f)
        files_0.append(f)

Looping through the list takes some time. Would it be better if i used a dictionary {filename:filesize} and append each entry to the appropriate list?

Thanks for helping!

Upvotes: 0

Views: 757

Answers (4)

alani
alani

Reputation: 13079

The most efficient way is certainly to use sets for everything, as in Ronie Martinez's answer.

I'll take the liberty of quoting Ronie's code here, for ease of reference (although with a slight edit to use os.path.getsize):

files = [...]
files_0 = {f for f in files if os.path.getsize(f) == 0}
files = set(files) - files_0

As an addendum, here is what you could do if you actually care about the ordering for some reason:

files = [...]
files_0 = [f for f in files if os.path.getsize(f) == 0]

Followed by either:

files = [f for f in files if file not in files_0]  #

or better (because sets are much quicker than lists for testing inclusion):

files_0_set = set(files_0)
files = [f for f in files if f not in files_0_set]

In this case, although we don't specifically want to output the files_0_set, it is a useful intermediate.


By the way, the original code attempts to remove items from a list while iterating over it. This will not work reliably, and may lead to items being missed. It is fine to use an explicit loop if desired, instead of a set comprehension or list comprehension, but the removal of items must be deferred until after the loop.

Additionally, Ronie is correct that removing items from the middle of a list is slow, and therefore it is likely to be faster simply to construct a new list which excludes certain items. However, one situation where you might not want to do that, and therefore should use remove would be if files is a view of a list that you are also using via other references, so you want that list to be updated, not just to reassign the files variable. Here is an example of what you might do in that case. In this example, I will use an explicit loop when constructing files_0 for sake of clarity, although the list comprehension above could also still be used.

files_0 = []
for f in files:
    if os.path.getsize(f) == 0:
        files_0.append(f)

for f in files_0:
    files.remove(f)

The point is that you are doing the removal in a separate loop after the end of the loop over files, and this separate loop is over a different collection (in this case files_0) so you are not removing items from the list that you are iterating over.

Upvotes: 0

Abhishek D K
Abhishek D K

Reputation: 2425

try this ...

import os

files_0 = []
path = "<valid directory path>"

for rootdir,dirs,files in os.walk(path):  
  for f in files:  
    filename = os.path.join(root,f)   
      if os.stat(filename).st_size == 0:  
        print(" Removed file ",filename)  
        os.remove(filename)
        files_0.append(filename)

Upvotes: 0

Allan Juan
Allan Juan

Reputation: 2554

The remove operation can be pretty slow, as shown here. To be more specific, it has a time complexity of O(N). If you're not familiar with big O notation, that means in the worst case scenario you'll have to parse the entire list just to remove this single element.

You'll achieve a better performance by inverting your logic: instead of removing elements that shouldn't belong in the list, create another list and put only the valid elements there. E.g:

files = [ ... ]
valid_files = []
files_0 = []

for f in files:
    if os.stat(f).st_size == 0:
        files_0.append(f)
    else:
        valid_files.append(f)

Upvotes: 0

Ronie Martinez
Ronie Martinez

Reputation: 1276

Probably better to use a set comprehension because .append() and .remove() are not optimal. Then get the difference from the original list.

files = [...]
files_0 = {f for f in files if os.stat(f).st_size==0}
files = set(files) - files_0

Not tested though.

EDIT: if you need to retain the arrangement of items. Use a list comprehension because as stated above, .append() and .remove are not optimal. Creating a new list is faster.

Upvotes: 1

Related Questions