J Singh
J Singh

Reputation: 35

Issue sorting lots of files in python

I have a directory with over 10 000 files all with the same extension. All with the same form, e.g.,

 20150921(1)_0001.sgy
 20150921(1)_0002.sgy
 20150921(1)_0003.sgy
 20150921(1)_0004.sgy
...
20150921(1)_13290.sgy

The code I'm currently using is:

files = listdir('full data')
files.sort()

However this returns a list that follows:

20150921(1)_0001.sgy
...
20150921(1)_0998.sgy
20150921(1)_0999.sgy
20150921(1)_1000.sgy
20150921(1)_10000.sgy
20150921(1)_10001.sgy
20150921(1)_10002.sgy
20150921(1)_10003.sgy
20150921(1)_10004.sgy
20150921(1)_10005.sgy
20150921(1)_10006.sgy
20150921(1)_10007.sgy
20150921(1)_10008.sgy
20150921(1)_10009.sgy
20150921(1)_1001.sgy
20150921(1)_10010.sgy

The problem only arises when there are more than 1000 files, it seems sort can't order files correctly if they're larger than 10000. Can anyone see a way around this?

Upvotes: 1

Views: 84

Answers (3)

Andy
Andy

Reputation: 50540

This is called a Natural Sort. You can use the natsort package to do this:

from natsort import natsorted
import pprint

files = ['20150921(1)_0001.sgy',
'20150921(1)_0102.sgy',
'20150921(1)_0011.sgy',
'20150921(1)_0003.sgy',
'20150921(1)_0004.sgy',
'20150921(1)_0010.sgy',
'20150921(1)_1001.sgy',
'20150921(1)_0012.sgy',
'20150921(1)_0101.sgy',
'20150921(1)_1003.sgy',
'20150921(1)_0103.sgy',
'20150921(1)_10002.sgy',
'20150921(1)_1002.sgy',
'20150921(1)_10001.sgy',
'20150921(1)_0002.sgy',
]

pprint.pprint(natsorted(files))

This outputs:

['20150921(1)_0001.sgy',
 '20150921(1)_0002.sgy',
 '20150921(1)_0003.sgy',
 '20150921(1)_0004.sgy',
 '20150921(1)_0010.sgy',
 '20150921(1)_0011.sgy',
 '20150921(1)_0012.sgy',
 '20150921(1)_0101.sgy',
 '20150921(1)_0102.sgy',
 '20150921(1)_0103.sgy',
 '20150921(1)_1001.sgy',
 '20150921(1)_1002.sgy',
 '20150921(1)_1003.sgy',
 '20150921(1)_10001.sgy',
 '20150921(1)_10002.sgy']

Upvotes: 4

Chad S.
Chad S.

Reputation: 6633

They are sorting alphabetically. If you want to sort them by the number, you will need to do a bit of parsing first:

   def filename_to_tuple(name):
      import re
      match = re.match(r'(\d+)\((\d+)\)_(\d+)\.sgy', name)
      if not match:
         raise ValueError('Filename doesn't match expected pattern')
      else:
         return int(i for i in match.groups())

   sorted_files = sorted(os.listdir('full data'), key=filename_to_tuple)

Upvotes: 0

inspectorG4dget
inspectorG4dget

Reputation: 113915

sorted_filenames = sorted(os.listdir('full data'), key=lambda s: int(s.rsplit('.',1)[0].split("_",1)[1]))

Upvotes: 0

Related Questions