J280694
J280694

Reputation: 21

Python list sort creating different orders to expected output

I have a script to move items around and perform some basic functions on them. It relies on list.sort() to make sure the files are going to the right places.

For example I have 11 files:

A1_S1_ETC.ext
A2_S2_ETC.ext
...
...
A10_S10_ETC.ext
A11_S11_ETC.ext

The script asks for a path and output, from this I create two sorted lists using os and glob:

pathA = raw_input()
listA = list(glob.glob(os.path.join(path,'*.ext')))
listA.sort()
outp = raw_input()
outp.sort()
filen = [x.split(pathA)[1].split('_')[0] for x in listA]
filen.sort()
outp1 = [pathA + s + '/' for s in filen]
outp1.sort()

But when printed:

print listA
['A10_S10_ETC.ext', 'A11_S11_ETC.ext','A1_S1_ETC.ext',, A2_S2_ETC.ext']
print outp1
['/user/path/A1/', '/user/path/A10/', '/user/path/A11/', '/user/path/A2/']

I guess it's the '_SXX' part in the file name that's impacting the sort function? I don't care how it's sorted, as long as A1 files go into A1 directory - not just for this nomenclature but for any possible string.

Is there a way to do this - perhaps by asking the list.sort function to sort until the first underscore?

Upvotes: 2

Views: 174

Answers (3)

zvone
zvone

Reputation: 19382

What you want is called natural sort. See this thread about it: Does Python have a built in function for string natural sort?

Upvotes: 1

skyking
skyking

Reputation: 14400

What happens is that sorting is lexicographic with ordering ASCII characters according to ASCII code. Here we have ASCII code for '0' is 48 while the ASCII code for '_' is 95 - which means that '0' < '_'.

What you can do do get consistency is to supply a consistent comparison function. For example:

def mycmp(s1, s2):
    s1 = s1.split(pathA)[1].split('_')[0]        
    s2 = s2.split(pathA)[1].split('_')[0]
    return cmp(s1, s2)

outp1.sort(cmp=mycmp)

Here the thing is that you use the same transformation before comparing the strings.

This relies on that since you strip away information you may strip away too much to make the elements distinct, but in your case it would mean that two elements of outp1 would become the same anyway so it wouldn't matter here.

Otherwise you would have to apply the sort before you transform the names. Which would mean not to sort filen or outp1 (because then their order would rely on the order of listA.

Upvotes: 1

Ayush
Ayush

Reputation: 3965

Sorting strings in python is a lexicographical sort. The strings are compared lexicographically. So 'A10' and 'A11' come before 'A1_'.

you can get your expect behaviour using:

lst.sort(key=lambda x: int(x.split('_')[0][1:])

Upvotes: 1

Related Questions