Reputation: 11
I am trying to sort a csv file by column. The file has many columns and looks like:
Tom,01AA01,234.56,334
Alice,01AS01,546.09,3434.3
Sam,01NA01,4574.3,65.45
Joy,01BA01,2897.03,455
Pam,01MA01,434.034,454
John,01AA02,343,24
Alice,01AS02,454,454.54
Tom,02BA01,3434,3454.2
And it continues for about 20 columns and 250 rows.
I want it to be sorted by the second column and ordered alphabetically for AA
, AS
, BA
in the second portion, and numerically for the third section '01', '02', '03',
and numerically for the first section '01', '02', '03'
. And then create a new csv file from this sort. They are not usually just 6 characters long, others are look like '02BAA', '01MAA', '02NAA'
and so on.
So in the end it would hopefully look like this for column 2.
01AA01
01AS01
01BA01
01MA01
01NA01
01AA02
01AS02
02BA01
I'm new to coding and not quite sure how to go about doing this. Thank you in advance.
Upvotes: 1
Views: 1629
Reputation: 103824
The default sort order for ASCII strings from Python's sorted
function is lexicographic (or 'ASCIIbetical'):
>>> li=['1', '20', '100', '11']
>>> sorted(li)
['1', '100', '11', '20']
Compared to integer magnitude when those list values are integers:
>>> sorted(map(int, li))
[1, 11, 20, 100]
ie, the magnitude of the numbers in strings to the human eye is different than the same strings to the computer's eye. (Written about more extensively in Codinghorror)
To fix it, we need to separate the letters from the numbers and convert the numbers to integers (or floats).
The easiest way is with a regex that captures all numbers, converts to ints then all letters.
This sorts into your target:
li1='''\
01AA01
01AS01
01NA01
01BA01
01MA01
01AA02
01AS02
02BA01'''.splitlines()
tgt='''\
01AA01
01AS01
01BA01
01MA01
01NA01
01AA02
01AS02
02BA01'''.splitlines()
import re
def kf(s):
nums=map(int, re.findall(r'(\d+)', s))
lets=re.findall(r'([a-zA-Z]+)', s)
return nums+lets
print tgt==sorted(li1, key=kf)
# True
Or, one line:
>>> tgt==sorted(li1, key=lambda s: map(int, re.findall(r'(\d+)', s))+re.findall(r'(\D+)', s))
True
Edit based on comments
The text of the question states:
I want it to be ordered numerically in the first section 01,02,03... and then alphabetically for AA, AS, BA in the second portion, and numerically again for the third section.
However, the example shows that this is not the case.
We can sort based on the pattern of (int, letters, int) with split:
>>> [re.split(r'(\D+)', e) for e in li1]
[['01', 'AA', '01'], ['01', 'AS', '01'], ['01', 'NA', '01'], ['01', 'BA', '01'], ['01', 'MA', '01'], ['01', 'AA', '02'], ['01', 'AS', '02'], ['02', 'BA', '01']]
>>> sorted(li1, key=lambda s: [int(e) if e.isdigit() else e for e in re.split(r'(\D+)', s)])
['01AA01', '01AA02', '01AS01', '01AS02', '01BA01', '01MA01', '01NA01', '02BA01']
# ^^ ^^ etc '01AA02', before '01AS01' in the example
By inspection, the pattern of the POSTED example is (int, int, letters)
which can be seen here:
>>> [map(int, re.findall(r'(\d+)', s))+re.findall(r'(\D+)', s) for s in li1]
[[1, 1, 'AA'], [1, 1, 'AS'], [1, 1, 'NA'], [1, 1, 'BA'], [1, 1, 'MA'], [1, 2, 'AA'], [1, 2, 'AS'], [2, 1, 'BA']]
If the TEXT is correct, use the split form of sort I have; if the EXAMPLE is correct, use the nums+lets
form.
Upvotes: 1
Reputation: 77347
Assuming this is a csv file, each line is a row and each column is marked with a comma. Since haven't given us an exmaple of your csv, I made up one that has two columns, with your data in col[1].
>>> print open('mycsv.csv').read()
fred, 01AA01
brenda, 01BA01
bob, 01AA02
alice, 01NA01
jane, 01AS01
blane, 02BA01
larry, 01MA01
mary, 01AS02
These can all be read into a list with the csv module. You end up with a list of rows, where the columns are another list
>>> import csv
>>> table=[row for row in csv.reader(open('mycsv.csv')) if row]
>>> print table
[['fred', ' 01AA01'], ['brenda', ' 01BA01'], ['bob', ' 01AA02'], ['alice', ' 01NA01'], ['jane', ' 01AS01'], ['blane', ' 02BA01'], ['larry', ' 01MA01'], ['mary', ' 01AS02']]
You can sort that list. By default, sort starts with the first key, then the second key if the first is the same, and etc. So, it wil lsort by 'fred' etc... But you can select a different sort key. Python calls key function with each list item so that you can transform it into what you want. The transformations can be simple like make lower case, or complex.
Its common to use lambdas for sort keys, but that may be a bit advanced, so here's a function that just grabs the key you want.
>>> def item_1(row):
... return row[1]
...
>>> print table
[['fred', ' 01AA01'], ['bob', ' 01AA02'], ['jane', ' 01AS01'], ['mary', ' 01AS02'], ['brenda', ' 01BA01'], ['larry', ' 01MA01'], ['alice', ' 01NA01'], ['blane', ' 02BA01']]
>>>
Upvotes: 0
Reputation: 43830
sorted()
and the list's .sort()
method take an optional key
argument.
Where:
key specifies a function of one argument that is used to extract a comparison key from each list element: key=str.lower.
In other words, the function (that you will write) given to the key argument parses and returns the sortable value for the given object.
So, given your input, "01AS01"
, you want to break it down into pieces that can be easily sorted.
As you mentioned, you want the results sorted by (int, str, int). Since sorted()
and .sort()
automatically will sort by number, in the case of ints, and alphabetically, in case of strings, all you need to do for your key function is to break your value, "01AS01"
into [1, "AS", 1]
and sorted()
/.sort()
will take care of the rest.
This is a similar example to dawg's but without using map()
and re
.
col = ['01AA01',
'01AS01',
'01NA01',
'01BA01',
'01MA01',
'01AA02',
'01AS02',
'02BA01']
def create_sort_key(value):
int_indexes = (0, 4)
str_indexes = (2,)
parsed_values = []
# get the starting index for groups of two
for i in range(0, 6, 2):
pair = value[i:i+2]
if i in int_indexes:
parsed_value = int(pair)
elif i in str_indexes:
parsed_value = str(pair)
else:
raise IndexError("unexpected index: {}".format(i))
parsed_values.append(parsed_value)
return parsed_values
col.sort(key=create_sort_key)
Upvotes: 0