Reputation: 19547

How do you sort multiple columns of a CSV file using python?

I am having issues figuring out how to sort a large data set into more useful data.

The original file in CSV format is shown below- the data indicates x,y,z positions and finally energy. The x,y,z coordinates spread for quite a ways this is a small snippet below- basically it was an energy search over a volume.

-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026

Unfortunately its very difficult to plot in the requisite four dimensions so I need to trim this data. I would like to do this in such a way that I will turn the volume into a surface on the lowest energy z axis. On smaller data sets this was simple, in excel sort by X then Y and then energy, then delete all energies above the lowest. This was easy enough for small sets of data but has quickly become problematic.

I have tried various ways of doing this such as splitting the csv and using the sort command, but I am having little luck. Any advice on how to approach this would be much appreciated.

Upvotes: 3

Answers (3)

agf

Reputation: 176980

This does what you ask in your comment to Raymond's answer -- returns just the row with the lowest z for each x, y pair:

from operator import itemgetter
from itertools import groupby
from csv import reader


def min_z(iterable):
    # the data converted from strings to numbers
    floats = [[float(n) for n in row] for row in iterable]
    # the data sorted by x, y, z
    floats.sort(key=lambda (x, y, z, e): (x, y, z))
    # group the data by x, y
    grouped_floats = groupby(floats, key=itemgetter(slice(0, 2)))
    # return the first item from each group
    # because the data is sorted
    # the first item is the smallest z for the x, y group
    return [next(rowgroup) for xy, rowgroup in grouped_floats]


data = """-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026""".splitlines()


print min_z(reader(data))

Prints:

[[-3.0, 1.0, 4.07, -0.81185718], 
 [-2.8, -1.0, 3.27, 1.29303723], 
 [-2.8, -0.4, 4.87, -0.51165026]]

Upvotes: 4

Raymond Hettinger

Reputation: 226754

After reading the data into a list of tuples with the csv.reader, sort the data by (x, y) values. For clarity, use named tuples to identify the fields.

Then use itertools.groupby to cluster the related (x, y) data points. For each group, use min to isolate the one with the lowest energy:

>>> import csv, collections, itertools

>>> raw_data = '''\
-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026
'''.splitlines()

>>> Sample = collections.namedtuple('Sample', ['x', 'y', 'z', 'energy'])
>>> data = [Sample(*row) for row in csv.reader(raw_data)]
>>> data.sort(key=lambda s: (s.x, s.y))
>>> for xy, group in itertools.groupby(data, key=lambda s: (s.x, s.y)):
        print min(group, key=lambda s: s.energy)


Sample(x='-2.800000', y='-0.400000', z='4.870000', energy='-0.51165026')
Sample(x='-2.800000', y='-1.000000', z='5.470000', energy='-0.26488315')
Sample(x='-3.000000', y='1.000000', z='4.070000', energy='-0.81185718')

Upvotes: 2

jedwards

Reputation: 30260

I think numpy's lexsort will address your sorting needs.

In general I think your steps are:

Read csv into numpy array -- have you tried python's csv package or numpy's genfromtext() function?
Sort using lexsort
Trim off unnecessary rows

EDIT: See this related SO question.

Upvotes: 0

How do you sort multiple columns of a CSV file using python?

Answers (3)

Related Questions