Reputation: 159

Python Sorting by date internally based on 1st attribute

I have a data set with 8 attributes (which is sorted according to the first attribute) and is of the following format (Just an example, it is tab separated)

AX  0123  December 20, 2010  1  2  8.0  hello this
AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0214  September 10, 2009  2  3  9.0 this king
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
MG  980   April 20, 2007  2  4  7.1  not available
MG  246   May 8, 2005  5  1  2.1  make goat

Now, that the file is sorted according first attribute, now i need to sort internally according to date based on first attribute, the output should be like this (I don't want to use the database, this is a huge file (2 GB) so I think a special python code might be required (Not sure if one can do this with a simple code)

AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  0123  December 20, 2010  1  2  8.0  hello this
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
BX  0214  September 10, 2009  2  3  9.0 this king
MG  246   May 8, 2005  5  1  2.1  make goat
MG  980   April 20, 2007  2  4  7.1  not available

Any replies are greatly appreciated. Let me know if you have any other questions.

Upvotes: 2

Answers (2)

bmu

Reputation: 36214

pandas is a python library designed for analysing data sets with different datatypes.

If your data is in data.txt, you can read it with pandas.read_csv() and than sort the resulting DataFrame.

>>> import datetime
>>> import pandas as pd

>>> def date_converter(date_string):
...     return datetime.datetime.strptime(datestring, '%B %d, %Y').date()
>>> df = pd.read_csv('data.txt', sep='\t', header=None,
...                  converters={2:date_converter})
>>> print df
  X.1   X.2         X.3  X.4  X.5  X.6            X.7
0  AX   123  2010-12-20    1    2  8.0     hello this
1  AX  2313  2009-04-19    2    3  4.0       hi there
2  AX  4532  2010-12-19    6    2  8.0       nice tie
3  AX  1244  2011-01-10    3    4  8.0      king tale
4  BX   214  2009-09-10    2    3  9.0      this king
5  BX   114  2003-02-09    4    9  4.0    his brought
6  BX  3214  2006-09-01    1    3  3.0       is great
7  MG   980  2007-04-20    2    4  7.1  not available
8  MG   246  2005-05-08    5    1  2.1      make goat

>>> df = df.set_index(['X.1', 'X.3'])  # using a hierarchical index
>>> df = df.sort_index()
>>> print df
                 X.2  X.4  X.5  X.6            X.7
X.1 X.3                                           
AX  2009-04-19  2313    2    3  4.0       hi there
    2010-12-19  4532    6    2  8.0       nice tie
    2010-12-20   123    1    2  8.0     hello this
    2011-01-10  1244    3    4  8.0      king tale
BX  2003-02-09   114    4    9  4.0    his brought
    2006-09-01  3214    1    3  3.0       is great
    2009-09-10   214    2    3  9.0      this king
MG  2005-05-08   246    5    1  2.1      make goat
    2007-04-20   980    2    4  7.1  not available

As it is numpy based, it should be the right choice for large data sets.

Upvotes: 0

joaquin

Reputation: 85653

OK, quick and dirty. You can improve it:

from datetime import datetime as dt
from collections import defaultdict

dd = defaultdict(list)

with open('test.txt') as f:
    for line in f:
        lines = line.split('\t')
        dd[lines[0]].append(lines)

def mydate(line):
    return dt.strptime(line[2], "%B %d, %Y")

keys = sorted(dd.keys())

my_list = []
for key in keys:
    dd[key].sort(key=mydate)
    my_list.extend(dd[key])

for item in my_list:
    print item

this produces:

['AX', '2313', 'April 19, 2009', '2', '3', '4.0', 'hi there\n']
['AX', '4532', 'December 19, 2010', '6', '2', '8.0', 'nice tie\n']
['AX', '0123', 'December 20, 2010', '1', '2', '8.0', 'hello this\n']
['AX', '1244', 'January 10, 2011', '3', '4', '8.0', 'king tale\n']
['BX', '0114', 'February 9, 2003', '4', '9', '4.0', 'his brought\n']
['BX', '3214', 'September 1, 2006', '1', '3', '3.0 is great\n']
['BX', '0214', 'September 10, 2009', '2', '3', '9.0 this king\n']
['MG', '246', 'May 8, 2005', '5', '1', '2.1', 'make goat']
['MG', '980', 'April 20, 2007', '2', '4', '7.1', 'not available\n']

then you only need to string.join() all the lists

text_lines = []
for item in my_list:
    text_lines.append('\t'.join(item))

full_text = ''.join(text_lines)

Upvotes: 1

Python Sorting by date internally based on 1st attribute

Answers (2)

Related Questions