Reputation: 159
I have a data set with 8 attributes (which is sorted according to the first attribute) and is of the following format (Just an example, it is tab separated)
AX 0123 December 20, 2010 1 2 8.0 hello this
AX 2313 April 19, 2009 2 3 4.0 hi there
AX 4532 December 19, 2010 6 2 8.0 nice tie
AX 1244 January 10, 2011 3 4 8.0 king tale
BX 0214 September 10, 2009 2 3 9.0 this king
BX 0114 February 9, 2003 4 9 4.0 his brought
BX 3214 September 1, 2006 1 3 3.0 is great
MG 980 April 20, 2007 2 4 7.1 not available
MG 246 May 8, 2005 5 1 2.1 make goat
Now, that the file is sorted according first attribute, now i need to sort internally according to date based on first attribute, the output should be like this (I don't want to use the database, this is a huge file (2 GB) so I think a special python code might be required (Not sure if one can do this with a simple code)
AX 2313 April 19, 2009 2 3 4.0 hi there
AX 4532 December 19, 2010 6 2 8.0 nice tie
AX 0123 December 20, 2010 1 2 8.0 hello this
AX 1244 January 10, 2011 3 4 8.0 king tale
BX 0114 February 9, 2003 4 9 4.0 his brought
BX 3214 September 1, 2006 1 3 3.0 is great
BX 0214 September 10, 2009 2 3 9.0 this king
MG 246 May 8, 2005 5 1 2.1 make goat
MG 980 April 20, 2007 2 4 7.1 not available
Any replies are greatly appreciated. Let me know if you have any other questions.
Upvotes: 2
Views: 290
Reputation: 36214
pandas is a python library designed for analysing data sets with different datatypes.
If your data is in data.txt, you can read it with pandas.read_csv()
and than sort the resulting DataFrame
.
>>> import datetime
>>> import pandas as pd
>>> def date_converter(date_string):
... return datetime.datetime.strptime(datestring, '%B %d, %Y').date()
>>> df = pd.read_csv('data.txt', sep='\t', header=None,
... converters={2:date_converter})
>>> print df
X.1 X.2 X.3 X.4 X.5 X.6 X.7
0 AX 123 2010-12-20 1 2 8.0 hello this
1 AX 2313 2009-04-19 2 3 4.0 hi there
2 AX 4532 2010-12-19 6 2 8.0 nice tie
3 AX 1244 2011-01-10 3 4 8.0 king tale
4 BX 214 2009-09-10 2 3 9.0 this king
5 BX 114 2003-02-09 4 9 4.0 his brought
6 BX 3214 2006-09-01 1 3 3.0 is great
7 MG 980 2007-04-20 2 4 7.1 not available
8 MG 246 2005-05-08 5 1 2.1 make goat
>>> df = df.set_index(['X.1', 'X.3']) # using a hierarchical index
>>> df = df.sort_index()
>>> print df
X.2 X.4 X.5 X.6 X.7
X.1 X.3
AX 2009-04-19 2313 2 3 4.0 hi there
2010-12-19 4532 6 2 8.0 nice tie
2010-12-20 123 1 2 8.0 hello this
2011-01-10 1244 3 4 8.0 king tale
BX 2003-02-09 114 4 9 4.0 his brought
2006-09-01 3214 1 3 3.0 is great
2009-09-10 214 2 3 9.0 this king
MG 2005-05-08 246 5 1 2.1 make goat
2007-04-20 980 2 4 7.1 not available
As it is numpy
based, it should be the right choice for large data sets.
Upvotes: 0
Reputation: 85653
OK, quick and dirty. You can improve it:
from datetime import datetime as dt
from collections import defaultdict
dd = defaultdict(list)
with open('test.txt') as f:
for line in f:
lines = line.split('\t')
dd[lines[0]].append(lines)
def mydate(line):
return dt.strptime(line[2], "%B %d, %Y")
keys = sorted(dd.keys())
my_list = []
for key in keys:
dd[key].sort(key=mydate)
my_list.extend(dd[key])
for item in my_list:
print item
this produces:
['AX', '2313', 'April 19, 2009', '2', '3', '4.0', 'hi there\n']
['AX', '4532', 'December 19, 2010', '6', '2', '8.0', 'nice tie\n']
['AX', '0123', 'December 20, 2010', '1', '2', '8.0', 'hello this\n']
['AX', '1244', 'January 10, 2011', '3', '4', '8.0', 'king tale\n']
['BX', '0114', 'February 9, 2003', '4', '9', '4.0', 'his brought\n']
['BX', '3214', 'September 1, 2006', '1', '3', '3.0 is great\n']
['BX', '0214', 'September 10, 2009', '2', '3', '9.0 this king\n']
['MG', '246', 'May 8, 2005', '5', '1', '2.1', 'make goat']
['MG', '980', 'April 20, 2007', '2', '4', '7.1', 'not available\n']
then you only need to string.join() all the lists
text_lines = []
for item in my_list:
text_lines.append('\t'.join(item))
full_text = ''.join(text_lines)
Upvotes: 1