Reputation: 359
Overview: I have data something like this (each row is a string):
81:0A:D7:19:25:7B, 2016-07-14 14:29:13, 2016-07-14 14:29:15, -69, 22:22:22:22:22:23,null,^M 3B:3F:B9:0A:83:E6, 2016-07-14 01:28:59, 2016-07-14 01:29:01, -36, 33:33:33:33:33:31,null,^M B3:C0:6E:77:E5:31, 2016-07-14 08:26:45, 2016-07-14 08:26:47, -65, 33:33:33:33:33:32,null,^M 61:01:55:16:B5:52, 2016-07-14 06:25:32, 2016-07-14 06:25:34, -56, 33:33:33:33:33:33,null,^M
And I want to sort each row based on the first timestamp that is present in the each String, which for these four records is:
2016-07-14 01:28:59
2016-07-14 06:25:32
2016-07-14 08:26:45
2016-07-14 14:29:13
Now I know the sort()
method but I don't understand how can I use here to sort all the rows based on this (timestamp) quantity, and I do need to keep the final sorted data in the same format as some other service is going to use it.
I also understand I can make the key()
but I am not clear how that can be made to sort on the timestamp field.
Upvotes: 5
Views: 639
Reputation: 3335
If the format of the line in itself shall not be changed, maybe (I do not know the wider context of the solution) a simple shell transformation is fitting well (I know it is not a python solution).
So:
$ sort -t, -k2,2 sort_me_on_first_timestamp_field.txt
3B:3F:B9:0A:83:E6, 2016-07-14 01:28:59, 2016-07-14 01:29:01, -36, 33:33:33:33:33:31,null,^M
61:01:55:16:B5:52, 2016-07-14 06:25:32, 2016-07-14 06:25:34, -56, 33:33:33:33:33:33,null,^M
B3:C0:6E:77:E5:31, 2016-07-14 08:26:45, 2016-07-14 08:26:47, -65, 33:33:33:33:33:32,null,^M
81:0A:D7:19:25:7B, 2016-07-14 14:29:13, 2016-07-14 14:29:15, -69, 22:22:22:22:22:23,null,^M
Looks quite OK to me. the -t option tells sort to use the comma as the delimiter, the -k2,2 requests sorting based on the second "field" (it starts counting at one). sometimes it is important to switch with -n to numerical sorting, but here with ISO datetime string of fixed length it should work with lexical sorting.
Again: If you are looking for a pure python solution, I suggest picking the suggested python based answer. This here only suggests a baseline alternative.
Update to "measure" some scenario on some machine - well:
On the "machine of the developer", sorting the sample 4 lines concatenated multiple times into files of 20, 200, 2000, ..., 2,000,000 lines take from 12 milli seconds to 1.7 seconds (for 2 million lines) to sort with the sort command writing to /dev/null and 2 seconds writing to a file.
A naive implementation of @juanpa.arrivillaga's proposed route sorting in-place:
#! /usr/bin/env python
FILE_PATH_IN = './fhf.txt'
NL, FS = '\n', ','
list_of_strings = open(FILE_PATH_IN).read().split(NL)[:-1]
list_of_strings.sort(key=lambda s: s.split(FS)[1])
with open(FILE_PATH_IN + ".out", "wt") as f:
f.write(NL.join(list_of_strings))
on the same machine takes approx. 3 seconds for the 2 million line case as the other variant (using sorted to generate a new list) does:
#! /usr/bin/env python
FILE_PATH_IN = './fhf.txt'
NL, FS = '\n', ','
list_of_strings = open(FILE_PATH_IN).read().split(NL)[:-1]
with open(FILE_PATH_IN + ".out", "wt") as f:
f.write(NL.join(sorted(list_of_strings, key=lambda s: s.split(',')[1])))
So suggested is, to use the pure python solution.
Upvotes: 2
Reputation: 95948
You can use the list method list.sort
which sorts in-place or use the sorted()
built-in function which returns a new list. the key
argument takes a function which it applies to each element of the sequence before sorting. You can use a combination of string.split(',')
and indexing to the second element, e.g. some_list[1], so:
In [8]: list_of_strings
Out[8]:
['81:0A:D7:19:25:7B, 2016-07-14 14:29:13, 2016-07-14 14:29:15, -69, 22:22:22:22:22:23,null,^M',
'3B:3F:B9:0A:83:E6, 2016-07-14 01:28:59, 2016-07-14 01:29:01, -36, 33:33:33:33:33:31,null,^M',
'B3:C0:6E:77:E5:31, 2016-07-14 08:26:45, 2016-07-14 08:26:47, -65, 33:33:33:33:33:32,null,^M',
'61:01:55:16:B5:52, 2016-07-14 06:25:32, 2016-07-14 06:25:34, -56, 33:33:33:33:33:33,null,^M']
In [9]: sorted(list_of_strings, key=lambda s: s.split(',')[1])
Out[9]:
['3B:3F:B9:0A:83:E6, 2016-07-14 01:28:59, 2016-07-14 01:29:01, -36, 33:33:33:33:33:31,null,^M',
'61:01:55:16:B5:52, 2016-07-14 06:25:32, 2016-07-14 06:25:34, -56, 33:33:33:33:33:33,null,^M',
'B3:C0:6E:77:E5:31, 2016-07-14 08:26:45, 2016-07-14 08:26:47, -65, 33:33:33:33:33:32,null,^M',
'81:0A:D7:19:25:7B, 2016-07-14 14:29:13, 2016-07-14 14:29:15, -69, 22:22:22:22:22:23,null,^M']
Or if you'd rather sort a list in place,
list_of_strings
Out[12]:
['81:0A:D7:19:25:7B, 2016-07-14 14:29:13, 2016-07-14 14:29:15, -69, 22:22:22:22:22:23,null,^M',
'3B:3F:B9:0A:83:E6, 2016-07-14 01:28:59, 2016-07-14 01:29:01, -36, 33:33:33:33:33:31,null,^M',
'B3:C0:6E:77:E5:31, 2016-07-14 08:26:45, 2016-07-14 08:26:47, -65, 33:33:33:33:33:32,null,^M',
'61:01:55:16:B5:52, 2016-07-14 06:25:32, 2016-07-14 06:25:34, -56, 33:33:33:33:33:33,null,^M']
list_of_strings.sort(key=lambda s: s.split(',')[1])
list_of_strings
Out[14]:
['3B:3F:B9:0A:83:E6, 2016-07-14 01:28:59, 2016-07-14 01:29:01, -36, 33:33:33:33:33:31,null,^M',
'61:01:55:16:B5:52, 2016-07-14 06:25:32, 2016-07-14 06:25:34, -56, 33:33:33:33:33:33,null,^M',
'B3:C0:6E:77:E5:31, 2016-07-14 08:26:45, 2016-07-14 08:26:47, -65, 33:33:33:33:33:32,null,^M',
'81:0A:D7:19:25:7B, 2016-07-14 14:29:13, 2016-07-14 14:29:15, -69, 22:22:22:22:22:23,null,^M']
Upvotes: 10