Reputation: 63
So I've been working on a Python script that combines some information into a "bed" format. Which means that I'm working with features on a genome, my first column is the scaffold name (string), the second the start position on that scaffold (integer) and the third column is the stop position (integer), the other columns contain other information which is not relevant to my question. My issue is that my output is unsorted.
Now I know I can sort my files using this bash command:
$sort -k1,1 -k2,2n -k3,3n infile > outfile
But in the interest efficacy I'd like to know if there's a way to do this in Python. So far I've only seen list based sorts that deal with one either lexicographical or numerical sort. Not a combination of the two. So, do you guys have any ideas?
Snippet of my data (I want to sort by column 1, 2 and 3 (in that order)):
Scf_3R 8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
Scf_3R 8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
Scf_3L 16209309 16236428 FBgn0000017 FBgn0184183 -1.19105585707
Scf_2L 10630469 10632308 FBgn0000018 FBgn0193617 0.073153454539
Scf_3R 12087670 12124207 FBgn0000024 FBgn0022516 -0.023946795475
Scf_X 14395665 14422243 FBgn0000028 FBgn0187465 0.00300558969397
Scf_3R 25163062 25165316 FBgn0000032 FBgn0189058 0.530118698187
Scf_3R 19757441 19808894 FBgn0000036 FBgn0189822 -0.282508464261
Upvotes: 6
Views: 2709
Reputation: 414179
To sort by your own sort criteria, just pass the corresponding key
function:
with open('infile', 'rb') as file:
lines = file.readlines()
def sort_key(line):
fields = line.split()
try:
return fields[0], int(fields[1]), int(fields[2])
except (IndexError, ValueError):
return () # sort invalid lines together
lines.sort(key=sort_key)
with open('outfile', 'wb') as file:
file.writelines(lines)
It assumes there is a newline at the end of the input file (append it if needed).
The code sorts text data by its bytes values (it is ok if the first column is ASCII), open the file in a text mode (use io.open()
on Python 2) if it is not the case (to sort by Unicode code point values). The result of the sort
command in the shell may depend on locale. You could use PyICU collator in Python.
If you need to sort files that do not fit in memory, see Sorting text file by using Python
Upvotes: 3
Reputation: 3335
The solution of @sparkandshine seems to be short and to the point for the specific sort pattern. Also the one offered by @j-f-sebastian to me looks excellent, terse and with important hints/links on internationalization and off-memory sorting strategies.
Maybe the following more explicit show case offers additional helpful info for the OP or someone with similar tasks to adapt to their needs. Please see comments in the mostly pep8 conforming code:
#! /usr/bin/env python
"""Provide a show case for hierarchical sort, that offers flexible
hierarchical lexcical, numeric column sort mixes at runtime.
Hopefully this draft solution offers ideas for helping migrate
the sort shell level operation into a pythonic solution - YMMV."""
from __future__ import print_function
from functools import partial # We use this to tailor the key function
def text_in_lines_gen(text_in_lines):
"""Mock generator simulating a line source for the data."""
for line in text_in_lines.split('\n'):
if line:
yield line.split()
def sort_hier_gen(iterable_lines, hier_sort_spec):
"""Given iterator of text lines, sort all lines based on
sort specification in hier_sort_spec.
Every entry in hier_sort_spec is expected to be a pair with first value
integer for index in columns of text blocks lines and second entry
type of sorting in ('int', 'float') numeric or any other for text
(lexical) ordering regime."""
num_codes = ('int', 'float')
converter_map = dict(zip(num_codes, (int, float)))
# Extract facts from sort spec, prepare processing:
key_ordered = tuple(k for k, _ in hier_sort_spec)
# Prepare key function: Step 1 ...
def _key_in(selected, r):
"""Inject the indexing into the key at sort time
via partial application, as key function in sort
has single argument only."""
return tuple(r[k] for k in selected)
_key = partial(_key_in, key_ordered) # ... step 2
convert_these_by = {}
for k, t in hier_sort_spec:
if t in num_codes:
convert_these_by[k] = converter_map[t]
if not convert_these_by: # early out
for row in sorted(iterable_lines, key=_key):
yield row
else:
def flow_converter(row_iter, converter_map):
"""Row based converter - Don't block the flow ;-)."""
for row in row_iter:
for k, convert in converter_map.items():
row[k] = convert(row[k])
yield row
for row in sorted(flow_converter(iterable_lines,
convert_these_by), key=_key):
yield row
def main():
"""Drive the hierarchical text-int-int sort."""
data_1 = """Scf_3R 8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
Scf_3R 8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
Scf_3L 16209309 16236428 FBgn0000017 FBgn0184183 -1.19105585707
Scf_2L 10630469 10632308 FBgn0000018 FBgn0193617 0.073153454539
Scf_3R 12087670 12124207 FBgn0000024 FBgn0022516 -0.023946795475
Scf_X 14395665 14422243 FBgn0000028 FBgn0187465 0.00300558969397
Scf_3R 25163062 25165316 FBgn0000032 FBgn0189058 0.530118698187
Scf_3R 19757441 19808894 FBgn0000036 FBgn0189822 -0.282508464261"""
bar = []
x = 0
for a in range(3, 0, -1):
for b in range(3, 0, -1):
for c in range(3, 0, -1):
x += 1
bar.append('a_%d %d %0.1f %d' % (a, b, c * 1.1, x))
data_2 = '\n'.join(bar)
hier_sort_spec = ((0, 't'), (1, 'int'), (2, 'int'))
print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
print(sorted_row)
hier_sort_spec = ((0, 't'), (1, None), (2, False))
print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
print(sorted_row)
hier_sort_spec = ((0, 't'), (2, 'float'), (1, 'int'))
print("# Test data set 2 and sort spec={0}:".format(hier_sort_spec))
for sorted_row in sort_hier_gen(text_in_lines_gen(data_2), hier_sort_spec):
print(sorted_row)
if __name__ == '__main__':
main()
On my machine the three test cases (including the questions sample data) yield:
First:
# Test data set 1 and sort spec=((0, 't'), (1, 'int'), (2, 'int')):
['Scf_2L', 10630469, 10632308, 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', 16209309, 16236428, 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', 8497493, 8503049, 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', 8599253, 8621866, 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_3R', 12087670, 12124207, 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', 19757441, 19808894, 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', 25163062, 25165316, 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_X', 14395665, 14422243, 'FBgn0000028', 'FBgn0187465', '0.00300558969397']
Second:
# Test data set 1 and sort spec=((0, 't'), (1, None), (2, False)):
['Scf_2L', '10630469', '10632308', 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', '16209309', '16236428', 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', '12087670', '12124207', 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', '19757441', '19808894', 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', '25163062', '25165316', 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_3R', '8497493', '8503049', 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', '8599253', '8621866', 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_X', '14395665', '14422243', 'FBgn0000028', 'FBgn0187465', '0.00300558969397']
Third:
# Test data set 2 and sort spec=((0, 't'), (2, 'float'), (1, 'int')):
['a_1', 1, 1.1, '27']
['a_1', 2, 1.1, '24']
['a_1', 3, 1.1, '21']
['a_1', 1, 2.2, '26']
['a_1', 2, 2.2, '23']
['a_1', 3, 2.2, '20']
['a_1', 1, 3.3, '25']
['a_1', 2, 3.3, '22']
['a_1', 3, 3.3, '19']
['a_2', 1, 1.1, '18']
['a_2', 2, 1.1, '15']
['a_2', 3, 1.1, '12']
['a_2', 1, 2.2, '17']
['a_2', 2, 2.2, '14']
['a_2', 3, 2.2, '11']
['a_2', 1, 3.3, '16']
['a_2', 2, 3.3, '13']
['a_2', 3, 3.3, '10']
['a_3', 1, 1.1, '9']
['a_3', 2, 1.1, '6']
['a_3', 3, 1.1, '3']
['a_3', 1, 2.2, '8']
['a_3', 2, 2.2, '5']
['a_3', 3, 2.2, '2']
['a_3', 1, 3.3, '7']
['a_3', 2, 3.3, '4']
['a_3', 3, 3.3, '1']
Updated to use mostly generators for only having one copy of the data "around" as this is needed anyway (in-memory) for the global sort, but no need for having more copies ;-)
Added also functools.partial
as this for me is the fastest way, to adapt a key function to the flexible sort order.
One last update removed the remaining non-generator copy cruft in the case where a conversion is implemented, by defining a local generator function ofr the row based conversions. HTH.
Upvotes: 0
Reputation: 18007
Load data, sort them with sorted
, write to a new file.
# Load data
lists = list()
with open(filename, 'r') as f:
for line in f:
lists.append(line.rstrip().split())
# Sort data
results = sorted(lists, key=lambda x:(x[0], int(x[1]), int(x[2])))
# Write to a file
import csv
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(results)
Upvotes: 2