Reputation: 147
I have a data that goes something like this:
apple 873 8695 913 1084
apple 873 8695 3953 4498
apple 873 8695 4549 8639
peach 22087 23527 22956 23034
peach 22087 23527 22799 22898
peach 22087 23527 22634 22722
peach 22087 23527 22456 22553
orange 10731 23193 22799 22932
orange 10731 23193 22634 22722
orange 10731 23193 22084 22553
orange 10731 23193 21965 22023
As you can see, for each kind of fruit, 2nd and 3rd columns are same but 4th and 5th are different. I want to write a python script that merges these lines and list everything in 4th and 5th column in one row separated by comma.
So something like this:
apple 873 8695 913,1084,3953,4498,4549,8639
peach 22087 23527 22956,23034,22799,22898,22634,22722,22456,22533
orange 10731 23193 22799,22932,22634,22722,22084,22553,21965,22023
Can someone help me on how to start on this? I had a very complex data set that i simplified to this format. Now all I need to do is this step. It seems that solution will be very simple but I am not even sure how to search for how to do this. Thus my title sounds very awkward as well. I would appreciate for any help.
Upvotes: 1
Views: 161
Reputation: 1458
This is a great opportunity to use groupby (see How do I use Python's itertools.groupby()? for nice explanation).
Essentially you could read your file into a list of strings with readlines(), and group them with the lambda function you see there, that is their three first elements. (Or one, or two). Then you get a nice group of entries that you can iterate through, and append their last elements (4th and 5th) columns to a list that you print out (with the start of the elements in the group). Given that your example data is stored in file.txt:
from itertools import groupby
f = open('file.txt')
lines = f.readlines()
for key, group in groupby(lines, lambda x: x.split()[0:3]):
restelms = []
for elm in group:
splitelm = elm.split()
for subelm in splitelm[3:]:
restelms.append(subelm)
print '\t'.join(key + [','.join(restelms)])
f.close()
Upvotes: 1
Reputation: 2555
Assuming your data file is in filename.txt
from collections import defaultdict
d = defaultdict(list)
for line in open('filename.txt'):
tokens = line.split()
d[tuple(tokens[:3])].extend(tokens[3:])
for k in sorted(d.keys()):
v = d[k]
print('{}\t{}\t{}\t{}'.format(k[0],k[1],k[2],','.join(v)))
outputs
apple 873 8695 913,1084,3953,4498,4549,8639
orange 10731 23193 22799,22932,22634,22722,22084,22553,21965,22023
peach 22087 23527 22956,23034,22799,22898,22634,22722,22456,22553
If you want to output the data to file output.txt, replace the last three lines with:
with open('output.txt','w') as outfile:
for k in sorted(d.keys()):
v = d[k]
print('{}\t{}\t{}\t{}'.format(k[0],k[1],k[2],','.join(v)),file=outfile)
Upvotes: 0
Reputation: 56634
from itertools import groupby
from operator import itemgetter
txt = dedent("""\
apple 873 8695 913 1084
apple 873 8695 3953 4498
apple 873 8695 4549 8639
peach 22087 23527 22956 23034
peach 22087 23527 22799 22898
peach 22087 23527 22634 22722
peach 22087 23527 22456 22553
orange 10731 23193 22799 22932
orange 10731 23193 22634 22722
orange 10731 23193 22084 22553
orange 10731 23193 21965 22023
""")
data = (row.split() for row in txt.splitlines())
data = [("\t".join(row[:3]), ",".join(row[3:])) for row in data]
output = [
label + "\t" + ",".join(row[1] for row in rows)
for label,rows in groupby(data, itemgetter(0))
]
print("\n".join(output))
results in
apple 873 8695 913,1084,3953,4498,4549,8639
peach 22087 23527 22956,23034,22799,22898,22634,22722,22456,22553
orange 10731 23193 22799,22932,22634,22722,22084,22553,21965,22023
Upvotes: 0