user3546860
user3546860

Reputation: 147

python: merge lines with common fields

I have a data that goes something like this:

apple   873     8695    913     1084
apple   873     8695    3953    4498
apple   873     8695    4549    8639
peach   22087   23527   22956   23034
peach   22087   23527   22799   22898
peach   22087   23527   22634   22722
peach   22087   23527   22456   22553
orange  10731   23193   22799   22932
orange  10731   23193   22634   22722
orange  10731   23193   22084   22553
orange  10731   23193   21965   22023

As you can see, for each kind of fruit, 2nd and 3rd columns are same but 4th and 5th are different. I want to write a python script that merges these lines and list everything in 4th and 5th column in one row separated by comma.

So something like this:

apple   873     8695    913,1084,3953,4498,4549,8639
peach   22087   23527   22956,23034,22799,22898,22634,22722,22456,22533
orange  10731   23193   22799,22932,22634,22722,22084,22553,21965,22023

Can someone help me on how to start on this? I had a very complex data set that i simplified to this format. Now all I need to do is this step. It seems that solution will be very simple but I am not even sure how to search for how to do this. Thus my title sounds very awkward as well. I would appreciate for any help.

Upvotes: 1

Views: 161

Answers (3)

user1603472
user1603472

Reputation: 1458

This is a great opportunity to use groupby (see How do I use Python's itertools.groupby()? for nice explanation).

Essentially you could read your file into a list of strings with readlines(), and group them with the lambda function you see there, that is their three first elements. (Or one, or two). Then you get a nice group of entries that you can iterate through, and append their last elements (4th and 5th) columns to a list that you print out (with the start of the elements in the group). Given that your example data is stored in file.txt:

from itertools import groupby

f = open('file.txt')

lines = f.readlines()

for key, group in groupby(lines, lambda x: x.split()[0:3]):
    restelms = []
    for elm in group:
        splitelm = elm.split()
        for subelm in splitelm[3:]:
            restelms.append(subelm)

    print '\t'.join(key + [','.join(restelms)])

f.close()

Upvotes: 1

mtadd
mtadd

Reputation: 2555

Assuming your data file is in filename.txt

from collections import defaultdict
d = defaultdict(list)
for line in open('filename.txt'):
    tokens = line.split()
    d[tuple(tokens[:3])].extend(tokens[3:])

for k in sorted(d.keys()):
    v = d[k]
    print('{}\t{}\t{}\t{}'.format(k[0],k[1],k[2],','.join(v)))

outputs

apple   873     8695    913,1084,3953,4498,4549,8639
orange  10731   23193   22799,22932,22634,22722,22084,22553,21965,22023
peach   22087   23527   22956,23034,22799,22898,22634,22722,22456,22553

If you want to output the data to file output.txt, replace the last three lines with:

with open('output.txt','w') as outfile:
    for k in sorted(d.keys()):
        v = d[k]
        print('{}\t{}\t{}\t{}'.format(k[0],k[1],k[2],','.join(v)),file=outfile)

Upvotes: 0

Hugh Bothwell
Hugh Bothwell

Reputation: 56634

from itertools import groupby
from operator import itemgetter

txt = dedent("""\
    apple   873     8695    913     1084
    apple   873     8695    3953    4498
    apple   873     8695    4549    8639
    peach   22087   23527   22956   23034
    peach   22087   23527   22799   22898
    peach   22087   23527   22634   22722
    peach   22087   23527   22456   22553
    orange  10731   23193   22799   22932
    orange  10731   23193   22634   22722
    orange  10731   23193   22084   22553
    orange  10731   23193   21965   22023
""")

data = (row.split() for row in txt.splitlines())
data = [("\t".join(row[:3]), ",".join(row[3:])) for row in data]

output = [
    label + "\t" + ",".join(row[1] for row in rows)
    for label,rows in groupby(data, itemgetter(0))
]

print("\n".join(output))

results in

apple   873     8695    913,1084,3953,4498,4549,8639
peach   22087   23527   22956,23034,22799,22898,22634,22722,22456,22553
orange  10731   23193   22799,22932,22634,22722,22084,22553,21965,22023

Upvotes: 0

Related Questions