Reputation: 263
I have a tab separated text file like this small example:
small example
:
chr15 60644581 60644675 ANXA2 ENST00000332680.4 annotated
chr15 60643922 60644018 ANXA2 ENST00000332680.4 annotated
chr15 60643391 60643450 ANXA2 ENST00000332680.4 annotated
chr15 60641273 60641396 ANXA2 ENST00000332680.4 annotated
chr15 60639831 60639888 ANXA2 ENST00000332680.4 annotated
chr14 25443870 25444024 STXBP6 ENST00000550887.1 annotated
chr14 25326232 25326363 STXBP6 ENST00000550887.1 annotated
chr14 25325141 25325307 STXBP6 ENST00000550887.1 annotated
chr14 25288242 25288400 STXBP6 ENST00000550887.1 annotated
I want to make another tab separated file with 6 columns but less rows. in fact I want to summarize the rows. rows with the same 4th column must be summarized in one row in fact at first I need to group the rows based on 4th column. the new row has the same 1st, 4th, 5th and 6th column but its 2nd column comes from the first row of each group and the 3rd column comes from the last row of each group. in the small example we have 2 groups based on 4th column so for the first group we would have this row:
chr15 60644581 60639888 ANXA2 ENST00000332680.4 annotated
and the expected output for the small example would look like this:
expected output
:
chr15 60644581 60639888 ANXA2 ENST00000332680.4 annotated
chr14 25443870 25288400 STXBP6 ENST00000550887.1 annotated
I am trying to do that in python using the following command but it does not return the expected output. do you know how to fix it?
import pandas as pd
p = pd.read_csv("myfile.txt", sep='\t')
data = p.groupby('4th column')
mylist = []
for i in range(len(data)):
new_line = 'data[0], data[1], data[2], data[3], data[4], data[5]'
mylist.append(new_line)
df = pd.DataFrame(mylist)
df.to_csv('outfile.txt', sep='\t', index=None, header=False)
Upvotes: 2
Views: 1960
Reputation: 2310
You do not specify what should be the grouped values of 1st, 5th and 6th columns so I will assume that any value will do, and will use first as with the 2nd column.
What you are looking for is the aggregation after groupby which would give the proper values. Pass a dictionary of {column name: aggregation function name} pairs to agg, especially {**"2": "first"**, **"3": "last"**}
.
import pandas as pd
p = pd.read_csv("myfile.txt", names=["1", "2", "3", "4", "5", "6"], header=None, sep='\t')
df = p.groupby("4").agg({"1": "first", "2": "first", "3": "last", "5": "first", "6": "first"})
df.reset_index(inplace=True)
df = df.reindex(sorted(df.columns), axis=1)
df.to_csv('outfile.txt', sep='\t', index=False, header=None)
Upvotes: 2