Reputation: 3173
i have an excel data that i read in with python pandas:
import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t')
Here i'm showing a mock data since my actual data is quite large:
data = {'ID': [0,1,2,3,4,5,6,2],
'VGene': ['IGHV1-J','IGHV1-J','IGHV2-J','IGHV2-J','IGHV1-J','IGHV1-J','IGHV2-J','IGHV2-J'],
'JGene':['IGHJ4-1','IGHJ4-1','IGHJ5-1','IGHJ5-1', 'IGHJ4-1','IGHJ4-1','IGHJ5-1','IGHJ5-1'],
'seq': ['AAAAAA','AAAAAC','TTTTTT','GGGGGG','AAAAAA','AAAAAC','TTTTTT','GGGGGG']}
data = DataFrame(data)
Out[13]:
ID VGene JGene seq
0 0 IGHV1-J IGHJ4-1 AAAAAA
1 1 IGHV1-J IGHJ4-1 AAAAAC
2 2 IGHV2-J IGHJ5-1 TTTTTT
3 3 IGHV2-J IGHJ5-1 GGGGGG
4 4 IGHV1-J IGHJ4-1 AAAAAA
5 5 IGHV1-J IGHJ4-1 AAAAAC
6 6 IGHV2-J IGHJ5-1 TTTTTT
7 2 IGHV2-J IGHJ5-1 GGGGGG
for now i simply want to output the VGene, JGene, ID and seq for each VGene and JGene combination:
def printoutput(sgrp):
return print(sgrp["ID"].unique(),sgrp["VGene"].unique(), sgrp['JGene'].unique(), sgrp['seq'].unique())
data.groupby(["VGene", "JGene"]).apply(printoutput)
the output:
[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
[2 3 6] ['IGHV2-J'] ['IGHJ5-1'] ['TTTTTT' 'GGGGGG']
seems right except that it prints out the first combination twice:
[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
i tried it with a larger dataset and the same thing happened, the first instance always gets printed out twice. any idea on why?
Upvotes: 2
Views: 349
Reputation: 353059
As the documentation for groupby.apply
explains:
In the current implementation apply calls func twice on the first group to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first group.
I'd just use an old-fashioned loop and get on with my day:
>>> for k,g in df.groupby(["VGene", "JGene"]):
... printoutput(g)
...
[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
[2 3 6] ['IGHV2-J'] ['IGHJ5-1'] ['TTTTTT' 'GGGGGG']
(Note that there's no need for your printoutput
function to return anything; right now, since print
returns None, it's returning the same thing it would have if there were no return
at all.)
Upvotes: 3