Jessica
Jessica

Reputation: 3173

Python pandas: groupby output duplicate

i have an excel data that i read in with python pandas:

import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t')

Here i'm showing a mock data since my actual data is quite large:

data = {'ID': [0,1,2,3,4,5,6,2],
    'VGene': ['IGHV1-J','IGHV1-J','IGHV2-J','IGHV2-J','IGHV1-J','IGHV1-J','IGHV2-J','IGHV2-J'],
    'JGene':['IGHJ4-1','IGHJ4-1','IGHJ5-1','IGHJ5-1', 'IGHJ4-1','IGHJ4-1','IGHJ5-1','IGHJ5-1'],
    'seq': ['AAAAAA','AAAAAC','TTTTTT','GGGGGG','AAAAAA','AAAAAC','TTTTTT','GGGGGG']}

data = DataFrame(data)

Out[13]: 
    ID  VGene    JGene     seq
0   0  IGHV1-J  IGHJ4-1  AAAAAA
1   1  IGHV1-J  IGHJ4-1  AAAAAC
2   2  IGHV2-J  IGHJ5-1  TTTTTT
3   3  IGHV2-J  IGHJ5-1  GGGGGG
4   4  IGHV1-J  IGHJ4-1  AAAAAA
5   5  IGHV1-J  IGHJ4-1  AAAAAC
6   6  IGHV2-J  IGHJ5-1  TTTTTT
7   2  IGHV2-J  IGHJ5-1  GGGGGG

for now i simply want to output the VGene, JGene, ID and seq for each VGene and JGene combination:

def printoutput(sgrp):
    return print(sgrp["ID"].unique(),sgrp["VGene"].unique(), sgrp['JGene'].unique(), sgrp['seq'].unique())

data.groupby(["VGene", "JGene"]).apply(printoutput)

the output:

 [0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
 [0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
 [2 3 6] ['IGHV2-J'] ['IGHJ5-1'] ['TTTTTT' 'GGGGGG']

seems right except that it prints out the first combination twice:

[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']

i tried it with a larger dataset and the same thing happened, the first instance always gets printed out twice. any idea on why?

Upvotes: 2

Views: 349

Answers (1)

DSM
DSM

Reputation: 353059

As the documentation for groupby.apply explains:

In the current implementation apply calls func twice on the first group to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first group.

I'd just use an old-fashioned loop and get on with my day:

>>> for k,g in df.groupby(["VGene", "JGene"]):
...     printoutput(g)
...     
[0 1 4 5] ['IGHV1-J'] ['IGHJ4-1'] ['AAAAAA' 'AAAAAC']
[2 3 6] ['IGHV2-J'] ['IGHJ5-1'] ['TTTTTT' 'GGGGGG']

(Note that there's no need for your printoutput function to return anything; right now, since print returns None, it's returning the same thing it would have if there were no return at all.)

Upvotes: 3

Related Questions