Kpangee
Kpangee

Reputation: 57

How to retrieve element from a groupby DataFrame given index

Context:

my DataFrame contains the following columns: HapID, Marker, Start_position, End_position. For each HapID, I'd like to get: - the marker having the minimum Start_position (that is called leftMarker) - the marker having the maximum End_position (that is called rightMarker) - Interval being the difference (maximum End_position - minimum Start_position)

My issue is how to retrieve the marker names now that I know their index. I got the error below and I'm not sure how to sort it out although I've spent hours on it.

Here is the error message

AttributeError: Cannot access callable attribute 'iloc' of 'SeriesGroupBy' objects, try using the 'apply' method

Below is the data

HapID   Marker  Start_position  End_position
hap_1   mk1 1107207 1107256
hap_1   mk2 1104711 1104760
hap_1   mk3 1106845  1106894
hap_2   mk4 11901413 11901462
hap_2   mk5 206031250 206031299
hap_2   mk6 11498893 11498942
hap_2   mk7 17236023 17236072
hap_2   mk8 11692209 11692258
hap_2   mk9 11691512 11691561
hap_2   mk10 11615664 11615713

this is the expected output

HapID   leftMarker  rightMarker Start_position  End_position    Interval
hap_1   mk2 mk1 1104711 1107256 2545
hap_2   mk6 mk5 11498893    206031299   194532406

Code:

import pandas as pd
data = {
'HapID':['hap_1','hap_1','hap_1','hap_2','hap_2','hap_2','hap_2','hap_2','hap_2','hap_2'],
'Marker':['mk1','mk2','mk3','mk4','mk5','mk6','mk7','mk8','mk9','mk10'],
'Start_position':[1107207,1104711,1106845,11901413,206031250,11498893,17236023,11692209,11691512,11615664],
'End_position':[1107256,1104760,1106894,11901462,206031299,11498942,17236072,11692258,11691561,11615713]}
df = pd.DataFrame(data)

haplotypes = df.groupby(df['HapID'])
posi_1 = haplotypes.Start_position.min()
posi_2 = haplotypes.End_position.max()
diff_posi = posi_2 - posi_1
a = haplotypes.Start_position.idxmin()#index at minimum Start_position
b = haplotypes.End_position.idxmax() #index at maximum End_position
#print('{} {} {}'.format(posi_1,posi_2,diff_posi))
#print('{} {}'.format(a,b)) #just to se if I'm getting the index

Now, the problem is how to retrieve the markers at those positions for each haplotype

leftMarker = haplotypes.Marker.iloc(a)
rightMarker = haplotypes.Marker.iloc(b)

Upvotes: 2

Views: 716

Answers (2)

Alex
Alex

Reputation: 19124

This is a fairly straightforward case of applying a function to a pandas groupby. You should read the pandas docs on how to use groupby to get a better understanding of how/when to use this technique.

def my_fn(df):
    mk_min = df.loc[df['Start_position'].idxmin()]
    mk_max = df.loc[df['End_position'].idxmax()]
    vals = [mk_min['Marker'], mk_max['Marker'], mk_min['Start_position'], mk_max['End_position'], mk_max['End_position'] - mk_min['Start_position']]
    idx = ['leftMarker', 'rightMarker', 'Start_position', 'End_position', 'Interval']
    return pd.Series(vals, index=idx)

df.groupby('HapID').apply(my_fn)

returns

      leftMarker rightMarker  Start_position  End_position   Interval
HapID                                                                
hap_1        mk2         mk1         1104711       1107256       2545
hap_2        mk6         mk5        11498893     206031299  194532406

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153560

I think you need to retrieve the markers from the original dataframe.

leftMarker = df.loc[a,['HapID','Marker']]
rigthMarker = df.loc[b,['HapID','Marker']]

print(leftMarker)

   HapID Marker
1  hap_1    mk2
5  hap_2    mk6

print(rightMarker)

   HapID Marker
0  hap_1    mk1
4  hap_2    mk5

Upvotes: 1

Related Questions