Reputation: 1185
I have that df:
gene person allele allele2
A1 p1 G C
A2 p1 A C
A3 p1 A T
A1 p2 G C
A2 p2 T T
A3 p2 G C
A4 p2 A T
A2 p1 G C
A3 p1 C C
...
As u can see, in table I can have same person few times (record form different laboratories). First p1 is different sample that second p1, and I need to pick only unique samples with best score (the highest number of rows), so it's that example it will be first p1 coz it's have 3 when another have 2.
And I have no idea how to extract that table to got something like this:
gene person allele allele2
A1 p1 G C
A2 p1 A C
A3 p1 A T
A1 p2 G C
A2 p2 T T
A3 p2 G C
A4 p2 A T
...
I thinking about indexing it by for loop. For example, add to index i if person == above person. If not, i+1. And then I will have a groups. But... whole df has 3mln row, so before I start I decide to describe here my problem. Maybe it's better way to do this?
Upvotes: 2
Views: 959
Reputation: 862681
Create consecutive groups by compare Series.ne
with Series.shift
and Series.cumsum
, then counts by Series.map
with Series.value_counts
:
g = df['person'].ne(df['person'].shift()).cumsum()
s = g.map(g.value_counts())
print (s)
0 3
1 3
2 3
3 4
4 4
5 4
6 4
7 2
8 2
Name: person, dtype: int64
And last compare by maxima value per person
by GroupBy.transform
with Series
s
in boolean indexing
:
print (s.groupby(df['person']).transform('max'))
0 3
1 3
2 3
3 4
4 4
5 4
6 4
7 3
8 3
Name: person, dtype: int64
df = df[s.groupby(df['person']).transform('max').eq(s)]
print (df)
gene person allele allele2
0 A1 p1 G C
1 A2 p1 A C
2 A3 p1 A T
3 A1 p2 G C
4 A2 p2 T T
5 A3 p2 G C
6 A4 p2 A T
EDIT: If need first group if same size, e.g. here group p1
has same length 2 times:
#added last row for another data test
print (df)
gene person allele allele2
0 A1 p1 G C
1 A2 p1 A C
2 A3 p1 A T
3 A1 p2 G C
4 A2 p2 T T
5 A3 p2 G C
6 A4 p2 A T
7 A2 p1 G C
8 A3 p1 C C
9 A4 p1 C C
g = df['person'].ne(df['person'].shift()).cumsum()
print (g)
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 3
8 3
9 3
Name: person, dtype: int32
#same size 3
s = g.map(g.value_counts())
print (s)
0 3
1 3
2 3
3 4
4 4
5 4
6 4
7 3
8 3
9 3
Name: person, dtype: int64
#selected first max index in s
idx = s.groupby(df['person']).idxmax()
print (idx)
person
p1 0
p2 3
Name: person, dtype: int64
#seelcted groups g
print (g.loc[idx])
0 1
3 2
Name: person, dtype: int32
#selected only matched groups
print (g.isin(g.loc[idx]))
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 False
Name: person, dtype: bool
df = df[g.isin(g.loc[idx])]
print (df)
gene person allele allele2
0 A1 p1 G C
1 A2 p1 A C
2 A3 p1 A T
3 A1 p2 G C
4 A2 p2 T T
5 A3 p2 G C
6 A4 p2 A T
Upvotes: 2