The best way to group rows with same names

Question

I have that df:

gene  person  allele    allele2
A1      p1       G          C
A2      p1       A          C
A3      p1       A          T
A1      p2       G          C
A2      p2       T          T
A3      p2       G          C
A4      p2       A          T
A2      p1       G          C
A3      p1       C          C
...

As u can see, in table I can have same person few times (record form different laboratories). First p1 is different sample that second p1, and I need to pick only unique samples with best score (the highest number of rows), so it's that example it will be first p1 coz it's have 3 when another have 2.

And I have no idea how to extract that table to got something like this:

gene  person  allele    allele2
A1      p1       G          C
A2      p1       A          C
A3      p1       A          T
A1      p2       G          C
A2      p2       T          T
A3      p2       G          C
A4      p2       A          T
...

I thinking about indexing it by for loop. For example, add to index i if person == above person. If not, i+1. And then I will have a groups. But... whole df has 3mln row, so before I start I decide to describe here my problem. Maybe it's better way to do this?

jezrael · Accepted Answer

Create consecutive groups by compare Series.ne with Series.shift and Series.cumsum, then counts by Series.map with Series.value_counts:

g = df['person'].ne(df['person'].shift()).cumsum()
s = g.map(g.value_counts())

print (s)
0    3
1    3
2    3
3    4
4    4
5    4
6    4
7    2
8    2
Name: person, dtype: int64

And last compare by maxima value per person by GroupBy.transform with Series s in boolean indexing:

print (s.groupby(df['person']).transform('max'))
0    3
1    3
2    3
3    4
4    4
5    4
6    4
7    3
8    3
Name: person, dtype: int64

df = df[s.groupby(df['person']).transform('max').eq(s)]
print (df)
  gene person allele allele2
0   A1     p1      G       C
1   A2     p1      A       C
2   A3     p1      A       T
3   A1     p2      G       C
4   A2     p2      T       T
5   A3     p2      G       C
6   A4     p2      A       T

EDIT: If need first group if same size, e.g. here group p1 has same length 2 times:

#added last row for another data test
print (df)
  gene person allele allele2
0   A1     p1      G       C
1   A2     p1      A       C
2   A3     p1      A       T
3   A1     p2      G       C
4   A2     p2      T       T
5   A3     p2      G       C
6   A4     p2      A       T
7   A2     p1      G       C
8   A3     p1      C       C
9   A4     p1      C       C

g = df['person'].ne(df['person'].shift()).cumsum()
print (g)
0    1
1    1
2    1
3    2
4    2
5    2
6    2
7    3
8    3
9    3
Name: person, dtype: int32

#same size 3
s = g.map(g.value_counts())
print (s)
0    3
1    3
2    3
3    4
4    4
5    4
6    4
7    3
8    3
9    3
Name: person, dtype: int64

#selected first max index in s
idx = s.groupby(df['person']).idxmax()
print (idx)
person
p1    0
p2    3
Name: person, dtype: int64

#seelcted groups g
print (g.loc[idx])
0    1
3    2
Name: person, dtype: int32

#selected only matched groups
print (g.isin(g.loc[idx]))
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: person, dtype: bool

df = df[g.isin(g.loc[idx])]
print (df)
  gene person allele allele2
0   A1     p1      G       C
1   A2     p1      A       C
2   A3     p1      A       T
3   A1     p2      G       C
4   A2     p2      T       T
5   A3     p2      G       C
6   A4     p2      A       T

The best way to group rows with same names

Answers (1)

Related Questions