Ordering a dataframe using value_counts

Question

I have a dataframe in which under the column "component_id", I have component_ids repeating several times. Here is what the df looks like:

In [82]: df.head()
Out[82]:
   index  molregno      chembl_id  assay_id     tid     tid  component_id  
0      0    942606  CHEMBL1518722    688422  103668  103668          4891
1      0    942606  CHEMBL1518722    688422  103668  103668          4891
2      0    942606  CHEMBL1518722    688721      78      78           286
3      0    942606  CHEMBL1518722    688721      78      78           286
4      0    942606  CHEMBL1518722    688779  103657  103657          5140

  component_synonym
0              LMN1
1              LMNA
2              LGR3
3              TSHR
4              MAPT

As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.

In [84]: df.component_id.value_counts()
Out[84]:
5432    804
3947    402
5147    312
3       304
2693    294
75      282
Name: component_id, dtype: int64

Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times? And also, is it possible for my dataframe to contain only the first occurrence of each component_id?

Any advice would be greatly appreciated!

Bharath M Shetty · Accepted Answer

I think you can make use of count to sort the rows and then drop the count column i.e

df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)

Ordering a dataframe using value_counts

Answers (1)

Related Questions