Inan Khan
Inan Khan

Reputation: 91

Combining Duplicate rows in pandas dataframe

I have this dataframe in python df

    uniprot_id(PK)                                       protein_name  ... protein_family protein_subfamily
0           Q8TAS1              Serine/threonine-protein kinase Kist   ...            KIS               NaN
1           P35916     Vascular endothelial growth factor receptor 3   ...          VEGFR               NaN
2           Q96SB4                             SRSF protein kinase 1   ...           SRPK               NaN
3           Q6P3W7                               SCY1-like protein 2   ...           SCY1               NaN
4           Q9UKI8    Serine/threonine-protein kinase tousled-like 1   ...            TLK               NaN
5           P30291                          Wee1-like protein kinase   ...            WEE               NaN
6           Q15120                            Pyruvate dehydrogenase   ...           PDHK               NaN
7           Q7L7X3              Serine/threonine-protein kinase TAO1   ...          STE20               TAO
8           O75385              Serine/threonine-protein kinase ULK1   ...            ULK               NaN
9           P08922        Proto-oncogene tyrosine-protein kinase ROS   ...            Sev               NaN
10          Q9P289                Serine/threonine-protein kinase 26   ...          STE20               YSK
11          Q9NRP7                Serine/threonine-protein kinase 36   ...            ULK               NaN
12          Q9C0K7         STE20-related kinase adapter protein beta   ...          STE20              STLK
13          Q8IZX4  Transcription initiation factor TFIID subunit ...  ...           TAF1               NaN
14          Q9UKE5          TRAF2 and NCK-interacting protein kinase   ...          STE20               MSN
15          Q5TCY1                              Tau-tubulin kinase 1   ...           TTBK               NaN
16          P33981               Dual specificity protein kinase TTK   ...            TTK               NaN
17          P07949  Proto-oncogene tyrosine-protein kinase recepto...  ...            Ret               NaN
18          O14730              Serine/threonine-protein kinase RIO3   ...            RIO              RIO3
19          O43353  Receptor-interacting serine/threonine-protein ...  ...           RIPK               NaN
20          P57078  Receptor-interacting serine/threonine-protein ...  ...           RIPK               NaN
21          Q9Y2H1           Serine/threonine-protein kinase 38-like   ...            NDR               NaN
22          Q9UEW8  STE20/SPS1-related proline-alanine-rich protei...  ...          STE20              FRAY
23          Q8TDR2                Serine/threonine-protein kinase 35   ...           NKF4               NaN
24          P49842                Serine/threonine-protein kinase 19   ...            G11               NaN
25          Q13177             Serine/threonine-protein kinase PAK 2   ...          STE20              PAKA
26          B5MCJ9            Tripartite motif-containing protein 66   ...           TIF1               NaN
27          Q6IBK5  Transcription initiation factor IIF subunit alpha  ...         GTF2F1               NaN
28          Q8N165            Serine/threonine-protein kinase PDIK1L   ...           NKF4               NaN
29          Q86YV6         Myosin light chain kinase family member 4   ...           MLCK               NaN
30          Q8TCG2         Phosphatidylinositol 4-kinase type 2-beta   ...            NaN               NaN
31          Q16654                            Pyruvate dehydrogenase   ...           PDHK               NaN
32          P51817  cAMP-dependent protein kinase catalytic subuni...  ...            PKA               NaN
33      A0A0B4J2F2    Putative serine/threonine-protein kinase SIK1B   ...            NaN               NaN
34          P57059              Serine/threonine-protein kinase SIK1   ...          CAMKL               QIK
35          Q9H0K1              Serine/threonine-protein kinase SIK2   ...          CAMKL               QIK
36          Q9Y2K2              Serine/threonine-protein kinase SIK3   ...          CAMKL               QIK
37          Q9BXU1                Serine/threonine-protein kinase 31   ...   Other-Unique               NaN
38          Q13263          Transcription intermediary factor 1-beta   ...           TIF1               NaN
39          Q32MK0                       Myosin light chain kinase 3   ...           MLCK               NaN
40          Q13153             Serine/threonine-protein kinase PAK 1   ...          STE20              PAKA
41          Q16816  Phosphorylase b kinase gamma catalytic chain; ...  ...            PHK               NaN
42          Q05823                       2-5A-dependent ribonuclease   ...   Other-Unique               NaN
43          Q8IWB6    Inactive serine/threonine-protein kinase TEX14   ...           NKF5               NaN
44          Q8IWB6    Inactive serine/threonine-protein kinase TEX14   ...           NKF5               NaN
45          Q9BX84  Transient receptor potential cation channel su...  ...          Alpha              ChaK
46          Q9H1R3  Myosin light chain kinase 2; skeletal/cardiac ...  ...           MLCK               NaN
47          O75116                   Rho-associated protein kinase 2   ...           DMPK              ROCK
48          Q01973  Inactive tyrosine-protein kinase transmembrane...  ...            Ror               NaN
49          O75962                  Triple functional domain protein   ...           Trio               NaN
50          Q9Y4A5  Transformation/transcription domain-associated...  ...           PIKK             TRRAP
51          Q8NEB9  Phosphatidylinositol 3-kinase catalytic subuni...  ...            NaN               NaN
52          Q496M5     Inactive serine/threonine-protein kinase PLK5   ...            NaN               NaN
53          O00444              Serine/threonine-protein kinase PLK4   ...            PLK               NaN
54          Q06418            Tyrosine-protein kinase receptor TYRO3   ...            Axl               NaN
55          Q9Y572  Receptor-interacting serine/threonine-protein ...  ...           RIPK               NaN
56          Q6IQ55                              Tau-tubulin kinase 2   ...           TTBK               NaN
57          Q6PHR2              Serine/threonine-protein kinase ULK3   ...            ULK               NaN
58          P30530              Tyrosine-protein kinase receptor UFO   ...            Axl               NaN
59          Q9Y6S9                Ribosomal protein S6 kinase-like 1   ...           RSKL               NaN
60          Q01974  Tyrosine-protein kinase transmembrane receptor...  ...            Ror               NaN
61          Q15772  Striated muscle preferentially expressed prote...  ...           Trio               NaN
62          Q15772  Striated muscle preferentially expressed prote...  ...           Trio               NaN
63          Q9UHD2              Serine/threonine-protein kinase TBK1   ...            IKK               NaN
64          Q8TEA7  TBC domain-containing protein kinase-like protein  ...           TBCK               NaN
65          Q96PF2  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
66          Q9H792            Inactive tyrosine-protein kinase PEAK1   ...           NKF3               NaN
67          O43930     Putative serine/threonine-protein kinase PRKY   ...            PKA               NaN
68          P0C1S8                        Wee1-like protein kinase 2   ...            WEE               NaN
69          Q96KB5  Lymphokine-activated killer T-cell-originated ...  ...           TOPK               NaN
70          Q9BXA6  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
71          Q96C45              Serine/threonine-protein kinase ULK4   ...            ULK               NaN
72          P29597         Non-receptor tyrosine-protein kinase TYK2   ...            Jak               NaN
73          P29597         Non-receptor tyrosine-protein kinase TYK2   ...           JakB               NaN
74          Q8WZ42                                             Titin   ...           MLCK               NaN
75          Q86UE8    Serine/threonine-protein kinase tousled-like 2   ...            TLK               NaN
76          Q9BXA7  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
77          Q96KG9                    N-terminal kinase-like protein   ...           SCY1               NaN
78          Q9NRH2       SNF-related serine/threonine-protein kinase   ...          CAMKL              SNRK
79          O94768               Serine/threonine-protein kinase 17B   ...           DAPK               NaN
80          O75716                Serine/threonine-protein kinase 16   ...            NAK               NaN
81          Q15831             Serine/threonine-protein kinase STK11   ...          CAMKL               LKB
82          P07947                       Tyrosine-protein kinase Yes   ...            Src               NaN
83          Q8IV63     Inactive serine/threonine-protein kinase VRK3   ...            VRK               NaN
84          P35968     Vascular endothelial growth factor receptor 2   ...          VEGFR               NaN
85          Q99986              Serine/threonine-protein kinase VRK1   ...            VRK               NaN
86          Q9BYP7              Serine/threonine-protein kinase WNK3   ...            WNK               NaN
87          Q96BR1              Serine/threonine-protein kinase Sgk3   ...            SGK               NaN
88          Q9H2G2        STE20-like serine/threonine-protein kinase   ...          STE20               SLK
89          O94804                Serine/threonine-protein kinase 10   ...          STE20               SLK
90          Q9UPN9                E3 ubiquitin-protein ligase TRIM33   ...           TIF1               NaN
91          Q92519                                Tribbles homolog 2   ...           Trbl               NaN
92          Q9UL54              Serine/threonine-protein kinase TAO2   ...          STE20               TAO
93          Q96RU8                                Tribbles homolog 1   ...           Trbl               NaN
94          Q96PN8  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
95          Q9H4A3              Serine/threonine-protein kinase WNK1   ...            WNK               NaN
96          Q6SA08  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
97          P43403                    Tyrosine-protein kinase ZAP-70   ...            Syk               NaN
98          P42681                       Tyrosine-protein kinase TXK   ...            Tec               NaN
99          P17948     Vascular endothelial growth factor receptor 1   ...          VEGFR               NaN
100         P21675   Transcription initiation factor TFIID subunit 1   ...           TAF1               NaN
101         Q02763                           Angiopoietin-1 receptor   ...            Tie               NaN
102         Q96J92              Serine/threonine-protein kinase WNK4   ...            WNK               NaN
103         Q13470         Non-receptor tyrosine-protein kinase TNK1   ...            Ack               NaN
104         Q9Y3S1              Serine/threonine-protein kinase WNK2   ...            WNK               NaN
105         Q86Y07              Serine/threonine-protein kinase VRK2   ...            VRK               NaN
106         Q96RU7                                Tribbles homolog 3   ...           Trbl               NaN
107         Q9NRL2  Bromodomain adjacent to zinc finger domain pro...  ...            BAZ               NaN
108         Q9NSY1                    BMP-2-inducible protein kinase   ...            NAK               NaN
109         Q13131  5-AMP-activated protein kinase catalytic subun...  ...          CAMKL              AMPK
110         Q96QP1                            Alpha-protein kinase 1   ...          Alpha               NaN
111         Q00532                    Cyclin-dependent kinase-like 1   ...           CDKL               NaN
112         P07333   Macrophage colony-stimulating factor 1 receptor   ...          PDGFR               NaN
113         Q13705                          Activin receptor type-2B   ...           STKR             STKR2
114         Q9UIG0                     Tyrosine-protein kinase BAZ1B   ...            BAZ               NaN
115         Q8IWQ3             Serine/threonine-protein kinase BRSK2   ...          CAMKL              BRSK
116         P51813           Cytoplasmic tyrosine-protein kinase BMX   ...            Tec               NaN
117         Q08345  Epithelial discoidin domain-containing recepto...  ...            DDR               NaN
118         Q16832            Discoidin domain-containing receptor 2   ...            DDR               NaN
119         Q8N568             Serine/threonine-protein kinase DCLK2   ...         DCAMKL               NaN
120         O76039                    Cyclin-dependent kinase-like 5   ...           CDKL               NaN
121         P00533                  Epidermal growth factor receptor   ...           EGFR               NaN
122         Q13873        Bone morphogenetic protein receptor type-2   ...           STKR             STKR2
123         P50613                         Cyclin-dependent kinase 7   ...            CDK              CDK7
124         Q9UQB9                                   Aurora kinase C   ...            Aur               NaN
125         P25440                  Bromodomain-containing protein 2   ...            BRD               NaN
126         P51451                       Tyrosine-protein kinase Blk   ...            Src               NaN
127         P29323                          Ephrin type-B receptor 2   ...            Eph               NaN
128         P54764                          Ephrin type-A receptor 4   ...            Eph               NaN
129         Q05397                           Focal adhesion kinase 1   ...            FAK               NaN
130         P11801                Serine/threonine-protein kinase H1   ...            PSK               NaN
131         P23443                Ribosomal protein S6 kinase beta-1   ...            RSK            RSKp70
132         Q96LW2       Ribosomal protein S6 kinase-related protein   ...           RSKR               NaN
133         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...            RSK            RSKp90
134         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...           RSKb              RSKb
135         Q8NB16          Mixed lineage kinase domain-like protein   ...     TKL-Unique               NaN
136         O00750  Phosphatidylinositol 4-phosphate 3-kinase C2 d...  ...            NaN               NaN
137         O60566  Mitotic checkpoint serine/threonine-protein ki...  ...            BUB               NaN
138         Q9UPZ9               Serine/threonine-protein kinase ICK   ...            RCK               NaN
139         O14965                                   Aurora kinase A   ...            Aur               NaN
140         O60885                  Bromodomain-containing protein 4   ...            BRD               NaN
141         Q58F21               Bromodomain testis-specific protein   ...            BRD               NaN
142         Q15131                        Cyclin-dependent kinase 10   ...            CDK             CDK10
143         Q00537                        Cyclin-dependent kinase 17   ...            CDK           PCTAIRE
144         Q8NI60              Atypical kinase COQ8A; mitochondrial   ...           ABC1            ABC1-A
145         Q15303           Receptor tyrosine-protein kinase erbB-4   ...           EGFR               NaN
146         P08069             Insulin-like growth factor 1 receptor   ...           InsR               NaN
147         O15111  Inhibitor of nuclear factor kappa-B kinase sub...  ...            IKK               NaN
148         O14920  Inhibitor of nuclear factor kappa-B kinase sub...  ...            IKK               NaN
149         O43187   Interleukin-1 receptor-associated kinase-like 2   ...           IRAK               NaN
150         Q9Y243         RAC-gamma serine/threonine-protein kinase   ...            Akt               NaN
151         Q04771                           Activin receptor type-1   ...           STKR             STKR1
152         Q7Z695  Uncharacterized aarF domain-containing protein...  ...           ABC1            ABC1-C
153         P16066             Atrial natriuretic peptide receptor 1   ...            RGC               NaN
154         Q8NFD2  Ankyrin repeat and protein kinase domain-conta...  ...           RIPK               NaN
155         Q13535               Serine/threonine-protein kinase ATR   ...           PIKK               ATR
156         P36894       Bone morphogenetic protein receptor type-1A   ...           STKR             STKR1
157         P11274                 Breakpoint cluster region protein   ...            BCR               NaN
158         Q09013                           Myotonin-protein kinase   ...           DMPK               GEK
159         Q13315                         Serine-protein kinase ATM   ...           PIKK               ATM
160         P53004                            Biliverdin reductase A   ...          BLVRA               NaN
161         O43683  Mitotic checkpoint serine/threonine-protein ki...  ...            BUB               NaN
162         P10398             Serine/threonine-protein kinase A-Raf   ...            RAF               NaN
163         P20594             Atrial natriuretic peptide receptor 2   ...            RGC               NaN
164         P35626                 Beta-adrenergic receptor kinase 2   ...            GRK              BARK
165         P49761              Dual specificity protein kinase CLK3   ...            CLK               NaN
166         P24941                         Cyclin-dependent kinase 2   ...            CDK              CDK2
167         P50750                         Cyclin-dependent kinase 9   ...            CDK              CDK9
168         Q07002                        Cyclin-dependent kinase 18   ...            CDK           PCTAIRE
169         P29320                          Ephrin type-A receptor 3   ...            Eph               NaN
170         P54762                          Ephrin type-B receptor 1   ...            Eph               NaN
171         P22455               Fibroblast growth factor receptor 4   ...           FGFR               NaN
172         P31751          RAC-beta serine/threonine-protein kinase   ...            Akt               NaN
173         Q15059                  Bromodomain-containing protein 3   ...            BRD               NaN
174         P00519                      Tyrosine-protein kinase ABL1   ...            Abl               NaN
175         O00238       Bone morphogenetic protein receptor type-1B   ...           STKR             STKR1
176         P31749         RAC-alpha serine/threonine-protein kinase   ...            Akt               NaN
177         Q8NER5                          Activin receptor type-1C   ...           STKR             STKR1
178         P27037                          Activin receptor type-2A   ...           STKR             STKR2
179         P68400                    Casein kinase II subunit alpha   ...            CK2               NaN
180         P15056             Serine/threonine-protein kinase B-raf   ...            RAF               NaN
181         Q06187                       Tyrosine-protein kinase BTK   ...            Tec               NaN
182         Q9C098             Serine/threonine-protein kinase DCLK3   ...         DCAMKL               NaN
183         Q00526                         Cyclin-dependent kinase 3   ...            CDK              CDK2
184         P19784                    Casein kinase II subunit alpha   ...            CK2               NaN
185         Q8NEV1                  Casein kinase II subunit alpha 3   ...            NaN               NaN

there are some rows which are duplicates as shown below

133         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...            RSK            RSKp90
134         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...           RSKb              RSKb

I was wondering what would be the best way to combine the columns of these rows and seperate it with a semicolon if they are different (If they are the same I just want it as a single value). Ideally I would also like to specify order if possible

133         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...            RSK; RSKb            RSKp90; RSKb

Upvotes: 0

Views: 48

Answers (1)

eshirvana
eshirvana

Reputation: 24568

something like this :

df.groupby(['uniprot_id','protein_name'])[['protein_family','protein_subfamily']].agg('; '.join(x))

probably sort your df before group by :

df.sort_values(['protein_family','protein_subfamily']).groupby(...)

if this is not what you want , you may wanna sort whiting each group then :

df.groupby(['uniprot_id','protein_name'])[['protein_family','protein_subfamily']].agg(lambda x : '; '.join(x.sort_values()))

Upvotes: 2

Related Questions