Reputation: 91
I have this dataframe in python df
uniprot_id(PK) protein_name ... protein_family protein_subfamily
0 Q8TAS1 Serine/threonine-protein kinase Kist ... KIS NaN
1 P35916 Vascular endothelial growth factor receptor 3 ... VEGFR NaN
2 Q96SB4 SRSF protein kinase 1 ... SRPK NaN
3 Q6P3W7 SCY1-like protein 2 ... SCY1 NaN
4 Q9UKI8 Serine/threonine-protein kinase tousled-like 1 ... TLK NaN
5 P30291 Wee1-like protein kinase ... WEE NaN
6 Q15120 Pyruvate dehydrogenase ... PDHK NaN
7 Q7L7X3 Serine/threonine-protein kinase TAO1 ... STE20 TAO
8 O75385 Serine/threonine-protein kinase ULK1 ... ULK NaN
9 P08922 Proto-oncogene tyrosine-protein kinase ROS ... Sev NaN
10 Q9P289 Serine/threonine-protein kinase 26 ... STE20 YSK
11 Q9NRP7 Serine/threonine-protein kinase 36 ... ULK NaN
12 Q9C0K7 STE20-related kinase adapter protein beta ... STE20 STLK
13 Q8IZX4 Transcription initiation factor TFIID subunit ... ... TAF1 NaN
14 Q9UKE5 TRAF2 and NCK-interacting protein kinase ... STE20 MSN
15 Q5TCY1 Tau-tubulin kinase 1 ... TTBK NaN
16 P33981 Dual specificity protein kinase TTK ... TTK NaN
17 P07949 Proto-oncogene tyrosine-protein kinase recepto... ... Ret NaN
18 O14730 Serine/threonine-protein kinase RIO3 ... RIO RIO3
19 O43353 Receptor-interacting serine/threonine-protein ... ... RIPK NaN
20 P57078 Receptor-interacting serine/threonine-protein ... ... RIPK NaN
21 Q9Y2H1 Serine/threonine-protein kinase 38-like ... NDR NaN
22 Q9UEW8 STE20/SPS1-related proline-alanine-rich protei... ... STE20 FRAY
23 Q8TDR2 Serine/threonine-protein kinase 35 ... NKF4 NaN
24 P49842 Serine/threonine-protein kinase 19 ... G11 NaN
25 Q13177 Serine/threonine-protein kinase PAK 2 ... STE20 PAKA
26 B5MCJ9 Tripartite motif-containing protein 66 ... TIF1 NaN
27 Q6IBK5 Transcription initiation factor IIF subunit alpha ... GTF2F1 NaN
28 Q8N165 Serine/threonine-protein kinase PDIK1L ... NKF4 NaN
29 Q86YV6 Myosin light chain kinase family member 4 ... MLCK NaN
30 Q8TCG2 Phosphatidylinositol 4-kinase type 2-beta ... NaN NaN
31 Q16654 Pyruvate dehydrogenase ... PDHK NaN
32 P51817 cAMP-dependent protein kinase catalytic subuni... ... PKA NaN
33 A0A0B4J2F2 Putative serine/threonine-protein kinase SIK1B ... NaN NaN
34 P57059 Serine/threonine-protein kinase SIK1 ... CAMKL QIK
35 Q9H0K1 Serine/threonine-protein kinase SIK2 ... CAMKL QIK
36 Q9Y2K2 Serine/threonine-protein kinase SIK3 ... CAMKL QIK
37 Q9BXU1 Serine/threonine-protein kinase 31 ... Other-Unique NaN
38 Q13263 Transcription intermediary factor 1-beta ... TIF1 NaN
39 Q32MK0 Myosin light chain kinase 3 ... MLCK NaN
40 Q13153 Serine/threonine-protein kinase PAK 1 ... STE20 PAKA
41 Q16816 Phosphorylase b kinase gamma catalytic chain; ... ... PHK NaN
42 Q05823 2-5A-dependent ribonuclease ... Other-Unique NaN
43 Q8IWB6 Inactive serine/threonine-protein kinase TEX14 ... NKF5 NaN
44 Q8IWB6 Inactive serine/threonine-protein kinase TEX14 ... NKF5 NaN
45 Q9BX84 Transient receptor potential cation channel su... ... Alpha ChaK
46 Q9H1R3 Myosin light chain kinase 2; skeletal/cardiac ... ... MLCK NaN
47 O75116 Rho-associated protein kinase 2 ... DMPK ROCK
48 Q01973 Inactive tyrosine-protein kinase transmembrane... ... Ror NaN
49 O75962 Triple functional domain protein ... Trio NaN
50 Q9Y4A5 Transformation/transcription domain-associated... ... PIKK TRRAP
51 Q8NEB9 Phosphatidylinositol 3-kinase catalytic subuni... ... NaN NaN
52 Q496M5 Inactive serine/threonine-protein kinase PLK5 ... NaN NaN
53 O00444 Serine/threonine-protein kinase PLK4 ... PLK NaN
54 Q06418 Tyrosine-protein kinase receptor TYRO3 ... Axl NaN
55 Q9Y572 Receptor-interacting serine/threonine-protein ... ... RIPK NaN
56 Q6IQ55 Tau-tubulin kinase 2 ... TTBK NaN
57 Q6PHR2 Serine/threonine-protein kinase ULK3 ... ULK NaN
58 P30530 Tyrosine-protein kinase receptor UFO ... Axl NaN
59 Q9Y6S9 Ribosomal protein S6 kinase-like 1 ... RSKL NaN
60 Q01974 Tyrosine-protein kinase transmembrane receptor... ... Ror NaN
61 Q15772 Striated muscle preferentially expressed prote... ... Trio NaN
62 Q15772 Striated muscle preferentially expressed prote... ... Trio NaN
63 Q9UHD2 Serine/threonine-protein kinase TBK1 ... IKK NaN
64 Q8TEA7 TBC domain-containing protein kinase-like protein ... TBCK NaN
65 Q96PF2 Testis-specific serine/threonine-protein kinas... ... TSSK NaN
66 Q9H792 Inactive tyrosine-protein kinase PEAK1 ... NKF3 NaN
67 O43930 Putative serine/threonine-protein kinase PRKY ... PKA NaN
68 P0C1S8 Wee1-like protein kinase 2 ... WEE NaN
69 Q96KB5 Lymphokine-activated killer T-cell-originated ... ... TOPK NaN
70 Q9BXA6 Testis-specific serine/threonine-protein kinas... ... TSSK NaN
71 Q96C45 Serine/threonine-protein kinase ULK4 ... ULK NaN
72 P29597 Non-receptor tyrosine-protein kinase TYK2 ... Jak NaN
73 P29597 Non-receptor tyrosine-protein kinase TYK2 ... JakB NaN
74 Q8WZ42 Titin ... MLCK NaN
75 Q86UE8 Serine/threonine-protein kinase tousled-like 2 ... TLK NaN
76 Q9BXA7 Testis-specific serine/threonine-protein kinas... ... TSSK NaN
77 Q96KG9 N-terminal kinase-like protein ... SCY1 NaN
78 Q9NRH2 SNF-related serine/threonine-protein kinase ... CAMKL SNRK
79 O94768 Serine/threonine-protein kinase 17B ... DAPK NaN
80 O75716 Serine/threonine-protein kinase 16 ... NAK NaN
81 Q15831 Serine/threonine-protein kinase STK11 ... CAMKL LKB
82 P07947 Tyrosine-protein kinase Yes ... Src NaN
83 Q8IV63 Inactive serine/threonine-protein kinase VRK3 ... VRK NaN
84 P35968 Vascular endothelial growth factor receptor 2 ... VEGFR NaN
85 Q99986 Serine/threonine-protein kinase VRK1 ... VRK NaN
86 Q9BYP7 Serine/threonine-protein kinase WNK3 ... WNK NaN
87 Q96BR1 Serine/threonine-protein kinase Sgk3 ... SGK NaN
88 Q9H2G2 STE20-like serine/threonine-protein kinase ... STE20 SLK
89 O94804 Serine/threonine-protein kinase 10 ... STE20 SLK
90 Q9UPN9 E3 ubiquitin-protein ligase TRIM33 ... TIF1 NaN
91 Q92519 Tribbles homolog 2 ... Trbl NaN
92 Q9UL54 Serine/threonine-protein kinase TAO2 ... STE20 TAO
93 Q96RU8 Tribbles homolog 1 ... Trbl NaN
94 Q96PN8 Testis-specific serine/threonine-protein kinas... ... TSSK NaN
95 Q9H4A3 Serine/threonine-protein kinase WNK1 ... WNK NaN
96 Q6SA08 Testis-specific serine/threonine-protein kinas... ... TSSK NaN
97 P43403 Tyrosine-protein kinase ZAP-70 ... Syk NaN
98 P42681 Tyrosine-protein kinase TXK ... Tec NaN
99 P17948 Vascular endothelial growth factor receptor 1 ... VEGFR NaN
100 P21675 Transcription initiation factor TFIID subunit 1 ... TAF1 NaN
101 Q02763 Angiopoietin-1 receptor ... Tie NaN
102 Q96J92 Serine/threonine-protein kinase WNK4 ... WNK NaN
103 Q13470 Non-receptor tyrosine-protein kinase TNK1 ... Ack NaN
104 Q9Y3S1 Serine/threonine-protein kinase WNK2 ... WNK NaN
105 Q86Y07 Serine/threonine-protein kinase VRK2 ... VRK NaN
106 Q96RU7 Tribbles homolog 3 ... Trbl NaN
107 Q9NRL2 Bromodomain adjacent to zinc finger domain pro... ... BAZ NaN
108 Q9NSY1 BMP-2-inducible protein kinase ... NAK NaN
109 Q13131 5-AMP-activated protein kinase catalytic subun... ... CAMKL AMPK
110 Q96QP1 Alpha-protein kinase 1 ... Alpha NaN
111 Q00532 Cyclin-dependent kinase-like 1 ... CDKL NaN
112 P07333 Macrophage colony-stimulating factor 1 receptor ... PDGFR NaN
113 Q13705 Activin receptor type-2B ... STKR STKR2
114 Q9UIG0 Tyrosine-protein kinase BAZ1B ... BAZ NaN
115 Q8IWQ3 Serine/threonine-protein kinase BRSK2 ... CAMKL BRSK
116 P51813 Cytoplasmic tyrosine-protein kinase BMX ... Tec NaN
117 Q08345 Epithelial discoidin domain-containing recepto... ... DDR NaN
118 Q16832 Discoidin domain-containing receptor 2 ... DDR NaN
119 Q8N568 Serine/threonine-protein kinase DCLK2 ... DCAMKL NaN
120 O76039 Cyclin-dependent kinase-like 5 ... CDKL NaN
121 P00533 Epidermal growth factor receptor ... EGFR NaN
122 Q13873 Bone morphogenetic protein receptor type-2 ... STKR STKR2
123 P50613 Cyclin-dependent kinase 7 ... CDK CDK7
124 Q9UQB9 Aurora kinase C ... Aur NaN
125 P25440 Bromodomain-containing protein 2 ... BRD NaN
126 P51451 Tyrosine-protein kinase Blk ... Src NaN
127 P29323 Ephrin type-B receptor 2 ... Eph NaN
128 P54764 Ephrin type-A receptor 4 ... Eph NaN
129 Q05397 Focal adhesion kinase 1 ... FAK NaN
130 P11801 Serine/threonine-protein kinase H1 ... PSK NaN
131 P23443 Ribosomal protein S6 kinase beta-1 ... RSK RSKp70
132 Q96LW2 Ribosomal protein S6 kinase-related protein ... RSKR NaN
133 Q9UK32 Ribosomal protein S6 kinase alpha-6 ... RSK RSKp90
134 Q9UK32 Ribosomal protein S6 kinase alpha-6 ... RSKb RSKb
135 Q8NB16 Mixed lineage kinase domain-like protein ... TKL-Unique NaN
136 O00750 Phosphatidylinositol 4-phosphate 3-kinase C2 d... ... NaN NaN
137 O60566 Mitotic checkpoint serine/threonine-protein ki... ... BUB NaN
138 Q9UPZ9 Serine/threonine-protein kinase ICK ... RCK NaN
139 O14965 Aurora kinase A ... Aur NaN
140 O60885 Bromodomain-containing protein 4 ... BRD NaN
141 Q58F21 Bromodomain testis-specific protein ... BRD NaN
142 Q15131 Cyclin-dependent kinase 10 ... CDK CDK10
143 Q00537 Cyclin-dependent kinase 17 ... CDK PCTAIRE
144 Q8NI60 Atypical kinase COQ8A; mitochondrial ... ABC1 ABC1-A
145 Q15303 Receptor tyrosine-protein kinase erbB-4 ... EGFR NaN
146 P08069 Insulin-like growth factor 1 receptor ... InsR NaN
147 O15111 Inhibitor of nuclear factor kappa-B kinase sub... ... IKK NaN
148 O14920 Inhibitor of nuclear factor kappa-B kinase sub... ... IKK NaN
149 O43187 Interleukin-1 receptor-associated kinase-like 2 ... IRAK NaN
150 Q9Y243 RAC-gamma serine/threonine-protein kinase ... Akt NaN
151 Q04771 Activin receptor type-1 ... STKR STKR1
152 Q7Z695 Uncharacterized aarF domain-containing protein... ... ABC1 ABC1-C
153 P16066 Atrial natriuretic peptide receptor 1 ... RGC NaN
154 Q8NFD2 Ankyrin repeat and protein kinase domain-conta... ... RIPK NaN
155 Q13535 Serine/threonine-protein kinase ATR ... PIKK ATR
156 P36894 Bone morphogenetic protein receptor type-1A ... STKR STKR1
157 P11274 Breakpoint cluster region protein ... BCR NaN
158 Q09013 Myotonin-protein kinase ... DMPK GEK
159 Q13315 Serine-protein kinase ATM ... PIKK ATM
160 P53004 Biliverdin reductase A ... BLVRA NaN
161 O43683 Mitotic checkpoint serine/threonine-protein ki... ... BUB NaN
162 P10398 Serine/threonine-protein kinase A-Raf ... RAF NaN
163 P20594 Atrial natriuretic peptide receptor 2 ... RGC NaN
164 P35626 Beta-adrenergic receptor kinase 2 ... GRK BARK
165 P49761 Dual specificity protein kinase CLK3 ... CLK NaN
166 P24941 Cyclin-dependent kinase 2 ... CDK CDK2
167 P50750 Cyclin-dependent kinase 9 ... CDK CDK9
168 Q07002 Cyclin-dependent kinase 18 ... CDK PCTAIRE
169 P29320 Ephrin type-A receptor 3 ... Eph NaN
170 P54762 Ephrin type-B receptor 1 ... Eph NaN
171 P22455 Fibroblast growth factor receptor 4 ... FGFR NaN
172 P31751 RAC-beta serine/threonine-protein kinase ... Akt NaN
173 Q15059 Bromodomain-containing protein 3 ... BRD NaN
174 P00519 Tyrosine-protein kinase ABL1 ... Abl NaN
175 O00238 Bone morphogenetic protein receptor type-1B ... STKR STKR1
176 P31749 RAC-alpha serine/threonine-protein kinase ... Akt NaN
177 Q8NER5 Activin receptor type-1C ... STKR STKR1
178 P27037 Activin receptor type-2A ... STKR STKR2
179 P68400 Casein kinase II subunit alpha ... CK2 NaN
180 P15056 Serine/threonine-protein kinase B-raf ... RAF NaN
181 Q06187 Tyrosine-protein kinase BTK ... Tec NaN
182 Q9C098 Serine/threonine-protein kinase DCLK3 ... DCAMKL NaN
183 Q00526 Cyclin-dependent kinase 3 ... CDK CDK2
184 P19784 Casein kinase II subunit alpha ... CK2 NaN
185 Q8NEV1 Casein kinase II subunit alpha 3 ... NaN NaN
there are some rows which are duplicates as shown below
133 Q9UK32 Ribosomal protein S6 kinase alpha-6 ... RSK RSKp90
134 Q9UK32 Ribosomal protein S6 kinase alpha-6 ... RSKb RSKb
I was wondering what would be the best way to combine the columns of these rows and seperate it with a semicolon if they are different (If they are the same I just want it as a single value). Ideally I would also like to specify order if possible
133 Q9UK32 Ribosomal protein S6 kinase alpha-6 ... RSK; RSKb RSKp90; RSKb
Upvotes: 0
Views: 48
Reputation: 24568
something like this :
df.groupby(['uniprot_id','protein_name'])[['protein_family','protein_subfamily']].agg('; '.join(x))
probably sort your df before group by :
df.sort_values(['protein_family','protein_subfamily']).groupby(...)
if this is not what you want , you may wanna sort whiting each group then :
df.groupby(['uniprot_id','protein_name'])[['protein_family','protein_subfamily']].agg(lambda x : '; '.join(x.sort_values()))
Upvotes: 2