Reputation: 65
I have a pandas dataframe like the following. I created the last 3 columns based on unique values from the column RefIDPrefix
.
RefIDPrefix | RefIDNumber | GO | PMID | Reactome |
---|---|---|---|---|
GO | 12345 | |||
PMID | 23456 | |||
Reactome | 34567 | |||
GO | 45678 | |||
GO | 56789 | |||
PMID | 67890 |
I want to fill the last 3 columns like the following. Basically, based on the value in RefIDPrefix
, I want to take the value in RefIDNumber
and put it in the correct column corresponding to RefIDPrefix
.
RefIDPrefix | RefIDNumber | GO | PMID | Reactome |
---|---|---|---|---|
GO | 12345 | 12345 | ||
PMID | 23456 | 23456 | ||
Reactome | 34567 | 34567 | ||
GO | 45678 | 45678 | ||
GO | 56789 | 56789 | ||
PMID | 67890 | 67890 |
I have been trying to do this for a while but haven't been able to figure out how to go about it. Any help would be appreciated!
Upvotes: 2
Views: 187
Reputation: 23227
You can use df.pivot()
to build the columns from RefIDPrefix
and .join()
it back to the original df
df.join(df.pivot(columns='RefIDPrefix', values='RefIDNumber').fillna(''))
Output:
RefIDPrefix RefIDNumber GO PMID Reactome
0 GO 12345 12345.0
1 PMID 23456 23456.0
2 Reactome 34567 34567.0
3 GO 45678 45678.0
4 GO 56789 56789.0
5 PMID 67890 67890.0
For the display format of the numbers in new columns (currently displayed like float
number with decimal point), if your RefIDNumber
column is actually in string, the numbers in new column will also be in string and have no decimal point (like integers).
However, if RefIDNumber
is in numeric format (most probably in positive numbers for ID numbers), we can retain the numbers as integer
by fine-tuning the .fillna()
part, as follows:
df.join(df.pivot(columns='RefIDPrefix', values='RefIDNumber').fillna(-1, downcast='infer').replace(-1, ''))
Output:
RefIDPrefix RefIDNumber GO PMID Reactome
0 GO 12345 12345
1 PMID 23456 23456
2 Reactome 34567 34567
3 GO 45678 45678
4 GO 56789 56789
5 PMID 67890 67890
Upvotes: 3