Reputation: 5
I am working with triqler data outputs. I have been able to convert it into a pandas dataframe that looks like the following:
A | B |
---|---|
Jan05 GN=XYZ NA | Q9GLD3;A0A0A0RBT6 |
Dec25 GN=ZYX Y | A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198 |
Nov12 GN=NHAN Y | P60524;A0A1K0GGH0;A0A8I3P9B9 |
May22 GN=GZV X | P09582;A0A1R3UGQ4;A0A8I3NWV7 |
Jun24 MNIAV X | P09582;A0A1R3UGQ4;A0A8I3NWV7 |
May22 CAUCGZV Y | P09582;A0A1R3UGQ4;A0A8I3NWV7 |
I need to create a new column that either uses the string after the "GN=" from the values in column A or the first part before the semicolon in column B if there is no "GN=" in column A.
In other words, I need it to look like this:
A | B | C |
---|---|---|
Jan05 GN=XYZ NA | Q9GLD3;A0A0A0RBT6 | XYZ |
Dec25 GN=ZYX Y | A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198 | ZYX |
Nov12 GN=NHAN Y | P60524;A0A1K0GGH0;A0A8I3P9B9 | NHAN |
May22 GN=GZV X | P09582;A0A1R3UGQ4;A0A8I3NWV7 | GZV |
Jun24 MNIAV X | P09582;A0A1R3UGQ4;A0A8I3NWV7 | P09582 |
May22 CAUCGZV Y | P09582;A0A1R3UGQ4;A0A8I3NWV7 | P09582 |
I am not sure which packages or commands I should be using to accomplish this goal. I think that a conditional statement would be the way to go.
Upvotes: 0
Views: 87
Reputation: 3720
df['C'] = (
df['A'].str.extract('GN=(.+) ')
.where(df['A'].str.contains('GN='),df['B'].str.extract('^(.+?);.*'))
)
print(df)
Result
A B C
0 Jan05 GN=XYZ NA Q9GLD3;A0A0A0RBT6 XYZ
1 Dec25 GN=ZYX Y A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198 ZYX
2 Nov12 GN=NHAN Y P60524;A0A1K0GGH0;A0A8I3P9B9 NHAN
3 May22 GN=GZV X P09582;A0A1R3UGQ4;A0A8I3NWV7 GZV
4 Jun24 MNIAV X P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582
5 May22 CAUCGZV Y P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582
Upvotes: 0
Reputation: 30050
Let's try
df['C'] = (df['A'].str.extract('GN=([^\s]+)')[0]
.fillna(df['B'].str.split(';').str[0]))
print(df)
A B C
0 Jan05 GN=XYZ NA Q9GLD3;A0A0A0RBT6 XYZ
1 Dec25 GN=ZYX Y A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198 ZYX
2 Nov12 GN=NHAN Y P60524;A0A1K0GGH0;A0A8I3P9B9 NHAN
3 May22 GN=GZV X P09582;A0A1R3UGQ4;A0A8I3NWV7 GZV
4 Jun24 MNIAV X P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582
5 May22 CAUCGZV Y P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582
Upvotes: 1
Reputation: 2188
action1 = df['A'].str.split('GN=',expand=True)[1].str.split(' ',expand=True)[0]
action2 = df['B'].str.split(';',expand=True)[0]
df['C'] = np.where(df['A'].str.contains('GN='),action1, action2)
df
###
A B C
0 Jan05 GN=XYZ NA Q9GLD3;A0A0A0RBT6 XYZ
1 Dec25 GN=ZYX Y A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198 ZYX
2 Nov12 GN=NHAN Y P60524;A0A1K0GGH0;A0A8I3P9B9 NHAN
3 May22 GN=GZV X P09582;A0A1R3UGQ4;A0A8I3NWV7 GZV
4 Jun24 MNIAV X P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582
5 May22 CAUCGZV Y P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582
Upvotes: 0
Reputation: 1067
try this
data['new_column']=data.apply(lambda x:x["A"].split("GN=")[1] if("GN="in x["A"]) else x["B"].split(";")[0],axis=1)
Upvotes: 0
Reputation: 774
Adding on the previous answer to eliminate NA
import pandas as pd
#example of data
cols=['A','B']
data=[['Jan05 GN=XYZ NA', 'Q9GLD3;A0A0A0RBT6'],['Jan05 GN=XYZ NA','Q9GLD3;A0A0A0RBT6'],['Jun24 MNIAV X','P09582;A0A1R3UGQ4;A0A8I3NWV7']]
df=pd.DataFrame(data,columns=cols)
df
df['C']=df.apply(lambda x:x["A"].split("GN=")[1].split()[0] if("GN="in x["A"]) else x["B"].split(";")[0],axis=1)
Upvotes: 0