Sterling Wright
Sterling Wright

Reputation: 5

Is there a way to create a new column based on a substring and text that follow it?

I am working with triqler data outputs. I have been able to convert it into a pandas dataframe that looks like the following:

A B
Jan05 GN=XYZ NA Q9GLD3;A0A0A0RBT6
Dec25 GN=ZYX Y A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198
Nov12 GN=NHAN Y P60524;A0A1K0GGH0;A0A8I3P9B9
May22 GN=GZV X P09582;A0A1R3UGQ4;A0A8I3NWV7
Jun24 MNIAV X P09582;A0A1R3UGQ4;A0A8I3NWV7
May22 CAUCGZV Y P09582;A0A1R3UGQ4;A0A8I3NWV7

I need to create a new column that either uses the string after the "GN=" from the values in column A or the first part before the semicolon in column B if there is no "GN=" in column A.

In other words, I need it to look like this:

A B C
Jan05 GN=XYZ NA Q9GLD3;A0A0A0RBT6 XYZ
Dec25 GN=ZYX Y A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198 ZYX
Nov12 GN=NHAN Y P60524;A0A1K0GGH0;A0A8I3P9B9 NHAN
May22 GN=GZV X P09582;A0A1R3UGQ4;A0A8I3NWV7 GZV
Jun24 MNIAV X P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582
May22 CAUCGZV Y P09582;A0A1R3UGQ4;A0A8I3NWV7 P09582

I am not sure which packages or commands I should be using to accomplish this goal. I think that a conditional statement would be the way to go.

Upvotes: 0

Views: 87

Answers (5)

sitting_duck
sitting_duck

Reputation: 3720

df['C'] = (
    df['A'].str.extract('GN=(.+) ')
    .where(df['A'].str.contains('GN='),df['B'].str.extract('^(.+?);.*'))
)

print(df)

Result

                 A                                 B       C
0  Jan05 GN=XYZ NA                 Q9GLD3;A0A0A0RBT6     XYZ
1   Dec25 GN=ZYX Y  A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198     ZYX
2  Nov12 GN=NHAN Y      P60524;A0A1K0GGH0;A0A8I3P9B9    NHAN
3   May22 GN=GZV X      P09582;A0A1R3UGQ4;A0A8I3NWV7     GZV
4    Jun24 MNIAV X      P09582;A0A1R3UGQ4;A0A8I3NWV7  P09582
5  May22 CAUCGZV Y      P09582;A0A1R3UGQ4;A0A8I3NWV7  P09582

Upvotes: 0

Ynjxsjmh
Ynjxsjmh

Reputation: 30050

Let's try

df['C'] = (df['A'].str.extract('GN=([^\s]+)')[0]
           .fillna(df['B'].str.split(';').str[0]))
print(df)

                 A                                 B       C
0  Jan05 GN=XYZ NA                 Q9GLD3;A0A0A0RBT6     XYZ
1   Dec25 GN=ZYX Y  A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198     ZYX
2  Nov12 GN=NHAN Y      P60524;A0A1K0GGH0;A0A8I3P9B9    NHAN
3   May22 GN=GZV X      P09582;A0A1R3UGQ4;A0A8I3NWV7     GZV
4    Jun24 MNIAV X      P09582;A0A1R3UGQ4;A0A8I3NWV7  P09582
5  May22 CAUCGZV Y      P09582;A0A1R3UGQ4;A0A8I3NWV7  P09582

Upvotes: 1

Baron Legendre
Baron Legendre

Reputation: 2188

action1 = df['A'].str.split('GN=',expand=True)[1].str.split(' ',expand=True)[0]
action2 = df['B'].str.split(';',expand=True)[0]
df['C'] = np.where(df['A'].str.contains('GN='),action1, action2)
df
###
                 A                                 B       C
0  Jan05 GN=XYZ NA                 Q9GLD3;A0A0A0RBT6     XYZ
1   Dec25 GN=ZYX Y  A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198     ZYX
2  Nov12 GN=NHAN Y      P60524;A0A1K0GGH0;A0A8I3P9B9    NHAN
3   May22 GN=GZV X      P09582;A0A1R3UGQ4;A0A8I3NWV7     GZV
4    Jun24 MNIAV X      P09582;A0A1R3UGQ4;A0A8I3NWV7  P09582
5  May22 CAUCGZV Y      P09582;A0A1R3UGQ4;A0A8I3NWV7  P09582

Upvotes: 0

Mouad Slimane
Mouad Slimane

Reputation: 1067

try this

data['new_column']=data.apply(lambda x:x["A"].split("GN=")[1] if("GN="in x["A"]) else x["B"].split(";")[0],axis=1)

Upvotes: 0

Ran A
Ran A

Reputation: 774

Adding on the previous answer to eliminate NA

import pandas as pd
#example of data
cols=['A','B']
data=[['Jan05 GN=XYZ NA', 'Q9GLD3;A0A0A0RBT6'],['Jan05 GN=XYZ NA','Q9GLD3;A0A0A0RBT6'],['Jun24 MNIAV X','P09582;A0A1R3UGQ4;A0A8I3NWV7']]
df=pd.DataFrame(data,columns=cols)
df
df['C']=df.apply(lambda x:x["A"].split("GN=")[1].split()[0] if("GN="in x["A"]) else x["B"].split(";")[0],axis=1)

Upvotes: 0

Related Questions