Reputation: 125
I have a pandas dataframe
_id _score ensembl ensembl.gene notfound
query
Dnmt3a 1788 89.405594 NaN ENSG00000119772 NaN
SUMO1 7341 85.157100 NaN ENSG00000116030 NaN
GADD45a 1647 86.867760 NaN ENSG00000116717 NaN
Rad17 5884 85.377050 [{u'gene': u'ENSG00000155093'}, {u'gene': u'ENSG00000282185'}] NaN NaN
DRS NaN NaN NaN NaN True
On the basis of the values of 'ensembl', 'ensembl.gene' and 'notfound', how to find out the ensemble id for a particular instance. The output should be based on three conditions
If both the value of 'ensembl' and 'ensembl.gene' is 'NaN', then output is "Not found". e.g. fifth row.
If the value of 'ensembl' is 'NaN', then just print the value of 'ensembl.gene' e.g. first, second and third row.
If the value of 'ensembl.gene' is 'NaN', then print first part of the value of 'ensembl' e.g. in fourth row the value of 'ensembl.gene' is 'NaN', hence the output is the first part of 'ensembl' value i.e. ENSG00000155093.
The output should be
Ensemble_ID
query
Dnmt3a ENSG00000119772
SUMO1 ENSG00000116030
GADD45a ENSG00000116717
Rad17 ENSG00000155093
DRS Not_found
Upvotes: 0
Views: 135
Reputation: 2407
First create a copy of the column "ensemble.gene". Then apply the "where" method and some regex. Finally use "fillna".
df["Ensemble_ID"]=df["ensembl.gene"]
df["Ensemble_ID"]=df["Ensemble_ID"].where(df["ensembl"].isna(),df["ensembl"].str.extract(r"u'(ENSG\d+)",expand=False))
df["Ensemble_ID"].fillna("Not_found",inplace=True)
df["Ensemble_ID"]
query
Dnmt3a ENSG00000119772
SUMO1 ENSG00000116030
GADD45a ENSG00000116717
Rad17 ENSG00000155093
DRS Not_found
Name: Ensemble_ID, dtype: object
Upvotes: 1
Reputation: 4487
If I understand correctly, this is what you need:
import numpy as np
def make_id(row):
if row['ensembl'] is np.nan and row['ensembl.gene'] is np.nan: # 1) If both the value of 'ensembl' and 'ensembl.gene' is 'NaN', then output is "Not found".
return 'Not Found'
elif row['ensembl'] is np.nan: # 2) If the value of 'ensembl' is 'NaN', then just print the value of 'ensembl.gene'
return row['ensembl.gene']
else: # 3) (otherwise) If the value of 'ensembl.gene' is 'NaN', then print first part of the value of 'ensembl'
return row['ensembl'][0]['gene']
df = pd.DataFrame({'ensembl': [np.nan,[{u'gene': u'ENSG00000155093'}],np.nan], 'ensembl.gene':[1,4,5]})
df['id'] = df.apply(lambda row: make_id(row), axis=1)
print(df)
ensembl ensembl.gene id
0 None 1 1
1 [{'gene': 'ENSG00000155093'}] 4 ENSG00000155093
2 None 5 5
In this way the ids of each row of your df are generated and saved in the corresponding 'id'
column.
Note: If the missing value is not represented by np.nan, replace np.nan with another placeholder 'nan' inside the algorithm
Upvotes: 2
Reputation: 4487
If I understand your question correctly, this code solved your problem:
searched_id = df.loc[df['ensembl']=='ENSG00000119772'].index[0]
You can generalize the code in a simple way as follows:
def get_index(df, pred)
return df.loc[pred].index
In this way the results are filtered according to the predicate, and the corresponding index list is returned. An example of use is the following:
pred = (df['ensemble']=='val1') & (df['ensembl.gene']=='val2') & (df['notfound']=='val3')
searched_id = get_index(df, pred)
If I haven't answered your question, try to rephrase the question because it's really unclear
Upvotes: 1
Reputation: 1404
As far as I understood you want to find the value of "_id"
that correspond to known values of 'ensembl', 'ensembl.gene'
and 'notfound'
.
Here how to do it with a toy dataframe (easily extendable to your case).
import numpy as np
import pandas as pd
df = pd.DataFrame({'id':[0,1,2,3],
'col_1':[11,12,13,14],
'col_2':[110,120,130,140],
'col_3':[1100,1200,1300,1400]})
condition = np.logical_and(
np.logical_and(df['col_1']==13,
df['col_2']==130),df['col_3']==1300)
print (f'the index(es) corresponding to the values of the columns is:
{df["id"][condition].values}')
# output
the index corresponding to the values of the columns is: [2]
Upvotes: 0