Pratik Dutta
Pratik Dutta

Reputation: 125

How to select a particular column in a dataframe among two columns based on the column value?

I have a pandas dataframe

        _id     _score      ensembl   ensembl.gene  notfound
query                   
Dnmt3a  1788    89.405594   NaN      ENSG00000119772     NaN
SUMO1   7341    85.157100   NaN      ENSG00000116030    NaN
GADD45a 1647    86.867760   NaN      ENSG00000116717    NaN
Rad17   5884    85.377050   [{u'gene': u'ENSG00000155093'}, {u'gene': u'ENSG00000282185'}]  NaN NaN
DRS     NaN     NaN         NaN       NaN               True

On the basis of the values of 'ensembl', 'ensembl.gene' and 'notfound', how to find out the ensemble id for a particular instance. The output should be based on three conditions

  1. If both the value of 'ensembl' and 'ensembl.gene' is 'NaN', then output is "Not found". e.g. fifth row.

  2. If the value of 'ensembl' is 'NaN', then just print the value of 'ensembl.gene' e.g. first, second and third row.

  3. If the value of 'ensembl.gene' is 'NaN', then print first part of the value of 'ensembl' e.g. in fourth row the value of 'ensembl.gene' is 'NaN', hence the output is the first part of 'ensembl' value i.e. ENSG00000155093.

The output should be

    Ensemble_ID
query                   
Dnmt3a  ENSG00000119772
SUMO1   ENSG00000116030
GADD45a ENSG00000116717
Rad17   ENSG00000155093
DRS     Not_found

Upvotes: 0

Views: 135

Answers (4)

kantal
kantal

Reputation: 2407

First create a copy of the column "ensemble.gene". Then apply the "where" method and some regex. Finally use "fillna".

df["Ensemble_ID"]=df["ensembl.gene"]
df["Ensemble_ID"]=df["Ensemble_ID"].where(df["ensembl"].isna(),df["ensembl"].str.extract(r"u'(ENSG\d+)",expand=False))
df["Ensemble_ID"].fillna("Not_found",inplace=True)

df["Ensemble_ID"]                                                                                                   
query
Dnmt3a     ENSG00000119772
SUMO1      ENSG00000116030
GADD45a    ENSG00000116717
Rad17      ENSG00000155093
DRS              Not_found
Name: Ensemble_ID, dtype: object

Upvotes: 1

Massifox
Massifox

Reputation: 4487

If I understand correctly, this is what you need:

import numpy as np

def make_id(row): 
    if row['ensembl'] is np.nan and row['ensembl.gene'] is np.nan:  # 1) If both the value of 'ensembl' and 'ensembl.gene' is 'NaN', then output is "Not found".
        return 'Not Found'
    elif row['ensembl'] is np.nan:                                  # 2) If the value of 'ensembl' is 'NaN', then just print the value of 'ensembl.gene'
        return row['ensembl.gene']
    else:                                                           # 3) (otherwise) If the value of 'ensembl.gene' is 'NaN', then print first part of the value of 'ensembl' 
        return row['ensembl'][0]['gene']



df = pd.DataFrame({'ensembl': [np.nan,[{u'gene': u'ENSG00000155093'}],np.nan], 'ensembl.gene':[1,4,5]})
df['id'] = df.apply(lambda row: make_id(row), axis=1)
print(df)  

                         ensembl  ensembl.gene               id
0                           None             1                1
1  [{'gene': 'ENSG00000155093'}]             4  ENSG00000155093
2                           None             5                5

In this way the ids of each row of your df are generated and saved in the corresponding 'id' column.

Note: If the missing value is not represented by np.nan, replace np.nan with another placeholder 'nan' inside the algorithm

Upvotes: 2

Massifox
Massifox

Reputation: 4487

If I understand your question correctly, this code solved your problem:

searched_id = df.loc[df['ensembl']=='ENSG00000119772'].index[0]

You can generalize the code in a simple way as follows:

def get_index(df, pred)
    return df.loc[pred].index

In this way the results are filtered according to the predicate, and the corresponding index list is returned. An example of use is the following:

pred = (df['ensemble']=='val1') & (df['ensembl.gene']=='val2') & (df['notfound']=='val3')
searched_id = get_index(df, pred)

If I haven't answered your question, try to rephrase the question because it's really unclear

Upvotes: 1

JacoSolari
JacoSolari

Reputation: 1404

As far as I understood you want to find the value of "_id" that correspond to known values of 'ensembl', 'ensembl.gene' and 'notfound'. Here how to do it with a toy dataframe (easily extendable to your case).

import numpy as np
import pandas as pd
df = pd.DataFrame({'id':[0,1,2,3],
                   'col_1':[11,12,13,14],
                   'col_2':[110,120,130,140],
                   'col_3':[1100,1200,1300,1400]})
condition = np.logical_and(
    np.logical_and(df['col_1']==13,
                   df['col_2']==130),df['col_3']==1300)
print (f'the index(es) corresponding to the values of the columns is:
       {df["id"][condition].values}')
# output
the index corresponding to the values of the columns is: [2]

Upvotes: 0

Related Questions