Dsh M
Dsh M

Reputation: 411

Panda_Merge multiple values in df1 to unique values in df2

I have one dataframe (df1) with details of Genes and list of organs they are associated with and another mapping dataframe (df2) that mach these organs to unique organ type.

E.g.

df1 <-

data.frame ("Gene_name"=c("Gene1", "Gene2", "Gene3, "Gene4"),   
"Organ_name"=c("Skin, Stomach, Eyes, Hair", "Lungs, Mouth, Oesophagus", "Pharynx, Lungs, Throat, Skin", "Stomach, Small intestine"))

df2 <-

data.frame ("Type"=c("External", "External", "External", "External"......"Internal", "Internal", "Internal"...),

 "Organ"=c("Skin", "Eyes", "Hair", "Legs",.... "Lungs", "Small intestine", "Oesophagus".....))

I want to see what major category individual genes belong to. Is it frequently present internally or externally?

If I split "Organ_name" using str.split(",") then in some cases I am getting around 20 columns. Merging these individual "Organ_name" columns of df1 with "Type" in df2 using Organ as the key is a big pain.

Is there a better way to analyse this data? How to know the frequency/count for "Type" of organ? Please let me know

Upvotes: 0

Views: 48

Answers (1)

jpp
jpp

Reputation: 164773

Below is an example of how you can structure your logic with pandas.

Setup

import pandas as pd

df1 = pd.DataFrame({"Gene_name": ("Gene1", "Gene2", "Gene3", "Gene4"),   
                    "Organ_name": ("Skin, Stomach, Eyes, Hair", "Lungs, Mouth, Oesophagus",
                                  "Pharynx, Lungs, Throat, Skin", "Stomach, Small intestine")})

df2 = pd.DataFrame({"Type": ("External", "External", "External", "External", "Internal", "Internal", "Internal"),
                    "Organ": ("Skin", "Eyes", "Hair", "Legs", "Lungs", "Small intestine", "Oesophagus")})

Solution

t = df2.set_index('Organ')['Type']

df1['Organ_list'] = df1['Organ_name'].str.split(', ')

df1['Int_Ext'] = [list(filter(None, map(t.get, x))) for x in df1['Organ_list']]

df1['Int_Ext_Flag'] = df1['Int_Ext'].apply(lambda x: 'Internal' if \
                      x.count('Internal') / len(x) >= 0.5 else 'External')

Result

  Gene_name                    Organ_name                      Organ_list  \
0     Gene1     Skin, Stomach, Eyes, Hair     [Skin, Stomach, Eyes, Hair]   
1     Gene2      Lungs, Mouth, Oesophagus      [Lungs, Mouth, Oesophagus]   
2     Gene3  Pharynx, Lungs, Throat, Skin  [Pharynx, Lungs, Throat, Skin]   
3     Gene4      Stomach, Small intestine      [Stomach, Small intestine]   

                          Int_Ext Int_Ext_Flag  
0  [External, External, External]     External  
1            [Internal, Internal]     Internal  
2            [Internal, External]     Internal  
3                      [Internal]     Internal 

Explanation

  • Create a mapping from organ to type using df2.
  • Split the strings in df1['Organ_list'] to form a list.
  • Map elements of this list to type. Add logic to determine whether "Internal" or "External" via pd.Series.apply.
  • In this example, I filter out organs which have not been mapped to type via list(filter(None, ...)).

Upvotes: 1

Related Questions