Reputation: 249
I have two completely different dataframes that I need to map (thanks biology). All the tutorials out there on pandas are on much simpler conversions, and I can't get around solving this (real newbie) without 4 nested loops without success. Really curious on a pythonic way to solve this problem without having to go back to Excel.
The first is something like this df1. Observations of zeros and 1s for thousands of genes in a-j categories.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(df1)
a b c d e f g h i j
gene1 1 0 1 0 1 0 1 1 1 0
gene2 0 1 0 0 0 0 0 0 1 0
gene3 0 1 1 1 1 1 0 0 0 0
gene4 1 0 1 0 0 1 0 1 1 1
gene5 0 0 1 0 0 0 0 0 0 0
gene6 0 1 0 0 1 0 1 0 1 0
gene7 1 1 0 1 1 0 0 0 1 0
gene8 0 0 0 1 1 1 1 0 1 0
gene9 1 0 1 0 1 0 1 1 0 1
gene10 1 0 0 0 1 0 1 0 1 1
Then the second is something like this df2. A mapping of higher level categories (X-W) for the lower level ones. This girl has NaNs and no indexes.
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','Nan']},index=None)
print(df2)
W X Y Z
0 i a d g
1 j NaN b h
2 NaN NaN c e
3 Nan NaN f NaN
What I need is something like result1. Here resides another tricky thing. E.g. gene4 is in i and j categories, and both are in W, but I still just want a '1' in result1.loc['gene4','W']. The final result still needs to be binary.
result1 = pd.DataFrame({'X': ['1','0','0','1','0','0','1','0','1','1'],
'Y': ['1','1','1','1','1','1','1','1','1','0'],
'Z': ['1','0','1','1','0','1','1','1','1','1'],
'W': ['1','1','0','1','0','1','1','1','1','1']}, index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(result1)
W X Y Z
gene1 1 1 1 1
gene2 1 0 1 0
gene3 0 0 1 1
gene4 1 1 1 1
gene5 0 0 1 0
gene6 1 0 1 1
gene7 1 1 1 1
gene8 1 0 1 1
gene9 1 1 1 1
gene10 1 1 0 1
This could be another possible result format. [updated with actual expected result]. If someone is feeling like teaching them both (or an easy interconversion), much extra appreciation, and science is also grateful.
result1 = pd.DataFrame({'1': ['gene1','gene1','gene1','gene1'],
'2': ['gene2','gene4','gene2','gene3'],
'3': ['gene4','gene7','gene3','gene4'],
'4': ['gene6','gene9','gene4','gene6'],
'5': ['gene7','gene10','gene5','gene7'],
'6': ['gene8','NaN','gene6','gene8'],
'7': ['gene9','NaN','gene7','gene9'],
'8': ['gene10','NaN','gene8','gene10'],
'9': ['NaN','NaN','gene9','NaN'],
},
index = ['W','X','Y','Z'])
print(result1)
1 2 3 4 5 6 7 8 9
W gene1 gene2 gene4 gene6 gene7 gene8 gene9 gene10 NaN
X gene1 gene4 gene7 gene9 gene10 NaN NaN NaN NaN
Y gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
Z gene1 gene3 gene4 gene6 gene7 gene8 gene9 gene10 NaN
Thank you a lot for your patience reading this long question.
Upvotes: 1
Views: 83
Reputation: 153510
Here we go! Let's try this.
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','NaN']},index=None)
df2 = df2.replace('NaN',np.nan)
gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']
df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()
df_out = df3.groupby(['index','level_1'])['level_1'].count().unstack()
print(df_out)
Output:
level_1 W X Y Z
index
gene1 1.0 NaN NaN NaN
gene10 1.0 1.0 1.0 1.0
gene2 1.0 1.0 1.0 1.0
gene3 1.0 1.0 1.0 1.0
gene4 1.0 NaN 1.0 1.0
gene5 1.0 NaN 1.0 NaN
gene6 1.0 1.0 1.0 1.0
gene7 NaN 1.0 1.0 1.0
gene8 NaN NaN 1.0 1.0
gene9 1.0 NaN NaN 1.0
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','NaN']},index=None)
df2 = df2.replace('NaN',np.nan)
gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']
df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()
df3['cols'] = df3['index'].str.split('gene').str[1].astype(int)
df_out2 = df3.set_index(['level_1','cols'])['index'].unstack()
Output:
cols 1 2 3 4 5 6 7 8 9 10
level_1
W gene1 gene2 gene3 gene4 gene5 None gene7 gene8 gene9 gene10
X None None gene3 None gene5 None None gene8 gene9 gene10
Y gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
Z None gene2 None gene4 None gene6 None gene8 gene9 None
Upvotes: 1