claudia chiodi
claudia chiodi

Reputation: 41

How to create a 2Darray from a dictionary?

I have a file containing the correlation between two factors which looks like this:

gr2_1 gr2_2 90%
gr2_1 gr2_3 80%
gr2_1 gr4_7 55%
gr2_2 gr3A_1 5%
gr2_2 gr3A_2 100%
gr3B_1 gr2_1 60%

I want to convert it into a 2D array:

       gr2   gr3A    gr3B gr3C gr4
gr2_1  90%   none    none none 55%
gr2_2 none  5%,100%  none none none
gr2_3  65%   none     20% none none
gr3B_1 60%   none    none none none

So in the first column I will have the names of each element, not repeated (huge list). And in the first line the names of the groups (just those 5). As you can see some elements can have more than one (usually no more than 2-3) elements correlated. In the matrix I want to have the percentage of correlation between the two elements. I don't care about losing the ID of the second element, I just want to know if the first element is present in in which group and with which correlation.

I'm a very beginner to Python, I think I have to do it using a dictionary, than I think to isolate the name of the group "gr2" and use a regular expression, but nothing I tried worked

Upvotes: 4

Views: 82

Answers (1)

yatu
yatu

Reputation: 88305

Here's one approach. There are a few discrepancies between your sample input file and expected output, so I've gone ahead and shared a solution making some assumptions.

import pandas as pd
# load your file into a list
l = [l.rstrip('\n').split() for l in open('my_file.txt')]
# [['gr2_1', 'gr2_2', '90%'], ['gr2_1', 'gr2_3', '80%'],...
# Construct a dataframe from it
df = pd.DataFrame(l).add_prefix('col')

    col0    col1  col2
0   gr2_1   gr2_2   90%
1   gr2_1   gr2_3   80%
2   gr2_1   gr4_7   55%
3   gr2_2  gr3A_1    5%
4   gr2_2  gr3A_2  100%
5  gr3B_1   gr2_1   60%

And we can then use pd.crosstab with a custom aggregation function, for instance str.cat with a comma separator, to concatenate entries with a common col0 and col1.

We will also need some extra work on the col1 column, as we want pd.crosstab to regard gr2_2 and gr2_3 as the same. So we can use str.split and slice the result keeping the first item:

(pd.crosstab(index=df.col0, 
            columns=df.col1.str.split('_').str[0], 
            values = df.col2, 
            aggfunc= lambda x: x.str.cat(sep=',')))

col1        gr2     gr3A  gr4
col0                         
gr2_1   90%,80%      NaN  55%
gr2_2       NaN  5%,100%  NaN
gr3B_1      60%      NaN  NaN

Upvotes: 3

Related Questions