Reputation: 544
I have a data frame of language similarity. Here is a small snippet that's been edited for simplicity:
0 1 2
0 English Spanish 0.50
1 English Russian 0.15
I would like to create a correlation dataframe such as:
English Spanish Russian
English 1 0.5 0.15
Spanish 0.5 1 -
Russian 0.15 - 1
To create the first dataframe, I ran:
pairing_list = [["English","Spanish",0.5],["English","Russian",0.15]]
df = pd.DataFrame(pairing_list)
I have tried:
df.corr()
Which returns:
2
2 1.0
I have looked at other similar questions but it seems that the data for use in .corr() is by itself (ie: my data here is already a correlation between the two columns, whereas the examples I have seen are not yet such related).
To clarify: the data presented is already the similarity between the two languages, and thus is not some value associated with one language alone; it is for the pair listed in the columns.
How could I use Python / Pandas to do this?
Upvotes: 2
Views: 225
Reputation: 93191
I assume that the pairs are unique, i.e. if there's English - Spanish
, there won't be a Spanish - English
.
# Rename the columns. Because working with columns
# named like numbers is a pain in the butt
df.columns = ['lang1', 'lang2', 'corr']
# Create the opposite pairs
tmp1 = df.copy().set_axis(['lang2', 'lang1', 'corr'], axis=1)
# Create the identity pairs
all_lang = np.unique(df[['lang1', 'lang2']])
tmp2 = pd.DataFrame({
'lang1': all_lang,
'lang2': all_lang,
'corr': 1.0
})
# Concatenate all 3 together and beat the data frame into shape
result = (
pd.concat([df, tmp1, tmp2], axis=0)
.set_index(['lang1', 'lang2'])
.unstack()
.droplevel(0, axis=1)
.rename_axis(index=[None], columns=[None])
)
Upvotes: 1
Reputation: 120539
Use crosstab
to create the all language combinations and fill with the existing data:
lg = pd.concat([df[0], df[1]]).unique() # ['English', 'Spanish', 'Russian']
cx = pd.crosstab(lg, lg)
cx.update(df.set_index([0, 1]).squeeze().unstack())
cx.update(df.set_index([0, 1]).squeeze().unstack().T)
>>> cx
col_0 English Russian Spanish
row_0
English 1.00 0.15 0.5
Russian 0.15 1.00 0.0
Spanish 0.50 0.00 1.0
Upvotes: 1