Steak
Steak

Reputation: 544

How to Create a Correlation Dataframe from already related data

I have a data frame of language similarity. Here is a small snippet that's been edited for simplicity:

    0       1       2
0   English Spanish 0.50
1   English Russian 0.15

I would like to create a correlation dataframe such as:

        English Spanish Russian
English 1       0.5     0.15
Spanish 0.5     1       -
Russian 0.15    -       1

To create the first dataframe, I ran:

pairing_list = [["English","Spanish",0.5],["English","Russian",0.15]]
df = pd.DataFrame(pairing_list)

I have tried:

df.corr()

Which returns:

        2
2       1.0

I have looked at other similar questions but it seems that the data for use in .corr() is by itself (ie: my data here is already a correlation between the two columns, whereas the examples I have seen are not yet such related).

To clarify: the data presented is already the similarity between the two languages, and thus is not some value associated with one language alone; it is for the pair listed in the columns.

How could I use Python / Pandas to do this?

Upvotes: 2

Views: 225

Answers (2)

Code Different
Code Different

Reputation: 93191

I assume that the pairs are unique, i.e. if there's English - Spanish, there won't be a Spanish - English.

# Rename the columns. Because working with columns
# named like numbers is a pain in the butt
df.columns = ['lang1', 'lang2', 'corr']

# Create the opposite pairs
tmp1 = df.copy().set_axis(['lang2', 'lang1', 'corr'], axis=1)

# Create the identity pairs
all_lang = np.unique(df[['lang1', 'lang2']])
tmp2 = pd.DataFrame({
    'lang1': all_lang,
    'lang2': all_lang,
    'corr': 1.0
})

# Concatenate all 3 together and beat the data frame into shape
result = (
    pd.concat([df, tmp1, tmp2], axis=0)
        .set_index(['lang1', 'lang2'])
        .unstack()
        .droplevel(0, axis=1)
        .rename_axis(index=[None], columns=[None])
)

Upvotes: 1

Corralien
Corralien

Reputation: 120539

Use crosstab to create the all language combinations and fill with the existing data:

lg = pd.concat([df[0], df[1]]).unique()  # ['English', 'Spanish', 'Russian']
cx = pd.crosstab(lg, lg)

cx.update(df.set_index([0, 1]).squeeze().unstack())
cx.update(df.set_index([0, 1]).squeeze().unstack().T)
>>> cx
col_0    English  Russian  Spanish
row_0
English     1.00     0.15      0.5
Russian     0.15     1.00      0.0
Spanish     0.50     0.00      1.0

Upvotes: 1

Related Questions