How to import precomputed distance matrix (in csv format) to python?

Question

I have been trying to import a pre-calculated distance matrix using pandas and I want to use it to make a heatmap using seaborn. I have used the following codes:

import pandas as pd
msa = pd.read_csv("Multiple_alignment_distance_matrix.csv")

The output below does not look like a distance matrix.

    sp|Q9BYW2|SETD2_HUMAN Histone-lysine N-methyltransferase SETD2 OS=Homo sapiens OX=9606 GN=SETD2 PE=1 SV=3   sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens OX=9606 GN=HTT PE=1 SV=2  sp|Q8IUH5|ZDH17_HUMAN Palmitoyltransferase ZDHHC17 OS=Homo sapiens OX=9606 GN=ZDHHC17 PE=1 SV=2 sp|O75400|PR40A_HUMAN Pre-mRNA-processing factor 40 homolog A OS=Homo sapiens OX=9606 GN=PRPF40A PE=1 SV=2  tr|F8VU11|F8VU11_HUMAN PRP40 pre-mRNA processing factor 40 homolog B (Yeast), isoform CRA_a OS=Homo sapiens OX=9606 GN=PRPF40B PE=1 SV=2    sp|Q6NWY9|PR40B_HUMAN Pre-mRNA-processing factor 40 homolog B OS=Homo sapiens OX=9606 GN=PRPF40B PE=1 SV=1  sp|P43357|MAGA3_HUMAN Melanoma-associated antigen 3 OS=Homo sapiens OX=9606 GN=MAGEA3 PE=1 SV=1 tr|A0A024RBM8|A0A024RBM8_HUMAN AMPylator FICD OS=Homo sapiens OX=9606 GN=HYPE PE=3 SV=1 sp|Q9BVA6|FICD_HUMAN Protein adenylyltransferase FICD OS=Homo sapiens OX=9606 GN=FICD PE=1 SV=2 tr|B3KSH4|B3KSH4_HUMAN Huntingtin interacting protein 2, isoform CRA_a OS=Homo sapiens OX=9606 GN=HIP2 PE=2 SV=1    tr|B4DIZ2|B4DIZ2_HUMAN cDNA FLJ57995, moderately similar to Ubiquitin-conjugating enzyme E2-25 kDa OS=Homo sapiens OX=9606 PE=2 SV=1    sp|P61086|UBE2K_HUMAN Ubiquitin-conjugating enzyme E2 K OS=Homo sapiens OX=9606 GN=UBE2K PE=1 SV=3
0   sp|Q9BYW2|SETD2_HUMAN Histone-lysine N-methylt...   2564    409 69  114 109 107 41  89  89  9   13  19
1   sp|P42858|HD_HUMAN Huntingtin OS=Homo sapiens ...   409 3142    90  126 143 143 59  58  58  15  14  18
2   sp|Q8IUH5|ZDH17_HUMAN Palmitoyltransferase ZDH...   69  90  632 5   10  10  1   16  16  0   2   2
3   sp|O75400|PR40A_HUMAN Pre-mRNA-processing fact...   114 126 5   957 502 498 15  5   5   0   0   0
4   tr|F8VU11|F8VU11_HUMAN PRP40 pre-mRNA processi...   109 143 10  502 892 870 17  3   3   0   0   0
5   sp|Q6NWY9|PR40B_HUMAN Pre-mRNA-processing fact...   107 143 10  498 870 871 16  3   3   0   0   0
6   sp|P43357|MAGA3_HUMAN Melanoma-associated anti...   41  59  1   15  17  16  314 1   1   0   0   0
7   tr|A0A024RBM8|A0A024RBM8_HUMAN AMPylator FICD ...   89  58  16  5   3   3   1   458 458 19  29  42
8   sp|Q9BVA6|FICD_HUMAN Protein adenylyltransfera...   89  58  16  5   3   3   1   458 458 19  29  42
9   tr|B3KSH4|B3KSH4_HUMAN Huntingtin interacting ...   9   15  0   0   0   0   0   19  19  97  67  97
10  tr|B4DIZ2|B4DIZ2_HUMAN cDNA FLJ57995, moderate...   13  14  2   0   0   0   0   29  29  67  139 139
11  sp|P61086|UBE2K_HUMAN Ubiquitin-conjugating en...   19  18  2   0   0   0   0   42  42  97  139 200

The columns look alright but rows are indexed (as 0, 1, 2...). I have tried to use this to create the heatmap

import seaborn as sns
sns.heatmap(msa)

But I get a TypeError. I have tried to read the pandas and scipy documentation. But I am having a hard time understanding it.

mozway · Accepted Answer

As I expected, you can add the index_col=0 parameter to your read_csv function:

import pandas as pd
import seaborn as sns
df = pd.read_csv('Multiple_alignment_distance_matrix.csv', index_col=0)
sns.heatmap(df)

bonus: nicer names

def prot_name(s):
    import re
    match = re.search('^[^ ]+ (.*) OS=', s)
    if match:
        return match.group(1)

sns.heatmap(df.rename(columns=prot_name, index=prot_name))

How to import precomputed distance matrix (in csv format) to python?

Answers (1)

bonus: nicer names

Related Questions