gthm
gthm

Reputation: 1948

reading a matrix and fetch row and column names in python

I would like to read a matrix file something which looks like:

sample  sample1 sample2 sample3
sample1 1   0.7 0.8
sample2 0.7 1   0.8
sample3 0.8 0.8 1

I would like to fetch all the pairs that have a value of > 0.8. E.g: sample1,sample3 0.8 sample2,sample3 0.8 etc in a large file .

When I use csv.reader, each line is turning in to a list and keeping track of row and column names makes program dodgy. I would like to know an elegant way of doing it like using numpy or pandas.

Desired output:

sample1,sample3 0.8 
sample2,sample3 0.8

1 can be ignored because between same sample, it will be 1 always.

Upvotes: 3

Views: 3030

Answers (5)

JARS
JARS

Reputation: 1119

Using scipy.sparse.coo_matrix, as it works with a "(row, col) data" format.

from scipy.sparse import coo_matrix
import numpy as np

M = np.matrix([[1.0, 0.7, 0.8], [0.7, 1.0, 0.8], [0.8, 0.8, 1.0]])
S = coo_matrix(M)

Here, S.row and S.col are arrays of row and column indices, S.data is the array of values at those indices. So you can filter by

idx = S.data >= 0.8

And for instance create a new matrix only with those elements:

S2 = coo_matrix((S.data[idx], (S.row[idx], S.col[idx])))
print S2

The output is

(0, 0)  1.0
(0, 2)  0.8
(1, 1)  1.0
(1, 2)  0.8
(2, 0)  0.8
(2, 1)  0.8
(2, 2)  1.0

Note (0,1) does not appear as the value is 0.7.

Upvotes: 2

Lee
Lee

Reputation: 31040

To read it in you need the skipinitialspace and index_col parameters:

a=pd.read_csv('yourfile.txt',sep=' ',skipinitialspace=True,index_col=0)

To get the values pair wise:

[[x,y,round(a[x][y],3)] for x in a.index for y in a.columns if x!=y and a[x][y]>=0.8][:2]

Gives:

[['sample1', 'sample3', 0.8], 
['sample2', 'sample3', 0.8]]

Upvotes: 2

Andy Hayden
Andy Hayden

Reputation: 375495

You can mask out the off upper-triangular values with np.triu:

In [11]: df
Out[11]:
         sample1  sample2  sample3
sample
sample1      1.0      0.7      0.8
sample2      0.7      1.0      0.8
sample3      0.8      0.8      1.0

In [12]: np.triu(df, 1)
Out[12]:
array([[ 0. ,  0.7,  0.8],
       [ 0. ,  0. ,  0.8],
       [ 0. ,  0. ,  0. ]])

In [13]: np.triu(df, 1) >= 0.8
Out[13]:
array([[False, False,  True],
       [False, False,  True],
       [False, False, False]], dtype=bool)

Then to extract the index/columns where it's True I think you have to use np.where*:

In [14]: np.where(np.triu(df, 1) >= 0.8)
Out[14]: (array([0, 1]), array([2, 2]))

This gives you an array of first index indices and then column indices (this is the least efficient part of this numpy version):

In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8)

In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)]
Out[17]:
[('sample1', 'sample3', 0.80000000000000004),
 ('sample2', 'sample3', 0.80000000000000004)]

As desired.

*I may be forgetting an easier way to get this last chunk (Edit: the below pandas code does it, but I think there may be another way too.)


You can use the same trick in pandas but with stack to get the index/columns natively:

In [21]: (np.triu(df, 1) >= 0.8) * df
Out[21]:
         sample1  sample2  sample3
sample
sample1        0        0      0.8
sample2        0        0      0.8
sample3        0        0      0.0

In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack()

In [23]: res
Out[23]:
sample
sample1  sample1    0.0
         sample2    0.0
         sample3    0.8
sample2  sample1    0.0
         sample2    0.0
         sample3    0.8
sample3  sample1    0.0
         sample2    0.0
         sample3    0.0
dtype: float64

In [24]: res[res!=0]
Out[24]:
sample
sample1  sample3    0.8
sample2  sample3    0.8
dtype: float64

Upvotes: 5

Kartik
Kartik

Reputation: 8683

If you want to use Pandas, the following answer will help. I am assuming you will figure out how to read your matrix files into Pandas by yourself. I am also assuming that your columns and rows are labelled correctly. What you will end up with after you read your data is a DataFrame which will look a lot like the matrix you have at the top of your question. I am assuming that all row names are the DataFrame index. I am taking that you have read the data into a variable called df as my starting point.

Pandas is more efficient row-wise than column-wise. So, we do things row-wise, looping over the columns.

pairs = {}
for col in df.columns:
    pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)].index.tolist()
    # If row names are not an index, but a different column named 'names' run the following line, instead of the line above
    # pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)]['names'].tolist()

Alternatively, you can use apply() to do this, because that too will loop over all columns. (Maybe in 0.17 it will release the GIL for faster results, I do not know because I have not tried it.)

pairs will now contain the column name as key and a list of the names of rows as values where the correlation is greater than 0.8, but less than 1.

If you also want to extract correlation values from the DataFrame, replace .tolist() by .to_dict(). .to_dict() will generate a dict such that index is key and value is value: {index -> value}. So, ultimately your pairs will look like {column -> {index -> value}}. It will also be guaranteed free of nan. Note that .to_dict() will only work if your index contains the row names that you want, else it will return the default index, which is just numbers.


Ps. If your file is huge, I would recommend reading it in chunks. In this case, the piece of code above will be repeated for each chunk. So it should be inside your loop that iterates over chunks. However, then you will have to be careful to append new data coming from the next chunk to pairs. The following links are for your reference:

  1. Pandas I/O docs
  2. Pandas read_csv() function
  3. SO question on chunked read

You might also want to read reference 1 for other types of I/O supported by Pandas.

Upvotes: 3

Greg Reda
Greg Reda

Reputation: 1854

pandas' read_table can handle regular expressions in the sep parameter.

In [19]: !head file.txt
sample  sample1 sample2 sample3
sample1 1   0.7 0.8
sample2 0.7 1   0.8
sample3 0.8 0.8 1

In [20]: df = pd.read_table('file.txt', sep='\s+')

In [21]: df
Out[21]:
    sample  sample1  sample2  sample3
0  sample1      1.0      0.7      0.8
1  sample2      0.7      1.0      0.8
2  sample3      0.8      0.8      1.0

From there, you can filter on all values >= 0.8.

In [23]: df[df >= 0.8]
Out[23]:
    sample  sample1  sample2  sample3
0  sample1      1.0      NaN      0.8
1  sample2      NaN      1.0      0.8
2  sample3      0.8      0.8      1.0

Upvotes: 0

Related Questions