Reputation: 1948
I would like to read a matrix file something which looks like:
sample sample1 sample2 sample3
sample1 1 0.7 0.8
sample2 0.7 1 0.8
sample3 0.8 0.8 1
I would like to fetch all the pairs that have a value of > 0.8. E.g: sample1,sample3 0.8
sample2,sample3 0.8
etc in a large file .
When I use csv.reader
, each line is turning in to a list and keeping track of row and column names makes program dodgy. I would like to know an elegant way of doing it like using numpy
or pandas
.
Desired output:
sample1,sample3 0.8
sample2,sample3 0.8
1 can be ignored because between same sample, it will be 1 always.
Upvotes: 3
Views: 3030
Reputation: 1119
Using scipy.sparse.coo_matrix, as it works with a "(row, col) data" format.
from scipy.sparse import coo_matrix
import numpy as np
M = np.matrix([[1.0, 0.7, 0.8], [0.7, 1.0, 0.8], [0.8, 0.8, 1.0]])
S = coo_matrix(M)
Here, S.row and S.col are arrays of row and column indices, S.data is the array of values at those indices. So you can filter by
idx = S.data >= 0.8
And for instance create a new matrix only with those elements:
S2 = coo_matrix((S.data[idx], (S.row[idx], S.col[idx])))
print S2
The output is
(0, 0) 1.0
(0, 2) 0.8
(1, 1) 1.0
(1, 2) 0.8
(2, 0) 0.8
(2, 1) 0.8
(2, 2) 1.0
Note (0,1) does not appear as the value is 0.7.
Upvotes: 2
Reputation: 31040
To read it in you need the skipinitialspace
and index_col
parameters:
a=pd.read_csv('yourfile.txt',sep=' ',skipinitialspace=True,index_col=0)
To get the values pair wise:
[[x,y,round(a[x][y],3)] for x in a.index for y in a.columns if x!=y and a[x][y]>=0.8][:2]
Gives:
[['sample1', 'sample3', 0.8],
['sample2', 'sample3', 0.8]]
Upvotes: 2
Reputation: 375495
You can mask out the off upper-triangular values with np.triu
:
In [11]: df
Out[11]:
sample1 sample2 sample3
sample
sample1 1.0 0.7 0.8
sample2 0.7 1.0 0.8
sample3 0.8 0.8 1.0
In [12]: np.triu(df, 1)
Out[12]:
array([[ 0. , 0.7, 0.8],
[ 0. , 0. , 0.8],
[ 0. , 0. , 0. ]])
In [13]: np.triu(df, 1) >= 0.8
Out[13]:
array([[False, False, True],
[False, False, True],
[False, False, False]], dtype=bool)
Then to extract the index/columns where it's True I think you have to use np.where
*:
In [14]: np.where(np.triu(df, 1) >= 0.8)
Out[14]: (array([0, 1]), array([2, 2]))
This gives you an array of first index indices and then column indices (this is the least efficient part of this numpy version):
In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8)
In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)]
Out[17]:
[('sample1', 'sample3', 0.80000000000000004),
('sample2', 'sample3', 0.80000000000000004)]
As desired.
*I may be forgetting an easier way to get this last chunk (Edit: the below pandas code does it, but I think there may be another way too.)
You can use the same trick in pandas but with stack to get the index/columns natively:
In [21]: (np.triu(df, 1) >= 0.8) * df
Out[21]:
sample1 sample2 sample3
sample
sample1 0 0 0.8
sample2 0 0 0.8
sample3 0 0 0.0
In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack()
In [23]: res
Out[23]:
sample
sample1 sample1 0.0
sample2 0.0
sample3 0.8
sample2 sample1 0.0
sample2 0.0
sample3 0.8
sample3 sample1 0.0
sample2 0.0
sample3 0.0
dtype: float64
In [24]: res[res!=0]
Out[24]:
sample
sample1 sample3 0.8
sample2 sample3 0.8
dtype: float64
Upvotes: 5
Reputation: 8683
If you want to use Pandas, the following answer will help. I am assuming you will figure out how to read your matrix files into Pandas by yourself. I am also assuming that your columns and rows are labelled correctly. What you will end up with after you read your data is a DataFrame which will look a lot like the matrix you have at the top of your question. I am assuming that all row names are the DataFrame index. I am taking that you have read the data into a variable called df
as my starting point.
Pandas is more efficient row-wise than column-wise. So, we do things row-wise, looping over the columns.
pairs = {}
for col in df.columns:
pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)].index.tolist()
# If row names are not an index, but a different column named 'names' run the following line, instead of the line above
# pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)]['names'].tolist()
Alternatively, you can use apply()
to do this, because that too will loop over all columns. (Maybe in 0.17 it will release the GIL for faster results, I do not know because I have not tried it.)
pairs
will now contain the column name as key and a list of the names of rows as values where the correlation is greater than 0.8, but less than 1.
If you also want to extract correlation values from the DataFrame, replace .tolist()
by .to_dict()
. .to_dict()
will generate a dict such that index is key and value is value: {index -> value}
. So, ultimately your pairs
will look like {column -> {index -> value}}
. It will also be guaranteed free of nan
. Note that .to_dict()
will only work if your index contains the row names that you want, else it will return the default index, which is just numbers.
Ps. If your file is huge, I would recommend reading it in chunks. In this case, the piece of code above will be repeated for each chunk. So it should be inside your loop that iterates over chunks. However, then you will have to be careful to append new data coming from the next chunk to pairs
. The following links are for your reference:
You might also want to read reference 1 for other types of I/O supported by Pandas.
Upvotes: 3
Reputation: 1854
pandas' read_table
can handle regular expressions in the sep
parameter.
In [19]: !head file.txt
sample sample1 sample2 sample3
sample1 1 0.7 0.8
sample2 0.7 1 0.8
sample3 0.8 0.8 1
In [20]: df = pd.read_table('file.txt', sep='\s+')
In [21]: df
Out[21]:
sample sample1 sample2 sample3
0 sample1 1.0 0.7 0.8
1 sample2 0.7 1.0 0.8
2 sample3 0.8 0.8 1.0
From there, you can filter on all values >= 0.8.
In [23]: df[df >= 0.8]
Out[23]:
sample sample1 sample2 sample3
0 sample1 1.0 NaN 0.8
1 sample2 NaN 1.0 0.8
2 sample3 0.8 0.8 1.0
Upvotes: 0