Reputation: 2886
Using
cell = pd.read_csv('test_cell.txt',header=2,sep='\t')
I create a pandas DataFrame object that looks like this:
Name Description LN18 22RV1 DU145
0 100009676_at LOC100009676 1 2 3
1 10000_at AKT3 4 5 6
2 10001_at MED6 7 8 9
3 NaN NaN NaN NaN NaN
I'd like to do some matrix multiplication (outer product) with another matrix that has similar rows. For example:
patients =
id Pat_1 Pat_2 Pat_3
0 MED6 1 1 1
1 LOC100009676 2 2 2
2 ABCD 3 3 3
These two matrices have some similar row names -- gene symbols. Since I'm familiar with iterating over dictionaries with keys, how can I establish a "key" column (to column 1 - "Description") with my first pandas DataFrame cell
such that it matches my other DataFrame patients
gene symbol keys?
I'd like to perform my outerproduct across "MED6","LOC100009676", and so on, and add each gene score down a each patient:cell comparison. The final data structure is a 1D matrix of 9 entries (given two 3x3 matricies as above).
dot_prod_total = sum(np.outer(cell[key], patient[key]).flat for key in cell)
dot_prod_total =
Pat_1&LN18 Pat_1&22RV1 Pat_1&DU145 Pat_2&LN18 ...
8 10 12 16
Thanks to user @Ivc who helped me out with the dot_prod_total numpy generator comprehension.
Upvotes: 0
Views: 94
Reputation: 1485
I guess what you want to do is pd.merge
on the gene names, then sum the scores for each cell:patient pair. But your data frames are reshaped; pd.stack
is useful for that purpose.
cell_s=cell.set_index(['Description','Name']).stack().reset_index()
cell_s.columns = ['Description', 'Name', 'cell', 's1']
Then, cell_s looks like this:
Description Name cell s1
0 LOC100009676 100009676_at LN18 1
1 LOC100009676 100009676_at 22RV1 2
2 LOC100009676 100009676_at DU145 3
3 AKT3 10000_at LN18 4
4 AKT3 10000_at 22RV1 5
5 AKT3 10000_at DU145 6
6 MED6 10001_at LN18 7
7 MED6 10001_at 22RV1 8
8 MED6 10001_at DU145 9
....
Do it again for patients:
patients_s=patients.set_index('id').stack().reset_index()
patients_s.columns=['id', 'patient', 's2']
Then, join the two and product two scores:
merged=cell_s.merge(patients_s, left_on='Description',right_on='id')
merged['score']=merged.s1*merged.s2
It looks like :
Description Name cell s1 id patient s2 score
0 LOC100009676 100009676_at LN18 1 LOC100009676 Pat_1 2 2
1 LOC100009676 100009676_at LN18 1 LOC100009676 Pat_2 2 2
2 LOC100009676 100009676_at LN18 1 LOC100009676 Pat_3 2 2
3 LOC100009676 100009676_at 22RV1 2 LOC100009676 Pat_1 2 4
4 LOC100009676 100009676_at 22RV1 2 LOC100009676 Pat_2 2 4
5 LOC100009676 100009676_at 22RV1 2 LOC100009676 Pat_3 2 4
6 LOC100009676 100009676_at DU145 3 LOC100009676 Pat_1 2 6
...
Finally, groupby and sum the scores:
scores=merged.groupby(['patient','cell'])['score'].sum()
You get scores
like this:
patient cell
Pat_1 22RV1 12
DU145 15
LN18 9
Pat_2 22RV1 12
DU145 15
LN18 9
Pat_3 22RV1 12
DU145 15
LN18 9
For plotting scores as a histogram, pandas has hist()
methods.
scores = scores.reset_index()
#Plot a histogram with all scores:
scores.hist()
#Plot a histogram with specific cells:
cells = your_map['tissue_1'] #e.g. cells = ['LN18', 'DU145']
scores[scores.cell.isin(cells)].hist()
#or get the score array and input this to your plotting functions
score_array = scores[scores.cell.isin(cells)].score.values
Upvotes: 1