Thomas Matthew
Thomas Matthew

Reputation: 2886

Set DataFrame row index to a different column

The Setup

Using

cell = pd.read_csv('test_cell.txt',header=2,sep='\t')

I create a pandas DataFrame object that looks like this:

           Name   Description  LN18  22RV1  DU145 
0  100009676_at  LOC100009676  1     2      3     
1      10000_at          AKT3  4     5      6    
2      10001_at          MED6  7     8      9  
3           NaN           NaN  NaN   NaN    NaN

I'd like to do some matrix multiplication (outer product) with another matrix that has similar rows. For example:

patients =
       id  Pat_1   Pat_2   Pat_3
    0          MED6    1   1   1
    1  LOC100009676    2   2   2
    2          ABCD    3   3   3

Question

These two matrices have some similar row names -- gene symbols. Since I'm familiar with iterating over dictionaries with keys, how can I establish a "key" column (to column 1 - "Description") with my first pandas DataFrame cell such that it matches my other DataFrame patients gene symbol keys?

Ultimately:

I'd like to perform my outerproduct across "MED6","LOC100009676", and so on, and add each gene score down a each patient:cell comparison. The final data structure is a 1D matrix of 9 entries (given two 3x3 matricies as above).

dot_prod_total = sum(np.outer(cell[key], patient[key]).flat for key in cell) 


dot_prod_total =

Pat_1&LN18   Pat_1&22RV1   Pat_1&DU145   Pat_2&LN18   ...
8            10            12            16

Thanks to user @Ivc who helped me out with the dot_prod_total numpy generator comprehension.

Upvotes: 0

Views: 94

Answers (1)

Jihun
Jihun

Reputation: 1485

I guess what you want to do is pd.merge on the gene names, then sum the scores for each cell:patient pair. But your data frames are reshaped; pd.stack is useful for that purpose.

cell_s=cell.set_index(['Description','Name']).stack().reset_index()
cell_s.columns = ['Description', 'Name', 'cell', 's1']    

Then, cell_s looks like this:

    Description          Name   cell  s1
0  LOC100009676  100009676_at   LN18   1
1  LOC100009676  100009676_at  22RV1   2
2  LOC100009676  100009676_at  DU145   3
3          AKT3      10000_at   LN18   4
4          AKT3      10000_at  22RV1   5
5          AKT3      10000_at  DU145   6
6          MED6      10001_at   LN18   7
7          MED6      10001_at  22RV1   8
8          MED6      10001_at  DU145   9
....

Do it again for patients:

patients_s=patients.set_index('id').stack().reset_index()
patients_s.columns=['id', 'patient', 's2']

Then, join the two and product two scores:

merged=cell_s.merge(patients_s, left_on='Description',right_on='id')
merged['score']=merged.s1*merged.s2

It looks like :

     Description          Name   cell  s1            id patient  s2  score
0   LOC100009676  100009676_at   LN18   1  LOC100009676   Pat_1   2      2
1   LOC100009676  100009676_at   LN18   1  LOC100009676   Pat_2   2      2
2   LOC100009676  100009676_at   LN18   1  LOC100009676   Pat_3   2      2
3   LOC100009676  100009676_at  22RV1   2  LOC100009676   Pat_1   2      4
4   LOC100009676  100009676_at  22RV1   2  LOC100009676   Pat_2   2      4
5   LOC100009676  100009676_at  22RV1   2  LOC100009676   Pat_3   2      4
6   LOC100009676  100009676_at  DU145   3  LOC100009676   Pat_1   2      6
...

Finally, groupby and sum the scores:

scores=merged.groupby(['patient','cell'])['score'].sum()

You get scores like this:

patient  cell 
Pat_1    22RV1    12
         DU145    15
         LN18      9
Pat_2    22RV1    12
         DU145    15
         LN18      9
Pat_3    22RV1    12
         DU145    15
         LN18      9

For comment

For plotting scores as a histogram, pandas has hist() methods.

scores = scores.reset_index()
#Plot a histogram with all scores:
scores.hist()
#Plot a histogram with specific cells:
cells = your_map['tissue_1']  #e.g. cells = ['LN18', 'DU145']    
scores[scores.cell.isin(cells)].hist()
#or get the score array and input this to your plotting functions
score_array = scores[scores.cell.isin(cells)].score.values

Upvotes: 1

Related Questions