Maurice F
Maurice F

Reputation: 25

How to make a percentage matrix in pandas/numpy?

I am struggling to get something like a matrix dataframe in pandas/NumPy.

I have the following dataframe

Value Source Placement First
A GooFeed 1 GooFeed
A CyberCyber 2 GooFeed
A GooFeed 3 GooFeed
B CyberCyber 1 CyberCyber
B GooFeed 2 CyberCyber
C WowFeed 1 WowFeed
C WowFeed 2 WowFeed
C CyberCyber 3 WowFeed
C GooFeed 4 WowFeed
C CyberCyber 5 WowFeed
D GooFeed 1 GooFeed

What I want to achieve is a percentage of the feeds having data from each other. The feeds with the 1 in Placement got the data (value A) first. Sometimes the same values appear in multiple data feeds and so are multiple placements from the same feed at one value, like with value C and CyberCyber. But the values should be unique and not count as 2 times data from WowFeed.

I calculated total values in a separate dataframe like the following:

Source Firsts TotalDistinctValues FirstSources
GooFeed 2 4 {'CyberCyber':1;'WowFeed':1}
CyberCyber 1 3 {'GooFeed':1;'WowFeed':1}
WowFeed 1 1

And the matrix I am looking for is the following:

source GooFeed CyberCyber WowFeed Total
GooFeed 25% 25% 50%
CyberCyber 33% 33% 66%
WowFeed
Average 33% 25% 58%

Average (bottom row) of the column values excluding the value source matching itself. Like GooFeed at GooFeed.

Total (outer row on the right) should be to total of its values from its own distinct values.

I really do not get my hand around how this should work. Am I even right to try it in pandas/NumPy?

I am so glad for every help.

Best regards

Maurice

Upvotes: 1

Views: 930

Answers (2)

Maurice F
Maurice F

Reputation: 25

I managed to work with piterbargs to answer and achieve everything I need. Thank you piterbarg!

For the ones interested I will post the code because I edited it a bit (caution - different naming of column / values).

enter image description here

enter image description here

Upvotes: 1

piterbarg
piterbarg

Reputation: 8219

Need a few steps here. In the first step we drop duplicates to avoid double-counting and pivot:

df2 = pd.pivot_table(df.drop_duplicates(['Value','Source','First']), index = 'Source', columns = 'First',values = 'Value',aggfunc = 'count', fill_value = 0)
df2

we get

First       CyberCyber  GooFeed WowFeed
Source          
CyberCyber  1           1       1
GooFeed     1           2       1
WowFeed     0           0       1

Now we need to translate into proportions and calculate Totals. The somewhat tricky bit is to get rid of the diagonal (this is where np.eye() comes in)

import numpy as np
df3 = df2.div( df2.sum(axis=1), axis=0 )
df4 = df3*(1 - np.eye(len(df3))
df4['Total'] = df4.sum(axis=1)
df4.loc['Total'] = df4.sum(axis=0)
df4

and we get


First       CyberCyber  GooFeed     WowFeed     Total
Source              
CyberCyber  0.00        0.333333    0.333333    0.666667
GooFeed     0.25        0.000000    0.250000    0.500000
WowFeed     0.00        0.000000    0.000000    0.000000
Total       0.25        0.333333    0.583333    1.166667

if you want you can replace 0's with NaNs, say, using df4.replace(0,np.NaN) and manually set the bottom right cell to NaN if that's of consequence

Upvotes: 0

Related Questions