How to make a percentage matrix in pandas/numpy?

Question

I am struggling to get something like a matrix dataframe in pandas/NumPy.

I have the following dataframe

Value	Source	Placement	First
A	GooFeed	1	GooFeed
A	CyberCyber	2	GooFeed
A	GooFeed	3	GooFeed
B	CyberCyber	1	CyberCyber
B	GooFeed	2	CyberCyber
C	WowFeed	1	WowFeed
C	WowFeed	2	WowFeed
C	CyberCyber	3	WowFeed
C	GooFeed	4	WowFeed
C	CyberCyber	5	WowFeed
D	GooFeed	1	GooFeed

What I want to achieve is a percentage of the feeds having data from each other. The feeds with the 1 in Placement got the data (value A) first. Sometimes the same values appear in multiple data feeds and so are multiple placements from the same feed at one value, like with value C and CyberCyber. But the values should be unique and not count as 2 times data from WowFeed.

I calculated total values in a separate dataframe like the following:

Source	Firsts	TotalDistinctValues	FirstSources
GooFeed	2	4	`{'CyberCyber':1;'WowFeed':1}`
CyberCyber	1	3	`{'GooFeed':1;'WowFeed':1}`
WowFeed	1	1

And the matrix I am looking for is the following:

source	GooFeed	CyberCyber	WowFeed	Total
GooFeed		25%	25%	50%
CyberCyber	33%		33%	66%
WowFeed
Average	33%	25%	58%

Average (bottom row) of the column values excluding the value source matching itself. Like GooFeed at GooFeed.

Total (outer row on the right) should be to total of its values from its own distinct values.

I really do not get my hand around how this should work. Am I even right to try it in pandas/NumPy?

I am so glad for every help.

Best regards

Maurice

piterbarg · Accepted Answer

Need a few steps here. In the first step we drop duplicates to avoid double-counting and pivot:

df2 = pd.pivot_table(df.drop_duplicates(['Value','Source','First']), index = 'Source', columns = 'First',values = 'Value',aggfunc = 'count', fill_value = 0)
df2

we get

First       CyberCyber  GooFeed WowFeed
Source          
CyberCyber  1           1       1
GooFeed     1           2       1
WowFeed     0           0       1

Now we need to translate into proportions and calculate Totals. The somewhat tricky bit is to get rid of the diagonal (this is where np.eye() comes in)

import numpy as np
df3 = df2.div( df2.sum(axis=1), axis=0 )
df4 = df3*(1 - np.eye(len(df3))
df4['Total'] = df4.sum(axis=1)
df4.loc['Total'] = df4.sum(axis=0)
df4

and we get


First       CyberCyber  GooFeed     WowFeed     Total
Source              
CyberCyber  0.00        0.333333    0.333333    0.666667
GooFeed     0.25        0.000000    0.250000    0.500000
WowFeed     0.00        0.000000    0.000000    0.000000
Total       0.25        0.333333    0.583333    1.166667

if you want you can replace 0's with NaNs, say, using df4.replace(0,np.NaN) and manually set the bottom right cell to NaN if that's of consequence

How to make a percentage matrix in pandas/numpy?

Answers (2)

Related Questions