Reputation: 25
I am struggling to get something like a matrix dataframe in pandas/NumPy.
I have the following dataframe
Value | Source | Placement | First |
---|---|---|---|
A | GooFeed | 1 | GooFeed |
A | CyberCyber | 2 | GooFeed |
A | GooFeed | 3 | GooFeed |
B | CyberCyber | 1 | CyberCyber |
B | GooFeed | 2 | CyberCyber |
C | WowFeed | 1 | WowFeed |
C | WowFeed | 2 | WowFeed |
C | CyberCyber | 3 | WowFeed |
C | GooFeed | 4 | WowFeed |
C | CyberCyber | 5 | WowFeed |
D | GooFeed | 1 | GooFeed |
What I want to achieve is a percentage of the feeds having data from each other. The feeds with the 1 in Placement got the data (value A) first. Sometimes the same values appear in multiple data feeds and so are multiple placements from the same feed at one value, like with value C and CyberCyber. But the values should be unique and not count as 2 times data from WowFeed.
I calculated total values in a separate dataframe like the following:
Source | Firsts | TotalDistinctValues | FirstSources |
---|---|---|---|
GooFeed | 2 | 4 | {'CyberCyber':1;'WowFeed':1} |
CyberCyber | 1 | 3 | {'GooFeed':1;'WowFeed':1} |
WowFeed | 1 | 1 |
And the matrix I am looking for is the following:
source | GooFeed | CyberCyber | WowFeed | Total |
---|---|---|---|---|
GooFeed | 25% | 25% | 50% | |
CyberCyber | 33% | 33% | 66% | |
WowFeed | ||||
Average | 33% | 25% | 58% |
Average (bottom row) of the column values excluding the value source matching itself. Like GooFeed at GooFeed.
Total (outer row on the right) should be to total of its values from its own distinct values.
I really do not get my hand around how this should work. Am I even right to try it in pandas/NumPy?
I am so glad for every help.
Best regards
Maurice
Upvotes: 1
Views: 930
Reputation: 25
I managed to work with piterbargs to answer and achieve everything I need. Thank you piterbarg!
For the ones interested I will post the code because I edited it a bit (caution - different naming of column / values).
Upvotes: 1
Reputation: 8219
Need a few steps here. In the first step we drop duplicates to avoid double-counting and pivot:
df2 = pd.pivot_table(df.drop_duplicates(['Value','Source','First']), index = 'Source', columns = 'First',values = 'Value',aggfunc = 'count', fill_value = 0)
df2
we get
First CyberCyber GooFeed WowFeed
Source
CyberCyber 1 1 1
GooFeed 1 2 1
WowFeed 0 0 1
Now we need to translate into proportions and calculate Totals. The somewhat tricky bit is to get rid of the diagonal (this is where np.eye()
comes in)
import numpy as np
df3 = df2.div( df2.sum(axis=1), axis=0 )
df4 = df3*(1 - np.eye(len(df3))
df4['Total'] = df4.sum(axis=1)
df4.loc['Total'] = df4.sum(axis=0)
df4
and we get
First CyberCyber GooFeed WowFeed Total
Source
CyberCyber 0.00 0.333333 0.333333 0.666667
GooFeed 0.25 0.000000 0.250000 0.500000
WowFeed 0.00 0.000000 0.000000 0.000000
Total 0.25 0.333333 0.583333 1.166667
if you want you can replace 0's with NaNs, say, using df4.replace(0,np.NaN)
and manually set the bottom right cell to NaN if that's of consequence
Upvotes: 0