Reputation: 93
So I'm importing some data from a database into a DataFrame object using Pandas. The data's format is as follows:
time info from to frequency
19:00 ... A X 20
19:00 ... B Z 9
21:00 ... A Y 2
21:00 ... A Z 5
23:55 ... A X 8
Now, I have two problems I need to solve:
Count the sum of the frequencies of every movement from point to point independently of time, such that from point A to point X this sum would be 28. So as you may guess, "time" and "key" are disposable keys, I do not need them in this situation.
Since I can guarantee that all points in "from" are the same as in "to", I would like to have such sums mentioned above in form of a matrix of some sort.
I was already solving both of these with the following code:
import pandas as pd
def make_matrix(df: pd.DataFrame):
# Get grouped version, discarding date and info...
grouped = df.groupby(['from', 'to'])['frequency'].sum()
# Fill dictionary acting as matrix...
D = {}
for (_from, _to), freq in grouped.items():
if D.get(_from):
D[_from][_to] = int(freq)
else:
D[_from] = {}
For context, the first line turns the example DataFrame into:
from to frequency
A X 28
Y 2
Z 5
B Z 9
Thing is, I'm pretty sure there's a better way of doing this, but I can't find it elsewhere in StackOverflow or Google, since this is a pretty particular situation.
Also I'm looking for a better way because this dictionary ends up without the None/0 value for every instance of point X to the same point X.
I was thinking there should be an easier way to merge these columns without them ending up in the format of grouped: pd.Series
, since having to iterate over each tuple like (A,X),(A,Y),(A,Z)
and such as well as having to artificially add None to the trivial case of (X,X) in the dictionary felt very hacky...
Edit 1: I'm adding the desired matrix output... It should be something like this:
A B ... X Y Z
A null 0 ... 28 2 5
B 0 null ... 0 0 9
.
.
.
X 0 0 ... 0 0 0
Y 0 0 ... 0 0 0
Z 0 0 ... 0 0 0
additionally, if there was another tuple such as from X
to A
with frequency 25 in the matrix position M[X][A] there would be 0 instead of 25.
Edit 2: It is possible I'm indexing wrong, it could be the transposed matrix instead of the example one, either way you get the problem, it is a non-simetric squared matrix.
Upvotes: 3
Views: 159
Reputation: 8508
You can remove all the columns and just keep from
, to
, and frequency
. Then do a pivot_table
on it to get what you want.
import pandas as pd
df = pd.DataFrame ({'time' :['19:00']*4 + ['20:00']*3 + ['21:00']*4,
'from' :['A','B','A','A','B','X','Y','Z','X','A','Z'],
'to' :['X','Z','Y','Z','X','B','A','B','Z','X','Y'],
'frequency':[20 , 9, 2, 5, 8, 10, 20, 6, 8, 3, 7],
'othercols':['junk']*11})
print (df)
df = df[['from','to','frequency']]
df1 = pd.pivot_table(df, values=['frequency'],index=['from'],columns=['to'],aggfunc=sum, margins=False)
print (df1)
The output of this will be:
Original DataFrame:
time from to frequency othercols
0 19:00 A X 20 junk
1 19:00 B Z 9 junk
2 19:00 A Y 2 junk
3 19:00 A Z 5 junk
4 20:00 B X 8 junk
5 20:00 X B 10 junk
6 20:00 Y A 20 junk
7 21:00 Z B 6 junk
8 21:00 X Z 8 junk
9 21:00 Y X 3 junk
10 21:00 Z Y 7 junk
Result as Expected DataFrame:
frequency
to A B X Y Z
from
A NaN NaN 23.0 2.0 5.0
B NaN NaN 8.0 NaN 9.0
X NaN 10.0 NaN NaN 8.0
Y 20.0 NaN NaN NaN NaN
Z NaN 6.0 NaN 7.0 NaN
If there is a mapping between row and column, and the value exists, it will sum and put it in the grid. We have two values from A
to X
. It sums it up (20 + 3) and puts it in the grid.
If you want the NaN
values to be 0
, then you can give:
df1 = pd.pivot_table(df, values=['frequency'],index=['from'],columns=['to'],aggfunc=sum, margins=False).fillna(0).astype(int)
Output will be:
frequency
to A B X Y Z
from
A 0 0 23 2 5
B 0 0 8 0 9
X 0 10 0 0 8
Y 20 0 0 0 0
Z 0 6 0 7 0
Upvotes: 0
Reputation: 36
df.groupby(['from', 'to'])['frequency'].sum()
.unstack(fill_value=0)
Try writing this one and I am sure you will get the right output
Upvotes: 1
Reputation: 150735
You could try:
(df.groupby(['from', 'to'])['frequency'].sum()
.unstack(fill_value=0)
)
Output:
to X Y Z
from
A 28 2 5
B 0 0 9
Now, if you want all the destination available, you can use reindex
:
all_cols = sorted(set(df['from']).union(set(df['to'])) )
(df.groupby(['from', 'to'])['frequency'].sum()
.unstack(fill_value=0)
.reindex(all_cols, fill_value=0)
.reindex(all_cols, fill_value=0, axis=1)
)
Output:
to A B X Y Z
from
A 0 0 28 2 5
B 0 0 0 0 9
X 0 0 0 0 0
Y 0 0 0 0 0
Z 0 0 0 0 0
Upvotes: 1