Kvothe
Kvothe

Reputation: 93

Python: Attempting to make a matrix out of DataFrame from pandas

So I'm importing some data from a database into a DataFrame object using Pandas. The data's format is as follows:

time   info    from    to    frequency
19:00  ...      A      X        20
19:00  ...      B      Z         9
21:00  ...      A      Y         2
21:00  ...      A      Z         5
23:55  ...      A      X         8

Now, I have two problems I need to solve:

  1. Count the sum of the frequencies of every movement from point to point independently of time, such that from point A to point X this sum would be 28. So as you may guess, "time" and "key" are disposable keys, I do not need them in this situation.

  2. Since I can guarantee that all points in "from" are the same as in "to", I would like to have such sums mentioned above in form of a matrix of some sort.

I was already solving both of these with the following code:

import pandas as pd

def make_matrix(df: pd.DataFrame):

    # Get grouped version, discarding date and info...
    grouped = df.groupby(['from', 'to'])['frequency'].sum()

    # Fill dictionary acting as matrix...
    D = {}
    for (_from, _to), freq in grouped.items():
        if D.get(_from):
            D[_from][_to] = int(freq)
        else:
            D[_from] = {}

For context, the first line turns the example DataFrame into:

from    to    frequency
   A     X        28
         Y         2
         Z         5
   B     Z         9

Thing is, I'm pretty sure there's a better way of doing this, but I can't find it elsewhere in StackOverflow or Google, since this is a pretty particular situation.

Also I'm looking for a better way because this dictionary ends up without the None/0 value for every instance of point X to the same point X.

I was thinking there should be an easier way to merge these columns without them ending up in the format of grouped: pd.Series, since having to iterate over each tuple like (A,X),(A,Y),(A,Z) and such as well as having to artificially add None to the trivial case of (X,X) in the dictionary felt very hacky...

Edit 1: I'm adding the desired matrix output... It should be something like this:

    A     B    ...    X    Y    Z
A  null   0    ...   28    2    5
B   0    null  ...    0    0    9
.
.
.
X   0     0    ...    0    0    0
Y   0     0    ...    0    0    0
Z   0     0    ...    0    0    0

additionally, if there was another tuple such as from X to A with frequency 25 in the matrix position M[X][A] there would be 0 instead of 25.

Edit 2: It is possible I'm indexing wrong, it could be the transposed matrix instead of the example one, either way you get the problem, it is a non-simetric squared matrix.

Upvotes: 3

Views: 159

Answers (3)

Joe Ferndz
Joe Ferndz

Reputation: 8508

You can remove all the columns and just keep from, to, and frequency. Then do a pivot_table on it to get what you want.

import pandas as pd
df = pd.DataFrame ({'time'     :['19:00']*4 + ['20:00']*3 + ['21:00']*4,
                    'from'     :['A','B','A','A','B','X','Y','Z','X','A','Z'],
                    'to'       :['X','Z','Y','Z','X','B','A','B','Z','X','Y'],
                    'frequency':[20 ,  9,  2,  5,  8, 10, 20,  6,  8,  3,  7],
                    'othercols':['junk']*11})

print (df)

df = df[['from','to','frequency']]

df1 = pd.pivot_table(df, values=['frequency'],index=['from'],columns=['to'],aggfunc=sum, margins=False)

print (df1)

The output of this will be:

Original DataFrame:

     time from to  frequency othercols
0   19:00    A  X         20      junk
1   19:00    B  Z          9      junk
2   19:00    A  Y          2      junk
3   19:00    A  Z          5      junk
4   20:00    B  X          8      junk
5   20:00    X  B         10      junk
6   20:00    Y  A         20      junk
7   21:00    Z  B          6      junk
8   21:00    X  Z          8      junk
9   21:00    Y  X          3      junk
10  21:00    Z  Y          7      junk

Result as Expected DataFrame:

     frequency                      
to           A     B     X    Y    Z
from                                
A          NaN   NaN  23.0  2.0  5.0
B          NaN   NaN   8.0  NaN  9.0
X          NaN  10.0   NaN  NaN  8.0
Y         20.0   NaN   NaN  NaN  NaN
Z          NaN   6.0   NaN  7.0  NaN

If there is a mapping between row and column, and the value exists, it will sum and put it in the grid. We have two values from A to X. It sums it up (20 + 3) and puts it in the grid.

If you want the NaN values to be 0, then you can give:

df1 = pd.pivot_table(df, values=['frequency'],index=['from'],columns=['to'],aggfunc=sum, margins=False).fillna(0).astype(int)

Output will be:

     frequency              
to           A   B   X  Y  Z
from                        
A            0   0  23  2  5
B            0   0   8  0  9
X            0  10   0  0  8
Y           20   0   0  0  0
Z            0   6   0  7  0

Upvotes: 0

its-akanksha
its-akanksha

Reputation: 36

df.groupby(['from', 'to'])['frequency'].sum()
   .unstack(fill_value=0)

Try writing this one and I am sure you will get the right output

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150735

You could try:

(df.groupby(['from', 'to'])['frequency'].sum()
   .unstack(fill_value=0)
)

Output:

to     X  Y  Z
from          
A     28  2  5
B      0  0  9

Now, if you want all the destination available, you can use reindex:

all_cols = sorted(set(df['from']).union(set(df['to'])) )

(df.groupby(['from', 'to'])['frequency'].sum()
   .unstack(fill_value=0)
   .reindex(all_cols, fill_value=0)
   .reindex(all_cols, fill_value=0, axis=1)
)

Output:

to    A  B   X  Y  Z
from                
A     0  0  28  2  5
B     0  0   0  0  9
X     0  0   0  0  0
Y     0  0   0  0  0
Z     0  0   0  0  0

Upvotes: 1

Related Questions