TylerNG
TylerNG

Reputation: 941

Pandas converting df into matrix by conditions

Is it possible to covert a df into a matrix like the following? Given df:

Name Value
x    5
x    2
x    3
x    3
y    3
y    2
z    4

The matrix would be:

Name    1    2    3   4   5   
x       4    4    3   1   1
y       2    2    1   0   0
z       1    1    1   1   0

Here's the logic behind it:

Name    1    2    3  4    5   (5 columns since 5 is the max in Value)
--------------------------------------------------------------------
x       4 (since x has 4 values >= 1)     4 (since x has 4 values >= 2)    3 (since x has 3 values >= 3)   1 (since x has 1 values >= 4)   1 (since 1 x >= 5)
y       2 (since y has 2 values >= 1)     2 (since y has 2 values >= 2)    1 (since y has 1 values >= 3)   0 (since no more y >= 5)        0 (since no more y >= 5)
z       1 (since z has 1 values >= 1)     1 (since z has 1 values >= 2)    1 (since z has 1 values >= 3)   1 (since z has 1 values >= 4)   0 (since no more z >= 5)

Let me know if this makes sense.
I know I have to use sort, group, and count but couldn't figure out how to set up the matrix.

Thank you!!!

Upvotes: 4

Views: 490

Answers (5)

BENY
BENY

Reputation: 323316

This is good question, I will using pd.cut, notice, It will also good for float :-)

df['G']=pd.cut(df.Value,list(range(df.Value.max()+1)),labels=list(range(1,df.Value.max()+1)))

df1=df.groupby(['Name','G']).count().sort_index(level='G',ascending=False).\
         groupby(level='Name').cumsum().\
             Value.unstack().bfill(1).fillna(0)
df1
Out[398]: 
G       1    2    3    4    5
Name                         
x     4.0  4.0  3.0  1.0  1.0
y     2.0  2.0  1.0  0.0  0.0
z     1.0  1.0  1.0  1.0  0.0

Upvotes: 2

cs95
cs95

Reputation: 402814

Probably the fastest solution, using numpy's broadcasting -

i = np.arange(1, df.Value.max() + 1)
j = df.Value.values[:, None] >= i

df = pd.DataFrame(j, columns=i, index=df.Name).sum(level=0)

        1    2    3    4    5
Name                         
x     4.0  4.0  3.0  1.0  1.0
y     2.0  2.0  1.0  0.0  0.0
z     1.0  1.0  1.0  1.0  0.0

Caveat: In exchange for performance, this is somewhat of a memory hungry method. For large data, it may result in a memory blowout, so use with discretion.


Details

Create a range of values, from 1 to df.Value.max() -

i = np.arange(1, df.Value.max() + 1)
i
array([1, 2, 3, 4, 5])

Perform a broadcasted comparison with df.Values and i -

j = df.Value.values[:, None] >= i
j

array([[ True,  True,  True,  True,  True],
       [ True,  True, False, False, False],
       [ True,  True,  True, False, False],
       [ True,  True,  True, False, False],
       [ True,  True,  True, False, False],
       [ True,  True, False, False, False],
       [ True,  True,  True,  True, False]], dtype=bool)

Load this into a dataframe, and perform a grouped sum by df.Name to get your final result.

k = pd.DataFrame(j, columns=i, index=df.Name)
k
         1     2      3      4      5
Name                                 
x     True  True   True   True   True
x     True  True  False  False  False
x     True  True   True  False  False
x     True  True   True  False  False
y     True  True   True  False  False
y     True  True  False  False  False
z     True  True   True   True  False
k.sum(level=0)

        1    2    3    4    5
Name                         
x     4.0  4.0  3.0  1.0  1.0
y     2.0  2.0  1.0  0.0  0.0
z     1.0  1.0  1.0  1.0  0.0

If you need to convert the result to integers, call .astype(int) -

k.sum(level=0).astype(int)

      1  2  3  4  5
Name               
x     4  4  3  1  1
y     2  2  1  0  0
z     1  1  1  1  0

Upvotes: 8

Brad Solomon
Brad Solomon

Reputation: 40908

Here's a way about this with groupby:

def get_counts(frame, idx):
    idx = np.arange(1, idx+1)[::-1]
    vc = frame['Value'].value_counts().reindex(idx)
    return vc.cumsum().ffill().sort_index().fillna(0.).astype(int)

idx = df['Value'].max()
print(df.groupby('Name').apply(lambda f: get_counts(f, idx)))

Value  1  2  3  4  5
Name                
x      4  4  3  1  1
y      2  2  1  0  0
z      1  1  1  1  0

This builds what is essentially a "helper function" that gets applied to each sub-frame of your groupby object.

Upvotes: 2

rpanai
rpanai

Reputation: 13447

Not sure if this is the best way but you can try something like

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name":["x","x","x","x","y","y","z"],
                  "Value":[5,2,3,3,3,2,4]})

mv = df["Value"].max()
out=[]
for i in range(mv):
    out.append(df.groupby("Name").apply(lambda x : len(x[x["Value"]>=i+1])))

df2  = pd.concat(out, axis=1)
df2.columns = np.arange(1,mv+1)

Upvotes: 3

DSM
DSM

Reputation: 353359

This isn't the prettiest, but should work:

d2 = df.pivot_table(index="Name", columns="Value", aggfunc=len)
d2 = d2.reindex(range(1, df["Value"].max()+1), axis=1).fillna(0)
d2 = d2.iloc[:, ::-1].cumsum(axis=1).iloc[:, ::-1]

gives me

In [115]: d2
Out[115]: 
Value    1    2    3    4    5
Name                          
x      4.0  4.0  3.0  1.0  1.0
y      2.0  2.0  1.0  0.0  0.0
z      1.0  1.0  1.0  1.0  0.0

where the repeated .iloc[:, ::-1] is just to get the cumulative sum to occur right-to-left.

Upvotes: 4

Related Questions