PRO88
PRO88

Reputation: 437

pandas int or float column to percentage distribution

I have a pandas dataframe df:

import pandas as pd
import numpy as np
data = {'A':[250,100,400,np.nan,300]}
df = pd.DataFrame(data)
print(df)

       A
0  250.0
1  100.0
2  400.0
3    NaN
4  300.0

I want to transform this datafarme (DF) based on values in the list (values).

values = [0,200,400,600]

In df, first number 250. It is between 200 and 400 in list values, such that (|200-250|)/(400-200) = 0.25 and (400-250)/(400-200)=0.75,respectively. If data is missing (np.nan) then row must be filled with 0. I want to convert this respresent this dataframe in this manner.

Desired dataframe:

     0   200   400  600
0  0.0  0.25  0.75  0.0
1  0.5  0.50  0.00  0.0
2  0.0  0.00  1.00  0.0
3  0.0  0.00  0.00  0.0
4  0.0  0.50  0.50  0.0

Upvotes: 5

Views: 617

Answers (2)

user3483203
user3483203

Reputation: 51155

Setup

There is quite a bit to do here. I wanted to come up with a completely "vectorized" method, but the overhead may make this not the most performant solution for you to use. Make sure you include this setup in any timings before choosing an approach.

u = df['A'].values
v = np.asarray(values)
nan_u = np.isnan(u)
a = np.where(nan_u, 0, u)

d = v[1:] - v[:-1]
diffs = np.concatenate((d[:1], d))
s = np.searchsorted(v, a)

You can use vectorized methods to calculate the ratio if a value fell anywhere, and then mask between valid ranges.

This makes the assumption that all values that are non-null fall between min(values) and max(values). If this requirement is not met then the step where you create the mask m will fail with an index error.


m = np.zeros((u.shape[0], v.shape[0]), dtype='i1')
ar = np.arange(m.shape[0])

m[ar, s] = 1
m[ar, s-1] = 1

f = np.abs((v - a[:, None]) / diffs)

final = f * m
final[nan_u] = 0

final

array([[0.  , 0.25, 0.75, 0.  ],
       [0.5 , 0.5 , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.5 , 0.5 , 0.  ]])

The one step I am not happy about in this answer is where I create the differences:

d = v[1:] - v[:-1]
diffs = np.concatenate((d[:1], d))

You really only have 3 "differences", but since you need 4 for shape reasons, you can append either the first difference or last difference to the beginning or end of the array respectively. It seems very clunky, and there may be a way to avoid this.

Upvotes: 4

BENY
BENY

Reputation: 323306

Here is one way using pd.cut

s=pd.cut(df.A,values).dropna()
x=s.map(lambda x : x.left).astype(int).to_frame('V')
y=s.map(lambda x : x.right).astype(int).to_frame('V')
x['r']=(df.A-x.V)/(y.V-x.V)
y['r']=(y.V-df.A)/(y.V-x.V)
df1=pd.concat([x,y]).set_index('V',append=True).\
       r.unstack(fill_value=0).\
        reindex(columns=values,index=df.index,fill_value=0)
df1
Out[110]: 
V  0     200   400  600
0  0.0  0.25  0.75  0.0
1  0.5  0.50  0.00  0.0
2  0.0  1.00  0.00  0.0
3  0.0  0.00  0.00  0.0
4  0.0  0.50  0.50  0.0

Upvotes: 5

Related Questions