Reputation: 437
I have a pandas dataframe df:
import pandas as pd
import numpy as np
data = {'A':[250,100,400,np.nan,300]}
df = pd.DataFrame(data)
print(df)
A
0 250.0
1 100.0
2 400.0
3 NaN
4 300.0
I want to transform this datafarme (DF) based on values in the list (values).
values = [0,200,400,600]
In df, first number 250. It is between 200 and 400 in list values
, such that (|200-250|)/(400-200) = 0.25 and (400-250)/(400-200)=0.75,respectively. If data is missing (np.nan) then row must be filled with 0. I want to convert this respresent this dataframe in this manner.
Desired dataframe:
0 200 400 600
0 0.0 0.25 0.75 0.0
1 0.5 0.50 0.00 0.0
2 0.0 0.00 1.00 0.0
3 0.0 0.00 0.00 0.0
4 0.0 0.50 0.50 0.0
Upvotes: 5
Views: 617
Reputation: 51155
Setup
There is quite a bit to do here. I wanted to come up with a completely "vectorized" method, but the overhead may make this not the most performant solution for you to use. Make sure you include this setup in any timings before choosing an approach.
u = df['A'].values
v = np.asarray(values)
nan_u = np.isnan(u)
a = np.where(nan_u, 0, u)
d = v[1:] - v[:-1]
diffs = np.concatenate((d[:1], d))
s = np.searchsorted(v, a)
You can use vectorized methods to calculate the ratio if a value fell anywhere, and then mask between valid ranges.
This makes the assumption that all values that are non-null fall between min(values)
and max(values)
. If this requirement is not met then the step where you create the mask m
will fail with an index error.
m = np.zeros((u.shape[0], v.shape[0]), dtype='i1')
ar = np.arange(m.shape[0])
m[ar, s] = 1
m[ar, s-1] = 1
f = np.abs((v - a[:, None]) / diffs)
final = f * m
final[nan_u] = 0
final
array([[0. , 0.25, 0.75, 0. ],
[0.5 , 0.5 , 0. , 0. ],
[0. , 1. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0.5 , 0.5 , 0. ]])
The one step I am not happy about in this answer is where I create the differences:
d = v[1:] - v[:-1]
diffs = np.concatenate((d[:1], d))
You really only have 3 "differences", but since you need 4 for shape reasons, you can append either the first difference or last difference to the beginning or end of the array respectively. It seems very clunky, and there may be a way to avoid this.
Upvotes: 4
Reputation: 323306
Here is one way using pd.cut
s=pd.cut(df.A,values).dropna()
x=s.map(lambda x : x.left).astype(int).to_frame('V')
y=s.map(lambda x : x.right).astype(int).to_frame('V')
x['r']=(df.A-x.V)/(y.V-x.V)
y['r']=(y.V-df.A)/(y.V-x.V)
df1=pd.concat([x,y]).set_index('V',append=True).\
r.unstack(fill_value=0).\
reindex(columns=values,index=df.index,fill_value=0)
df1
Out[110]:
V 0 200 400 600
0 0.0 0.25 0.75 0.0
1 0.5 0.50 0.00 0.0
2 0.0 1.00 0.00 0.0
3 0.0 0.00 0.00 0.0
4 0.0 0.50 0.50 0.0
Upvotes: 5