mcatis
mcatis

Reputation: 1266

How to use a pandas interval to lookup values, to fill another dataframe

I have two dataframes (df1, df2):

x  id  
35  4   
55  3
92  2
99  5

and

id  x               val
1   (0.0, 50.0]     1.2
2   (90.0, inf]     0.5
3   (0.0, 50.0]     8.9
3  (50.0, 90.0]     9.9
4   (0.0, 50.0]     4.3
4  (50.0, 90.0]     1.1
4   (90.0, inf]     2.9
5  (50.0, 90.0]     3.2
5   (90.0, inf]     5.1

Want to add a new column x_new in the first dataframe, df1, which values depends on the lookup-table from the second dataframe, df2. According to the id and the value of x, there is a special multiplier, to get the new value x_new:

  x  id   x_new
 35  4    35*4.3
 55  3    55*9.9 
 92  2    ...
 99  5    ...

The value ranges in the second dataframe were created with a pandas cut:

df2 = df.groupby(['id', pd.cut(df.x, [0,50,90,np.inf])]).apply(lambda x: np.average(x['var1']/x['var2'], weights=x['var1'])).reset_index(name='val')

My idea is starting with the pandas built in lookup function:

df1['x_new'] = df.lookup(df.index, df['id'])

Don't know how to get it work.

Also see my previous question for more information about the code.

Upvotes: 2

Views: 1732

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62493

  • A value can be found in a pd.Interval
    • 40 in pd.Interval(0.0, 50.0, closed='right') evaluates as True
  • Likewise, if a pd.Interval is in a index, a value passed using .loc, will find the correct interval.
    • df2.loc[(3, 35)] will return 8.9
    • Since df2 is multi-indexed, the values for the index, are passed as a tuple.
    • A KeyError will occur if a value from df1 doesn't exist in the index of df2, so you may need to write a function with try-except.
      • df1_in_df2 = df1[df1.id.isin(df2.index.get_level_values(0))] will find all df1.id in df2.index
import pandas as pd
import numpy as np

# setupt dataframes
df1 = pd.DataFrame({'id': [4, 3, 2, 5], 'x': [35, 55, 92, 99]})
df2 = pd.DataFrame({'id': [1, 2, 3, 3, 4, 4, 4, 5, 5], 'x': [pd.Interval(0.0, 50.0, closed='right'), pd.Interval(90.0, np.inf, closed='right'), pd.Interval(0.0, 50.0, closed='right'), pd.Interval(50.0, 90.0, closed='right'), pd.Interval(0.0, 50.0, closed='right'), pd.Interval(50.0, 90.0, closed='right'), pd.Interval(90.0, np.inf, closed='right'), pd.Interval(50.0, 90.0, closed='right'), pd.Interval(90.0, np.inf, closed='right')], 'val': [1.2, 0.5, 8.9, 9.9, 4.3, 1.1, 2.9, 3.2, 5.1]})

# set id and x as the index of df2
df2 = df2.set_index(['id', 'x'])

# display(df2)
                 val
id x                
1  (0.0, 50.0]   1.2
2  (90.0, inf]   0.5
3  (0.0, 50.0]   8.9
   (50.0, 90.0]  9.9
4  (0.0, 50.0]   4.3
   (50.0, 90.0]  1.1
   (90.0, inf]   2.9
5  (50.0, 90.0]  3.2
   (90.0, inf]   5.1

# use a lambda expression to pass id and x of df1 as index labels to df2 and return val
df1['val'] = df1.apply(lambda x: df2.loc[(x['id'], x['x'])], axis=1)

# multiple x and val to get x_new
df1['x_new'] = df1.x.mul(df1.val)

# display(df1)
   id   x  val  x_new
0   4  35  4.3  150.5
1   3  55  9.9  544.5
2   2  92  0.5   46.0
3   5  99  5.1  504.9

Upvotes: 3

Related Questions