user13132640
user13132640

Reputation: 349

Accessing binned data in another dataframe?

I have a two dataframes, generally structured as below:

enter image description here

The binned values in df2 correspond to the same variable as "A" in df1, but is based on a larger dataset, and provides values for a different variable, "C." I would like to adjust the values in "B" based on the correct value of "C" in df2. For example, the first row's "B" of 15.3 would be adjusted by the value 1.5 from "C" since its "A" falls in the bin [2,4).

I'm not sure how to access rows in the binned df though, since I can't directly query the index with the exact values in df1's "A."

Any tips on how to access this data would be greatly appreciated.

EDIT: My apologies, I now added a code example:

The setup:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'A':[4.4, 3.6, 9.2, 3.4], 'B':[15.3, 10.8, 10.3, 17.0]})

df2 = pd.DataFrame({'A':[0.0, 4.9, 9.3, 4.5, 2.9, 3.2, 1.0, 6.7, 8.7, 9.8, 3.4, .7, 2.2, 6.5, 3.4, 1.7, 9.4, 10.0],
                    'C':[1.3, 4.3, 4.8, 3.5, 1.7, 2.2, 1.1, 4.9, 5.6, 5.6, 2.5, .6, 1.9, 4.1, 3.6, 2.8, 6.3, 5.9]})

bins = np.arange(df2['A'].min(), df2['A'].max()+2, 2)

df2 = df2.groupby(pd.cut(df2['A'], bins)).mean()

My thinking was then to determine the appropriate bin for each value of A in DF1, and then somehow reference this. The first line works to do this, but the second step (using this bin to query df2) does not:

df1['Bin']=pd.cut(df1['A'], bins)
df1['Product'] = df1['B'] * df2.loc(df1['Bin'])['C']

The error I get from that second line is "TypeError: 'Series' objects are mutable, thus they cannot be hashed."

My desired output is a "product" column which multiples the value in "B" by the correct value in "C" based on which bin the "A" value falls into.

Upvotes: 0

Views: 541

Answers (1)

piterbarg
piterbarg

Reputation: 8229

The idea is for each element in 'A' from df1 to find the (index of) the corresponding bin from df2 and then merge on that. This can probably be done more elegantly but seems to work. I change some column names in df2 eventually hope it is ok. Complete code:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'A':[4.4, 3.6, 9.2, 3.4], 'B':[15.3, 10.8, 10.3, 17.0]})

df2 = pd.DataFrame({'A':[0.0, 4.9, 9.3, 4.5, 2.9, 3.2, 1.0, 6.7, 8.7, 9.8, 3.4, .7, 2.2, 6.5, 3.4, 1.7, 9.4, 10.0],
                    'C':[1.3, 4.3, 4.8, 3.5, 1.7, 2.2, 1.1, 4.9, 5.6, 5.6, 2.5, .6, 1.9, 4.1, 3.6, 2.8, 6.3, 5.9]})

bins = np.arange(df2['A'].min(), df2['A'].max()+2, 2)
df3 = df2.groupby(pd.cut(df2['A'], bins)).mean()

# Here I am resetting the index on df3 and renaming the bin column to bins
df3 = df3.rename_axis('bins').reset_index()

# The main calc of matching the bin. 
match_indx = [[a in interval for interval in df3['bins']].index(True) for a in df1['A']]

# stick in into df1
df1['bin_index'] = match_indx

# merge
df4 = df1.merge(df3, left_on = 'bin_index', right_index=True)
df4

now we get df4 looking like this


    A_x B   bin_index   bins    A_y     C
0   4.4 15.3    2   (4.0, 6.0]  4.70    3.90
1   3.6 10.8    1   (2.0, 4.0]  3.02    2.38
3   3.4 17.0    1   (2.0, 4.0]  3.02    2.38
2   9.2 10.3    4   (8.0, 10.0] 9.44    5.64

Now we can do your calc. Column A from df1 got renamed to A_x (you can rename it back if you want, I did not bother)

df4['Product'] = df4['B'] * df4['C']

The match logic is easier understood in the traditional loop format

match_indx = []
for a in df1['A']:
    # binary_mask will have True or False depending on whether a is in df3['bins'][i] or not
    binary_mask = []
    for interval in df3['bins']:
        binary_mask.append(a in interval)
    # find the first (and only) item that is True in binary_mask. Return its index
    index = binary_mask.index(True)
    match_indx.append(index)

Upvotes: 1

Related Questions