Sashmit Bhaduri
Sashmit Bhaduri

Reputation: 101

Distributing value into multiple bins in pandas

I have two pandas dataframes (actual dataframes are much larger):

events = pd.DataFrame({'Begin':[959.44, 1222.82, 2217.59], 'End':[978.00,1240.41,2799.43]})

markers = pd.DataFrame({'Marker': [0, 256.0, 700, 975.33, 1188.2, 1230.88, 2500, 3120.22]})

I want to subdivide the events dataframe into marker, which I'm trying to treat like bins, that is, [0, 256.0], [256, 700], etc... Trying to end up with another row in the markers dataframe that accounts for how a cumulative total of events was observed from during that bin. Each of the events may end up in multiple bins. For example, the 959.44 to 978.00 event should have 15.89 (978.00-975.33) counted in the 700-975.33 bin and the rest should be counted in the 975.33,1188.2.

I've been trying to use pandas.cut to bin the markers dataframe, but I'm not sure how to account for multiple bins. is this the best way to do this?

Upvotes: 0

Views: 83

Answers (1)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

IIUC you can use interval index to get the ranges later use get loc to get the marker value i.e

markers['Begin'] =  markers.shift() 
nm = markers.sort_index(1).dropna()
nm.index = pd.IntervalIndex.from_arrays(nm['Begin'], nm['Marker'])


events['mark'] = events['Begin'].apply(lambda x : nm.iloc[nm.index.get_loc(x)]['Marker'])
events['new'] = events['mark'] - events['Begin']

Output:

    Begin      End     mark     new
0   959.44   978.00   975.33   15.89
1  1222.82  1240.41  1230.88    8.06
2  2217.59  2799.43  2500.00  282.41

Explanation

Creating a interval index by shifting Marker and droppping nan i.e

nm.index = pd.IntervalIndex.from_arrays(nm['Begin'], nm['Marker'])
                     Begin   Marker
(0.0, 256.0]          0.00   256.00
(256.0, 700.0]      256.00   700.00
(700.0, 975.33]     700.00   975.33
(975.33, 1188.2]    975.33  1188.20
(1188.2, 1230.88]  1188.20  1230.88
(1230.88, 2500.0]  1230.88  2500.00
(2500.0, 3120.22]  2500.00  3120.22

Search for the begin of events in the interval index then get the index by using get_loc later get the marker value for the index obtained i.e

    Begin      End     mark
0   959.44   978.00   975.33
1  1222.82  1240.41  1230.88
2  2217.59  2799.43  2500.00

Later subtract the mark from begin to get the new column

Hope it helps.

Upvotes: 1

Related Questions