Saraha
Saraha

Reputation: 146

Bin data based on ranges of borders

I have the following two dataframes:

borders

start   end
25000   30000
85000   90000
105000  110000

... this specifies start & end of borders.

to_bin

start end
3676 4686
24943 25902
25010 26000
29000 31000
51174 52100
54224 54682
58014 59024
91987 92988
117574 119637

... this is my data.

I want to bin my data to_bin based on the borders, so all the values where at least the start is outside of e.g. border 1 should be in bin_1. On the other side, it is also allowed that a start is inside a border, whereas the end is outside. Additionally, one can look at the data as if it were circular, meaning that after the last line of to_bin, it goes back to the first line, therefore these should all be in one bin.

Here's what I expect:

3676 4686   bin_1
24943 25902 bin_1                # only the start is in bin_1
25010 26000 NaN                  # no bin because fully inside border 1
29000 31000 bin_2                # only the end is in bin_2
51174 52100 bin_2
54224 54682 bin_2
58014 59024 bin_2
84900   85500 bin_2              # only the start is in bin_2
91987 92988 bin_3
117574 119637   bin_1            # everything after the last border belongs to the first bin

I only got so far to assign a bin if the data has both start and end outside a border:

columnname = "bin"
for index, row in borders.iterrows():
    if index == 0: # goes into first bin
        data.loc[data['end'] <= borders.iloc[[index]]['start'].item(), columnname] = "bin_1"
    elif index == borders.shape[0]-1: # goes into first bin
        data.loc[data['start'] >= borders.iloc[[index]]['end'].item(), columnname] = "bin_1"
    else:
        name = "bin_" + str(index+1)
        data.loc[(data['start'] >= borders.iloc[[index-1]]['end'].item()) & (data['end'] <= borders.iloc[[index]]['start'].item()), columnname] = name

This gives me:

start   end bin
3676    4686    bin_1
24943   25902   NaN # should be in bin_1 because of start outside border
25010   26000   NaN
51174   52100   bin_2
54224   54682   bin_2
58014   59024   bin_2
91987   92988   NaN # I don't know why this is wrong...
117574  119637  bin_1

EDIT/CORRECTION: I forgot to add the "other way round", ie. that an end can also be outside a border to be assigned a bin.

Upvotes: 0

Views: 325

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 30981

Define functions:

  1. getBin to get bin definition from a row of (a little changed) borders:

    def getBin(row):
        return pd.Series([pd.Interval(row.stPrev, row.start, closed='neither'),
            'bin_' + str(row.name + 1)], index=['Range', 'label'])
    

    Details concerning how borders is changed follow later.

  2. getBinName to get the bin name for the current row (if one exists, either for start or for end) or NaN (if no bin found):

    def getBinName(row):
        # First attempt - find bin for start
        bin = bins[bins.index.contains(row.start)].label
        if bin.size > 0: return bin.iloc[0]
        # Second attempt - find bin for end
        bin = bins[bins.index.contains(row.end)].label
        if bin.size > 0: return bin.iloc[0]
        # Nothing found
        return np.nan
    

Start the actual processing with adding stPrev column to borders, containing the previous value of end column or 0 for the first row:

borders['stPrev'] = borders.end.shift().fillna(0, downcast='infer')

(print borders to see the result).

Then create an auxiliary DataFrame containing description of bins:

bins = borders.apply(getBin, axis=1)
bins = bins.append(pd.Series([pd.Interval(borders.end.iloc[-1], to_bin.end.iloc[-1],
    closed='neither'), 'bin_1'], index=bins.columns), ignore_index=True)\
    .set_index('Range')

For your data sample it contains:

                  label
Range                  
(0, 25000)        bin_1
(30000, 85000)    bin_2
(90000, 105000)   bin_3
(110000, 119637)  bin_1

And the last step is to generate a new column - the bin name in to_bin:

to_bin['bin'] = to_bin.apply(getBinName, axis=1)

The result is:

    start     end    bin
0    3676    4686  bin_1
1   24943   25902  bin_1
2   25010   26000    NaN
3   29000   31000  bin_2
4   51174   52100  bin_2
5   54224   54682  bin_2
6   58014   59024  bin_2
7   84900   85500  bin_2
8   91987   92988  bin_3
9  117574  119637  bin_1

Upvotes: 2

kasper
kasper

Reputation: 92

Seems like i didn't understand your problem wholly. Especially why do you use condition for both start and end, when you need:

... all the values where at least the start is outside of e.g. border 1 should be in bin_1.

I made a piece of code with only 'start' condition.

to_bin['bins'] = 'NaN'  # creating new column with 'NaN' values as default
# iterating from last value in borders to the second element
for i in range(len(borders) - 1, 0, -1):  
    # iterating from last value in to_bin to the first element
    for j in range(len(to_bin) - 1, -1, -1):  
        # start value from to_bin must be between borders start/end to be assigned to bin
        if to_bin['start'][j] < borders['start'][i] and to_bin['start'][j] > borders['end'][i-1]:
            to_bin.loc[j, 'bins'] = 'bin_' + str(i+1)
        # first bin that have another (circular) condition
        elif to_bin['start'][j] < borders['start'][0] or to_bin['start'][j] > borders['end'][len(borders) - 1]:
            to_bin.loc[j, 'bins']= 'bin_1'

Result:

    start   end bins
0   3676    4686    bin_1
1   24943   25902   bin_1
2   25010   26000   NaN
3   51174   52100   bin_2
4   54224   54682   bin_2
5   58014   59024   bin_2
6   84900   85500   bin_2
7   91987   92988   bin_3
8   117574  119637  bin_1

Upvotes: -1

Related Questions