Reputation: 146
I have the following two dataframes:
borders
start end
25000 30000
85000 90000
105000 110000
... this specifies start & end of borders.
to_bin
start end
3676 4686
24943 25902
25010 26000
29000 31000
51174 52100
54224 54682
58014 59024
91987 92988
117574 119637
... this is my data.
I want to bin my data to_bin
based on the borders, so all the values where at least the start is outside of e.g. border
1 should be in bin_1
. On the other side, it is also allowed that a start is inside a border, whereas the end is outside.
Additionally, one can look at the data as if it were circular, meaning that after the last line of to_bin
, it goes back to the first line, therefore these should all be in one bin.
Here's what I expect:
3676 4686 bin_1
24943 25902 bin_1 # only the start is in bin_1
25010 26000 NaN # no bin because fully inside border 1
29000 31000 bin_2 # only the end is in bin_2
51174 52100 bin_2
54224 54682 bin_2
58014 59024 bin_2
84900 85500 bin_2 # only the start is in bin_2
91987 92988 bin_3
117574 119637 bin_1 # everything after the last border belongs to the first bin
I only got so far to assign a bin if the data has both start and end outside a border:
columnname = "bin"
for index, row in borders.iterrows():
if index == 0: # goes into first bin
data.loc[data['end'] <= borders.iloc[[index]]['start'].item(), columnname] = "bin_1"
elif index == borders.shape[0]-1: # goes into first bin
data.loc[data['start'] >= borders.iloc[[index]]['end'].item(), columnname] = "bin_1"
else:
name = "bin_" + str(index+1)
data.loc[(data['start'] >= borders.iloc[[index-1]]['end'].item()) & (data['end'] <= borders.iloc[[index]]['start'].item()), columnname] = name
This gives me:
start end bin
3676 4686 bin_1
24943 25902 NaN # should be in bin_1 because of start outside border
25010 26000 NaN
51174 52100 bin_2
54224 54682 bin_2
58014 59024 bin_2
91987 92988 NaN # I don't know why this is wrong...
117574 119637 bin_1
EDIT/CORRECTION: I forgot to add the "other way round", ie. that an end can also be outside a border to be assigned a bin
.
Upvotes: 0
Views: 325
Reputation: 30981
Define functions:
getBin to get bin definition from a row of (a little changed) borders:
def getBin(row):
return pd.Series([pd.Interval(row.stPrev, row.start, closed='neither'),
'bin_' + str(row.name + 1)], index=['Range', 'label'])
Details concerning how borders is changed follow later.
getBinName to get the bin name for the current row (if one exists, either for start or for end) or NaN (if no bin found):
def getBinName(row):
# First attempt - find bin for start
bin = bins[bins.index.contains(row.start)].label
if bin.size > 0: return bin.iloc[0]
# Second attempt - find bin for end
bin = bins[bins.index.contains(row.end)].label
if bin.size > 0: return bin.iloc[0]
# Nothing found
return np.nan
Start the actual processing with adding stPrev column to borders, containing the previous value of end column or 0 for the first row:
borders['stPrev'] = borders.end.shift().fillna(0, downcast='infer')
(print borders to see the result).
Then create an auxiliary DataFrame containing description of bins:
bins = borders.apply(getBin, axis=1)
bins = bins.append(pd.Series([pd.Interval(borders.end.iloc[-1], to_bin.end.iloc[-1],
closed='neither'), 'bin_1'], index=bins.columns), ignore_index=True)\
.set_index('Range')
For your data sample it contains:
label
Range
(0, 25000) bin_1
(30000, 85000) bin_2
(90000, 105000) bin_3
(110000, 119637) bin_1
And the last step is to generate a new column - the bin name in to_bin:
to_bin['bin'] = to_bin.apply(getBinName, axis=1)
The result is:
start end bin
0 3676 4686 bin_1
1 24943 25902 bin_1
2 25010 26000 NaN
3 29000 31000 bin_2
4 51174 52100 bin_2
5 54224 54682 bin_2
6 58014 59024 bin_2
7 84900 85500 bin_2
8 91987 92988 bin_3
9 117574 119637 bin_1
Upvotes: 2
Reputation: 92
Seems like i didn't understand your problem wholly. Especially why do you use condition for both start and end, when you need:
... all the values where at least the start is outside of e.g. border 1 should be in bin_1.
I made a piece of code with only 'start' condition.
to_bin['bins'] = 'NaN' # creating new column with 'NaN' values as default
# iterating from last value in borders to the second element
for i in range(len(borders) - 1, 0, -1):
# iterating from last value in to_bin to the first element
for j in range(len(to_bin) - 1, -1, -1):
# start value from to_bin must be between borders start/end to be assigned to bin
if to_bin['start'][j] < borders['start'][i] and to_bin['start'][j] > borders['end'][i-1]:
to_bin.loc[j, 'bins'] = 'bin_' + str(i+1)
# first bin that have another (circular) condition
elif to_bin['start'][j] < borders['start'][0] or to_bin['start'][j] > borders['end'][len(borders) - 1]:
to_bin.loc[j, 'bins']= 'bin_1'
Result:
start end bins
0 3676 4686 bin_1
1 24943 25902 bin_1
2 25010 26000 NaN
3 51174 52100 bin_2
4 54224 54682 bin_2
5 58014 59024 bin_2
6 84900 85500 bin_2
7 91987 92988 bin_3
8 117574 119637 bin_1
Upvotes: -1