Reputation: 105
I am using Proc HPBIN to split my data into equally-spaced buckets i.e. each bucket has an equal proportion of the total range of the variable.
My issue is when I have extremely skewed data with a large range. Almost all of my datapoints lie in one bucket while there is a couple of observations scattered around the extremes.
I'm wondering if there is a way to force PROC HPBIN to consider the proportion of values in each bin and make sure there is at least e.g. 5% of observations in a bin and to group others?
DATA var1;
DO VAR1 = 1 TO 100;
OUTPUT;
END;
DO VAR1 = 500 TO 505;
OUTPUT;
END;
DO VAR1 = 7000 TO 7015;
OUTPUT;
END;
DO VAR1 = 1000000 TO 1000010;
OUTPUT;
END;
RUN;
/*Use proc hpbin to generate bins of equal width*/
ODS EXCLUDE ALL;
ODS OUTPUT
Mapping = bin_width_results;
PROC HPBIN
DATA=var1
numbin = 15
bucket;
input VAR1 / numbin = 15;
RUN;
ODS EXCLUDE NONE;
Id like to see a way that proc hpbin or other method groups together the bins which are empty and allows at least 5% of proportion per bucket. However, I am not looking to use percentiles in this case (it is another plot on my pdf) because I'd see like to see the spread.
Upvotes: 0
Views: 1330
Reputation: 27508
Quantile
option and 20
bins should give you ~5% per bin
PROC HPBIN DATA=var1 quantile;
input VAR1 / numbin = 20;
RUN;
When the values of a bin need to be dynamically rebinned due overly high proportions in a bin (problem bins) you need to hpbin
only those values in the problem bins. A macro can be written to loop around the HPBIN
process, zooming in on problem areas.
For example:
DATA have;
DO VAR1 = 1 TO 100;
OUTPUT;
END;
DO VAR1 = 500 TO 505;
OUTPUT;
END;
DO VAR1 = 7000 TO 7015;
OUTPUT;
END;
DO VAR1 = 1000000 TO 1000010;
OUTPUT;
END;
RUN;
%macro bin_zoomer (data=, var=, nbins=, rezoom=0.25, zoomlimit=8, out=);
%local data_view step nextstep outbins zoomers;
proc sql;
create view data_zoom1 as
select 1 as step, &var from &data;
quit;
%let step = 1;
%let data_view = data_zoom&step;
%let outbins = bins_step&step;
%bin:
%if &step > &zoomlimit %then %goto done;
ODS EXCLUDE ALL;
ODS OUTPUT Mapping = &outbins;
PROC HPBIN DATA=&data_view bucket ;
id step;
input &var / numbin = &nbins;
RUN;
ODS EXCLUDE NONE;
proc sql noprint;
select count(*) into :zoomers trimmed
from &outbins
where proportion >= &rezoom
;
%put NOTE: &=zoomers;
%if &zoomers = 0 %then %goto done;
%let step = %eval(&step+1);
proc sql;
create view data_zoom&step as
select &step as step, *
from &data_view data
join &outbins bins
on data.&var between bins.LB and bins.UB
and bins.proportion >= &rezoom
;
quit;
%let outbins = bins_step&step;
%let data_view = data_zoom&step;
%goto bin;
%done:
%put NOTE: done @ &=step;
* stack the bins that are non-problem or of final zoom;
* the LB to UB domains from step2+ will discretely cover the bounds
* of the original step1 bins;
data &out;
set
bins_step1-bins_step&step
indsname = source
;
if proportion < &rezoom or source = "bins_step&step";
step = source;
run;
%mend;
options mprint;
%bin_zoomer(data=have, var=var1, nbins=15, out=bins);
Upvotes: 1
Reputation: 63424
Have you tried using the WINSOR
method (winsorised binning)? From the documentation:
Winsorized binning is similar to bucket binning except that both tails are cut off to obtain a smooth binning result. This technique is often used to remove outliers during the data preparation stage.
You can specify the WINSORRATE
to impact how it adjusts these tails.
Upvotes: 1