Reputation: 133
Working with lab data, I want to overlay a subset of data points on a boxplot grouped by treatment and sequenced by timepoint. Bringing all elements together is not straightforward in SAS, and requires a clever approach that I can't devise or find myself :)
The beauty of the desired plot is that it displays 2 distinct types of outliers:
This is difficult when grouping data (e.g., by treatment) and then blocking or categorizing by another variable (e.g., a timepoint). SAS internally determines the spacing of the boxplots, so this spacing is difficult to mimic for the overlayed normal-range data markers. A generic solution in this direction would be an unreliable kludge.
I've demoed this approach, below, of manually mimicking the group separation for the overlay markers -- just to give an idea of intent. As expected, normal range outliers do not line up with the boxplot groups. Plus, data points that meet both outlier criteria (statistical and clinical) appear as separate points, rather than single points with overlayed markers. My annotations in green:
Is there an easy, robust way to instruct SAS to overlay grouped data points on a boxplot, keeping everything aligned as intended?
Here's the code to reproduce that miss:
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
select (mygroup);
when ('A') scatternum = myvisitnum - 1;
when ('B') scatternum = myvisitnum + 1;
otherwise;
end;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata;
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
outlierattrs=(symbol=square);
scatter x=scatternum y=scattervalue /
group=mygroup
x2axis
jitter;
x2axis display=none;
keylegend / position=bottom type=marker;
run;
Upvotes: 4
Views: 2268
Reputation: 133
Thanks for the insights! I was stuck on the the same disconnect between boxplot discrete axis and scatter plot real axis. It turns out that with SAS 9.4, scatter plots can handle "categories" like the vbox, but SAS refers to this as the x-axis rather than a category. This SAS 9.4 example also helped crack it for me (as soon as I'd given up, naturally :).
This is pretty close, and leaves most processing to SAS (always my preference for a robust solution):
The updated code: The "category" from the VBOX is the "x" for the SCATTER. Note that the default cluster-width for VBOX and SCATTER are different, 0.7 and 0.85, respectively, so I have to explicitly set them to the same value:
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata;
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
groupdisplay=cluster
clusterwidth=0.7
outlierattrs=(symbol=square);
scatter x=myvisitnum y=scattervalue /
group=mygroup
groupdisplay=cluster
clusterwidth=0.7
jitter;
keylegend /
position=bottom type=marker;
run;
Thanks, again, for getting me back on track so quickly!
Upvotes: 1
Reputation: 63424
So - I think there is a solution here, but I'm not sure how general it is. Certainly, it only works for a two element boxplot.
The issue you have right now is that the axis type by default for a scatterplot is linear, not discrete, while a boxplot is by default discrete. This is always going to be messy if you have it set up that way, though you could in theory work out the exact difference and plot it. You could also use the annotate facility, though it will have the same problem.
However, if you set the scatterplot to use a discrete axis, you can use the discreteoffset
option to make things line up properly - more or less. Unfortunately, there's no way to use the group
on scatterplot to tell SAS to place the appropriate marker on the appropriate boxplot, so by default everything ends up in the center of the discrete axis; so you will need to use two separate plots here, one for a
and one for b
, one with negative offset and one with positive.
The advantage of discreteoffset
is it should be a constant value for any two-group boxplot, unless you make some alteration to the box widths; no matter how big the actual plot is, the discreteoffset amount should be the same (as it's a percentage of the total width of the block assigned for that value).
Some things to consider here include having six elements in your boxplot instead of three (so get rid of group
and just have six different visnum
values, a_1 b_1 etc.); that would guarantee that each boxplot centered right on the center of the discrete axis (then your scatterplot would have a 0 discrete offset). You also could consider rolling your own boxplot; calculate your own IQR, for example, and then use high-low plots to draw the boxes and draw the whiskers via annotation, then scatterplot all of the different outliers (not just your 'normal' ones).
Here's the code that seems to work for your specific example, and hopefully would work for most cases similar (with two bars). For 3 bars it's probably easy as well (1 bar has a 0 offset, the other two are probably around +/- 0.25). Beyond that you start having to do more calculations to figure out where the boxes will be, but overall SAS will be pretty good at spacing them out equally so it'll usually be fairly straightforward.
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
select (mygroup);
when ('A') a_scatternum = myvisitnum; /* Note the separate names now, but no added +/- 1 */
when ('B') b_scatternum = myvisitnum;
otherwise;
end;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata noautolegend; /* suppress auto-legend */
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
outlierattrs=(symbol=square) name="boxplot"; /* Name for keylegend */
scatter x=a_scatternum y=scattervalue / /* Now you have two of these - and no need for an x2axis */
group=mygroup discreteoffset=-0.175
jitter
;
scatter x=b_scatternum y=scattervalue /
group=mygroup discreteoffset=0.175
jitter
;
keylegend "boxplot" / position=bottom type=marker; /* Needed to make a custom keylegend or else you have a mess with three plots in it */
run;
Upvotes: 3