Dante
Dante

Reputation: 133

Boxplot by groups, plus a user-defined scatter plot (markers for a subset of values)

Working with lab data, I want to overlay a subset of data points on a boxplot grouped by treatment and sequenced by timepoint. Bringing all elements together is not straightforward in SAS, and requires a clever approach that I can't devise or find myself :)

The beauty of the desired plot is that it displays 2 distinct types of outliers:

This is difficult when grouping data (e.g., by treatment) and then blocking or categorizing by another variable (e.g., a timepoint). SAS internally determines the spacing of the boxplots, so this spacing is difficult to mimic for the overlayed normal-range data markers. A generic solution in this direction would be an unreliable kludge.

I've demoed this approach, below, of manually mimicking the group separation for the overlay markers -- just to give an idea of intent. As expected, normal range outliers do not line up with the boxplot groups. Plus, data points that meet both outlier criteria (statistical and clinical) appear as separate points, rather than single points with overlayed markers. My annotations in green:

SGPLOT-overlay-fail

Is there an easy, robust way to instruct SAS to overlay grouped data points on a boxplot, keeping everything aligned as intended?

Here's the code to reproduce that miss:

proc sql;
  create table labstruct
    (  mygroup         char(3) label='Treatment Group'
     , myvisitnum      num     label='Visit number'
     , myvisitname     char(8) label='Visit name'
     , labtestname     char(8) label='Name of lab test'
     , labseed         num     label='Lab measurement seed'
     , lablow          num     label='Low end of normal range'
     , labhigh         num     label='High end of normal range'
    )
  ;
  insert into labstruct
    values('A', 1,  'Day 1',  'Test XYZ', 48, 40, 60)
    values('A', 5,  'Week 1', 'Test XYZ', 50, 40, 60)
    values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
    values('B', 1,  'Day 1',  'Test XYZ', 52, 40, 60)
    values('B', 5,  'Week 1', 'Test XYZ', 50, 40, 60)
    values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
  ;
quit;

data labdata;
  set labstruct;

  * Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
  select (mygroup);
    when ('A') scatternum = myvisitnum - 1;
    when ('B') scatternum = myvisitnum + 1;
    otherwise;
  end;

  * Make more obs from the seeds above *;
  label labvalue = 'Lab measurement';
  do repeat = 1 to 20;
    labvalue = labseed + 6*rannor(3297);

    * Scatter plot ONLY normal range outliers *;
    if labvalue < lablow or labvalue > labhigh 
       then scattervalue = labvalue;
    else scattervalue = .;

    output;
  end;
  drop repeat labseed;
run;

proc sgplot data=labdata;
  block x=myvisitnum block=myvisitname / 
        nofill 
        lineattrs=(color=lightgray);
  vbox labvalue / 
       category=myvisitnum
       group=mygroup
       outlierattrs=(symbol=square);
  scatter x=scatternum y=scattervalue /
       group=mygroup
       x2axis
       jitter;
  x2axis display=none;
  keylegend / position=bottom type=marker;
run;

Upvotes: 4

Views: 2268

Answers (2)

Dante
Dante

Reputation: 133

Thanks for the insights! I was stuck on the the same disconnect between boxplot discrete axis and scatter plot real axis. It turns out that with SAS 9.4, scatter plots can handle "categories" like the vbox, but SAS refers to this as the x-axis rather than a category. This SAS 9.4 example also helped crack it for me (as soon as I'd given up, naturally :).

This is pretty close, and leaves most processing to SAS (always my preference for a robust solution):

enter image description here

The updated code: The "category" from the VBOX is the "x" for the SCATTER. Note that the default cluster-width for VBOX and SCATTER are different, 0.7 and 0.85, respectively, so I have to explicitly set them to the same value:

proc sql;
  create table labstruct
    (  mygroup         char(3) label='Treatment Group'
     , myvisitnum      num     label='Visit number'
     , myvisitname     char(8) label='Visit name'
     , labtestname     char(8) label='Name of lab test'
     , labseed         num     label='Lab measurement seed'
     , lablow          num     label='Low end of normal range'
     , labhigh         num     label='High end of normal range'
    )
  ;
  insert into labstruct
    values('A', 1,  'Day 1',  'Test XYZ', 48, 40, 60)
    values('A', 5,  'Week 1', 'Test XYZ', 50, 40, 60)
    values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
    values('B', 1,  'Day 1',  'Test XYZ', 52, 40, 60)
    values('B', 5,  'Week 1', 'Test XYZ', 50, 40, 60)
    values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
  ;
quit;

data labdata;
  set labstruct;

  * Make more obs from the seeds above *;
  label labvalue = 'Lab measurement';
  do repeat = 1 to 20;
    labvalue = labseed + 6*rannor(3297);

    * Scatter plot ONLY normal range outliers *;
    if labvalue < lablow or labvalue > labhigh 
       then scattervalue = labvalue;
    else scattervalue = .;

    output;
  end;
  drop repeat labseed;
run;

proc sgplot data=labdata;
  block x=myvisitnum block=myvisitname / 
        nofill 
        lineattrs=(color=lightgray);
  vbox labvalue / 
       category=myvisitnum
       group=mygroup
       groupdisplay=cluster
       clusterwidth=0.7
       outlierattrs=(symbol=square);
  scatter x=myvisitnum y=scattervalue /
       group=mygroup
       groupdisplay=cluster
       clusterwidth=0.7
       jitter;
  keylegend / 
       position=bottom type=marker;
run;

Thanks, again, for getting me back on track so quickly!

Upvotes: 1

Joe
Joe

Reputation: 63424

So - I think there is a solution here, but I'm not sure how general it is. Certainly, it only works for a two element boxplot.

The issue you have right now is that the axis type by default for a scatterplot is linear, not discrete, while a boxplot is by default discrete. This is always going to be messy if you have it set up that way, though you could in theory work out the exact difference and plot it. You could also use the annotate facility, though it will have the same problem.

However, if you set the scatterplot to use a discrete axis, you can use the discreteoffset option to make things line up properly - more or less. Unfortunately, there's no way to use the group on scatterplot to tell SAS to place the appropriate marker on the appropriate boxplot, so by default everything ends up in the center of the discrete axis; so you will need to use two separate plots here, one for a and one for b, one with negative offset and one with positive.

The advantage of discreteoffset is it should be a constant value for any two-group boxplot, unless you make some alteration to the box widths; no matter how big the actual plot is, the discreteoffset amount should be the same (as it's a percentage of the total width of the block assigned for that value).

Some things to consider here include having six elements in your boxplot instead of three (so get rid of group and just have six different visnum values, a_1 b_1 etc.); that would guarantee that each boxplot centered right on the center of the discrete axis (then your scatterplot would have a 0 discrete offset). You also could consider rolling your own boxplot; calculate your own IQR, for example, and then use high-low plots to draw the boxes and draw the whiskers via annotation, then scatterplot all of the different outliers (not just your 'normal' ones).

Here's the code that seems to work for your specific example, and hopefully would work for most cases similar (with two bars). For 3 bars it's probably easy as well (1 bar has a 0 offset, the other two are probably around +/- 0.25). Beyond that you start having to do more calculations to figure out where the boxes will be, but overall SAS will be pretty good at spacing them out equally so it'll usually be fairly straightforward.

proc sql;
  create table labstruct
    (  mygroup         char(3) label='Treatment Group'
     , myvisitnum      num     label='Visit number'
     , myvisitname     char(8) label='Visit name'
     , labtestname     char(8) label='Name of lab test'
     , labseed         num     label='Lab measurement seed'
     , lablow          num     label='Low end of normal range'
     , labhigh         num     label='High end of normal range'
    )
  ;
  insert into labstruct
    values('A', 1,  'Day 1',  'Test XYZ', 48, 40, 60)
    values('A', 5,  'Week 1', 'Test XYZ', 50, 40, 60)
    values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
    values('B', 1,  'Day 1',  'Test XYZ', 52, 40, 60)
    values('B', 5,  'Week 1', 'Test XYZ', 50, 40, 60)
    values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
  ;
quit;

data labdata;
  set labstruct;

  * Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
  select (mygroup);
    when ('A') a_scatternum = myvisitnum;  /* Note the separate names now, but no added +/- 1 */
    when ('B') b_scatternum = myvisitnum;
    otherwise;
  end;

  * Make more obs from the seeds above *;
  label labvalue = 'Lab measurement';
  do repeat = 1 to 20;
    labvalue = labseed + 6*rannor(3297);

    * Scatter plot ONLY normal range outliers *;
    if labvalue < lablow or labvalue > labhigh 
       then scattervalue = labvalue;
    else scattervalue = .;

    output;
  end;
  drop repeat labseed;
run;

proc sgplot data=labdata noautolegend;  /* suppress auto-legend */
  block x=myvisitnum block=myvisitname / 
        nofill 
        lineattrs=(color=lightgray);
  vbox labvalue / 
       category=myvisitnum
       group=mygroup
       outlierattrs=(symbol=square) name="boxplot"; /* Name for keylegend */
  scatter x=a_scatternum y=scattervalue /     /* Now you have two of these - and no need for an x2axis */
       group=mygroup discreteoffset=-0.175
        jitter
       ;  
  scatter x=b_scatternum y=scattervalue /
       group=mygroup discreteoffset=0.175
        jitter
       ;
  keylegend "boxplot" / position=bottom type=marker;  /* Needed to make a custom keylegend or else you have a mess with three plots in it */
run;

Upvotes: 3

Related Questions