Roland
Roland

Reputation: 537

Random ordered sampling with replacement in SAS

I have a data set from which I'd like to draw a sample with replacement. When I use proc surveyselect, the samples drawn are in the excact same order as in the original dataset and multiple draws are written below each other.

proc surveyselect data=sashelp.baseball outhits method=urs n=1000 out=mydata;

However, it's important to me that the position in the outtable is sampled as well. Is there an option in proc surveyselect, or am I better off to just sample the rownumber myself and output it, like outlined in this paper,p4?

As a toy example (not in SAS notation), suppose I have a list of values [a, b, c, d] and I draw five times with repetition (and keeping the order of draws):

First a, then c, then a, then b, then c. The result I want is [a, c, a, b, c], but sas only gives output of the type

Upvotes: 3

Views: 332

Answers (2)

Dominic Comtois
Dominic Comtois

Reputation: 10411

So here is a solution which only requires BASE SAS. Minor changes would be needed to allow inclusion of additional columns such as an ID or a DATE, for instance. I don't claim it's the most efficient way to do this. It relies heavily on PROC SQL which is my preference. Having said that, it should produce the results you wish in quite reasonable time.

The length of the generated SQL code justifies the need for a separate sas program. If you don't want to show the whole %included file in the log, just leave out the /source2 option.

Generate Sample Data

data mymatrix;
  input c1 c2 c3 c4 c5;
  datalines;
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
;

Declare Macro %DrawSample

Parameters:

lib = library in which ds is found
ds = table to sample from
out = table to generate
outfile = path/name of the sas program containing the insert strings
n = number of repetitions

%macro DrawSample(lib, ds, out, outfile, n);

  %local nrows ncols cols;

  proc sql;
    /* get number of rows in source table */
    select count(*)
      into :nrows
      from &lib..&ds;

    /* get variable names */
    select name, count(name)
      into :cols separated by " ",
           :ncols
      from dictionary.columns
     where libname = upcase("&lib")
           and memname = upcase("&ds");

  quit;

  data _null_;
    file "&outfile";
    length query $ 256;
    array column(&ncols) $32;
    put "PROC SQL;";
    put " /* create an empty table with same structure */";
    put "  create table &out as";
    put "    select *";
    put "      from &lib..&ds";
    put "        where 1 = 2;";
    put " ";

    do i = 1 to &n;
      %* Randomize column order;
      do j = 1 to &ncols;
        column(j) = scan("&cols", 1 + floor((&ncols)*rand("uniform")));
      end;
      %* Build the query;
      query = cat("  INSERT INTO &out SELECT ", column(1));
      do j = 2 to &ncols;
        query = catx(", ", query, column(j));
      end;
      rownumber = 1 + floor(&nrows * rand("uniform"));
      query = catx(" ", query, "FROM &lib..&ds(firstobs=", rownumber,
                   "obs=", rownumber, ");");
      put query;
    end;
    put "QUIT;";
  run;

  %include "&outfile" / source2;

%mend;

Calling the Macro

%DrawSample(lib=work, ds=mymatrix, out=matrixSample, outfile=myRandomSample.sas, n=1000);

Et voilà!

Upvotes: 1

Longfish
Longfish

Reputation: 7602

Not sure exactly what you're after, but something that may help is to use the option OUTALL instead of OUTHITS. This will create an output dataset the same size as the original, with a selected column to show if the record has been sampled and a numberhits column to show how many times that record has been selected. It won't create a row for each time a record is selected.

You can then select the observation number for all records in the sample.

Upvotes: 0

Related Questions