franvergara66
franvergara66

Reputation: 10784

Time of execution of SAS code

Im studying for my SAS test and i found the following question:

Given a dataset with 5000 observations, and given the following two dataset (a subset of the first).

Data test1 (keep= cust_id trans_type gender);
Set transaction;
Where trans_type = ‘ Travel’ ;
Run;

Data test2;
Set transaction (keep= cust_id trans_type gender);
Where trans_type = ‘ Travel’ ;
Run;

I. Which one of the above two datasets (test1, test2 ) will take less

I think that both take the same time to run because basically both made the same instructions in different order. Im right? or the order of instructions affects the runtime?

Upvotes: 0

Views: 349

Answers (2)

Joe
Joe

Reputation: 63434

The answer the book is looking for is that test2 will be faster. That is because there is a difference in the two:

test1: Read in all variables, only write out 3 of them

test2: Read in 3 variables, write out all variables read

SAS has some advantages based on the physical dataset structure that allows it to more efficiently read in subsets of the dataset, particularly if those variables are stored consecutively.

However, in real world scenarios this may or may not be true, and in particular in a 5000 row dataset, probably won't see any difference between the two. For example:

data class1M;
  set sashelp.class;
  do _i = 1 to 1e6;
    output;
  end;
run;

data test1(keep=name sex);
  set class1M;
run;

data test2;
  set class1M(keep=name sex);
run;

Both of these data steps take the identical length of time. That's likely because the dataset is being read into memory and then bits are being grabbed as needed - a 250MB dataset just isn't big enough to trigger any efficiencies that way.

However, if you add a bunch of other variables:

data class1M;
  set sashelp.class;
  length a b c d e f g h i j k l 8;
  do _i = 1 to 1e6;
    output;
  end;
run;

data test1(keep=name sex);
  set class1M;
run;

data test2;
  set class1M(keep=name sex);
run;

Now it takes a lot longer to run test1 than test2. That's because the dataset for test1 now doesn't fit into memory, so it's reading it by bits, while the dataset for test2 does fit in memory. Make it a lot bigger row-wise, say 10M rows, and it will take a long time for both test1 and test2 - but a bit shorter for test2.

Upvotes: 2

Haikuo Bian
Haikuo Bian

Reputation: 906

The test2 will take less time to run, as it brings in less variables.

Upvotes: 0

Related Questions