Reputation: 3117
I am trying to figure out how to populate a "fake" column by choosing randomly from another table column.
So far this was easy using an array and the rantbl()
function as there were not a lot of modalities.
data want;
set have;
array values[2] $10 _temporary_ ('NO','YES');
value=values[rantbl(0,0.5,0.5)];
array start_dates[4] _temporary_ (1735689600,1780358400,1798848000,1798848000);
START_DATE=start_dates[rantbl(0,0.25,0.25,0.25,0.25)];
format START_DATE datetime20.;
run;
However, my question is what happens if there are, for example, more than 150 modalities in the other table? Hence, is there a way to put into an array all the modalities that are in another table ? Or better, to populate the new "fake" column with modalities from another table's column with regards to the modalities's distribution in the other table ?
Upvotes: 1
Views: 139
Reputation: 63424
There are probably a dozen good ways to do this, which one being ideal depending on various details of your data - in particular, how performance sensitive this is.
The most SASsy way to do this, I would say, is to use PROC SURVEYSELECT. This generates a random sample of the size you want, and then merges it on. It is not the fastest way, but it is very easy to understand and is fast-ish as long as you aren't talking humungous data sizes.
data _null_;
set sashelp.cars nobs=nobs_cars;
call symputx('nobs_cars',nobs_Cars);
stop;
run;
proc surveyselect data=sashelp.class sampsize=&nobs_Cars out=names(keep=name)
seed=7 method=urs outhits outorder=random;
run;
data want;
merge sashelp.cars names;
run;
In this example, we are taking the dataset sashelp.cars
, and appending an owner's name to each car, which we choose at random from the dataset sashelp.class
.
What we're doing here is first determining how many records we need - the number of observations in the to-be-merged-to dataset. This step can be skipped if you know that already, but it takes basically zero time no matter what the dataset size.
Second, we use proc surveyselect
to generate the random list. We use method=urs
to ask for simple random sampling with replacement, meaning we take 428 (in this case) separate pulls, each time every row being equally likely to be chosen. We use outhits
and outorder=random
to get a dataset with one row per desired output dataset row and in a random order (without outhits
it gives one row per input dataset row, and a number of times sampled variable, and without outrandom
it gives them in sorted order). sampsize
is used with our created macro variable that stores the number of observations in the eventual output dataset.
Third, we do a side by side merge (with no by
statement, intentionally). Please note that in some installations, options mergenoby
is set to give a warning or error for this particular usage; if so you may need to do this slightly differently, though it is easy to do so using two set
statements (set sashelp.cars; set names;
) to achieve the identical results.
Upvotes: 2
Reputation: 4937
I'm not entirely sure, but here's how I interpret your request and how I would solve it.
You have a table one
. You want to create a new data set want
with an additional column. This column should have values that are sampled from a pool of values given in yet another data set two
in column y
. You want too simulate the new column in the want
data set according to the distribution of y
in the two
data set.
So, in the example below, there should be a .5 change of simulating y = 3 and .25 for 1 and 2 respectively.
I think the way to go is not using arrays at all. See if this helps you.
data one;
do x = 1 to 1e4;
output;
end;
run;
data two;
input y;
datalines;
1
2
3
3
;
data want;
set one;
p = ceil(rand('uniform')*n);
set two(keep = y) nobs = n point = p;
run;
To verify that the new column resembles the distribution from the two
data set:
proc freq data = want;
tables y / nocum;
run;
Upvotes: 3