Reputation: 15
Im a new SAS User and I have a small problem
I have one large empty table A with lets say 100 columns that I have created with a simple proc sql; create table
I have another table B with lets say 40 columns and table C with 55 columns. I want to add these two tables into table A, basically I want a table with 100 columns containing the data from table B & C and I'm doing this with a Union command.
Since I dont have values for all 100 variables I have to set default values. Lets say I have a column named nourishment in table A, food in table B and has no equivalent in table C. I have rules like "If the data comes from table B then value =xxx if its from table C then Value="DefaultValue"
I'd do this easily with R or python but Im struggling with sas.
I'm using SAS sql commands (a Union command)
How do you set default values ? (for all data types : character, numeric or dateI'm using SAS sql commands )
Upvotes: 1
Views: 1664
Reputation: 51601
Why would you create an empty dataset? What is it going to be used for? Perhaps you want to use it as a default structure definition? If so and you want to stack B and C and get them in the structure defined by A you could code this way.
data want ;
set a(obs=0) b c ;
run;
Not sure what the purpose would be to have default values. Couldn't you use formats if you want missing values to display in special ways?
Or you could create code to default values and perhaps just %include it or wrap the logic into a macro. So it you had a code file name 'defaults.sas' with lines like this.
startdate=coalesce(startdate,'01JAN2013'd);
gender=coalescec(gender,'UNKNOWN');
Then your little program to make a new dataset that looks like A and uses the data from B and C would look like this.
data want ;
set a(obs=0) b c ;
%include 'defaults.sas';
run;
If you really did want to aggregate the records into some large dataset then perhaps you want to use PROC APPEND
to add the records once they are created in the right structure.
proc append data=want base=a ;
run;
Upvotes: 0
Reputation: 27508
NFN:
A DATA Step approach for stacking data is the simplest. Use SET to stack the data and array processing to apply your defaults. For example:
data stacked_data;
set
TARGET_TEMPLATE (obs = 0)
ONE
TWO
;
array allchar _character_;
array allnum _numeric_;
array dates d1-d5;
do over allchar; if missing(allchar) then allchar = '*UNKNOWN*'; end;
do over allnum; if missing(allnum) then allnum = -995; end;
do over dates; if missing(dates) then dates='01NOV1971'd; end;
run;
A subtle issue is that any missing values in ONE or TWO will be replaced with the default value.
In Proc SQL you will want to create a single row table containing the default values for A. That table can be joined to the union of B and C. The join select will involve coalesce() in order to choose the predefined default value when a column is not from B or C.
For example, suppose you have an empty (zero rows), richly columned, target table (your A) acting as a template:
data TARGET_TEMPLATE;
length _n_ 8;
length a1-a5 $25 d1-d5 4 x1-x20 y1-y20 z1-z20 p1-p20 q1-q20 r1-r20 8;
call missing (of _all_);
format d1-d5 yymmdd10.;
stop;
run;
Because Proc SQL does not provide syntax for a default constraint you need to create a table of your own defaults. This is probably easiest with DATA Step:
data TARGET_DEFAULTS;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv to match TARGET;
array allchar _character_ (1000 * '*UNKNOWN*');
array allnum _numeric_ (1000 * -995);
array d d1-d5 (5 * '01NOV1971'd); * override the allnum array initialization;
output;
stop;
run;
Here is some generated demo data, ONE and TWO, that correspond to your B and C:
data ONE;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 100;
array a a1 a3 a5;
array num x: y: z:;
array d d1 d2;
do over a; a = catx (' ', 'ONE', _n_, _i_); end;
do over num; num = 1000 + _n_ + _i_; end;
retain foodate '01jan1975'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a3 a5 x: y: z: d1 d2; * keep the disparate columns that were populated;
run;
data TWO;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 200;
array a a1 a2 a3;
array num x5 y5 z5 p: q: r:;
array d d1 d2;
do over a; a = catx (' ', 'TWO', _n_, _i_); end;
do over num; num = 20000 + _n_*10 + _i_; end;
retain foodate '01jan1985'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a2 a3 x5 y5 z5 p: q: r:; * keep the disparate columns that were populated;
run;
A stacking of A, B and C is simple SQL but does not introduce target specific default values:
proc sql noprint;
* generic UNION stack with SAS missing values (space and dot) for cells
* where ONE and TWO did not contribute any data;
create table stacked_data as
select * from have_data_TEMPLATE %*** empty template first ensures template column order and formats are honored in output data;
outer union corresponding %*** align by column name, do not remove duplicates;
select * from ONE
outer union corresponding
select * from TWO
;
When the stacking is put in a sub-query, it can be joined with the defaults. The choosing of the target default value for each column involves examining DICTIONARY.COLUMNS and generating the SQL source for selecting the coalescence of stack and default.
proc sql noprint;
* codegen select items ;
select cat('coalesce(STACK.',trim(name),',DEFAULT.',trim(name),') as ',trim(name))
into :coalesces separated by ','
from DICTIONARY.COLUMNS
where libname = 'WORK' and memname = 'HAVE_DATA_TEMPLATE' %* dictionary lib and mem name values are always uppercase;
order by npos
;
create table stacked_data_with_defaults as
select * from TARGET_TEMPLATE %*** output honors template;
outer union corresponding
select
source
, &coalesces %*** apply codegen;
from
(
select * from WORK.have_data_TEMPLATE %*** ensure fully columned sub-select that will align with coalesces;
outer union corresponding
select 'one' as source, * from ONE
outer union corresponding
select 'two' as source, * from TWO
) as STACK
join
TARGET_DEFAULTS as DEFAULT
on 1=1
;
quit;
Upvotes: 0
Reputation: 63424
I would not recommend using UNION
in PROC SQL
at any point in your SAS usage. UNION
is almost always inferior to a simple data step, or a data step view.
That's because the data step seamlessly handles issues like differing variables on different tables. SAS is quite comfortable with vertically combining datasets; SQL is always a bit trickier when they're not identical.
data c;
set a b;
run;
That runs whether or not a
and b
are identical, so long as a
and b
don't have conflicting variable names (that aren't intended to be in the same column); and if they do, just use the rename
dataset option to resolve it.
If you do as the above, and don't use union
, you'll get a missing value automatically for those dates.
Upvotes: 0
Reputation: 27508
. as MyColumnName
SAS can deal with missing values.
Using a specially coded value, such as 'NA', to represent a missing value condition can work but may lead to headaches and extra coding. Recommended read in SAS help: "Working with Missing Values"
The default SAS missing value for numerics (which also includes dates) is period.
. as MyColumnName
SAS also has 27 special missing values for numerics that are expressed as . < character >
.A as MyColumnName
...
.Z as MyColumnName
._ as MyColumnName
The missing value for character variables is a single space
' '
'' empty quote string also works
' ' as does a longer empty string
Rule of thumb: be consistent when coding your missing values.
You can use OPTIONS MISSING to specify what character is shown when a missing value is printed.
OPTIONS MISSING = '*'; * My special representation of missing for this report;
Proc PRINT data=myData;
run;
OPTIONS MISSING = '.'; * Restore to the default;
SAS custom formats can also be used to customize what is printed for missing values.
Proc FORMAT;
value MissingN
. = 'N/A'
.N = 'Special N/A different than regular N/A' /* for .N */
;
value $MissingC
' ' = 'N/A'
;
value SillyChristmasStocking
.C = 'Bad'
.O = 'children'
.A = 'get'
.L = 'No toys'
;
The token after the value
keyword can be any new valid SAS name that you want to used for your format name.
Proc PRINT data=myData;
format myColumnName MissingN.;
format name $MissingC.;
format behaviour SillyChristmasStocking.;
run;
As for your character missing value conditions, I would continue to use " "
or ' '
You mention UNION which is a SQL feature. In SQL, JOIN also occur, perhaps more often then UNION. When JOINing and values from two source columns collide, you will want to use either COALESCE() function or CASE statements to select the non-missing value.
Upvotes: 0
Reputation: 1792
Dates in SAS are actually just numeric values. Often they have a date format applied to make them readable.
So you could just assign a missing value by default like so:
. as ColumnName
or any default date like so
'17NOV2017'd as ColumnName
Upvotes: 1