Reputation: 33
Say that my data set has quite a lot of missing/invalid values and I would like to remove (or drop) the entire variable (or column) if it contains too many invalid values.
Take the following example, the variable 'gender' has quite a lot of "#N/A"s. I would like to remove that variable if a certain percentage of the data points in there are "#N/A"s, say more than 50%, more than 30%.
In addition, I would like to make the percentage a configurable value, i.e., I am willing to remove the entire variable if more than x% of the observations under that variable are "#N/A". And I also want to be able to define what an invalid value is, could be "#N/A", could be "Invalid Value", could be " ", could be anything else that I pre-define.
data dat;
input id score gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
Please make the solution as generalized as possible. For example, if the real data set contains thousands of variables, I need to be able to loop through all those variables instead of referencing their variable names one by one. Furthermore, the data set could contain more than just "#N/A" as bad values, other things like ".", "Invalid Obs", "N.A." could also exist at the same time.
PS: Actually I thought of a way to make this problem easier. We could probably read in all the data points as numerical values, so that all the "#N/A", "N.A.", " " stuff get turned into ".", which makes the drop criterion easier. Hope that helps you solve this problem for me ...
Update: below is the code I am working on. Got stuck at the last block.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
proc contents data=dat out=test0(keep=name type) noprint;
/*A DATA step is used to subset the test0 data set to keep only the character */
/*variables and exclude the one ID character variable. A new list of numeric*/
/*variable names is created from the character variable name with a "_n" */
/*appended to the end of each name. */
data test0;
set test0;
if type=2;
newname=trim(left(name))||"_n";
/*The macro system option SYMBOLGEN is set to be able to see what the macro*/
/*variables resolved to in the SAS log. */
options symbolgen;
/*PROC SQL is used to create three macro variables with the INTO clause. One */
/*macro variable named c_list will contain a list of each character variable */
/*separated by a blank space. The next macro variable named n_list will */
/*contain a list of each new numeric variable separated by a blank space. The */
/*last macro variable named renam_list will contain a list of each new numeric */
/*variable and each character variable separated by an equal sign to be used on*/
/*the RENAME statement. */
proc sql noprint;
select trim(left(name)), trim(left(newname)),
trim(left(newname))||'='||trim(left(name))
into :c_list separated by ' ', :n_list separated by ' ',
:renam_list separated by ' '
from test0;
quit;
/*The DATA step is used to convert the numeric values to character. An ARRAY */
/*statement is used for the list of character variables and another ARRAY for */
/*the list of numeric variables. A DO loop is used to process each variable */
/*to convert the value from character to numeric with the INPUT function. The */
/*DROP statement is used to prevent the character variables from being written */
/*to the output data set, and the RENAME statement is used to rename the new */
/*numeric variable names back to the original character variable names. */
data test2;
set dat;
array ch(*) $ &c_list;
array nu(*) &n_list;
do i = 1 to dim(ch);
nu(i)=input(ch(i),8.);
end;
drop i &c_list;
rename &renam_list;
run;
data test3;
set test2;
array myVars(*) &c_list;
countTotal=1;
do i = 1 to dim(myVars);
myCounter = count(.,myVars(i));
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */
end;
run;
The problem is, and where I got stuck on, is that I am not able to drop the variables that I want to drop. And the reason is because I do not want to use the variable names in the drop function. Instead, I want it done in a loop where I can reference the variable names with the looper "i". I tried to use the array "myVars(i)" but it doesnt seem to work with the drop function.
Upvotes: 2
Views: 2098
Reputation: 63434
In general, you'll find this sort of thing simplified using built in procs - this is SAS's bread and butter. You just need to restate the question.
What you want is to drop variables with a % of missing/bad data higher than 50%, so you need a frequency table of variables, right?
So - use PROC FREQ. This is the simplified version (only looks for "#N/A"), but it should be easy to modify the last step to make it look for other values (and to sum up the percents for them). Or, like you'll see in the linked question (from my comment on the question), you can use a special format that puts all invalid values to one formatted value, and all valid values to another formatted value. (You'll have to construct this format.)
Concept: use PROC FREQ to get frequency table, then look at that dataset to find the rows with > 50% of the rows and an invalid value in the F_ column.
This won't work with actual missing (" " or .); you'll need to add the /MISSING
option to PROC FREQ if you have those also.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window;
ods exclude all;
ods output onewayfreqs=freq_tables;
proc freq data=dat;
tables id score gender;
run;
ods output close;
ods exclude none;
*now we check for variables that match our criteria;
data has_missing;
set freq_tables;
if coalescec(of f_:) ='#N/A' and percent>50;
varname = substr(table,7);
run;
*now we put those into a macro variable to drop;
proc sql;
select varname
into :droplist separated by ' '
from has_missing;
quit;
*and we drop them;
data dat_fixed;
set dat;
drop &droplist.;
run;
Upvotes: 0
Reputation: 9569
My understanding is that SAS processes drop statements during data step compilation, i.e. before it looks at any of the data from any input datasets. Therefore, you cannot use the vname
function like that to select variables to drop, as it doesn't evaluate the variable names until the data step has finished compiling and has moved on to execution.
You will need to output a temporary dataset or view containing all your variables, including the ones you don't want, build up a list of variables that you want to drop, in a macro variable, then drop them in a subsequent data step.
Refer to this paper and page 3 in particular for more details of which things run during compilation rather than execution:
http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf
Upvotes: 1