Reputation: 3
My current issue is that I have data in one table, while other table contains value ranges for classifying each column into a category.
Ex: For "City" - value from 0 to 6 is changed to 1, 7-16 to 2 and 17+ to 3.
Ultimately, I will have to use this code on a table of more than 100 columns with a total of 500 value ranges/categories.
I have suitable code for creating the categories and for selecting a column after column, but the main code (reading the condition and applying them) eludes me.
In the following example code test1 contains the raw data, test2 the value ranges for all columns, test3 conditions for selected column.
proc sql noprint;
create table work.test1
(Id char(4),
city num,
country num);
insert into work.test1
values('1639',5,42260)
values('1065',10,38090)
values('1400',15,29769);
create table work.test2
(condition char(7),
g_l char(6),
g_p char(6));
insert into work.test2
values('city',"low","6")
values('city',"7","16")
values('city',"17","high")
values('country',"low","1000")
values('country',"1001","high");
%let zmien = "city";
data work.test3 (where=(condition = &zmien));
set work.test2;
run;
proc sql noprint;
select count(warunek) into :ile_war
from work.test3;
quit;
%let kat = 0; /* place where current category is stored */
%let v_l = 0; /* place where lower border of the category is stored */
%let v_h = 0; /* place where higher border of the category is stored */
%macro kat(ile_war);
My idea was to use a macro do loop to go through all categories for each column. The whole idea works if I do not use a macro (which precludes usage of a loop to my knowledge) and instead of call symput use a simple equations (x = y) in if's.
%macro kat(ile);
%do a=1 %to &ile;
data work.test4;
set work.tesT3 point=a;
%if g_l = "low" %then %do;
call symput('kat',&a);
call symput('war_l',0);
%end;
%if g_l ~= "low" %then %do;
call symput('kat',&a);
call symput('war_l',g_l);
%end;
%if g_p = "high" %then %do;
call symput('war_h',9999999);
%end;
%if g_p ~= "high" %then %do;
call symput('war_h',g_p);
%end;
output;
stop;
data work.test1;
modify work.test1(WHERE=(&zmien BETWEEN &war_l AND &war_h));
&zmien=&kat;
replace;
run;
%end;
%mend;
Any help with the macro or a suggestion of doing it in some other way would be very appreciated.
EDIT: So by trying to use the recommended proc format I hit an issue - it works when I hardcode the ranges and variable/column to be changed, but I do not know how to make it work in case of:
A) column name as a content of macro variable (got an error that the format is either not found or not applicable)
B) ranges being in a dataset
How can I read the values for variable column, insert it into format, use it to categorize the data and then overwrite it to use for different column?
Upvotes: 0
Views: 704
Reputation: 51621
So it sounds like you want to use the format data in TEST2 to convert the values in TEST1 into codes. So for CITY you have three levels so you want to generate values 1,2,3. So you can do that with formats, but if you want the results to be numbers instead of strings then you will need to use an INPUT() function call to convert the formatted value back to a number.
First let's create your sample data using normal SAS code since it is much easier edit to adjust the test data than SQL INSERT statements.
data test1;
input id $ city country ;
cards;
1639 5 42260
1065 10 38090
1400 15 29769
;
data test2;
input condition $ g_l $ g_p $ ;
cards;
city low 6
city 7 16
city 17 high
country low 1000
country 1001 high
;
We can convert TEST2
dataset into a format. We can use a data step to create the data needed to define formats using PROC FORMAT. Let's assume that it is already sorted by CONDITION into the categories you want to create so that we can generate the category numbers. Also I will assume that CONDITION is a valid format name (starts with alpha or underscore and does NOT end in a digit).
data formats ;
length fmtname $32 start end 8 hlo $3 label $32 ;
keep fmtname -- label;
set test2;
by condition notsorted;
if first.condition then row=1;
else row + 1;
fmtname = condition ;
start=input(g_l,??32.);
end=input(g_p,??32.);
if g_l='low' then hlo=cats(hlo,'L');
if g_p='high' then hlo=cats(hlo,'H');
label = left(put(row,32.));
run;
proc format cntlin=formats ;
run;
To use these formats to convert the values into category numbers we need to generate some code. When the list of variables is small enough you can put the code into a single macro variable (max length 64K bytes).
For example if we want to generate new variables with _GRP
suffix for any variable in the input dataset, TEST1
, whose name is in the list of conditions in the metadata table, TEST2
. We could use code like this to generate a macro variable.
proc contents data=test1 out=contents noprint;
run;
proc sql noprint ;
select distinct cats(name,'_grp=input(put(',name,',',name,'.),32.)')
into :recode separated by ';'
from contents
where upcase(name) in (select upcase(condition) from test2)
;
quit ;
For your example the RECODE
macro variable looks like this:
city_grp=input(put(city,city.),32.);
country_grp=input(put(country,country.),32.)
Which you can then use in a date step to create a new dataset from your old one.
data want ;
set test1 ;
&recode;
run;
Results:
country_
Obs id city country city_grp grp
1 1639 5 42260 1 2
2 1065 10 38090 2 2
3 1400 15 29769 2 2
If you have a lot of variables that you are recoding then instead of generating a macro variable you can just write the code to a file instead.
proc sql noprint ;
create table names as
select distinct name
from contents
where upcase(name) in (select upcase(condition) from test2)
;
quit ;
filename code temp;
data _null_;
set names ;
file code ;
put name +(-1) '_grp=input(put(' name ',' name +(-1) '.),32.);' ;
run;
data want ;
set test1 ;
%include code / source2;
run;
You might also want to generate another series of formats that you could use to decode the categories back into descriptions. So for your new CITY_GRP
variable you might want to generate a format CITY_GRP.
that would translate 1
to low - 6
, etc.
data format2 ;
length fmtname $32 start 8 label $50 ;
keep fmtname -- label;
set test2;
by condition notsorted;
if first.condition then row=1;
else row + 1;
fmtname = catx('_',condition,'grp') ;
start=row ;
label = catx(' - ',g_l,g_p);
run;
proc format cntlin=format2; run;
proc print data=want;
format city_grp city_grp. country_grp country_grp.;
run;
Result:
Obs id city country city_grp country_grp
1 1639 5 42260 low - 6 1001 - high
2 1065 10 38090 7 - 16 1001 - high
3 1400 15 29769 7 - 16 1001 - high
Upvotes: 0
Reputation: 21294
Here's how this would work if you used formats. You could automate the further the actual implementation of the formats if desired, but here's the approach I would recommend. You can create IF/THEN if you'd like, but it seems like a lot more work and more finicky to me.
*create formats from the data set, test2;
data createFormats;
set test2;
by condition notsorted;
fmtname = catx('_', condition, 'fmt' );
start = g_l;
end = g_p;
label = catx(" to ", g_l, g_p);
run;
proc format cntlin=createFormats;
run;
title 'Original Data';
proc print data=test1;
run;
*recode into formats;
data new;
set test1;
*this part can be automated via a macro assuming you use consistent naming structure as here;
city_group = put(city, city_fmt.);
country_group = put(country, country_fmt.);
run;
title 'formats applied';
proc print data=new;
run;
*apply formats for display, will be honoured by most procs;
proc datasets lib=work nodetails nolist;
modify test1;
*this could also be automated via a macro;
format city city_fmt. country country_fmt.;
run;quit;
title 'Recoded into new variables';
proc print data=test1;
run;
Upvotes: 1