Reputation: 1537
I have a common column ID
among multiple data sets. Now I'm looking for how many times certain ID
appears in these data sets. Even an ID
appears multiple times in one data set, the visit
in the data set is still counted as 1
.
So given m
data sets, for any ID
, the value visit
is between 1
and m
.
Ideal Output:
ID # Visit
222 5
233 5
556 3
...
...
667 1
Data sets: (they don't have same prefix, this is an example).
data1: (# visit of 222
is 1, even it appears twice)
ID col2 col3 ...
21
222
222
...
data5: (# visit of 222
is 1)
ID col87 col12 ...
222
623
126
I don't know how to start with this.It seems like a dictionary traverse.
Upvotes: 0
Views: 413
Reputation: 11
I think the solution below will work (you must provide the list of the datasets you want to compute the statistics on):
%macro macro_answer(dataset_list);
%let ndsets = %sysfunc(countw(&dataset_list., ' '));
/* Create a frequency table for every dataset in the list of input datasets, containing a column with the dataset name */
/* and the count of each ID in this dataset */
%do i = 1 %to &ndsets;
proc sql;
create table __freq_&i. as
select "%scan(&dataset_list., ' ')" as dataset_name
,ID
,count(*) as _count_
from %scan(&dataset_list., ' ')
group by 1,2
;quit;
%end;
/* Append every frequency table create above */
data __all_datasets_freq;
length dataset_name $41.; /*8 for libname + 1 for '.' + 32 for dataset name*/
set
%do i = 1 %to &ndsets;
__freq_&i.;
%end;
run;
/* Drop intermediary tables for frequency of each dataset */
proc sql;
%do i = 1 %to &ndsets.;
drop table __freq_&i.;
%end;
;quit;
%mend macro_answer;
The solution provided by Haikuo Bian probably works and is also simpler to understand but, in my experience, you should always compute statistics on datasets and then stack up all of your tables rather than stack up all of them and calculate statistics on the big table; depending on your machine and the size of the tables, you can expect some performance gains using the former.
Additionally, if your tables resides on a database (Teradata, Oracle etc) and you are working with SAS Pass-Through, you can use you the database Pass-Through the most efficient way, once you don't have to use SAS to append all tables and create on big table before computing statistics. Besides that, with little tweaks on my code, you can loop over a entire lib (or even over a list of libs) with datasets with certain pattern name, making use of the dictionary table dictionary.tables.
Disclaimer: I don't have access to SAS right now, so I can't confirm that the code I provided really works.
Upvotes: 0
Reputation: 63424
Here's a data step approach, assuming all datasets are sorted by ID.
Basically, you compare indsname
value of current row to previous row, and if it changed, increment count by 1. Notice the slightly different than usual approach to resetting count - I do it after the indsname check, and to 1, not to 0 before; that's because in some cases you may have two consecutive IDs with different indsname values. Since we can't use INDSNAME in the BY statement, we can't rely on SAS to "change" the value for us (as we would with a nested by statement) so we have to do it out of order.
data data1;
do id = 1 to 20;
output;
end;
run;
data data2;
do id = 1 to 30 by 2;
output;
end;
run;
data data3;
do id = 1 to 30 by 3;
output;
end;
run;
data want;
set data: indsname=dsn;
by id;
count+ifn(dsn=lag(dsn),0,1);
if first.id then count=1;
if last.id then output;
run;
Upvotes: 0
Reputation: 906
This is not tested, but something along the line should work:
/*Stack up all of your tables, keep 'ID' only*/
data have (keep=ID);
set data: indsname=dsn;
dsname = dsn;
run;
/*Proc SQL to get the job done*/
Proc sql;
create table want as
select ID, count(distinct dsname) as visit from have
group by ID
;
quit;
Upvotes: 4