Reputation: 1435
I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:
I want:
I can't figure out how to count distinct within each row; I can do it among multiple rows but within the row by the columns, I don't know.
I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering: I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.
So it was something like this:
PROC SQL;
CREATE TABLE table1_a AS
SELECT
*
a as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table2_b AS
SELECT
*
b as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table3_c AS
SELECT
*
c as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table4_d AS
SELECT
*
d as abcd
FROM table_I_have_with_all_columns
;
QUIT;
Then I stacked them (this means I have duplicate rows but that's OK, because I just want all of the variables in 1 column and I can do distinct count.
data ALL_STACK;
set
table1_a
table1_b
table1_c
table1_d
;
run;
Then I counted all unique values in ‘abcd’ grouped by ID
PROC SQL ;
CREATE TABLE count_unique AS
SELECT
My_id,
COUNT(DISTINCT abcd) as Count_customers
FROM ALL_STACK
GROUP BY my_id
;
RUN;
Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.
Upvotes: 2
Views: 4952
Reputation: 51566
You could just put the unique values into a temporary array. Let's convert your photograph into data.
data have;
input id a b c d;
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;
So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.
data want ;
set have ;
array unique (4) _temporary_;
array values a b c d ;
call missing(of unique(*));
do _n_=1 to dim(values);
if not missing(values(_n_)) then
if not whichn(values(_n_),of unique(*)) then
unique(_n_)=values(_n_)
;
end;
count=n(of unique(*));
run;
Output:
Obs id a b c d count
1 11 2 3 4 4 3
2 22 1 8 1 1 2
3 33 6 . 1 2 3
4 44 . 1 1 . 1
Upvotes: 1
Reputation: 3315
one more way using sortn and using conditions.
data have; input id a b c d; datalines; 11 2 3 4 4 22 1 8 1 1 33 6 . 1 2 44 . 1 1 . 55 . . . . 66 1 2 3 4 77 . 3 . 4 88 . 9 5 . 99 . . 2 2 76 . . . 2 58 1 1 . . 50 2 . 2 . 66 2 . 7 . 89 1 1 1 . 75 1 2 3 . 76 . 5 6 7 88 . 1 1 1 43 1 . . 1 31 1 . . 2 ;
data want;
set have;
_a=a; _b=b; _c=c; _d=d;
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0 then count+ 0;
else if hello(i)-hello(i+1) ne 0 then count+ 1;
end;
drop i _:;
run;
Upvotes: 1
Reputation: 27498
A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d
is four.
The search evaluates every pair in the array and has a flow very similar to a bubble sort.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;
The linear search for duplicates will occur on every row, and the count_distinct
will be initialized automatically in each row to a missing (.) value. The sum
function is used to increment the count when a non-missing value is not found in any prior array indices.
* linear search O(N**2);
data want;
set have;
array x a b c d;
do i = 1 to dim(x) while (missing(x(i)));
end;
if i <= dim(x) then count_distinct = 1;
do j = i+1 to dim(x);
if missing(x(j)) then continue;
do k = i to j-1 ;
if x(k) = x(j) then leave;
end;
if k = j then count_distinct = sum(count_distinct,1);
end;
drop i j k;
run;
Upvotes: 2
Reputation: 4554
Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.
Proc transpose data=have prefix=ID out=temp;
id ID;
run;
Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;
data count;
set count;
ID=compress(TableVar,,'kd');
drop TableVar;
run;
data want;
merge have count;
by id;
run;
Upvotes: 1