hope288
hope288

Reputation: 1435

SAS Comparing values across multiple columns of the same observation

I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:

enter image description here

I want:

enter image description here

I can't figure out how to count distinct within each row; I can do it among multiple rows but within the row by the columns, I don't know.

Update

I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering: I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.

So it was something like this:

PROC SQL;
CREATE TABLE table1_a AS
    SELECT 
        *
        a as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

PROC SQL;
CREATE TABLE table2_b AS
    SELECT 
        *
        b as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

PROC SQL;
CREATE TABLE table3_c AS
    SELECT 
        *
        c as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

PROC SQL;
CREATE TABLE table4_d AS
    SELECT 
        *
        d as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

Then I stacked them (this means I have duplicate rows but that's OK, because I just want all of the variables in 1 column and I can do distinct count.

data ALL_STACK;
    set 
    table1_a
    table1_b
    table1_c
    table1_d
;
run;

Then I counted all unique values in ‘abcd’ grouped by ID

PROC SQL ;
    CREATE TABLE count_unique AS
    SELECT
    My_id,
    COUNT(DISTINCT abcd) as Count_customers
    FROM ALL_STACK
    GROUP BY my_id
    ;
RUN;

Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.

Upvotes: 2

Views: 4952

Answers (4)

Tom
Tom

Reputation: 51566

You could just put the unique values into a temporary array. Let's convert your photograph into data.

data have;
  input id a b c d; 
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;

So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.

data want ;
  set have ;
  array unique (4) _temporary_;
  array values a b c d ;
  call missing(of unique(*));
  do _n_=1 to dim(values);
    if not missing(values(_n_)) then
      if not whichn(values(_n_),of unique(*)) then 
        unique(_n_)=values(_n_)
    ;
  end;
  count=n(of unique(*));
run;

Output:

Obs    id    a    b    c    d    count

 1     11    2    3    4    4      3
 2     22    1    8    1    1      2
 3     33    6    .    1    2      3
 4     44    .    1    1    .      1

Upvotes: 1

Kiran
Kiran

Reputation: 3315

one more way using sortn and using conditions.

data have;
  input id a b c d; datalines;
  11 2 3 4 4
  22 1 8 1 1
  33 6 . 1 2
  44 . 1 1 .
  55 . . . .
  66 1 2 3 4
  77 . 3 . 4
  88 . 9 5 .
  99 . . 2 2
  76 . . . 2
  58 1 1 . .
  50 2 . 2 .
  66 2 . 7 .
  89 1 1 1 .
  75 1 2 3 .
  76 . 5 6 7
  88 . 1 1 1
  43 1 . . 1
  31 1 . . 2
  ;
data want;
set have;
_a=a; _b=b; _c=c; _d=d; 
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0  then count+ 0;
else if hello(i)-hello(i+1) ne 0  then count+ 1;
end;
drop i _:;
run;

Upvotes: 1

Richard
Richard

Reputation: 27498

A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d is four.

The search evaluates every pair in the array and has a flow very similar to a bubble sort.

data have;
  input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;

The linear search for duplicates will occur on every row, and the count_distinct will be initialized automatically in each row to a missing (.) value. The sum function is used to increment the count when a non-missing value is not found in any prior array indices.

* linear search O(N**2);

data want;
  set have;
  array x a b c d;

  do i = 1 to dim(x) while (missing(x(i)));
  end;
  if i <= dim(x) then count_distinct = 1;
  do j = i+1 to dim(x);
    if missing(x(j)) then continue;
    do k = i to j-1 ;
      if x(k) = x(j) then leave;
    end;
    if k = j then count_distinct = sum(count_distinct,1);
  end;
  drop i j k;
run;

Upvotes: 2

Shenglin Chen
Shenglin Chen

Reputation: 4554

Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.

Proc transpose data=have prefix=ID out=temp;
id ID;
run;

Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;

data count;
   set count;
   ID=compress(TableVar,,'kd');
   drop TableVar;
run;

data want;
    merge have count;
    by id;
run;

Upvotes: 1

Related Questions