Jed
Jed

Reputation: 399

How do I do a cluster analysis on table with both character and numeric variables in SAS?

Account_id <- c("00qwerf1”, “00uiowe3”, “11heooiue” , “11heooihe” , 
"00sdffrg3”, “03vthjygjj”, “11mpouhhu” , “1poihbusw”)

Postcode <- c(“EN8 7WD”, “EN7 9BB”, “EN6 8YQ”, “EN8 7TT”, “EN7 9BC”, “EN6 
8YQ”, “EN8 7WD”, “EN7 7WB) 

Age <- c(“30”, “35”, “40”, “50”, “60”, “32”, “34”, “45”)

DF <- data.frame(Account_id, Postcode, Age)

I want to do cluster analysis on my dataframe in SAS. I understand that technically a dataframe is not used in SAS, however I have just used this format for illustration purposes. Account_id and Postcode are both character variables and Age is a numeric variable.

Below is the code that I have used after conducting a data step;

Proc fastclus data=DF maxc-8 maxiter=10 seed=5 out=clus;
Run;

The cluster analysis does not work because Account_id and Postcode are character variables. Is there a way to change these variables into numeric variables, or is there a clustering method that works with both character and numeric variables?

Upvotes: 0

Views: 479

Answers (2)

Richard
Richard

Reputation: 27508

You can determine the unique values of each variable and then assign the ordinality of the original value as it's numeric representation for the purpose of fastclus.

Sample code

Note: The FASTCLUS seed= option is a data set specifier, not a simple number (as is used with random number generators)

* hacky tweak to place your R coded data values in a SAS data set;
data have;
  array _Account_id(8) $20 _temporary_ ("00qwerf1", "00uiowe3", "11heooiue" , "11heooihe" , 
"00sdffrg3", "03vthjygjj", "11mpouhhu" , "1poihbusw");

  array _postcode(8) $7 _temporary_ ("EN8 7WD", "EN7 9BB", "EN6 8YQ", "EN8 7TT", "EN7 9BC", "EN6 
8YQ", "EN8 7WD", "EN7 7WB");

  array _age (8) $3  _temporary_ ("30", "35", "40", "50", "60", "32", "34", "45");

  do _n_ = 1 to dim (_account_id);
    Account_id = _account_id(_n_);
    Postcode = _postcode(_n_);
    Age = _age(_n_);
    output;
  end;
run;

* get lists of distinct values for each variable;
proc means noprint data=have;
  class _all_;
  ways 1;
  output out=have_freq;
run;

* compute ordinal of each variables original value;
data have_freq2;
  set have_freq;
  if not missing(Account_id) then unum_Account_id + 1;
  if not missing(Postcode) then unum_Postcode + 1;
  if not missing(Age) then unum_Age + 1;
run;

* merge back by original value to obtain ordinal values;
proc sql;
  create table have_unumified as
  select 
    Account_id, Postcode, Age
   , (select unum_Account_id from have_freq2 where have_freq2.Account_id = have.Account_id) as unum_Account_id
   , (select unum_Postcode   from have_freq2 where have_freq2.Postcode = have.Postcode) as unum_Postcode
   , (select unum_Age   from have_freq2 where have_freq2.Age = have.Age) as unum_Age
   from have
   ;
run;

* fastclus on the ordinal values (seed= not specified);
Proc fastclus data=have_unumified maxc=8 maxiter=10 out=clus_on_unum;
  var unum_:;
Run;

Upvotes: 0

user667489
user667489

Reputation: 9569

Before you can do clustering you need to define a metric that can be used to calculate the distance between observations. By default proc fastclus uses the Euclidean metric. This requires that all input variables are numeric and works best if they are all rescaled to have the same mean and variance, so that they are all equally important when growing clusters.

You could use postcode in a by statement if you wanted to perform a separate cluster analysis for each postcode, but if you want to use postcode itself as a clustering variable you will need to convert it to a numeric form. Replacing postcode with two variables for the latitude and longitude of postcode centroid might be a good option.

It's less obvious what would be a good option for your account ID variable, as this doesn't appear to be a measurement of anything. I would try to get hold of something else like account creation date or last activity date, which can be converted to a numeric value in a more obvious way.

Upvotes: 0

Related Questions