Krishna Das
Krishna Das

Reputation: 91

kmeans clustering using PL/R in postgresql

I am trying to use kmeans function in PL/R. I have created following table

CREATE TABLE EMP (NAME1 TEXT, AGE SMALLINT, SALARY NUMERIC );

INSERT INTO EMP VALUES 
('Joe', 41, 55000),
('Jill', 27, 25000),
('Jack', 31, 45000),
('Joker', 65, 20000),
('Joy', 22, 31000),
('Jane', 72, 35000),
('Jackson', 42, 65000),
('Jessica', 23, 37000);

My clustering function is given below

CREATE OR REPLACE FUNCTION CLUS(sal NUMERIC[])
RETURNS INTEGER[] AS
$BODY$

a <- NA
a = kmeans(sal, 3)$cluster
return(a)

$BODY$
LANGUAGE 'plr' ;

When I am executing the following query

SELECT * , CLUS(ARRAY (SELECT SALARY FROM EMP ORDER BY SALARY)) AS CLUSTER_ID from emp order by salary;

My output is

name1   age salary  cluster_id
Joker   65  20000   {3,3,2,2,2,2,1,1}
Jill    27  25000   {2,2,2,2,2,3,3,1}
Joy     22  31000   {3,3,2,2,2,2,1,1}
Jane    72  35000   {1,1,1,1,1,2,2,3}
Jessica 23  37000   {1,1,2,2,2,2,3,3}
Jack    31  45000   {2,2,3,3,3,3,1,1}
Joe     41  55000   {1,1,3,3,3,3,2,2}
Jackson 42  65000   {2,2,3,3,3,3,1,1}

Problems I am facing are, every time a row is evaluated clusters are changing and I want the following output …

name1   age salary  cluster_id
Joker   65  20000   3
Jill    27  25000   3
Joy     22  31000   2
Jane    72  35000   2
Jessica 23  37000   2
Jack    31  45000   2
Joe     41  55000   1
Jackson 42  65000   1

please let me know if it can be done in a better and simple way

Upvotes: 1

Views: 1956

Answers (1)

user1509107
user1509107

Reputation:

Please read the K-means documentation first.

You will notice that K-means involves a random element - this is what causes your output to show different cluster-ids by row.

See here to know how to use set.seed to replicate the same cluster result at every execution given the same input.

Second thing you are doing wrong is the way you implemented your CLUS function & the way you are calling it.

Let me highlight the issue by expanding the query you are running:

SELECT 
* 
,CLUS(ARRAY (SELECT SALARY FROM EMP ORDER BY SALARY)) AS CLUSTER_ID 
from emp 
order by salary;

name    age salary  inputForClus                                         cluster_id
Joker   65  20000   {20000,25000,31000,35000,37000,45000,55000,65000}   {3,3,2,2,2,2,1,1}
Jill    27  25000   {20000,25000,31000,35000,37000,45000,55000,65000}   {2,2,2,2,2,3,3,1}
Joy     22  31000   {20000,25000,31000,35000,37000,45000,55000,65000}   {3,3,2,2,2,2,1,1}
Jane    72  35000   {20000,25000,31000,35000,37000,45000,55000,65000}   {1,1,1,1,1,2,2,3}
Jess    23  37000   {20000,25000,31000,35000,37000,45000,55000,65000}   {1,1,2,2,2,2,3,3}
Jack    31  45000   {20000,25000,31000,35000,37000,45000,55000,65000}   {2,2,3,3,3,3,1,1}
Joe     41  55000   {20000,25000,31000,35000,37000,45000,55000,65000}   {1,1,3,3,3,3,2,2}
Jack    42  65000   {20000,25000,31000,35000,37000,45000,55000,65000}   {2,2,3,3,3,3,1,1}

Notice how the inputForClus column is the same for every row. The cluster ids change from row to row because of the random element mentioned earlier.

The correct way to apply k-means in your case is to write a function which takes a id column, and a numeric array. The function would then return a table which has two columns id & cluster-id. You could implement your function as one which retruns table. Then you can join the cluster-ids back to your table on id.

Search the Postgresql documentation to see how to write table returning functions, first.

Another way could probably be to specify k-means function as an aggregate function.

Upvotes: 1

Related Questions