Reputation: 91
I am trying to use kmeans function in PL/R. I have created following table
CREATE TABLE EMP (NAME1 TEXT, AGE SMALLINT, SALARY NUMERIC );
INSERT INTO EMP VALUES
('Joe', 41, 55000),
('Jill', 27, 25000),
('Jack', 31, 45000),
('Joker', 65, 20000),
('Joy', 22, 31000),
('Jane', 72, 35000),
('Jackson', 42, 65000),
('Jessica', 23, 37000);
My clustering function is given below
CREATE OR REPLACE FUNCTION CLUS(sal NUMERIC[])
RETURNS INTEGER[] AS
$BODY$
a <- NA
a = kmeans(sal, 3)$cluster
return(a)
$BODY$
LANGUAGE 'plr' ;
When I am executing the following query
SELECT * , CLUS(ARRAY (SELECT SALARY FROM EMP ORDER BY SALARY)) AS CLUSTER_ID from emp order by salary;
My output is
name1 age salary cluster_id
Joker 65 20000 {3,3,2,2,2,2,1,1}
Jill 27 25000 {2,2,2,2,2,3,3,1}
Joy 22 31000 {3,3,2,2,2,2,1,1}
Jane 72 35000 {1,1,1,1,1,2,2,3}
Jessica 23 37000 {1,1,2,2,2,2,3,3}
Jack 31 45000 {2,2,3,3,3,3,1,1}
Joe 41 55000 {1,1,3,3,3,3,2,2}
Jackson 42 65000 {2,2,3,3,3,3,1,1}
Problems I am facing are, every time a row is evaluated clusters are changing and I want the following output …
name1 age salary cluster_id
Joker 65 20000 3
Jill 27 25000 3
Joy 22 31000 2
Jane 72 35000 2
Jessica 23 37000 2
Jack 31 45000 2
Joe 41 55000 1
Jackson 42 65000 1
please let me know if it can be done in a better and simple way
Upvotes: 1
Views: 1956
Reputation:
Please read the K-means documentation first.
You will notice that K-means involves a random element - this is what causes your output to show different cluster-ids by row.
See here to know how to use set.seed
to replicate the same cluster result at every execution given the same input.
Second thing you are doing wrong is the way you implemented your CLUS
function & the way you are calling it.
Let me highlight the issue by expanding the query you are running:
SELECT
*
,CLUS(ARRAY (SELECT SALARY FROM EMP ORDER BY SALARY)) AS CLUSTER_ID
from emp
order by salary;
name age salary inputForClus cluster_id
Joker 65 20000 {20000,25000,31000,35000,37000,45000,55000,65000} {3,3,2,2,2,2,1,1}
Jill 27 25000 {20000,25000,31000,35000,37000,45000,55000,65000} {2,2,2,2,2,3,3,1}
Joy 22 31000 {20000,25000,31000,35000,37000,45000,55000,65000} {3,3,2,2,2,2,1,1}
Jane 72 35000 {20000,25000,31000,35000,37000,45000,55000,65000} {1,1,1,1,1,2,2,3}
Jess 23 37000 {20000,25000,31000,35000,37000,45000,55000,65000} {1,1,2,2,2,2,3,3}
Jack 31 45000 {20000,25000,31000,35000,37000,45000,55000,65000} {2,2,3,3,3,3,1,1}
Joe 41 55000 {20000,25000,31000,35000,37000,45000,55000,65000} {1,1,3,3,3,3,2,2}
Jack 42 65000 {20000,25000,31000,35000,37000,45000,55000,65000} {2,2,3,3,3,3,1,1}
Notice how the inputForClus column is the same for every row. The cluster ids change from row to row because of the random element mentioned earlier.
The correct way to apply k-means in your case is to write a function which takes a id
column, and a numeric array
. The function would then return a table which has two columns id
& cluster-id
. You could implement your function as one which retruns table
. Then you can join the cluster-ids back to your table on id
.
Search the Postgresql documentation to see how to write table returning functions, first.
Another way could probably be to specify k-means function as an aggregate
function.
Upvotes: 1