Reputation: 3320
I mused about how hard it would be to create an Excel Lambda function to deliver K-means clusters. I built a one-dimensional version of it with only a small investment of time.
One dimensional versions are, by themselves, quite useful in solving clustering problems. I used a 1D version for example to examine trip journey data of people walking between buildings on a campus (using access control data). The management of this company suspected that when employees walk between buildings while on duty, they were using the journeys to "go shopping", "take long breaks", etc. and I thought I could prove to what extent this really existed. The data quickly exposed interesting details - e.g., don't try to change buildings just after lunch because the elevators are full. It also exposed "clusters" of transit times:
This proved the existence of the managers' suspicions and put measurements onto it. Unfortunately, it offered at least one major reason this facility had the lowest productivity in the world.
In another 1D case, I clustered invoicing amounts to build some kind of classifier - more like binning for a histogram.
So, here is my Lambda based, non-VBA one-dimensional K-means function:
=LAMBDA( dataArray, nodes, [changeThreshold], [kArray],
LET(
nSeq, SEQUENCE(, nodes),
n, IF(ISOMITTED(kArray), nSeq / (nodes + 1), kArray),
adjust, LAMBDA(array, nArray,
BYCOL(
IF(BYROW(ABS(array - nArray), LAMBDA(x, MATCH(MIN(x), x, 0))) = nSeq, array, ""),
LAMBDA(x, AVERAGE(x))
)
),
next_kArray, adjust(dataArray, n),
IF(
SUM(ABS(next_kArray - n)) <= IF(ISOMITTED(changeThreshold), 0.01, changeThreshold),
kArray,
KMEANS_1D( dataArray, nodes, changeThreshold, next_kArray )
)
)
Where the arguments are:
NOTES: This has no error handling, so user input errors, data with errors, convergence faults, etc. result in Value!. This only takes vertically presented data from a single column. If you want to to take tabular, row-wise data, change it, but if I put in all of these manipulations, the algorithm gets lost in the code and this is only for illustration. Also, I chose not to make a random initial selection of nodes because I feared the function would become volatile. Therefore, I used a simple even distribution (
nSeq / (nodes + 1)
).
Now, the challenge is to make a multi-dimensional version. It has proven not to be so trivial. I will offer it as an answer and ask if there are improvements or alternatives.
I will not mark a correct answer - this is just for sharing and collaboration.
Upvotes: 3
Views: 185
Reputation: 11468
I'm kinda unfamiliar with K-Means and tried to understand it by doing a little research. I used reduce to iterate:
=LET(xy,C3:D102,
k,3,
REDUCE(TAKE(xy,k),SEQUENCE(10),
LAMBDA(r,i,
LET(t,BYROW(xy,
LAMBDA(b,
LET(m,BYROW(((r-b)^2),
LAMBDA(n,SUM(n)))^0.5,
XMATCH(MIN(m),m)))),
MAP(SEQUENCE(k)+SEQUENCE(,COLUMNS(xy))*0,SEQUENCE(,COLUMNS(xy))+SEQUENCE(k)*0,
LAMBDA(x,y,
AVERAGE(CHOOSECOLS(FILTER(xy,t=x),y))))))))
It first takes the first k
rows (I figured there's no need for volatile random choice); this is the starting point.
From there on it checks the distance from each row in you range to the starting point k
-rows values column-wise (to the power 2, then summed and then to the power a half).
The values are assigned a nearest k
value and all values nearing a k
-value are averaged to become the new k
-values.
I used iteration times 10.
Upvotes: 2
Reputation: 3320
Here is the result:
=LAMBDA( dataArray, nodes, [changeThreshold], [kArray],
LET(
nSeq, SEQUENCE(, nodes),
dimensions, COLUMNS(dataArray),
dSeq, SEQUENCE(, dimensions, 0),
rSeq, SEQUENCE(ROWS(dataArray)),
nCoordSeq, SEQUENCE(, nodes * dimensions, 0),
genStartK, LAMBDA(minDimensions, maxDimensions,
LET(
incr, 2 * PI() / (nodes - 1),
center, VSTACK(SIGN(dSeq + 1) * 0.5),
ring, COS(DROP(TRANSPOSE(nSeq), -1) * incr + PI() * (dSeq + 1) / dimensions) * SQRT(2) / 4 + 0.5,
kMatrix, VSTACK(center, ring) * (maxDimensions - minDimensions) + minDimensions,
TOROW(kMatrix)
)
),
reshape, LAMBDA(matrix, newCols, WRAPROWS(TOROW(matrix), newCols)),
distance, LAMBDA(da, ka,
LET(
squareDiff, (INDEX(da, rSeq, MOD(nCoordSeq, dimensions) + 1) - ka) ^ 2,
offsetAdd, BYROW(reshape(squareDiff, dimensions), LAMBDA(a, SUM(a))),
SQRT(reshape(offsetAdd, nodes))
)
),
adjust, LAMBDA(array, nArray,
BYCOL(
IF(
BYROW(distance(array, nArray), LAMBDA(x, MATCH(MIN(x), x, 0))) =
INT(nCoordSeq / dimensions + 1),
INDEX(array, rSeq, MOD(nCoordSeq, dimensions) + 1)
),
LAMBDA(x, AVERAGE(x))
)
),
n, IF(
ISOMITTED(kArray),
genStartK(BYCOL(dataArray, LAMBDA(a, MIN(a))), BYCOL(dataArray, LAMBDA(a, MAX(a)))),
TOROW(kArray)
),
next_kArray, IFERROR(adjust(dataArray, n), n),
IF(
SUM(ABS(next_kArray - n)) <= IF(ISOMITTED(changeThreshold), 0.01, changeThreshold),
WRAPROWS(kArray, dimensions),
KMEANS(dataArray, nodes, changeThreshold, next_kArray)
)
)
The same arguments are used as the 1D version. The input array where each row represents a node and each column represents a dimension. e.g.:
In the example above dataArray would be the range B2:D5. Then the number of output nodes needs to be set (an integer number).
The output is a set of multidimensional nodes that cluster the data where each row represents a node and each column represents a dimension. e.g.:
The outputs are in cells I2:K5 based on a three dimensional (columns) array as an input and 4 nodes for the expected output.
Here is a two dimensional example to illustrate the function. Create two columns that represent the x and y of your sample data. Then put this formula into the x column:
=LET( r, RANDARRAY( 100 ),
m, CHOOSE( ROUNDUP( 3*r, 0 ), 0.25, 0.5, 0.75 ),
NORM.INV( r, m, SD ) )
and this formula into the y column:
=LET( r, RANDARRAY( 100 ),
m, CHOOSE( ROUNDUP( 2*C3#, 0 ), 0.75, 0.25 ),
IFERROR( NORM.INV( r, m, SD ), SD*RAND() ) )
This creates some artificial 2-D clusters.
Now apply the KMEANS function to the data with four nodes, and here are the results:
A multi-dimensional (even just two-dimensional) version is an order of magnitude more difficult. With each problem encountered in debugging, I found a solution, but worried the whole time that this function for clustering was becoming a was cluster function (pun intended).
For example, the simple distance calculation ABS(dataArray - nArray)
must be changed into a Pythagorean distance in the form SQRT( (dataArray - nArray)^2 )
. It must also accommodate the formation of each array (e.g., are the dimensions presented column-wise, row-wise, in separate symmetrical tables,...). I ultimately created an internal lambda function for readability that can be edited without changing the other parts:
distance, LAMBDA(da, ka,
LET(
squareDiff, (INDEX(da, rSeq, MOD(nCoordSeq, dimensions) + 1) - ka) ^ 2,
offsetAdd, BYROW(reshape(squareDiff, dimensions), LAMBDA(a, SUM(a))),
SQRT(reshape(offsetAdd, nodes)
)
)
While it is a simple calculation, it required a lot of structuring to do it across dimensions. I am unhappy with the distance function, because the only way I could find to do it was to use my own variant of WRAPROWS called reshape to build an array called offsetAdd that that facilitates BYROW summation. This is then reshaped back into columnar form to make a table of distances where columns represent nodes and rows represent each data point in dataArray.
The reshape function is there for readability.
reshape, LAMBDA(matrix, newCols, WRAPROWS(TOROW(matrix), newCols)),
The adjust
internal function required substantial modification to adapt to multi-dimensionality.
adjust, LAMBDA(array, nArray,
BYCOL(
IF(
BYROW(distance(array, nArray), LAMBDA(x, MATCH(MIN(x), x, 0))) =
INT(nCoordSeq / dimensions + 1),
INDEX(array, rSeq, MOD(nCoordSeq, dimensions) + 1)
),
LAMBDA(x, AVERAGE(x))
)
)
No Error Handling I put no effort into error handling or adaptive inputs. This is already large, so it would be hard to understand if it was bloated with input shaping and error handling. There is one exception. I did wrap the adjust call with an IFERROR because I noticed that if by chance, one of the nodes is not selected by any data points in the adjustment iterations, it will throw a DIV/0 error. This could be handled a number of way - including my preference of do nothing. If you get this result, it is likely because you have too many nodes for the sample data (there are 3 tightly defined clusters but you asking for 5). The simplest (but most brute-force) solution was to replace an error with the prior node value (n). It works...
next_kArray, IFERROR(adjust(dataArray, n), n)
If you have the time to dive into this and see ways to improve it, I would love to learn from you. I have hopefully made it modular enough to allow surgical improvements to parts.
Also, sorry that there are no code comments. For some reason, that feature is not working anymore in the Advanced Formula Environment.
Upvotes: 3