Reputation: 4643
I was surprised to find out that clara
from library(cluster)
allows NAs. But function documentation says nothing about how it handles these values.
So my questions are:
clara
handles NAs?kmeans
(Nas not allowed)?[Update] So I did found lines of code in clara
function:
inax <- is.na(x)
valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE)))
x[inax] <- valmisdat
which do missing value replacement by valmisdat
. Not sure I understand the reason to use such formula. Any ideas? Would it be more "natural" to treat NAs by each column separately, maybe replacing with mean/median?
Upvotes: 13
Views: 15319
Reputation: 103
By looking at the Clara c code, I noticed that in clara algorithm, when there are missing values in the observations, the sum of squares is "reduced" proportional to the number of missing values, which I think is wrong! line 646 of clara.c is like " dsum *= (nobs / pp) " which shows it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. I think it must be done in other way, i.e. " dsum *= (pp / nobs) ".
Upvotes: 0
Reputation: 131
Not sure if kmeans
can handle missing data by ignoring the missing values in a row.
There are two steps in kmeans
;
When we have missing data in our observations:
Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy
package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans
to deal missing data.
Upvotes: 3
Reputation: 174803
Although not stated explicitly, I believe that NA
are handled in the manner described in the ?daisy
help page. The Details section has:
In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.
Given internally the same code will be being used by clara()
that is how I understand that NA
s in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.
Update The C
sources for clara.c
clearly indicate that this (the above) is how NA
s are handled by clara()
(lines 350-356 in ./src/clara.c
):
if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
/* in the following line (Fortran!), x[-2] ==> seg.fault
{BDR to R-core, Sat, 3 Aug 2002} */
if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
continue /* next j */;
}
}
Upvotes: 9