Reputation: 187
In an experiment, we gained data through an online survey. We used this data to conduct a cluster analysis (using the library(mclust)
library).
According to the clusters gained through our analysis, we then selected certain people for further examination.
Later we reopened our online survey, gaining additional data (although not as much as in the first round).
Now, I'd like to assign the new data (here: seconddata) using the model based on the first data (here: firstdata).
In order to do the clustering in the first place, we scaled the firstdata and then conducted the cluster analysis. Now, to assign the clusters to the new data, I scaled the seconddata as well, and then tried to assign the clusters. Even though we theoretically draw the seconddata from the same population as the firstdata, there is a slight difference in their distribution - which is of course, expectable.
In my original data, it seems that the assigned clusters to the seconddata make little to no sense since I, e.g., suddenly have some empty clusters that previously were the most prevalent. I feel this might be because of the slight difference in the distributions of the two datasets, but I'm unsure.
Here is an example of how I tried to do the assigning. The cluster itself makes no sense but the procedure remains the same for my original data.
# library -----------------------------------------------------------------
library(tidyverse)
library(ggplot2)
library(scatterplot3d)
library(hopkins)
library(factoextra)
library(NbClust)
library(mclust)
library(ggpubr)
library(writexl)
library(reshape2)
library(hms)
library("lubridate")
set.seed(1591593)
# first dataset -----------------------------------------------------------
n1 = 100
#number of observations in first data frame
firstdata <- expand.grid(n1 = 1:n1,
x1 = NA,
x2 = NA,
x3 = NA)
firstdata$x1 <- sample(runif(n = n1, min = 50, max = 100), replace = TRUE)
firstdata$x2 <- sample(runif(n = n1, min = 20, max = 25), replace = TRUE)
firstdata$x3 <- sample(runif(n = n1, min = 30, max = 80), replace = TRUE)
#creating data
firstdata = firstdata %>%
select(contains("x")) %>%
mutate_all(scale)
#scale data
par(mfrow = c(1, 1)) # display 1 plot
scatterplot3d(firstdata$x1, firstdata$x2, firstdata$x3)
## model-based clustering ---------------------------------------------
# Fit model
c_model <- Mclust(firstdata)
# Show optimal model
summary(c_model)
# Plot all models
fviz_mclust(c_model, "BIC", palette = "jco")
# Print cluster sizes
table(c_model$classification)
# Cluster plot
fviz_mclust(c_model, "classification", geom = "point",
pointsize = 1.5, palette = "jco")
# Cluster no. (classification) as variable (new dataframe)
firstdata_cluster = firstdata
firstdata_cluster$cluster = as.factor(c_model$classification)
# Compute probability to belong to resp. cluster and not to other cluster
firstdata_cluster$probability = 1-(c_model$uncertainty)
# Cluster label in classifications
clusters = MclustDA(firstdata, class = firstdata_cluster$cluster)
summary(clusters)
# second dataset ----------------------------------------------------------
n2 = 10
#number of observations in second data frame
seconddata <- expand.grid(n2= 1:n2,
x1 = NA,
x2 = NA,
x3 = NA)
seconddata$x1 <- sample(runif(n = n2, min = 50, max = 100), replace = TRUE)
seconddata$x2 <- sample(runif(n = n2, min = 20, max = 25), replace = TRUE)
seconddata$x3 <- sample(runif(n = n2, min = 30, max = 80), replace = TRUE)
# create second data frame
## assigning clusters to the new data ------------------------------------
seconddata = seconddata %>%
select(contains("x")) %>%
mutate_all(scale)
# scale second data frame
prediction_seconddata = predict(clusters, seconddata)
round(prediction_seconddata$z, 0)
#assign clusters to the new second data frame/new data using the first model
prediction_seconddata_cluster<-as.data.frame(round(prediction_seconddata$z, 0))
prediction_seconddata_cluster
#assigned clusters of the second data frame
I do not have any experience in clustering data, so I'm a bit unsure whether this approach is acceptable. I'm pretty sure something in my procedure is off...
Do you have any suggestions on how to match the seconddata to the clusters?
Thanks!
Upvotes: 0
Views: 354
Reputation: 187
The answer to my question was that I had to rescale the seconddate according to the firstdata. Befor doing the clustering.
See: # rescale seconddata --------
# library -----------------------------------------------------------------
library(tidyverse)
library(ggplot2)
library(scatterplot3d)
library(hopkins)
library(factoextra)
library(NbClust)
library(mclust)
library(ggpubr)
library(writexl)
library(reshape2)
library(hms)
library("lubridate")
set.seed(1591593)
# first dataset -----------------------------------------------------------
n1 = 100
#number of observations in first data frame
firstdata <- expand.grid(n1 = 1:n1,
x1 = NA,
x2 = NA,
x3 = NA)
firstdata$x1 <- sample(runif(n = n1, min = 50, max = 100), replace = TRUE)
firstdata$x2 <- sample(runif(n = n1, min = 20, max = 25), replace = TRUE)
firstdata$x3 <- sample(runif(n = n1, min = 30, max = 80), replace = TRUE)
#creating data
firstdata = firstdata %>%
select(contains("x")) %>%
mutate_all(scale)
#scale data
par(mfrow = c(1, 1)) # display 1 plot
scatterplot3d(firstdata$x1, firstdata$x2, firstdata$x3)
## model-based clustering ---------------------------------------------
# Fit model
c_model <- Mclust(firstdata)
# Show optimal model
summary(c_model)
# Plot all models
fviz_mclust(c_model, "BIC", palette = "jco")
# Print cluster sizes
table(c_model$classification)
# Cluster plot
fviz_mclust(c_model, "classification", geom = "point",
pointsize = 1.5, palette = "jco")
# Cluster no. (classification) as variable (new dataframe)
firstdata_cluster = firstdata
firstdata_cluster$cluster = as.factor(c_model$classification)
# Compute probability to belong to resp. cluster and not to other cluster
firstdata_cluster$probability = 1-(c_model$uncertainty)
# Cluster label in classifications
clusters = MclustDA(firstdata, class = firstdata_cluster$cluster)
summary(clusters)
# second dataset ----------------------------------------------------------
n2 = 10
#number of observations in second data frame
seconddata <- expand.grid(n2= 1:n2,
x1 = NA,
x2 = NA,
x3 = NA)
seconddata$x1 <- sample(runif(n = n2, min = 50, max = 100), replace = TRUE)
seconddata$x2 <- sample(runif(n = n2, min = 20, max = 25), replace = TRUE)
seconddata$x3 <- sample(runif(n = n2, min = 30, max = 80), replace = TRUE)
# create second data frame
# rescale seconddata --------------------------------------------------------
firstdata_x1_mean <- mean(firstdata$x1)
firstdata_x1_sd <- sd(firstdata$x1)
firstdata_x2_mean <- mean(firstdata$x2)
firstdata_x2_sd <- sd(firstdata$x2)
firstdata_x3_mean <- mean(firstdata$x2)
firstdata_x3_sd <- sd(firstdata$x2)
#mean and sd of old data mss
seconddata <- seconddata %>%
mutate_at(vars(x1),function(x) (x - firstdata_x1_mean) / firstdata_x1_sd) %>%
mutate_at(vars(x2),function(x) (x - firstdata_x2_mean) / firstdata_x2_sd) %>%
mutate_at(vars(x3),function(x) (x - firstdata_x3_mean) / firstdata_x3_sd)
#rescale the new data according to the parameters of the old data
## assigning clusters to the new data ------------------------------------
seconddata = seconddata %>%
select(contains("x"))
prediction_seconddata = predict(clusters, seconddata)
round(prediction_seconddata$z, 0)
#assign clusters to the new second data frame/new data using the first model
prediction_seconddata_cluster<-as.data.frame(round(prediction_seconddata$z, 0))
prediction_seconddata_cluster
#assigned clusters of the second data frame
Now the new clustering is correct.
Maybe this helps someone else.
Upvotes: 0