Linus
Linus

Reputation: 187

Assign clusters to new data using a model based on previously gained data

In an experiment, we gained data through an online survey. We used this data to conduct a cluster analysis (using the library(mclust) library). According to the clusters gained through our analysis, we then selected certain people for further examination.

Later we reopened our online survey, gaining additional data (although not as much as in the first round).

Now, I'd like to assign the new data (here: seconddata) using the model based on the first data (here: firstdata).

In order to do the clustering in the first place, we scaled the firstdata and then conducted the cluster analysis. Now, to assign the clusters to the new data, I scaled the seconddata as well, and then tried to assign the clusters. Even though we theoretically draw the seconddata from the same population as the firstdata, there is a slight difference in their distribution - which is of course, expectable.

In my original data, it seems that the assigned clusters to the seconddata make little to no sense since I, e.g., suddenly have some empty clusters that previously were the most prevalent. I feel this might be because of the slight difference in the distributions of the two datasets, but I'm unsure.

Here is an example of how I tried to do the assigning. The cluster itself makes no sense but the procedure remains the same for my original data.


# library -----------------------------------------------------------------
library(tidyverse)
library(ggplot2)
library(scatterplot3d)
library(hopkins)
library(factoextra)
library(NbClust)
library(mclust)
library(ggpubr)
library(writexl)
library(reshape2)
library(hms)
library("lubridate") 

set.seed(1591593)

# first dataset -----------------------------------------------------------

n1 = 100
#number of observations in first data frame

firstdata <- expand.grid(n1 = 1:n1,
                         x1 = NA,
                         x2 = NA,
                         x3 = NA)


firstdata$x1 <- sample(runif(n = n1, min = 50, max = 100), replace = TRUE)
firstdata$x2 <- sample(runif(n = n1, min = 20, max = 25), replace = TRUE)
firstdata$x3 <- sample(runif(n = n1, min = 30, max = 80), replace = TRUE)
#creating data

firstdata = firstdata %>%
  select(contains("x")) %>%
  mutate_all(scale)
#scale data

par(mfrow = c(1, 1)) # display 1 plot
scatterplot3d(firstdata$x1, firstdata$x2, firstdata$x3)


## model-based clustering ---------------------------------------------

# Fit model
c_model <- Mclust(firstdata)

# Show optimal model
summary(c_model) 

# Plot all models
fviz_mclust(c_model, "BIC", palette = "jco")

# Print cluster sizes
table(c_model$classification)

# Cluster plot
fviz_mclust(c_model, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")

# Cluster no. (classification) as variable (new dataframe)
firstdata_cluster = firstdata
firstdata_cluster$cluster = as.factor(c_model$classification)

# Compute probability to belong to resp. cluster and not to other cluster
firstdata_cluster$probability = 1-(c_model$uncertainty)

# Cluster label in classifications
clusters = MclustDA(firstdata, class = firstdata_cluster$cluster)
summary(clusters)


# second dataset ----------------------------------------------------------

n2 = 10
#number of observations in second data frame

seconddata <- expand.grid(n2= 1:n2,
                            x1 = NA,
                            x2 = NA,
                            x3 = NA)

seconddata$x1 <- sample(runif(n = n2, min = 50, max = 100), replace = TRUE)
seconddata$x2 <- sample(runif(n = n2, min = 20, max = 25), replace = TRUE)
seconddata$x3 <- sample(runif(n = n2, min = 30, max = 80), replace = TRUE)
# create second data frame


## assigning clusters to the new data ------------------------------------

seconddata = seconddata %>%
  select(contains("x")) %>%
  mutate_all(scale)
# scale second data frame

prediction_seconddata = predict(clusters, seconddata)
round(prediction_seconddata$z, 0)
#assign clusters to the new second data frame/new data using the first model

prediction_seconddata_cluster<-as.data.frame(round(prediction_seconddata$z, 0))
prediction_seconddata_cluster
#assigned clusters of the second data frame
 

I do not have any experience in clustering data, so I'm a bit unsure whether this approach is acceptable. I'm pretty sure something in my procedure is off...

Do you have any suggestions on how to match the seconddata to the clusters?

Thanks!

Upvotes: 0

Views: 354

Answers (1)

Linus
Linus

Reputation: 187

The answer to my question was that I had to rescale the seconddate according to the firstdata. Befor doing the clustering.

See: # rescale seconddata --------

# library -----------------------------------------------------------------
library(tidyverse)
library(ggplot2)
library(scatterplot3d)
library(hopkins)
library(factoextra)
library(NbClust)
library(mclust)
library(ggpubr)
library(writexl)
library(reshape2)
library(hms)
library("lubridate") 

set.seed(1591593)

# first dataset -----------------------------------------------------------

n1 = 100
#number of observations in first data frame

firstdata <- expand.grid(n1 = 1:n1,
                         x1 = NA,
                         x2 = NA,
                         x3 = NA)


firstdata$x1 <- sample(runif(n = n1, min = 50, max = 100), replace = TRUE)
firstdata$x2 <- sample(runif(n = n1, min = 20, max = 25), replace = TRUE)
firstdata$x3 <- sample(runif(n = n1, min = 30, max = 80), replace = TRUE)
#creating data

firstdata = firstdata %>%
  select(contains("x")) %>%
  mutate_all(scale)
#scale data

par(mfrow = c(1, 1)) # display 1 plot
scatterplot3d(firstdata$x1, firstdata$x2, firstdata$x3)


## model-based clustering ---------------------------------------------

# Fit model
c_model <- Mclust(firstdata)

# Show optimal model
summary(c_model) 

# Plot all models
fviz_mclust(c_model, "BIC", palette = "jco")

# Print cluster sizes
table(c_model$classification)

# Cluster plot
fviz_mclust(c_model, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")

# Cluster no. (classification) as variable (new dataframe)
firstdata_cluster = firstdata
firstdata_cluster$cluster = as.factor(c_model$classification)

# Compute probability to belong to resp. cluster and not to other cluster
firstdata_cluster$probability = 1-(c_model$uncertainty)

# Cluster label in classifications
clusters = MclustDA(firstdata, class = firstdata_cluster$cluster)
summary(clusters)


# second dataset ----------------------------------------------------------

n2 = 10
#number of observations in second data frame

seconddata <- expand.grid(n2= 1:n2,
                          x1 = NA,
                          x2 = NA,
                          x3 = NA)

seconddata$x1 <- sample(runif(n = n2, min = 50, max = 100), replace = TRUE)
seconddata$x2 <- sample(runif(n = n2, min = 20, max = 25), replace = TRUE)
seconddata$x3 <- sample(runif(n = n2, min = 30, max = 80), replace = TRUE)
# create second data frame



# rescale seconddata --------------------------------------------------------

firstdata_x1_mean <- mean(firstdata$x1)
firstdata_x1_sd <- sd(firstdata$x1)
firstdata_x2_mean <- mean(firstdata$x2)
firstdata_x2_sd <- sd(firstdata$x2)
firstdata_x3_mean <- mean(firstdata$x2)
firstdata_x3_sd <- sd(firstdata$x2)
#mean and sd of old data mss

seconddata <- seconddata %>% 
  mutate_at(vars(x1),function(x) (x - firstdata_x1_mean) / firstdata_x1_sd) %>% 
  mutate_at(vars(x2),function(x) (x - firstdata_x2_mean) / firstdata_x2_sd) %>% 
  mutate_at(vars(x3),function(x) (x - firstdata_x3_mean) / firstdata_x3_sd) 
#rescale the new data according to the parameters of the old data


## assigning clusters to the new data ------------------------------------

seconddata = seconddata %>%
  select(contains("x"))

prediction_seconddata = predict(clusters, seconddata)
round(prediction_seconddata$z, 0)
#assign clusters to the new second data frame/new data using the first model

prediction_seconddata_cluster<-as.data.frame(round(prediction_seconddata$z, 0))
prediction_seconddata_cluster
#assigned clusters of the second data frame


Now the new clustering is correct.

Maybe this helps someone else.

Upvotes: 0

Related Questions