Mayo
Mayo

Reputation: 11

Clustering leads to very concentrated clusters

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8

Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.

Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.

The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.

With help from another one, I managed to create a distance matrix and use the PAM clustering.

Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.

Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.

Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?

The whole code and results:

#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1

Now using the PAM clustering

dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))

But I get very concentrated clusters, as:

1   2   3   4 
382 100  23  62

Upvotes: 1

Views: 270

Answers (2)

mnm
mnm

Reputation: 2012

@Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables). Continuing further, See this post on correlation detection in R;

# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables. 
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.

#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg" 

High Correlation

removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32  4

Hope this helps.

Upvotes: 0

AidanGawronski
AidanGawronski

Reputation: 2085

I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.

dim(df) # [1] 7659    3

test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318     3

length(unique(test$V1)) # 567
length(unique(test$V2)) # 567

test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567

Upvotes: 0

Related Questions