R, SOM, Kohonen Package, Outlier Detection

Question

With SOM I experimented a little. First I used MiniSOM in Python but I was not impressed and changed to the kohonen package in R, which offers more features than the previous one. Basically, I applied SOM for three use cases: (1) clustering in 2D with generated data, (2) clustering with more-dimensional data: built-in wine data set, and (3) outlier detection. I solved all the three use cases but I would like to raise a question in connection with the outlier detection I applied. For this purpose I used the vector som$distances, which contains a distance for each row of the input data set. The values with excelling distances can be outliers. However, I do not know how this distance is computed. The package description (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf) states for this metric : "distance to the closest unit".

Could you please tell how this distance is computed?
Could you please comment the outlier detection I used? How would you have done it? (In the generated data set it really finds the outliers. In the real wine data set there are four relatively excelling values among the 177 wine sorts. See the charts below. The idea that crossed my mind to use bar charts for depicting this I really like.)

Charts:

Generated data, 100 point in 2D in 5 distinct clusters and 2 outliers (Category 6 shows the outliers):
Distances shown for all the 102 data points, the last two ones are the outliers which were correctly identified. I repeated the test with 500, and 1000 data points and added solely 2 outliers. The outliers were also found in those cases.
Distances for the real wine data set with potential outliers:

The row id of the potential outliers:

# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]

it produces the following output:
    index    value
59     59 12.22916
110   110 13.41211
121   121 15.86576
158   158 11.50079

My annotated code snippet:

        data(wines)

        scaled_wines <- scale(wines)

        # creating and training SOM
        som.wines <- som(scaled_wines, grid = somgrid(5, 5, "hexagonal"))
        summary(som.wines)

        #looking for outliers, dist = distance to the closest unit
        som.wines$distances

        len <- length(som.wines$distances)
        index_in_vector <- c(1:len)
        df_wine<-data.frame(cbind(index_in_vector, som.wines$distances))
        colnames(df_wine) <-c("index", "value")

        po <-ggplot(df_wine, aes(index, value)) + geom_bar(stat = "identity") 
        po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som.wines$distances") + xlab("Number of Rows in the Data Set")
        plot(po)

        # print the row id of the outliers
        # the threshold 10 can be taken from the bar chart,
        # below which the vast majority of the values fall
        df_wine[df_wine$value > 10, ]

Further Code Samples

With regard to the discussion in the comments I also post the code snippets asked for. As far as I remember, the code lines responsible for clustering I constructed based on samples I found in the description of the Kohonen package (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf). However, I am not completely sure, it was more than a year ago. The code is provided as is without any warranty :-). Please bear in mind that a particular clustering approach may perform with different accuracy on different data. I would also recommend to compare it with t-SNE on the wine data set (data(wines) available in R). Moreover, implement the heat-maps to demonstrate how the data with regard to individual variables are located. (In the case of the above example with 2 variables it is not important but it would be nice for the wine data set).

Data Generation with Five Clusters and 2 Outliers and Plotting

            library(stats)
            library(ggplot2)

            library(kohonen)


            generate_data <- function(num_of_points, num_of_clusters, outliers=TRUE){
              num_of_points_per_cluster <- num_of_points/num_of_clusters
              cat(sprintf("#### num_of_points_per_cluster = %s, num_of_clusters = %s 
", num_of_points_per_cluster, num_of_clusters))
              arr<-array()
              
              standard_dev_y <- 6000
              standard_dev_x <- 2
              
              # for reproducibility setting the random generator
              set.seed(10)
              
              for (i in 1:num_of_clusters){
                centroid_y <- runif(1, min=10000, max=200000)
                centroid_x <- runif(1, min=20, max=70)
                cat(sprintf("centroid_x = %s 
, centroid_y = %s", centroid_x, centroid_y ))
                
                vector_y <- rnorm(num_of_points_per_cluster, mean=centroid_y, sd=standard_dev_y)
                vector_x <- rnorm(num_of_points_per_cluster, mean=centroid_x, sd=standard_dev_x)
                cluster <- array(c(vector_y, vector_x), dim=c(num_of_points_per_cluster, 2))
                cluster <- cbind(cluster, i)
                
                arr <- rbind(arr, cluster)
              }
              
              if(outliers){
                #adding two outliers
                arr <- rbind(arr, c(10000, 30, 6))
                arr <- rbind(arr, c(150000, 70, 6))
              }
              
              colnames(arr) <-c("y", "x", "Cluster")
              # WA to remove the first NA row
              arr <- na.omit(arr)
              return(arr)
            }

            scatter_plot_data <- function(data_in, couloring_base_indx, main_label){
              
              df <- data.frame(data_in)
              colnames(df) <-c("y", "x", "Cluster")

              pl <- ggplot(data=df, aes(x = x,y=y)) + geom_point(aes(color=factor(df[, couloring_base_indx]))) 
              pl <- pl + ggtitle(main_label) + theme(plot.title = element_text(hjust = 0.5))
              print(pl)
              
            }

            ##################
            # generating data
            data <- generate_data(100, 5, TRUE)
            print(data)
            scatter_plot_data(data, couloring_base_indx<-3, "Original Clusters without Outliers 
 102 Points")

Preparation, Clustering and Plotting

I used the hierarchical clustering approach with the Kohonen Map (SOM).

            normalising_data <- function(data){
              # normalizing data points not the cluster identifiers
              mtrx <- data.matrix(data)
              umtrx <- scale(mtrx[,1:2])
              umtrx <- cbind(umtrx, factor(mtrx[,3]))
              colnames(umtrx) <-c("y", "x", "Cluster")
              return(umtrx)
            }

            train_som <- function(umtrx){
              # unsupervised learning
              set.seed(7)
              g <- somgrid(xdim=5, ydim=5, topo="hexagonal")
              #map<-som(umtrx[, 1:2], grid=g, alpha=c(0.005, 0.01), radius=1, rlen=1000)
              map<-som(umtrx[, 1:2], grid=g)
              summary(map)
              
              return(map)
            }

            plot_som_data <- function(map){
              par(mfrow=c(3,2))
              # to plot some charactristics of the SOM map
              plot(map, type='changes')
              plot(map, type='codes', main="Mapping Data")
              plot(map, type='count')
              plot(map, type='mapping') # how many data points are held by each neuron
              plot(map, type='dist.neighbours') # the darker the colours are, the closer the point are; the lighter the colours are, the more distant the points are
              
              #to switch the plot config to the normal
              par(mfrow=c(1,1))
            }

            plot_disstances_to_the_closest_point <- function(map){
              
              # to see which neuron is assigned to which value 
              map$unit.classif
              
              #looking for outliers, dist = distance to the closest unit
              map$distances
              max(map$distances)
              
              len <- length(map$distances)
              index_in_vector <- c(1:len)
              df<-data.frame(cbind(index_in_vector, map$distances))
              colnames(df) <-c("index", "value")
              
              po <-ggplot(df, aes(index, value)) + geom_bar(stat = "identity") 
              po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som$distances") + xlab("Number of Rows in the Data Set")
              plot(po)
              
              return(df)
              
            }


            ###################
            # unsupervised learning

            umtrx <- normalising_data(data)

            map<-train_som(umtrx)
            plot_som_data(map)

            #####################
            # creating the dendogram and then the clusters for the neurons
            dendogram <- hclust(object.distances(map, "codes"), method = 'ward.D')
            plot(dendogram)

            clusters <- cutree(dendogram, 7)
            clusters
            length(clusters)

            #visualising the clusters on the map
            par(mfrow = c(1,1))
            plot(map, type='dist.neighbours', main="Mapping Data")
            add.cluster.boundaries(map, clusters)

Plots with the Clusters

You can also create nice heat-maps for selected variables but I had not implemented them for clustering with 2 variables it does not really make sense. If you implement it for the wine data set, please add the code and the charts to this post.

            #see the predicted clusters with the data set
            # 1. add the vector of the neuron ids to the data
            mapped_neurons <- map$unit.classif
            umtrx <- cbind(umtrx, mapped_neurons)

            # 2. taking the predicted clusters and adding them the the original matrix
            # very good description of the apply functions:
            # https://www.guru99.com/r-apply-sapply-tapply.html
            get_cluster_for_the_row <- function(x, cltrs){
              return(cltrs[x])
            }

            predicted_clusters <- sapply (umtrx[,4], get_cluster_for_the_row, cltrs<-clusters)

            mtrx <- cbind(mtrx, predicted_clusters)
            scatter_plot_data(mtrx, couloring_base_indx<-4, "Predicted Clusters with Outliers 
 100 points")

See the predicted clusters below in case there were outliers and in case there were not.