Status

NOT DONE SO PLEASE PROCEED WITH CAUTION

Model Description
Data set exploration
Evaluating the optimum number of clusters
Cluster analysis

Model Description

A method used for clustering. It assumes that data comes from a distribution that is a combination of several Gaussian distributions.

Soft Clustering

The clusters in the GMM overlap with each other. The GMM provides what is called a “soft” or probabilistic clustering. Each point does not strictly belong to a single cluster,but instead has a degree or probability of membership in each cluster. In this particular clustering, we can think that a point is more likely to have come from some clusters than others. However, there still is a possibility, perhaps remote, that the point may have come from any of them.

A Probability-Based Clustering

To be done

The EM Algorithm

The problem is that we know neither of these things: not the distribution that each training instance came from nor the five mixture model parameters. So we adopt the procedure used for the k-means clustering algorithm and iterate. Start with initial guesses for the five parameters, use them to calculate the cluster probabilities for each instance, use these probabilities to reestimate the parameters, and repeat. (If you prefer, you can start with guesses for the classes of the instances instead.) This is called the EM algorithm, for expectation maximization. The first step—calculation of the cluster probabilities, which are the “expected” class values—is “expectation”; the second, calculation of the distribution parameters, is “maximization” of the likelihood of the distributions given the data available.

Required Packages

library(mclust)

## Package 'mclust' version 5.4.6
## Type 'citation("mclust")' for citing this R package in publications.

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Dataset Preparation

data <- iris
summary(data)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Normilize the data

scaled_data <- scale(data[,c(-5)])

GMM Aplication

# List of clusters to be analysed
test_clusters <- 1:10

# Training the model
model <- Mclust(scaled_data, G=test_clusters)

BIC and AIC metrics

How to select K?

If we choose K to maximize likelihood, when K increases the value of the maximum likelihood cannot decrease. Thus more complex models will always improve likelihood.
It is necessary to penalize the complexity of the model. We need a balance between how well the model fits and the data and the simplicity of the model

This is where we choose K based on the score returned either by Akaike information criterion (AIC) or Bayesian information criterion (BIC)

AIC Score:
Bayesian Information Criterion BIC

When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit.

AICscores <- c()
BICscores <- c()
for (k in test_clusters) {
  kmodel <- Mclust(scaled_data, G=k)
  AICscores <- append(AICscores, AIC(kmodel, k=2))
  BICscores <- append(BICscores, BIC(kmodel))
}

Then we plot the result that each algorithm gave us

## AIC adn BIC scores comparission 
plot(1:length(AICscores), AICscores, 
     type = "o", 
     main= "Akaike Information Criterion (AIC) Score",
     xlab = "Number of components",
     ylab = "AIC Scores",
     xaxt='n')
axis(side=1, at=1:length(AICscores), labels=1:length(AICscores),cex.axis=0.8)

plot(1:length(BICscores), BICscores, 
     type = "o", 
     main= "Bayesian Information Criterion (BIC) Score",
     xlab = "Number of components",
     ylab = "BIC Scores",
     xaxt='n')
axis(side=1, at=1:length(BICscores), labels=1:length(BICscores),cex.axis=0.8)

Mclust function also returns whish form of the clsuter better fit the gaussian density of the data, not only returning his own BIC score results but also the better fitted form.

Here is an image for reference

When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit.

After this we can conclude that clusters represent the best the groups shown in the iris data set

opt_model <- Mclust(scaled_data, G=2)
opt_model$density <- densityMclust(scaled_data, G=2)$density
plot(opt_model)

Option 2 gives the actual cluster formed

Option 3 gives rows that the algorithm couldn’t associate with either cluster

Option 4 gives the density of each clsuter

3d Visualization of the clusters

We already seen how the groups would look in a 2d plane, but not always the data set will behave as this one, it can happen sometimes that the clusters intersect each others and makes it difficult to differentiate both. One way to distinguish better each cluster is plotting the in a 3d space, being the height Value the density of each coordinate in the data. Next we will use the plotly library, which I like personally for its flexibility and features.

plot_ly(x=scaled_data[,2], 
        y=scaled_data[,3], 
        z=opt_model$density, 
        color = opt_model$classification, 
        colors = c("#FF0000", "#0000FF"),
        mode = "markers", 
        type = "scatter3d", 
        opacity= 0.5) %>% hide_colorbar()

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Evaluation

We saw that by using the AIC and BIC metrics we could decided which would have been the best number of cluster to better fit our data. Sadly we are not done yet, we did find the the best number of cluster but we don’t know yet if those groups really represent the behavior of the data

#Searching for K based on silhouette
coefSil=c()
for (k in test_clusters[-1]){
  model = Mclust(iris[1:4], G=k)
  temp = silhouette(model$classification,dist(iris[,1:4]))
  coefSil[k]=mean(temp[,3])
}

tempDF=data.frame(CS=coefSil,K=test_clusters)

ggplot(tempDF)+
  aes(x=K,y=CS)+
  geom_line()+
  scale_x_continuous(breaks=test_clusters)+
  xlab("N. of Clusters")+
  ylab("Silhouette Coefficient")+
  ggtitle("Silhouette Coefficient vs N. of Clusters")

## Warning: Removed 1 row(s) containing missing values (geom_path).

#Silohouette coefficient
model = Mclust(iris[1:4], G=2)
cluster=model$classification
coefSil=silhouette(cluster,dist(iris[,1:4]))
summary(coefSil)

## Silhouette of 150 units in 2 clusters from silhouette.default(x = cluster, dist = dist(iris[, 1:4])) :
##  Cluster sizes and average silhouette widths:
##        50       100 
## 0.8287325 0.6157364 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1518  0.6374  0.7075  0.6867  0.8115  0.8767

fviz_silhouette(coefSil,label=F)

##   cluster size ave.sil.width
## 1       1   50          0.83
## 2       2  100          0.62

Gaussian Mixture Models