██████╗ ███╗ ███╗███╗ ███╗
██╔════╝ ████╗ ████║████╗ ████║
██║ ███╗██╔████╔██║██╔████╔██║
██║ ██║██║╚██╔╝██║██║╚██╔╝██║
╚██████╔╝██║ ╚═╝ ██║██║ ╚═╝ ██║
╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝
NOT DONE SO PLEASE PROCEED WITH CAUTION
A method used for clustering. It assumes that data comes from a distribution that is a combination of several Gaussian distributions.
The clusters in the GMM overlap with each other. The GMM provides what is called a “soft” or probabilistic clustering. Each point does not strictly belong to a single cluster,but instead has a degree or probability of membership in each cluster. In this particular clustering, we can think that a point is more likely to have come from some clusters than others. However, there still is a possibility, perhaps remote, that the point may have come from any of them.
To be done
The problem is that we know neither of these things: not the distribution that each training instance came from nor the five mixture model parameters. So we adopt the procedure used for the k-means clustering algorithm and iterate. Start with initial guesses for the five parameters, use them to calculate the cluster probabilities for each instance, use these probabilities to reestimate the parameters, and repeat. (If you prefer, you can start with guesses for the classes of the instances instead.) This is called the EM algorithm, for expectation maximization. The first step—calculation of the cluster probabilities, which are the “expected” class values—is “expectation”; the second, calculation of the distribution parameters, is “maximization” of the likelihood of the distributions given the data available.
library(mclust)
## Package 'mclust' version 5.4.6
## Type 'citation("mclust")' for citing this R package in publications.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- iris
summary(data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
scaled_data <- scale(data[,c(-5)])
# List of clusters to be analysed
test_clusters <- 1:10
# Training the model
model <- Mclust(scaled_data, G=test_clusters)
How to select K?
If we choose K to maximize likelihood, when K increases the value of the maximum likelihood cannot decrease. Thus more complex models will always improve likelihood.
It is necessary to penalize the complexity of the model. We need a balance between how well the model fits and the data and the simplicity of the model
This is where we choose K based on the score returned either by Akaike information criterion (AIC) or Bayesian information criterion (BIC)
AIC Score:
Bayesian Information Criterion BIC
When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit.
AICscores <- c()
BICscores <- c()
for (k in test_clusters) {
kmodel <- Mclust(scaled_data, G=k)
AICscores <- append(AICscores, AIC(kmodel, k=2))
BICscores <- append(BICscores, BIC(kmodel))
}
Then we plot the result that each algorithm gave us
## AIC adn BIC scores comparission
plot(1:length(AICscores), AICscores,
type = "o",
main= "Akaike Information Criterion (AIC) Score",
xlab = "Number of components",
ylab = "AIC Scores",
xaxt='n')
axis(side=1, at=1:length(AICscores), labels=1:length(AICscores),cex.axis=0.8)
plot(1:length(BICscores), BICscores,
type = "o",
main= "Bayesian Information Criterion (BIC) Score",
xlab = "Number of components",
ylab = "BIC Scores",
xaxt='n')
axis(side=1, at=1:length(BICscores), labels=1:length(BICscores),cex.axis=0.8)
Mclust function also returns whish form of the clsuter better fit the gaussian density of the data, not only returning his own BIC score results but also the better fitted form.
Here is an image for reference
When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit.
After this we can conclude that
opt_model <- Mclust(scaled_data, G=2)
opt_model$density <- densityMclust(scaled_data, G=2)$density
plot(opt_model)
Option 2 gives the actual cluster formed
Option 3 gives rows that the algorithm couldn’t associate with either cluster
Option 4 gives the density of each clsuter
We already seen how the groups would look in a 2d plane, but not always the data set will behave as this one, it can happen sometimes that the clusters intersect each others and makes it difficult to differentiate both. One way to distinguish better each cluster is plotting the in a 3d space, being the height Value the density of each coordinate in the data. Next we will use the plotly library, which I like personally for its flexibility and features.
plot_ly(x=scaled_data[,2],
y=scaled_data[,3],
z=opt_model$density,
color = opt_model$classification,
colors = c("#FF0000", "#0000FF"),
mode = "markers",
type = "scatter3d",
opacity= 0.5) %>% hide_colorbar()
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
We saw that by using the AIC and BIC metrics we could decided which would have been the best number of cluster to better fit our data. Sadly we are not done yet, we did find the the best number of cluster but we don’t know yet if those groups really represent the behavior of the data
#Searching for K based on silhouette
coefSil=c()
for (k in test_clusters[-1]){
model = Mclust(iris[1:4], G=k)
temp = silhouette(model$classification,dist(iris[,1:4]))
coefSil[k]=mean(temp[,3])
}
tempDF=data.frame(CS=coefSil,K=test_clusters)
ggplot(tempDF)+
aes(x=K,y=CS)+
geom_line()+
scale_x_continuous(breaks=test_clusters)+
xlab("N. of Clusters")+
ylab("Silhouette Coefficient")+
ggtitle("Silhouette Coefficient vs N. of Clusters")
## Warning: Removed 1 row(s) containing missing values (geom_path).
#Silohouette coefficient
model = Mclust(iris[1:4], G=2)
cluster=model$classification
coefSil=silhouette(cluster,dist(iris[,1:4]))
summary(coefSil)
## Silhouette of 150 units in 2 clusters from silhouette.default(x = cluster, dist = dist(iris[, 1:4])) :
## Cluster sizes and average silhouette widths:
## 50 100
## 0.8287325 0.6157364
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1518 0.6374 0.7075 0.6867 0.8115 0.8767
fviz_silhouette(coefSil,label=F)
## cluster size ave.sil.width
## 1 1 50 0.83
## 2 2 100 0.62