Let’s look at some of the methods of clustering of data that are being used currently:
It is a connectivity-based clustering that is based on the core idea of objects being more related to nearby objects than to those far away. The advantage of using this algorithm is that it provides clusters ranging from 1 (all data in one cluster) to N (each data point itself as one cluster) allowing the analyst to choose the cluster depending upon the calculated statistic. However, the main disadvantage is that it is more time consuming for larger datasets.
The clusters are modeled using statistical distributions and are defined as objects belonging most likely to the same distribution. These models are complex in nature and for real data sets, there may not be any concisely defined mathematical models.
It is centroid-based clustering - clusters are represented by a central (mean) vector, which may not necessarily be a member of the data set. It assigns the object to the nearest cluster. The advantages of using this are it partitions the data space into a structure and it is conceptually close to nearest neighbor classification, and as such, is popular in machine learning. However, the main disadvantage is that we need to specify the required number of clusters in advance, leading to subjectivity. Also, the process needs to be repeated for different clusters and then the results need to be compared to arrive at the final number.
In this type of clustering, the clusters or groups are defined as the region of comparatively higher concentration (density) to the rest of the data. The main disadvantage is a density drop while identifying the cluster borders. Also it fails to detect the intrinsic cluster structure that are frequent in real life scenarios.
Combining hierarchical and K-means clustering
The hierarchical clustering method will produce a hierarchy giving the user the opportunity to choose appropriate clusters instead of partitioning the data in unique way. K-means type algorithms require the number of clusters to be specified in advance and that is one of the biggest drawbacks of these algorithms. However, if variables are huge, then K-means, most of the times, is computationally faster than hierarchical clustering, if we keep k small.
For optimum results, it is essential to use a combination of hierarchical clustering and K-means clustering along with the GLM (Generalized Linear Model) procedure.
As a first step, Hierarchical clustering is used initially on a sample of data to determine the optimum number of clusters. The GLM procedure, then, checks for the significant variables, which are contributing to the cluster model so that we use only those variables in the next steps. Finally, it leverages the K-means algorithm with the number of clusters (from step 1) and the significant variables (from step 2) to produce the results of the same sample.
The results could then be extended to the larger population. However, it does not end here. In order to leverage the most mileage, we require:
- Sizeable amount of data that is clean, accurate and in a form that works well for this process
- Additional data describing the below parameters:
- Demographics of customers
- Competitor data
- Domain knowledge expert
Better results to enable right decisions
By using such a combination of a clustering approach, customers can carry out further analysis and advanced model building activities for each of the groups. Additional benefits include: