February | 2020
In our interactions with enterprise leaders, we come across varied expectations. There would be situations when a chief marketing officer of the company would need help in terms of comprehending the customers better to market products more effectively, or when a chief human resource officer of your company asks – “What is the reason for increased attrition, how can we manage our employees better?”. Most of these questions elicit no responses as there are no certain answers. These are typical examples of unsupervised learning problems in data science. Answers to the question can only be figured out, if we find structure in the customers or employee data, a step commonly known as “clustering”.
We came across a similar problem in a large credit card (CC) company with a customer base of around 1.2 million that was facing high attrition. The management needed to devise a strategy to have a better control over the same and to best manage the customer population. The traditional random methods of clustering (based on few financial parameters of customers like repayment history, outstanding balance etc.) did not help the company to mitigate the problem efficiently. The reason for the same was that apart from the financial parameters of a customer, the behavioral and socio-economic aspects that are almost equally important were not taken into account, leading to incorrect clustering.
Current industry challenges
Almost all industries are generating huge amount of data that is used for analysis so as to get insights to address specific business problems. However, the data generated cannot be used as it is, as the data has some intrinsic patterns or characteristics that could be different within themselves. Hence, the model built or analysis done on such data will produce polluted results leading to incorrect business decisions. In a nutshell, we can say that the typical challenges that the industry is facing are:
Cluster analysis and different methods
Cluster Analysis is the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects that have similarities between them but are dissimilar to the objects belonging to other clusters.
Let’s look at some of the methods of clustering of data that are being used currently:
Hierarchical clustering
It is a connectivity-based clustering that is based on the core idea of objects being more related to nearby objects than to those far away. The advantage of using this algorithm is that it provides clusters ranging from 1 (all data in one cluster) to N (each data point itself as one cluster) allowing the analyst to choose the cluster depending upon the calculated statistic. However, the main disadvantage is that it is more time consuming for larger datasets.
Distribution models
The clusters are modeled using statistical distributions and are defined as objects belonging most likely to the same distribution. These models are complex in nature and for real data sets, there may not be any concisely defined mathematical models.
K-Means algorithm
It is centroid-based clustering - clusters are represented by a central (mean) vector, which may not necessarily be a member of the data set. It assigns the object to the nearest cluster. The advantages of using this are it partitions the data space into a structure and it is conceptually close to nearest neighbor classification, and as such, is popular in machine learning. However, the main disadvantage is that we need to specify the required number of clusters in advance, leading to subjectivity. Also, the process needs to be repeated for different clusters and then the results need to be compared to arrive at the final number.
Density-based clustering
In this type of clustering, the clusters or groups are defined as the region of comparatively higher concentration (density) to the rest of the data. The main disadvantage is a density drop while identifying the cluster borders. Also it fails to detect the intrinsic cluster structure that are frequent in real life scenarios.
Combining hierarchical and K-means clustering
The hierarchical clustering method will produce a hierarchy giving the user the opportunity to choose appropriate clusters instead of partitioning the data in unique way. K-means type algorithms require the number of clusters to be specified in advance and that is one of the biggest drawbacks of these algorithms. However, if variables are huge, then K-means, most of the times, is computationally faster than hierarchical clustering, if we keep k small.
For optimum results, it is essential to use a combination of hierarchical clustering and K-means clustering along with the GLM (Generalized Linear Model) procedure.
As a first step, Hierarchical clustering is used initially on a sample of data to determine the optimum number of clusters. The GLM procedure, then, checks for the significant variables, which are contributing to the cluster model so that we use only those variables in the next steps. Finally, it leverages the K-means algorithm with the number of clusters (from step 1) and the significant variables (from step 2) to produce the results of the same sample.
The results could then be extended to the larger population. However, it does not end here. In order to leverage the most mileage, we require:
Better results to enable right decisions
By using such a combination of a clustering approach, customers can carry out further analysis and advanced model building activities for each of the groups. Additional benefits include:
If we relate these expected benefits to the business case we discussed, then the credit card company would be able to (statistically) divide the customer base into a sizable number of clusters based on the within-group and across-group relationships. Thus, we come down from 1.2 million customers to say just seven groups. The strategy to address the customer attrition/issues will be for a group of customers instead of single customers; thus reducing the complexities involved.
References
[1] Maria Florina Balcan, Yingyu Liang and Pramod Gupta. Robust Hierarchical Clustering. Journal of Machine Learning Research 15, 2014
[2] Varun Kumar, Vishnu Chaitanya and Madhavan. Segmenting the Banking Market Strategy by Clustering. International Journal of Computer Applications (0975 – 8887) Volume 45– No.17, May 2012
[3] Michael Auld. Data cleaning and spotting outliers with UNIVARIATE. www.lexjansen.com. CC01, 2011
[4] D. Sculley. Web-Scale K-Means Clustering. April 26–30, 2010
[5] D. Witten and R. Tibshirani. A framework for feature selection in clustering. To Appear: Journal of the American Statistical Association, 2010
[6] Marielle Caccam, Jewel Refran. Cluster Analysis.
http://www.slideshare.net/jewelmrefran/cluster-analysis-15529464. Dec 2012
Rahul Pundlik
Lead Consultant, Wipro Ltd.
Rahul has 15+ years of experience in cross-domain automation and analytical solutions. He enjoys designing & implementing statistical models for real-world applications.
Improves overall productivity by 80% by digitizing drawings via Wipro HOLMES Engineering Drawing Digitization Solution
Large universal bank achieves regulatory compliance, reduces E-KYC lifecycle by up to 50% by rolling out Wipro HOLMES&trade enterprise-KYC solution
A leading insurance company achieves 40 times improvement in detection of fraudulent claims leveraging Wipro’s Artificial Intelligence powerhouse - HOLMES
© 2021 Wipro Limited |
|
© 2021 Wipro Limited |
Engineering, Construction & Operations
Pharmaceutical & Life Sciences