Insurance fraud detection is a challenging problem, given the variety of fraud patterns and relatively small ratio of known frauds in typical samples. While building detection models, the savings from loss prevention needs to be balanced with cost of false alerts. Machine learning techniques allow for improving predictive accuracy, enabling loss control units to achieve higher coverage with low false positive rates. In this paper, multiple machine learning techniques for fraud detection are presented and their performance on various data sets examined. The impact of feature engineering, feature selection and parameter tweaking are explored with the objective of achieving superior predictive performance.
Insurance frauds cover the range of improper activities which an individual may commit in order to achieve a favorable outcome from the insurance company. This could range from staging the incident, misrepresenting the situation including the relevant actors and the cause of incident and finally the extent of damage caused.
Potential situations could include:
- Covering-up for a situation that wasn’t covered under insurance (e.g. drunk driving, performing risky acts, illegal activities etc.)
- Misrepresenting the context of the incident: This could include transferring the blame to incidents where the insured party is to blame, failure to take agreed upon safety measures
- Infiating the impact of the incident: Increasing the estimate of loss incurred either through addition of unrelated losses (faking losses) or attributing increased cost to the losses
The insurance industry has grappled with the challenge of insurance claim fraud from the very start. On one hand, there is the challenge of impact to customer satisfaction through delayed payouts or prolonged investigation during a period of stress. Additionally, there are costs of investigation and pressure from insurance industry regulators. On the other hand, improper payouts cause a hit to profitability and encourage similar delinquent behavior from other policy holders.
According to FBI, the insurance industry in the USA consists of over 7000 companies that collectively received over $1 trillion annually in premiums. FBI also estimates the total cost of insurance fraud (non-health insurance) to be more than $40 billion annually .
It must be noted that insurance fraud is not a victimless crime – the losses due to frauds, impact all the involved parties through increased premium costs, trust deficit during the claims process and impacts to process efficiency and innovation.
Hence the insurance industry has an urgent need to develop capability that can help identify potential frauds with a high degree of accuracy, so that other claims can be cleared rapidly while identified cases can be scrutinized in detail.
2.0 Why Machine Learning in Fraud Detection?
The traditional approach for fraud detection is based on developing heuristics around fraud indicators. Based on these heuristics, a decision on fraud would be made in one of two ways. In certain scenarios rules would be framed that would define if the case needs to be sent for investigation. In other cases, a checklist would be prepared with scores for the various indicators of fraud. An aggregation of these scores along with the value of the claim would determine if the case needs to be sent for investigation. The criteria for determining indicators and the thresholds will be tested statistically and periodically recalibrated.
The challenge with the above approaches is that they rely very heavily on manual intervention which will lead to the following limitations
- Constrained to operate with a limited set of known parameters based on heuristic knowledge – while being aware that some of the other attributes could also infiuence decisions
- Inability to understand context-specific relationships between parameters (geography, customer segment, insurance sales process) that might not refiect the typical picture. Consultations with industry experts indicate that there is no ‘typical model’, and hence challenges to determine the model specific to context
- Recalibration of model is a manual exercise that has to be conducted periodically to refiect changing behavior and to ensure that the model adapts to feedback from investigations. The ability to conduct this calibration is challenging
- Incidence of fraud (as a percentage of the overall claims) is lowtypically less than 1% of the claims are classified. Additionally new modus operandi for fraud needs to be uncovered on a proactive basis
These are challenging from a traditional statistics perspective. Hence, insurers have started looking at leveraging machine learning capability. The intent is to present a variety of data to the algorithm without judgement around the relevance of the data elements. Based on identified frauds, the intent is for the machine to develop a model that can be tested on these known frauds through a variety of algorithmic techniques.
3.0 Exercise Objectives
Explore various machine learning techniques to improve accuracy of detection in imbalanced samples. The impact of feature engineering, feature selection and parameter tweaking are explored with objective of achieving superior predictive performance.
As a procedure, the data will be split into three different segments – training, testing and cross-validation. The algorithm will be trained on a partial set of data and parameters tweaked on a testing set. This will be examined for performance on the cross-validation set. The high-performing models will be then tested for various random splits of data to ensure consistency in results.
The exercise was conducted on ApolloTM – Wipro’s Anomaly Detection Platform, which applies a combination of pre-defined rules and predictive machine learning algorithms to identify outliers in data. It is built on Open Source with a library of pre-built algorithms that enable rapid deployment, and can be customized and managed. This Big Data Platform is comprised of three layers as indicated below.
Three layers of Apollo’s architecture
- Data Clensing
- Business Rules
- ML Algorithims
- Detailed Reports
- Case Management
The exercise described above was performed on four different insurance datasets. The names cannot be declared, given reasons of confidentiality.
Data descriptions for the datasets are given below.
4.0 Data Set Description
4.I Introduction to Datasets