Use of artificial intelligence (AI) techniques in various day-to-day business activities are slowly but surely becoming ubiquitous. The excellent capacity of these techniques to make mathematical models of complex, intuitive tasks, with impressive accuracy, has been reported time and again in media. It’s no wonder then, that, as per Adobe1, the proportion of enterprises using AI will double in 2019, from 2018.
One of the most commonly used tasks where AI is typically used is classification. In this task, the ask is to categorize numbers, images, text, into a set of pre-defined classes. Think about an algorithm distinguishing a cat from a dog in an image, and, you have a classifier. There are several, more useful and non-trivial cases for classifiers where they are used to identify credit card frauds or ensure no sensitive information is being leaked from an organization. In such tasks, the job of the classifier is to identify all cases of potential fraud or data leak while ensuring that no actual incident slips through.
In AI speak, the task is to ensure the machine catches all true positives while ensuring there are no false positives. The cost of cases not identified by such systems can be significant with organizations facing regulatory penalties as well as reputational damage. In spite of the potential costs involved, it can get extremely difficult to first reduce and then, completely eliminate false negatives from classification results.
How to reduce false negatives
A right approach to classification in AI helps deal with anomalies in the task. The effective approach uses a cascade of models to selectively reduce false negatives. The initial layer looks for both positive and negative classes, while the second layer looks only for negatives and any hidden positives in them. A short description of the steps involved in the classification approach are:
- Filter the output of the primary classifier to hold only the negatives i.e. valid, normal observations. This step enables elimination of a part of the variation in the dataset, potentially leading to simpler models and better learners.
- Generate a new target from the original labels. Here, positives imply the original false negatives while negatives imply the original true negatives.
- Use appropriate sampling techniques to get balanced datasets as the original is likely to be very imbalanced. By the nature of the input dataset i.e. a dataset obtained as the output of a classifier, the proportion of positive cases the subsequent algorithm has to learn (i.e. the original false negatives) will be extremely low compared to the negatives (i.e. the original true negatives). This step is required to ensure the algorithm can learn effectively.
- Do a non-linear transformation on the feature set. The positives labelled in the previous step are those cases from the original dataset which are very hard to classify. Given this reality, a non-linear transformation can be done to potentially allow better separation of the classes, in the subsequent algorithm.
- Dimensionality reduction techniques can also be employed to simplify the resultant model. This is to allow simpler models to be built and not complicate the flow even further.
- Use a secondary classifier on the balanced dataset to identify the positives (i.e. the original false negatives).
- Use regular model validation techniques to ensure models perform satisfactorily.
Validation of the multi-layered classification approach
In an experiment with four real-world datasets of emails tagged as alerts for containing potentially sensitive data, for the one of the files, the primary classifier was performing fairly well with a false negative rate (FNR) of 1.31%. Due to the criticality of the decision, isolation of false negatives was required.
The given classification approach was followed using principal component analysis (PCA) for transformation and dimensionality reduction, followed by support vector classifier (SVC) with radial basis function (RBF) kernel, to identify the false positives. This performed very well and reduced the FNR to 0.11%
The improvement in FNR, as usual, comes with a price. The false positive rate (FPR) which after the primary model was 2.65%, increased to 8% after the secondary model. This FPR was deemed acceptable.
The various metrics for all files are given in table 1. These were calculated by predicting the labels for the whole dataset and comparing them with the original labels. So, they represent the mean performance of the train and test datasets. We see that the mean percentage reduction in FNR is 78.97%, by the application of this approach.