In my last blog, I stressed on the importance of understanding and defining the problem before getting on to execution.
Subsequent steps after ‘problem identification’ is setting up infrastructure, getting access to relevant data based on the problem and scope and follow it up by running the detection models to deliver business outcomes. Let us now delve into these stages briefly:
Step 2 – Data Handling
Although this seems to be an operational step, it requires a fair bit of attention and effort as the investment made here pays off by avoiding re-work and creating bottlenecks in course of the process. Typically, data could be sourced from anywhere – the ERP, business systems or log data, timesheets, HR records and so on. Some or most of the data would require cleansing, where in, you might need to identify proxies for the missing data. At this point, it is advisable to ensure compliance to privacy regulations, tokenize any sensitive information and verify that you conform to jurisdictional requirements especially when handling personnel data, health records or financial records. In addition, when working with log data – network, physical and other access logs from multiple locations and servers, consistency across data sets should be maintained. For example, even the date formats should be looked at as inconsistent formats can lead to twisted results.
Step 3 – Model Execution
Execution of detection algorithms is the core of the process. This step flags anomalous records for investigation. To execute this without any hassles, a data scientist should ideally work closely with the business analyst to understand the data problem and iterate between machine learning models so as to arrive at the best performing one. What is critical to have here is a decision on desired levels of precision and recall based on the impact of missing a true positive (anomaly) vs. the cost of investigation from false positives. We have seen this to be a function of whether the business in question has recently burnt its fingers through an act of fraud and hence is prepared to live with greater false positives and extra investigation burden as long it can be assured of catching all frauds – for example, in areas such as data theft.