The time of artificial intelligence has finally arrived. Although AI had its emergence as a field of study since the late 60s, it remained on the margins as more of a research area with little or no actual impact on mainstream real-world problem solving.
The advent of affordable and prevalent high compute hardware, coupled with parallel processing with emerging platforms such as cloud-based infrastructure, has made it possible to deploy AI solutions with good enough performance for real-world applications. Additionally, commoditization of AI methods and techniques offered by several AI/ML frameworks and libraries has brought machine learning out for research labs into the mainstream.
The applicability of AI and ML methods to a whole range of real world problems including industry, healthcare, security etc. ensures that more and more domains will use machine learning for analysis and prediction of anomalies. New strides are being taken every day towards discovering new techniques to model and to improve existing ones. One of the key challenges today is that every AI solution is custom-designed for a specific use case and it requires additional effort to make it reusable for another similar scenario.
One of the key factors, which is often overlooked while evaluating a problem for AI application, is the source data that is to be ingested by such models. A good model needs an equally good quality of data to generate quality predictions. That being said, the question is what constitutes good data for a problem and how do we ensure goodness of data.
This paper discusses and illustrates the impact of data on real world machine learning and analytics use cases. The paper will discuss the parameters to consider while selecting and preparing the right data for Artificial Intelligence / Machine Learning based solutions, techniques and approaches to make it more effective and ways to format it for ingestion.
The first point of ensuring goodness of data for ingestion by an AI application is at the source of said data. The following are few key parameters that affect the quality of data at the point of capture:
Point of capture: Data for AI application may come from a variety of sources such as live video feed, pre-recorded media, sensor data, historic data, equipment feed etc. One of the key considerations is the point of capture i.e. the moment when a sample is taken for analysis, for example for a vision based solution, the image captured at point when all the key features are identifiable within the region of interest help train a model with much better results. Similarly, for an industrial use case, it might be helpful to place the respective sensor as near to the area of interest as possible. For example, placement of vibration or temperature sensor on the machine under investigation allows measurements to be more accurate and helps reduce effects due to environmental factors.
Method of capture: The method of capture also plays a critical role in data capture, for example, temperature variations can be captured via thermocouples, infra-red imaging, and digital temperature sensors. While thermocouples and temperature sensors are best suited to identify changes in temperature at a specific location, IR imaging helps identify the area affected by the heat. Similarly, defects on a production line can be captured more accurately by images whereas, machine wear can be captured more accurately by physical sensors. Method of data capture can help improve the feature detection ability of a model and reduce the amount of training required to produce useful inferences.
Noise in capture: Noise in a data stream is usually unavoidable. In a sense, noise is a necessary evil for a good AI model as real-world data is seldom without noise. Noise helps a model to be more robust and fault-tolerant. That said, unwanted noise can skew a model and affect accuracy of prediction, particularly if the noise is not natural. One such instance is data poisoning. Data poisoning effects the data by including spurious variables. For example, ambient temperature variations such as the presence of a fan or additional heat source near the point of capture may affect a temperature sensor. Similarly, an insecticide spray may affect a chemical sensor.
Another source of noise is sensor glitches, which might introduce outliers to a data stream or provide false readings. One of the most common reasons of such variations are the quality of sensors and environmental conditions (e.g. dust, humidity etc.). While outliers can be discarded and missing values can be estimated, it is always recommended to use real-world data for model training and optimization rather than from closed / clean environment.
Frequency of capture: The frequency of capture refers to the rate at which a sample is taken for training and analysis. For example, the sampling rate required to produce accurate predictions for temperature variation in a room will be much lower as compared to vibration data from a motor as the potential for the temperature to suddenly change value is extremely limited. A high frequency of capture for a data stream with low variance may produce duplicate and irrelevant data, whereas, a capture at lower than required frequency may miss critical changes in data points.
Duration of capture: The duration of a captured sample refers to how long you need to monitor a data source to capture an event. For example, if current leakage data is to be monitored for a number of samples, similar care must be taken to compensate for sensor glitches so that validity of a sample can be ensured before it is considered for analysis. Methods for “filtering out” such instances will be discussed in the subsequent section. Meanwhile, enough samples must be captured to make sure an event is captured in its entirety.
Enough emphasis cannot be put on this factor while preparing data for a machine learning use case; it can very literally make or break your model. A model may produce completely different results if careful consideration is not given on selecting the input data. Based on the problem being tackled, there are a number of aspects to consider:
Feature selection: One of the key challenges of machine learning is to deal with high dimensional data: the term dimensionality here refers to the number of attributes or features a data set may have. It may sound intuitively efficient to capture as many attributes as possible from a data source to accurately capture its state. However, this presents an interesting side-effect, as the number of attributes (dimensionality) increases, the number of possible distinct tuples also increases exponentially and as a result, the coverage provided by the selected sample space decreases. This is a phenomenon known as the “Curse of Dimensionality”. High dimensionality leads to higher complexity and resource requirements (e.g. compute, memory, network bandwidth etc.). Figure 1 indicates how the efficiency of a machine learning model diminishes when compared against the dimensionality of the dataset.