With advancements in Artificial Intelligence and Machine Learning techniques, organizations across the globe have realized the potential of AI in enabling quick and informed decisions. They are moving towards rapid adoption of these technologies for survival and subsequent favorable outcomes. Due to this growing demand for AI across industries, infrastructure decisions are increasingly being based on this workload.
However, before deciding the infrastructure solution for AI platforms, enterprises should gain insight into the data lifecycle from an AI model’s perspective. Data can be the key challenge for implementation of AI and ML.
Here are some key pointers that enterprises need to know about data before defining infrastructure for AI:
- The unstructured data (logs, sensor data, images and marketing data) as well as semi-structured data needs to be grown, managed and stored efficiently.
- Data differs in size, type etc. and hence gathering it from multiple sources is challenging.
- The processing of large volumes of data through complex ETL logics to arrive at meaningful insights from data sets might be perplexing.
- Hence, the data lake needs to be agnostic and highly scalable depending on the data stored.
- High performing computing platforms and engines are needed to train the models plus churn huge volumes of data from the data lake and then use AI tools to generate key insights.
- High performance computations are required to train deep neural network models for parallel processing on large chunks of data.
- Efficient storage along with well-suited throughput channel are required to send data from storage to compute engines.
Efficient management of data for enabling AI
Data is crucial for modern AI and ML algorithms. Collection of data in raw format, managing these big chunks of datasets, and labeling them with related information to help train an AI model are the major data challenges.
A data architecture that describes how data is collected, stored, transformed, distributed and consumed is essential to implement AI solutions.
Datasets have to go through multiple processing stages before being used to train the AI model. Let’s see what happens to data at each of these stages.
Data Collection and Ingestion - This step involves storing data from multiple sources into a data lake. The data here is a mix of structured and unstructured data. The data lake needs to be scalable and agile to handle every type of data. The datasets will have Inferences, which must be saved to be processed in the further steps.
Data Preparation (Clean and Transform) - Data stored in data lake post ingestion isn’t suitable for utilization in training AI models as it needs cleansing and required transformations. Thus, transformations are performed in data lakes.
Explore - Next step is to feed these processed datasets from data lake to the AI/ML tools in the AI based platform, which possesses high storage with GPU and CPU servers. Initially this platform conducts testing with a smaller set of data and then full-scale model training is enabled through multiple iterations.
Training & Evaluation - At this stage, random data sets are taken from data lake and fed into the AI platform for training and updating the model. Infrastructure requirement at this stage is to have high performance storage to compete with high processing speed of GPUs.
Scoring or Prediction - After training and evaluation of AI models, they are circulated in production as PMML (Java) file, model objects (Python flask, RServer, R modelObj) for prediction (scoring). The business application needing prediction invokes the objects in accordance with the algorithm.
After completion of data processing and training of models, the data becomes redundant or cold and is no longer needed. But sometimes, the cold data may be needed to refer back for problem-solving. Hence, cold data is mostly stored in a low-cost storage area.
Infrastructure selection criteria for AI
To make the right choices about compute and storage, it is vital to understand the requirements of storage and compute at multiple stages of the data lifecycle.Figure 1 maps the storage and compute requirements with the phases of data lifespan. However, this might vary on a case-to-case basis.