Having covered aspects of why and how to bring deep learning (DL) inference into edge devices in Part 1 and the top-seven industries building the next generation of edge devices in Part 2, we now look at architectural aspects of edge devices with DL inference in the third part of this blog series.
Offline training of DL systems is likely to continue to find a home in the cloud, which tends to be built with large numbers of Central Processing Units (CPUs), Graphics Processing Units (GPUs) and/or Field Programmable Gate Arrays (FPGAs), along with specialized artificial intelligence (AI) chipsets. However, as we have already discussed, DL inference makes more sense at the edge. It is predicted that by 2025, the market potential for cloud- and edge-based AI chipsets will reach $14.6 and $51.6 billion  respectively. Unlike cloud-based AI chipsets, edge-based AI chipsets must meet many more stringent constraints, including:
- Ability to achieve high-performance, handling up to tens of billions of floating-point operations per second (BFLOPS), with minimal die-size and power consumption
- Ability to perform with low latency, with response times in the range of a few milliseconds
- Low dynamic power (up to few watts) as well as extremely low leakage power (up to few milli watts) consumption
- Smaller on-chip static random-access memory (SRAM) size, up to one mega bit (1Mb)
- Less off-chip dynamic random-access memory (DRAM) bandwidth, up to few tens of Giga bits per second (Gbps)
- Smaller die size, up to a few square millimetres
- Low cost, preferably below a few tens of US Dollars
In order to realize edge-based DL inference, some necessary requirements for Deep Neural Network (DNN) models include:
- Very efficient DNN models with minimal number of layers (around few tens of layers), needing less compute power and yielding reasonable mean average precision (mAP)
- Compression and quantization of weights and activations of layers to represent with few bits
- Inference minimal accuracy loss (about 1-2%), with low-precision data and weights (use of Integer 1 or 8 bit (INT1/INT8) as opposed to Floating point 16 or 32 bit (FP16/FP32))
The DL inference in edge devices can be realized with architectural choices including CPUs, GPUs, FPGAs and custom system on chips (SoCs). However, there is no single choice for all the possible application scenarios and areas, so it is worthwhile to consider the trade-offs between different architectural choices. The CPU-based inference is quite possible whenever the computational requirements are not so high (typically up to few hundreds of millions of floating point operations per second, MFLOPS). Here, CPU referred to is a general-purpose and off-the-shelf application processor chip. Moreover, if computational requirements are very high (typically up to few tens of trillions of floating-point operations per second, TFLOPS), CPU can optionally offload compute-intense operations to an additional GPU. Here, the GPU referred to is a general-purpose and off-the-shelf graphics chip.
CPU-based DL inference solutions using off-the-shelf application processor chips can be attractive, particularly in cases where it allows users to reuse already purchased hardware for these newer inference workloads. Though CPU-based inference solutions are attractive for relatively low-performance applications, the power consumption is higher and such solutions won’t scale up in performance due to fixed configuration of such application processor chips. In FPGA or custom SoC solutions, CPU offloads predetermined tasks to hardware inference engine built in them.
Though very high-performance levels can be achieved (up to few tens of TFOPS), GPU-based solutions are normally not preferred for DL inference in edge devices due to very high power consumption and high cost. Furthermore, GPU-based solutions do not typically benefit from low-precision inference due to their architectures, which are inherently tuned for handling FP16/32 precision. However, in some safety critical applications such as autonomous cars, GPU power consumption and cost is justified due to higher performance demands.
Similar to GPU, FPGA-based solutions are not normally preferred for DL inference workloads on edge devices, due to high power consumption and very high cost. Compared to GPUs, FPGAs run at lower clock speeds and many have not yet reached performance levels as that of state-of-the-art GPUs. Moreover, FPGA-based solutions do benefit from low-precision inference and are able to achieve low latency. FPGA based solutions may be considered when production volumes are very low or for prototyping purposes.
Custom SoC-based solutions are attractive for inference at the edge for several reasons. The SoC-based solutions achieves the best trade-off between power, performance, and die-size. The SoCs can run at much higher clock speeds (relative to FPGAs) and achieves 5-10x performance improvement. Since SoCs are custom-designed for inference applications, their power consumption and die-size are lower compared to that of FPGAs. In addition, from a cost point of view, SoC-based solutions will be the lowest cost when production volumes are high. Hence, SoC-based DL inference solutions are best suited for applications including energy, utilities, industrial and surveillance etc., as discussed in Part 2 of the blog (which need few hundreds of MFLOPS to few BFLOPS), including battery operated solutions. Overall, custom SoC-based DL inference solutions are the most preferable in terms of die-size, power consumption, and cost.
Custom SoC-based inference solutions typically include one or more 64-bit CPU cores, a hardware DL inference engine, and peripheral interfaces for connecting various sensors, microphones, speakers, cameras, and displays. It should be noted that there is no single inference hardware engine that can cater to all application areas. An inference engine must be scaled considering the characteristics of the application area.
The FPGA-based solutions are completely reconfigurable from a hardware point-of-view, which can be changed in the field. Even though custom SoC-based solutions are not fully hardware configurable (with FPGA based solutions), it should not be a concern. Care must be exercised in designing DL hardware accelerator engine that shall adopt a generic layer approach and software configurability, so that there are no limitations of mapping DNN models to custom SoCs. The inference hardware acceleration engine shall include different layers including pre-processing, convolution, activation, pooling, softmax, fully connected and post-processing etc., to which a DNN model can be mapped.
Any DNN model is made up of several convolutions, activation and pooling layers, apart from others. Based on the complexity of application use case, the total number of layers in DNN model can vary from a few dozen to a few hundred. Also, DL inference hardware acceleration engines shall have support for CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) etc., for catering to imaging, audio, non-imaging applications, and the fusion of them. The inference hardware accelerator must fulfill certain constraints, including:
- Multiply and accumulate (MAC) units design must be tuned to target DNN model inference precision levels
- Performance, in terms of BFLOPS, within small die-size and with low power consumption
- Performance levels shall be up- or down-scalable, by adjusting the clock frequencies
- Weights must be loaded fast for better utilization of available MACs, and DRAM bandwidth shouldn’t become a bottleneck
- Smaller on-chip SRAM size to minimize die-size
- Lower off-chip DRAM bandwidth for low cost
- High throughput internal bus fabric for low latencies
Indicative design attributes of DL-inference hardware-accelerator in 16 nanometer and lower-process geometry nodes, for catering to various application areas include:
- Scalable die-size: Up to 5mm2
- Power consumption range: up to 1000mW peak power
- Frequency of operation: Above 1GHz
- Performance: Up to few tens of BFLOPS compute power
A CPU- and hardware-accelerated-based DL inference custom SoC solution is best suited for most edge device applications. Further, if needed, GPU cores can be deployed in very high performance and safety critical edge device SoC solutions, such as autonomous cars. The critical success factors of inference in edge device includes:
- Whether domain experts in respective market segments can train the DNN models
- Machine training dataset and DNN model simplicity, accuracy and corner cases
- How well DNN models maps to the hardware accelerator
- mAP achievable shall be high
To reap the benefits from DL inference SoCs, investment goes beyond building just chipsets. A very efficient and intelligent software layer that runs on top of these chipsets is essential. Without this these inference SoCs are not usable. In the near future we will see inference start to drift from cloud to edge devices, and considering the above architectural aspects and trade-offs, vertically-integrated inference solutions will dictate their future success.