Having covered aspects of why and how to bring deep learning (DL) inference into edge devices in Part 1 and the top-seven industries building the next generation of edge devices in Part 2, we now look at architectural aspects of edge devices with DL inference in the third part of this blog series.
Offline training of DL systems is likely to continue to find a home in the cloud, which tends to be built with large numbers of Central Processing Units (CPUs), Graphics Processing Units (GPUs) and/or Field Programmable Gate Arrays (FPGAs), along with specialized artificial intelligence (AI) chipsets. However, as we have already discussed, DL inference makes more sense at the edge. It is predicted that by 2025, the market potential for cloud- and edge-based AI chipsets will reach $14.6 and $51.6 billion  respectively. Unlike cloud-based AI chipsets, edge-based AI chipsets must meet many more stringent constraints, including:
In order to realize edge-based DL inference, some necessary requirements for Deep Neural Network (DNN) models include:
The DL inference in edge devices can be realized with architectural choices including CPUs, GPUs, FPGAs and custom system on chips (SoCs). However, there is no single choice for all the possible application scenarios and areas, so it is worthwhile to consider the trade-offs between different architectural choices. The CPU-based inference is quite possible whenever the computational requirements are not so high (typically up to few hundreds of millions of floating point operations per second, MFLOPS). Here, CPU referred to is a general-purpose and off-the-shelf application processor chip. Moreover, if computational requirements are very high (typically up to few tens of trillions of floating-point operations per second, TFLOPS), CPU can optionally offload compute-intense operations to an additional GPU. Here, the GPU referred to is a general-purpose and off-the-shelf graphics chip.
CPU-based DL inference solutions using off-the-shelf application processor chips can be attractive, particularly in cases where it allows users to reuse already purchased hardware for these newer inference workloads. Though CPU-based inference solutions are attractive for relatively low-performance applications, the power consumption is higher and such solutions won’t scale up in performance due to fixed configuration of such application processor chips. In FPGA or custom SoC solutions, CPU offloads predetermined tasks to hardware inference engine built in them.
Though very high-performance levels can be achieved (up to few tens of TFOPS), GPU-based solutions are normally not preferred for DL inference in edge devices due to very high power consumption and high cost. Furthermore, GPU-based solutions do not typically benefit from low-precision inference due to their architectures, which are inherently tuned for handling FP16/32 precision. However, in some safety critical applications such as autonomous cars, GPU power consumption and cost is justified due to higher performance demands.
Similar to GPU, FPGA-based solutions are not normally preferred for DL inference workloads on edge devices, due to high power consumption and very high cost. Compared to GPUs, FPGAs run at lower clock speeds and many have not yet reached performance levels as that of state-of-the-art GPUs. Moreover, FPGA-based solutions do benefit from low-precision inference and are able to achieve low latency. FPGA based solutions may be considered when production volumes are very low or for prototyping purposes.
Custom SoC-based solutions are attractive for inference at the edge for several reasons. The SoC-based solutions achieves the best trade-off between power, performance, and die-size. The SoCs can run at much higher clock speeds (relative to FPGAs) and achieves 5-10x performance improvement. Since SoCs are custom-designed for inference applications, their power consumption and die-size are lower compared to that of FPGAs. In addition, from a cost point of view, SoC-based solutions will be the lowest cost when production volumes are high. Hence, SoC-based DL inference solutions are best suited for applications including energy, utilities, industrial and surveillance etc., as discussed in Part 2 of the blog (which need few hundreds of MFLOPS to few BFLOPS), including battery operated solutions. Overall, custom SoC-based DL inference solutions are the most preferable in terms of die-size, power consumption, and cost.
Custom SoC-based inference solutions typically include one or more 64-bit CPU cores, a hardware DL inference engine, and peripheral interfaces for connecting various sensors, microphones, speakers, cameras, and displays. It should be noted that there is no single inference hardware engine that can cater to all application areas. An inference engine must be scaled considering the characteristics of the application area.
The FPGA-based solutions are completely reconfigurable from a hardware point-of-view, which can be changed in the field. Even though custom SoC-based solutions are not fully hardware configurable (with FPGA based solutions), it should not be a concern. Care must be exercised in designing DL hardware accelerator engine that shall adopt a generic layer approach and software configurability, so that there are no limitations of mapping DNN models to custom SoCs. The inference hardware acceleration engine shall include different layers including pre-processing, convolution, activation, pooling, softmax, fully connected and post-processing etc., to which a DNN model can be mapped.
Any DNN model is made up of several convolutions, activation and pooling layers, apart from others. Based on the complexity of application use case, the total number of layers in DNN model can vary from a few dozen to a few hundred. Also, DL inference hardware acceleration engines shall have support for CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) etc., for catering to imaging, audio, non-imaging applications, and the fusion of them. The inference hardware accelerator must fulfill certain constraints, including:
Indicative design attributes of DL-inference hardware-accelerator in 16 nanometer and lower-process geometry nodes, for catering to various application areas include:
A CPU- and hardware-accelerated-based DL inference custom SoC solution is best suited for most edge device applications. Further, if needed, GPU cores can be deployed in very high performance and safety critical edge device SoC solutions, such as autonomous cars. The critical success factors of inference in edge device includes:
To reap the benefits from DL inference SoCs, investment goes beyond building just chipsets. A very efficient and intelligent software layer that runs on top of these chipsets is essential. Without this these inference SoCs are not usable. In the near future we will see inference start to drift from cloud to edge devices, and considering the above architectural aspects and trade-offs, vertically-integrated inference solutions will dictate their future success.
Dr. Vijay Kumar K
Chief Architect and Distinguished Member of Technical Staff - VLSI Technology Practice Group of Product Engineering Services, Wipro
Vijay has been with Wipro for about 19 years and has been in the VLSI industry for more than 24 years. He has done architecture and design of several complex cutting-edge SoCs, ASICs, FPGAs and Systems, in various application areas, for top-notch semiconductor companies globally. He also specializes in the video domain and has created several solutions around video compression, post-processing, etc. He has been granted 14 US Patents so far. He is currently working on next-generation architectures of semiconductor devices including edge devices with machine learning.