How to rapidly design and deploy smart machine vision systems
Figure 4: Typical image processing flow for an AI-based ANPR application. Image source: AMD Xilinx
typical automotive application, communication between the GPU and various modules requires multiple accesses to external DDR memory, while the Zynq MPSoC- based solution incorporates a pipeline designed to avoid most DDR accesses (Figure 2).
the model – the complexity was reduced from 117 Gops to 11.6 Gops (10X), performance increased from 18 to 103 FPS (5X), and accuracy dropped from 61.55 mean average precision (mAP) for object detection to 60.4 mAP (only 1% lower) (Figure 3). Real-world application example A machine learning application for automobile license plate detection and recognition, also called auto number plate recognition (ANPR), was developed based on vision analytics software from Uncanny Vision. ANPR is used in automated toll systems, highway monitoring, secure gate and parking access,
and other applications. This ANPR application includes an AI-based pipeline that decodes the video and preprocesses the image, followed by ML detection and OCR character recognition (Figure 4). Implementing ANPR requires one or more H.264 or H.265 encoded real-time streaming protocol (RTSP) feeds that are decoded or uncompressed. The decoded video frames are scaled, cropped, color space converted, and normalized (pre-processed), then sent to the ML detection algorithm. High-performance ANPR implementations require a multi- stage AI pipeline. The first stage detects and localizes the vehicle in the image, creating the region of interest (ROI). At the same time, other algorithms optimize the image quality for subsequent use by the OCR character recognition algorithm and track the vehicle’s motion across multiple frames. The vehicle ROI is further cropped to generate the number plate ROI processed by the OCR algorithm to determine the characters in the number plate. Compared with other commercial SOMs based on GPUs or CPUs, Uncanny Vision’s ANPR application ran 2-3X faster on the Kira KV260 SOM, costing less than $100 per RTSP feed.
Figure 2: In this typical automotive application, the GPU requires multiple accesses to DDR for communication between the various modules (left), while the pipeline architecture of the Zynq MPSoC (right) avoids most DDR accesses. Image source: AMD Xilinx
Pruning leverages the advantages
ready SOMs based on the Kira 26 platform designed to be plugged into a carrier card with solution- specific peripherals. It begins with data type optimization The needs of deep learning algorithms are evolving. Not every application needs high-precision calculations. Lower precision data types such as INT8, or custom data formats, are being used. GPU- based systems can be challenged with trying to modify architectures optimized for high-precision data to accommodate lower-precision data formats efficiently. The Kria K26 SOM is reconfigurable, enabling it to support a wide range of data types from FP32 to INT8 and others. Reconfigurability also results in lower overall energy consumption. For example, operations optimized for INT8
consume an order of magnitude less energy compared with an FP32 operation (Figure 1).
L2, L3) ■ 40% for the external memory (such as DDR) Frequent accesses to inefficient DDR memory are required by GPUs to support programmability and can be a bottleneck to high bandwidth computing demands. The Zynq MPSoC architecture used in the Kria K26 SOM supports the development of applications with little or no access to external memory. For example, in a
The performance of neural networks on the K26 SOM can be enhanced using an AI optimization tool that enables data optimization and pruning. It’s very common for neural networks to be over- parameterized, leading to high levels of redundancy that can be optimized using data pruning and model compression. Using Xilinx’s AI Optimizer can result in a 50x reduction in model complexity, with a nominal impact on model accuracy. For example, a single-shot detector (SSD) plus a VGG convolution neural net (CNN) architecture with 117 Giga Operations (Gops) was refined over 11 iterations of pruning using the AI Optimizer. Before optimization, the model ran 18 frames per second (FPS) on a Zynq UltraScale+ MPSoC. After 11 iterations – the 12th run of
Optimal architecture for minimal power consumption
Designs implemented based on a multicore GPU or CPU architecture can be power-hungry based on typical power usage patterns: ■ 30% for the cores ■ 30% for the internal memory (L1,
Figure 5: The Kria KV260 vision AI starter kit is a comprehensive
development environment for machine vision applications. Image source: AMD Xilinx
Figure 3: After a relatively few iterations, pruning can reduce model complexity (Gop) by 10X and improve performance (FPS) by 5X, with only a 1% reduction in accuracy (mAP). Image source: AMD Xilinx
we get technical
32
33
Powered by FlippingBook