The co-processor architecture: an embedded system architecture for rapid prototyping
Processing Slice for the UltraScale Architecture
row and column processes can now execute concurrently. The number of required clock cycles is kept to a minimum, even if this consumes more FPGA resources.
and coworkers, to obtain the Xilinx Alliance Partner certification for 2020-2021. The following tools and devices were used in this effort: ■ Vivado HLS v2019 ■ The device for assessment and simulation was the xczu7ev- ffvc1156-2-e Beginning with the C-based implementation, the DCT algorithm accepts two arrays of 16-bit numbers; array ‘a’ is the input array to the DCT, and array ‘b’ is the output array from the DCT. The data width (DW) is therefore defined as 16, and the number of elements within the arrays (N) is 1024/DW, or 64. Last of all, the size of the DCT matrix (DCT_SIZE) is set to 8, which means an 8 x 8 matrix is used.
acceleration, loop unrolling, and other techniques are readily available. Once the DCT code was created within the Vivado HLS tool as a project, the next step is to begin synthesizing the design for FPGA implementation. It is at this next step where some of the most impactful benefits from moving an algorithm’s execution from an MCU to an FPGA become more apparent – as a reference this step is equivalent to the System Management with the Microcontroller milestone discussed above. Modern FPGA tools allow for a suite of optimizations and enhancements that greatly enhance the performance of complex algorithms. Before analyzing the results, there are some important terms to keep in mind: ■ Latency – The number of clock cycles required to execute all iterations of the loop [10] ■ Interval – The number of clock cycles before the next iteration of a loop starts to process data [11] ■ BRAM – Block Random Access Memory ■ DSP48E – Digital Signal
Should parts become obsolete, or optimizations be required, the same architecture can allow for these changes. New MCUs and new FPGAs can be fitted into the design, all the while the interfaces can remain relatively untouched. Additionally, since both the MCU and FPGA are field updatable, user- specific changes and optimizations can be applied in the field and remotely. In closing, this architecture blends the development speed and availability of an MCU with the performance and expandability of an FPGA. With optimizations and performance enhancements available at every development step, the co-processor architecture can meet the needs of even the most challenging requirements – both for today’s designs and beyond.
Array partition This directive maps the contents of the loops to arrays and thus flattens all of the memory access to single elements within these arrays. By doing this, more RAM is consumed, but again, the execution time of this algorithm is cut in half. Dataflow This directive allows the designer to specify the target number of clock cycles between each of the input reads. This directive is only supported for top-level function. Only loops and functions exposed to this level will benefit from this directive. Inline The INLINE directive flattens all loops, both inner and outer. Both
■ FF – Flipflop ■ LUT – Look-up Table ■ URAM – Unified Random-Access Memory (can be composed of a single transistor)
Conclusion
The co-processor hardware architecture provides the embedded designer with a
Default
The default optimization setting comes from the unaltered result of translating the C-based algorithm to synthesizable HDL. No optimizations are enabled, and this can be used as a performance reference to better understand the other optimizations.
high-performance platform that maintains its design flexibility throughout development and past product release. By first validating algorithms in C or C++, processes, data and signal paths, and critical functionality can be verified in a relatively short amount of time. Then, by translating the processor- intensive algorithms into the co-processor FPGA, the designer can enjoy the benefits of hardware acceleration and a more modular design.
Pipeline inner loop
The PIPELINE directive instructs Vivado HLS to unroll the inner loops so that new data can start being processed while existing data is still in the pipeline. Thus, new data does not have to wait for the existing data to be complete before processing can begin.
Following the premise of this article, the C-based algorithm
implementation allows the designer to quickly develop and validate the algorithm’s functionality. Although it is an important consideration, this validation places functionality at a higher weighting than execution time. This weighting is allowed, since the ultimate implementation of this algorithm will be in an FPGA, where hardware
Table 2: FPGA algorithm execution optimization findings (resource utilization).
BRAM_18K DSP48E
FF
LUT
URAM
Pipeline outer loop
5
1
246
964
0
Default (solution)
By applying the PIPELINE directive to the outer loop, the outer loop’s operations are now pipelined. However, the inner loops’ operations now occur concurrently. Both the latency and interval time are cut in half through applying this directly to the outer loop.
5
1
223
1211
0
Pipeline inner loop (solution 2)
5
8
516
1356
0
Pipeline outer loop (solution 3)
The co-processor hardware architecture provides the embedded designer with a high- performance platform that maintains its design flexibility throughout development and past product release.
3
8
862
1879
0
Array partition (solution 4)
3
8
868
1654
0
Dataflow (solution 5)
3
16
1086
1462
0
Inline (solution 6)
we get technical
18
19
Powered by FlippingBook