The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Banyak pemecut inferens rangkaian neural konvolusi dalam (CNN) pada platform tatasusunan gerbang boleh atur cara lapangan (FPGA) telah diterima pakai secara meluas kerana penggunaan kuasa yang rendah dan prestasi tinggi. Dalam kertas ini, kami membangunkan perkara berikut untuk meningkatkan prestasi dan kecekapan kuasa. Pertama, kami menggunakan memori lebar jalur tinggi (HBM) untuk mengembangkan lebar jalur penghantaran data antara memori luar cip dan pemecut. Kedua, cara saluran paip sepenuhnya, yang terdiri daripada pengiraan antara lapisan saluran paip dan enjin pengiraan saluran paip, dilaksanakan untuk mengurangkan masa melahu antara lapisan. Ketiga, seni bina berbilang teras dengan penimbal dwi-kongsi direka untuk mengurangkan akses memori luar cip dan memaksimumkan daya pemprosesan. Kami mereka bentuk pemecut yang dicadangkan pada platform Xilinx Alveo U280 dengan Verilog HDL yang mendalam dan bukannya sintesis tahap tinggi seperti yang dilakukan sebelum ini dan meneroka model VGG-16 untuk mengesahkan sistem semasa percubaan kami. Dengan seni bina pemecut yang serupa, keputusan eksperimen menunjukkan bahawa lebar jalur memori HBM adalah 13.2× lebih baik daripada DDR4. Berbanding dengan pemecut lain dari segi daya pemprosesan, pemecut kami adalah 1.9×/1.65×/11.9× lebih baik daripada FPGA+HBM2 berasaskan/saiz kelompok rendah (4) GPGPU/saiz kelompok rendah (4) CPU. Berbanding dengan pemecut berasaskan DDR+FPGA/DDR+GPGPU/DDR+CPU sebelumnya dari segi kecekapan kuasa, sistem cadangan kami menyediakan peningkatan 1.4-1.7×/1.7-12.6×/6.6-37.1× dengan model CNN berskala besar.
Van-Cam NGUYEN
Nara Institute of Science and Technology
Yasuhiko NAKASHIMA
Nara Institute of Science and Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Salinan
Van-Cam NGUYEN, Yasuhiko NAKASHIMA, "Implementation of Fully-Pipelined CNN Inference Accelerator on FPGA and HBM2 Platform" in IEICE TRANSACTIONS on Information,
vol. E106-D, no. 6, pp. 1117-1129, June 2023, doi: 10.1587/transinf.2022EDP7155.
Abstract: Many deep convolutional neural network (CNN) inference accelerators on the field-programmable gate array (FPGA) platform have been widely adopted due to their low power consumption and high performance. In this paper, we develop the following to improve performance and power efficiency. First, we use a high bandwidth memory (HBM) to expand the bandwidth of data transmission between the off-chip memory and the accelerator. Second, a fully-pipelined manner, which consists of pipelined inter-layer computation and a pipelined computation engine, is implemented to decrease idle time among layers. Third, a multi-core architecture with shared-dual buffers is designed to reduce off-chip memory access and maximize the throughput. We designed the proposed accelerator on the Xilinx Alveo U280 platform with in-depth Verilog HDL instead of high-level synthesis as the previous works and explored the VGG-16 model to verify the system during our experiment. With a similar accelerator architecture, the experimental results demonstrate that the memory bandwidth of HBM is 13.2× better than DDR4. Compared with other accelerators in terms of throughput, our accelerator is 1.9×/1.65×/11.9× better than FPGA+HBM2 based/low batch size (4) GPGPU/low batch size (4) CPU. Compared with the previous DDR+FPGA/DDR+GPGPU/DDR+CPU based accelerators in terms of power efficiency, our proposed system provides 1.4-1.7×/1.7-12.6×/6.6-37.1× improvement with the large-scale CNN model.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022EDP7155/_p
Salinan
@ARTICLE{e106-d_6_1117,
author={Van-Cam NGUYEN, Yasuhiko NAKASHIMA, },
journal={IEICE TRANSACTIONS on Information},
title={Implementation of Fully-Pipelined CNN Inference Accelerator on FPGA and HBM2 Platform},
year={2023},
volume={E106-D},
number={6},
pages={1117-1129},
abstract={Many deep convolutional neural network (CNN) inference accelerators on the field-programmable gate array (FPGA) platform have been widely adopted due to their low power consumption and high performance. In this paper, we develop the following to improve performance and power efficiency. First, we use a high bandwidth memory (HBM) to expand the bandwidth of data transmission between the off-chip memory and the accelerator. Second, a fully-pipelined manner, which consists of pipelined inter-layer computation and a pipelined computation engine, is implemented to decrease idle time among layers. Third, a multi-core architecture with shared-dual buffers is designed to reduce off-chip memory access and maximize the throughput. We designed the proposed accelerator on the Xilinx Alveo U280 platform with in-depth Verilog HDL instead of high-level synthesis as the previous works and explored the VGG-16 model to verify the system during our experiment. With a similar accelerator architecture, the experimental results demonstrate that the memory bandwidth of HBM is 13.2× better than DDR4. Compared with other accelerators in terms of throughput, our accelerator is 1.9×/1.65×/11.9× better than FPGA+HBM2 based/low batch size (4) GPGPU/low batch size (4) CPU. Compared with the previous DDR+FPGA/DDR+GPGPU/DDR+CPU based accelerators in terms of power efficiency, our proposed system provides 1.4-1.7×/1.7-12.6×/6.6-37.1× improvement with the large-scale CNN model.},
keywords={},
doi={10.1587/transinf.2022EDP7155},
ISSN={1745-1361},
month={June},}
Salinan
TY - JOUR
TI - Implementation of Fully-Pipelined CNN Inference Accelerator on FPGA and HBM2 Platform
T2 - IEICE TRANSACTIONS on Information
SP - 1117
EP - 1129
AU - Van-Cam NGUYEN
AU - Yasuhiko NAKASHIMA
PY - 2023
DO - 10.1587/transinf.2022EDP7155
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E106-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2023
AB - Many deep convolutional neural network (CNN) inference accelerators on the field-programmable gate array (FPGA) platform have been widely adopted due to their low power consumption and high performance. In this paper, we develop the following to improve performance and power efficiency. First, we use a high bandwidth memory (HBM) to expand the bandwidth of data transmission between the off-chip memory and the accelerator. Second, a fully-pipelined manner, which consists of pipelined inter-layer computation and a pipelined computation engine, is implemented to decrease idle time among layers. Third, a multi-core architecture with shared-dual buffers is designed to reduce off-chip memory access and maximize the throughput. We designed the proposed accelerator on the Xilinx Alveo U280 platform with in-depth Verilog HDL instead of high-level synthesis as the previous works and explored the VGG-16 model to verify the system during our experiment. With a similar accelerator architecture, the experimental results demonstrate that the memory bandwidth of HBM is 13.2× better than DDR4. Compared with other accelerators in terms of throughput, our accelerator is 1.9×/1.65×/11.9× better than FPGA+HBM2 based/low batch size (4) GPGPU/low batch size (4) CPU. Compared with the previous DDR+FPGA/DDR+GPGPU/DDR+CPU based accelerators in terms of power efficiency, our proposed system provides 1.4-1.7×/1.7-12.6×/6.6-37.1× improvement with the large-scale CNN model.
ER -