The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Anjakan paradigma ke arah infrastruktur pengkomputeran tepi yang mengutamakan jejak kecil dan prestasi berskala/mudah dianggarkan semakin meningkat. Dalam kertas kerja ini, kami mencadangkan perkara berikut untuk meningkatkan jejak dan skalabiliti tatasusunan sistolik: (1) multithreading lajur untuk mengurangkan bilangan unit fizikal dan mengekalkan prestasi walaupun untuk pengumpulan titik terapung belakang-ke-belakang; (2) bas AXI peer-to-peer berlatarkan untuk struktur berbilang cip boleh skala dan bas memori tempatan selari intra-cip untuk kependaman rendah; (3) kawalan gelung berbilang peringkat dalam mana-mana unit untuk mengurangkan overhed permulaan dan peralihan operasi penyesuaian untuk penggunaan semula kenangan tempatan yang cekap. Kami mereka bentuk tatasusunan sistolik dengan konfigurasi baris tunggal × 64 dengan Verilog HDL, menilai kekerapan dan prestasi pada FPGA yang dilampirkan pada sistem ZYNQ sebagai peranti hamba AXI, dan menilai kawasan dengan perpustakaan TSMC 28nm dan penjana memori dan mengenal pasti perkara berikut: (1) kelajuan pelaksanaan pendaraban matriks/operasi lilitan/pengekstrakan kedalaman medan cahaya, yang saiznya lebih besar daripada kapasiti memori tempatan, ialah 6.3× / 9.2× / 6.6× berbanding dengan yang serupa tatasusunan sistolik (EMAX); (2) anggaran kelajuan dengan konfigurasi 4-cip ialah 19.6× / 16.0× / 8.5×; (3) saiz cip tunggal ialah 8.4 mm2 (0.31× EMAX) dan prestasi asas setiap kawasan ialah 2.4×.
Jun IWAMOTO
Nara Institute of Science and Technology
Yuma KIKUTANI
Nara Institute of Science and Technology
Renyuan ZHANG
Nara Institute of Science and Technology
Yasuhiko NAKASHIMA
Nara Institute of Science and Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Salinan
Jun IWAMOTO, Yuma KIKUTANI, Renyuan ZHANG, Yasuhiko NAKASHIMA, "Daisy-Chained Systolic Array and Reconfigurable Memory Space for Narrow Memory Bandwidth" in IEICE TRANSACTIONS on Information,
vol. E103-D, no. 3, pp. 578-589, March 2020, doi: 10.1587/transinf.2019EDP7144.
Abstract: A paradigm shift toward edge computing infrastructures that prioritize small footprint and scalable/easy-to-estimate performance is increasing. In this paper, we propose the following to improve the footprint and the scalability of systolic arrays: (1) column multithreading for reducing the number of physical units and maintaining the performance even for back-to-back floating-point accumulations; (2) a cascaded peer-to-peer AXI bus for a scalable multichip structure and an intra-chip parallel local memory bus for low latency; (3) multilevel loop control in any unit for reducing the startup overhead and adaptive operation shifting for efficient reuse of local memories. We designed a systolic array with a single column × 64 row configuration with Verilog HDL, evaluated the frequency and the performance on an FPGA attached to a ZYNQ system as an AXI slave device, and evaluated the area with a TSMC 28nm library and memory generator and identified the following: (1) the execution speed of a matrix multiplication/a convolution operation/a light-field depth extraction, whose size larger than the capacity of the local memory, is 6.3× / 9.2× / 6.6× compared with a similar systolic array (EMAX); (2) the estimated speed with a 4-chip configuration is 19.6× / 16.0× / 8.5×; (3) the size of a single-chip is 8.4 mm2 (0.31× of EMAX) and the basic performance per area is 2.4×.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2019EDP7144/_p
Salinan
@ARTICLE{e103-d_3_578,
author={Jun IWAMOTO, Yuma KIKUTANI, Renyuan ZHANG, Yasuhiko NAKASHIMA, },
journal={IEICE TRANSACTIONS on Information},
title={Daisy-Chained Systolic Array and Reconfigurable Memory Space for Narrow Memory Bandwidth},
year={2020},
volume={E103-D},
number={3},
pages={578-589},
abstract={A paradigm shift toward edge computing infrastructures that prioritize small footprint and scalable/easy-to-estimate performance is increasing. In this paper, we propose the following to improve the footprint and the scalability of systolic arrays: (1) column multithreading for reducing the number of physical units and maintaining the performance even for back-to-back floating-point accumulations; (2) a cascaded peer-to-peer AXI bus for a scalable multichip structure and an intra-chip parallel local memory bus for low latency; (3) multilevel loop control in any unit for reducing the startup overhead and adaptive operation shifting for efficient reuse of local memories. We designed a systolic array with a single column × 64 row configuration with Verilog HDL, evaluated the frequency and the performance on an FPGA attached to a ZYNQ system as an AXI slave device, and evaluated the area with a TSMC 28nm library and memory generator and identified the following: (1) the execution speed of a matrix multiplication/a convolution operation/a light-field depth extraction, whose size larger than the capacity of the local memory, is 6.3× / 9.2× / 6.6× compared with a similar systolic array (EMAX); (2) the estimated speed with a 4-chip configuration is 19.6× / 16.0× / 8.5×; (3) the size of a single-chip is 8.4 mm2 (0.31× of EMAX) and the basic performance per area is 2.4×.},
keywords={},
doi={10.1587/transinf.2019EDP7144},
ISSN={1745-1361},
month={March},}
Salinan
TY - JOUR
TI - Daisy-Chained Systolic Array and Reconfigurable Memory Space for Narrow Memory Bandwidth
T2 - IEICE TRANSACTIONS on Information
SP - 578
EP - 589
AU - Jun IWAMOTO
AU - Yuma KIKUTANI
AU - Renyuan ZHANG
AU - Yasuhiko NAKASHIMA
PY - 2020
DO - 10.1587/transinf.2019EDP7144
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2020
AB - A paradigm shift toward edge computing infrastructures that prioritize small footprint and scalable/easy-to-estimate performance is increasing. In this paper, we propose the following to improve the footprint and the scalability of systolic arrays: (1) column multithreading for reducing the number of physical units and maintaining the performance even for back-to-back floating-point accumulations; (2) a cascaded peer-to-peer AXI bus for a scalable multichip structure and an intra-chip parallel local memory bus for low latency; (3) multilevel loop control in any unit for reducing the startup overhead and adaptive operation shifting for efficient reuse of local memories. We designed a systolic array with a single column × 64 row configuration with Verilog HDL, evaluated the frequency and the performance on an FPGA attached to a ZYNQ system as an AXI slave device, and evaluated the area with a TSMC 28nm library and memory generator and identified the following: (1) the execution speed of a matrix multiplication/a convolution operation/a light-field depth extraction, whose size larger than the capacity of the local memory, is 6.3× / 9.2× / 6.6× compared with a similar systolic array (EMAX); (2) the estimated speed with a 4-chip configuration is 19.6× / 16.0× / 8.5×; (3) the size of a single-chip is 8.4 mm2 (0.31× of EMAX) and the basic performance per area is 2.4×.
ER -