احمد كريمي پاشاكي

عنوان

پياده‌سازي سخت‌افزاري شبكه‌ي عصبي كانولوشن جهت تسريع مرحله استنتاج

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مدارهاي مجتمع الكترونيك

سال تحصيل

1396

تاريخ دفاع

1399/07/23

استاد راهنما

دكتر ستار ميرزاكوچكي

دانشكده

برق

چكيده

در اين پايان‌نامه يك معماري سخت‌افزاري بهينه جهت پياده‌سازي شبكه عصبي كانولوشن طراحي و روي تراشه زينك Ultrascale+ EG3 پياده‌سازي شد. جهت پياده‌سازي از بورد Ultra96 AVNET استفاده شده است. در اين معماري شبكه عصبي كانولوشن به صورت سه ماژول مجزا به نام‌هاي Conv2D، AddReluPadd و MaxPool پياده‌سازي شد كه مي‌توانند به صورت موازي با يكديگر كار كنند. هدف از اين پياده‌سازي معرفي سيستمي است كه در كنار قدرت محاسباتي بالا، توان مصرفي مطلوبي داشته و براي استفاده در سيستم‌هاي نهفته مناسب باشد. در معماري پيشنهادي خواندن و نوشتن داده در حافظه خارجي به صورت جرياني بوده و با استفاده از تكنيك‌هاي مختلف، ميزان رجوع به حافظه بسيار كاهش يافته كه اين امر موجب كاهش توان مصرفي و همچنين افزايش كارايي شده است. در اين معماري موازي‌سازي در بخش‌هاي مختلف پياده‌سازي صورت گرفته و همچنين استفاده از تكنيك خط‌ لوله موجب افزايش گذردهي شده است. در اين طرح براي تابع فعالسازي رلو يك مدار جديد و بسيار بهينه تنها با استفاده از يك مالتي‌پلكسر معرفي شده است. كنترل ماژول‌ها، دستور شروع و پايان كار ماژول‌ها و همچنين تنظيم آدرس‌هاي خواندن و نوشتن توسط پردازنده آرم روي زينك انجام مي‌شود. همچنين اين پياده‌سازي انعطاف‌پذير بوده و با تغيير هايپرپارامترها، مي‌توان هر شبكه عصبي كانولوشني كه شامل لايه‌هاي كانولوشن، تابع رلو و مكس پولينگ باشد را روي اين سيستم اجرا نمود. در اين طرح شبكه الكسنت با دو نوع داده مختلف مميز شناور و مميز ثابت، 32 بيتي پياده‌سازي شد و در انتها نتايج آن‌ها با يكديگر مقايسه گرديد. معماري ارائه شده با داده‌هاي مميز ثابت، داراي فركانس 300 مگاهرتز و توان مصرفي 87/2 وات بوده و همچنين در اين طرح كارايي سيستم GOPs 31/51 مي‌باشد. مقدار كارايي بر DSP در اين سيستم برابر 006/1 مي‌باشد. اين عدد نسبت به كارهاي مشابه بسيار بالاتر بوده و بيانگر آن است كه با منابع بسيار كم‌تر توانستيم به كارايي بسيار بالايي دست يابيم. در واقع فركانس و كارايي بالا در كنار توان مصرفي پايين، موجب شده تا اين معماري براي سيستم‌هاي نهفته گزينه مطلوبي باشد.

تاريخ ورود اطلاعات

1399/08/06

عنوان به انگليسي

Implementing a Hardware Accelerator for Convolutional Neural Network

تاريخ بهره برداري

3/21/2021 12:00:00 AM

دانشجوي وارد كننده اطلاعات

احمد كريمي پاشاكي

Name: احمد كريمي پاشاكي
Author: احمد كريمي پاشاكي

چكيده به لاتين

In this thesis, an efficient hardware architecture for implementing a convolutional neural network was designed and implemented on Zynq Ultrascale+ EG3 chip. We used the AVNET Ultra96 evaluation board for the implementation. In this architecture, a convolutional neural network was implemented as three separate modules called Conv2D, AddReluPadd, and MaxPool that can operate in parallel. The purpose of this implementation is to present a system that has optimal power consumption besides high computing power and is suitable for use in embedded systems. In the proposed architecture, reading from and writing on the off-chip memory is performed in the stream paradigm and the number of memory accesses has been extremely reduced using different techniques, which has resulted in the reduction of power consumption and the increase of speed. In this architecture, parallelization is performed in different parts of the implementation, and also, using pipelining has increased the throughput. In this design, a novel very efficient circuit that uses only one multiplexer is presented for implementing the ReLU activation function. Controlling the modules, sending start and finish commands to them, and managing memory addresses for writing and reading operations is performed by the ARM processor embedded in the Zynq. In addition, the proposed implementation is flexible and any convolutional neural network that consists of convolution layers, ReLU function, and max pooling can be implemented on this system by changing the hyperparameters. In this design, the AlexNet network was implemented with two different data types of 32 bits floating point and fixed point, and their results were compared. The operating frequency of the proposed architecture with fixed point data is 300MHz, its power consumption is 2.87W and its performance is 51.31GOPs. The performance per DSP in this system is equal to 1.006. This number is much higher than similar works and indicates that we were able to achieve very high performance with much fewer resources. Having a high working frequency and performance besides low power consumption, the proposed architecture is a favorable choice for embedded systems.

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=22520&Field=0&DTC=6