چكيده به لاتين
In this thesis, an efficient hardware architecture for implementing a convolutional neural network was designed and implemented on Zynq Ultrascale+ EG3 chip. We used the AVNET Ultra96 evaluation board for the implementation. In this architecture, a convolutional neural network was implemented as three separate modules called Conv2D, AddReluPadd, and MaxPool that can operate in parallel. The purpose of this implementation is to present a system that has optimal power consumption besides high computing power and is suitable for use in embedded systems.
In the proposed architecture, reading from and writing on the off-chip memory is performed in the stream paradigm and the number of memory accesses has been extremely reduced using different techniques, which has resulted in the reduction of power consumption and the increase of speed. In this architecture, parallelization is performed in different parts of the implementation, and also, using pipelining has increased the throughput. In this design, a novel very efficient circuit that uses only one multiplexer is presented for implementing the ReLU activation function.
Controlling the modules, sending start and finish commands to them, and managing memory addresses for writing and reading operations is performed by the ARM processor embedded in the Zynq. In addition, the proposed implementation is flexible and any convolutional neural network that consists of convolution layers, ReLU function, and max pooling can be implemented on this system by changing the hyperparameters.
In this design, the AlexNet network was implemented with two different data types of 32 bits floating point and fixed point, and their results were compared. The operating frequency of the proposed architecture with fixed point data is 300MHz, its power consumption is 2.87W and its performance is 51.31GOPs. The performance per DSP in this system is equal to 1.006. This number is much higher than similar works and indicates that we were able to achieve very high performance with much fewer resources. Having a high working frequency and performance besides low power consumption, the proposed architecture is a favorable choice for embedded systems.