چكيده به لاتين
There are some unanswered research questions in various fields of science, including climate, nuclear physics, materials science, Chemistry, biology and security. The answer to these questions is not possible without the development of the next generation of computing systems, exascale computing systems. The exascale computing systems have many challenges, including high energy consumption, scalability, high transmission and receipt data rates, and high fault rates. Due to the number of nodes between 100 thousands and 1 million in exascale computing systems, on average, nodes fail every 2 minutes. Current approaches for fault tolerance are based on spatial and temporal redundancy. In checkpoint approach which is based on temporal redundancy, the application should take a checkpoint after certain intervals. In the extreme-scale because of large failure, a check point should be made at short distances. Therefore, most processor’s time is spent on taking a checkpoint. The other approach is the full replication, in which each application’s process transparently is running on two separated processor. The full replication approach also has an average efficiency of 50%. In this dissertation, the partial process replication method is presented in such a way that for each application process, there is a partial process running on a processor in a time sharing manner with a partial process of another process. The purpose of this method is to increase the efficiency of the applications compared to the checkpoint method, as well as the feasibility of using less nodes than the full replication approach. The results of the simulations indicate that the proposed method outperforms the checkpoint and full replication methods. At the worst, the efficiency of application will be 50%. Also, in this method the number of nodes is less than the full replication approach.
Keywords: High Performance Computing, Exascale Computing Systems, Fault Tolerance, Checkpoint, Full Replication, Partial Process Replication