چكيده به لاتين
In today's world, pre-trained models such as BERT and GPT-3, along with the use of transformers, which are recognized as large AI models, have gained significant importance. To accelerate the training of these models, distributed training has become a fundamental approach. This method enables the execution of model training across multiple GPUs, which is particularly essential for models that require more data and training time. Despite past advancements, achieving optimal utilization of GPU capacity remains a major challenge, especially in academic environments that often feature heterogeneous infrastructures and limited bandwidth between nodes, which do not align with the assumptions of existing methods. In previous methods, the node with the lowest computational power is considered the bottleneck, leading to computational slowdowns and increased waiting times for other nodes. This study addresses the issue by adjusting batch sizes in a way that minimizes node waiting times. This approach improves the efficiency of node utilization without reducing the convergence speed. Moreover, existing methods to address GPU memory limitations often rely on high-speed inter-node communication. In scenarios with low network bandwidth (e.g., 1 Gb/s), this reliance increases training time. In this research, the challenge is mitigated using the LSDP (Locally Sharded Data Parallel) method, which leverages CPU memory instead of inter-node communication. Finally, by combining these two strategies, the LSHDP (Locally Sharded Heterogeneous Data Parallel) framework is introduced, which is suitable for heterogeneous infrastructures with low inter-node communication speeds. Experiments demonstrate that this method outperforms previous approaches in such environments, achieving improvements of 35.39% and 52.57% compared to data-parallel and Fully Sharded Data Parallel (FSDP) methods, respectively.