Structure-aware communication scheduling for distributed deep learning applications

Loading...
Thumbnail Image

Keywords

DCN, cloud computing, distributed training, communication scheduling

Degree Level

masters

Degree Name

M. Sc.

Volume

Issue

Publisher

Memorial University of Newfoundland

Abstract

With the growing popularity of large-scale deep neural networks, efficient communication scheduling has become crucial in distributed deep neural networks (DNNs) to reduce overall training time. In multijob distributed training scenarios, current communication scheduling methods do not effectively utilize the periodic communication patterns of deep learning training (DLT) jobs to reduce the potential link contention. When multiple tenants run concurrent jobs and compete for network resources, training performance can degrade due to increased network contention. In this thesis, we explore leveraging periodic communication patterns to schedule DLT jobs. We analyze the performance of static shift-based scheduling strategies that align job iterations using their least common multiple (LCM) to handle multi-job communication conflicts. Through theoretical modeling and validation via real-world traces, we identify fundamental limitations of such shift-based strategies, showing that about 73% of job combinations fail to benefit due to structural constraints in job periodicity. To address this limitation, we propose a set of structure-aware scheduling enhancements targeting both feasible and infeasible cases. For structurally feasible combinations, we prune the scheduling search space by leveraging the relationship between job periods and aggregation structures, and further eliminate redundant evaluations through structural equivalence filtering. This two-step approach achieves over 99.47% runtime reduction compared to generic Mixed-Integer Linear Programming (MILP) solvers, enabling fast and scalable optimization. For infeasible combinations where shift-based optimization is fundamentally ineffective, we develop a delay-tolerant strategy that reconstructs the optimization space through minimal periodic padding, transforming unschedulable job pairs into schedulable ones. Experimental results demonstrate that our methods significantly extend the applicability of shift-based scheduling, and can yield over 10% makespan reduction as well as consistent fairness improvements across diverse contention scenarios.

Collections