Joint optimization of algebraic transformation, device placement, and network bandwidth allocation for distributed DNN training
Files
Date
Authors
Keywords
Degree Level
Degree Name
Volume
Issue
Publisher
Abstract
Deep neural networks (DNNs) have been widely adopted across a broad range of application domains, including autonomous driving and recommendation systems. To achieve improved predictive performance, the sizes of training datasets and model parameters have increased substantially. As the sizes of DNN learning models increase, training on a single device is impractical due to limitations in memory and computational resources. To address this challenge, training must be scaled across multiple devices within a distributed cluster. Therefore, distributed training has been adopted to enable efficient training of large DNN models in such environments. Model parallelism is a widely used strategy for distributed DNN training that partitions model parameters across devices. However, the traditional model parallelism approach suffers from low hardware utilization because only one device can work at a time. To improve utilization, tensor, pipeline, and inter-operator parallelism strategies have been proposed. Among these approaches, inter-operator parallelism necessitates the optimization of device placement and scheduling to minimize training latency. However, determining an optimal device placement is an NP-hard problem. As the DNN model sizes increase, the search space for placement expands exponentially, resulting in significant placement search latency. In addition, the communication traffic pattern of a DNN training job is determined by the parallelization strategy, leading to non-uniform traffic distributions across interserver communication links. However, traditional Clos-based topologies, which provide uniform bandwidth and latency, become suboptimal under these conditions. Specifically, links connecting servers that frequently synchronize may become congested, while other links remain underutilized. This imbalance leads to poor network resource utilization and can significantly hinder training performance. Considering these challenges discussed above, two problems are addressed in this thesis: 1) with non-uniform traffic distribution in DNN training jobs, how to improve the network resource utilization to minimize the communication costs, and 2) with the NP-hard nature of device placement search problem and increased model sizes, how to determine an efficient device placement decision in a polynomial time. First, to improve network resource utilization, we present joint optimization of device placement and network bandwidth allocation for the first time to accelerate large-scale distributed DNN training. We propose a novel approach to implement the network bandwidth allocation. On each network interface, Open vSwitch (OVS) enforces a Quality of Service (QoS) policy and corresponding flow rules to route egress traffic and allocate bandwidth for each destination server. Simultaneously, we leverage policy-based routing with custom routing tables to simultaneously utilize available bandwidth across multiple physical links if bandwidth demand from a source server exceeds the capacity of a single link to the destination server. Second, to tackle the NP-hardness of the joint optimization problem, we propose a novel algebraic transformation framework based on iterative operator fusion and co-location. This framework generates a compact representation of the DNN computation graph, significantly reducing the search space for device placement and network bandwidth allocation optimization, while incurring minimal performance degradation in training latency. We evaluate our approach using real-world DNN benchmarks, demonstrating up to a 22% reduction in training latency. Incorporating network bandwidth allocation can provide up to an additional 11% reduction in training latency. More importantly, our design achieves up to 650 times lower solution search latency compared with state-of-theart methods.
