Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scenarios,entities and their relationships are subject to constant changes.Graphs that record such changes are called dynamic graphs.In recent years,the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results.As the scale of dynamic graphs becomes larger,higher performance requirements are demanded to dynamic graph processing systems.With the massive parallel processing power and high memory bandwidth,GPUs become mainstream vehicles to accelerate dynamic graph processing tasks.GPU-based dynamic graph processing systems mainly address two challenges:maintaining the graph data when updates occur(i.e.,graph updating)and producing analytics results in time(i.e.,graph computing).In this paper,we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing.To comprehensively discuss existing dynamic graph processing systems on GPUs,we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing.In addition,we discuss the challenges and future research directions of dynamic graph processing on GPUs.
Hongru GAOXiaofei LIAOZhiyuan SHAOKexin LIJiajie CHENHai JIN
The 3D reconstruction pipeline uses the Bundle Adjustment algorithm to refine the camera and point parameters. The Bundle Adjustment algorithm is a compute-intensive algorithm, and many researchers have improved its performance by implementing the algorithm on GPUs. In the previous research work, “Improving Accuracy and Computational Burden of Bundle Adjustment Algorithm using GPUs,” the authors demonstrated first the Bundle Adjustment algorithmic performance improvement by reducing the mean square error using an additional radial distorting parameter and explicitly computed analytical derivatives and reducing the computational burden of the Bundle Adjustment algorithm using GPUs. The naïve implementation of the CUDA code, a speedup of 10× for the largest dataset of 13,678 cameras, 4,455,747 points, and 28,975,571 projections was achieved. In this paper, we present the optimization of the Bundle Adjustment algorithm CUDA code on GPUs to achieve higher speedup. We propose a new data memory layout for the parameters in the Bundle Adjustment algorithm, resulting in contiguous memory access. We demonstrate that it improves the memory throughput on the GPUs, thereby improving the overall performance. We also demonstrate an increase in the computational throughput of the algorithm by optimizing the CUDA kernels to utilize the GPU resources effectively. A comparative performance study of explicitly computing an algorithm parameter versus using the Jacobians instead is presented. In the previous work, the Bundle Adjustment algorithm failed to converge for certain datasets due to several block matrices of the cameras in the augmented normal equation, resulting in rank-deficient matrices. In this work, we identify the cameras that cause rank-deficient matrices and preprocess the datasets to ensure the convergence of the BA algorithm. Our optimized CUDA implementation achieves convergence of the Bundle Adjustment algorithm in around 22 seconds for the largest dataset compared to 654 seconds for the sequential implemen
Pranay R. KommeraSuresh S. MuknahallipatnaJohn E. McInroy
Bundle adjustment is a camera and point refinement technique in a 3D scene reconstruction pipeline. The camera parameters and the 3D points are refined by minimizing the difference between computed projection and observed projection of the image points formulated as a non-linear least-square problem. Levenberg-Marquardt method is used to solve the non-linear least-square problem. Solving the non-linear least-square problem is computationally expensive, proportional to the number of cameras, points, and projections. In this paper, we implement the Bundle Adjustment (BA) algorithm and analyze techniques to improve algorithmic performance by reducing the mean square error. We investigate using an additional radial distortion camera parameter in the BA algorithm and demonstrate better convergence of the mean square error. We also demonstrate the use of explicitly computed analytical derivatives. In addition, we implement the BA algorithm on GPUs using the CUDA parallel programming model to reduce the computational time burden of the BA algorithm. CUDA Streams, atomic operations, and cuBLAS library in the CUDA programming model are proposed, implemented, and demonstrated to improve the performance of the BA algorithm. Our implementation has demonstrated better convergence of the BA algorithm and achieved a speedup of up to 16× on the use of the BA algorithm on various datasets.
Pranay R. KommeraSuresh S. MuknahallipatnaJohn E. McInroy
We report a high-performance multi graphics processing unit(GPU)implementation of the Kohn–Sham time-dependent density functional theory(TDDFT)within the Tamm–Dancoff approximation.Our algorithm on massively parallel computing systems using multiple parallel models in tandem scales optimally with material size,considerably reducing the computational wall time.A benchmark TDDFT study was performed on a green fluorescent protein complex composed of 4353 atoms with 40,518 atomic orbitals represented by Gaussian-type functions,demonstrating the effect of distant protein residues on the excitation.As the largest molecule attempted to date to the best of our knowledge,the proposed strategy demonstrated reasonably high efficiencies up to 256 GPUs on a custom-built state-of-the-art GPU computing system with Nvidia A100 GPUs.We believe that our GPU-oriented algorithms,which empower first-principles simulation for very large-scale applications,may render deeper understanding of the molecular basis of material behaviors,eventually revealing new possibilities for breakthrough designs on new material systems.
Inkoo KimDaun JeongWon-Joon SonHyung-Jin KimYoung Min RheeYongsik JungHyeonho ChoiJinkyu YimInkook JangDae Sin Kim
When training a large-scale knowledge graph embedding(KGE)model with multiple graphics processing units(GPUs),the partition-based method is necessary for parallel training.However,existing partition-based training methods suffer from low GPU utilization and high input/output(IO)overhead between the memory and disk.For a high IO overhead between the disk and memory problem,we optimized the twice partitioning with fine-grained GPU scheduling to reduce the IO overhead between the CPU memory and disk.For low GPU utilization caused by the GPU load imbalance problem,we proposed balanced partitioning and dynamic scheduling methods to accelerate the training speed in different cases.With the above methods,we proposed fine-grained partitioning KGE,an efficient KGE training framework with multiple GPUs.We conducted experiments on some benchmarks of the knowledge graph,and the results show that our method achieves speedup compared to existing framework on the training of KGE.
In distributed training,increasing batch size can improve parallelism,but it can also bring many difficulties to the training process and cause training errors.In this work,we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent(SGD) and Adaptive moment estimation(Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit(GPU).A new method that considers momentum to eliminate training errors in distributed training is proposed.We define a Momentum-like Factor(MF) to represent the influence of former gradients on parameter updates in each iteration.Then,we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD,Adam,and Nesterov accelerated gradient.Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training.The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.
Yu TangZhigang KanLujia YinZhiquan LaiZhaoning ZhangLinbo QiaoDongsheng Li
We present a novel algorithm BADF(Bounding Volume Hierarchy Based Adaptive Distance Fields)for accelerating the construction of ADFs(adaptive distance fields)of rigid and deformable models on graphics processing units.Our approach is based on constructing a bounding volume hierarchy(BVH)and we use that hierarchy to generate an octree-based ADF.We exploit the coherence between successive frames and sort the grid points of the octree to accelerate the computation.Our approach is applicable to rigid and deformable models.Our GPU-based(graphics processing unit based)algorithm is about 20x--50x faster than current mainstream central processing unit based algorithms.Our BADF algorithm can construct the distance fields for deformable models with 60k triangles at interactive rates on an NVIDIA GTX GeForce 1060.Moreover,we observe 3x speedup over prior GPU-based ADF algorithms.