搜索到687篇“ GPUS“的相关文章
A survey on dynamic graph processing on GPUs: concepts, terminologies and systems
2024年
Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scenarios,entities and their relationships are subject to constant changes.Graphs that record such changes are called dynamic graphs.In recent years,the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results.As the scale of dynamic graphs becomes larger,higher performance requirements are demanded to dynamic graph processing systems.With the massive parallel processing power and high memory bandwidth,GPUs become mainstream vehicles to accelerate dynamic graph processing tasks.GPU-based dynamic graph processing systems mainly address two challenges:maintaining the graph data when updates occur(i.e.,graph updating)and producing analytics results in time(i.e.,graph computing).In this paper,we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing.To comprehensively discuss existing dynamic graph processing systems on GPUs,we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing.In addition,we discuss the challenges and future research directions of dynamic graph processing on GPUs.
Hongru GAOXiaofei LIAOZhiyuan SHAOKexin LIJiajie CHENHai JIN
关键词:GPUS
Optimized CUDA Implementation to Improve the Performance of Bundle Adjustment Algorithm on GPUs
2024年
The 3D reconstruction pipeline uses the Bundle Adjustment algorithm to refine the camera and point parameters. The Bundle Adjustment algorithm is a compute-intensive algorithm, and many researchers have improved its performance by implementing the algorithm on GPUs. In the previous research work, “Improving Accuracy and Computational Burden of Bundle Adjustment Algorithm using GPUs,” the authors demonstrated first the Bundle Adjustment algorithmic performance improvement by reducing the mean square error using an additional radial distorting parameter and explicitly computed analytical derivatives and reducing the computational burden of the Bundle Adjustment algorithm using GPUs. The naïve implementation of the CUDA code, a speedup of 10× for the largest dataset of 13,678 cameras, 4,455,747 points, and 28,975,571 projections was achieved. In this paper, we present the optimization of the Bundle Adjustment algorithm CUDA code on GPUs to achieve higher speedup. We propose a new data memory layout for the parameters in the Bundle Adjustment algorithm, resulting in contiguous memory access. We demonstrate that it improves the memory throughput on the GPUs, thereby improving the overall performance. We also demonstrate an increase in the computational throughput of the algorithm by optimizing the CUDA kernels to utilize the GPU resources effectively. A comparative performance study of explicitly computing an algorithm parameter versus using the Jacobians instead is presented. In the previous work, the Bundle Adjustment algorithm failed to converge for certain datasets due to several block matrices of the cameras in the augmented normal equation, resulting in rank-deficient matrices. In this work, we identify the cameras that cause rank-deficient matrices and preprocess the datasets to ensure the convergence of the BA algorithm. Our optimized CUDA implementation achieves convergence of the Bundle Adjustment algorithm in around 22 seconds for the largest dataset compared to 654 seconds for the sequential implemen
Pranay R. KommeraSuresh S. MuknahallipatnaJohn E. McInroy
关键词:LEVENBERG-MARQUARDT
环上多项式乘法在GPU上的优化实现
2024年
作为格密码算法的核心组件,环上多项式乘法的效率和准确性对于格密码方案的实用性和安全性至关重要.NTT及KNTT等现有的环上多项式乘法算法具有较高的并行性,其在CPU上运行时很难完全发挥优势.这也意味着,很多基于CPU实现的环上多项式乘法算法的效率仍有很大的提升空间.针对这一问题,本文基于Zhu等人提出的KNTT算法,利用GPU的众核特性以及强大的并行计算能力,实现了高效的环上多项式乘法运算.同时,将GPU线程模型中的线程块与KNTT算法中拆分出的小次数多项式一一对应,使得每个线程块负责一个多项式的NTT并行运算.由于GPU中的多个线程块可以被同时调度开始计算任务,因此多项式之间也可以实现并行处理,这进一步提高了KNTT算法在GPU上的实现效率.实验结果显示,GPU上实现的KNTT算法相较于NTT算法的GPU版本以及原始的CPU版本增速明显.在模多项式次数N为16384时,相对于原始C版本代码可以达到93.78%的增速.相较于GPU版本的NTT算法,在N=2048时,也可以达到40.62%的增速.
赵新颖袁峰赵臻王保仓
关键词:多项式乘法NTT
Improving Accuracy and Computational Burden of Bundle Adjustment Algorithm Using GPUs
2023年
Bundle adjustment is a camera and point refinement technique in a 3D scene reconstruction pipeline. The camera parameters and the 3D points are refined by minimizing the difference between computed projection and observed projection of the image points formulated as a non-linear least-square problem. Levenberg-Marquardt method is used to solve the non-linear least-square problem. Solving the non-linear least-square problem is computationally expensive, proportional to the number of cameras, points, and projections. In this paper, we implement the Bundle Adjustment (BA) algorithm and analyze techniques to improve algorithmic performance by reducing the mean square error. We investigate using an additional radial distortion camera parameter in the BA algorithm and demonstrate better convergence of the mean square error. We also demonstrate the use of explicitly computed analytical derivatives. In addition, we implement the BA algorithm on GPUs using the CUDA parallel programming model to reduce the computational time burden of the BA algorithm. CUDA Streams, atomic operations, and cuBLAS library in the CUDA programming model are proposed, implemented, and demonstrated to improve the performance of the BA algorithm. Our implementation has demonstrated better convergence of the BA algorithm and achieved a speedup of up to 16× on the use of the BA algorithm on various datasets.
Pranay R. KommeraSuresh S. MuknahallipatnaJohn E. McInroy
关键词:LEVENBERG-MARQUARDT
Kohn–Sham time-dependent density functional theory with Tamm–Dancoff approximation on massively parallel GPUs
2023年
We report a high-performance multi graphics processing unit(GPU)implementation of the Kohn–Sham time-dependent density functional theory(TDDFT)within the Tamm–Dancoff approximation.Our algorithm on massively parallel computing systems using multiple parallel models in tandem scales optimally with material size,considerably reducing the computational wall time.A benchmark TDDFT study was performed on a green fluorescent protein complex composed of 4353 atoms with 40,518 atomic orbitals represented by Gaussian-type functions,demonstrating the effect of distant protein residues on the excitation.As the largest molecule attempted to date to the best of our knowledge,the proposed strategy demonstrated reasonably high efficiencies up to 256 GPUs on a custom-built state-of-the-art GPU computing system with Nvidia A100 GPUs.We believe that our GPU-oriented algorithms,which empower first-principles simulation for very large-scale applications,may render deeper understanding of the molecular basis of material behaviors,eventually revealing new possibilities for breakthrough designs on new material systems.
Inkoo KimDaun JeongWon-Joon SonHyung-Jin KimYoung Min RheeYongsik JungHyeonho ChoiJinkyu YimInkook JangDae Sin Kim
关键词:GPUSGRAPHICSMASSIVE
Efficient Knowledge Graph Embedding Training Framework with Multiple GPUs被引量:1
2023年
When training a large-scale knowledge graph embedding(KGE)model with multiple graphics processing units(GPUs),the partition-based method is necessary for parallel training.However,existing partition-based training methods suffer from low GPU utilization and high input/output(IO)overhead between the memory and disk.For a high IO overhead between the disk and memory problem,we optimized the twice partitioning with fine-grained GPU scheduling to reduce the IO overhead between the CPU memory and disk.For low GPU utilization caused by the GPU load imbalance problem,we proposed balanced partitioning and dynamic scheduling methods to accelerate the training speed in different cases.With the above methods,we proposed fine-grained partitioning KGE,an efficient KGE training framework with multiple GPUs.We conducted experiments on some benchmarks of the knowledge graph,and the results show that our method achieves speedup compared to existing framework on the training of KGE.
Ding SunZhen HuangDongsheng LiMin Guo
面向多GPU的图神经网络训练加速
2023年
图神经网络由于其强大的表示能力和灵活性最近取得了广泛的关注.随着图数据规模的增长和显存容量的限制,基于传统的通用深度学习系统进行图神经网络训练已经难以满足要求,无法充分发挥GPU设备的性能.如何高效利用GPU硬件进行图神经网络的训练已经成为该领域重要的研究问题之一.传统做法是基于稀疏矩阵乘法,完成图神经网络中的计算过程,当面对GPU显存容量限制时,通过分布式矩阵乘法,把计算任务分发到每个设备上,这类方法的主要不足有:(1)稀疏矩阵乘法忽视了图数据本身的稀疏分布特性,计算效率不高;(2)忽视了GPU本身的计算和访存特性,无法充分利用GPU硬件.为了提高训练效率,现有一些研究通过图采样方法,减少每轮迭代的计算带价和存储需求,同时也可以支持灵活的分布式拓展,但是由于采样随机性和方差,它们往往会影响训练的模型精度.为此,提出了一套面向多GPU的高性能图神经网络训练框架,为了保证模型精度,基于全量图进行训练,探索了不同的多GPU图神经网络切分方案,研究了GPU上不同的图数据排布对图神经网络计算过程中GPU性能的影响,并提出了稀疏块感知的GPU访存优化技术.基于C++和CuDNN实现了该原型系统,在4个不同的大规模GNN数据集上的实验表明:(1)通过图重排优化,提高了GPU约40%的缓存命中率,计算加速比可达2倍;(2)相比于现有系统DGL,取得了5.8倍的整体加速比.
苗旭鹏王驭捷沈佳邵蓥侠崔斌
关键词:分布式计算内存优化GPU加速
基于GPU的卫星通信基带处理高吞吐率并行算法
2023年
卫星通信被广泛应用于加密通信、应急通信等领域中,其基带处理算法较为复杂,需要很强的算力支持。传统的FPGA和DSP等平台开发周期过长,基于GPU的软件无线电方案开发便捷,性能优越。提出了一种基于GPU的卫星通信基带算法群,实现了卫星通信下行链路的高速处理。实验结果表明,基于GPU的卫星通信链路达到了最低延迟要求,基带最高处理速度达到978 Mbps。
李荣春周鑫王庆林梅松竹
关键词:卫星通信LDPCVITERBIRS
Increasing Momentum-Like Factors:A Method for Reducing Training Errors on Multiple GPUs被引量:1
2022年
In distributed training,increasing batch size can improve parallelism,but it can also bring many difficulties to the training process and cause training errors.In this work,we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent(SGD) and Adaptive moment estimation(Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit(GPU).A new method that considers momentum to eliminate training errors in distributed training is proposed.We define a Momentum-like Factor(MF) to represent the influence of former gradients on parameter updates in each iteration.Then,we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD,Adam,and Nesterov accelerated gradient.Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training.The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.
Yu TangZhigang KanLujia YinZhiquan LaiZhaoning ZhangLinbo QiaoDongsheng Li
BADF:Bounding Volume Hierarchies Centric Adaptive Distance Field Computation for Deformable Objects on GPUs
2022年
We present a novel algorithm BADF(Bounding Volume Hierarchy Based Adaptive Distance Fields)for accelerating the construction of ADFs(adaptive distance fields)of rigid and deformable models on graphics processing units.Our approach is based on constructing a bounding volume hierarchy(BVH)and we use that hierarchy to generate an octree-based ADF.We exploit the coherence between successive frames and sort the grid points of the octree to accelerate the computation.Our approach is applicable to rigid and deformable models.Our GPU-based(graphics processing unit based)algorithm is about 20x--50x faster than current mainstream central processing unit based algorithms.Our BADF algorithm can construct the distance fields for deformable models with 60k triangles at interactive rates on an NVIDIA GTX GeForce 1060.Moreover,we observe 3x speedup over prior GPU-based ADF algorithms.
Xiao-Rui ChenMin TangCheng LiDinesh ManochaRuo-Feng Tong
关键词:OCTREE

相关作者

张阿漫
作品数:196被引量:1,030H指数:17
供职机构:哈尔滨工程大学船舶工程学院
研究主题:水下爆炸 气泡 数值模拟 舰船 射流
明付仁
作品数:26被引量:117H指数:6
供职机构:哈尔滨工程大学船舶工程学院
研究主题:水下爆炸 SPH方法 SPH 水下接触爆炸 流固耦合
刘金硕
作品数:100被引量:134H指数:7
供职机构:武汉大学
研究主题:智能电表 反汇编 机器码 多线程 文本
吴效明
作品数:373被引量:1,065H指数:15
供职机构:华南理工大学
研究主题:ZIGBEE 特征提取 支持向量机 小波变换 血流动力学
褚晓文
作品数:8被引量:19H指数:3
供职机构:香港浸会大学
研究主题:GPU加速 计算统一设备架构 叠前时间偏移 地震勘探 GPU