DOWNLOAD PRESENTATION
WATCH VIDEOIn this work, we present different implementations of the Lattice Boltzmann Method algorithm for fluid dynamics simulation, parallelized from a single sequential implementation. We discuss versions of the algorithm that adapt to diverse CPU-GPU hardware as well as versions fine-tuned for a GPU cluster. We compare different programming approaches, compiler optimizations and scheduling techniques for the LBM implementation. We successfully reduce the execution time of the simulation from 5 days on a single CPU node to just 200 seconds on a single GPU node. Lattice Boltzmann Method (LBM) simulation is a widely used technique to observe the free flow of fluid through porous media. The oil and gas industries use this technique to measure the porosity of a rock. Porosity of the rock is a detrimental factor in choosing a well for oil and natural gas extraction. The process of choosing a well involves making a preliminary analysis of the rock on the field and then further processing it in the research lab for greater accuracy. The field version of LBM is usually run on a single desktop system available and should be portable with reasonable accuracy. The LBM implementation for the research laboratory must be fine-tuned to a particular research cluster and must achieve maximum performance and precision. LBM simulation is computationally intensive. For certain rocks, computing a 300x300x300 grid is equivalent to simulating a 3000x3000x3000 cube of atoms and takes around 5 days to measure the porosity of the rock with reasonable accuracy. Recently, high performance heterogeneous architectures have become ubiquitous and are being widely adopted by many industries including oil and gas industries. However, extracting the maximum performance on these heterogeneous architectures is non-trivial and requires rigorous training. In our work, we show simple yet powerful approaches and methods to take advantage of these modern heterogeneous architectures and achieve good performance. LBM has two main kernels: collision and propagation. The sequential version of the LBM code takes 20 seconds per iteration for a grid size of 300x300x300 of single precision data values on a single CPU node. One of our optimizations merges the collision and propagation kernels into a single loop in order to avoid loading and storing from memory the entire grid between the invocations of the kernels. Merging the kernels reduces the time per iteration to 12 seconds. Next, we employ array linearization and loop normalization optimizations; reducing the time per step to 6.1 seconds. Finally, we fine-tune the application by building a sparse version of the merged collision and propagation kernels. An OpenMP version of this optimized kernel takes 1 second per time step on a 12-core CPU node while an OpenCL version takes 0.02 seconds per time step on a single GPU node. We implemented the cluster version of LBM using MPI + OpenCL. The cluster implementation enables LBM simulation over very large grid sizes. Further, we experiment with different optimizations using Habanero-C, a portable language for heterogeneous (CPU-GPU) architectures. The "forasync" construct in Habanero-C is compiled down to OpenCL. Merging the collision and propagation kernels requires a shadow copy of the grid to store the new values computed in each time step. This poses a constraint on the grid size of the simulation since the GPU memory is often limited on a single device. When the entire grid does not fit in the available GPU memory, we partition the grid and perform the computation on one partition at a time. This requires data to be copied to and from the device. We overlap the communication of data with computation of the kernel to hide the data-copying overhead. The propagation kernel has conditional statements to handle the boundary cells. We strip-mine the kernel into computation on the inner grid and computation on the boundaries. The inner grid, which is free of conditionals, is efficiently executed on the GPU while the boundary cells, which have conditionals, are executed on the CPU. We also experiment by changing the data layout of the grid and observe the performance on various heterogeneous hardware. We conclude with a discussion of the performance of our implementations on recent heterogeneous (CPU-GPU) devices from AMD, Intel and NVIDIA. We are in the process of evaluating the performance of these versions on various heterogeneous hardware including AMD APU/ discrete devices, Intel IVB device and Nvidia Fermi and Kepler devices.