This Molecular dynamic simulation is exactly the same as Molecular Dynamic with Lennard-Jones potential , with the difference that it work on Nvidia CUDA GPU and for this reason we use one milions particles. Before starting with this example, we suggest to start from the example GPU first steps and Molecular Dynamic with Lennard-Jones potential .
In the CPU example we have 4 loops
All these loops must be converted into CUDA Kernels and launch the cuda kernel.
The calculate forces function now has a call to the function getDomainIteratorGPU() , this function is equivalent to getDomainIterator(). What it return is struct with a convenient division in workgroups (wthr ) and threads (thr ) in a workgroups, to iterate across all the domain particles. The cuda kernel launch is a standard NVIDIA CUDA launch kernel, where if we want to pass the distributed data-structure to the kernel we have to remember to use the function toKernel().
The kernel is templated on all the parameters passed with toKernel(). This is not strictly necessary but is convenient because we let the compiler to infer the type returned by toKernel() without write the argument type explicitly. To get the particle index we use the macro GET_PARTICLE this macro simply retrieve the particle id from the CUDA block id and thread is + it avoid that the particle index does not overflow (The macro is simple to follow and is located in the file src/Vector/vector_dist_kernel.hpp). The rest of the kernel is more or less a copy paste of the CPU iteration code
For the two kernels that compose the 2 steps verlet integration, we simply followed the steps we described above. More interesting is instead the calculation of the total energy of the system. In this case we have to do a full reduction, or accumulating the energy of all the particles. In the CPU based code we used an external accumulator and we loop on all the particles summing Kinetics energy and potential energy. On GPU this is not possible because such operation is serial and would not be fast on GPU. To bypass this problem we have to first calculate the Kinetic energy + the potential energy for each particles in a kernel. This kernel is very similar to the CPU code with the difference that we do not accumulate on an external variable, but we store in a property per particle.
After that, we can use the function reduce_local<energy,_add_>(vd) to operate a parallel reduction on GPU of the property energy on the vector_dist_gpu vd (the standard accumulation operation is the sum).
The rest remain mostly the same with the CPU code, with the exception that in the main loop where particle redistribution and ghost fill happen directly on GPU we use the option RUN_ON_DEVICE .
Every time we have to write a file we have to offload from GPU memory to host memory