Molecular Dynamic with Lennard-Jones potential on GPU

This Molecular dynamic simulation is exactly the same as Molecular Dynamic with Lennard-Jones potential , with the difference that it work on Nvidia CUDA GPU and for this reason we use one milions particles. Before starting with this example, we suggest to start from the example GPU first steps and Molecular Dynamic with Lennard-Jones potential .

CUDA Kernels

In the CPU example we have 4 loops

One loop to calculate forces
One loop to do position and velocity integration as first step of the Verlet integration
One loop to do velocity integration as second step of the Verlet integration
One loop to calculate the total energy of the system

All these loops must be converted into CUDA Kernels and launch the cuda kernel.

Calculate forces

The calculate forces function now has a call to the function getDomainIteratorGPU() , this function is equivalent to getDomainIterator(). What it return is struct with a convenient division in workgroups (wthr ) and threads (thr ) in a workgroups, to iterate across all the domain particles. The cuda kernel launch is a standard NVIDIA CUDA launch kernel, where if we want to pass the distributed data-structure to the kernel we have to remember to use the function toKernel().

 
template<typename CellList> void calc_forces(vector_dist_gpu<3,real_number, aggregate<real_number[3],real_number[3],real_number> > & vd, CellList & NN, real_number sigma12, real_number sigma6, real_number r_cut2)
{
    vd.updateCellList(NN);
 
    // Get an iterator over particles
    auto it2 = vd.getDomainIteratorGPU();
 
    CUDA_LAUNCH(calc_force_gpu,it2,vd.toKernel(),NN.toKernel(),sigma12,sigma6,r_cut2);
}
 

The kernel is templated on all the parameters passed with toKernel(). This is not strictly necessary but is convenient because we let the compiler to infer the type returned by toKernel() without write the argument type explicitly. To get the particle index we use the macro GET_PARTICLE this macro simply retrieve the particle id from the CUDA block id and thread is + it avoid that the particle index does not overflow (The macro is simple to follow and is located in the file src/Vector/vector_dist_kernel.hpp). The rest of the kernel is more or less a copy paste of the CPU iteration code

 
template<typename vector_dist_type,typename NN_type>
__global__ void calc_force_gpu(vector_dist_type vd, NN_type NN, real_number sigma12, real_number sigma6, real_number r_cut2)
{
    auto p = GET_PARTICLE(vd);
 
    // Get the position xp of the particle
    Point<3,real_number> xp = vd.getPos(p);
 
    // Reset the force counter
    vd.template getProp<force>(p)[0] = 0.0;
    vd.template getProp<force>(p)[1] = 0.0;
    vd.template getProp<force>(p)[2] = 0.0;
 
 
 
    // Get an iterator over the neighborhood particles of p
    auto Np = NN.getNNIterator(NN.getCell(vd.getPos(p)));
 
    // For each neighborhood particle ...
    while (Np.isNext())
    {
        // ... q
        auto q = Np.get();
 
        // if (p == q) skip this particle
        if (q == p) {++Np; continue;};
 
        // Get the position of p
        Point<3,real_number> xq = vd.getPos(q);
 
        // Get the distance between p and q
        Point<3,real_number> r = xp - xq;
 
        // take the norm of this vector
        real_number rn = norm2(r);
 
        if (rn > r_cut2)
        {++Np; continue;};
 
        // Calculate the force, using pow is slower
        Point<3,real_number> f = 24.0*(2.0 *sigma12 / (rn*rn*rn*rn*rn*rn*rn) -  sigma6 / (rn*rn*rn*rn)) * r;
 
        // we sum the force produced by q on p
        vd.template getProp<force>(p)[0] += f.get(0);
        vd.template getProp<force>(p)[1] += f.get(1);
        vd.template getProp<force>(p)[2] += f.get(2);
 
        // Next neighborhood
        ++Np;
    }
}
 

For the two kernels that compose the 2 steps verlet integration, we simply followed the steps we described above. More interesting is instead the calculation of the total energy of the system. In this case we have to do a full reduction, or accumulating the energy of all the particles. In the CPU based code we used an external accumulator and we loop on all the particles summing Kinetics energy and potential energy. On GPU this is not possible because such operation is serial and would not be fast on GPU. To bypass this problem we have to first calculate the Kinetic energy + the potential energy for each particles in a kernel. This kernel is very similar to the CPU code with the difference that we do not accumulate on an external variable, but we store in a property per particle.

 
template<typename vector_dist_type,typename NN_type>
__global__ void particle_energy(vector_dist_type vd, NN_type NN, real_number sigma12, real_number sigma6, real_number shift, real_number r_cut2)
{
    auto p = GET_PARTICLE(vd);
 
    // Get the position of the particle p
    Point<3,real_number> xp = vd.getPos(p);
 
    // Get an iterator over the neighborhood of the particle p
    auto Np = NN.getNNIterator(NN.getCell(vd.getPos(p)));
 
    real_number E = 0;
 
    // For each neighborhood of the particle p
    while (Np.isNext())
    {
        // Neighborhood particle q
        auto q = Np.get();
 
        // if p == q skip this particle
        if (q == p) {++Np; continue;};
 
        // Get position of the particle q
        Point<3,real_number> xq = vd.getPos(q);
 
        // take the normalized direction
        real_number rn = norm2(xp - xq);
 
        if (rn > r_cut2)
        {++Np;continue;}
 
        // potential energy (using pow is slower)
        E += 2.0 * ( sigma12 / (rn*rn*rn*rn*rn*rn) - sigma6 / ( rn*rn*rn) ) - shift;
 
        // Next neighborhood
        ++Np;
    }
 
    // Kinetic energy of the particle given by its actual speed
    vd.template getProp<energy>(p) = E + (vd.template getProp<velocity>(p)[0]*vd.template getProp<velocity>(p)[0] +
            vd.template getProp<velocity>(p)[1]*vd.template getProp<velocity>(p)[1] +
            vd.template getProp<velocity>(p)[2]*vd.template getProp<velocity>(p)[2]) / 2;
}
 

After that, we can use the function reduce_local<energy,_add_>(vd) to operate a parallel reduction on GPU of the property energy on the vector_dist_gpu vd (the standard accumulation operation is the sum).

Note: reduce_local does only a local (within one GPU) reduction

 
    // Calculated energy
    return reduce_local<energy,_add_>(vd);

The rest remain mostly the same with the CPU code, with the exception that in the main loop where particle redistribution and ghost fill happen directly on GPU we use the option RUN_ON_DEVICE .

 
        // Because we moved the particles in space we have to map them and re-sync the ghost
        vd.map(RUN_ON_DEVICE);
        vd.template ghost_get<>(RUN_ON_DEVICE);
 

Every time we have to write a file we have to offload from GPU memory to host memory

 
            vd.deviceToHostPos();
            vd.deviceToHostProp<0,1,2>();
 
            // We write the particle position for visualization (Without ghost)
            vd.deleteGhost();
            vd.write_frame("particles_",f);
 

Table of Contents

Molecular Dynamic with Lennard-Jones potential on GPU

CUDA Kernels

Calculate forces