OpenFPM_pdata  4.1.0
Project that contain the implementation of distributed structures
 
Loading...
Searching...
No Matches
Vector 4 data-reordering and cache friendliness

Data reordering, computation-reordering and cache friendly computation

In this example we show how reordering the data can significantly improve the computation speed. In order to do this we will re-work the molecular dynamic example.

Initialization

The initialization is the same as the molecular dynamic example. The difference are in the parameters. We will use a bigger system, with more particles. The delta time for integration is chosen in order to keep the system stable.

See also
Initialization
double dt = 0.0001;
double r_cut = 0.03;
double sigma = r_cut/3.0;
double sigma12 = pow(sigma,12);
double sigma6 = pow(sigma,6);
openfpm_init(&argc,&argv);
Vcluster<> & v_cl = create_vcluster();
// we will use it do place particles on a 10x10x10 Grid like
size_t sz[3] = {40,40,40};
// domain
Box<3,double> box({0.0,0.0,0.0},{1.0,1.0,1.0});
// Boundary conditions
size_t bc[3]={PERIODIC,PERIODIC,PERIODIC};
// ghost, big enough to contain the interaction radius
Ghost<3,double> ghost(r_cut);
// Create vector
This class represent an N-dimensional box.
Definition Box.hpp:61
Implementation of VCluster class.
Definition VCluster.hpp:59
Implementation of 1-D std::vector like structure.
Distributed vector.

Particles on a grid like position

Here we place the particles on a grid like manner

See also
Particles on a grid like position
auto it = vd.getGridIterator(sz);
while (it.isNext())
{
vd.add();
auto key = it.get();
vd.getLastPos()[0] = key.get(0) * it.getSpacing(0);
vd.getLastPos()[1] = key.get(1) * it.getSpacing(1);
vd.getLastPos()[2] = key.get(2) * it.getSpacing(2);
vd.template getLastProp<velocity>()[0] = 0.0;
vd.template getLastProp<velocity>()[1] = 0.0;
vd.template getLastProp<velocity>()[2] = 0.0;
vd.template getLastProp<force>()[0] = 0.0;
vd.template getLastProp<force>()[1] = 0.0;
vd.template getLastProp<force>()[2] = 0.0;
++it;
}

Molecular dynamic steps

Here we do 30000 MD steps using verlet integrator the cycle is the same as the molecular dynamic example. with the following changes.

Reordering

Every 200 iterations we reorder the data.

if (i % 200 == 0)
vd.reorder(5);

This function reorder the particles internally using an hilbert curve of order m (in this case m=5). More in detail, a cell list is created with \( 2^m \times 2^m \) Cells in 2D or \( 2^m \times 2^m \times 2^m \) Cells in 3D. The particles are reordered in the vector following an Hilbert curve passing through the cells, in the way described in the figure below

+------+------+------+------+     Example of Hilbert curve for m = 2
|      |      |      |      |
|  6+---->7   |  10+--->11  |
|  ^   |  +   |  ^   |   +  |
+--|------|------|-------|--+
|  +   |  v   |  +   |   v  |
|  5   |  8+---->9   |  12  |
|  ^   |      |      |   +  |
+--|---------------------|--+
|  +   |      |      |   v  |
|  4<----+3   |  14<---+13  |
|      |  ^   |   +  |      |
+---------|-------|---------+
|      |  +   |   v  |      |
|  1+---->2   |  15+--->16  |
|      |      |      |      |
+------+------+------+------+


 Suppose now that the particles are ordered like the situation (BEFORE).
 After the call to reorder they will be reorder like (AFTER)


               BEFORE                                AFTER

Particles   id      Cell                         id        Cell
             0         1                          0           1
             1         7                          3           1
             2         8                         16           2
             3         1                         10           3
             4         9                         20           3
             5         9                         18           4
             6         6                          6           6
             7         7                          1           7
             8        12                          7           7
             9        10                          2           8
            10         3                          4           9
            11        13                          5           9
            12        13                          9          10
            13        15                         15          10
            14        14                          8          12
            15        10                         11          13
            16         2                         12          13
            17        16                         14          14
            18         4                         19          14
            19        14                         13          15
            20         3                          2          16



 * 

We cannot explain here what is a cache, but in practice is a fast memory in the CPU able to store chunks of memory. The cache in general is much smaller than RAM, but the big advantage is its speed. Retrieve data from the cache is much faster than RAM. Unfortunately the factors that determine what is on cache and what is not are multiples: Type of cache, algorithm ... . Qualitatively all caches will tend to load chunks of data that you read multiple-time, or chunks of data that probably you will read based on pattern analysis. A small example is a linear memory copy where you read consecutively memory and you write on consecutive memory. Modern CPU recognize such pattern and decide to load on cache the consecutive memory before you actually require it.

Reordering the vector in the way described above has the advantage that when we do computation on particles and its neighborhood consecutively (from the first to the end) we will have the tendency to:

  • read always the same particles
  • read the memory more consecutively or in predictable way

That are the 2 factor required to take advantage of the cache

if (i % 200 == 0)
vd.reorder(5);

In order to show in practice what happen we first show the graph when we do not reorder

The measure has oscillation but we see an asymptotic behavior from 0.04 in the initial condition to 0.124 . Below we show what happen when we reorder every 10000 iterations

As we can see the reorder at iteration 0 does not produce any effect or performance improve in force calculation. This is because the particles on a grid are already ordered enough. As soon as the particle move the situation degrade and the calculation on the force increase. At 10000 we try to reorder the particles and as we can see this reduce drastically the time to compute the forces.

Note
There is still a gap between the initial condition and the reordered situation. This is because the particles from the initial conditions are getting near. This increase the average number of neighborhood per particle and so the overall computational cost of the force calculation

This show how the reordering of the data can significantly improve the performance. As soon as the the particles move we see a progressive degrade of the performance. The particles will have the tendency to disorder again. At 20000 we do the same things, and as we can see after reordering we get again a dramatic increase in performance. From this graph we clearly see that wait 10000 iteration is too much for reordering. We have to increase the frequency of reordering. Finding a good frequency depend from

  • How fast the performance degrade (Or how fast particles move in a disordered way)
  • How reorder is heavy compared to the full iteration

In this configuration

  • Every 600 time step we get around 10% performance degrade
  • Reordering is in general one order of magnitude than the calculation of the force

Even if the optimal is an higher frequency, we decide to reorder every 200 iteration. Getting the following graph where the force calculation improve of a factor a little smaller than 2.

In cases where particles move a lot, and so when degradation of performance is fast, consider to use computation reordering

See also
Computation-reordering and cache friendly computation

Timers

In order to collect the time of the force calculation we insert two timers around the function calc_force. The overall performance is instead calculated with another timer around the time stepping

timer time;
if (i % 10 == 0)
time.start();
Class for cpu time benchmarking.
Definition timer.hpp:28
void start()
Start the timer.
Definition timer.hpp:90
if (i % 10 == 0)
{
time.stop();
x.add(i);
y.add({time.getwct()});
}
void stop()
Stop the timer.
Definition timer.hpp:119
double getwct()
Return the elapsed real time.
Definition timer.hpp:130
See also
Molecular dynamic steps
// Get the Cell list structure
auto NN = vd.getCellList(r_cut);
// calculate forces
calc_forces(vd,NN,sigma12,sigma6);
unsigned long int f = 0;
timer time2;
time2.start();
#ifndef TEST_RUN
size_t Nstep = 30000;
#else
size_t Nstep = 300;
#endif
// MD time stepping
for (size_t i = 0; i < Nstep ; i++)
{
// Get the iterator
auto it3 = vd.getDomainIterator();
// integrate velicity and space based on the calculated forces (Step1)
while (it3.isNext())
{
auto p = it3.get();
// here we calculate v(tn + 0.5)
vd.template getProp<velocity>(p)[0] += 0.5*dt*vd.template getProp<force>(p)[0];
vd.template getProp<velocity>(p)[1] += 0.5*dt*vd.template getProp<force>(p)[1];
vd.template getProp<velocity>(p)[2] += 0.5*dt*vd.template getProp<force>(p)[2];
// here we calculate x(tn + 1)
vd.getPos(p)[0] += vd.template getProp<velocity>(p)[0]*dt;
vd.getPos(p)[1] += vd.template getProp<velocity>(p)[1]*dt;
vd.getPos(p)[2] += vd.template getProp<velocity>(p)[2]*dt;
++it3;
}
// Because we mooved the particles in space we have to map them and re-sync the ghost
vd.map();
vd.template ghost_get<>();
if (i % 200 == 0)
vd.reorder(5);
timer time;
if (i % 10 == 0)
time.start();
// calculate forces or a(tn + 1) Step 2
calc_forces(vd,NN,sigma12,sigma6);
if (i % 10 == 0)
{
time.stop();
x.add(i);
y.add({time.getwct()});
}
// Integrate the velocity Step 3
auto it4 = vd.getDomainIterator();
while (it4.isNext())
{
auto p = it4.get();
// here we calculate v(tn + 1)
vd.template getProp<velocity>(p)[0] += 0.5*dt*vd.template getProp<force>(p)[0];
vd.template getProp<velocity>(p)[1] += 0.5*dt*vd.template getProp<force>(p)[1];
vd.template getProp<velocity>(p)[2] += 0.5*dt*vd.template getProp<force>(p)[2];
++it4;
}
// After every iteration collect some statistic about the confoguration
if (i % 100 == 0)
{
// We calculate the energy
double energy = calc_energy(vd,NN,sigma12,sigma6);
auto & vcl = create_vcluster();
vcl.sum(energy);
vcl.execute();
// We also print on terminal the value of the energy
// only one processor (master) write on terminal
if (vcl.getProcessUnitID() == 0)
std::cout << "Energy: " << energy << std::endl;
f++;
}
}
time2.stop();
std::cout << "Performance: " << time2.getwct() << std::endl;

Plotting graphs

This code follow the same as the one in molecular dynamic code. The difference is that in this case the output is the computation time of the force for each iteration

See also
Plotting graphs
// Google charts options, it store the options to draw the X Y graph
GCoptions options;
// Title of the graph
options.title = std::string("Force calculation time");
// Y axis name
options.yAxis = std::string("Time");
// X axis name
options.xAxis = std::string("iteration");
// width of the line
options.lineWidth = 1.0;
// Object that draw the X Y graph
// Add the graph
// The graph that it produce is in svg format that can be opened on browser
cg.AddLinesGraph(x,y,options);
// Write into html format
cg.write("gc_plot2_out.html");
Small class to produce graph with Google chart in HTML.
void write(std::string file)
It write the graphs on file in html format using Google charts.
void AddLinesGraph(openfpm::vector< X > &x, openfpm::vector< Y > &y, const GCoptions &opt)
Add a simple lines graph.
Google chart options.
std::string xAxis
X axis name.
size_t lineWidth
Width of the line.
std::string title
Title of the chart.
std::string yAxis
Y axis name.

Finalize

At the very end of the program we have always de-initialize the library

openfpm_finalize();

Full code

#include "Vector/vector_dist.hpp"
#include "data_type/aggregate.hpp"
#include "Plot/GoogleChart.hpp"
#include "Plot/util.hpp"
#include "timer.hpp"
#include "energy_force.hpp"
int main(int argc, char* argv[])
{
double dt = 0.0001;
double r_cut = 0.03;
double sigma = r_cut/3.0;
double sigma12 = pow(sigma,12);
double sigma6 = pow(sigma,6);
openfpm_init(&argc,&argv);
Vcluster<> & v_cl = create_vcluster();
// we will use it do place particles on a 10x10x10 Grid like
size_t sz[3] = {40,40,40};
// domain
Box<3,double> box({0.0,0.0,0.0},{1.0,1.0,1.0});
// Boundary conditions
size_t bc[3]={PERIODIC,PERIODIC,PERIODIC};
// ghost, big enough to contain the interaction radius
Ghost<3,double> ghost(r_cut);
// Create vector
auto it = vd.getGridIterator(sz);
while (it.isNext())
{
vd.add();
auto key = it.get();
vd.getLastPos()[0] = key.get(0) * it.getSpacing(0);
vd.getLastPos()[1] = key.get(1) * it.getSpacing(1);
vd.getLastPos()[2] = key.get(2) * it.getSpacing(2);
vd.template getLastProp<velocity>()[0] = 0.0;
vd.template getLastProp<velocity>()[1] = 0.0;
vd.template getLastProp<velocity>()[2] = 0.0;
vd.template getLastProp<force>()[0] = 0.0;
vd.template getLastProp<force>()[1] = 0.0;
vd.template getLastProp<force>()[2] = 0.0;
++it;
}
// Get the Cell list structure
auto NN = vd.getCellList(r_cut);
// calculate forces
calc_forces(vd,NN,sigma12,sigma6);
unsigned long int f = 0;
timer time2;
time2.start();
#ifndef TEST_RUN
size_t Nstep = 30000;
#else
size_t Nstep = 300;
#endif
// MD time stepping
for (size_t i = 0; i < Nstep ; i++)
{
// Get the iterator
auto it3 = vd.getDomainIterator();
// integrate velicity and space based on the calculated forces (Step1)
while (it3.isNext())
{
auto p = it3.get();
// here we calculate v(tn + 0.5)
vd.template getProp<velocity>(p)[0] += 0.5*dt*vd.template getProp<force>(p)[0];
vd.template getProp<velocity>(p)[1] += 0.5*dt*vd.template getProp<force>(p)[1];
vd.template getProp<velocity>(p)[2] += 0.5*dt*vd.template getProp<force>(p)[2];
// here we calculate x(tn + 1)
vd.getPos(p)[0] += vd.template getProp<velocity>(p)[0]*dt;
vd.getPos(p)[1] += vd.template getProp<velocity>(p)[1]*dt;
vd.getPos(p)[2] += vd.template getProp<velocity>(p)[2]*dt;
++it3;
}
// Because we mooved the particles in space we have to map them and re-sync the ghost
vd.map();
vd.template ghost_get<>();
if (i % 200 == 0)
vd.reorder(5);
timer time;
if (i % 10 == 0)
time.start();
// calculate forces or a(tn + 1) Step 2
calc_forces(vd,NN,sigma12,sigma6);
if (i % 10 == 0)
{
time.stop();
x.add(i);
y.add({time.getwct()});
}
// Integrate the velocity Step 3
auto it4 = vd.getDomainIterator();
while (it4.isNext())
{
auto p = it4.get();
// here we calculate v(tn + 1)
vd.template getProp<velocity>(p)[0] += 0.5*dt*vd.template getProp<force>(p)[0];
vd.template getProp<velocity>(p)[1] += 0.5*dt*vd.template getProp<force>(p)[1];
vd.template getProp<velocity>(p)[2] += 0.5*dt*vd.template getProp<force>(p)[2];
++it4;
}
// After every iteration collect some statistic about the confoguration
if (i % 100 == 0)
{
// We calculate the energy
double energy = calc_energy(vd,NN,sigma12,sigma6);
auto & vcl = create_vcluster();
vcl.sum(energy);
vcl.execute();
// We also print on terminal the value of the energy
// only one processor (master) write on terminal
if (vcl.getProcessUnitID() == 0)
std::cout << "Energy: " << energy << std::endl;
f++;
}
}
time2.stop();
std::cout << "Performance: " << time2.getwct() << std::endl;
// Google charts options, it store the options to draw the X Y graph
GCoptions options;
// Title of the graph
options.title = std::string("Force calculation time");
// Y axis name
options.yAxis = std::string("Time");
// X axis name
options.xAxis = std::string("iteration");
// width of the line
options.lineWidth = 1.0;
// Object that draw the X Y graph
// Add the graph
// The graph that it produce is in svg format that can be opened on browser
cg.AddLinesGraph(x,y,options);
// Write into html format
cg.write("gc_plot2_out.html");
openfpm_finalize();
}