In this example we show how reordering the data can significantly improve the computation speed. In order to do this we will re-work the molecular dynamic example.
The initialization is the same as the molecular dynamic example. The difference are in the parameters. We will use a bigger system, with more particles. The delta time for integration is chosen in order to keep the system stable.
Here we place the particles on a grid like manner
Here we do 30000 MD steps using verlet integrator the cycle is the same as the molecular dynamic example. with the following changes.
Every 200 iterations we reorder the data.
This function reorder the particles internally using an hilbert curve of order m (in this case m=5). More in detail, a cell list is created with \( 2^m \times 2^m \) Cells in 2D or \( 2^m \times 2^m \times 2^m \) Cells in 3D. The particles are reordered in the vector following an Hilbert curve passing through the cells, in the way described in the figure below
+------+------+------+------+ Example of Hilbert curve for m = 2
| | | | |
| 6+---->7 | 10+--->11 |
| ^ | + | ^ | + |
+--|------|------|-------|--+
| + | v | + | v |
| 5 | 8+---->9 | 12 |
| ^ | | | + |
+--|---------------------|--+
| + | | | v |
| 4<----+3 | 14<---+13 |
| | ^ | + | |
+---------|-------|---------+
| | + | v | |
| 1+---->2 | 15+--->16 |
| | | | |
+------+------+------+------+
Suppose now that the particles are ordered like the situation (BEFORE).
After the call to reorder they will be reorder like (AFTER)
BEFORE AFTER
Particles id Cell id Cell
0 1 0 1
1 7 3 1
2 8 16 2
3 1 10 3
4 9 20 3
5 9 18 4
6 6 6 6
7 7 1 7
8 12 7 7
9 10 2 8
10 3 4 9
11 13 5 9
12 13 9 10
13 15 15 10
14 14 8 12
15 10 11 13
16 2 12 13
17 16 14 14
18 4 19 14
19 14 13 15
20 3 2 16
* We cannot explain here what is a cache, but in practice is a fast memory in the CPU able to store chunks of memory. The cache in general is much smaller than RAM, but the big advantage is its speed. Retrieve data from the cache is much faster than RAM. Unfortunately the factors that determine what is on cache and what is not are multiples: Type of cache, algorithm ... . Qualitatively all caches will tend to load chunks of data that you read multiple-time, or chunks of data that probably you will read based on pattern analysis. A small example is a linear memory copy where you read consecutively memory and you write on consecutive memory. Modern CPU recognize such pattern and decide to load on cache the consecutive memory before you actually require it.
Reordering the vector in the way described above has the advantage that when we do computation on particles and its neighborhood consecutively (from the first to the end) we will have the tendency to:
That are the 2 factor required to take advantage of the cache
In order to show in practice what happen we first show the graph when we do not reorder
The measure has oscillation but we see an asymptotic behavior from 0.04 in the initial condition to 0.124 . Below we show what happen when we reorder every 10000 iterations
As we can see the reorder at iteration 0 does not produce any effect or performance improve in force calculation. This is because the particles on a grid are already ordered enough. As soon as the particle move the situation degrade and the calculation on the force increase. At 10000 we try to reorder the particles and as we can see this reduce drastically the time to compute the forces.
This show how the reordering of the data can significantly improve the performance. As soon as the the particles move we see a progressive degrade of the performance. The particles will have the tendency to disorder again. At 20000 we do the same things, and as we can see after reordering we get again a dramatic increase in performance. From this graph we clearly see that wait 10000 iteration is too much for reordering. We have to increase the frequency of reordering. Finding a good frequency depend from
In this configuration
Even if the optimal is an higher frequency, we decide to reorder every 200 iteration. Getting the following graph where the force calculation improve of a factor a little smaller than 2.
In cases where particles move a lot, and so when degradation of performance is fast, consider to use computation reordering
In order to collect the time of the force calculation we insert two timers around the function calc_force. The overall performance is instead calculated with another timer around the time stepping
This code follow the same as the one in molecular dynamic code. The difference is that in this case the output is the computation time of the force for each iteration
At the very end of the program we have always de-initialize the library