Preliminary remarks

The input was agreed to be of the kind:

1 1000000.0 1.E6 40 40
50000 1 10 183 50 1
0.01 0.01 0.3 0.1 0.1 5.0 2.0E-03 1.0 1.0
1 1 1 0 1 1 4 0 0 2
1 0 0 0 2 1 0 0 0 0
1 0 2 0 0 2 0 0 0 2
0 0 2 0 1 0 1 1 0 1
0 0 0 0 0 0 0 0 0 0
1.0E-05 0.01 0.002 10.0 1.0E-06 0.1
2.35 20.0 0.1 0 0.0 0.0 0.0
0.5 0.0 0.0 0.0
0 0.005 -1.0 1.0 5.0 5 0

(here for 50K particles) The runs do have a mass spectrum!

The collected data (outputs...) is so far in:
/home/Tux/oporth/bench (Andreas results as well) and should be readable.
Very usefull are the times??k files which contain the last timing output and the corresponding nbody time added in the first row, which should look like this:

#TIME PE N ttot treg tirr tpredtot tint tinit tks ttcomm tadj tmov tprednb tsub tsub2 xtsub1 xtsub2
5.0 4 50000 19182.78432 14081.13 148.85 5.41 14625.98 172.21 1625.79 .00 2761.95 183.38 119.19 17.95 16.76 2.46152D+10 2.69353D+10
5.0 8 50000 10652.73794 7068.44 95.55 5.44 7523.57 108.91 1624.99 .00 1398.45 146.83 119.28 23.82 19.85 2.87156D+10 3.14242D+10
...

Then the important plots can be done with the attached routines.


Results for different hardware

Cineca

The hardware is described at https://hpc.cineca.it/docs/user-guide-zwiki/SP5UserGuide
It is essentially an IBM Power5 machine with globally 512 processors and a 2GB/s network. The benchmarks were done by Oliver with the Nbody6++ (Cambody) Code.
a)b)
As described in DEISA, the optimal processor number for each particle number N is found at the intersection of the parallelized calculation timing with the communication timing, which can be read of plot a).
The optimal processor number for a given N is shown in plot b).

c)d)
Plot c) is an example of the relevant timings for N=50K. It strikes the eye that before the optimal processor number is reached, the (serial) regularization calculation exceeds the (parallelized) force calculations in the timing. This problem is also present for higher particle numbers as seen in plot d) for N=131K (the wriggles in tks are due to the differing final calculated time in nbody units).

(O.P. 13.02.08)

Louhi

The benchmarks were done by Andreas with his Nbody6GC Code.

Concerning the parallelized computation and communication one gets:

a)b)
c)
c) shows that the timing is dominated by the tprednb calculation, which might be due to a speciality of Andreas Code... Additonally, the tks timing yields ridiculously large values (>ttot) and can not be used.
(O.P. 13.02.08)

Jump

The benchmarks were done by Andreas with his Nbody6GC Code.

Concerning the parallelized computation and communication one gets:
a)b)
c)
The tks timing is indeed negligible (below 1 sec and thus not in plot c)).

Jubl

The benchmarks were done by Andreas with probably the Nbody6++ Code. The Jubl machine is a Blue Gene with 2^14 cores.

a)b)
c)

Huygens

The Hardware is described at https://subtrac.sara.nl/userdoc/wiki/huygens/description, it has a heterogenous structure with 16 cores per node. The nodes are interconnected with an Infiniband network providing an MPI latency of 4.5 microseconds and an MPI bandwidth of 1.2 GB/sec between neighbouring nodes. The benchmarks were done by Oliver with the Nbody6++ (Marc-Andrés Package, see Compiling problems) Code.
a)b)
Scaling in the communication is a bit strange, probably due to the hierarchical topology. The optimal processor number for a given N is shown in plot b).
c)
Plot c) is an example of the relevant timings for N=50K. The optimal processor-number for N=50K is probably between NP=128 and NP=256 which would explain the rise in t_calc. A Simulation with NP=192 will clarify the case and yield an additional point for plot b).
(O.P.)

RZG

Information about the hardware van be found in http://www.rzg.mpg.de/computing/hardware/power4
The benchmarks were done by José with the Cambody version of Nbody6++.
a)b)
Relevant timings for N=50K:
c)

LRZ

Information about the hardware van be found in
http://www.lrz-muenchen.de/services/compute/hlrb/
The benchmarks were done by José with the Cambody version of Nbody6++.
a)b)
Relevant timings for N=50K:
c)

The optimal processor numbers

A comparison of the plots of the b) type is now given


Page Information

  • 1 month ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts