The input was agreed to be of the kind:
(here for 50K particles) The runs do have a mass spectrum!
The collected data (outputs...) is so far in:
/home/Tux/oporth/bench (Andreas results as well) and should be readable.
Very usefull are the times??k files which contain the last timing output and the corresponding nbody time added in the first row, which should look like this:
Then the important plots can be done with the attached routines.
The hardware is described at https://hpc.cineca.it/docs/user-guide-zwiki/SP5UserGuide
It is essentially an IBM Power5 machine with globally 512 processors and a 2GB/s network. The benchmarks were done by Oliver with the Nbody6++ (Cambody) Code.
a)
b)
As described in DEISA, the optimal processor number for each particle number N is found at the intersection of the parallelized calculation timing with the communication timing, which can be read of plot a).
The optimal processor number for a given N is shown in plot b).
c)
d)
Plot c) is an example of the relevant timings for N=50K. It strikes the eye that before the optimal processor number is reached, the (serial) regularization calculation exceeds the (parallelized) force calculations in the timing. This problem is also present for higher particle numbers as seen in plot d) for N=131K (the wriggles in tks are due to the differing final calculated time in nbody units).
(O.P. 13.02.08)
The benchmarks were done by Andreas with his Nbody6GC Code.
Concerning the parallelized computation and communication one gets:
a)
b)
c)
c) shows that the timing is dominated by the tprednb calculation, which might be due to a speciality of Andreas Code... Additonally, the tks timing yields ridiculously large values (>ttot) and can not be used.
(O.P. 13.02.08)
The benchmarks were done by Andreas with his Nbody6GC Code.
Concerning the parallelized computation and communication one gets:
a)
b)
c)
The tks timing is indeed negligible (below 1 sec and thus not in plot c)).
The benchmarks were done by Andreas with probably the Nbody6++ Code. The Jubl machine is a Blue Gene with 2^14 cores.
a)
b)
c)
The Hardware is described at https://subtrac.sara.nl/userdoc/wiki/huygens/description, it has a heterogenous structure with 16 cores per node. The nodes are interconnected with an Infiniband network providing an MPI latency of 4.5 microseconds and an MPI bandwidth of 1.2 GB/sec between neighbouring nodes. The benchmarks were done by Oliver with the Nbody6++ (Marc-Andrés Package, see Compiling problems) Code.
a)
b)
Scaling in the communication is a bit strange, probably due to the hierarchical topology. The optimal processor number for a given N is shown in plot b).
c)
Plot c) is an example of the relevant timings for N=50K. The optimal processor-number for N=50K is probably between NP=128 and NP=256 which would explain the rise in t_calc. A Simulation with NP=192 will clarify the case and yield an additional point for plot b).
(O.P.)
Information about the hardware van be found in http://www.rzg.mpg.de/computing/hardware/power4
The benchmarks were done by José with the Cambody version of Nbody6++.
a)
b)
Relevant timings for N=50K:
c)
Information about the hardware van be found in
http://www.lrz-muenchen.de/services/compute/hlrb/
The benchmarks were done by José with the Cambody version of Nbody6++.
a)
b)
Relevant timings for N=50K:
c)
A comparison of the plots of the b) type is now given

Page Information
|
Wiki Information |
Recent PBwiki Blog Posts |