1) STDIO - Error 11: Resource temporarily unavailable
2) a) ERROR: 0032-117 User pack or receive buffer is too small (1368) in MPI_Sendrecv, task 0 b) MPI_SENDRECV : Message truncated
3) The run hangs in a KS regularization - what to do?
4) Segmentation fault directly after starting the code
5) Run hangs after output fpoly2 time
6) The energy is not equal to -0.25 or energy errors occur.
1) STDIO - Error 11: Resource temporarily unavailable: We were getting the following strange error on our Beowulf PC cluster Hydra @ ARI using PGI mpif77 / pgf77 while trying to restart the nbody6++ run:
PGFIO/stdio: Resource temporarily unavailable
PGFIO-F-/unformatted read/unit=1/error code returned by host stdio - 11.
File name = comm.1 unformatted, sequential access record = 1
In source file mydump.F, at line number 61
PSIlogger: Child with rank 0 exited with status 1.
PSIlogger: Child with rank 4 exited on signal 15.
PSIlogger: Child with rank 3 exited on signal 15.
PSIlogger: Child with rank 8 exited on signal 15.
PSIlogger: Child with rank 6 exited on signal 15.
PSIlogger: Child with rank 5 exited on signal 15.
PSIlogger: Child with rank 7 exited on signal 15.
PSIlogger: Child with rank 2 exited on signal 15.
PSIlogger: Child with rank 1 exited on signal 15.
PSIlogger: Child with rank 9 exited on signal 15.
PSIlogger: done
The code tries to read the file called comm.1, which contains all the
COMMON blocks in order to restart the scientific simulation from the
point where it previously stopped.
stdio error 11 (EAGAIN) is coming from the operating system and indicates here that the file is not available for some reason. First check that the file may be accessed by your process and that the process has read permissions. Also, the file may be locked by another process. Solution so far: use ifort (A. Ernst)
2)
a) ERROR: 0032-117 User pack or receive buffer is too small (1368) in MPI_Sendrcev, task 0
This nasty bug occured on Juelich's Jump system. The error message means that in an
MPI_SENDRECV call on rank x the "sendcount" to rank y and in the corresponding MPI_SENDRECV
call on rank y the "recvcount" from rank x have not the same value.
b) MPI_SENDRECV : Message truncated This similar error occured on Hydra. (A. Ernst)
It seems that the array inum(p) which contains the message size for processor p is different
on rank 0 as compared with the other ranks. The reason for this is that the variable
NXTLEN is different on rank 0 as compared with the other ranks.
In general: This error can have different reasons, e.g. a missing MPI_BCAST, or a variable from a common block is used but the common block is not declared in the header of the routine (this was the case with a variable used in the Chain regularization routines). How to find the bug: Look first for the particle in NXTLST which is causing the problems. Then try to find the value of TIME at which one processor gets different from the others. Then you might be able to limit the routines where it happens to a few routines where you can search the bug. (A.E.)
Cambody version: Make sure that intgrt.F contains the declaration of the COMMON block CHAINC, since one variable of this block (ICH) is used in the Chain predictions in intgrt.F.
3) What to do if the run hangs in a KS regularization? A first aid is the following:
5) Run hangs after output fpoly2 time. The code hangs after the output
appears in the diagnostics file. The problem is an infinite loop in start.F after label 80:
A solution is to comment the above line. In principle, NLIST is deprecated and all references to this array should be removed anyway as Sverre says. (O.P./A.E.)
6) The energy is not equal to -0.25 or energy errors occur. If you do a "grep ADJUST diagnosticsfile" you can see the value of the total energy. If it is not equal to -0.25 in the beginning or unreasonable energy errors occur the first thing to check is the compiler optimization option in the Makefile: -O4 seems to work in most cases, but -O3 seems to work in all cases so far with pgf. (A.E.)
Page Information
|
Wiki Information |
Recent PBwiki Blog Posts |