Runtime problems:

Problems:

1) STDIO - Error 11: Resource temporarily unavailable
2) a) ERROR: 0032-117 User pack or receive buffer is too small (1368) in MPI_Sendrecv, task 0 b) MPI_SENDRECV : Message truncated
3) The run hangs in a KS regularization - what to do?
4) Segmentation fault directly after starting the code
5) Run hangs after output fpoly2 time
6) The energy is not equal to -0.25 or energy errors occur.

Solutions:

1) STDIO - Error 11: Resource temporarily unavailable: We were getting the following strange error on our Beowulf PC cluster Hydra @ ARI using PGI mpif77 / pgf77 while trying to restart the nbody6++ run:

PGFIO/stdio: Resource temporarily unavailable
PGFIO-F-/unformatted read/unit=1/error code returned by host stdio - 11.
File name = comm.1 unformatted, sequential access record = 1
In source file mydump.F, at line number 61
PSIlogger: Child with rank 0 exited with status 1.
PSIlogger: Child with rank 4 exited on signal 15.
PSIlogger: Child with rank 3 exited on signal 15.
PSIlogger: Child with rank 8 exited on signal 15.
PSIlogger: Child with rank 6 exited on signal 15.
PSIlogger: Child with rank 5 exited on signal 15.
PSIlogger: Child with rank 7 exited on signal 15.
PSIlogger: Child with rank 2 exited on signal 15.
PSIlogger: Child with rank 1 exited on signal 15.
PSIlogger: Child with rank 9 exited on signal 15.

PSIlogger: done

The code tries to read the file called comm.1, which contains all the
COMMON blocks in order to restart the scientific simulation from the
point where it previously stopped.

stdio error 11 (EAGAIN) is coming from the operating system and indicates here that the file is not available for some reason. First check that the file may be accessed by your process and that the process has read permissions. Also, the file may be locked by another process. Solution so far: use ifort (A. Ernst)

2)

a) ERROR: 0032-117 User pack or receive buffer is too small (1368) in MPI_Sendrcev, task 0
This nasty bug occured on Juelich's Jump system. The error message means that in an
MPI_SENDRECV call on rank x the "sendcount" to rank y and in the corresponding MPI_SENDRECV
call on rank y the "recvcount" from rank x have not the same value.

b) MPI_SENDRECV : Message truncated This similar error occured on Hydra. (A. Ernst)
It seems that the array inum(p) which contains the message size for processor p is different
on rank 0 as compared with the other ranks. The reason for this is that the variable
NXTLEN is different on rank 0 as compared with the other ranks.

In general: This error can have different reasons, e.g. a missing MPI_BCAST, or a variable from a common block is used but the common block is not declared in the header of the routine (this was the case with a variable used in the Chain regularization routines). How to find the bug: Look first for the particle in NXTLST which is causing the problems. Then try to find the value of TIME at which one processor gets different from the others. Then you might be able to limit the routines where it happens to a few routines where you can search the bug. (A.E.)

Cambody version: Make sure that intgrt.F contains the declaration of the COMMON block CHAINC, since one variable of this block (ICH) is used in the Chain predictions in intgrt.F.

3) What to do if the run hangs in a KS regularization? A first aid is the following:

  • Kill the run
  • Make backups of comm.1, comm.2 and conf.3
  • cp comm.2 comm.1 (this restores the common blocks from the last stored time point depending on DTADJ & DELTAT)
  • Restart the run with KSTART = 5 (see FAQ in this wiki) after you have modified one or more of the parameters ETAI, ETAR, ETAU, DTMIN, RMIN, NNBOPT in the third line of the restart input file
4) Segmentation fault directly after starting the code. This nasty bug appeard on SARA's Huygens system: A segfault right at the beginning before any output was written. Wim Rijks from SARA helped: There is a duplicate symbol "FCLOSE". Solution: Rename the routine fclose.f into something like myfclose.f and modify all calls to this routine by using sed. I have small script which does that for you. (A.E.)

      • This did not help me, i get the segfault in any case, shure that this is the solution? (José experienced the same problems). (O.P.)
Did you also rename the COMMON block CLOSE? (A.E.)

5) Run hangs after output fpoly2 time. The code hangs after the output

fpoly1 time= ...
fpoly2 time= ...

appears in the diagnostics file. The problem is an infinite loop in start.F after label 80:

80 CONTINUE
*
* Check whether membership range is acceptable.
* Next statement commented (A. Ernst)
* IF (NNB.EQ.1) GO TO 70
*

A solution is to comment the above line. In principle, NLIST is deprecated and all references to this array should be removed anyway as Sverre says. (O.P./A.E.)

6) The energy is not equal to -0.25 or energy errors occur. If you do a "grep ADJUST diagnosticsfile" you can see the value of the total energy. If it is not equal to -0.25 in the beginning or unreasonable energy errors occur the first thing to check is the compiler optimization option in the Makefile: -O4 seems to work in most cases, but -O3 seems to work in all cases so far with pgf. (A.E.)


Page Information

  • 4 months ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts