Louhi - Cray XT4


Site

Louhi is situated at the CSC in Espoo, Finnland (near Helsinki) and takes part in the DEISA grid project.
Here is a user guide: http://www.csc.fi/english/pages/louhi_guide/index_html

Compiling

my make-rule is:

# CRAY XT4 Fully parallel Mode -fastsse
#----------------------------------------------------------------------------
xt4-fastsse:
(tab) ./modify.sh
(tab) cp mpif.xt4.h mpif.h
(tab) $(MAKE) $(RESULT) "FC = f77" "FFLAGS = -O4 -fastsse -D PUREMPI -D PARALLEL" \
(tab) "SOURCE = energy_mpi.f fpoly1_mpi.f fpoly2_mpi.f $(SOURCE)" \
(tab) "CFLAGS = -D PUREMPI -D PARALLEL"
(tab) ./unmodify.sh

  1. modify.sh (by aernst) to change MPI_REAL to MPI_REAL8
  2. unmodify.sh to change MPI_REAL8 back to MPI_REAL
  3. mpif.xt4.h the mpi-header for the xt4

Optimization

some hints for optimization are given.

  • Tested is the -small_pages option, which gives a speedup of approx. 1.5 in the nbody6++ code.
  • The -fastsse compile option does not affect the performance much (but also does not hurt).
  • The MPI_COLL_OPT_ON option (export MPI_COLL_OPT_ON=1 in the jobscript) gives a factor of 1.3 in tmov.

Runtime Problems

I experience a nasty bug in all of my runs (40K+, 128 processors), after some time (approx. 15 walltime hours) the code crashes at writing the comm.2 file. This is deadly because comm.2 is then corrupted and the code can not be restarted.
last line of output is:

1188882745.619525:3-1311:(rw.c:519:get_io_group()): kmalloc of 'group' (5571264 bytes) failed at rw.c:519
1188882745.624228:3-1311:(rw.c:519:get_io_group()): -117320599 total bytes allocated by Lustre, 0 by Portals

this might be because not enough memory can be allocated. There is (only?) 1 GB per processor.
It seems to happen in mydump.F.

  • Debug-file is:


----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------

PROCESSOR 0
log_nid = 0 phys_nid = 0x51f host_id = 8 host_pid = 2715
group_id = 52048 num_procs = 128 rank = 0 local_pid = 3
base_node_index = 0 last_node_index = 63

text_base = 0x00000000200000 text_len = 0x00000000400000
data_base = 0x00000000600000 data_len = 0x00000015e00000
stack_base = 0x0000007ec00000 stack_len = 0x00000001000000
heap_base = 0x00000016600000 heap_len = 0x00000025e00000

ss = 0x000000000000001f fs = 000000000000000000 gs = 0x0000000000000017
rip = 0x00000000002d6b90
rdi = 0x0000000000000002 rsi = 0xfffffffffffffff4 rbp = 0x000000007fbfa610
rsp = 0x000000007fbfa540 rbx = 0x000000001be8bdf0 rdx = 0x000000000c2fa85c
rcx = 0x0000000004400000 rax = 0xfffffffffffffff4 cs = 0x000000000000001f
R8 = 0x0000000003e00000 R9 = 0xfffffffffffffff4 R10 = 0x000000001be8bd20
R11 = 0x000000003b9f2fb0 R12 = 0x000000001be8bdf0 R13 = 0x0000000004400000
R14 = 0x000000003b98f630 R15 = 0x0000000004400000
rflg = 0x0000000000010202 prev_sp = 0x000000007fbfa540
error_code = 4

SIGNAL #11Segmentation fault fault_address = 0xfffffffffffffffc

0x7fbfa590 0x 46dce939 0x 5502c000000000 0x 7fbfa610 0x 2d73c3
0x7fbfa5b0 0x 4b2560 0x f901d469 0x 0 0x 207
0x7fbfa5d0 0x 1be8bdf0 0x 1b4af9e0 0x 7fbfa720 0x 3e00000
0x7fbfa5f0 0x 4400000 0x c2fa85c 0xfffffffffffffff4 0x 23b9f2fb0
0x7fbfa610 0x 7fbfa780 0x 2d809a 0x 0 0x 16f894b0
0x7fbfa630 0x 37fbfa640 0x 13 0x 7fbfa670 0x 0
0x7fbfa650 0x a262cc16d3870 0x 1ded16f892b0 0x 4400000 0x c2fa85c
0x7fbfa670 0x 7fbfa710 0x 4400000 0x 46dce939 0x 0
0x7fbfa690 0x 0 0x 10 0x 7fbfa6d0 0x 16f892b0
0x7fbfa6b0 0x 0 0x bde7320 0x 7fbfa7c0 0x 427a39
0x7fbfa6d0 0xa262cc1600000020 0x 3e00000 0x 81fffff 0x 1622b5e0
0x7fbfa6f0 0x 84 0x 200 0x bde7340 0xfffffffffffffff4
0x7fbfa710 0x 1b8fde80 0x 1be8b7c0 0x 0 0x bde7328
0x7fbfa730 0x 3b9f2fb0 0x 1be8bdf0 0x 1be899c0 0x 1be8bd20
0x7fbfa750 0x 3b9926b0 0x 3b9926b0 0x 4400000 0x 3e00000

Stack Trace: ------------------------------
#0 0x00000000002d6b90 in llu_queue_pio()
#1 0x00000000002d809a in llu_file_prwv()
#2 0x00000000004388cc in _sysio_enumerate_extents()
#3 0x00000000002d8d01 in llu_file_rwx()
#4 0x00000000002d8e80 in llu_iop_write()
#5 0x0000000000436391 in _sysio_iiox()
#6 0x00000000004365af in _sysio_iiov()
#7 0x0000000000437a39 in __write()
#8 0x0000000000454ae0 in _IO_new_file_write()
could not find symbol for addr 0x00008618000085bc



o.p.


Page Information

  • 10 months ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts