[CP2K-user] cp2k.popt calculation "freezes": poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0 (Timeout)
Philipp Rahe
hquerq... at gmail.com
Sun Apr 28 18:21:55 UTC 2019
Dear all,
when running cp2k on our local cluster (SGI UV2000), I recently observe
that sometimes during an mpi job all cp2k processes "freeze" at 100% cpu
usage (according to top). For example, when running a CELL_OPT calculation
via
mpirun -n 144 cp2k.popt -o cp2k.output cp2k.inp
the run 'freezes' after several steps, the last entries in the output file
are:
>> tail -f cp2k.output
RS_GRID| Information for grid number
10584
RS_GRID| Bounds 1 -62 62 Points:
125
RS_GRID| Bounds 2 -72 71 Points:
144
RS_GRID| Bounds 3 -144 143 Points:
288
RS_GRID| Real space distribution over 8
groups
RS_GRID| Real space distribution along direction
2
RS_GRID| Border size
37
RS_GRID| Real space distribution over 18
groups
RS_GRID| Real space distribution along direction
3
I compiled cp2k-6.1 with the toolchain script and recently changed to
openmpi 3.1.4 due to a bug in 3.1.0
(https://github.com/open-mpi/ompi/issues/5638) that caused cp2k runs to
crash. (mpi on our cluster is a bit outdated that's why I'm not using it).
The regtest gave 1 COMPILE WARNING, 0 FAILED/WRONG, 3015 CORRECT, 16 NEW.
I inspected one of the "frozen" cp2k.popt processes:
>> strace -fp 38571
Process 38571 attached with 3 threads
[pid 38585] epoll_wait(10, <unfinished ...>
[pid 38582] restart_syscall(<... resuming interrupted call ...> <unfinished
...>
[pid 38571] poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0
(Timeout)
[pid 38571] poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0
(Timeout)
[pid 38571] poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0
(Timeout)
...
the last line repeats until I stop strace. The file descriptors are:
>> lsof -p 38571
...
cp2k.popt 38571 prahe 5u 0000 0,10 0 10625
anon_inode
...
cp2k.popt 38571 prahe 15u IPv4 158047374 0t0 TCP
*:polestar (LISTEN)
...
>> ls -l /proc/38571/fd
...
lrwx------ 1 prahe ustudent 64 Apr 27 18:23 15 -> socket:[158047374]
...
lrwx------ 1 prahe ustudent 64 Apr 27 18:23 5 -> anon_inode:[eventfd]
...
The strace output is the same for three of the 144 processes, I haven't
checked the others. At this point I understand the processes are waiting
for some input, but I'm unfortunately lost otherwise. Any suggestions - or
since I have these issues since using openmpi 3.1.4: Is this a bad choice?
Please let me know if you need any further files/info.
Thanks in advance and best regards,
Philipp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20190428/fc6e951f/attachment.htm>
More information about the CP2K-user
mailing list