[CP2K-user] cp2k.popt calculation "freezes": poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0 (Timeout)

Philipp Rahe hquerq... at gmail.com
Sun Apr 28 18:21:55 UTC 2019


Dear all,

when running cp2k on our local cluster (SGI UV2000), I recently observe 
that sometimes during an mpi job all cp2k processes "freeze" at 100% cpu 
usage (according to top). For example, when running a CELL_OPT calculation 
via

mpirun -n 144 cp2k.popt -o cp2k.output cp2k.inp

the run 'freezes' after several steps, the last entries in the output file 
are:

>> tail -f cp2k.output

 RS_GRID| Information for grid number                                      
10584
 RS_GRID|   Bounds   1            -62      62                Points:       
  125
 RS_GRID|   Bounds   2            -72      71                Points:       
  144
 RS_GRID|   Bounds   3           -144     143                Points:       
  288
 RS_GRID| Real space distribution over                                  8 
groups
 RS_GRID| Real space distribution along direction                           
   2
 RS_GRID| Border size                                                       
  37
 RS_GRID| Real space distribution over                                 18 
groups
 RS_GRID| Real space distribution along direction                           
   3

I compiled cp2k-6.1 with the toolchain script and recently changed to 
openmpi 3.1.4 due to a bug in 3.1.0 
(https://github.com/open-mpi/ompi/issues/5638) that caused cp2k runs to 
crash. (mpi on our cluster is a bit outdated that's why I'm not using it). 
The regtest gave 1 COMPILE WARNING, 0 FAILED/WRONG, 3015 CORRECT, 16 NEW. 

I inspected one of the "frozen" cp2k.popt processes:

>> strace -fp 38571
Process 38571 attached with 3 threads
[pid 38585] epoll_wait(10,  <unfinished ...>
[pid 38582] restart_syscall(<... resuming interrupted call ...> <unfinished 
...>
[pid 38571] poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0 
(Timeout)
[pid 38571] poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0 
(Timeout)
[pid 38571] poll([{fd=5, events=POLLIN}, {fd=15, events=POLLIN}], 2, 0) = 0 
(Timeout)
...

the last line repeats until I stop strace. The file descriptors are:

>> lsof -p 38571
...
cp2k.popt 38571 prahe    5u  0000               0,10         0       10625 
anon_inode
...
cp2k.popt 38571 prahe   15u  IPv4          158047374       0t0         TCP 
*:polestar (LISTEN)
...


>> ls -l /proc/38571/fd
...
lrwx------ 1 prahe ustudent 64 Apr 27 18:23 15 -> socket:[158047374]
...
lrwx------ 1 prahe ustudent 64 Apr 27 18:23 5 -> anon_inode:[eventfd]
...

The strace output is the same for three of the 144 processes, I haven't 
checked the others. At this point I understand the processes are waiting 
for some input, but I'm unfortunately lost otherwise. Any suggestions - or 
since I have these issues since using openmpi 3.1.4: Is this a bad choice? 

Please let me know if you need any further files/info. 

Thanks in advance and best regards,
Philipp


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20190428/fc6e951f/attachment.htm>


More information about the CP2K-user mailing list