CP2K calculation stuck in communication

Ondrej Marsalek ondrej.... at gmail.com
Wed Mar 4 11:09:07 UTC 2009

Dear everyone,

I have a problem with a CP2K calculation that is stuck in
communication and I would like to ask for ideas as for where to start.
I realize that this is slightly off-topic, but perhaps someone can at
least point me in the right direction. I will provide basic
information and will be glad to provide more details, if anyone finds
it useful.

The setup is a cluster with two dual core Opterons per node and
Infiniband interconnect. OpenMPI 1.3 and recent CP2K (that behaves
fine elsewhere).

All  the processes are alive and taking 100% CPU, but there is no
output (no, it is not the buffers, the timescales makes that clear).
Attaching gdb reveals the stack shown (with a comment) at the end of
the message. I don't have Totalview there, but could maybe install the
trial, if there is good chance that it helps. This problem appears
repeatedly, before some software updates, we had a problem that looked
similar on the surface, but one of the nodes involved was entirely
frozen. As it does not happen elsewhere, I thought that the IB is to
blame, somehow, but seeing the stack now, I am not so sure.

I would very much appreciate any suggestions.



The backtrace in gdb looks something like this, but is "constant" only
from opal_progress downwards. Clearly, opal_progress runs and you can
see other stuff above it, for example btl_sm_component_progress or
something deeper in btl_openib_component_progress.

(gdb) backtrace
#0  0x00002b0d461c6da6 in btl_openib_component_progress () from
#1  0x00002b0d41adc778 in opal_progress () from
#2  0x00002b0d415dd7d2 in ompi_request_default_wait_all () from
#3  0x00002b0d4764654b in ompi_coll_tuned_sendrecv_actual () from
#4  0x00002b0d4764adee in ompi_coll_tuned_alltoall_intra_pairwise ()
from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
#5  0x00002b0d47646e51 in ompi_coll_tuned_alltoall_intra_dec_fixed ()
from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
#6  0x00002b0d415f51d5 in PMPI_Alltoall () from
#7  0x00002b0d4139182c in pmpi_alltoall__ () from
#8  0x000000000075c348 in message_passing_mp_mp_alltoall_r45_ ()
#9  0x000000000170d294 in ps_wavelet_base_mp_f_poissonsolver_ ()
#10 0x00000000013ea32b in ps_wavelet_util_mp_psolver_ ()
#11 0x00000000013e2b1f in ps_wavelet_types_mp_ps_wavelet_solve_ ()
#12 0x0000000000810e18 in pw_poisson_methods_mp_pw_poisson_solve_ ()
#13 0x00000000008a670c in qs_ks_methods_mp_qs_ks_build_kohn_sham_matrix_ ()
#14 0x00000000008a46cb in qs_ks_methods_mp_qs_ks_update_qs_env_ ()
#15 0x0000000000885f11 in qs_force_mp_qs_forces_ ()
#16 0x000000000052f4f1 in force_env_methods_mp_force_env_calc_energy_force_ ()
#17 0x00000000012a5521 in integrator_mp_nvt_ ()
#18 0x0000000000b594c3 in velocity_verlet_control_mp_velocity_verlet_ ()
#19 0x0000000000723521 in md_run_mp_qs_mol_dyn_ ()
#20 0x0000000000498297 in cp2k_runs_mp_cp2k_run_ ()
#21 0x0000000000496f06 in cp2k_runs_mp_run_input_ ()
#22 0x0000000000495dba in MAIN__ ()

