CP2K calculation stuck in communication

Axel akoh... at gmail.com
Wed Mar 4 13:49:06 UTC 2009



On Mar 4, 6:09 am, Ondrej Marsalek <ondrej.... at gmail.com> wrote:
> Dear everyone,
>
> I have a problem with a CP2K calculation that is stuck in
> communication and I would like to ask for ideas as for where to start.
> I realize that this is slightly off-topic, but perhaps someone can at
> least point me in the right direction. I will provide basic
> information and will be glad to provide more details, if anyone finds
> it useful.

> The setup is a cluster with two dual core Opterons per node and
> Infiniband interconnect. OpenMPI 1.3 and recent CP2K (that behaves
> fine elsewhere).

i remember people reporting some problems with openmpi 1.3 over
infiniband
for alltoall and related. the workaround was to insert a call to
mpi_barrier
before the alltoall.... you could try this.
also with the openib btl, i always need to set btl_openib_use_srq to 1
via --mca or ~/.openmpi/mca.conf.

hope that helps.
    axel.


> All  the processes are alive and taking 100% CPU, but there is no
> output (no, it is not the buffers, the timescales makes that clear).
> Attaching gdb reveals the stack shown (with a comment) at the end of
> the message. I don't have Totalview there, but could maybe install the
> trial, if there is good chance that it helps. This problem appears
> repeatedly, before some software updates, we had a problem that looked
> similar on the surface, but one of the nodes involved was entirely
> frozen. As it does not happen elsewhere, I thought that the IB is to
> blame, somehow, but seeing the stack now, I am not so sure.
>
> I would very much appreciate any suggestions.
>
> Best,
> Ondrej
>
> ---
>
> The backtrace in gdb looks something like this, but is "constant" only
> from opal_progress downwards. Clearly, opal_progress runs and you can
> see other stuff above it, for example btl_sm_component_progress or
> something deeper in btl_openib_component_progress.
>
> (gdb) backtrace
> #0  0x00002b0d461c6da6 in btl_openib_component_progress () from
> /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so
> #1  0x00002b0d41adc778 in opal_progress () from
> /home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0
> #2  0x00002b0d415dd7d2 in ompi_request_default_wait_all () from
> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
> #3  0x00002b0d4764654b in ompi_coll_tuned_sendrecv_actual () from
> /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
> #4  0x00002b0d4764adee in ompi_coll_tuned_alltoall_intra_pairwise ()
> from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
> #5  0x00002b0d47646e51 in ompi_coll_tuned_alltoall_intra_dec_fixed ()
> from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
> #6  0x00002b0d415f51d5 in PMPI_Alltoall () from
> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
> #7  0x00002b0d4139182c in pmpi_alltoall__ () from
> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0
> #8  0x000000000075c348 in message_passing_mp_mp_alltoall_r45_ ()
> #9  0x000000000170d294 in ps_wavelet_base_mp_f_poissonsolver_ ()
> #10 0x00000000013ea32b in ps_wavelet_util_mp_psolver_ ()
> #11 0x00000000013e2b1f in ps_wavelet_types_mp_ps_wavelet_solve_ ()
> #12 0x0000000000810e18 in pw_poisson_methods_mp_pw_poisson_solve_ ()
> #13 0x00000000008a670c in qs_ks_methods_mp_qs_ks_build_kohn_sham_matrix_ ()
> #14 0x00000000008a46cb in qs_ks_methods_mp_qs_ks_update_qs_env_ ()
> #15 0x0000000000885f11 in qs_force_mp_qs_forces_ ()
> #16 0x000000000052f4f1 in force_env_methods_mp_force_env_calc_energy_force_ ()
> #17 0x00000000012a5521 in integrator_mp_nvt_ ()
> #18 0x0000000000b594c3 in velocity_verlet_control_mp_velocity_verlet_ ()
> #19 0x0000000000723521 in md_run_mp_qs_mol_dyn_ ()
> #20 0x0000000000498297 in cp2k_runs_mp_cp2k_run_ ()
> #21 0x0000000000496f06 in cp2k_runs_mp_run_input_ ()
> #22 0x0000000000495dba in MAIN__ ()


More information about the CP2K-user mailing list