CP2K calculation stuck in communication
Axel
akoh... at gmail.com
Wed Mar 4 16:03:50 UTC 2009
On Mar 4, 10:00 am, Ondrej Marsalek <ondrej.... at gmail.com> wrote:
> And one more thing - I have found elsewhere a post by you, Axel,
> mentioning that the reserved memory limit is important. I have just
> checked and found that it is set to 1024 kB on our nodes.
>
> Could this, together with not using a shared receive buffer (I have
> not used the use_srq parameter before) cause the behaviour described?
hard to say. we saw that there was heavy complaining about
a too small a value impeding performance and then changed it
and had no problems since. we never tried the other way around,
since the machine has been running rock solid and with good
performance for 6 months now.
cheers,
axel.
>
> Ondrej
>
> On Wed, Mar 4, 2009 at 15:47, Ondrej Marsalek <ondrej.... at gmail.com> wrote:
> > On Wed, Mar 4, 2009 at 14:49, Axel <akoh... at gmail.com> wrote:
>
> >> i remember people reporting some problems with openmpi 1.3 over
> >> infiniband
> >> for alltoall and related. the workaround was to insert a call to
> >> mpi_barrier
> >> before the alltoall.... you could try this.
>
> > Well, we upgraded because there were trouble before, at least now the
> > nodes do not freeze. The problem is that it runs fine for hours to day
> > and then "stops" without any other sign of failure.
>
> >> also with the openib btl, i always need to set btl_openib_use_srq to 1
> >> via --mca or ~/.openmpi/mca.conf.
>
> > OK, I'll try to use it. When you say "need", what does that mean? What
> > happens otherwise?
>
> > Thanks,
> > Ondrej
>
> >> hope that helps.
> >> axel.
>
> >>> All the processes are alive and taking 100% CPU, but there is no
> >>> output (no, it is not the buffers, the timescales makes that clear).
> >>> Attaching gdb reveals the stack shown (with a comment) at the end of
> >>> the message. I don't have Totalview there, but could maybe install the
> >>> trial, if there is good chance that it helps. This problem appears
> >>> repeatedly, before some software updates, we had a problem that looked
> >>> similar on the surface, but one of the nodes involved was entirely
> >>> frozen. As it does not happen elsewhere, I thought that the IB is to
> >>> blame, somehow, but seeing the stack now, I am not so sure.
>
> >>> I would very much appreciate any suggestions.
>
> >>> Best,
> >>> Ondrej
>
> >>> ---
>
> >>> The backtrace in gdb looks something like this, but is "constant" only
> >>> from opal_progress downwards. Clearly, opal_progress runs and you can
> >>> see other stuff above it, for example btl_sm_component_progress or
> >>> something deeper in btl_openib_component_progress.
>
> >>> (gdb) backtrace
> >>> #0 0x00002b0d461c6da6 in btl_openib_component_progress () from
> >>> /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so
> >>> #1 0x00002b0d41adc778 in opal_progress () from
> >>> /home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0
> >>> #2 0x00002b0d415dd7d2 in ompi_request_default_wait_all () from
> >>> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
> >>> #3 0x00002b0d4764654b in ompi_coll_tuned_sendrecv_actual () from
> >>> /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
> >>> #4 0x00002b0d4764adee in ompi_coll_tuned_alltoall_intra_pairwise ()
> >>> from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
> >>> #5 0x00002b0d47646e51 in ompi_coll_tuned_alltoall_intra_dec_fixed ()
> >>> from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
> >>> #6 0x00002b0d415f51d5 in PMPI_Alltoall () from
> >>> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
> >>> #7 0x00002b0d4139182c in pmpi_alltoall__ () from
> >>> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0
> >>> #8 0x000000000075c348 in message_passing_mp_mp_alltoall_r45_ ()
> >>> #9 0x000000000170d294 in ps_wavelet_base_mp_f_poissonsolver_ ()
> >>> #10 0x00000000013ea32b in ps_wavelet_util_mp_psolver_ ()
> >>> #11 0x00000000013e2b1f in ps_wavelet_types_mp_ps_wavelet_solve_ ()
> >>> #12 0x0000000000810e18 in pw_poisson_methods_mp_pw_poisson_solve_ ()
> >>> #13 0x00000000008a670c in qs_ks_methods_mp_qs_ks_build_kohn_sham_matrix_ ()
> >>> #14 0x00000000008a46cb in qs_ks_methods_mp_qs_ks_update_qs_env_ ()
> >>> #15 0x0000000000885f11 in qs_force_mp_qs_forces_ ()
> >>> #16 0x000000000052f4f1 in force_env_methods_mp_force_env_calc_energy_force_ ()
> >>> #17 0x00000000012a5521 in integrator_mp_nvt_ ()
> >>> #18 0x0000000000b594c3 in velocity_verlet_control_mp_velocity_verlet_ ()
> >>> #19 0x0000000000723521 in md_run_mp_qs_mol_dyn_ ()
> >>> #20 0x0000000000498297 in cp2k_runs_mp_cp2k_run_ ()
> >>> #21 0x0000000000496f06 in cp2k_runs_mp_run_input_ ()
> >>> #22 0x0000000000495dba in MAIN__ ()
More information about the CP2K-user
mailing list