[CP2K:1824] Re: CP2K calculation stuck in communication

Ondrej Marsalek ondrej.... at gmail.com
Wed Mar 4 15:00:35 UTC 2009


And one more thing - I have found elsewhere a post by you, Axel,
mentioning that the reserved memory limit is important. I have just
checked and found that it is set to 1024 kB on our nodes.

Could this, together with not using a shared receive buffer (I have
not used the use_srq parameter before) cause the behaviour described?

Ondrej


On Wed, Mar 4, 2009 at 15:47, Ondrej Marsalek <ondrej.... at gmail.com> wrote:
> On Wed, Mar 4, 2009 at 14:49, Axel <akoh... at gmail.com> wrote:
>>
>> i remember people reporting some problems with openmpi 1.3 over
>> infiniband
>> for alltoall and related. the workaround was to insert a call to
>> mpi_barrier
>> before the alltoall.... you could try this.
>
> Well, we upgraded because there were trouble before, at least now the
> nodes do not freeze. The problem is that it runs fine for hours to day
> and then "stops" without any other sign of failure.
>
>> also with the openib btl, i always need to set btl_openib_use_srq to 1
>> via --mca or ~/.openmpi/mca.conf.
>
> OK, I'll try to use it. When you say "need", what does that mean? What
> happens otherwise?
>
> Thanks,
> Ondrej
>
>
>> hope that helps.
>>    axel.
>>
>>
>>> All  the processes are alive and taking 100% CPU, but there is no
>>> output (no, it is not the buffers, the timescales makes that clear).
>>> Attaching gdb reveals the stack shown (with a comment) at the end of
>>> the message. I don't have Totalview there, but could maybe install the
>>> trial, if there is good chance that it helps. This problem appears
>>> repeatedly, before some software updates, we had a problem that looked
>>> similar on the surface, but one of the nodes involved was entirely
>>> frozen. As it does not happen elsewhere, I thought that the IB is to
>>> blame, somehow, but seeing the stack now, I am not so sure.
>>>
>>> I would very much appreciate any suggestions.
>>>
>>> Best,
>>> Ondrej
>>>
>>> ---
>>>
>>> The backtrace in gdb looks something like this, but is "constant" only
>>> from opal_progress downwards. Clearly, opal_progress runs and you can
>>> see other stuff above it, for example btl_sm_component_progress or
>>> something deeper in btl_openib_component_progress.
>>>
>>> (gdb) backtrace
>>> #0  0x00002b0d461c6da6 in btl_openib_component_progress () from
>>> /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so
>>> #1  0x00002b0d41adc778 in opal_progress () from
>>> /home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0
>>> #2  0x00002b0d415dd7d2 in ompi_request_default_wait_all () from
>>> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
>>> #3  0x00002b0d4764654b in ompi_coll_tuned_sendrecv_actual () from
>>> /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
>>> #4  0x00002b0d4764adee in ompi_coll_tuned_alltoall_intra_pairwise ()
>>> from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
>>> #5  0x00002b0d47646e51 in ompi_coll_tuned_alltoall_intra_dec_fixed ()
>>> from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_coll_tuned.so
>>> #6  0x00002b0d415f51d5 in PMPI_Alltoall () from
>>> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
>>> #7  0x00002b0d4139182c in pmpi_alltoall__ () from
>>> /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0
>>> #8  0x000000000075c348 in message_passing_mp_mp_alltoall_r45_ ()
>>> #9  0x000000000170d294 in ps_wavelet_base_mp_f_poissonsolver_ ()
>>> #10 0x00000000013ea32b in ps_wavelet_util_mp_psolver_ ()
>>> #11 0x00000000013e2b1f in ps_wavelet_types_mp_ps_wavelet_solve_ ()
>>> #12 0x0000000000810e18 in pw_poisson_methods_mp_pw_poisson_solve_ ()
>>> #13 0x00000000008a670c in qs_ks_methods_mp_qs_ks_build_kohn_sham_matrix_ ()
>>> #14 0x00000000008a46cb in qs_ks_methods_mp_qs_ks_update_qs_env_ ()
>>> #15 0x0000000000885f11 in qs_force_mp_qs_forces_ ()
>>> #16 0x000000000052f4f1 in force_env_methods_mp_force_env_calc_energy_force_ ()
>>> #17 0x00000000012a5521 in integrator_mp_nvt_ ()
>>> #18 0x0000000000b594c3 in velocity_verlet_control_mp_velocity_verlet_ ()
>>> #19 0x0000000000723521 in md_run_mp_qs_mol_dyn_ ()
>>> #20 0x0000000000498297 in cp2k_runs_mp_cp2k_run_ ()
>>> #21 0x0000000000496f06 in cp2k_runs_mp_run_input_ ()
>>> #22 0x0000000000495dba in MAIN__ ()
>> >>
>>
>



More information about the CP2K-user mailing list