[CP2K:2540] problems with scp-nddo when running in parallel
Laino Teodoro
teodor... at gmail.com
Fri Feb 5 23:19:15 UTC 2010
Thanks Garold,
this is indeed a bug in part of the dbcsr_util routines.
Specifically I was able to track the problem with the input you provided
in ENSURE_ARRAY_SIZE_D, line 477 of dbcsr_util_d_.F .
This error happens whether you do GEO_OPT or ENERGY. So the fact that
you observe
the crashing only at the 11th step is just because the memory gets
dirty and it bombs at that level.
But the bug appears since the very beginning of the setup of part of
the NDDO matrices.
Urban or Valery : could you please have a look at this problem?
Thanks!
Teo
On 5 Feb 2010, at 23:39, garold wrote:
> Dear cp2k experts,
>
> I am encountering a strange problem when using the SCP-NDDO method for
> both bulk water MD (periodic) and water clusters GEO_OPT (periodic
> none). Since the crash happens the same way in both cases, I am trying
> to debug in the case of the simpler system, the cluster. Here is what
> I have observed:
>
> 1. serial (sopt) GEO_OPT: no problem for water_n, n=1-21; can
> reproduce results obtained with another program
>
> 2. parallel (popt, 32 procs or 4 procs) GEO_OPT: calculations crash
> for water_n, if n>=4 (see error message below for water_4 on 4 procs)
>
> 3. parallel (popt, 32 procs) ENERGY and ENERGY_FORCE: no problem, even
> for water_4
>
> 4. parallel (popt and pdbg, 2 procs) GEO_OPT: no problem, even for
> water_4
>
> The crash does not happen with the 29May09 cvs version (but that
> version has other problems and thus I would like to use the latest
> version if I can). It does happen with the current version (04feb10),
> as well as with some earlier versions I examined (19jan10, 01dec09,
> 24nov09, 12oct09).
>
> Following the src development on the cp2k website, I see that some
> major changes took place in the period 29may - 12oct, in particular
> the introduction of the dbcsr* routines for data manipulation. Perhaps
> the crash is related to this? I have tried to do some debugging to
> localize the error more precisely but have not gotten very far yet. I
> used ddt on 4 procs and can see exactly where the crash occurs but I
> suspect the error is much further upstream (a pdf containing snapshots
> from a ddt session is attached). Note that the crash occurs on the
> 11th step of geometry optimization; up to that point there are only
> small differences compared to a run that works (e.g., on 2 procs). I
> have also attached the input and output files (bad out output as well
> as a good output for sdiff-ing).
>
> Thank you in advance for any assistance or suggestions.
>
> Best,
> Garold
>
>
> ps:
>
> Here’s the error message for water_4 geo_opt on 4 procs
>
> (see the attachments for more details:
> cp2k_scpnddo_parallel_bug_nprocs4_w4_geoopt_ddt.pdf
> w4.out_seems_correct_2_procs
> w4.out
> w4.inp
> )
>
>
> .
> [lines deleted]
> .
>
> --------------------------
> OPTIMIZATION STEP: 11
> --------------------------
>
> Number of
> electrons: 32
> Number of occupied
> orbitals: 16
> Number of molecular
> orbitals: 16
>
> Number of orbital
> functions: 48
> Number of independent orbital
> functions: 48
>
> .
> [lines deleted]
> .
>
> 7 OT DIIS 0.15E+00 0.0 0.00000197 -46.1242466471
> -2.46E-06
>
> Core Hamiltonian energy:
> -75.9765505690
> Two-electron integral energy:
> -123.1122092931
> SCP electrostatic energy:
> -0.1642845813
> SCP kernel energy:
> 0.1642798700
> SCP dispersion energy:
> -0.0703730607
> 8 OT DIIS 0.15E+00 0.0 0.00000153
> -46.1242406283 6.02E-06
>
> Core Hamiltonian energy:
> -75.9764389930
> Two-electron integral energy:
> -123.1124672616
> SCP electrostatic energy:
> -0.1642952273
> SCP kernel energy:
> 0.1642969843
> SCP dispersion energy:
> -0.0703730607
> 9 OT DIIS 0.15E+00 0.0 0.00000094 -46.1242515683
> -1.09E-05
>
> *** SCF run converged in 9 steps ***
>
>
> Core-core repulsion energy [eV]:
> 2489.26453361951690
> Core Hamiltonian energy [eV]:
> -2067.42404549389585
> Two-electron integral energy [eV]:
> -3350.06060418325342
> Electronic energy [eV]:
> -3742.45434758552256
>
> Total energy [eV]:
> -1255.10471452199113
>
> Atomic reference energy [eV]:
> 1223.34254295964183
> Heat of formation [kcal/mol]:
> -732.45313459066131
> CP2K| MPI error 686886414 in mpi_allreduce @ mp_sum_d : Message
> truncated, error stack:
> MPI_Allreduce(714)...............: MPI_Allreduce(sbuf=0x7fffffff1138,
> rbuf=0x7fffffff0688, count=1, MPI_DOUBLE_PRECISION, MPI_SUM,
> MPI_COMM_WORLD) failed
> MPIDI_CRAY_SMPClus_Allreduce(209):
> MPIDI_CRAY_SMPClus_Allreduce(sendbuf=0x7fffffff1138,
> recvbuf=0x7fffffff0688, count=1, type=MPI_DOUBLE_PRECISION,
> op=MPI_SUM, comm=MPI_COMM_WORLD) failed
> MPIR_Allreduce(289)..............:
> (unknown)(): Message truncated
> CP2K| Abnormal program termination, stopped by process number 3
> Application 161757 exit codes: 1
> Application 161757 resources: utime 0, stime 0
>
> --
> You received this message because you are subscribed to the Google
> Groups "cp2k" group.
> To post to this group, send email to cp... at googlegroups.com.
> To unsubscribe from this group, send email to cp2k
> +unsub... at googlegroups.com.
> For more options, visit this group at http://groups.google.com/
> group/cp2k?hl=en.
>
More information about the CP2K-user
mailing list