[CP2K:2540] problems with scp-nddo when running in parallel
Laino Teodoro
teodor... at gmail.com
Sat Feb 6 21:33:37 UTC 2010
Dear Garold,
Urban provided a patch for the first bug. It is already in the CVS.
Unfortunately there is still another error but this is related to the
BFGS implementation modified ~ 1 year ago by Florian.
The error shows at this line:
In CP_FM_SYEVD, line 124 of cp_fm_diag.F .
This is just a library call but this tells you that the arguments
passed to that call have something wrong.
Florian, could you please check this problem? You can reproduce it
with the input Garold provided + NAG pdbg (2 procs crashes after 11
steps ; 4 procs crashes after 2 steps).
Thanks
Teo
On 5 Feb 2010, at 23:39, garold wrote:
> Dear cp2k experts,
>
> I am encountering a strange problem when using the SCP-NDDO method for
> both bulk water MD (periodic) and water clusters GEO_OPT (periodic
> none). Since the crash happens the same way in both cases, I am trying
> to debug in the case of the simpler system, the cluster. Here is what
> I have observed:
>
> 1. serial (sopt) GEO_OPT: no problem for water_n, n=1-21; can
> reproduce results obtained with another program
>
> 2. parallel (popt, 32 procs or 4 procs) GEO_OPT: calculations crash
> for water_n, if n>=4 (see error message below for water_4 on 4 procs)
>
> 3. parallel (popt, 32 procs) ENERGY and ENERGY_FORCE: no problem, even
> for water_4
>
> 4. parallel (popt and pdbg, 2 procs) GEO_OPT: no problem, even for
> water_4
>
> The crash does not happen with the 29May09 cvs version (but that
> version has other problems and thus I would like to use the latest
> version if I can). It does happen with the current version (04feb10),
> as well as with some earlier versions I examined (19jan10, 01dec09,
> 24nov09, 12oct09).
>
> Following the src development on the cp2k website, I see that some
> major changes took place in the period 29may - 12oct, in particular
> the introduction of the dbcsr* routines for data manipulation. Perhaps
> the crash is related to this? I have tried to do some debugging to
> localize the error more precisely but have not gotten very far yet. I
> used ddt on 4 procs and can see exactly where the crash occurs but I
> suspect the error is much further upstream (a pdf containing snapshots
> from a ddt session is attached). Note that the crash occurs on the
> 11th step of geometry optimization; up to that point there are only
> small differences compared to a run that works (e.g., on 2 procs). I
> have also attached the input and output files (bad out output as well
> as a good output for sdiff-ing).
>
> Thank you in advance for any assistance or suggestions.
>
> Best,
> Garold
>
>
> ps:
>
> Here’s the error message for water_4 geo_opt on 4 procs
>
> (see the attachments for more details:
> cp2k_scpnddo_parallel_bug_nprocs4_w4_geoopt_ddt.pdf
> w4.out_seems_correct_2_procs
> w4.out
> w4.inp
> )
>
>
> .
> [lines deleted]
> .
>
> --------------------------
> OPTIMIZATION STEP: 11
> --------------------------
>
> Number of
> electrons: 32
> Number of occupied
> orbitals: 16
> Number of molecular
> orbitals: 16
>
> Number of orbital
> functions: 48
> Number of independent orbital
> functions: 48
>
> .
> [lines deleted]
> .
>
> 7 OT DIIS 0.15E+00 0.0 0.00000197 -46.1242466471
> -2.46E-06
>
> Core Hamiltonian energy:
> -75.9765505690
> Two-electron integral energy:
> -123.1122092931
> SCP electrostatic energy:
> -0.1642845813
> SCP kernel energy:
> 0.1642798700
> SCP dispersion energy:
> -0.0703730607
> 8 OT DIIS 0.15E+00 0.0 0.00000153
> -46.1242406283 6.02E-06
>
> Core Hamiltonian energy:
> -75.9764389930
> Two-electron integral energy:
> -123.1124672616
> SCP electrostatic energy:
> -0.1642952273
> SCP kernel energy:
> 0.1642969843
> SCP dispersion energy:
> -0.0703730607
> 9 OT DIIS 0.15E+00 0.0 0.00000094 -46.1242515683
> -1.09E-05
>
> *** SCF run converged in 9 steps ***
>
>
> Core-core repulsion energy [eV]:
> 2489.26453361951690
> Core Hamiltonian energy [eV]:
> -2067.42404549389585
> Two-electron integral energy [eV]:
> -3350.06060418325342
> Electronic energy [eV]:
> -3742.45434758552256
>
> Total energy [eV]:
> -1255.10471452199113
>
> Atomic reference energy [eV]:
> 1223.34254295964183
> Heat of formation [kcal/mol]:
> -732.45313459066131
> CP2K| MPI error 686886414 in mpi_allreduce @ mp_sum_d : Message
> truncated, error stack:
> MPI_Allreduce(714)...............: MPI_Allreduce(sbuf=0x7fffffff1138,
> rbuf=0x7fffffff0688, count=1, MPI_DOUBLE_PRECISION, MPI_SUM,
> MPI_COMM_WORLD) failed
> MPIDI_CRAY_SMPClus_Allreduce(209):
> MPIDI_CRAY_SMPClus_Allreduce(sendbuf=0x7fffffff1138,
> recvbuf=0x7fffffff0688, count=1, type=MPI_DOUBLE_PRECISION,
> op=MPI_SUM, comm=MPI_COMM_WORLD) failed
> MPIR_Allreduce(289)..............:
> (unknown)(): Message truncated
> CP2K| Abnormal program termination, stopped by process number 3
> Application 161757 exit codes: 1
> Application 161757 resources: utime 0, stime 0
>
> --
>
