problems with scp-nddo when running in parallel
garold
g.murd... at gmail.com
Fri Feb 5 22:39:21 UTC 2010
Dear cp2k experts,
I am encountering a strange problem when using the SCP-NDDO method for
both bulk water MD (periodic) and water clusters GEO_OPT (periodic
none). Since the crash happens the same way in both cases, I am trying
to debug in the case of the simpler system, the cluster. Here is what
I have observed:
1. serial (sopt) GEO_OPT: no problem for water_n, n=1-21; can
reproduce results obtained with another program
2. parallel (popt, 32 procs or 4 procs) GEO_OPT: calculations crash
for water_n, if n>=4 (see error message below for water_4 on 4 procs)
3. parallel (popt, 32 procs) ENERGY and ENERGY_FORCE: no problem, even
for water_4
4. parallel (popt and pdbg, 2 procs) GEO_OPT: no problem, even for
water_4
The crash does not happen with the 29May09 cvs version (but that
version has other problems and thus I would like to use the latest
version if I can). It does happen with the current version (04feb10),
as well as with some earlier versions I examined (19jan10, 01dec09,
24nov09, 12oct09).
Following the src development on the cp2k website, I see that some
major changes took place in the period 29may - 12oct, in particular
the introduction of the dbcsr* routines for data manipulation. Perhaps
the crash is related to this? I have tried to do some debugging to
localize the error more precisely but have not gotten very far yet. I
used ddt on 4 procs and can see exactly where the crash occurs but I
suspect the error is much further upstream (a pdf containing snapshots
from a ddt session is attached). Note that the crash occurs on the
11th step of geometry optimization; up to that point there are only
small differences compared to a run that works (e.g., on 2 procs). I
have also attached the input and output files (bad out output as well
as a good output for sdiff-ing).
Thank you in advance for any assistance or suggestions.
Best,
Garold
ps:
Here’s the error message for water_4 geo_opt on 4 procs
(see the attachments for more details:
cp2k_scpnddo_parallel_bug_nprocs4_w4_geoopt_ddt.pdf
w4.out_seems_correct_2_procs
w4.out
w4.inp
)
.
[lines deleted]
.
--------------------------
OPTIMIZATION STEP: 11
--------------------------
Number of
electrons: 32
Number of occupied
orbitals: 16
Number of molecular
orbitals: 16
Number of orbital
functions: 48
Number of independent orbital
functions: 48
.
[lines deleted]
.
7 OT DIIS 0.15E+00 0.0 0.00000197 -46.1242466471
-2.46E-06
Core Hamiltonian energy:
-75.9765505690
Two-electron integral energy:
-123.1122092931
SCP electrostatic energy:
-0.1642845813
SCP kernel energy:
0.1642798700
SCP dispersion energy:
-0.0703730607
8 OT DIIS 0.15E+00 0.0 0.00000153
-46.1242406283 6.02E-06
Core Hamiltonian energy:
-75.9764389930
Two-electron integral energy:
-123.1124672616
SCP electrostatic energy:
-0.1642952273
SCP kernel energy:
0.1642969843
SCP dispersion energy:
-0.0703730607
9 OT DIIS 0.15E+00 0.0 0.00000094 -46.1242515683
-1.09E-05
*** SCF run converged in 9 steps ***
Core-core repulsion energy [eV]:
2489.26453361951690
Core Hamiltonian energy [eV]:
-2067.42404549389585
Two-electron integral energy [eV]:
-3350.06060418325342
Electronic energy [eV]:
-3742.45434758552256
Total energy [eV]:
-1255.10471452199113
Atomic reference energy [eV]:
1223.34254295964183
Heat of formation [kcal/mol]:
-732.45313459066131
CP2K| MPI error 686886414 in mpi_allreduce @ mp_sum_d : Message
truncated, error stack:
MPI_Allreduce(714)...............: MPI_Allreduce(sbuf=0x7fffffff1138,
rbuf=0x7fffffff0688, count=1, MPI_DOUBLE_PRECISION, MPI_SUM,
MPI_COMM_WORLD) failed
MPIDI_CRAY_SMPClus_Allreduce(209):
MPIDI_CRAY_SMPClus_Allreduce(sendbuf=0x7fffffff1138,
recvbuf=0x7fffffff0688, count=1, type=MPI_DOUBLE_PRECISION,
op=MPI_SUM, comm=MPI_COMM_WORLD) failed
MPIR_Allreduce(289)..............:
(unknown)(): Message truncated
CP2K| Abnormal program termination, stopped by process number 3
Application 161757 exit codes: 1
Application 161757 resources: utime 0, stime 0
More information about the CP2K-user
mailing list