problems with scp-nddo when running in parallel

garold g.murd... at gmail.com
Fri Feb 5 22:39:21 UTC 2010


Dear cp2k experts,

I am encountering a strange problem when using the SCP-NDDO method for
both bulk water MD (periodic) and water clusters GEO_OPT (periodic
none). Since the crash happens the same way in both cases, I am trying
to debug in the case of the simpler system, the cluster.  Here is what
I have observed:

1. serial (sopt) GEO_OPT: no problem for water_n, n=1-21; can
reproduce results obtained with another program

2. parallel (popt, 32 procs or 4 procs) GEO_OPT: calculations crash
for water_n, if n>=4 (see error message below for water_4 on 4 procs)

3. parallel (popt, 32 procs) ENERGY and ENERGY_FORCE: no problem, even
for water_4

4. parallel (popt and pdbg, 2 procs) GEO_OPT: no problem, even for
water_4

The crash does not happen with the 29May09 cvs version (but that
version has other problems and thus I would like to use the latest
version if I can).  It does happen with the current version (04feb10),
as well as with some earlier versions I examined (19jan10, 01dec09,
24nov09, 12oct09).

Following the src development on the cp2k website, I see that some
major changes took place in the period 29may - 12oct, in particular
the introduction of the dbcsr* routines for data manipulation. Perhaps
the crash is related to this? I have tried to do some debugging to
localize the error more precisely but have not gotten very far yet.  I
used ddt on 4 procs and can see exactly where the crash occurs but I
suspect the error is much further upstream (a pdf containing snapshots
from a ddt session is attached).  Note that the crash occurs on the
11th step of geometry optimization; up to that point there are only
small differences compared to a run that works (e.g., on 2 procs).  I
have also attached the input and output files (bad out output as well
as a good output for sdiff-ing).

Thank you in advance for any assistance or suggestions.

Best,
Garold


ps:

Here’s the error message for water_4 geo_opt on 4 procs

(see the attachments for more details:
	cp2k_scpnddo_parallel_bug_nprocs4_w4_geoopt_ddt.pdf
	w4.out_seems_correct_2_procs
	w4.out
	w4.inp
)


.
[lines deleted]
.

 --------------------------
 OPTIMIZATION STEP:     11
 --------------------------

 Number of
electrons:                                                         32
 Number of occupied
orbitals:                                                 16
 Number of molecular
orbitals:                                                16

 Number of orbital
functions:                                                 48
 Number of independent orbital
functions:                                     48

.
[lines deleted]
.

     7 OT DIIS     0.15E+00    0.0     0.00000197       -46.1242466471
-2.46E-06

  Core Hamiltonian energy:
-75.9765505690
  Two-electron integral energy:
-123.1122092931
  SCP electrostatic energy:
-0.1642845813
  SCP kernel energy:
0.1642798700
  SCP dispersion energy:
-0.0703730607
     8 OT DIIS     0.15E+00    0.0     0.00000153
-46.1242406283  6.02E-06

  Core Hamiltonian energy:
-75.9764389930
  Two-electron integral energy:
-123.1124672616
  SCP electrostatic energy:
-0.1642952273
  SCP kernel energy:
0.1642969843
  SCP dispersion energy:
-0.0703730607
     9 OT DIIS     0.15E+00    0.0     0.00000094       -46.1242515683
-1.09E-05

  *** SCF run converged in     9 steps ***


  Core-core repulsion energy [eV]:
2489.26453361951690
  Core Hamiltonian energy [eV]:
-2067.42404549389585
  Two-electron integral energy [eV]:
-3350.06060418325342
  Electronic energy [eV]:
-3742.45434758552256

  Total energy [eV]:
-1255.10471452199113

  Atomic reference energy [eV]:
1223.34254295964183
  Heat of formation [kcal/mol]:
-732.45313459066131
 CP2K|  MPI error 686886414 in mpi_allreduce @ mp_sum_d : Message
truncated, error stack:
MPI_Allreduce(714)...............: MPI_Allreduce(sbuf=0x7fffffff1138,
rbuf=0x7fffffff0688, count=1, MPI_DOUBLE_PRECISION, MPI_SUM,
MPI_COMM_WORLD) failed
MPIDI_CRAY_SMPClus_Allreduce(209):
MPIDI_CRAY_SMPClus_Allreduce(sendbuf=0x7fffffff1138,
recvbuf=0x7fffffff0688, count=1, type=MPI_DOUBLE_PRECISION,
op=MPI_SUM, comm=MPI_COMM_WORLD) failed
MPIR_Allreduce(289)..............:
(unknown)(): Message truncated
 CP2K| Abnormal program termination, stopped by process number 3
Application 161757 exit codes: 1
Application 161757 resources: utime 0, stime 0



More information about the CP2K-user mailing list