problems with scp-nddo when running in parallel

garold g.murd... at
Fri Feb 5 22:39:21 UTC 2010

Dear cp2k experts,

I am encountering a strange problem when using the SCP-NDDO method for
both bulk water MD (periodic) and water clusters GEO_OPT (periodic
none). Since the crash happens the same way in both cases, I am trying
to debug in the case of the simpler system, the cluster.  Here is what
I have observed:

1. serial (sopt) GEO_OPT: no problem for water_n, n=1-21; can
reproduce results obtained with another program

2. parallel (popt, 32 procs or 4 procs) GEO_OPT: calculations crash
for water_n, if n>=4 (see error message below for water_4 on 4 procs)

3. parallel (popt, 32 procs) ENERGY and ENERGY_FORCE: no problem, even
for water_4

4. parallel (popt and pdbg, 2 procs) GEO_OPT: no problem, even for

The crash does not happen with the 29May09 cvs version (but that
version has other problems and thus I would like to use the latest
version if I can).  It does happen with the current version (04feb10),
as well as with some earlier versions I examined (19jan10, 01dec09,
24nov09, 12oct09).

Following the src development on the cp2k website, I see that some
major changes took place in the period 29may - 12oct, in particular
the introduction of the dbcsr* routines for data manipulation. Perhaps
the crash is related to this? I have tried to do some debugging to
localize the error more precisely but have not gotten very far yet.  I
used ddt on 4 procs and can see exactly where the crash occurs but I
suspect the error is much further upstream (a pdf containing snapshots
from a ddt session is attached).  Note that the crash occurs on the
11th step of geometry optimization; up to that point there are only
small differences compared to a run that works (e.g., on 2 procs).  I
have also attached the input and output files (bad out output as well
as a good output for sdiff-ing).

Thank you in advance for any assistance or suggestions.



Here’s the error message for water_4 geo_opt on 4 procs

(see the attachments for more details:

[lines deleted]


 Number of
electrons:                                                         32
 Number of occupied
orbitals:                                                 16
 Number of molecular
orbitals:                                                16

 Number of orbital
functions:                                                 48
 Number of independent orbital
functions:                                     48

[lines deleted]

     7 OT DIIS     0.15E+00    0.0     0.00000197       -46.1242466471

  Core Hamiltonian energy:
  Two-electron integral energy:
  SCP electrostatic energy:
  SCP kernel energy:
  SCP dispersion energy:
     8 OT DIIS     0.15E+00    0.0     0.00000153
-46.1242406283  6.02E-06

  Core Hamiltonian energy:
  Two-electron integral energy:
  SCP electrostatic energy:
  SCP kernel energy:
  SCP dispersion energy:
     9 OT DIIS     0.15E+00    0.0     0.00000094       -46.1242515683

  *** SCF run converged in     9 steps ***

  Core-core repulsion energy [eV]:
  Core Hamiltonian energy [eV]:
  Two-electron integral energy [eV]:
  Electronic energy [eV]:

  Total energy [eV]:

  Atomic reference energy [eV]:
  Heat of formation [kcal/mol]:
 CP2K|  MPI error 686886414 in mpi_allreduce @ mp_sum_d : Message
truncated, error stack:
MPI_Allreduce(714)...............: MPI_Allreduce(sbuf=0x7fffffff1138,
rbuf=0x7fffffff0688, count=1, MPI_DOUBLE_PRECISION, MPI_SUM,
recvbuf=0x7fffffff0688, count=1, type=MPI_DOUBLE_PRECISION,
op=MPI_SUM, comm=MPI_COMM_WORLD) failed
(unknown)(): Message truncated
 CP2K| Abnormal program termination, stopped by process number 3
Application 161757 exit codes: 1
Application 161757 resources: utime 0, stime 0

