[CP2K:3937] Cray XE6 NAN Errors

Iain Bethune ibet... at epcc.ed.ac.uk
Thu Jul 26 19:54:54 UTC 2012


Hi,

I can try running this locally to check if the problem reproduces on our XE6.  Can you please email me the full set of files needed to run the job, and tell me which version of CP2K you are using, and how many MPI tasks & OpenMP threads (if you are using cp2k.psmp).

Cheers

- Iain

--

Iain Bethune
Applications Consultant, EPCC

Email: ibet... at epcc.ed.ac.uk
Twitter: @IainBethune
Tel/Fax: +44 (0)131 650 5201/6555
Mob: +44 (0)7598317015
Addr: 2404 JCMB, The King's Buildings, Mayfield Road, Edinburgh, EH9 3JZ






On 26 Jul 2012, at 13:55, DEC014 wrote:

> The job fails at Step 5251 with the NaN error as described before.  I've tried restarting from previous restart points and each time it fails at the same point.  Below I have placed my input file minus the majority of the atom coords to save space (it's just a water box.  I'm hoping there's an easy fix since getting the supercomputing center to recompile software is nearly impossible.
> 
> Input File:
> @SET CP2K_DATA /u/dec014/QS
> 
>  &GLOBAL
>   PROJECT watbox
>   PRINT_LEVEL LOW
>   PREFERRED_FFT_LIBRARY   FFTW
>   &TIMINGS
>      THRESHOLD 0.000001
>   &END
>   RUN_TYPE MD
>  &END GLOBAL
> 
>  &MOTION
>    &MD
>      ENSEMBLE NPT_F
>      STEPS 20000
>      TIMESTEP 1
>      TEMPERATURE 298.15
>     &THERMOSTAT
>       REGION MASSIVE
>       &NOSE
>         LENGTH 3
>         YOSHIDA 3
>         TIMECON 50
>         MTS 2
>       &END NOSE
>     &END
>     &BAROSTAT
>       PRESSURE 1.0
>       TIMECON 50
>       &THERMOSTAT
>         &NOSE
>           LENGTH 3
>           YOSHIDA 3
>           TIMECON 50
>           MTS 2
>         &END NOSE
>       &END THERMOSTAT
>     &END BAROSTAT
>    &END MD
>    &PRINT
>      &TRAJECTORY
>        &EACH
>         MD 20
>        &END EACH
>      &END TRAJECTORY
>      &VELOCITIES
>        &EACH
>         MD 20
>        &END EACH
>      &END VELOCITIES
>      &CELL
>        &EACH
>          MD 1
>        &END EACH
>      &END CELL
>      &STRESS
>        &EACH
>          MD 1
>        &END EACH
>      &END STRESS
>      &RESTART
>        FILENAME rst-md
>        &EACH
>          MD 250
>        &END EACH
>      &END RESTART
>    &END PRINT
>  &END MOTION
> 
>  &FORCE_EVAL
>    METHOD QS
>    STRESS_TENSOR ANALYTICAL
>    &DFT
>     BASIS_SET_FILE_NAME ${CP2K_DATA}/GTH_BASIS_SETS
>     POTENTIAL_FILE_NAME ${CP2K_DATA}/POTENTIAL
>     &MGRID
>       CUTOFF 400
>     &END MGRID
>     &QS
>       EPS_DEFAULT 1.0E-14
>       EXTRAPOLATION ASPC
>     &END QS
>     &SCF
>       SCF_GUESS ATOMIC
>       MAX_SCF 20
>       &OUTER_SCF
>         MAX_SCF 20
>       &END OUTER_SCF
>       &OT ON
>         MINIMIZER DIIS
>       &END OT
>         &PRINT
>           &RESTART OFF
>           &END RESTART
>         &END PRINT
>     &END SCF
>     &XC
>       &XC_FUNCTIONAL
>         &PBE
>          PARAMETRIZATION REVPBE
>         &END
>       &END XC_FUNCTIONAL
>       &VDW_POTENTIAL
>         POTENTIAL_TYPE PAIR_POTENTIAL
>         &PAIR_POTENTIAL
>           TYPE DFTD2
>           SCALING 1.0e0
>         &END PAIR_POTENTIAL
>       &END VDW_POTENTIAL
>     &END XC
>    &END DFT   
>    &SUBSYS
>      &KIND H
>        BASIS_SET DZVP-GTH
>        POTENTIAL GTH-PBE-q1
>      &END KIND
>      &KIND O
>        BASIS_SET DZVP-GTH
>        POTENTIAL GTH-PBE-q6
>      &END KIND
>      &CELL
>        ABC 26.6800 27.7260 25.9569
>        &CELL_REF
>          ABC 26.6800 27.7260 25.9569
>        &END CELL_REF
>      &END CELL
>      &COORD
> O           8.4042      7.96733      3.94052
> H          8.74996      6.96949      3.98698
> H          7.54856      7.98758      4.41086
> { .... more water box coordinates ....}
> O          22.2647      1.92667      12.3791
> H           22.876      1.27983      11.8524
> H          22.4165      1.57844      13.3125
>      &END COORD
>    &END SUBSYS
>  &END FORCE_EVAL
> 
> 
> 
> On Wednesday, July 25, 2012 3:22:50 PM UTC-4, IBethune wrote:
> Hi, 
> 
> I would be very surprised if a machine upgrade could cause the software to start producing numerical nonsense, however it is always a good idea to recompile the code after a new hardware or software upgrade to ensure you are getting good performance.  You should also update your code to a recent SVN version if possible, to pick up any relevant bug-fixes.  This *may* also help with the numerical troubles. 
> 
> Beyond that it's hard to say without seeing an input file and more specific detail of the problem. 
> 
> Cheers 
> 
> - Iain 
> 
> -- 
> 
> Iain Bethune 
> Applications Consultant, EPCC 
> 
> Email: ibet... at epcc.ed.ac.uk 
> Twitter: @IainBethune 
> Tel/Fax: +44 (0)131 650 5201/6555 
> Mob: +44 (0)7598317015 
> Addr: 2404 JCMB, The King's Buildings, Mayfield Road, Edinburgh, EH9 3JZ 
> 
> 
> 
> 
> 
> 
> On 25 Jul 2012, at 17:34, DEC014 wrote: 
> 
> > I am running DFT MD Simulations on a Cray XE-6 machine.  They used to run perfectly fine, however, the machines underwent some upgrades.  Now, periodically and seemingly at random, the simulations run into NAN or  MPI errors. In the OUT file, Barostat, Energy Drift, and Conserved Quantity, produce NaN and a corresponding NaN shows up in the ENER file.  I'm re-running a job that completed before on the same system to see if it's a job error or system error.  I'm guessing the upgrades are the problem, but I'm curious if any other are running into similar situations. 
> > 
> > If the Upgrades are the problem, what will solve the problem?  Recompile the software? 
> > 
> > -- 
> > You received this message because you are subscribed to the Google Groups "cp2k" group. 
> > To view this discussion on the web visit https://groups.google.com/d/msg/cp2k/-/bG8mx_rJCgYJ. 
> > To post to this group, send email to cp... at googlegroups.com. 
> > To unsubscribe from this group, send email to cp2k+uns... at googlegroups.com. 
> > For more options, visit this group at http://groups.google.com/group/cp2k?hl=en. 
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in 
> Scotland, with registration number SC005336. 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "cp2k" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/cp2k/-/GNntaQs9HJQJ.
> To post to this group, send email to cp... at googlegroups.com.
> To unsubscribe from this group, send email to cp2k+uns... at googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cp2k?hl=en.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




More information about the CP2K-user mailing list