Farming - Abnormal Termination

Daniele Bovi daniel... at gmail.com
Mon Jul 15 10:19:45 UTC 2013


Dear CP2K community,

I'm trying to employ the farming feature to run a big parallel job on 
CINECA Fermi cluster.

This is a typical input file:
##############################################
&GLOBAL
  PROJECT farming-50
  PROGRAM FARMING
  RUN_TYPE NONE
  WALLTIME 86000
&END GLOBAL
&FARMING
  NGROUP 32
@INCLUDE JOBS/primi_50/jobs_HS14.inc
@INCLUDE JOBS/primi_50/jobs_BS8-AABA.inc
@INCLUDE JOBS/primi_50/jobs_BS8-ABAA.inc
@INCLUDE JOBS/primi_50/jobs_BS8-AAAB.inc
@INCLUDE JOBS/primi_50/jobs_BS6-BAAA.inc
@INCLUDE JOBS/primi_50/jobs_BS2-ABBA.inc
@INCLUDE JOBS/primi_50/jobs_BS2-AABB.inc
@INCLUDE JOBS/primi_50/jobs_BS2-ABAB.inc
&END FARMING
#############################################

Each include file (*.inc) contains 50 equivalent single point calculations 
(ENERGY). For example the jobs_HS14.inc starts writing
############################################
  &JOB
    DIRECTORY STEP_21001/HS14/
    INPUT_FILE_NAME cp2k.inp
  &END JOB
  &JOB
    DIRECTORY STEP_21011/HS14/
    INPUT_FILE_NAME cp2k.inp
  &END JOB
    ...
    ...
    ...
############################################

Total jobs is equal to 50x8 = 400 jobs

We ask for 1024nodes x 2mpi processes each x 16 openMP threads
and we are using the executable 
/cineca/prod/applications/cp2k/2.3/bgq-xl--1.0/bin/cp2k.psmp

We split the farming in 32 groups, so each job takes 1024x2/32 = 64 mpi x 
16 threads on 32nodes.
(The single point inputs with the computational resources described above 
were tested before in a small farming job and everything worked fine)

When we run the big farming job (400 jobs in 32 groups) after the first 
execution of 32 jobs, mpi errors occurs and every calculation is brutally 
stopped.
This is the main output (after the correct preamble and the 
list/assignation of the jobs to the groups):
########STDOUT###########################################################################################
  Running Job 00001 in STEP_21001/BS8/AABA/. Done, output in OEC_QMMM_MA.out
  Running Job 00033 in STEP_21321/BS8/AABA/. CP2K| condition FAILED at line 
2152
 CP2K| Abnormal program termination, stopped by process number 256
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] 
:372729:ibm.runjob.client.Job: terminated by signal 6
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] 
:372729:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 
256
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] 
:372729:ibm.runjob.client.Job: 59 RAS events
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] 
:372729:ibm.runjob.client.Job: most recent RAS event text: DDR Correctable 
Error Summary : count=10000 MCFIR error status:  [MEMORY_CE] This bit is 
set when a memory CE is detected on a non-maintenance memory read op;
###########################################################################################################

########STDERR###########################################################################################
Abort(1) on node 256 (rank 256 in comm 1140850688): application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 256
###########################################################################################################

Could I have forgotten some crucial keywords in the input? Maybe something 
concerning the memory allocation rules?

Thank you for your help,
Daniele

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20130715/994b0659/attachment.htm>


More information about the CP2K-user mailing list