Farming - Abnormal Termination
Daniele Bovi
daniel... at gmail.com
Mon Jul 15 10:19:45 UTC 2013
Dear CP2K community,
I'm trying to employ the farming feature to run a big parallel job on
CINECA Fermi cluster.
This is a typical input file:
##############################################
&GLOBAL
PROJECT farming-50
PROGRAM FARMING
RUN_TYPE NONE
WALLTIME 86000
&END GLOBAL
&FARMING
NGROUP 32
@INCLUDE JOBS/primi_50/jobs_HS14.inc
@INCLUDE JOBS/primi_50/jobs_BS8-AABA.inc
@INCLUDE JOBS/primi_50/jobs_BS8-ABAA.inc
@INCLUDE JOBS/primi_50/jobs_BS8-AAAB.inc
@INCLUDE JOBS/primi_50/jobs_BS6-BAAA.inc
@INCLUDE JOBS/primi_50/jobs_BS2-ABBA.inc
@INCLUDE JOBS/primi_50/jobs_BS2-AABB.inc
@INCLUDE JOBS/primi_50/jobs_BS2-ABAB.inc
&END FARMING
#############################################
Each include file (*.inc) contains 50 equivalent single point calculations
(ENERGY). For example the jobs_HS14.inc starts writing
############################################
&JOB
DIRECTORY STEP_21001/HS14/
INPUT_FILE_NAME cp2k.inp
&END JOB
&JOB
DIRECTORY STEP_21011/HS14/
INPUT_FILE_NAME cp2k.inp
&END JOB
...
...
...
############################################
Total jobs is equal to 50x8 = 400 jobs
We ask for 1024nodes x 2mpi processes each x 16 openMP threads
and we are using the executable
/cineca/prod/applications/cp2k/2.3/bgq-xl--1.0/bin/cp2k.psmp
We split the farming in 32 groups, so each job takes 1024x2/32 = 64 mpi x
16 threads on 32nodes.
(The single point inputs with the computational resources described above
were tested before in a small farming job and everything worked fine)
When we run the big farming job (400 jobs in 32 groups) after the first
execution of 32 jobs, mpi errors occurs and every calculation is brutally
stopped.
This is the main output (after the correct preamble and the
list/assignation of the jobs to the groups):
########STDOUT###########################################################################################
Running Job 00001 in STEP_21001/BS8/AABA/. Done, output in OEC_QMMM_MA.out
Running Job 00033 in STEP_21321/BS8/AABA/. CP2K| condition FAILED at line
2152
CP2K| Abnormal program termination, stopped by process number 256
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00]
:372729:ibm.runjob.client.Job: terminated by signal 6
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00]
:372729:ibm.runjob.client.Job: abnormal termination by signal 6 from rank
256
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00]
:372729:ibm.runjob.client.Job: 59 RAS events
2013-07-06 05:29:38.027 (WARN ) [0x40001058b00]
:372729:ibm.runjob.client.Job: most recent RAS event text: DDR Correctable
Error Summary : count=10000 MCFIR error status: [MEMORY_CE] This bit is
set when a memory CE is detected on a non-maintenance memory read op;
###########################################################################################################
########STDERR###########################################################################################
Abort(1) on node 256 (rank 256 in comm 1140850688): application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 256
###########################################################################################################
Could I have forgotten some crucial keywords in the input? Maybe something
concerning the memory allocation rules?
Thank you for your help,
Daniele
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20130715/994b0659/attachment.htm>
More information about the CP2K-user
mailing list