Dear CP2K community,<div><br></div><div>I'm trying to employ the farming feature to run a big parallel job on CINECA Fermi cluster.</div><div><br></div><div>This is a typical input file:</div><div><div>##############################################</div><div>&GLOBAL</div><div> PROJECT farming-50</div><div> PROGRAM FARMING</div><div> RUN_TYPE NONE</div><div> WALLTIME 86000</div><div>&END GLOBAL</div><div>&FARMING</div><div> NGROUP 32</div><div>@INCLUDE JOBS/primi_50/jobs_HS14.inc</div><div>@INCLUDE JOBS/primi_50/jobs_BS8-AABA.inc</div><div>@INCLUDE JOBS/primi_50/jobs_BS8-ABAA.inc</div><div>@INCLUDE JOBS/primi_50/jobs_BS8-AAAB.inc</div><div>@INCLUDE JOBS/primi_50/jobs_BS6-BAAA.inc</div><div>@INCLUDE JOBS/primi_50/jobs_BS2-ABBA.inc</div><div>@INCLUDE JOBS/primi_50/jobs_BS2-AABB.inc</div><div>@INCLUDE JOBS/primi_50/jobs_BS2-ABAB.inc</div><div>&END FARMING<br></div><div>#############################################</div></div><div><br></div><div>Each include file (*.inc) contains 50 equivalent single point calculations (ENERGY). For example the jobs_HS14.inc starts writing</div><div>############################################</div><div><div> &JOB</div><div> DIRECTORY STEP_21001/HS14/</div><div> INPUT_FILE_NAME cp2k.inp</div><div> &END JOB</div><div> &JOB</div><div> DIRECTORY STEP_21011/HS14/</div><div> INPUT_FILE_NAME cp2k.inp</div><div> &END JOB</div></div><div> ...</div><div> ...</div><div> ...</div><div>############################################</div><div><br></div><div>Total jobs is equal to 50x8 = 400 jobs</div><div><br></div><div>We ask for 1024nodes x 2mpi processes each x 16 openMP threads</div><div>and we are using the executable </div><div>/cineca/prod/applications/cp2k/2.3/bgq-xl--1.0/bin/cp2k.psmp<br></div><div><br></div><div>We split the farming in 32 groups, so each job takes 1024x2/32 = 64 mpi x 16 threads on 32nodes.</div><div>(The single point inputs with the computational resources described above were tested before in a small farming job and everything worked fine)</div><div><br></div><div>When we run the big farming job (400 jobs in 32 groups) after the first execution of 32 jobs, mpi errors occurs and every calculation is brutally stopped.</div><div>This is the main output (after the correct preamble and the list/assignation of the jobs to the groups):</div><div>########STDOUT###########################################################################################</div><div><div> Running Job 00001 in STEP_21001/BS8/AABA/. Done, output in OEC_QMMM_MA.out</div><div> Running Job 00033 in STEP_21321/BS8/AABA/. CP2K| condition FAILED at line 2152</div><div> CP2K| Abnormal program termination, stopped by process number 256</div><div>2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] :372729:ibm.runjob.client.Job: terminated by signal 6</div><div>2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] :372729:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 256</div><div>2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] :372729:ibm.runjob.client.Job: 59 RAS events</div><div>2013-07-06 05:29:38.027 (WARN ) [0x40001058b00] :372729:ibm.runjob.client.Job: most recent RAS event text: DDR Correctable Error Summary : count=10000 MCFIR error status: [MEMORY_CE] This bit is set when a memory CE is detected on a non-maintenance memory read op;</div></div><div>###########################################################################################################</div><div><br></div><div>########STDERR###########################################################################################</div><div><div>Abort(1) on node 256 (rank 256 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 256</div></div><div>###########################################################################################################</div><div><br></div><div>Could I have forgotten some crucial keywords in the input? Maybe something concerning the memory allocation rules?</div><div><br></div><div>Thank you for your help,</div><div>Daniele</div><div><br></div>