[CP2K:2946] Re: libdbcsr, MPI error

Teodoro Laino teodor... at gmail.com
Thu Nov 25 15:16:29 UTC 2010


Roger,

I understand the frustration - I assume it is the very similar problem of few months ago, reported by Marco.
It could be everything: form a possible bug in the code, to a compiler or a library issue.
>From a very personal feeling I suspect (but it's only speculation!) some problem in the mpich2 library (some bug or memory leak there).

To identify the source  is time-consuming and having a job that fails ONLY after 100 hours ( more than 4 days!) is definitely out of any possible consideration. Keep in mind that the normal queue lengths in computer centers are 12 (exceptionally 24) hours.

So either you come up with something which has a faster failure rate (such that one should not wait 4 days before making another test) or you survive with it: after 100 hours you can always restart your job and you are sure it will run for other 100 hours.

Best,
Teo

p.s.: A very last thing: I don't like these lines here:

> 			-L/gpfs/apps/MPICH2/mx/1.0.8p1..3/64/lib \
> 			-L/gpfs/apps/MPICH2/slurm/64/lib \


which are related to this line here:

> 			-I/gpfs/apps/MPICH2/mx/1.0.8p1..3/64/include

what is the necessity of having the MPICH2/slurm?  (you should safely delete it: it will not solve your problem very probably but at least is clear what you are doing).
Hopefully compiler should be smart enough to consider the first path, but keep in mind that
mixing include and libraries can create problems.

On Nov 25, 2010, at 3:53 PM, nadler wrote:

> The error still comes up after several intents. Current version
> installed is 2.2.45. Furthermore, as mentioned in the first post:
> independently of the number of cpus chosen, the execution stops after
> a certain amount of cpu hours. In my case it is around 100 +/-10
> hours; the guy from the support team told me that the same happens to
> him after 163 +/-2 cpu hours, using the same input file I am using.
> Any ideas about what could be the problem? Following, I put
> informations about the clusters, compiler and current archfile.
> Thanks!
> 
> Compiler: IBM XL Fortran for Linux, V12.1
> 
> The information about the machines:
> Once, 1036 nodes of eServer BladeCenter JS20, having 2 PPC cpus
> (2.2GHz) with 4GB RAM per node.
> Then, 168 nodes of eServer BladeCenter JS21, having 4 PPC cpus
> (2.3GHz) with 8GB RAM.
> Communication occurs via Myrinet.
> 
> The current archfile is:
> 
> PERL     = perl
> CC       = xlc
> CPP      = cpp
> FC       = xlf95_r -qsuffix=f=F
> LD       = xlf95_r
> AR       = ar -r
> DFLAGS   = -D__AIX -D__ESSL -D__FFTSG -D__FFTW3 -D__parallel -D__BLACS
> -D__SCALAPACK -D__LIBINT
> CPPFLAGS = -C $(DFLAGS) -P -traditional \
> 			-I/gpfs/apps/FFTW/3.2.1/64/include
> FCFLAGS  = -O2 -qstrict -q64 -qarch=ppc970 -qcache=auto -qmaxmem=-1 -
> qtune=ppc970 \
> 			-I/gpfs/apps/FFTW/3.2.1/64/include \
> 			-I/gpfs/apps/LIBINT/1.1.4/64/include \
> 			-I/gpfs/apps/MPICH2/mx/1.0.8p1..3/64/include
> FCFLAGS2 = -O0 -qstrict -q64 -qarch=ppc970 -qcache=auto -qmaxmem=-1 -
> qtune=ppc970 \
> 			-I/gpfs/apps/FFTW/3.2.1/64/include \
> 			-I/gpfs/apps/LIBINT/1.1.4/64/include \
> 			-I/gpfs/apps/MPICH2/mx/1.0.8p1..3/64/include
> LDFLAGS  = $(FCFLAGS) \
> 			-L/gpfs/apps/LAPACK/3.2.1/64/lib \
> 			-L/gpfs/apps/SCALAPACK/1.8/mpich2/64 \
> 			-L/gpfs/apps/FFTW/3.2.1/64/lib \
> 			-L/gpfs/apps/LIBINT/1.1.4/64/lib \
> 			-L/opt/ibmcmp/xlmass/5.0/lib64 \
> 			-L/gpfs/apps/MPICH2/mx/1.0.8p1..3/64/lib \
> 			-L/gpfs/apps/MPICH2/slurm/64/lib \
> 			-L/opt/osshpc/mx/lib64 \
> 			-L/usr/lib64
> LIBS     =  -lscalapack \
> 			/gpfs/apps/SCALAPACK/1.8/mpich2/64/blacs.a \
> 			-lmass_64 \
> 			-lmpich -lpmi -lmyriexpress -lpthread \
> 			-llapack -lessl -lfftw3f -lfftw3 -lint -lderiv
> 
> OBJECTS_ARCHITECTURE = machine_aix.o
> 
> ### To speed up compilation time ###
> pint_types.o: pint_types.F
> 	$(FC) -c $(FCFLAGS2) $<
> md_run.o: md_run.F
> 	$(FC) -c $(FCFLAGS2) $<
> kg_energy.o: kg_energy.F
> 	$(FC) -c $(FCFLAGS2) $<
> integrator.o: integrator.F
> 	$(FC) -c $(FCFLAGS2) $<
> geo_opt.o: geo_opt.F
> 	$(FC) -c $(FCFLAGS2) $<
> qmmm_init.o: qmmm_init.F
> 	$(FC) -c $(FCFLAGS2) $<
> cp2k_runs.o: cp2k_runs.F
> 	$(FC) -c $(FCFLAGS2) $<
> mc_ensembles.o: mc_ensembles.F
> 	$(FC) -c $(FCFLAGS2) $<
> ep_methods.o: ep_methods.F
> 	$(FC) -c $(FCFLAGS2) $<
> mc_ge_moves.o: mc_ge_moves.F
> 	$(FC) -c $(FCFLAGS2) $<
> force_env_methods.o: force_env_methods.F
> 	$(FC) -c $(FCFLAGS2) $<
> cp_lbfgs_optimizer_gopt.o: cp_lbfgs_optimizer_gopt.F
> 	$(FC) -c $(FCFLAGS2) $<
> mc_types.o: mc_types.F
> 	$(FC) -c $(FCFLAGS2) $<
> f77_interface.o: f77_interface.F
> 	$(FC) -c $(FCFLAGS2) $<
> mc_moves.o: mc_moves.F
> 	$(FC) -c $(FCFLAGS2) $<
> 
> -- 
> You received this message because you are subscribed to the Google Groups "cp2k" group.
> To post to this group, send email to cp... at googlegroups.com.
> To unsubscribe from this group, send email to cp2k+uns... at googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cp2k?hl=en.
> 




More information about the CP2K-user mailing list