Sigsegv error during cell optimization
Maricarmen
Maricarme... at cemes.fr
Mon May 18 20:04:22 UTC 2009
Dear Teo,
I'm sorry but I think there has been a slight misunderstanding. You
see, the issue with the file numbering in CELL_OPT runs is posted in a
different thread, for it is completely independent from the issue
posted here. What I'm refering to here is the issue with the
segmentation fault signal that I've been getting in the latest
machines I've tried to run CP2K on.
The WALLTIME thing I mentioned was only to illustrate that the
application is actually dying, even though the job keeps going on. I
never meant that the flag itself doesn't work. Actually I think is a
very useful feature.
> The sad story is that I had to prepare this test by myself and I didn't
> see anything back from your side (apart from complains): that's a very
> strange thing and I can tell you that it is very uncommon when somebody
> asks for help that the helper has to guess even how to reproduce the
> problem!!
I'm sorry again. As I just said, I wasn't talking about the numbering
issue. You hadn't get any answer from me because I was planning to do
as you suggested (create a testfile) starting this week. You suggested
that last week and I don't normally work on weekends, so I was going
to get to it today. I'm sorry that you took from your time to do what
I was going to do in short time from now. I didn't mean for you to do
what I was supossed to do.
Also, I must say that I believe complaining is very different from
asking for help. I'm very sorry if I sounded like I was complaining,
that was never my intention. I though this was a help forum and I was
only reaching out to see if someone that might have had the same
problem could know a possible solution. Actually, Juerg's idea is very
logical and that's what I'll do. It is difficult to get some progress
if you have no control on the system's setup and you have to keep
asking others to change configuration by essay and error in order to
try and see if you're problem's been solved or not. If you add the
fact that they only give you limited calculation time, then the need
for an easy and/or quick solution is maximized. That's why I chose to
ask you first rather than my system's administrators (who had never
heard of CP2K before).
I guess next times I'll just ask them to try things out before posting
here. I just though maybe I could save some time and effort.
>
> For your convenience, keep also in mind the same suggestions we always
> say: CP2K is not an easy code.
> Highly demanding in terms of compilers/libraries.
> If you have a very tight problem in terms of timing (I'm aware the world
> today is based on timing!) keep in mind
> that there are bunches of other codes.. much more well documented...
> more easy friendly.. that may possibly help you better
> to achieve your goals.
>
Yes, I know that. Thank's for the sugestion. But we've chosen CP2K for
a reason and we'll stick with it. I really value the work you all do,
and I really appreciated how attentive you were to us during the
Tutorial. I can tell that you really love what you do and that you
really want to contribute to the scientific community. That is
absolutely remarkable. I apologize if somehow I implied otherwise in
my post.
Best regards,
Maricarmen
> Regards,
> Teo
>
> Maricarmen wrote:
> > I guess I rushed in. It's NOT working. I'm just not getting the
> > sigsegv signal, but CP2K just dies (usually when starting the second
> > cellopt step, but in any case is always when starting the SCF cycle),
> > no matter how big or small the system is. It starts fine, and then
> > after a few steps it hangs and stays there until the job is killed by
> > external signal due to limit time being reached. So now I'm spending
> > all my calculation time without doing any better than before.
> > May I add that I'm using the WALLTIME flag, but it is just not
> > working. As I said, the job is killed by MPI.
> > Pleeeease, could someone help me find out how to solve this? I'm not
> > just wasting my calculation time for the year, but real time to get
> > some useful results...
> > I wouldn't want to bother the administrators again without knowing
> > where the issue comes from. Should I tell them to try another
> > compiler??
>
> > Maricarmen
>
> > On 11 mai, 09:31, Maricarme... at cemes.fr wrote:
>
> >> Ciao everyone,
>
> >> I wanted to let you know that we have apparently solved the problem.
> >> The machine administrators have recompiled the code with these
> >> settings:
>
> >> - Classical optimization (-O2 -g) for INTEL 11 compilers
> >> - SGI MPT 1.22 MPI library
> >> - Intel MKL and Intel FFTW libraries
>
> >> I have been testing it the whole weekend and it looks like it works
> >> again :)
> >> Thanks a lot for your help.
>
> >> Cheers,
>
> >> Maricarmen
>
> >> On 6 mai, 16:49, Axel <akoh... at gmail.com> wrote:
>
> >>> ciao maricarmen,
>
> >>> On May 6, 9:24 am, Maricarme... at cemes.fr wrote:
>
> >>>> Thanks Teo,
>
> >>>> Actually the Intel fortran compiler is version 10.1.017. I can't find
> >>>> any comments on this particular version. I found something on 10.1.018
> >>>> though, and it semmed to work fin.
> >>>> In the machine there is also version 11.0.83, but I actually found
> >>>> some message on the list reporting problems with latests compilers
> >>>> (e.g. versions 11).
>
> >>> hard to say, but the fact that it is up to patch level 83 is somewhat
> >>> telling.
> >>> i'd try the 10.1 first.
>
> >>>> For the plain popt CP2K version I'll have to ask the administrators to
> >>>> recompile the code (they did it the first time), so I might as well
> >>>> ask them to use the newer compiler this time. Otherwise, do you think
> >>>> it's better to compile to the popt version with the same compiler
> >>>> (e.g. 10.1.017)?
>
> >>> i would suggest to first go a bit more conservative in optimization
> >>> and
> >>> replace '-O3 -xS' with '-O2'. using a less aggressive optimization
> >>> frequently
> >>> helps with intel compilers. since you seem to be on an itanium
> >>> processor
> >>> machine, you'll be seeing more problems, though. those compilers are
> >>> generally lagging behind the x86 versions in reliability. idependent
> >>> of the
> >>> individual version.
>
> >>> if you look through the files in the arch directory. there are several
> >>> entries
> >>> with exceptions for files that are better compiled without any
> >>> optimizations
> >>> to work around to aggressive compilers. i'd try to collect all of them
> >>> into
> >>> a special arch file in case you still are seeing problems.
>
> >>> finally, i'd have a closer look at the mpi manpage. on altix machines
> >>> there
> >>> are a few environment variables that can affect the stability and
> >>> performance
> >>> of parallel jobs. i remember having tinkered with that on a machine,
> >>> but i have
> >>> currently no access to it, and forgot to transfer the job scripts
> >>> before that.
>
> >>> cheers,
> >>> axel.
>
> >>>> Ciao,
>
> >>>> Maricarmen
>
> >>>> On 6 mai, 09:56, Teodoro Laino <teodor... at gmail.com> wrote:
>
> >>>>> Hi Maricarmen,
>
> >>>>> could you try a plain popt version without the smp support?
> >>>>> Keep as well in the submission script ompthreads=1.
>
> >>>>> which version of intel compiler are you using? did you check on this
> >>>>> mailing list that it is a "good one"?
> >>>>> In case, do you have access to other compilers on that machine?
>
> >>>>> Teo
>
> >>>>> Maricarme... at cemes.fr wrote:
>
> >>>>>> Hello everyone,
>
> >>>>>> I'm running a DFT cell optimization for Mx-V4O11 crystals (M = Ag and
> >>>>>> Cu). My cells are approximately 14x7x7 and about 260 atoms. Below is a
> >>>>>> copy of one of my input files. The problem is I keep getting a SIGSEGV
> >>>>>> (11) error, usually when starting the SCF cycles for the second cell
> >>>>>> opt step (an extract from the output file is also below).
> >>>>>> I'm running parallel on a calculus center (http://www.cines.fr/
> >>>>>> spip.php?rubrique186), and the administrators have already checked for
> >>>>>> the stack size (which according to them is set to unlimited). Below is
> >>>>>> also a copy of the job submission's file, and of the arch file.
> >>>>>> I even tried to run a cell opt test for a smaller cell (14*3*3, about
> >>>>>> 68 atoms), which I had already ran in a different calculus center
> >>>>>> without any issues, and I will still get the segmentation fault error.
> >>>>>> This clearly indicates me that the problem is associated to a
> >>>>>> configuration of the machines, to the way CP2K was installed, or to
> >>>>>> the job submission's characteristics (or to something else??). I must
> >>>>>> say I always get the exact same error during cell opt's second step,
> >>>>>> no matter what the system is (small or big cell, Ag or Cu).
> >>>>>> I tried running an Energy test on the smaller cell and it worked fine.
>
> >>>>>> I would really appreciate if any of you can throw some light at this,
> >>>>>> for I'm pretty stuck on it right now.
>
> >>>>>> Cheers,
>
> >>>>>> Maricarmen.
>
> >>>>>> Arch file:
>
> >>>>>> # by default some intel compilers put temporaries on the stack
> >>>>>> # this might lead to segmentation faults if the stack limit is set to
> >>>>>> low
> >>>>>> # stack limits can be increased by sysadmins or e.g with ulimit -s
> >>>>>> 256000
> >>>>>> # Tested on a HPC non-Itanium clusters @ UDS (France)
> >>>>>> # Note: -O2 produces an executable which is slightly faster than -O3
> >>>>>> # and the compilation time was also much shorter.
> >>>>>> CC = icc -diag-disable remark
> >>>>>> CPP =
> >>>>>> FC = ifort -diag-disable remark -openmp
> >>>>>> LD = ifort -diag-disable remark -openmp
> >>>>>> AR = ar -r
>
> >>>>>> #Better with mkl (intel lapack/blas) only
> >>>>>> #DFLAGS = -D__INTEL -D__FFTSG -D__parallel
> >>>>>> #If you want to use BLACS and SCALAPACK use the flags below
> >>>>>> DFLAGS = -D__INTEL -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK -
> >>>>>> D__FFTW3
> >>>>>> CPPFLAGS =
> >>>>>> FCFLAGS = $(DFLAGS) -fpp -free -O3 -xS -I/opt/software/SGI/intel/mkl/
> >>>>>> 10.0.3.020/include -I/opt/software/SGI/intel/mkl/10.0.3.020/include/
> >>>>>> fftw
> >>>>>> LDFLAGS = -L/opt/software/SGI/intel/mkl/10.0.3.020/lib/em64t
> >>>>>> #LIBS = -lmkl -lm -lpthread -lguide -openmp
> >>>>>> #If you want to use BLACS and SCALAPACK use the libraries below
> >>>>>> LIBS = -Wl,--allow-multiple-definition -lmkl_scalapack_lp64 /
> >>>>>> scratch/grisolia/blacsF77init_MPI-LINUX-0.a /scratch/grisolia/
> >>>>>> blacs_MPI-LINUX-0.a -lmpi -lmkl -lfftw3xf_intel -lmkl_blacs_lp64
>
> >>>>>> OBJECTS_ARCHITECTURE = machine_intel.o
>
> >>>>>> -------
>
> >>>>>> Job submission's file (getting the sigsegv error):
>
> >>>>>> #PBS -N cp2k
> >>>>>> #PBS -l walltime=24:00:00
> >>>>>> #PBS -S /bin/bash
> >>>>>> #PBS -l select=8:ncpus=8:mpiprocs=8:ompthreads=1
> >>>>>> #PBS -j oe
> >>>>>> #PBS -M gris... at cemes.fr -m abe
>
> >>>>>> PBS_O_WORKDIR=/scratch/grisolia/CuVO/Fixed/
>
> >>>>>> cd $PBS_O_WORKDIR
>
> >>>>>> export OMP_NUM_THREADS=1
> >>>>>> export MKL_NUM_THREADS=1
> >>>>>> export MPI_GROUP_MAX=512
>
> >>>>>> /usr/pbs/bin/mpiexec /scratch/grisolia/cp2k/exe/Linux-x86-64-jade/
> >>>>>> cp2k.psmp
>
> ...
>
> leer más »
>
> bug_report_cell.tgz
> 3363 KVerDescargar
More information about the CP2K-user
mailing list