mpich? problems on a linux cluster

Axel akoh... at gmail.com
Fri Dec 7 19:45:38 UTC 2007

Previous message (by thread): [CP2K:469] Re: mpich? problems on a linux cluster
Next message (by thread): [CP2K:471] Re: mpich? problems on a linux cluster
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

hi carlo,

On Dec 7, 11:10 am, "carlo antonio pignedoli" <c.pig... at gmail.com>
wrote:
> Dear Axel
> we are using the normal gigabit.

please check your scaling. my hunch is that you'll find out
that there is little or no improvement to go over the network.

it is also quite likely that you overload your network (or the
switch).
it depends a bit on your input, tho.

> I did the dmesg and... well I'm not an expert, I got something that
> looks like an error.

well, you'll get at least a message indicating the segfault.
hard to tell without seeing it. only the last 10-20 lines will
probably do.

> for the ulimit -a I have
>
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 38912
> max locked memory       (kbytes, -l) 32
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> stack size              (kbytes, -s) 8192

this is "small". it is thus possible that your
executable was running ok until the frist completed
SCF and then needed some extra memory which was not
available from the stack.

> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 38912
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> what is your suggestion for a reasonable value.

at least 10 times as large. i usually set it to 1GB
or unlimited if i test on a machine where i have to
change this.

ciao,
   axel.

>
> Thanks a lot
>
> Carlo
>
> On Dec 7, 2007 4:38 PM, Axel <akoh... at gmail.com> wrote:
>
>
>
> > carlo,
>
> > one more thing that may be important: what interconnect
> > do you have and is it working correctly under high load?
>
> > cp2k is very demanding and i've run across multiple machines
> > (myrinet/infiniband) where the MPI runtime settings needed to
> > be tweaked to have the job run reliably. i suggest you log into
> > the failing node and have a look at the kernel message buffer
> > with "dmesg" and see if there is anything suspicious.
>
> > the second option when you see segmentation faults with intel
> > compilers is the lack of sufficient stack size. for historical
> > reasons, the intel fortran frontend allocates temporary arrays
> > by default on the stack instead of the heap. please check your
> > cluster nodes for whether the stack segment is large enough
> > (ulimit -a), and have the sysadmins increase it if needed.
>
> > a second option is to reset the stack size from within cp2k, but
> > that requires some (ugly?) modifications of the code and they need
> > to be in c. i'll put an updated version of those into the files
> > section later.
>
> > the third options is to use the -heap-arrays flag, which is only
> > supported by intel compilers 10.0 and later.
>
> > hope that helps,
> >    axel.
>
> > On Dec 7, 7:59 am, "carlo antonio pignedoli" <c.pig... at gmail.com>
> > wrote:
> > > Ciao Teo,
>
> > > we are using the cmkl libraries
> > > intel clustertoolkit for linux
> > > version 9.1

Previous message (by thread): [CP2K:469] Re: mpich? problems on a linux cluster
Next message (by thread): [CP2K:471] Re: mpich? problems on a linux cluster
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the CP2K-user mailing list