Why my cp2k.popt is running much slower than cp2k.sopt?

hawk2012 hawk2... at gmail.com
Thu Jul 24 20:26:26 UTC 2008


Hi,
Attached file log1omp is the output file from running cp2k.sopt with
OMP_NUM_THREADS=1 while attached file log4omp is the output file from
running cp2k.popt with OMP_NUM_THREADS=1 and CPUs=4. After setting
OMP_NUM_THREADS=1, the cp2k.sopt is running normal with 100% CPU usage
for 1 process and the CPU time is equal to the wall time. However, the
cp2k.popt is still running weirdly with more than 100% CPU usage for
each of the four processes shown by command 'top'. In addition, it
took much longer time to finish the job than cp2k.sopt did. Actually
what I found by monitoring the test job in real time is that the real
calculation time for each step of optimization calculation is almost
the same between cp2k.sopt and cp2k.popt. However, cp2k.popt spent a
lot weird time in job initialization and job ending. It took a very
long time for cp2k.popt to show MPI performance information, Reference
information and Timing information after the following lines in the
output file:
  Total energy:
-17.20960460111594

  outer SCF iter =    1 RMS gradient =   0.13E-05 energy =
-17.2096046011
  outer SCF loop converged in   1 iterations or   20 steps


 ENERGY| Total FORCE_EVAL ( QS ) energy (a.u.):
-17.209604601164045

The MPI performace information was shown below:
 -                         MESSAGE PASSING
PERFORMANCE                         -
 
-
-
 
-------------------------------------------------------------------------------

 ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]
PERFORMANCE [MB/s]
 MP_Group                5         0.000
 MP_Bcast              158         0.003
119.                6.27
 MP_Allreduce          486      2372.141
123.                0.00
 MP_Gather              68        60.748
 MP_Alltoall           884         0.686
350347.              451.20
 MP_ISendRecv         1356         0.037
38736.             1423.46
 MP_Wait               996       535.073

It seems that the MPI performace is really bad. It spent a lot of time
in calling MP_Allreduce and MP_Wait. For cp2k.sopt it took only 162
seconds to finish the job while it took 3010 seconds to finish the
same job. There must be something wrong the executable cp2k.popt since
my other parallel executable can be run using the same /home/mpich.g95/
bin/mpirun with normal performance.  Any suggestions?

On Jul 21, 3:41 pm, Axel <akoh... at gmail.com> wrote:
> On Jul 21, 2:11 pm, Axel <akoh... at gmail.com> wrote:
>
> > On Jul 19, 11:50 pm, hawk2012 <hawk2... at gmail.com> wrote:
>
> > > No, I did not use Intel MKL library to link the executable cp2k.popt.
>
> > but you use a threaded GOTO and the timings that you present
> > only make sense for multi-threaded use (cpu time higher than wall
> > time).
> > so please check how GOTO controls threading or link with a non-
> > threaded
> > BLAS. in MKL you have to set OMP_NUM_THREADS=1 on all nodes.
>
> to followup my own response. out of curiosity i looked up the
> goto blas FAQ and indeed this has the same (stupid IMNSHO) default
> behavior to thread across all available local CPUs.
>
> so people on multi-core machines or altixen beware of the threaded
> BLASes
> and set OMP_NUM_THREADS=1 by default in your environment.
>
> cheers,
>    axel.
>
>
>
> > cheers,
> >    axel.
>
> > > The libraries I used can be shown in my Linux-x86-64-g95.popt file:
> > > CC       = cc
> > > CPP      =
> > > FC       = /home/mpich.g95/bin/mpif90
> > > LD       = /home/mpich.g95/bin/mpif90
> > > AR       = ar -r
> > > DFLAGS   = -D__G95 -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK -
> > > D__FFTW3 -D__LIBINT
> > > CPPFLAGS =
> > > FCFLAGS  = $(DFLAGS) -ffree-form -O2 -ffast-math -march=opteron -cpp -
> > > g
> > > LDFLAGS  = $(FCFLAGS)
> > > LIBS     = /home/scalapack/scalapack-1.8.0/libscalapack.a \
> > >            /home/BLACS/LIB/blacsF77init_MPI-LINUX-0.a \
> > >            /home/BLACS/LIB/blacs_MPI-LINUX-0.a \
> > >            /home/BLACS/LIB/blacsCinit_MPI-LINUX-0.a \
> > >            /home/lapack-3.1.1/lapack_LINUX.a \
> > >            /home/GotoBLAS/libgoto.a \
> > >            /home/fftw/lib/libfftw3.a \
> > >            /home/libint/lib/libderiv.a \
> > >            /home/libint/lib/libint.a \
> > >            /usr/lib64/libstdc++.so.6 -lpthread
>
> > > OBJECTS_ARCHITECTURE = machine_g95.o
>
> > > On Jul 19, 3:03 pm, Axel <akoh... at gmail.com> wrote:
>
> > > > On Jul 19, 3:34 pm, hawk2012 <hawk2... at gmail.com> wrote:
>
> > > > > Dear All:
>
> > > > > With the help from this discussion group I successfully compiled both
> > > > > serial and parallel executables of cp2k with g95 compiler and
> > > > > mpich1.2.6.
>
> > > > > However, with the same input file I found that it took much longer
> > > > > time to run cp2k.popt with 4 CPUs than that to run cp2k.sopt with 1
> > > > > CPU.
> > > > > Attached file log.sopt is the output file for cp2k.sopt with 1 CPU
> > > > > while log.popt-4CPUs is the output file for cp2k.popt with 4 CPUs.
> > > > > It looks like the job was really running in parallel with 4 CPUs from
> > > > > the output file log.popt-4CPUs because 4 processe numbers were shown
> > > > > and Total number of message passing processes is also 4 which was
> > > > > decomposed as 2x2 with Number of processor rows 2 and Number of
> > > > > processor cols 2. When I typed command 'top', I really saw four
> > > > > cp2k.popt processes were actually running.
>
> > > > > It is so weird. Is this due to the special input file I used or
> > > > > something else?
> > > > > Could anyone take a look at these two output files and tell me what is
> > > > > the possible reason?
>
> > > > you are using MKL version 10.0 or later, right?
>
> > > > have a look at the summary of CPU time and ELAPSED time.
> > > > in your "serial" calculation, the CPU time is almost 4 times
> > > > of your elapsed time. this usually happens, when MKL is used
> > > > in multi-threaded mode (you are running on a quad-core node or
> > > > a two-way dual core node. right?). since version 10 MKL multi-threads
> > > > by default across all available cpus. now if you switch to MPI,
> > > > MKL does not know that and thus with -np 4 you are _still_ running
> > > > with 4 threads per MPI tasks, i.e. 16 threads altogether. that clogs
> > > > up your memory bus and brings down your computation time.
>
> > > > add to that, that a serial executable is a bit faster due to lack
> > > > of parallel overhead and the fact that SMP performance of MPICH-1
> > > > is suboptimal and your experience is completely understandable.
>
> > > > please read the MKL documentation and either set OMP_NUM_THREADS=1
> > > > in your environment or link with the sequential mkl libraries
> > > > explicitly.
>
> > > > this has been discussed in this group before. please check the
> > > > archives.
>
> > > > cheers,
> > > >    axel.


More information about the CP2K-user mailing list