Why my cp2k.popt is running much slower than cp2k.sopt?

hawk2012 hawk2... at gmail.com
Fri Jul 25 18:45:21 UTC 2008


Hi,
I found the problem. The real problem is still due to impropriate
compilation of libgoto.a. If I use the standard BLAS library to link
cp2k, the resultant cp2k.popt can be run without the multithread
problem and it took only 86.68 seconds to finish the test job with 4
CPUs. The cp2k.sopt linked by standard BLAS library will take 164.28
seconds to finish the same job. Obviously cp2k.popt compiled with
standard BLAS library really improved the calculation speed although
parallel efficiency is only 50%.

It seems that the evironment variables  OMP_NUM_THREADS and
GOTO_NUM_THREADS only work for serial cp2k compiled with libgoto.a. I
set both OMP_NUM_THREADS and GOTO_NUM_THREADS equal to 1 and cp2k.sopt
can be run with 100% CPU usage for 1 process. However, running
cp2k.popt still has some problem. With 4 processes to run a job, the
CPU usage for some processes are much higher than 100% while for
others the usage are lower than 100%. Obviously the multithread
function is not turned off and this might be the reason why it take a
much longer time to finish the job since the system spent a lot of
time syncronizing among all 4 processes.
I will try to recompile libgoto.a to see if I can turn off the
multithread function.

On Jul 25, 11:24 am, Axel <akoh... at gmail.com> wrote:
> On Jul 25, 11:36 am, hawk2012 <hawk2... at gmail.com> wrote:
>
> > Thank you for your help.
> > The machine I used to test cp2k is a SMP machine with 8 cores running
> > on CentOS5.0. So, there is no network bottleneck problem and no TCP/IP
> > connection latency. On the same machine I tested another parallel MD
>
> depending on how you run your job and configured your MPI
> library, you may still be using TCP/IP.
>
> > program(a simple MD program for LJ potential) and found that the
> > parallel efficiency is almost linear (it toook 158 seconds to finish
> > the test job with 4 CPUs while it only took 85 seconds to finish the
> > same job).
>
> i've posted benchmark numbers on CP2k performance one 4 cores a while
> ago. so while you cannot get linear scaling because of the memory
> bandwidth
> requirements of cp2k, the numbers demonstrate that it _is_ possible to
> get
> reasonably good scaling results. so the problem _has_ to be on the
> side of
> your low-level support (MPI library, machine setup, compilers, how
> starting
> the parallel job).
>
> how large a test job are you running?. could you be simply
> running out of memory and swapping heavily?
> check out the older discussions on scaling in the group and
> try to reproduce those numbers.
>
> axel.
>
> > On Jul 24, 4:22 pm, Axel <akoh... at gmail.com> wrote:
>
> > > > It seems that the MPI performace is really bad. It spent a lot of time
> > > > in calling MP_Allreduce and MP_Wait. For cp2k.sopt it took only 162
>
> > > right this is what is needed. a lot. and this is why cp2k needs
> > > a very fast and low latency network and a good MPI implementation.
>
> > > > seconds to finish the job while it took 3010 seconds to finish the
> > > > same job. There must be something wrong the executable cp2k.popt since
> > > > my other parallel executable can be run using the same /home/mpich.g95/
> > > > bin/mpirun with normal performance.  Any suggestions?
>
> > > before discussing any cp2k related issues. you should first
> > > check how well your MPI setup works _at all_. i suspect there
> > > is a much lower lying problem than cp2k and its requirements.
>
> > > most MPI packages come with some benchmark examples to measure
> > > performance and latencies. i suggest to try those first and
> > > check how well your setup works and compare it to similar
> > > set ups. it would help a _lot_ if you give a sufficiently detailed
> > > account of your hardware when discussing performance. please
> > > see earlier discussions on the subject.
>
> > > if collective operations and barriers are giving you problems
> > > than your may not be using your machine correctly or have not
> > > set it up correctly. they should also matter a lot in case of
> > > using TCP/IP connections for parallel computation which incur
> > > large latency penalties due to the TCP/IP encoding. the fact
> > > that you are using MPICH doesn't help as well, since its
> > > collectives, especially in version 1.x.x, are supposed to be
> > > pretty inefficient.
>
> > > what is worrying me even more is the fact that you seem to be
> > > running your tests as root. this goes against about everything
> > > i've learned during my carreer about good computer use practices.
>
> > > basically the root account should only used if it cannot be avoided.
> > > to give an example: on our own local clusters (where i do maintain
> > > MPI, compilers, libraries and most applications including cp2k)
> > > i don't even _know_ the root password (and don't _want_ to, since
> > > this way, it is close to impossible to mess up the machines by
> > > accident or carelessness).
>
> > > cheers,
> > >     axel.


More information about the CP2K-user mailing list