Why my cp2k.popt is running much slower than cp2k.sopt?
akoh... at gmail.com
Sun Jul 27 00:12:11 CEST 2008
On Jul 25, 2:45 pm, hawk2012 <hawk2... at gmail.com> wrote:
> I found the problem. The real problem is still due to impropriate
> compilation of libgoto.a. If I use the standard BLAS library to link
> cp2k, the resultant cp2k.popt can be run without the multithread
> problem and it took only 86.68 seconds to finish the test job with 4
> CPUs. The cp2k.sopt linked by standard BLAS library will take 164.28
> seconds to finish the same job. Obviously cp2k.popt compiled with
> standard BLAS library really improved the calculation speed although
> parallel efficiency is only 50%.
> It seems that the evironment variables OMP_NUM_THREADS and
> GOTO_NUM_THREADS only work for serial cp2k compiled with libgoto.a. I
> set both OMP_NUM_THREADS and GOTO_NUM_THREADS equal to 1 and cp2k.sopt
> can be run with 100% CPU usage for 1 process. However, running
> cp2k.popt still has some problem. With 4 processes to run a job, the
> CPU usage for some processes are much higher than 100% while for
> others the usage are lower than 100%. Obviously the multithread
> function is not turned off and this might be the reason why it take a
you probably just have a crappy MPI implementation. i'd bet if you run
mpirun -np 4 env OMP_NUM_THREADS=1 cp2k.popt -i input.inp -o
output.out < /dev/null
you can work around it.
> much longer time to finish the job since the system spent a lot of
> time syncronizing among all 4 processes.
> I will try to recompile libgoto.a to see if I can turn off the
> multithread function.
> On Jul 25, 11:24 am, Axel <akoh... at gmail.com> wrote:
> > On Jul 25, 11:36 am, hawk2012 <hawk2... at gmail.com> wrote:
> > > Thank you for your help.
> > > The machine I used to test cp2k is a SMP machine with 8 cores running
> > > on CentOS5.0. So, there is no network bottleneck problem and no TCP/IP
> > > connection latency. On the same machine I tested another parallel MD
> > depending on how you run your job and configured your MPI
> > library, you may still be using TCP/IP.
> > > program(a simple MD program for LJ potential) and found that the
> > > parallel efficiency is almost linear (it toook 158 seconds to finish
> > > the test job with 4 CPUs while it only took 85 seconds to finish the
> > > same job).
> > i've posted benchmark numbers on CP2k performance one 4 cores a while
> > ago. so while you cannot get linear scaling because of the memory
> > bandwidth
> > requirements of cp2k, the numbers demonstrate that it _is_ possible to
> > get
> > reasonably good scaling results. so the problem _has_ to be on the
> > side of
> > your low-level support (MPI library, machine setup, compilers, how
> > starting
> > the parallel job).
> > how large a test job are you running?. could you be simply
> > running out of memory and swapping heavily?
> > check out the older discussions on scaling in the group and
> > try to reproduce those numbers.
> > axel.
> > > On Jul 24, 4:22 pm, Axel <akoh... at gmail.com> wrote:
> > > > > It seems that the MPI performace is really bad. It spent a lot of time
> > > > > in calling MP_Allreduce and MP_Wait. For cp2k.sopt it took only 162
> > > > right this is what is needed. a lot. and this is why cp2k needs
> > > > a very fast and low latency network and a good MPI implementation.
> > > > > seconds to finish the job while it took 3010 seconds to finish the
> > > > > same job. There must be something wrong the executable cp2k.popt since
> > > > > my other parallel executable can be run using the same /home/mpich.g95/
> > > > > bin/mpirun with normal performance. Any suggestions?
> > > > before discussing any cp2k related issues. you should first
> > > > check how well your MPI setup works _at all_. i suspect there
> > > > is a much lower lying problem than cp2k and its requirements.
> > > > most MPI packages come with some benchmark examples to measure
> > > > performance and latencies. i suggest to try those first and
> > > > check how well your setup works and compare it to similar
> > > > set ups. it would help a _lot_ if you give a sufficiently detailed
> > > > account of your hardware when discussing performance. please
> > > > see earlier discussions on the subject.
> > > > if collective operations and barriers are giving you problems
> > > > than your may not be using your machine correctly or have not
> > > > set it up correctly. they should also matter a lot in case of
> > > > using TCP/IP connections for parallel computation which incur
> > > > large latency penalties due to the TCP/IP encoding. the fact
> > > > that you are using MPICH doesn't help as well, since its
> > > > collectives, especially in version 1.x.x, are supposed to be
> > > > pretty inefficient.
> > > > what is worrying me even more is the fact that you seem to be
> > > > running your tests as root. this goes against about everything
> > > > i've learned during my carreer about good computer use practices.
> > > > basically the root account should only used if it cannot be avoided.
> > > > to give an example: on our own local clusters (where i do maintain
> > > > MPI, compilers, libraries and most applications including cp2k)
> > > > i don't even _know_ the root password (and don't _want_ to, since
> > > > this way, it is close to impossible to mess up the machines by
> > > > accident or carelessness).
> > > > cheers,
> > > > axel.
More information about the CP2K-user