[CP2K:7570] terrible performance across infiniband

Ronald Cohen rco... at carnegiescience.edu
Mon Mar 21 22:05:34 UTC 2016


According to my experience in general, or the cp2k web pages in particular that is not the case.  Please see the performance page for cp2k.  The problem I am sure now is with the openmpi build not using the proper infiniband libraries or drivers.

Thank you!

Ron

Sent from my iPad

> On Mar 21, 2016, at 5:36 PM, Glen MacLachlan <mac... at gwu.edu> wrote:
> 
> It's hard to talk about the performance when you set OMP_NUM_THREADS = 1 because there is so much overhead associated with OpenMP that launching 1 thread almost always is a performance killer. In fact, OMP_NUM_THREADS=1 never rivals single-threaded performance-wise because of that overhead. No one ever sets  OMP_NUM_THREADS=1 unless they are playing around...We never do that in production jobs. How about when you scale up to 4 or 8 threads?
> 
> Glen
> 
> P.S. I see you're in DC...so am I. I support CP2K for the chemists at GWU. Hope you aren't using Metro to get around the DMV :p
> 
>> On Mar 21, 2016 5:11 PM, "Cohen, Ronald" <rco... at carnegiescience.edu> wrote:
>> Yes I am using hybrid mode. But even if I set OMP_NUM_THREADS=1 performance is terrible.
>> 
>> ---
>> Ronald Cohen
>> Geophysical Laboratory
>> Carnegie Institution
>> 5251 Broad Branch Rd., N.W.
>> Washington, D.C. 20015
>> rco... at carnegiescience.edu
>> office: 202-478-8937
>> skype: ronaldcohen
>> https://twitter.com/recohen3
>> https://www.linkedin.com/profile/view?id=163327727
>> 
>>> On Mon, Mar 21, 2016 at 5:04 PM, Glen MacLachlan <mac... at gwu.edu> wrote:
>>> Are you conflating MPI with OpenMP? OMP_NUM_THREADS sets the number of threads used by OpenMP and OpenMP doesn't work on a distributed memory environment unless you piggyback on MPI which would be a hybrid use and I'm not sure CP2K ever worked optimally in hybrid mode or at least that's what I've gotten from reading the comments on the source code.
>>> 
>>> As for MPI, are you sure your MPI stack was compiled with IB bindings? I had similar issues and the problem was that I wasn't actually using IB. If you can, disable eth and leave only IB and see what happens.
>>> 
>>> Glen
>>> 
>>>> On Mar 21, 2016 4:48 PM, "Ronald Cohen" <rco... at carnegiescience.edu> wrote:
>>>> On the dco machine deepcarbon I find decent single node mpi performnace, but running on the same number of processors across two nodes is terrible, even with the infiniband interconect. This is the cp2k  H2O-64 benchmark:
>>>> 
>>>> 
>>>>  
>>>> On 16 cores on 1 node: total time 530 seconds
>>>>  SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
>>>>                                 MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
>>>>  CP2K                                 1  1.0    0.015    0.019  530.306  530.306
>>>>  -                                                                             -
>>>>  -                         MESSAGE PASSING PERFORMANCE                         -
>>>>  -                                                                             -
>>>>  -------------------------------------------------------------------------------
>>>> 
>>>>  ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE [MB/s]
>>>>  MP_Group                5         0.000
>>>>  MP_Bcast             4103         0.029              44140.             6191.05
>>>>  MP_Allreduce        21860         7.077                263.                0.81
>>>>  MP_Gather              62         0.008                320.                2.53
>>>>  MP_Sync                54         0.001
>>>>  MP_Alltoall         19407        26.839             648289.              468.77
>>>>  MP_ISendRecv        21600         0.091              94533.            22371.25
>>>>  MP_Wait            238786        50.545
>>>>  MP_comm_split          50         0.004
>>>>  MP_ISend            97572         0.741             239205.            31518.68
>>>>  MP_IRecv            97572         8.605             239170.             2711.98
>>>>  MP_Memory          167778        45.018
>>>>  -------------------------------------------------------------------------------
>>>> 
>>>> 
>>>> on 16 cores on 2 nodes: total time 5053 seconds !!
>>>> 
>>>> SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
>>>>                                 MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
>>>>  CP2K                                 1  1.0    0.311    0.363 5052.904 5052.909
>>>> 
>>>> 
>>>> -------------------------------------------------------------------------------
>>>>  -                                                                             -
>>>>  -                         MESSAGE PASSING PERFORMANCE                         -
>>>>  -                                                                             -
>>>>  -------------------------------------------------------------------------------
>>>> 
>>>>  ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE [MB/s]
>>>>  MP_Group                5         0.000
>>>>  MP_Bcast             4119         0.258              43968.              700.70
>>>>  MP_Allreduce        21892      1546.186                263.                0.00
>>>>  MP_Gather              62         0.049                320.                0.40
>>>>  MP_Sync                54         0.071
>>>>  MP_Alltoall         19407      1507.024             648289.                8.35
>>>>  MP_ISendRecv        21600         0.104              94533.            19656.44
>>>>  MP_Wait            238786       513.507
>>>>  MP_comm_split          50         4.096
>>>>  MP_ISend            97572         1.102             239206.            21176.09
>>>>  MP_IRecv            97572         2.739             239171.             8520.75
>>>>  MP_Memory          167778        18.845
>>>>  -------------------------------------------------------------------------------
>>>> 
>>>> Any ideas? The code was built with the latest gfortran and I built all of the dependencies, using this arch file.
>>>> 
>>>> CC   = gcc
>>>> CPP  =
>>>> FC   = mpif90
>>>> LD   = mpif90
>>>> AR   = ar -r
>>>> PREFIX   = /home/rcohen
>>>> FFTW_INC   = $(PREFIX)/include
>>>> FFTW_LIB   = $(PREFIX)/lib
>>>> LIBINT_INC = $(PREFIX)/include
>>>> LIBINT_LIB = $(PREFIX)/lib
>>>> LIBXC_INC  = $(PREFIX)/include
>>>> LIBXC_LIB  = $(PREFIX)/lib
>>>> GCC_LIB = $(PREFIX)/gcc-trunk/lib
>>>> GCC_LIB64  = $(PREFIX)/gcc-trunk/lib64
>>>> GCC_INC = $(PREFIX)/gcc-trunk/include
>>>> DFLAGS  = -D__FFTW3 -D__LIBINT -D__LIBXC2\
>>>>     -D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\
>>>>     -D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3 
>>>> CPPFLAGS   =
>>>> FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\
>>>>     -fopenmp -ftree-vectorize -funroll-loops\
>>>>     -mtune=native  \
>>>>      -I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \
>>>>      -I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules
>>>> LIBS    =  \
>>>>     $(PREFIX)/lib/libscalapack.a $(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \
>>>>     $(FFTW_LIB)/libfftw3.a\
>>>>     $(FFTW_LIB)/libfftw3_threads.a\
>>>>     $(LIBXC_LIB)/libxcf90.a\
>>>>     $(LIBXC_LIB)/libxc.a\
>>>>     $(PREFIX)/lib/liblapack.a  $(PREFIX)/lib/libtmglib.a $(PREFIX)/lib/libgomp.a  \
>>>>     $(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a  -lelpa_openmp -lgomp -lopenblas
>>>> LDFLAGS = $(FCFLAGS)  -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran -L$(PREFIX)/lib 
>>>> 
>>>> It was run with  OMP_NUM_THREADS=2 on the two nodes and  OMP_NUM_THREADS=1 on the one node.
>>>> Running with  OMP_NUM_THREADS=1 on two nodes .
>>>> 
>>>> I am now checking whether OMP_NUM_THREADS=1 on two nodes is faster than OMP_NUM_THREADS=2 , but I do not think so.
>>>> 
>>>> Ron Cohen
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups "cp2k" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns... at googlegroups.com.
>>>> To post to this group, send email to cp... at googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/cp2k.
>>>> For more options, visit https://groups.google.com/d/optout.
>>> 
>>> -- 
>>> You received this message because you are subscribed to a topic in the Google Groups "cp2k" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/cp2k/lVLso0oseHU/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to cp2k+uns... at googlegroups.com.
>>> To post to this group, send email to cp... at googlegroups.com.
>>> Visit this group at https://groups.google.com/group/cp2k.
>>> For more options, visit https://groups.google.com/d/optout.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups "cp2k" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns... at googlegroups.com.
>> To post to this group, send email to cp... at googlegroups.com.
>> Visit this group at https://groups.google.com/group/cp2k.
>> For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to a topic in the Google Groups "cp2k" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/cp2k/lVLso0oseHU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to cp2k+uns... at googlegroups.com.
> To post to this group, send email to cp... at googlegroups.com.
> Visit this group at https://groups.google.com/group/cp2k.
> For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20160321/70cadef6/attachment.htm>


More information about the CP2K-user mailing list