[CP2K:7568] terrible performance across infiniband

Cohen, Ronald rco... at carnegiescience.edu
Mon Mar 21 21:11:56 UTC 2016


Yes I am using hybrid mode. But even if I set OMP_NUM_THREADS=1 performance
is terrible.

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco... at carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Mon, Mar 21, 2016 at 5:04 PM, Glen MacLachlan <mac... at gwu.edu> wrote:

> Are you conflating MPI with OpenMP? OMP_NUM_THREADS sets the number of
> threads used by OpenMP and OpenMP doesn't work on a distributed memory
> environment unless you piggyback on MPI which would be a hybrid use and I'm
> not sure CP2K ever worked optimally in hybrid mode or at least that's what
> I've gotten from reading the comments on the source code.
>
> As for MPI, are you sure your MPI stack was compiled with IB bindings? I
> had similar issues and the problem was that I wasn't actually using IB. If
> you can, disable eth and leave only IB and see what happens.
>
> Glen
> On Mar 21, 2016 4:48 PM, "Ronald Cohen" <rco... at carnegiescience.edu>
> wrote:
>
>> On the dco machine deepcarbon I find decent single node mpi performnace,
>> but running on the same number of processors across two nodes is terrible,
>> even with the infiniband interconect. This is the cp2k  H2O-64 benchmark:
>>
>>
>>
>> On 16 cores on 1 node: total time 530 seconds
>>  SUBROUTINE                       CALLS  ASD         SELF TIME
>>  TOTAL TIME
>>                                 MAXIMUM       AVERAGE  MAXIMUM  AVERAGE
>>  MAXIMUM
>>  CP2K                                 1  1.0    0.015    0.019  530.306
>>  530.306
>>  -
>>       -
>>  -                         MESSAGE PASSING PERFORMANCE
>>       -
>>  -
>>       -
>>
>>  -------------------------------------------------------------------------------
>>
>>  ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE
>> [MB/s]
>>  MP_Group                5         0.000
>>  MP_Bcast             4103         0.029              44140.
>> 6191.05
>>  MP_Allreduce        21860         7.077                263.
>>    0.81
>>  MP_Gather              62         0.008                320.
>>    2.53
>>  MP_Sync                54         0.001
>>  MP_Alltoall         19407        26.839             648289.
>>  468.77
>>  MP_ISendRecv        21600         0.091              94533.
>>  22371.25
>>  MP_Wait            238786        50.545
>>  MP_comm_split          50         0.004
>>  MP_ISend            97572         0.741             239205.
>>  31518.68
>>  MP_IRecv            97572         8.605             239170.
>> 2711.98
>>  MP_Memory          167778        45.018
>>
>>  -------------------------------------------------------------------------------
>>
>>
>> on 16 cores on 2 nodes: total time 5053 seconds !!
>>
>> SUBROUTINE                       CALLS  ASD         SELF TIME
>>  TOTAL TIME
>>                                 MAXIMUM       AVERAGE  MAXIMUM  AVERAGE
>>  MAXIMUM
>>  CP2K                                 1  1.0    0.311    0.363 5052.904
>> 5052.909
>>
>>
>>
>> -------------------------------------------------------------------------------
>>  -
>>       -
>>  -                         MESSAGE PASSING PERFORMANCE
>>       -
>>  -
>>       -
>>
>>  -------------------------------------------------------------------------------
>>
>>  ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE
>> [MB/s]
>>  MP_Group                5         0.000
>>  MP_Bcast             4119         0.258              43968.
>>  700.70
>>  MP_Allreduce        21892      1546.186                263.
>>    0.00
>>  MP_Gather              62         0.049                320.
>>    0.40
>>  MP_Sync                54         0.071
>>  MP_Alltoall         19407      1507.024             648289.
>>    8.35
>>  MP_ISendRecv        21600         0.104              94533.
>>  19656.44
>>  MP_Wait            238786       513.507
>>  MP_comm_split          50         4.096
>>  MP_ISend            97572         1.102             239206.
>>  21176.09
>>  MP_IRecv            97572         2.739             239171.
>> 8520.75
>>  MP_Memory          167778        18.845
>>
>>  -------------------------------------------------------------------------------
>>
>> Any ideas? The code was built with the latest gfortran and I built all of
>> the dependencies, using this arch file.
>>
>> CC   = gcc
>> CPP  =
>> FC   = mpif90
>> LD   = mpif90
>> AR   = ar -r
>> PREFIX   = /home/rcohen
>> FFTW_INC   = $(PREFIX)/include
>> FFTW_LIB   = $(PREFIX)/lib
>> LIBINT_INC = $(PREFIX)/include
>> LIBINT_LIB = $(PREFIX)/lib
>> LIBXC_INC  = $(PREFIX)/include
>> LIBXC_LIB  = $(PREFIX)/lib
>> GCC_LIB = $(PREFIX)/gcc-trunk/lib
>> GCC_LIB64  = $(PREFIX)/gcc-trunk/lib64
>> GCC_INC = $(PREFIX)/gcc-trunk/include
>> DFLAGS  = -D__FFTW3 -D__LIBINT -D__LIBXC2\
>>     -D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\
>>     -D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3
>> CPPFLAGS   =
>> FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\
>>     -fopenmp -ftree-vectorize -funroll-loops\
>>     -mtune=native  \
>>      -I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \
>>      -I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules
>> LIBS    =  \
>>     $(PREFIX)/lib/libscalapack.a
>> $(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \
>>     $(FFTW_LIB)/libfftw3.a\
>>     $(FFTW_LIB)/libfftw3_threads.a\
>>     $(LIBXC_LIB)/libxcf90.a\
>>     $(LIBXC_LIB)/libxc.a\
>>     $(PREFIX)/lib/liblapack.a  $(PREFIX)/lib/libtmglib.a
>> $(PREFIX)/lib/libgomp.a  \
>>     $(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a  -lelpa_openmp -lgomp
>> -lopenblas
>> LDFLAGS = $(FCFLAGS)  -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran
>> -L$(PREFIX)/lib
>>
>> It was run with  OMP_NUM_THREADS=2 on the two nodes and  OMP_NUM_THREADS=1
>> on the one node.
>> Running with  OMP_NUM_THREADS=1 on two nodes .
>>
>> I am now checking whether OMP_NUM_THREADS=1 on two nodes is faster than OMP_NUM_THREADS=2
>> , but I do not think so.
>>
>> Ron Cohen
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "cp2k" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to cp2k+uns... at googlegroups.com.
>> To post to this group, send email to cp... at googlegroups.com.
>> Visit this group at https://groups.google.com/group/cp2k.
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "cp2k" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/cp2k/lVLso0oseHU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> cp2k+uns... at googlegroups.com.
> To post to this group, send email to cp... at googlegroups.com.
> Visit this group at https://groups.google.com/group/cp2k.
> For more options, visit https://groups.google.com/d/optout.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20160321/b5f2ccab/attachment.htm>


More information about the CP2K-user mailing list