[CP2K:7567] terrible performance across infiniband
Glen MacLachlan
mac... at gwu.edu
Mon Mar 21 21:04:00 UTC 2016
Are you conflating MPI with OpenMP? OMP_NUM_THREADS sets the number of
threads used by OpenMP and OpenMP doesn't work on a distributed memory
environment unless you piggyback on MPI which would be a hybrid use and I'm
not sure CP2K ever worked optimally in hybrid mode or at least that's what
I've gotten from reading the comments on the source code.
As for MPI, are you sure your MPI stack was compiled with IB bindings? I
had similar issues and the problem was that I wasn't actually using IB. If
you can, disable eth and leave only IB and see what happens.
Glen
On Mar 21, 2016 4:48 PM, "Ronald Cohen" <rco... at carnegiescience.edu> wrote:
> On the dco machine deepcarbon I find decent single node mpi performnace,
> but running on the same number of processors across two nodes is terrible,
> even with the infiniband interconect. This is the cp2k H2O-64 benchmark:
>
>
>
> On 16 cores on 1 node: total time 530 seconds
> SUBROUTINE CALLS ASD SELF TIME
> TOTAL TIME
> MAXIMUM AVERAGE MAXIMUM AVERAGE
> MAXIMUM
> CP2K 1 1.0 0.015 0.019 530.306
> 530.306
> -
> -
> - MESSAGE PASSING PERFORMANCE
> -
> -
> -
>
> -------------------------------------------------------------------------------
>
> ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE
> [MB/s]
> MP_Group 5 0.000
> MP_Bcast 4103 0.029 44140.
> 6191.05
> MP_Allreduce 21860 7.077 263.
> 0.81
> MP_Gather 62 0.008 320.
> 2.53
> MP_Sync 54 0.001
> MP_Alltoall 19407 26.839 648289.
> 468.77
> MP_ISendRecv 21600 0.091 94533.
> 22371.25
> MP_Wait 238786 50.545
> MP_comm_split 50 0.004
> MP_ISend 97572 0.741 239205.
> 31518.68
> MP_IRecv 97572 8.605 239170.
> 2711.98
> MP_Memory 167778 45.018
>
> -------------------------------------------------------------------------------
>
>
> on 16 cores on 2 nodes: total time 5053 seconds !!
>
> SUBROUTINE CALLS ASD SELF TIME TOTAL
> TIME
> MAXIMUM AVERAGE MAXIMUM AVERAGE
> MAXIMUM
> CP2K 1 1.0 0.311 0.363 5052.904
> 5052.909
>
>
>
> -------------------------------------------------------------------------------
> -
> -
> - MESSAGE PASSING PERFORMANCE
> -
> -
> -
>
> -------------------------------------------------------------------------------
>
> ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE
> [MB/s]
> MP_Group 5 0.000
> MP_Bcast 4119 0.258 43968.
> 700.70
> MP_Allreduce 21892 1546.186 263.
> 0.00
> MP_Gather 62 0.049 320.
> 0.40
> MP_Sync 54 0.071
> MP_Alltoall 19407 1507.024 648289.
> 8.35
> MP_ISendRecv 21600 0.104 94533.
> 19656.44
> MP_Wait 238786 513.507
> MP_comm_split 50 4.096
> MP_ISend 97572 1.102 239206.
> 21176.09
> MP_IRecv 97572 2.739 239171.
> 8520.75
> MP_Memory 167778 18.845
>
> -------------------------------------------------------------------------------
>
> Any ideas? The code was built with the latest gfortran and I built all of
> the dependencies, using this arch file.
>
> CC = gcc
> CPP =
> FC = mpif90
> LD = mpif90
> AR = ar -r
> PREFIX = /home/rcohen
> FFTW_INC = $(PREFIX)/include
> FFTW_LIB = $(PREFIX)/lib
> LIBINT_INC = $(PREFIX)/include
> LIBINT_LIB = $(PREFIX)/lib
> LIBXC_INC = $(PREFIX)/include
> LIBXC_LIB = $(PREFIX)/lib
> GCC_LIB = $(PREFIX)/gcc-trunk/lib
> GCC_LIB64 = $(PREFIX)/gcc-trunk/lib64
> GCC_INC = $(PREFIX)/gcc-trunk/include
> DFLAGS = -D__FFTW3 -D__LIBINT -D__LIBXC2\
> -D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\
> -D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3
> CPPFLAGS =
> FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\
> -fopenmp -ftree-vectorize -funroll-loops\
> -mtune=native \
> -I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \
> -I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules
> LIBS = \
> $(PREFIX)/lib/libscalapack.a
> $(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \
> $(FFTW_LIB)/libfftw3.a\
> $(FFTW_LIB)/libfftw3_threads.a\
> $(LIBXC_LIB)/libxcf90.a\
> $(LIBXC_LIB)/libxc.a\
> $(PREFIX)/lib/liblapack.a $(PREFIX)/lib/libtmglib.a
> $(PREFIX)/lib/libgomp.a \
> $(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a -lelpa_openmp -lgomp
> -lopenblas
> LDFLAGS = $(FCFLAGS) -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran
> -L$(PREFIX)/lib
>
> It was run with OMP_NUM_THREADS=2 on the two nodes and OMP_NUM_THREADS=1
> on the one node.
> Running with OMP_NUM_THREADS=1 on two nodes .
>
> I am now checking whether OMP_NUM_THREADS=1 on two nodes is faster than OMP_NUM_THREADS=2
> , but I do not think so.
>
> Ron Cohen
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "cp2k" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cp2k+uns... at googlegroups.com.
> To post to this group, send email to cp... at googlegroups.com.
> Visit this group at https://groups.google.com/group/cp2k.
> For more options, visit https://groups.google.com/d/optout.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20160321/b5bad412/attachment.htm>
More information about the CP2K-user
mailing list