terrible performance across infiniband

Ronald Cohen rco... at carnegiescience.edu
Mon Mar 21 20:48:39 UTC 2016


On the dco machine deepcarbon I find decent single node mpi performnace, 
but running on the same number of processors across two nodes is terrible, 
even with the infiniband interconect. This is the cp2k  H2O-64 benchmark:


 
On 16 cores on 1 node: total time 530 seconds
 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL 
TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE 
 MAXIMUM
 CP2K                                 1  1.0    0.015    0.019  530.306 
 530.306
 -                                                                         
    -
 -                         MESSAGE PASSING PERFORMANCE                     
    -
 -                                                                         
    -
 -------------------------------------------------------------------------------

 ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE 
[MB/s]
 MP_Group                5         0.000
 MP_Bcast             4103         0.029              44140.             
6191.05
 MP_Allreduce        21860         7.077                263.               
 0.81
 MP_Gather              62         0.008                320.               
 2.53
 MP_Sync                54         0.001
 MP_Alltoall         19407        26.839             648289.             
 468.77
 MP_ISendRecv        21600         0.091              94533.           
 22371.25
 MP_Wait            238786        50.545
 MP_comm_split          50         0.004
 MP_ISend            97572         0.741             239205.           
 31518.68
 MP_IRecv            97572         8.605             239170.             
2711.98
 MP_Memory          167778        45.018
 -------------------------------------------------------------------------------


on 16 cores on 2 nodes: total time 5053 seconds !!

SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL 
TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE 
 MAXIMUM
 CP2K                                 1  1.0    0.311    0.363 5052.904 
5052.909


-------------------------------------------------------------------------------
 -                                                                         
    -
 -                         MESSAGE PASSING PERFORMANCE                     
    -
 -                                                                         
    -
 -------------------------------------------------------------------------------

 ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE 
[MB/s]
 MP_Group                5         0.000
 MP_Bcast             4119         0.258              43968.             
 700.70
 MP_Allreduce        21892      1546.186                263.               
 0.00
 MP_Gather              62         0.049                320.               
 0.40
 MP_Sync                54         0.071
 MP_Alltoall         19407      1507.024             648289.               
 8.35
 MP_ISendRecv        21600         0.104              94533.           
 19656.44
 MP_Wait            238786       513.507
 MP_comm_split          50         4.096
 MP_ISend            97572         1.102             239206.           
 21176.09
 MP_IRecv            97572         2.739             239171.             
8520.75
 MP_Memory          167778        18.845
 -------------------------------------------------------------------------------

Any ideas? The code was built with the latest gfortran and I built all of 
the dependencies, using this arch file.

CC   = gcc
CPP  =
FC   = mpif90
LD   = mpif90
AR   = ar -r
PREFIX   = /home/rcohen
FFTW_INC   = $(PREFIX)/include
FFTW_LIB   = $(PREFIX)/lib
LIBINT_INC = $(PREFIX)/include
LIBINT_LIB = $(PREFIX)/lib
LIBXC_INC  = $(PREFIX)/include
LIBXC_LIB  = $(PREFIX)/lib
GCC_LIB = $(PREFIX)/gcc-trunk/lib
GCC_LIB64  = $(PREFIX)/gcc-trunk/lib64
GCC_INC = $(PREFIX)/gcc-trunk/include
DFLAGS  = -D__FFTW3 -D__LIBINT -D__LIBXC2\
    -D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\
    -D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3 
CPPFLAGS   =
FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\
    -fopenmp -ftree-vectorize -funroll-loops\
    -mtune=native  \
     -I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \
     -I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules
LIBS    =  \
    $(PREFIX)/lib/libscalapack.a 
$(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \
    $(FFTW_LIB)/libfftw3.a\
    $(FFTW_LIB)/libfftw3_threads.a\
    $(LIBXC_LIB)/libxcf90.a\
    $(LIBXC_LIB)/libxc.a\
    $(PREFIX)/lib/liblapack.a  $(PREFIX)/lib/libtmglib.a 
$(PREFIX)/lib/libgomp.a  \
    $(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a  -lelpa_openmp -lgomp 
-lopenblas
LDFLAGS = $(FCFLAGS)  -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran 
-L$(PREFIX)/lib 

It was run with  OMP_NUM_THREADS=2 on the two nodes and  OMP_NUM_THREADS=1 
on the one node.
Running with  OMP_NUM_THREADS=1 on two nodes .

I am now checking whether OMP_NUM_THREADS=1 on two nodes is faster than OMP_NUM_THREADS=2 
, but I do not think so.

Ron Cohen



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20160321/ad9361d0/attachment.htm>


More information about the CP2K-user mailing list