terrible performance across infiniband
Ronald Cohen
rco... at carnegiescience.edu
Mon Mar 21 20:48:39 UTC 2016
On the dco machine deepcarbon I find decent single node mpi performnace,
but running on the same number of processors across two nodes is terrible,
even with the infiniband interconect. This is the cp2k H2O-64 benchmark:
On 16 cores on 1 node: total time 530 seconds
SUBROUTINE CALLS ASD SELF TIME TOTAL
TIME
MAXIMUM AVERAGE MAXIMUM AVERAGE
MAXIMUM
CP2K 1 1.0 0.015 0.019 530.306
530.306
-
-
- MESSAGE PASSING PERFORMANCE
-
-
-
-------------------------------------------------------------------------------
ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE
[MB/s]
MP_Group 5 0.000
MP_Bcast 4103 0.029 44140.
6191.05
MP_Allreduce 21860 7.077 263.
0.81
MP_Gather 62 0.008 320.
2.53
MP_Sync 54 0.001
MP_Alltoall 19407 26.839 648289.
468.77
MP_ISendRecv 21600 0.091 94533.
22371.25
MP_Wait 238786 50.545
MP_comm_split 50 0.004
MP_ISend 97572 0.741 239205.
31518.68
MP_IRecv 97572 8.605 239170.
2711.98
MP_Memory 167778 45.018
-------------------------------------------------------------------------------
on 16 cores on 2 nodes: total time 5053 seconds !!
SUBROUTINE CALLS ASD SELF TIME TOTAL
TIME
MAXIMUM AVERAGE MAXIMUM AVERAGE
MAXIMUM
CP2K 1 1.0 0.311 0.363 5052.904
5052.909
-------------------------------------------------------------------------------
-
-
- MESSAGE PASSING PERFORMANCE
-
-
-
-------------------------------------------------------------------------------
ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE
[MB/s]
MP_Group 5 0.000
MP_Bcast 4119 0.258 43968.
700.70
MP_Allreduce 21892 1546.186 263.
0.00
MP_Gather 62 0.049 320.
0.40
MP_Sync 54 0.071
MP_Alltoall 19407 1507.024 648289.
8.35
MP_ISendRecv 21600 0.104 94533.
19656.44
MP_Wait 238786 513.507
MP_comm_split 50 4.096
MP_ISend 97572 1.102 239206.
21176.09
MP_IRecv 97572 2.739 239171.
8520.75
MP_Memory 167778 18.845
-------------------------------------------------------------------------------
Any ideas? The code was built with the latest gfortran and I built all of
the dependencies, using this arch file.
CC = gcc
CPP =
FC = mpif90
LD = mpif90
AR = ar -r
PREFIX = /home/rcohen
FFTW_INC = $(PREFIX)/include
FFTW_LIB = $(PREFIX)/lib
LIBINT_INC = $(PREFIX)/include
LIBINT_LIB = $(PREFIX)/lib
LIBXC_INC = $(PREFIX)/include
LIBXC_LIB = $(PREFIX)/lib
GCC_LIB = $(PREFIX)/gcc-trunk/lib
GCC_LIB64 = $(PREFIX)/gcc-trunk/lib64
GCC_INC = $(PREFIX)/gcc-trunk/include
DFLAGS = -D__FFTW3 -D__LIBINT -D__LIBXC2\
-D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\
-D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3
CPPFLAGS =
FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\
-fopenmp -ftree-vectorize -funroll-loops\
-mtune=native \
-I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \
-I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules
LIBS = \
$(PREFIX)/lib/libscalapack.a
$(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \
$(FFTW_LIB)/libfftw3.a\
$(FFTW_LIB)/libfftw3_threads.a\
$(LIBXC_LIB)/libxcf90.a\
$(LIBXC_LIB)/libxc.a\
$(PREFIX)/lib/liblapack.a $(PREFIX)/lib/libtmglib.a
$(PREFIX)/lib/libgomp.a \
$(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a -lelpa_openmp -lgomp
-lopenblas
LDFLAGS = $(FCFLAGS) -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran
-L$(PREFIX)/lib
It was run with OMP_NUM_THREADS=2 on the two nodes and OMP_NUM_THREADS=1
on the one node.
Running with OMP_NUM_THREADS=1 on two nodes .
I am now checking whether OMP_NUM_THREADS=1 on two nodes is faster than OMP_NUM_THREADS=2
, but I do not think so.
Ron Cohen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20160321/ad9361d0/attachment.htm>
More information about the CP2K-user
mailing list