<div dir="ltr"><div style="font-family: arial, sans-serif; font-size: 12.8px;">On the dco machine deepcarbon I find decent single node mpi performnace, but running on the same number of processors across two nodes is terrible, even with the infiniband interconect. This is the cp2k H2O-64 benchmark:</div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"> </div><div style="font-family: arial, sans-serif; font-size: 12.8px;">On 16 cores on 1 node: total time 530 seconds</div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><div> SUBROUTINE CALLS ASD SELF TIME TOTAL TIME</div><div> MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM</div><div> CP2K 1 1.0 0.015 0.019 530.306 530.306</div></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><div> - -</div><div> - MESSAGE PASSING PERFORMANCE -</div><div> - -</div><div> -----------------------------<wbr>------------------------------<wbr>--------------------</div><div><br></div><div> ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE [MB/s]</div><div> MP_Group 5 0.000</div><div> MP_Bcast 4103 0.029 44140. 6191.05</div><div> MP_Allreduce 21860 7.077 263. 0.81</div><div> MP_Gather 62 0.008 320. 2.53</div><div> MP_Sync 54 0.001</div><div> MP_Alltoall 19407 26.839 648289. 468.77</div><div> MP_ISendRecv 21600 0.091 94533. 22371.25</div><div> MP_Wait 238786 50.545</div><div> MP_comm_split 50 0.004</div><div> MP_ISend 97572 0.741 239205. 31518.68</div><div> MP_IRecv 97572 8.605 239170. 2711.98</div><div> MP_Memory 167778 45.018</div><div> -----------------------------<wbr>------------------------------<wbr>--------------------</div></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;">on 16 cores on 2 nodes: total time 5053 seconds !!</div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><div>SUBROUTINE CALLS ASD SELF TIME TOTAL TIME</div><div> MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM</div><div> CP2K 1 1.0 0.311 0.363 5052.904 5052.909</div></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><div>------------------------------<wbr>------------------------------<wbr>-------------------</div><div> - -</div><div> - MESSAGE PASSING PERFORMANCE -</div><div> - -</div><div> -----------------------------<wbr>------------------------------<wbr>--------------------</div><div><br></div><div> ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE [MB/s]</div><div> MP_Group 5 0.000</div><div> MP_Bcast 4119 0.258 43968. 700.70</div><div> MP_Allreduce 21892 1546.186 263. 0.00</div><div> MP_Gather 62 0.049 320. 0.40</div><div> MP_Sync 54 0.071</div><div> MP_Alltoall 19407 1507.024 648289. 8.35</div><div> MP_ISendRecv 21600 0.104 94533. 19656.44</div><div> MP_Wait 238786 513.507</div><div> MP_comm_split 50 4.096</div><div> MP_ISend 97572 1.102 239206. 21176.09</div><div> MP_IRecv 97572 2.739 239171. 8520.75</div><div> MP_Memory 167778 18.845</div><div> -----------------------------<wbr>------------------------------<wbr>--------------------</div></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;">Any ideas? The code was built with the latest gfortran and I built all of the dependencies, using this arch file.</div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><div style="font-size: 12.8px;">CC = gcc</div><div style="font-size: 12.8px;">CPP =</div><div style="font-size: 12.8px;">FC = mpif90</div><div style="font-size: 12.8px;">LD = mpif90</div><div style="font-size: 12.8px;">AR = ar -r</div><div style="font-size: 12.8px;">PREFIX = /home/rcohen</div><div style="font-size: 12.8px;">FFTW_INC = $(PREFIX)/include</div><div style="font-size: 12.8px;">FFTW_LIB = $(PREFIX)/lib</div><div style="font-size: 12.8px;">LIBINT_INC = $(PREFIX)/include</div><div style="font-size: 12.8px;">LIBINT_LIB = $(PREFIX)/lib</div><div style="font-size: 12.8px;">LIBXC_INC = $(PREFIX)/include</div><div style="font-size: 12.8px;">LIBXC_LIB = $(PREFIX)/lib</div><div style="font-size: 12.8px;">GCC_LIB = $(PREFIX)/gcc-trunk/lib</div><div style="font-size: 12.8px;">GCC_LIB64 = $(PREFIX)/gcc-trunk/lib64</div><div style="font-size: 12.8px;">GCC_INC = $(PREFIX)/gcc-trunk/include</div><div style="font-size: 12.8px;">DFLAGS = -D__FFTW3 -D__LIBINT -D__LIBXC2\</div><div style="font-size: 12.8px;"> -D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\</div><div style="font-size: 12.8px;"> -D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3 </div><div style="font-size: 12.8px;">CPPFLAGS =</div><div style="font-size: 12.8px;">FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\</div><div style="font-size: 12.8px;"> -fopenmp -ftree-vectorize -funroll-loops\</div><div style="font-size: 12.8px;"> -mtune=native \</div><div style="font-size: 12.8px;"> -I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \</div><div style="font-size: 12.8px;"> -I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules</div><div style="font-size: 12.8px;">LIBS = \</div><div style="font-size: 12.8px;"> $(PREFIX)/lib/libscalapack.a $(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \</div><div style="font-size: 12.8px;"> $(FFTW_LIB)/libfftw3.a\</div><div style="font-size: 12.8px;"> $(FFTW_LIB)/libfftw3_threads.a\</div><div style="font-size: 12.8px;"> $(LIBXC_LIB)/libxcf90.a\</div><div style="font-size: 12.8px;"> $(LIBXC_LIB)/libxc.a\</div><div style="font-size: 12.8px;"> $(PREFIX)/lib/liblapack.a $(PREFIX)/lib/libtmglib.a $(PREFIX)/lib/libgomp.a \</div><div style="font-size: 12.8px;"> $(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a -lelpa_openmp -lgomp -lopenblas</div><div style="font-size: 12.8px;">LDFLAGS = $(FCFLAGS) -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran -L$(PREFIX)/lib </div><div><br></div><div>It was run with <span style="font-size: 12.8px;">OMP_NUM_THREADS=2 on the two nodes</span> and OMP_NUM_THREADS=1 on the one node.</div><div>Running with OMP_NUM_THREADS=1 on two nodes .</div><div><br></div><div>I am now checking whether <span style="font-size: 12.8px;">OMP_NUM_THREADS=1 on two nodes is faster than </span><span style="font-size: 12.8px;">OMP_NUM_THREADS=2 , but I do not think so.</span></div><div><br></div><div>Ron Cohen</div><div><br></div></div><div style="font-family: arial, sans-serif; font-size: 12.8px;"><br></div><div><br></div></div>