mixed openMP-MPI for HF exchange
Simone Piccinin
piccini... at gmail.com
Wed Nov 9 10:43:38 UTC 2011
Dear CP2K users,
I am performing some tests on a multi-core machine (each node has 4
sockets with 8 cores each, 128 Gb of memory). I am doing hybrid-
functional calculations on a system with ~ 30 atoms, ~ 100 electrons,
~ 700 basis functions and I'd like to exploit the mixed openMP-MPI
parallelization scheme. As I change the number of threads and MPI
tasks in my tests I check the timings and whether all the ERI are
stored in-core or not.
Here below I report the results of the tests (#cores, time in seconds
of the whole calculation, 1st SCF iteration and last SCF iteration)
for a pure MPI run (with one MPI task per core), a mixed openMP-MPI
run using 4 threads per MPI task, and a mixed openMP-MPI run using 8
threads.
I find that for a given number of cores, the higher the number of
threads the smaller the number of ERI stored in core, and therefore
the performance degrades. On the other hand, when the mixed openMP-MPI
calculation manages to keep all the ERi in-core (i.e. when the number
of cores is sufficiently large), the performace in better than the
corresponding pure MPI run with the same number of cores.
Is this the expected behavior? This is pretty much the opposite of
what I expected. In the case of one MPI task per socket and 8 threads
(i.e. one thread per core) I would expect the MPI task to see 32 Gb of
memory, whereas a purely MPI calclulation with one MPI task per core
should have "only" 4 Gb of memory available. So it is not clear to me,
when comparing calculations with the same number of cores, why the
pure MPI calculation magages to fit all the ERI in-core, while the
mixed one does not (and actually it seems that the higher the number
of threads the smaller the available memory).
Here are the results of the tests:
VERSION: popt
#cores walltot 1st last
--------------------------
64 710 560.0 10.5
128 410 295.7 7.7
256 239 159.5 5.1
512 317 107.8 11.3
VERSION: psmp (4 threads)
#cores walltot 1st last
---------------------------
64 1744 536.7 86.0 --> not all in-core
128 356 273.3 5.6
256 223 155.3 3.7
512 149 89.6 2.9
1024 140 61.5 3.6
VERSION: psmp (8 threads)
#cores walltot 1st last
--------------------------
64 ~3600 530.0 217.9 --> not all in-core
128 692 274.9 29.5 --> not all in-core
256 208 150.0 3.6
512 145 89.4 2.8
and here's how I compiled the code (the mixed openMP-MPI version):
CC = mpicc
CPP =
FC = mpif90
LD = mpif90
AR = ar -r
INTEL_LIB=/usr/local/Intel_compilers/c/composer_xe_2011_sp1.6.233/mkl/
lib/intel64/
LIBINT_DIR=/ccc/cont005/home/pa0315/piccinis/LIBS/LIBINT_1.1.4/lib/
WRAPPER_DIR=/ccc/cont005/home/pa0315/piccinis/cp2k/tools/hfx_tools/
libint_tools
CPPFLAGS =
DFLAGS = -D__GFORTRAN -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK
-D__LIBINT
FCFLAGS = -O2 -fopenmp -ffast-math -funroll-loops -ftree-vectorize -
march=native -ffree-form $(DFLAGS)
LDFLAGS = $(FCFLAGS)
LIBS = -L$(INTEL_LIB) -Wl,-rpath,$(INTEL_LIB) \
-lmkl_scalapack_lp64 \
-lmkl_blacs_openmpi_lp64 \
-lmkl_intel_lp64 \
-lmkl_sequential \
-lmkl_core \
$(WRAPPER_DIR)/libint_cpp_wrapper.o \
$(LIBINT_DIR)/libderiv.a \
$(LIBINT_DIR)/libint.a \
-lstdc++
OBJECTS_ARCHITECTURE = machine_gfortran.o
Any help in understanding the performance of the openMP-MPI
parallelization scheme would be much appreaciated.
Best wishes,
Simone Piccinin
CNR-IOM, Trieste (Italy)
More information about the CP2K-user
mailing list