mixed openMP-MPI for HF exchange

Simone Piccinin piccini... at gmail.com
Wed Nov 9 11:43:38 CET 2011


Dear CP2K users,

I am performing some tests on a multi-core machine (each node has 4
sockets with 8 cores each, 128 Gb of memory). I am doing hybrid-
functional calculations on a system with ~ 30 atoms, ~ 100 electrons,
~ 700 basis functions and I'd like to exploit the mixed openMP-MPI
parallelization scheme. As I change the number of threads and MPI
tasks in my tests I check the timings and whether all the ERI are
stored in-core or not.

Here below I report the results of the tests (#cores, time in seconds
of the whole calculation, 1st SCF iteration and last SCF iteration)
for a pure MPI run (with one MPI task per core), a mixed openMP-MPI
run using 4 threads per MPI task, and a mixed openMP-MPI run using 8
threads.

I find that for a given number of cores, the higher the number of
threads the smaller the number of ERI stored in core, and therefore
the performance degrades. On the other hand, when the mixed openMP-MPI
calculation manages to keep all the ERi in-core (i.e. when the number
of cores is sufficiently large), the performace in better than the
corresponding pure MPI run with the same number of cores.

Is this the expected behavior? This is pretty much the opposite of
what I expected. In the case of one MPI task per socket and 8 threads
(i.e. one thread per core) I would expect the MPI task to see 32 Gb of
memory, whereas a purely MPI calclulation with one MPI task per core
should have "only" 4 Gb of memory available. So it is not clear to me,
when comparing calculations with the same number of cores, why the
pure MPI calculation magages to fit all the ERI in-core, while the
mixed one does not (and actually it seems that the higher the number
of threads the smaller the available memory).

Here are the results of the tests:

VERSION: popt
#cores walltot 1st    last
--------------------------
64      710    560.0  10.5
128     410    295.7   7.7
256     239    159.5   5.1
512     317    107.8  11.3

VERSION: psmp (4 threads)
#cores walltot 1st     last
---------------------------
64     1744    536.7  86.0  --> not all in-core
128     356    273.3   5.6
256     223    155.3   3.7
512     149     89.6   2.9
1024    140     61.5   3.6

VERSION: psmp (8 threads)
#cores walltot 1st    last
--------------------------
64    ~3600   530.0 217.9   --> not all in-core
128     692   274.9  29.5   --> not all in-core
256     208   150.0   3.6
512     145    89.4   2.8


and here's how I compiled the code (the mixed openMP-MPI version):

CC       = mpicc
CPP      =
FC       = mpif90
LD       = mpif90
AR       = ar -r
INTEL_LIB=/usr/local/Intel_compilers/c/composer_xe_2011_sp1.6.233/mkl/
lib/intel64/
LIBINT_DIR=/ccc/cont005/home/pa0315/piccinis/LIBS/LIBINT_1.1.4/lib/
WRAPPER_DIR=/ccc/cont005/home/pa0315/piccinis/cp2k/tools/hfx_tools/
libint_tools
CPPFLAGS =
DFLAGS   = -D__GFORTRAN -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK
-D__LIBINT
FCFLAGS  = -O2 -fopenmp -ffast-math -funroll-loops -ftree-vectorize -
march=native -ffree-form $(DFLAGS)
LDFLAGS  = $(FCFLAGS)
LIBS    = -L$(INTEL_LIB) -Wl,-rpath,$(INTEL_LIB) \
          -lmkl_scalapack_lp64     \
          -lmkl_blacs_openmpi_lp64 \
          -lmkl_intel_lp64         \
          -lmkl_sequential         \
          -lmkl_core \
           $(WRAPPER_DIR)/libint_cpp_wrapper.o \
           $(LIBINT_DIR)/libderiv.a \
           $(LIBINT_DIR)/libint.a \
           -lstdc++

OBJECTS_ARCHITECTURE = machine_gfortran.o


Any help in understanding the performance of the openMP-MPI
parallelization scheme would be much appreaciated.

Best wishes,
Simone Piccinin
CNR-IOM, Trieste (Italy)



More information about the CP2K-user mailing list