[CP2K:3594] mixed openMP-MPI for HF exchange

hut... at pci.uzh.ch hut... at pci.uzh.ch
Wed Nov 9 11:33:27 UTC 2011


Hi

did you correctly use the MEMORY keyword. CP2K will use only
this information when deciding on the number of integrals kept
in core. From the manual

Defines the maximum amount of memory [MB] to be consumed by the full HFX module. All temporary buffers and helper arrays are subtracted from this number. What remains will be used for storage of integrals. NOTE: This number is assumed to represent the memory available to one MPI process. When running a threaded version, cp2k automatically takes care of distributing the memory among all involved sub-processes.

regards

Juerg Hutter


--------------------------------------------------------------
Juerg Hutter                         Phone : ++41 44 635 4491
Physical Chemistry Institute   FAX   : ++41 44 635 6838
University of Zurich               E-mail:  hut... at pci.uzh.ch
Winterthurerstrasse 190
CH-8057 Zurich, Switzerland
---------------------------------------------------------------

-----cp... at googlegroups.com wrote: -----
To: cp2k <cp... at googlegroups.com>
From: Simone Piccinin 
Sent by: cp... at googlegroups.com
Date: 11/09/2011 11:43AM
Subject: [CP2K:3594] mixed openMP-MPI for HF exchange

Dear CP2K users,

I am performing some tests on a multi-core machine (each node has 4
sockets with 8 cores each, 128 Gb of memory). I am doing hybrid-
functional calculations on a system with ~ 30 atoms, ~ 100 electrons,
~ 700 basis functions and I'd like to exploit the mixed openMP-MPI
parallelization scheme. As I change the number of threads and MPI
tasks in my tests I check the timings and whether all the ERI are
stored in-core or not.

Here below I report the results of the tests (#cores, time in seconds
of the whole calculation, 1st SCF iteration and last SCF iteration)
for a pure MPI run (with one MPI task per core), a mixed openMP-MPI
run using 4 threads per MPI task, and a mixed openMP-MPI run using 8
threads.

I find that for a given number of cores, the higher the number of
threads the smaller the number of ERI stored in core, and therefore
the performance degrades. On the other hand, when the mixed openMP-MPI
calculation manages to keep all the ERi in-core (i.e. when the number
of cores is sufficiently large), the performace in better than the
corresponding pure MPI run with the same number of cores.

Is this the expected behavior? This is pretty much the opposite of
what I expected. In the case of one MPI task per socket and 8 threads
(i.e. one thread per core) I would expect the MPI task to see 32 Gb of
memory, whereas a purely MPI calclulation with one MPI task per core
should have "only" 4 Gb of memory available. So it is not clear to me,
when comparing calculations with the same number of cores, why the
pure MPI calculation magages to fit all the ERI in-core, while the
mixed one does not (and actually it seems that the higher the number
of threads the smaller the available memory).

Here are the results of the tests:

VERSION: popt
#cores walltot 1st    last
--------------------------
64      710    560.0  10.5
128     410    295.7   7.7
256     239    159.5   5.1
512     317    107.8  11.3

VERSION: psmp (4 threads)
#cores walltot 1st     last
---------------------------
64     1744    536.7  86.0  --> not all in-core
128     356    273.3   5.6
256     223    155.3   3.7
512     149     89.6   2.9
1024    140     61.5   3.6

VERSION: psmp (8 threads)
#cores walltot 1st    last
--------------------------
64    ~3600   530.0 217.9   --> not all in-core
128     692   274.9  29.5   --> not all in-core
256     208   150.0   3.6
512     145    89.4   2.8


and here's how I compiled the code (the mixed openMP-MPI version):

CC       = mpicc
CPP      =
FC       = mpif90
LD       = mpif90
AR       = ar -r
INTEL_LIB=/usr/local/Intel_compilers/c/composer_xe_2011_sp1.6.233/mkl/
lib/intel64/
LIBINT_DIR=/ccc/cont005/home/pa0315/piccinis/LIBS/LIBINT_1.1.4/lib/
WRAPPER_DIR=/ccc/cont005/home/pa0315/piccinis/cp2k/tools/hfx_tools/
libint_tools
CPPFLAGS =
DFLAGS   = -D__GFORTRAN -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK
-D__LIBINT
FCFLAGS  = -O2 -fopenmp -ffast-math -funroll-loops -ftree-vectorize -
march=native -ffree-form $(DFLAGS)
LDFLAGS  = $(FCFLAGS)
LIBS    = -L$(INTEL_LIB) -Wl,-rpath,$(INTEL_LIB) \
          -lmkl_scalapack_lp64     \
          -lmkl_blacs_openmpi_lp64 \
          -lmkl_intel_lp64         \
          -lmkl_sequential         \
          -lmkl_core \
           $(WRAPPER_DIR)/libint_cpp_wrapper.o \
           $(LIBINT_DIR)/libderiv.a \
           $(LIBINT_DIR)/libint.a \
           -lstdc++

OBJECTS_ARCHITECTURE = machine_gfortran.o


Any help in understanding the performance of the openMP-MPI
parallelization scheme would be much appreaciated.

Best wishes,
Simone Piccinin
CNR-IOM, Trieste (Italy)

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To post to this group, send email to cp... at googlegroups.com.
To unsubscribe from this group, send email to cp2k+uns... at googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cp2k?hl=en.




More information about the CP2K-user mailing list