[CP2K:3594] mixed openMP-MPI for HF exchange
hut... at pci.uzh.ch
hut... at pci.uzh.ch
Wed Nov 9 11:33:27 UTC 2011
Hi
did you correctly use the MEMORY keyword. CP2K will use only
this information when deciding on the number of integrals kept
in core. From the manual
Defines the maximum amount of memory [MB] to be consumed by the full HFX module. All temporary buffers and helper arrays are subtracted from this number. What remains will be used for storage of integrals. NOTE: This number is assumed to represent the memory available to one MPI process. When running a threaded version, cp2k automatically takes care of distributing the memory among all involved sub-processes.
regards
Juerg Hutter
--------------------------------------------------------------
Juerg Hutter Phone : ++41 44 635 4491
Physical Chemistry Institute FAX : ++41 44 635 6838
University of Zurich E-mail: hut... at pci.uzh.ch
Winterthurerstrasse 190
CH-8057 Zurich, Switzerland
---------------------------------------------------------------
-----cp... at googlegroups.com wrote: -----
To: cp2k <cp... at googlegroups.com>
From: Simone Piccinin
Sent by: cp... at googlegroups.com
Date: 11/09/2011 11:43AM
Subject: [CP2K:3594] mixed openMP-MPI for HF exchange
Dear CP2K users,
I am performing some tests on a multi-core machine (each node has 4
sockets with 8 cores each, 128 Gb of memory). I am doing hybrid-
functional calculations on a system with ~ 30 atoms, ~ 100 electrons,
~ 700 basis functions and I'd like to exploit the mixed openMP-MPI
parallelization scheme. As I change the number of threads and MPI
tasks in my tests I check the timings and whether all the ERI are
stored in-core or not.
Here below I report the results of the tests (#cores, time in seconds
of the whole calculation, 1st SCF iteration and last SCF iteration)
for a pure MPI run (with one MPI task per core), a mixed openMP-MPI
run using 4 threads per MPI task, and a mixed openMP-MPI run using 8
threads.
I find that for a given number of cores, the higher the number of
threads the smaller the number of ERI stored in core, and therefore
the performance degrades. On the other hand, when the mixed openMP-MPI
calculation manages to keep all the ERi in-core (i.e. when the number
of cores is sufficiently large), the performace in better than the
corresponding pure MPI run with the same number of cores.
Is this the expected behavior? This is pretty much the opposite of
what I expected. In the case of one MPI task per socket and 8 threads
(i.e. one thread per core) I would expect the MPI task to see 32 Gb of
memory, whereas a purely MPI calclulation with one MPI task per core
should have "only" 4 Gb of memory available. So it is not clear to me,
when comparing calculations with the same number of cores, why the
pure MPI calculation magages to fit all the ERI in-core, while the
mixed one does not (and actually it seems that the higher the number
of threads the smaller the available memory).
Here are the results of the tests:
VERSION: popt
#cores walltot 1st last
--------------------------
64 710 560.0 10.5
128 410 295.7 7.7
256 239 159.5 5.1
512 317 107.8 11.3
VERSION: psmp (4 threads)
#cores walltot 1st last
---------------------------
64 1744 536.7 86.0 --> not all in-core
128 356 273.3 5.6
256 223 155.3 3.7
512 149 89.6 2.9
1024 140 61.5 3.6
VERSION: psmp (8 threads)
#cores walltot 1st last
--------------------------
64 ~3600 530.0 217.9 --> not all in-core
128 692 274.9 29.5 --> not all in-core
256 208 150.0 3.6
512 145 89.4 2.8
and here's how I compiled the code (the mixed openMP-MPI version):
CC = mpicc
CPP =
FC = mpif90
LD = mpif90
AR = ar -r
INTEL_LIB=/usr/local/Intel_compilers/c/composer_xe_2011_sp1.6.233/mkl/
lib/intel64/
LIBINT_DIR=/ccc/cont005/home/pa0315/piccinis/LIBS/LIBINT_1.1.4/lib/
WRAPPER_DIR=/ccc/cont005/home/pa0315/piccinis/cp2k/tools/hfx_tools/
libint_tools
CPPFLAGS =
DFLAGS = -D__GFORTRAN -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK
-D__LIBINT
FCFLAGS = -O2 -fopenmp -ffast-math -funroll-loops -ftree-vectorize -
march=native -ffree-form $(DFLAGS)
LDFLAGS = $(FCFLAGS)
LIBS = -L$(INTEL_LIB) -Wl,-rpath,$(INTEL_LIB) \
-lmkl_scalapack_lp64 \
-lmkl_blacs_openmpi_lp64 \
-lmkl_intel_lp64 \
-lmkl_sequential \
-lmkl_core \
$(WRAPPER_DIR)/libint_cpp_wrapper.o \
$(LIBINT_DIR)/libderiv.a \
$(LIBINT_DIR)/libint.a \
-lstdc++
OBJECTS_ARCHITECTURE = machine_gfortran.o
Any help in understanding the performance of the openMP-MPI
parallelization scheme would be much appreaciated.
Best wishes,
Simone Piccinin
CNR-IOM, Trieste (Italy)
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To post to this group, send email to cp... at googlegroups.com.
To unsubscribe from this group, send email to cp2k+uns... at googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cp2k?hl=en.
More information about the CP2K-user
mailing list