mixed openMP-MPI for HF exchange
Simone Piccinin
piccini... at gmail.com
Wed Nov 9 13:26:03 UTC 2011
Dear Flo, dear Juerg,
thanks! Indeed the problem was a too small value of MAX_MEMORY.
I set it to 2500 MB for the pure MPI run, and I did not increase it
when using the OMP scheme. Setting it to a larger value solves the
problem.
Regarding the comparison of the performance for pure MPI vs. OMP-MPI
(when in both cases everything fits in-core) I see a considerable
improvement of the scalability on larger systems. This is what
I get for a system with 120 atoms, 430 electrons, 1766 basis
functions:
VERSION: popt
cores walltot 1st last
--------------------------
256 2092 1749.8 23.6
512 1119 887.8 14.1
1024 712 514.6 7.7
2048 1078 338.2 32.8
4096 1844 504.0 65.3
VERSION: psmp (4 threads)
cores walltot 1st last
--------------------------
512 1051 834.8 13.7
1024 630 443.0 11.6
2048 402 246.3 6.1
4096 429 147.3 13.0
VERSION: psmp (8 threads)
cores walltot 1st last
--------------------------
512 1022 825.6 12.9
1024 625 427.7 12.0
2048 430 235.5 11.5
4096 300 148.1 5.1
Thanks again.
Simone
On Nov 9, 12:33 pm, hut... at pci.uzh.ch wrote:
> Hi
>
> did you correctly use the MEMORY keyword. CP2K will use only
> this information when deciding on the number of integrals kept
> in core. From the manual
>
> Defines the maximum amount of memory [MB] to be consumed by the full HFX module. All temporary buffers and helper arrays are subtracted from this number. What remains will be used for storage of integrals. NOTE: This number is assumed to represent the memory available to one MPI process. When running a threaded version, cp2k automatically takes care of distributing the memory among all involved sub-processes.
>
> regards
>
> Juerg Hutter
>
> --------------------------------------------------------------
> Juerg Hutter Phone : ++41 44 635 4491
> Physical Chemistry Institute FAX : ++41 44 635 6838
> University of Zurich E-mail: hut... at pci.uzh.ch
> Winterthurerstrasse 190
> CH-8057 Zurich, Switzerland
> ---------------------------------------------------------------
>
> -----cp... at googlegroups.com wrote: -----
>
> To: cp2k <cp... at googlegroups.com>
> From: Simone Piccinin
> Sent by: cp... at googlegroups.com
> Date: 11/09/2011 11:43AM
> Subject: [CP2K:3594] mixed openMP-MPI for HF exchange
>
> Dear CP2K users,
>
> I am performing some tests on a multi-core machine (each node has 4
> sockets with 8 cores each, 128 Gb of memory). I am doing hybrid-
> functional calculations on a system with ~ 30 atoms, ~ 100 electrons,
> ~ 700 basis functions and I'd like to exploit the mixed openMP-MPI
> parallelization scheme. As I change the number of threads and MPI
> tasks in my tests I check the timings and whether all the ERI are
> stored in-core or not.
>
> Here below I report the results of the tests (#cores, time in seconds
> of the whole calculation, 1st SCF iteration and last SCF iteration)
> for a pure MPI run (with one MPI task per core), a mixed openMP-MPI
> run using 4 threads per MPI task, and a mixed openMP-MPI run using 8
> threads.
>
> I find that for a given number of cores, the higher the number of
> threads the smaller the number of ERI stored in core, and therefore
> the performance degrades. On the other hand, when the mixed openMP-MPI
> calculation manages to keep all the ERi in-core (i.e. when the number
> of cores is sufficiently large), the performace in better than the
> corresponding pure MPI run with the same number of cores.
>
> Is this the expected behavior? This is pretty much the opposite of
> what I expected. In the case of one MPI task per socket and 8 threads
> (i.e. one thread per core) I would expect the MPI task to see 32 Gb of
> memory, whereas a purely MPI calclulation with one MPI task per core
> should have "only" 4 Gb of memory available. So it is not clear to me,
> when comparing calculations with the same number of cores, why the
> pure MPI calculation magages to fit all the ERI in-core, while the
> mixed one does not (and actually it seems that the higher the number
> of threads the smaller the available memory).
>
> Here are the results of the tests:
>
> VERSION: popt
> #cores walltot 1st last
> --------------------------
> 64 710 560.0 10.5
> 128 410 295.7 7.7
> 256 239 159.5 5.1
> 512 317 107.8 11.3
>
> VERSION: psmp (4 threads)
> #cores walltot 1st last
> ---------------------------
> 64 1744 536.7 86.0 --> not all in-core
> 128 356 273.3 5.6
> 256 223 155.3 3.7
> 512 149 89.6 2.9
> 1024 140 61.5 3.6
>
> VERSION: psmp (8 threads)
> #cores walltot 1st last
> --------------------------
> 64 ~3600 530.0 217.9 --> not all in-core
> 128 692 274.9 29.5 --> not all in-core
> 256 208 150.0 3.6
> 512 145 89.4 2.8
>
> and here's how I compiled the code (the mixed openMP-MPI version):
>
> CC = mpicc
> CPP =
> FC = mpif90
> LD = mpif90
> AR = ar -r
> INTEL_LIB=/usr/local/Intel_compilers/c/composer_xe_2011_sp1.6.233/mkl/
> lib/intel64/
> LIBINT_DIR=/ccc/cont005/home/pa0315/piccinis/LIBS/LIBINT_1.1.4/lib/
> WRAPPER_DIR=/ccc/cont005/home/pa0315/piccinis/cp2k/tools/hfx_tools/
> libint_tools
> CPPFLAGS =
> DFLAGS = -D__GFORTRAN -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK
> -D__LIBINT
> FCFLAGS = -O2 -fopenmp -ffast-math -funroll-loops -ftree-vectorize -
> march=native -ffree-form $(DFLAGS)
> LDFLAGS = $(FCFLAGS)
> LIBS = -L$(INTEL_LIB) -Wl,-rpath,$(INTEL_LIB) \
> -lmkl_scalapack_lp64 \
> -lmkl_blacs_openmpi_lp64 \
> -lmkl_intel_lp64 \
> -lmkl_sequential \
> -lmkl_core \
> $(WRAPPER_DIR)/libint_cpp_wrapper.o \
> $(LIBINT_DIR)/libderiv.a \
> $(LIBINT_DIR)/libint.a \
> -lstdc++
>
> OBJECTS_ARCHITECTURE = machine_gfortran.o
>
> Any help in understanding the performance of the openMP-MPI
> parallelization scheme would be much appreaciated.
>
> Best wishes,
> Simone Piccinin
> CNR-IOM, Trieste (Italy)
>
> --
> You received this message because you are subscribed to the Google Groups "cp2k" group.
> To post to this group, send email to cp... at googlegroups.com.
> To unsubscribe from this group, send email to cp2k+uns... at googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/cp2k?hl=en.
More information about the CP2K-user
mailing list