mixed openMP-MPI for HF exchange

Simone Piccinin piccini... at gmail.com
Wed Nov 9 13:26:03 UTC 2011
Previous message (by thread): [CP2K:3594] mixed openMP-MPI for HF exchange
Next message (by thread): Postdoc position available @ UPC - Barcelona
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Flo, dear Juerg,

thanks! Indeed the problem was a too small value of MAX_MEMORY.
I set it to 2500 MB for the pure MPI run, and I did not increase it
when using the OMP scheme. Setting it to a larger value solves the
problem.

Regarding the comparison of the performance for pure MPI vs. OMP-MPI
(when in both cases everything fits in-core) I see a considerable
improvement of the scalability on larger systems. This is what
I get for a system with 120 atoms,  430 electrons, 1766 basis
functions:

VERSION: popt
cores   walltot 1st    last
--------------------------
 256    2092 1749.8   23.6
 512    1119  887.8   14.1
1024     712  514.6    7.7
2048    1078  338.2   32.8
4096    1844  504.0   65.3

VERSION: psmp (4 threads)
cores   walltot 1st    last
--------------------------
 512    1051   834.8 13.7
1024     630   443.0 11.6
2048     402   246.3  6.1
4096     429   147.3 13.0

VERSION: psmp (8 threads)
cores   walltot 1st    last
--------------------------
 512    1022   825.6  12.9
1024     625   427.7  12.0
2048     430   235.5  11.5
4096     300   148.1   5.1

Thanks again.
Simone



On Nov 9, 12:33 pm, hut... at pci.uzh.ch wrote:
> Hi
>
> did you correctly use the MEMORY keyword. CP2K will use only
> this information when deciding on the number of integrals kept
> in core. From the manual
>
> Defines the maximum amount of memory [MB] to be consumed by the full HFX module. All temporary buffers and helper arrays are subtracted from this number. What remains will be used for storage of integrals. NOTE: This number is assumed to represent the memory available to one MPI process. When running a threaded version, cp2k automatically takes care of distributing the memory among all involved sub-processes.
>
> regards
>
> Juerg Hutter
>
> --------------------------------------------------------------
> Juerg Hutter                         Phone : ++41 44 635 4491
> Physical Chemistry Institute   FAX   : ++41 44 635 6838
> University of Zurich               E-mail:  hut... at pci.uzh.ch
> Winterthurerstrasse 190
> CH-8057 Zurich, Switzerland
> ---------------------------------------------------------------
>
> -----cp... at googlegroups.com wrote: -----
>
> To: cp2k <cp... at googlegroups.com>
> From: Simone Piccinin
> Sent by: cp... at googlegroups.com
> Date: 11/09/2011 11:43AM
> Subject: [CP2K:3594] mixed openMP-MPI for HF exchange
>
> Dear CP2K users,
>
> I am performing some tests on a multi-core machine (each node has 4
> sockets with 8 cores each, 128 Gb of memory). I am doing hybrid-
> functional calculations on a system with ~ 30 atoms, ~ 100 electrons,
> ~ 700 basis functions and I'd like to exploit the mixed openMP-MPI
> parallelization scheme. As I change the number of threads and MPI
> tasks in my tests I check the timings and whether all the ERI are
> stored in-core or not.
>
> Here below I report the results of the tests (#cores, time in seconds
> of the whole calculation, 1st SCF iteration and last SCF iteration)
> for a pure MPI run (with one MPI task per core), a mixed openMP-MPI
> run using 4 threads per MPI task, and a mixed openMP-MPI run using 8
> threads.
>
> I find that for a given number of cores, the higher the number of
> threads the smaller the number of ERI stored in core, and therefore
> the performance degrades. On the other hand, when the mixed openMP-MPI
> calculation manages to keep all the ERi in-core (i.e. when the number
> of cores is sufficiently large), the performace in better than the
> corresponding pure MPI run with the same number of cores.
>
> Is this the expected behavior? This is pretty much the opposite of
> what I expected. In the case of one MPI task per socket and 8 threads
> (i.e. one thread per core) I would expect the MPI task to see 32 Gb of
> memory, whereas a purely MPI calclulation with one MPI task per core
> should have "only" 4 Gb of memory available. So it is not clear to me,
> when comparing calculations with the same number of cores, why the
> pure MPI calculation magages to fit all the ERI in-core, while the
> mixed one does not (and actually it seems that the higher the number
> of threads the smaller the available memory).
>
> Here are the results of the tests:
>
> VERSION: popt
> #cores walltot 1st    last
> --------------------------
> 64      710    560.0  10.5
> 128     410    295.7   7.7
> 256     239    159.5   5.1
> 512     317    107.8  11.3
>
> VERSION: psmp (4 threads)
> #cores walltot 1st     last
> ---------------------------
> 64     1744    536.7  86.0  --> not all in-core
> 128     356    273.3   5.6
> 256     223    155.3   3.7
> 512     149     89.6   2.9
> 1024    140     61.5   3.6
>
> VERSION: psmp (8 threads)
> #cores walltot 1st    last
> --------------------------
> 64    ~3600   530.0 217.9   --> not all in-core
> 128     692   274.9  29.5   --> not all in-core
> 256     208   150.0   3.6
> 512     145    89.4   2.8
>
> and here's how I compiled the code (the mixed openMP-MPI version):
>
> CC       = mpicc
> CPP      =
> FC       = mpif90
> LD       = mpif90
> AR       = ar -r
> INTEL_LIB=/usr/local/Intel_compilers/c/composer_xe_2011_sp1.6.233/mkl/
> lib/intel64/
> LIBINT_DIR=/ccc/cont005/home/pa0315/piccinis/LIBS/LIBINT_1.1.4/lib/
> WRAPPER_DIR=/ccc/cont005/home/pa0315/piccinis/cp2k/tools/hfx_tools/
> libint_tools
> CPPFLAGS =
> DFLAGS   = -D__GFORTRAN -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK
> -D__LIBINT
> FCFLAGS  = -O2 -fopenmp -ffast-math -funroll-loops -ftree-vectorize -
> march=native -ffree-form $(DFLAGS)
> LDFLAGS  = $(FCFLAGS)
> LIBS    = -L$(INTEL_LIB) -Wl,-rpath,$(INTEL_LIB) \
>           -lmkl_scalapack_lp64     \
>           -lmkl_blacs_openmpi_lp64 \
>           -lmkl_intel_lp64         \
>           -lmkl_sequential         \
>           -lmkl_core \
>            $(WRAPPER_DIR)/libint_cpp_wrapper.o \
>            $(LIBINT_DIR)/libderiv.a \
>            $(LIBINT_DIR)/libint.a \
>            -lstdc++
>
> OBJECTS_ARCHITECTURE = machine_gfortran.o
>
> Any help in understanding the performance of the openMP-MPI
> parallelization scheme would be much appreaciated.
>
> Best wishes,
> Simone Piccinin
> CNR-IOM, Trieste (Italy)
>
> --
> You received this message because you are subscribed to the Google Groups "cp2k" group.
> To post to this group, send email to cp... at googlegroups.com.
> To unsubscribe from this group, send email to cp2k+uns... at googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/cp2k?hl=en.
Previous message (by thread): [CP2K:3594] mixed openMP-MPI for HF exchange
Next message (by thread): Postdoc position available @ UPC - Barcelona
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the CP2K-user mailing list