[CP2K-user] [CP2K:11823] Re: CP2K performance on GPUs

Tiziano Müller tiziano... at chem.uzh.ch
Fri May 31 08:24:44 UTC 2019


Hi Chris,

an arch/ file for CP2K with P100 GPUs can be found as part of the
regtester output from Piz Daint here:

https://www.cp2k.org/static/regtest/trunk/cscs-daint-xc50_gpu/CRAY_XC50-gfortran_gpu.psmp.out

Those outputs are usually available from here:

  https://dashboard.cp2k.org/


(click the link in the Status column)

Best regards,
Tiziano


Am 30.05.19 um 10:15 schrieb CNelson:
> Hi Both,
> would it be possible to get a copy of the ARCH file you used to build
> CP2K with the new V100 GPUs?
> cheers,
> Chris.
> 
> On Sunday, 4 November 2018 21:06:12 UTC, Alfio Lazzaro wrote:
> 
>     OK, the best way is if you can attach the arch file, the input file,
>     and the output that you got from CP2K.
>     The only GPU accelerated part in CP2K is DBCSR, but can be that you
>     are bound from something else.
> 
>     I agree with you that the reoptimization is not that important at
>     this stage...
> 
>     Alfio
> 
> 
>     Il giorno domenica 4 novembre 2018 19:13:02 UTC+1, fo... at gmail.com
>     ha scritto:
> 
>         Thanks Alfio for the response.
> 
>         Yes. 8 V100 GPUs is extreme. The test I had used takes around
>         500 seconds on a system with Intel SKL G-6148 40 cores(20
>         cores/socket). Do you think this test is not large enough to run
>         on GPUs? If yes, can you recommend any test from CP2K tests folder?
> 
>         I had tried runs with 1 & 2 V100 gpus also. The performance was
>         slower than the 8 V100 gpus run. 
> 
>         CP2K was able to recognize all the 8 gpus, as per "DBCSR| ACC:
>         Number of devices/node".
> 
>         I had tried reoptimizing the kernels for V100. But could not
>         determine what block size values have to be passed to tune.py
>         script.
> 
>         As CP2K-6.1 already has optimized kernel parameters for P100,
>         even 2xP100 GPUs run was slower than CPU only benchmark.
> 
>         On Sunday, November 4, 2018 at 2:33:11 PM UTC+5:30, Alfio
>         Lazzaro wrote:
> 
>             You may take a look at this issue on
>             github: https://github.com/cp2k/cp2k/issues/73
>             <https://github.com/cp2k/cp2k/issues/73>
> 
>             In your particular case, your setup of 8 V100 is pretty
>             extreme and it would require a large computation. Which test
>             are you using for benchmarking?
> 
>             Then, your setup of 8 ranks + 5 threads should be OK. CP2K
>             attaches ranks to GPU in a round-robin manner, therefore in
>             your case there is a rank talking to each GPU.
>             We don't have a large experience of multi-gpu nodes, hence I
>             would suggest to do some scalability test by running 1 rank,
>             2 ranks, ... 8 ranks (always 5 threads) to check how the
>             performance scales. BTW, make sure CP2K is able to recognize
>             8 GPUs by checking the following output at the beginning:
> 
>              DBCSR| ACC: Number of devices/node                         
>                               1
> 
>             Eventually, you might consider reoptimizing the kernels for
>             the V100, but this is not a priority...
> 
>             Alfio
> 
> 
> 
>             Il giorno sabato 3 novembre 2018 07:55:09 UTC+1,
>             fo... at gmail.com ha scritto:
> 
>                 HI,
> 
>                 How is the CP2K performance on GPUs in general?
> 
>                 I'm getting very low performance on GPUs(Nvidia V100
>                 SXM2). It is a single node benchmark with 8 GPUs and
>                 Intel Skylake Gold 6148 dual processors. 
> 
>                 The CP2K time on 8 GPUs (CP2K-6.1 psmp version,
>                 ifort-2017, CUDA-9.2, 8mpi ranks + 5 threads per rank)
>                 is still slower than CP2K time of CPU only benchmark.
> 
>                 For CPU runs, the CP2K-6.1 is built with LIBXSMM-1.8.3.
> 
>                 For GPU runs, have tried both with and without LIBXSMM.
>                 There is no performance difference. But both's
>                 performance is still slower than CPU only benchmark even
>                 after using all the 8 GPUs & all 40 cores of CPU. Can
>                 some one please share their experience on CP2K
>                 performance with GPUs.
> 
>                 The CUDA specific DFLAGS used are: -D__ACC -D__DBCSR_ACC
>                 -D__PW_CUDA.
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "cp2k" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cp... at googlegroups.com
> <mailto:cp... at googlegroups.com>.
> To post to this group, send email to cp... at googlegroups.com
> <mailto:cp... at googlegroups.com>.
> Visit this group at https://groups.google.com/group/cp2k.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cp2k/4920c538-3d63-4754-8dc3-76396262d543%40googlegroups.com
> <https://groups.google.com/d/msgid/cp2k/4920c538-3d63-4754-8dc3-76396262d543%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

-- 
Tiziano Müller
University of Zurich
Department of Chemistry
Winterthurerstrasse 190
CH-8057 Zürich

Tel: +41 44 63 54234
www.chem.uzh.ch
tiziano... at chem.uzh.ch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pEpkey.asc
Type: application/pgp-keys
Size: 1809 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20190531/1e4422dc/attachment.key>


More information about the CP2K-user mailing list