[CP2K-user] CP2K performance on GPUs

CNelson chri... at gmail.com
Thu May 30 08:15:40 UTC 2019

Hi Both,
would it be possible to get a copy of the ARCH file you used to build CP2K 
with the new V100 GPUs?

On Sunday, 4 November 2018 21:06:12 UTC, Alfio Lazzaro wrote:

> OK, the best way is if you can attach the arch file, the input file, and 
> the output that you got from CP2K.
> The only GPU accelerated part in CP2K is DBCSR, but can be that you are 
> bound from something else.
> I agree with you that the reoptimization is not that important at this 
> stage...
> Alfio
> Il giorno domenica 4 novembre 2018 19:13:02 UTC+1, fo... at gmail.com ha 
> scritto:
>> Thanks Alfio for the response.
>> Yes. 8 V100 GPUs is extreme. The test I had used takes around 500 seconds 
>> on a system with Intel SKL G-6148 40 cores(20 cores/socket). Do you think 
>> this test is not large enough to run on GPUs? If yes, can you recommend any 
>> test from CP2K tests folder?
>> I had tried runs with 1 & 2 V100 gpus also. The performance was slower 
>> than the 8 V100 gpus run. 
>> CP2K was able to recognize all the 8 gpus, as per "DBCSR| ACC: Number of 
>> devices/node".
>> I had tried reoptimizing the kernels for V100. But could not determine 
>> what block size values have to be passed to tune.py script.
>> As CP2K-6.1 already has optimized kernel parameters for P100, even 2xP100 
>> GPUs run was slower than CPU only benchmark.
>> On Sunday, November 4, 2018 at 2:33:11 PM UTC+5:30, Alfio Lazzaro wrote:
>>> You may take a look at this issue on github: 
>>> https://github.com/cp2k/cp2k/issues/73
>>> In your particular case, your setup of 8 V100 is pretty extreme and it 
>>> would require a large computation. Which test are you using for 
>>> benchmarking?
>>> Then, your setup of 8 ranks + 5 threads should be OK. CP2K attaches 
>>> ranks to GPU in a round-robin manner, therefore in your case there is a 
>>> rank talking to each GPU.
>>> We don't have a large experience of multi-gpu nodes, hence I would 
>>> suggest to do some scalability test by running 1 rank, 2 ranks, ... 8 ranks 
>>> (always 5 threads) to check how the performance scales. BTW, make sure CP2K 
>>> is able to recognize 8 GPUs by checking the following output at the 
>>> beginning:
>>>  DBCSR| ACC: Number of devices/node                                      
>>>       1
>>> Eventually, you might consider reoptimizing the kernels for the V100, 
>>> but this is not a priority...
>>> Alfio
>>> Il giorno sabato 3 novembre 2018 07:55:09 UTC+1, fo... at gmail.com ha 
>>> scritto:
>>>> HI,
>>>> How is the CP2K performance on GPUs in general?
>>>> I'm getting very low performance on GPUs(Nvidia V100 SXM2). It is a 
>>>> single node benchmark with 8 GPUs and Intel Skylake Gold 6148 dual 
>>>> processors. 
>>>> The CP2K time on 8 GPUs (CP2K-6.1 psmp version, ifort-2017, CUDA-9.2, 
>>>> 8mpi ranks + 5 threads per rank) is still slower than CP2K time of CPU only 
>>>> benchmark.
>>>> For CPU runs, the CP2K-6.1 is built with LIBXSMM-1.8.3.
>>>> For GPU runs, have tried both with and without LIBXSMM. There is no 
>>>> performance difference. But both's performance is still slower than CPU 
>>>> only benchmark even after using all the 8 GPUs & all 40 cores of CPU. Can 
>>>> some one please share their experience on CP2K performance with GPUs.
>>>> The CUDA specific DFLAGS used are: -D__ACC -D__DBCSR_ACC -D__PW_CUDA.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20190530/7a3b2b94/attachment.htm>

More information about the CP2K-user mailing list