[CP2K-user] CP2K performance on GPUs

Alfio Lazzaro alfio.... at gmail.com
Sun Nov 4 21:06:12 UTC 2018


OK, the best way is if you can attach the arch file, the input file, and 
the output that you got from CP2K.
The only GPU accelerated part in CP2K is DBCSR, but can be that you are 
bound from something else.

I agree with you that the reoptimization is not that important at this 
stage...

Alfio


Il giorno domenica 4 novembre 2018 19:13:02 UTC+1, for... at gmail.com ha 
scritto:
>
> Thanks Alfio for the response.
>
> Yes. 8 V100 GPUs is extreme. The test I had used takes around 500 seconds 
> on a system with Intel SKL G-6148 40 cores(20 cores/socket). Do you think 
> this test is not large enough to run on GPUs? If yes, can you recommend any 
> test from CP2K tests folder?
>
> I had tried runs with 1 & 2 V100 gpus also. The performance was slower 
> than the 8 V100 gpus run. 
>
> CP2K was able to recognize all the 8 gpus, as per "DBCSR| ACC: Number of 
> devices/node".
>
> I had tried reoptimizing the kernels for V100. But could not determine 
> what block size values have to be passed to tune.py script.
>
> As CP2K-6.1 already has optimized kernel parameters for P100, even 2xP100 
> GPUs run was slower than CPU only benchmark.
>
> On Sunday, November 4, 2018 at 2:33:11 PM UTC+5:30, Alfio Lazzaro wrote:
>>
>> You may take a look at this issue on github: 
>> https://github.com/cp2k/cp2k/issues/73
>>
>> In your particular case, your setup of 8 V100 is pretty extreme and it 
>> would require a large computation. Which test are you using for 
>> benchmarking?
>>
>> Then, your setup of 8 ranks + 5 threads should be OK. CP2K attaches ranks 
>> to GPU in a round-robin manner, therefore in your case there is a rank 
>> talking to each GPU.
>> We don't have a large experience of multi-gpu nodes, hence I would 
>> suggest to do some scalability test by running 1 rank, 2 ranks, ... 8 ranks 
>> (always 5 threads) to check how the performance scales. BTW, make sure CP2K 
>> is able to recognize 8 GPUs by checking the following output at the 
>> beginning:
>>
>>  DBCSR| ACC: Number of devices/node                                      
>>       1
>>
>> Eventually, you might consider reoptimizing the kernels for the V100, but 
>> this is not a priority...
>>
>> Alfio
>>
>>
>>
>> Il giorno sabato 3 novembre 2018 07:55:09 UTC+1, for... at gmail.com ha 
>> scritto:
>>>
>>> HI,
>>>
>>> How is the CP2K performance on GPUs in general?
>>>
>>> I'm getting very low performance on GPUs(Nvidia V100 SXM2). It is a 
>>> single node benchmark with 8 GPUs and Intel Skylake Gold 6148 dual 
>>> processors. 
>>>
>>> The CP2K time on 8 GPUs (CP2K-6.1 psmp version, ifort-2017, CUDA-9.2, 
>>> 8mpi ranks + 5 threads per rank) is still slower than CP2K time of CPU only 
>>> benchmark.
>>>
>>> For CPU runs, the CP2K-6.1 is built with LIBXSMM-1.8.3.
>>>
>>> For GPU runs, have tried both with and without LIBXSMM. There is no 
>>> performance difference. But both's performance is still slower than CPU 
>>> only benchmark even after using all the 8 GPUs & all 40 cores of CPU. Can 
>>> some one please share their experience on CP2K performance with GPUs.
>>>
>>> The CUDA specific DFLAGS used are: -D__ACC -D__DBCSR_ACC -D__PW_CUDA.
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20181104/194d1e91/attachment.htm>


More information about the CP2K-user mailing list