[CP2K-user] CP2K performance on GPUs

foru... at gmail.com foru... at gmail.com
Sun Nov 4 18:13:01 UTC 2018


Thanks Alfio for the response.

Yes. 8 V100 GPUs is extreme. The test I had used takes around 500 seconds 
on a system with Intel SKL G-6148 40 cores(20 cores/socket). Do you think 
this test is not large enough to run on GPUs? If yes, can you recommend any 
test from CP2K tests folder?

I had tried runs with 1 & 2 V100 gpus also. The performance was slower than 
the 8 V100 gpus run. 

CP2K was able to recognize all the 8 gpus, as per "DBCSR| ACC: Number of 
devices/node".

I had tried reoptimizing the kernels for V100. But could not determine what 
block size values have to be passed to tune.py script.

As CP2K-6.1 already has optimized kernel parameters for P100, even 2xP100 
GPUs run was slower than CPU only benchmark.

On Sunday, November 4, 2018 at 2:33:11 PM UTC+5:30, Alfio Lazzaro wrote:
>
> You may take a look at this issue on github: 
> https://github.com/cp2k/cp2k/issues/73
>
> In your particular case, your setup of 8 V100 is pretty extreme and it 
> would require a large computation. Which test are you using for 
> benchmarking?
>
> Then, your setup of 8 ranks + 5 threads should be OK. CP2K attaches ranks 
> to GPU in a round-robin manner, therefore in your case there is a rank 
> talking to each GPU.
> We don't have a large experience of multi-gpu nodes, hence I would suggest 
> to do some scalability test by running 1 rank, 2 ranks, ... 8 ranks (always 
> 5 threads) to check how the performance scales. BTW, make sure CP2K is able 
> to recognize 8 GPUs by checking the following output at the beginning:
>
>  DBCSR| ACC: Number of devices/node                                        
>     1
>
> Eventually, you might consider reoptimizing the kernels for the V100, but 
> this is not a priority...
>
> Alfio
>
>
>
> Il giorno sabato 3 novembre 2018 07:55:09 UTC+1, for... at gmail.com ha 
> scritto:
>>
>> HI,
>>
>> How is the CP2K performance on GPUs in general?
>>
>> I'm getting very low performance on GPUs(Nvidia V100 SXM2). It is a 
>> single node benchmark with 8 GPUs and Intel Skylake Gold 6148 dual 
>> processors. 
>>
>> The CP2K time on 8 GPUs (CP2K-6.1 psmp version, ifort-2017, CUDA-9.2, 
>> 8mpi ranks + 5 threads per rank) is still slower than CP2K time of CPU only 
>> benchmark.
>>
>> For CPU runs, the CP2K-6.1 is built with LIBXSMM-1.8.3.
>>
>> For GPU runs, have tried both with and without LIBXSMM. There is no 
>> performance difference. But both's performance is still slower than CPU 
>> only benchmark even after using all the 8 GPUs & all 40 cores of CPU. Can 
>> some one please share their experience on CP2K performance with GPUs.
>>
>> The CUDA specific DFLAGS used are: -D__ACC -D__DBCSR_ACC -D__PW_CUDA.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20181104/7cd4c1b4/attachment.htm>


More information about the CP2K-user mailing list