[CP2K-user] [CP2K:18387] Re: GPU oversubscription errors with 2023.1

Alfio Lazzaro alfio.lazzaro at gmail.com
Thu Jan 26 08:45:48 UTC 2023


I'm sorry, I don't know how to reply to your question. However, I've opened 
a ticket on the github CP2K repository, maybe someone more expert can reply 
to you (see https://github.com/cp2k/cp2k/issues/2530 ).

>From my experience, ELPA is useful when you engage many ranks. You can 
check your outputs for timing entries like `cp_fm_syevd` (ScaLAPACK) or 
`cp_fm_diag_elpa` (ELPA), e.g.:

cp_fm_syevd                         36 10.6    0.001    0.001   13.586   
13.587

In this particular case, ELPA takes 13.6 seconds (last column). So, you can 
check how much time you spend in the diagonalizer and compare to the total 
time.
Another possibility is that you can always switch the diagonalizer (ELPA vs 
ScaLAPACK) in your input file (see 
https://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL.html#PREFERRED_DIAG_LIBRARY). 
In this case, I can suggest to build ELPA without GPU support, so that you 
can still have ELPA on the CPU (assuming that it is beneficial in your 
case) by hacking the toolchain installation file:

https://github.com/cp2k/cp2k/blob/master/tools/toolchain/scripts/stage5/install_elpa.sh

Hope it helps.

Alfio


Il giorno mercoledì 25 gennaio 2023 alle 12:37:49 UTC+1 jerryt... at gmail.com 
ha scritto:

> Hi Alfio,
> Yes, ELPA was the problem.  I removed it from my build and CP2K worked as 
> expected.  Where does ELPA help the most?  The majority of my AIMD jobs are 
> 1000 atoms or less.  Will ELPA provide a performance advantage over 
> SCALAPACK for systems of that size?
>
> Thank you,
> Jerry
>
> On Monday, January 23, 2023 at 4:09:41 AM UTC-5 Alfio Lazzaro wrote:
>
>> I have no clue what's wrong here, however I see in your log that ELPA is 
>> giving some warning message. For this reason, I would suggest to avoid 
>> elpa, i.e. add `--with-elpa=no` during the toolchain installation. Does it 
>> work on a single GPU, i.e. a single MPI rank?
>>
>> Il giorno venerdì 20 gennaio 2023 alle 15:33:45 UTC+1 jerryt... at gmail.com 
>> ha scritto:
>>
>>> Dear Forum,
>>> I successfully compiled v2023.1 (gcc-10.3.0, cuda-11.2, and MKL lib) 
>>> with the toolchain using:
>>>
>>> "-j 8 --no-check-certificate --install-all --with-gcc=system 
>>> --with-openmpi --with-mkl --with-sirius=no --with-spfft=no 
>>> --with-cmake=system --enable-cuda --gpu-ver=P100 --with-pexsi 
>>> --with-sirius=no --with-quip=no --with-hdf5=no --with-libvdwxc=no 
>>> --with-spla=no --with-libtorch=no"
>>>
>>> However, when I ran a test job, the job crashed and I got GPU 
>>> oversubscription as shown below:
>>>
>>> +-----------------------------------------------------------------------------+
>>> | NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 
>>> 11.7     |
>>>
>>> |-------------------------------+----------------------+----------------------+
>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile 
>>> Uncorr. ECC |
>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
>>>  Compute M. |
>>> |                               |                      |               
>>> MIG M. |
>>>
>>> |===============================+======================+======================|
>>> |   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                 
>>>    0 |
>>> | N/A   42C    P0    80W / 300W |   2585MiB / 16384MiB |     53%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                 
>>>    0 |
>>> | N/A   36C    P0    75W / 300W |   1664MiB / 16384MiB |     65%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                 
>>>    0 |
>>> | N/A   36C    P0    73W / 300W |   1616MiB / 16384MiB |     54%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                 
>>>    0 |
>>> | N/A   40C    P0    71W / 300W |   1614MiB / 16384MiB |     55%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>>                                                                         
>>>        
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes:                                                             
>>>      |
>>> |  GPU   GI   CI        PID   Type   Process name                  GPU 
>>> Memory |
>>> |        ID   ID                                                   Usage 
>>>      |
>>>
>>> |=============================================================================|
>>> |    0   N/A  N/A    159608      C   .../exe/local_cuda/cp2k.psmp     
>>> 1655MiB |
>>> |    0   N/A  N/A    159609      C   .../exe/local_cuda/cp2k.psmp     
>>>  307MiB |
>>> |    0   N/A  N/A    159610      C   .../exe/local_cuda/cp2k.psmp     
>>>  307MiB |
>>> |    0   N/A  N/A    159611      C   .../exe/local_cuda/cp2k.psmp     
>>>  307MiB |
>>> |    1   N/A  N/A    159609      C   .../exe/local_cuda/cp2k.psmp     
>>> 1659MiB |
>>> |    2   N/A  N/A    159610      C   .../exe/local_cuda/cp2k.psmp     
>>> 1611MiB |
>>> |    3   N/A  N/A    159611      C   .../exe/local_cuda/cp2k.psmp     
>>> 1609MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> However, using CP2K 2022.2, I ran the job successfully and did  not get 
>>> this oversubscription.  
>>>
>>>
>>> +-----------------------------------------------------------------------------+
>>> | NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 
>>> 11.7     |
>>>
>>> |-------------------------------+----------------------+----------------------+
>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile 
>>> Uncorr. ECC |
>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
>>>  Compute M. |
>>> |                               |                      |               
>>> MIG M. |
>>>
>>> |===============================+======================+======================|
>>> |   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                 
>>>    0 |
>>> | N/A   37C    P0    63W / 300W |   1598MiB / 16384MiB |     31%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                 
>>>    0 |
>>> | N/A   33C    P0    63W / 300W |   1606MiB / 16384MiB |     29%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                 
>>>    0 |
>>> | N/A   34C    P0    63W / 300W |   1562MiB / 16384MiB |     27%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                 
>>>    0 |
>>> | N/A   37C    P0    67W / 300W |   1560MiB / 16384MiB |     25%     
>>>  Default |
>>> |                               |                      |                 
>>>  N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>>                                                                         
>>>        
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes:                                                             
>>>      |
>>> |  GPU   GI   CI        PID   Type   Process name                  GPU 
>>> Memory |
>>> |        ID   ID                                                   Usage 
>>>      |
>>>
>>> |=============================================================================|
>>> |    0   N/A  N/A    163862      C   .../exe/local_cuda/cp2k.psmp     
>>> 1599MiB |
>>> |    1   N/A  N/A    163863      C   .../exe/local_cuda/cp2k.psmp     
>>> 1603MiB |
>>> |    2   N/A  N/A    163864      C   .../exe/local_cuda/cp2k.psmp     
>>> 1557MiB |
>>> |    3   N/A  N/A    163865      C   .../exe/local_cuda/cp2k.psmp     
>>> 1555MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> Additionally, the system output file shows the following CUDA runtime 
>>> error:
>>>
>>> CUDA RUNTIME API error: EventRecord failed with error 
>>> cudaErrorInvalidResourceHandle 
>>> (/cluster/home/tanoury/CP2K/cp2k-2023.1_GPU/exts/dbcsr/src/acc/cuda_hip/acc_event.cpp::60)
>>>
>>> I have also attached the error output.
>>>
>>> Any help to solve this problem is greatly appreciated.
>>>
>>> Thank you so much,
>>> Jerry
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/30f62023-c218-43ca-8501-cc7213f5b6b1n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230126/27784066/attachment-0001.htm>


More information about the CP2K-user mailing list