[CP2K-user] [CP2K:18385] Re: GPU oversubscription errors with 2023.1

jerryt...@gmail.com jerrytanoury at gmail.com
Wed Jan 25 11:37:49 UTC 2023
Previous message (by thread): [CP2K-user] [CP2K:18376] Re: GPU oversubscription errors with 2023.1
Next message (by thread): [CP2K-user] [CP2K:18387] Re: GPU oversubscription errors with 2023.1
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Alfio,
Yes, ELPA was the problem.  I removed it from my build and CP2K worked as 
expected.  Where does ELPA help the most?  The majority of my AIMD jobs are 
1000 atoms or less.  Will ELPA provide a performance advantage over 
SCALAPACK for systems of that size?

Thank you,
Jerry

On Monday, January 23, 2023 at 4:09:41 AM UTC-5 Alfio Lazzaro wrote:

> I have no clue what's wrong here, however I see in your log that ELPA is 
> giving some warning message. For this reason, I would suggest to avoid 
> elpa, i.e. add `--with-elpa=no` during the toolchain installation. Does it 
> work on a single GPU, i.e. a single MPI rank?
>
> Il giorno venerdì 20 gennaio 2023 alle 15:33:45 UTC+1 jerryt... at gmail.com 
> ha scritto:
>
>> Dear Forum,
>> I successfully compiled v2023.1 (gcc-10.3.0, cuda-11.2, and MKL lib) with 
>> the toolchain using:
>>
>> "-j 8 --no-check-certificate --install-all --with-gcc=system 
>> --with-openmpi --with-mkl --with-sirius=no --with-spfft=no 
>> --with-cmake=system --enable-cuda --gpu-ver=P100 --with-pexsi 
>> --with-sirius=no --with-quip=no --with-hdf5=no --with-libvdwxc=no 
>> --with-spla=no --with-libtorch=no"
>>
>> However, when I ran a test job, the job crashed and I got GPU 
>> oversubscription as shown below:
>>
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7 
>>     |
>>
>> |-------------------------------+----------------------+----------------------+
>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. 
>> ECC |
>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
>>  Compute M. |
>> |                               |                      |               
>> MIG M. |
>>
>> |===============================+======================+======================|
>> |   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                 
>>    0 |
>> | N/A   42C    P0    80W / 300W |   2585MiB / 16384MiB |     53%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>> |   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                 
>>    0 |
>> | N/A   36C    P0    75W / 300W |   1664MiB / 16384MiB |     65%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>> |   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                 
>>    0 |
>> | N/A   36C    P0    73W / 300W |   1616MiB / 16384MiB |     54%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>> |   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                 
>>    0 |
>> | N/A   40C    P0    71W / 300W |   1614MiB / 16384MiB |     55%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>>                                                                           
>>      
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                             
>>      |
>> |  GPU   GI   CI        PID   Type   Process name                  GPU 
>> Memory |
>> |        ID   ID                                                   Usage 
>>      |
>>
>> |=============================================================================|
>> |    0   N/A  N/A    159608      C   .../exe/local_cuda/cp2k.psmp     
>> 1655MiB |
>> |    0   N/A  N/A    159609      C   .../exe/local_cuda/cp2k.psmp     
>>  307MiB |
>> |    0   N/A  N/A    159610      C   .../exe/local_cuda/cp2k.psmp     
>>  307MiB |
>> |    0   N/A  N/A    159611      C   .../exe/local_cuda/cp2k.psmp     
>>  307MiB |
>> |    1   N/A  N/A    159609      C   .../exe/local_cuda/cp2k.psmp     
>> 1659MiB |
>> |    2   N/A  N/A    159610      C   .../exe/local_cuda/cp2k.psmp     
>> 1611MiB |
>> |    3   N/A  N/A    159611      C   .../exe/local_cuda/cp2k.psmp     
>> 1609MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> However, using CP2K 2022.2, I ran the job successfully and did  not get 
>> this oversubscription.  
>>
>>
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7 
>>     |
>>
>> |-------------------------------+----------------------+----------------------+
>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. 
>> ECC |
>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
>>  Compute M. |
>> |                               |                      |               
>> MIG M. |
>>
>> |===============================+======================+======================|
>> |   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                 
>>    0 |
>> | N/A   37C    P0    63W / 300W |   1598MiB / 16384MiB |     31%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>> |   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                 
>>    0 |
>> | N/A   33C    P0    63W / 300W |   1606MiB / 16384MiB |     29%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>> |   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                 
>>    0 |
>> | N/A   34C    P0    63W / 300W |   1562MiB / 16384MiB |     27%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>> |   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                 
>>    0 |
>> | N/A   37C    P0    67W / 300W |   1560MiB / 16384MiB |     25%     
>>  Default |
>> |                               |                      |                 
>>  N/A |
>>
>> +-------------------------------+----------------------+----------------------+
>>                                                                           
>>      
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                             
>>      |
>> |  GPU   GI   CI        PID   Type   Process name                  GPU 
>> Memory |
>> |        ID   ID                                                   Usage 
>>      |
>>
>> |=============================================================================|
>> |    0   N/A  N/A    163862      C   .../exe/local_cuda/cp2k.psmp     
>> 1599MiB |
>> |    1   N/A  N/A    163863      C   .../exe/local_cuda/cp2k.psmp     
>> 1603MiB |
>> |    2   N/A  N/A    163864      C   .../exe/local_cuda/cp2k.psmp     
>> 1557MiB |
>> |    3   N/A  N/A    163865      C   .../exe/local_cuda/cp2k.psmp     
>> 1555MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> Additionally, the system output file shows the following CUDA runtime 
>> error:
>>
>> CUDA RUNTIME API error: EventRecord failed with error 
>> cudaErrorInvalidResourceHandle 
>> (/cluster/home/tanoury/CP2K/cp2k-2023.1_GPU/exts/dbcsr/src/acc/cuda_hip/acc_event.cpp::60)
>>
>> I have also attached the error output.
>>
>> Any help to solve this problem is greatly appreciated.
>>
>> Thank you so much,
>> Jerry
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/345c710b-5323-4f3b-93cf-f34ae7c58300n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230125/c88cba6e/attachment-0001.htm>
Previous message (by thread): [CP2K-user] [CP2K:18376] Re: GPU oversubscription errors with 2023.1
Next message (by thread): [CP2K-user] [CP2K:18387] Re: GPU oversubscription errors with 2023.1
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the CP2K-user mailing list