[CP2K-user] [CP2K:18363] GPU oversubscription errors with 2023.1

jerryt...@gmail.com jerrytanoury at gmail.com
Fri Jan 20 14:33:45 UTC 2023


Dear Forum,
I successfully compiled v2023.1 (gcc-10.3.0, cuda-11.2, and MKL lib) with 
the toolchain using:

"-j 8 --no-check-certificate --install-all --with-gcc=system --with-openmpi 
--with-mkl --with-sirius=no --with-spfft=no --with-cmake=system 
--enable-cuda --gpu-ver=P100 --with-pexsi --with-sirius=no --with-quip=no 
--with-hdf5=no --with-libvdwxc=no --with-spla=no --with-libtorch=no"

However, when I ran a test job, the job crashed and I got GPU 
oversubscription as shown below:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7   
  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. 
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute 
M. |
|                               |                      |               MIG 
M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                   
 0 |
| N/A   42C    P0    80W / 300W |   2585MiB / 16384MiB |     53%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                   
 0 |
| N/A   36C    P0    75W / 300W |   1664MiB / 16384MiB |     65%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                   
 0 |
| N/A   36C    P0    73W / 300W |   1616MiB / 16384MiB |     54%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                   
 0 |
| N/A   40C    P0    71W / 300W |   1614MiB / 16384MiB |     55%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
                                                                            
   
+-----------------------------------------------------------------------------+
| Processes:                                                               
   |
|  GPU   GI   CI        PID   Type   Process name                  GPU 
Memory |
|        ID   ID                                                   Usage   
   |
|=============================================================================|
|    0   N/A  N/A    159608      C   .../exe/local_cuda/cp2k.psmp     
1655MiB |
|    0   N/A  N/A    159609      C   .../exe/local_cuda/cp2k.psmp     
 307MiB |
|    0   N/A  N/A    159610      C   .../exe/local_cuda/cp2k.psmp     
 307MiB |
|    0   N/A  N/A    159611      C   .../exe/local_cuda/cp2k.psmp     
 307MiB |
|    1   N/A  N/A    159609      C   .../exe/local_cuda/cp2k.psmp     
1659MiB |
|    2   N/A  N/A    159610      C   .../exe/local_cuda/cp2k.psmp     
1611MiB |
|    3   N/A  N/A    159611      C   .../exe/local_cuda/cp2k.psmp     
1609MiB |
+-----------------------------------------------------------------------------+

However, using CP2K 2022.2, I ran the job successfully and did  not get 
this oversubscription.  

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7   
  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. 
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute 
M. |
|                               |                      |               MIG 
M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                   
 0 |
| N/A   37C    P0    63W / 300W |   1598MiB / 16384MiB |     31%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                   
 0 |
| N/A   33C    P0    63W / 300W |   1606MiB / 16384MiB |     29%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                   
 0 |
| N/A   34C    P0    63W / 300W |   1562MiB / 16384MiB |     27%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                   
 0 |
| N/A   37C    P0    67W / 300W |   1560MiB / 16384MiB |     25%     
 Default |
|                               |                      |                 
 N/A |
+-------------------------------+----------------------+----------------------+
                                                                            
   
+-----------------------------------------------------------------------------+
| Processes:                                                               
   |
|  GPU   GI   CI        PID   Type   Process name                  GPU 
Memory |
|        ID   ID                                                   Usage   
   |
|=============================================================================|
|    0   N/A  N/A    163862      C   .../exe/local_cuda/cp2k.psmp     
1599MiB |
|    1   N/A  N/A    163863      C   .../exe/local_cuda/cp2k.psmp     
1603MiB |
|    2   N/A  N/A    163864      C   .../exe/local_cuda/cp2k.psmp     
1557MiB |
|    3   N/A  N/A    163865      C   .../exe/local_cuda/cp2k.psmp     
1555MiB |
+-----------------------------------------------------------------------------+

Additionally, the system output file shows the following CUDA runtime error:

CUDA RUNTIME API error: EventRecord failed with error 
cudaErrorInvalidResourceHandle 
(/cluster/home/tanoury/CP2K/cp2k-2023.1_GPU/exts/dbcsr/src/acc/cuda_hip/acc_event.cpp::60)

I have also attached the error output.

Any help to solve this problem is greatly appreciated.

Thank you so much,
Jerry

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/66d2a53d-ee32-49e3-a573-f2565c18945fn%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230120/2b1171f9/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: error_file
Type: application/octet-stream
Size: 66698 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230120/2b1171f9/attachment-0001.obj>


More information about the CP2K-user mailing list