[CP2K-user] [CP2K:18363] GPU oversubscription errors with 2023.1
jerryt...@gmail.com
jerrytanoury at gmail.com
Fri Jan 20 14:33:45 UTC 2023
Dear Forum,
I successfully compiled v2023.1 (gcc-10.3.0, cuda-11.2, and MKL lib) with
the toolchain using:
"-j 8 --no-check-certificate --install-all --with-gcc=system --with-openmpi
--with-mkl --with-sirius=no --with-spfft=no --with-cmake=system
--enable-cuda --gpu-ver=P100 --with-pexsi --with-sirius=no --with-quip=no
--with-hdf5=no --with-libvdwxc=no --with-spla=no --with-libtorch=no"
However, when I ran a test job, the job crashed and I got GPU
oversubscription as shown below:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
| | | MIG
M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off |
0 |
| N/A 42C P0 80W / 300W | 2585MiB / 16384MiB | 53%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off |
0 |
| N/A 36C P0 75W / 300W | 1664MiB / 16384MiB | 65%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off |
0 |
| N/A 36C P0 73W / 300W | 1616MiB / 16384MiB | 54%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off |
0 |
| N/A 40C P0 71W / 300W | 1614MiB / 16384MiB | 55%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:
|
| GPU GI CI PID Type Process name GPU
Memory |
| ID ID Usage
|
|=============================================================================|
| 0 N/A N/A 159608 C .../exe/local_cuda/cp2k.psmp
1655MiB |
| 0 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp
307MiB |
| 0 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp
307MiB |
| 0 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp
307MiB |
| 1 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp
1659MiB |
| 2 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp
1611MiB |
| 3 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp
1609MiB |
+-----------------------------------------------------------------------------+
However, using CP2K 2022.2, I ran the job successfully and did not get
this oversubscription.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
| | | MIG
M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off |
0 |
| N/A 37C P0 63W / 300W | 1598MiB / 16384MiB | 31%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off |
0 |
| N/A 33C P0 63W / 300W | 1606MiB / 16384MiB | 29%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off |
0 |
| N/A 34C P0 63W / 300W | 1562MiB / 16384MiB | 27%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off |
0 |
| N/A 37C P0 67W / 300W | 1560MiB / 16384MiB | 25%
Default |
| | |
N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:
|
| GPU GI CI PID Type Process name GPU
Memory |
| ID ID Usage
|
|=============================================================================|
| 0 N/A N/A 163862 C .../exe/local_cuda/cp2k.psmp
1599MiB |
| 1 N/A N/A 163863 C .../exe/local_cuda/cp2k.psmp
1603MiB |
| 2 N/A N/A 163864 C .../exe/local_cuda/cp2k.psmp
1557MiB |
| 3 N/A N/A 163865 C .../exe/local_cuda/cp2k.psmp
1555MiB |
+-----------------------------------------------------------------------------+
Additionally, the system output file shows the following CUDA runtime error:
CUDA RUNTIME API error: EventRecord failed with error
cudaErrorInvalidResourceHandle
(/cluster/home/tanoury/CP2K/cp2k-2023.1_GPU/exts/dbcsr/src/acc/cuda_hip/acc_event.cpp::60)
I have also attached the error output.
Any help to solve this problem is greatly appreciated.
Thank you so much,
Jerry
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/66d2a53d-ee32-49e3-a573-f2565c18945fn%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230120/2b1171f9/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: error_file
Type: application/octet-stream
Size: 66698 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230120/2b1171f9/attachment-0001.obj>
More information about the CP2K-user
mailing list