Hi Alfio,<div>Yes, ELPA was the problem. I removed it from my build and CP2K worked as expected. Where does ELPA help the most? The majority of my AIMD jobs are 1000 atoms or less. Will ELPA provide a performance advantage over SCALAPACK for systems of that size?</div><div><br /></div><div>Thank you,</div><div>Jerry<br /><br /></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">On Monday, January 23, 2023 at 4:09:41 AM UTC-5 Alfio Lazzaro wrote:<br/></div><blockquote class="gmail_quote" style="margin: 0 0 0 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">I have no clue what's wrong here, however I see in your log that ELPA is giving some warning message. For this reason, I would suggest to avoid elpa, i.e. add `--with-elpa=no` during the toolchain installation. Does it work on a single GPU, i.e. a single MPI rank?<br><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">Il giorno venerdì 20 gennaio 2023 alle 15:33:45 UTC+1 <a href data-email-masked rel="nofollow">jerryt...@gmail.com</a> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Forum,</div><div>I successfully compiled v2023.1 (gcc-10.3.0, cuda-11.2, and MKL lib) with the toolchain using:</div><div><br>"-j 8 --no-check-certificate --install-all --with-gcc=system --with-openmpi --with-mkl --with-sirius=no --with-spfft=no --with-cmake=system --enable-cuda --gpu-ver=P100 --with-pexsi --with-sirius=no --with-quip=no --with-hdf5=no --with-libvdwxc=no --with-spla=no --with-libtorch=no"</div><div><br></div><div>However, when I ran a test job, the job crashed and I got GPU oversubscription as shown below:</div><div>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |<br>|-------------------------------+----------------------+----------------------+<br>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>| | | MIG M. |<br>|===============================+======================+======================|<br>| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | 0 |<br>| N/A 42C P0 80W / 300W | 2585MiB / 16384MiB | 53% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off | 0 |<br>| N/A 36C P0 75W / 300W | 1664MiB / 16384MiB | 65% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off | 0 |<br>| N/A 36C P0 73W / 300W | 1616MiB / 16384MiB | 54% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off | 0 |<br>| N/A 40C P0 71W / 300W | 1614MiB / 16384MiB | 55% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br> <br>+-----------------------------------------------------------------------------+<br>| Processes: |<br>| GPU GI CI PID Type Process name GPU Memory |<br>| ID ID Usage |<br>|=============================================================================|<br>| 0 N/A N/A 159608 C .../exe/local_cuda/cp2k.psmp 1655MiB |<br>| 0 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp 307MiB |<br>| 0 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp 307MiB |<br>| 0 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp 307MiB |<br>| 1 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp 1659MiB |<br>| 2 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp 1611MiB |<br>| 3 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp 1609MiB |<br>+-----------------------------------------------------------------------------+</div><div><br></div><div>However, using CP2K 2022.2, I ran the job successfully and did not get this oversubscription. <br></div><div><br></div><div>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |<br>|-------------------------------+----------------------+----------------------+<br>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>| | | MIG M. |<br>|===============================+======================+======================|<br>| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | 0 |<br>| N/A 37C P0 63W / 300W | 1598MiB / 16384MiB | 31% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off | 0 |<br>| N/A 33C P0 63W / 300W | 1606MiB / 16384MiB | 29% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off | 0 |<br>| N/A 34C P0 63W / 300W | 1562MiB / 16384MiB | 27% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off | 0 |<br>| N/A 37C P0 67W / 300W | 1560MiB / 16384MiB | 25% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br> <br>+-----------------------------------------------------------------------------+<br>| Processes: |<br>| GPU GI CI PID Type Process name GPU Memory |<br>| ID ID Usage |<br>|=============================================================================|<br>| 0 N/A N/A 163862 C .../exe/local_cuda/cp2k.psmp 1599MiB |<br>| 1 N/A N/A 163863 C .../exe/local_cuda/cp2k.psmp 1603MiB |<br>| 2 N/A N/A 163864 C .../exe/local_cuda/cp2k.psmp 1557MiB |<br>| 3 N/A N/A 163865 C .../exe/local_cuda/cp2k.psmp 1555MiB |<br>+-----------------------------------------------------------------------------+<br></div><div><br></div><div>Additionally, the system output file shows the following CUDA runtime error:</div><div><br></div><div>CUDA RUNTIME API error: EventRecord failed with error cudaErrorInvalidResourceHandle (/cluster/home/tanoury/CP2K/cp2k-2023.1_GPU/exts/dbcsr/src/acc/cuda_hip/acc_event.cpp::60)</div><div><br></div><div>I have also attached the error output.<br></div><div><br></div><div>Any help to solve this problem is greatly appreciated.</div><div><br></div><div>Thank you so much,</div><div>Jerry<br></div><div><br></div></blockquote></div></blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups "cp2k" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="mailto:cp2k+unsubscribe@googlegroups.com">cp2k+unsubscribe@googlegroups.com</a>.<br />
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/cp2k/345c710b-5323-4f3b-93cf-f34ae7c58300n%40googlegroups.com?utm_medium=email&utm_source=footer">https://groups.google.com/d/msgid/cp2k/345c710b-5323-4f3b-93cf-f34ae7c58300n%40googlegroups.com</a>.<br />