I'm sorry, I don't know how to reply to your question. However, I've opened a ticket on the github CP2K repository, maybe someone more expert can reply to you (see https://github.com/cp2k/cp2k/issues/2530 ).<div><br /></div><div>From my experience, ELPA is useful when you engage many ranks. You can check your outputs for timing entries like `cp_fm_syevd` (ScaLAPACK) or `cp_fm_diag_elpa` (ELPA), e.g.:</div><div><br /></div><div>cp_fm_syevd 36 10.6 0.001 0.001 13.586 13.587<br /></div><div><br /></div><div>In this particular case, ELPA takes 13.6 seconds (last column). So, you can check how much time you spend in the diagonalizer and compare to the total time.</div><div>Another possibility is that you can always switch the diagonalizer (ELPA vs ScaLAPACK) in your input file (see https://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL.html#PREFERRED_DIAG_LIBRARY). In this case, I can suggest to build ELPA without GPU support, so that you can still have ELPA on the CPU (assuming that it is beneficial in your case) by hacking the toolchain installation file:</div><div><br /></div><div>https://github.com/cp2k/cp2k/blob/master/tools/toolchain/scripts/stage5/install_elpa.sh<br /></div><div><br /></div><div>Hope it helps.</div><div><br /></div><div>Alfio</div><div><br /><br /></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">Il giorno mercoledì 25 gennaio 2023 alle 12:37:49 UTC+1 jerryt...@gmail.com ha scritto:<br/></div><blockquote class="gmail_quote" style="margin: 0 0 0 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Hi Alfio,<div>Yes, ELPA was the problem. I removed it from my build and CP2K worked as expected. Where does ELPA help the most? The majority of my AIMD jobs are 1000 atoms or less. Will ELPA provide a performance advantage over SCALAPACK for systems of that size?</div><div><br></div><div>Thank you,</div><div>Jerry<br><br></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">On Monday, January 23, 2023 at 4:09:41 AM UTC-5 Alfio Lazzaro wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I have no clue what's wrong here, however I see in your log that ELPA is giving some warning message. For this reason, I would suggest to avoid elpa, i.e. add `--with-elpa=no` during the toolchain installation. Does it work on a single GPU, i.e. a single MPI rank?<br><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">Il giorno venerdì 20 gennaio 2023 alle 15:33:45 UTC+1 <a rel="nofollow">jerryt...@gmail.com</a> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Forum,</div><div>I successfully compiled v2023.1 (gcc-10.3.0, cuda-11.2, and MKL lib) with the toolchain using:</div><div><br>"-j 8 --no-check-certificate --install-all --with-gcc=system --with-openmpi --with-mkl --with-sirius=no --with-spfft=no --with-cmake=system --enable-cuda --gpu-ver=P100 --with-pexsi --with-sirius=no --with-quip=no --with-hdf5=no --with-libvdwxc=no --with-spla=no --with-libtorch=no"</div><div><br></div><div>However, when I ran a test job, the job crashed and I got GPU oversubscription as shown below:</div><div>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |<br>|-------------------------------+----------------------+----------------------+<br>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>| | | MIG M. |<br>|===============================+======================+======================|<br>| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | 0 |<br>| N/A 42C P0 80W / 300W | 2585MiB / 16384MiB | 53% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off | 0 |<br>| N/A 36C P0 75W / 300W | 1664MiB / 16384MiB | 65% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off | 0 |<br>| N/A 36C P0 73W / 300W | 1616MiB / 16384MiB | 54% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off | 0 |<br>| N/A 40C P0 71W / 300W | 1614MiB / 16384MiB | 55% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br> <br>+-----------------------------------------------------------------------------+<br>| Processes: |<br>| GPU GI CI PID Type Process name GPU Memory |<br>| ID ID Usage |<br>|=============================================================================|<br>| 0 N/A N/A 159608 C .../exe/local_cuda/cp2k.psmp 1655MiB |<br>| 0 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp 307MiB |<br>| 0 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp 307MiB |<br>| 0 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp 307MiB |<br>| 1 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp 1659MiB |<br>| 2 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp 1611MiB |<br>| 3 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp 1609MiB |<br>+-----------------------------------------------------------------------------+</div><div><br></div><div>However, using CP2K 2022.2, I ran the job successfully and did not get this oversubscription. <br></div><div><br></div><div>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |<br>|-------------------------------+----------------------+----------------------+<br>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>| | | MIG M. |<br>|===============================+======================+======================|<br>| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | 0 |<br>| N/A 37C P0 63W / 300W | 1598MiB / 16384MiB | 31% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off | 0 |<br>| N/A 33C P0 63W / 300W | 1606MiB / 16384MiB | 29% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off | 0 |<br>| N/A 34C P0 63W / 300W | 1562MiB / 16384MiB | 27% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br>| 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off | 0 |<br>| N/A 37C P0 67W / 300W | 1560MiB / 16384MiB | 25% Default |<br>| | | N/A |<br>+-------------------------------+----------------------+----------------------+<br> <br>+-----------------------------------------------------------------------------+<br>| Processes: |<br>| GPU GI CI PID Type Process name GPU Memory |<br>| ID ID Usage |<br>|=============================================================================|<br>| 0 N/A N/A 163862 C .../exe/local_cuda/cp2k.psmp 1599MiB |<br>| 1 N/A N/A 163863 C .../exe/local_cuda/cp2k.psmp 1603MiB |<br>| 2 N/A N/A 163864 C .../exe/local_cuda/cp2k.psmp 1557MiB |<br>| 3 N/A N/A 163865 C .../exe/local_cuda/cp2k.psmp 1555MiB |<br>+-----------------------------------------------------------------------------+<br></div><div><br></div><div>Additionally, the system output file shows the following CUDA runtime error:</div><div><br></div><div>CUDA RUNTIME API error: EventRecord failed with error cudaErrorInvalidResourceHandle (/cluster/home/tanoury/CP2K/cp2k-2023.1_GPU/exts/dbcsr/src/acc/cuda_hip/acc_event.cpp::60)</div><div><br></div><div>I have also attached the error output.<br></div><div><br></div><div>Any help to solve this problem is greatly appreciated.</div><div><br></div><div>Thank you so much,</div><div>Jerry<br></div><div><br></div></blockquote></div></blockquote></div></blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups "cp2k" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="mailto:cp2k+unsubscribe@googlegroups.com">cp2k+unsubscribe@googlegroups.com</a>.<br />
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/cp2k/30f62023-c218-43ca-8501-cc7213f5b6b1n%40googlegroups.com?utm_medium=email&utm_source=footer">https://groups.google.com/d/msgid/cp2k/30f62023-c218-43ca-8501-cc7213f5b6b1n%40googlegroups.com</a>.<br />