[CP2K-user] [CP2K:18387] Re: GPU oversubscription errors with 2023.1
Alfio Lazzaro
alfio.lazzaro at gmail.com
Thu Jan 26 08:45:48 UTC 2023
I'm sorry, I don't know how to reply to your question. However, I've opened
a ticket on the github CP2K repository, maybe someone more expert can reply
to you (see https://github.com/cp2k/cp2k/issues/2530 ).
>From my experience, ELPA is useful when you engage many ranks. You can
check your outputs for timing entries like `cp_fm_syevd` (ScaLAPACK) or
`cp_fm_diag_elpa` (ELPA), e.g.:
cp_fm_syevd 36 10.6 0.001 0.001 13.586
13.587
In this particular case, ELPA takes 13.6 seconds (last column). So, you can
check how much time you spend in the diagonalizer and compare to the total
time.
Another possibility is that you can always switch the diagonalizer (ELPA vs
ScaLAPACK) in your input file (see
https://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL.html#PREFERRED_DIAG_LIBRARY).
In this case, I can suggest to build ELPA without GPU support, so that you
can still have ELPA on the CPU (assuming that it is beneficial in your
case) by hacking the toolchain installation file:
https://github.com/cp2k/cp2k/blob/master/tools/toolchain/scripts/stage5/install_elpa.sh
Hope it helps.
Alfio
Il giorno mercoledì 25 gennaio 2023 alle 12:37:49 UTC+1 jerryt... at gmail.com
ha scritto:
> Hi Alfio,
> Yes, ELPA was the problem. I removed it from my build and CP2K worked as
> expected. Where does ELPA help the most? The majority of my AIMD jobs are
> 1000 atoms or less. Will ELPA provide a performance advantage over
> SCALAPACK for systems of that size?
>
> Thank you,
> Jerry
>
> On Monday, January 23, 2023 at 4:09:41 AM UTC-5 Alfio Lazzaro wrote:
>
>> I have no clue what's wrong here, however I see in your log that ELPA is
>> giving some warning message. For this reason, I would suggest to avoid
>> elpa, i.e. add `--with-elpa=no` during the toolchain installation. Does it
>> work on a single GPU, i.e. a single MPI rank?
>>
>> Il giorno venerdì 20 gennaio 2023 alle 15:33:45 UTC+1 jerryt... at gmail.com
>> ha scritto:
>>
>>> Dear Forum,
>>> I successfully compiled v2023.1 (gcc-10.3.0, cuda-11.2, and MKL lib)
>>> with the toolchain using:
>>>
>>> "-j 8 --no-check-certificate --install-all --with-gcc=system
>>> --with-openmpi --with-mkl --with-sirius=no --with-spfft=no
>>> --with-cmake=system --enable-cuda --gpu-ver=P100 --with-pexsi
>>> --with-sirius=no --with-quip=no --with-hdf5=no --with-libvdwxc=no
>>> --with-spla=no --with-libtorch=no"
>>>
>>> However, when I ran a test job, the job crashed and I got GPU
>>> oversubscription as shown below:
>>>
>>> +-----------------------------------------------------------------------------+
>>> | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version:
>>> 11.7 |
>>>
>>> |-------------------------------+----------------------+----------------------+
>>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
>>> Uncorr. ECC |
>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
>>> Compute M. |
>>> | | |
>>> MIG M. |
>>>
>>> |===============================+======================+======================|
>>> | 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off |
>>> 0 |
>>> | N/A 42C P0 80W / 300W | 2585MiB / 16384MiB | 53%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> | 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off |
>>> 0 |
>>> | N/A 36C P0 75W / 300W | 1664MiB / 16384MiB | 65%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> | 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off |
>>> 0 |
>>> | N/A 36C P0 73W / 300W | 1616MiB / 16384MiB | 54%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> | 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off |
>>> 0 |
>>> | N/A 40C P0 71W / 300W | 1614MiB / 16384MiB | 55%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>>
>>>
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes:
>>> |
>>> | GPU GI CI PID Type Process name GPU
>>> Memory |
>>> | ID ID Usage
>>> |
>>>
>>> |=============================================================================|
>>> | 0 N/A N/A 159608 C .../exe/local_cuda/cp2k.psmp
>>> 1655MiB |
>>> | 0 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp
>>> 307MiB |
>>> | 0 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp
>>> 307MiB |
>>> | 0 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp
>>> 307MiB |
>>> | 1 N/A N/A 159609 C .../exe/local_cuda/cp2k.psmp
>>> 1659MiB |
>>> | 2 N/A N/A 159610 C .../exe/local_cuda/cp2k.psmp
>>> 1611MiB |
>>> | 3 N/A N/A 159611 C .../exe/local_cuda/cp2k.psmp
>>> 1609MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> However, using CP2K 2022.2, I ran the job successfully and did not get
>>> this oversubscription.
>>>
>>>
>>> +-----------------------------------------------------------------------------+
>>> | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version:
>>> 11.7 |
>>>
>>> |-------------------------------+----------------------+----------------------+
>>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
>>> Uncorr. ECC |
>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
>>> Compute M. |
>>> | | |
>>> MIG M. |
>>>
>>> |===============================+======================+======================|
>>> | 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off |
>>> 0 |
>>> | N/A 37C P0 63W / 300W | 1598MiB / 16384MiB | 31%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> | 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off |
>>> 0 |
>>> | N/A 33C P0 63W / 300W | 1606MiB / 16384MiB | 29%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> | 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off |
>>> 0 |
>>> | N/A 34C P0 63W / 300W | 1562MiB / 16384MiB | 27%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> | 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off |
>>> 0 |
>>> | N/A 37C P0 67W / 300W | 1560MiB / 16384MiB | 25%
>>> Default |
>>> | | |
>>> N/A |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>>
>>>
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes:
>>> |
>>> | GPU GI CI PID Type Process name GPU
>>> Memory |
>>> | ID ID Usage
>>> |
>>>
>>> |=============================================================================|
>>> | 0 N/A N/A 163862 C .../exe/local_cuda/cp2k.psmp
>>> 1599MiB |
>>> | 1 N/A N/A 163863 C .../exe/local_cuda/cp2k.psmp
>>> 1603MiB |
>>> | 2 N/A N/A 163864 C .../exe/local_cuda/cp2k.psmp
>>> 1557MiB |
>>> | 3 N/A N/A 163865 C .../exe/local_cuda/cp2k.psmp
>>> 1555MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> Additionally, the system output file shows the following CUDA runtime
>>> error:
>>>
>>> CUDA RUNTIME API error: EventRecord failed with error
>>> cudaErrorInvalidResourceHandle
>>> (/cluster/home/tanoury/CP2K/cp2k-2023.1_GPU/exts/dbcsr/src/acc/cuda_hip/acc_event.cpp::60)
>>>
>>> I have also attached the error output.
>>>
>>> Any help to solve this problem is greatly appreciated.
>>>
>>> Thank you so much,
>>> Jerry
>>>
>>>
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/30f62023-c218-43ca-8501-cc7213f5b6b1n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230126/27784066/attachment-0001.htm>
More information about the CP2K-user
mailing list