[CP2K-user] [CP2K:22108] Re: GPU vs CPU performance on consumer workstation

Frederick Stein f.stein at hzdr.de
Wed Feb 18 10:30:18 UTC 2026

Previous message (by thread): [CP2K-user] [CP2K:22104] Re: GPU vs CPU performance on consumer workstation
Next message (by thread): [CP2K-user] [CP2K:22094] Confusion about the D4 parameters
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Rafael,
In general, your test is quite small. You should see more with a larger 
test such as H2O-512.
The routines where CP2K spent most of its time in your case are (see "self 
time" in the "timing" section at the end), grid_collocate_task_list, 
grid_integrate_task_list and  cp_fm_cholesky_invert. The latter one is 
performed using Scalapack which is not GPU accelerated but should run 
efficiently on the CPU. The first two are mostly run on the GPU (see 
section "grid statistics" in your output file) but as mentioned, will not 
be well accelerated by your GPU (only the increase in memory bandwidth may 
actually improve the performance). Operations within DBCSR also employed 
the GPU.
Considering that all of these operations make use of 
double-precision-numbers, the efficiency on the GPU is poor and will be 
better on a GPGPU 
(compare https://dashboard.cp2k.org/archive/perf-openmp/commit_2064daf5fd3962f4cfa5dcce2bfa3d6108bed819.txt 
for a CPU-test 
and https://dashboard.cp2k.org/archive/perf-cuda-volta/commit_e68d6fd0baf8cfa324767cbe0a05190d11f10215.txt 
for a V100-test).
If you want the full CPU usage, you may consider this 
keyword: https://manual.cp2k.org/cp2k-2025_2-branch/CP2K_INPUT/GLOBAL/GRID.html#CP2K_INPUT.GLOBAL.GRID.BACKEND 
.
Best,
Frederick
rafa... at gmail.com schrieb am Dienstag, 17. Februar 2026 um 23:37:24 UTC+1:

> Thanks for the reply, Frederick.
>
> I hoped that a CPU+GPU job would be at least as quick as a CPU-only run. 
> It seems my GPU job is only using 1 CPU core, whereas I expected to utilize 
> all 4 cores (my GPU run is on par with a 1 core CPU run). During a GPU run, 
> my GPU utilization periodically spikes to 100% but it is idling at 0% most 
> of the time; it seems like the GPU job is bottle-necked by the partial CPU 
> utilization while the GPU is idle.
>
> It leaves me wondering how CP2K is distributing the workload to the GPU 
> and CPU; and can I expect full CPU utilization during the CPU portion of a 
> CPU+GPU run?
> On Saturday, February 7, 2026 at 12:20:19 PM UTC-8 Frederick Stein wrote:
>
>> Dear Rafael,
>> with your GPU consumer cards will not provide an acceleration in case of 
>> CP2K no matter the workload because CP2K relies on Double-precision 
>> floating point numbers for accuracy which are not well supported by 
>> consumer cards such as NVIDIA RTX.
>> The GPU performance has improved since then (grid library, PDGEMM in RPA, 
>> DGEMM in MP2, ...) so some comments in the linked are not anymore correct.
>> I can't tell how much memory (CPU or GPU) you need for this test.
>> If you are interested to use the latest version of CP2K, be aware that 
>> you need to switch to the CMake-based (or Spack or Easybuild) build system.
>> Best,
>> Frederick
>>
>> rafa... at gmail.com schrieb am Samstag, 7. Februar 2026 um 19:31:35 UTC+1:
>>
>>> Hello, I'm testing CP2K performance on an older workstation PC and I'm 
>>> finding that a the CPU version of CP2k 2025.2 is faster than the GPU 
>>> version. My understanding is that many consumer GPUs do not have great 
>>> double precision performance, but I can't tell if the slower GPU timing is 
>>> normal for my system or if there is anything I can improve? For example, a 
>>> CPU-only H2O-32.inp benchmark is twice as fast as a GPU run. The timings 
>>> show that "grid_collocate_task_list" and "grid_integrate_task_list" are the 
>>> most time consuming steps.
>>>
>>> I came across a similar thread from 2018 issue73 
>>> <https://github.com/cp2k/cp2k/issues/73>, but I wonder how those 
>>> comments hold up for the 2025.2 CP2K version? Should I expect any 
>>> performance gains from a GPU on small systems (<250 atoms)? I attached the 
>>> ARCH files I used to build the CPU and GPU versions of CP2K along with the 
>>> output files from the H2O-32.inp benchmarks.
>>>
>>> My system has: hyperthreaded 4-core AMD Ryzen 5 2400G CPU, NVIDIA RTX 
>>> 3050 6gb GPU, and 16gb RAM.
>>>
>>> For CPU runs I use 4 MPI ranks with 2 OMP threads to get full CPU 
>>> utilization. For GPU runs I use 1 MPI rank with 2 OMP threads, increasing 
>>> OMP_NUM_THREADS to 4, 6, 8 does not show increased CPU utilization during a 
>>> GPU run.
>>>
>>> (I am unable to run H20-64.inp on GPU because of a CUDA OOM 
>>> error: ERROR: "cudaErrorLaunchOutOfResources" at 
>>> /home/raf/cp2k-home/cp2k-colordiffusion/cp2k-2025.2/src/grid/gpu/
>>> grid_gpu_collocate.cu:387 )
>>>
>>> Thanks,
>>> Rafal
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cp2k/f524042b-7a6b-4e64-b6a8-aeb2e5f6b7fdn%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20260218/1494066d/attachment.htm>

Previous message (by thread): [CP2K-user] [CP2K:22104] Re: GPU vs CPU performance on consumer workstation
Next message (by thread): [CP2K-user] [CP2K:22094] Confusion about the D4 parameters
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the CP2K-user mailing list