Hi everyone,<div><br></div><div>I'm deploying and optimizing cp2k on an AMD cluster, with 4 AMD GPUs per node. I got the DBCSR and the pw part to run on the GPUs, but I have questions about the performance. I am testing QS/H2O-256.inp and QS_DM_LS/H2O-dft-ls.NERP2.inp benchmarks:</div><div><br></div><div>1. To use all 4 GPUs, is there any problem to call more than 4 MPI processes (e.g., 8, 16)?</div><div>2. I saw a performance boost in the QS_DM_LS tests since all dbcsr operations were assigned to the GPUs, but there was little help to the QS tests. I check the output logs and find that only 2% of flops were processed by the GPUs. How can I utilize GPUs on more computations? I know this is scheduled by the program.</div><div><br></div><div><div><div> -------------------------------------------------------------------------------</div><div> - -</div><div> - DBCSR STATISTICS -</div><div> - -</div><div> -------------------------------------------------------------------------------</div><div> COUNTER TOTAL BLAS SMM ACC</div><div> flops 9 x 9 x 32 1430456039424 100.0% 0.0% 0.0%</div><div> flops 32 x 32 x 32 1962800054272 0.0% 0.0% 100.0%</div><div> flops 22 x 9 x 32 1986255912960 100.0% 0.0% 0.0%</div><div> flops 9 x 22 x 32 1992003932160 100.0% 0.0% 0.0%</div><div> flops 22 x 22 x 32 2753958699008 100.0% 0.0% 0.0%</div><div> flops 32 x 32 x 9 4454954827776 100.0% 0.0% 0.0%</div><div> flops 32 x 32 x 22 5444944789504 100.0% 0.0% 0.0%</div><div> flops 9 x 32 x 32 5492290093056 100.0% 0.0% 0.0%</div><div> flops 22 x 32 x 32 6712799002624 100.0% 0.0% 0.0%</div><div> flops 9 x 32 x 9 11613072052224 100.0% 0.0% 0.0%</div><div> flops 22 x 32 x 9 15239176077312 100.0% 0.0% 0.0%</div><div> flops 9 x 32 x 22 15239176077312 100.0% 0.0% 0.0%</div><div> flops 22 x 32 x 22 19911132921856 100.0% 0.0% 0.0%</div><div> flops inhomo. stacks 0 0.0% 0.0% 0.0%</div><div> flops total 94.233020E+12 97.9% 0.0% 2.1%</div><div> flops max/rank 5.910120E+12 97.9% 0.0% 2.1%</div><div> matmuls inhomo. stacks 0 0.0% 0.0% 0.0%</div><div> matmuls total 6806383904 99.6% 0.0% 0.4%</div><div> number of processed stacks 728928 84.0% 0.0% 16.0%</div><div> average stack size 11073.8 0.0 256.0</div><div> marketing flops 145.650931E+12</div></div></div><div><br></div><div>The QS test is long so I figure this is critical. Thanks.</div><div><br></div><div><br></div>