[CP2K-user] [CP2K:17631] cellopt calculation on EIGER aborted

Krack Matthias (PSI) matthias.krack at psi.ch
Thu Sep 8 11:28:07 UTC 2022


Hello

There is a hard limit (48*1024) coded in GPU grid routines of CP2K because of the limited GPU memory available. Using more nodes does not help here, because this won’t increase the shared memory available per GPU. A work around is to use the CPU implementation of grid_integrate instead of the GPU implementation by selecting the grid BACKEND<https://manual.cp2k.org/cp2k-2022_1-branch/CP2K_INPUT/GLOBAL/GRID.html#BACKEND> CPU explicitly (the default is AUTO which will then select GPU automatically on Piz Daint). Alternatively, you can try to change the code and increase that limit, e.g. to 51*1024, with the risk, however, of triggering other problems.

I don’t know what causes the error on Eiger.

HTH

Matthias

From: "cp2k at googlegroups.com" <cp2k at googlegroups.com>
Reply to: "cp2k at googlegroups.com" <cp2k at googlegroups.com>
Date: Thursday, 8 September 2022 at 11:37
To: "cp2k at googlegroups.com" <cp2k at googlegroups.com>
Subject: [CP2K:17629] cellopt calculation on EIGER aborted

Hello all,

I am trying to run a cell-optimization for a metal-organic framework using the scan functional and rvv10 vdw functional. As I had problems with SCF convergence, I increased the cutoff and used the NN50_SMOOTH method for calculating the XC derivatives and the nn50 density smoothing for the xc calculations, as suggested in another conversation here.
The singlepoint calculation converged with these settings, but when I tried to run the cellopt on piz daint (32 nodes, 64GB RAM per node) I got an out-of memory error:
"ERROR: Not enough shared memory in grid_gpu_integrate.
cab_len: 4704, alpha_len: 1512, cxyz_len: 364, total smem_per_block: 51.406250 kb"


So I tried running the calculations on Alps (Eiger) instead (256GB RAM/node). Now I get an error in the cp2k outfile as soon as the SCF calculation starts that I don't understand:
"libfabric:187819:1662628695:cxi:core:cxip_ux_onload_cb():2259<warn> nid001534: RXC (0x2300:32:0): PtlTE 105LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required.

Program received signal SIGABRT: Process abort signal."

Does someone have an idea what went wrong?
I am using cp2k-9.1, I attach you my input file and the outfile with the complete error message.
Thank you!
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com<mailto:cp2k+unsubscribe at googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/c0f4eecc-78a1-407c-a18d-20d35785d392n%40googlegroups.com<https://groups.google.com/d/msgid/cp2k/c0f4eecc-78a1-407c-a18d-20d35785d392n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/58829E1B-80D0-4528-B9C8-121FAE0D45EA%40psi.ch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20220908/f4263677/attachment-0001.htm>


More information about the CP2K-user mailing list