Hi Frederic, <br /><br />I am writing this as a follow up to previous discussions. I am currently seeing a recurring problem with CP2K, where tasks are being killed after about 10 days with errors as in the attached outputs. This is not particularly annoying, as a restart is sufficient and the simulation can run on. Unfortunately, I don't think you will be able to reproduce this error, given the very long simulation time. However, if there is anything else I can provide to help understand the source of these problems, let me know. <div><br /></div><div>Best</div><div>Bartosz<br /><br /></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">poniedziałek, 28 października 2024 o 09:34:45 UTC+1 bartosz mazur napisał(a):<br/></div><blockquote class="gmail_quote" style="margin: 0 0 0 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Many thanks Frederick for your help! <br><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">piątek, 25 października 2024 o 14:27:36 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Regarding the other issues:</div><div>I can confirm them but cannot provide fixes for all of them because the probably trigger bugs in ifort. Because ifort is already deprecated, these bugs will probably not be fixed. Furthermore, we do not see any issues on our Intel CI. I will fix what I can but some of them will be left as we will focus our efforts on the support of the new ifx compiler.<br></div><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">Frederick Stein schrieb am Freitag, 25. Oktober 2024 um 11:46:00 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Bartosz,
<div></div><div>I will check the other issues with your regtests.</div>
</div><div>Regarding your latest issue, please provide more information such as an output file or a hint on the context. If I am supposed to retry the calculation on my local machine, I need all additional input files such as your plumed file. I can run your input file up to the point that CP2K needs plumed.</div><div>Best,</div><div>Frederick<br></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Freitag, 25. Oktober 2024 um 10:15:19 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I just got another error with LibXSMM, now in my regular simulation and without using OpenMP. This is the error:<div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New">[1729843139.920274] [r23c01b04:2913 :0] ib_md.c:295 UCX ERROR ibv_reg_mr(address=0x14f0b46fc080, length=7424, access=0xf) failed: Cannot allocate memory<br>[1729843139.920290] [r23c01b04:2913 :0] ucp_mm.c:70 UCX ERROR failed to register address 0x14f0b46fc080 (host) length 7424 on md[4]=mlx5_0: Input/output error (md supports: host)<br><br>LIBXSMM_VERSION: develop-1.17-3834 (25693946)[1729843139.932647] [r23c01b04:2945 :0] ib_md.c:295 UCX ERROR ibv_reg_mr(address=0x1491f069e040, length=8128, access=0xf) failed: Cannot allocate memory<br>[1729843139.932660] [r23c01b04:2945 :0] ucp_mm.c:70 UCX ERROR failed to register address 0x1491f069e040 (host) length 8128 on md[4]=mlx5_0: Input/output error (md supports: host)</font></div><div><font face="Courier New"><br><br>CLX/DP TRY JIT STA COL<br> 0..13 4 4 0 0<br></font></div><div><font face="Courier New"> 14..23 4 4 0 0</font></div><div><font face="Courier New"><br> 24..64 0 0 0 0<br></font></div><div><font face="Courier New">Registry and code: 13 MB + 80 KB (gemm=8)<br>Command (PID=2913): /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp -i cp2k.inp -o cp2k.out<br>Uptime: 407633.177169 s</font></div><div><font face="Courier New">```</font></div><div><br></div><div>and this is simulation input I'm using:</div><div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New">&GLOBAL<br> PROJECT uam1o_npt_rms<br> RUN_TYPE MD<br> PRINT_LEVEL LOW<br> PREFERRED_DIAG_LIBRARY SCALAPACK<br>&END GLOBAL<br><br>&FORCE_EVAL<br> METHOD QUICKSTEP<br> STRESS_TENSOR ANALYTICAL<br> &DFT<br> BASIS_SET_FILE_NAME BASIS_MOLOPT_UZH<br> POTENTIAL_FILE_NAME POTENTIAL_UZH<br> &MGRID<br> CUTOFF 500<br> &END MGRID<br> &XC<br> &XC_FUNCTIONAL PBE<br> &END XC_FUNCTIONAL<br> &VDW_POTENTIAL<br> POTENTIAL_TYPE PAIR_POTENTIAL<br> &PAIR_POTENTIAL<br> TYPE DFTD3(BJ)<br> PARAMETER_FILE_NAME dftd3.dat<br> REFERENCE_FUNCTIONAL PBE<br> R_CUTOFF 25.0<br> &END PAIR_POTENTIAL<br> &END VDW_POTENTIAL<br> &END XC<br> &END DFT<br><br> &SUBSYS<br> &CELL<br> A 12.2807999 0.0000000 0.0000000<br> B 7.6258602 9.6257200 0.0000000<br> C -2.1557724 -1.0420258 18.0042801<br> &END CELL<br> &COORD<br> Zn 11.37811 4.60286 0.24515<br> Zn 8.15435 3.05288 8.74518<br> Zn 6.37590 3.97311 17.74650<br> Zn 9.59842 5.54014 9.24747<br> S 11.79344 6.72692 17.10850<br> S 4.06825 3.00573 9.90358<br> S 5.95830 1.84422 0.90027<br> S 13.67407 5.58944 8.10767<br> O 10.72408 3.58291 1.89315<br> O 8.51986 4.01962 1.53085<br> O 6.60135 3.91587 7.68572<br> O 7.74637 5.79259 8.21600<br> O 15.32810 8.58246 5.10041<br> O 9.35608 2.93551 7.09500<br> O 10.38999 4.93007 7.45977<br> O 11.66491 6.35111 1.31266<br> O 9.48582 6.62478 0.77364<br> O 2.59062 2.40094 3.91496<br> O 7.03031 4.99173 16.09885<br> O 9.23544 4.56122 16.46252<br> O 11.14602 4.67776 10.31440<br> O 10.00982 2.79915 9.77218<br> O 2.41388 0.01898 12.91899<br> O 8.39375 5.66143 10.89628<br> O 7.36998 3.66087 10.53589<br> O 6.08863 2.22161 16.68336<br> O 8.26988 1.95313 17.21650<br> O 15.16937 6.16381 14.09906<br> N 13.25907 3.80728 0.04001<br> N 2.36335 -0.74130 17.33402<br> N 7.60676 1.08576 8.95623<br> N 15.77729 5.75974 9.67861<br> N 4.49430 4.76652 17.95756<br> N 15.38873 9.31230 0.67467<br> N 10.14308 7.50848 9.04236<br> N 1.96529 2.83557 8.33233<br> C 6.76554 5.18292 7.68414<br> C 14.28210 4.11624 0.86006<br> C 9.47998 3.39622 2.09658<br> C 3.20112 3.42080 0.84626<br> C 9.91466 1.18589 3.17244<br> C 9.08210 2.29987 3.02657<br> C 5.74710 6.04945 7.01821<br> C 7.83265 2.30920 3.66005<br> C 3.35793 2.34328 -0.04029<br> C 4.51663 1.46385 -0.02755<br> C 16.24194 7.75266 5.73606<br> C 4.78940 5.52817 6.14198<br> C 7.40810 1.21174 4.39947<br> C 16.18016 6.38244 5.49010<br> C 9.48869 0.06986 3.88005<br> C 11.27238 1.77457 17.14330<br> C 5.77166 7.43009 7.27236<br> C 11.14819 8.24901 17.58588<br> C 8.22170 0.08058 4.47135<br> C 0.15087 1.02286 17.07544<br> C 17.16180 8.28565 6.64351<br> C 10.57067 7.01060 1.31282<br> C 6.72654 0.47459 8.14002<br> C 10.27972 3.79035 6.89470<br> C 14.15006 8.72843 8.15880<br> C 11.73751 2.06868 5.82537<br> C 11.38838 3.41515 5.96966<br> C 10.52304 8.34339 1.98566<br> C 12.16584 4.39562 5.33967<br> C 14.89762 7.93801 9.04648<br> C 14.86698 6.48365 9.03575<br> C 2.67167 1.17044 3.27681<br> C 11.52468 8.76552 2.86608<br> C 13.29140 4.04007 4.60622<br> C 3.78230 0.36534 3.52266<br> C 12.87823 1.70260 5.12344<br> C 8.27761 0.34001 9.85941<br> C 9.42677 9.18364 1.73295<br> C 3.27553 4.45658 9.42657<br> C 13.66559 2.69775 4.53650<br> C 15.77023 8.59069 9.93240<br> C 1.68356 0.78491 2.36643<br> C 10.98451 3.41041 10.31327<br> C 3.46873 4.45681 17.14097<br> C 8.27403 5.18373 15.89814<br> C 14.54907 5.15099 17.15930<br> C 7.83119 7.39584 14.82858<br> C 8.66916 6.28563 14.97331<br> C 11.99928 2.54577 10.98702<br> C 9.92072 6.28547 14.34388<br> C 16.54982 7.26986 0.04271<br> C 15.39103 8.14919 0.03189<br> C 1.50023 0.84646 12.27989<br> C 12.95126 3.06908 11.86817<br> C 10.34198 7.38826 13.61070<br> C 1.55836 2.21699 12.52561<br> C 8.25354 8.51697 14.12666<br> C 6.48249 6.79770 0.85630<br> C 11.97760 1.16465 10.73446<br> C 6.60385 0.32218 0.42301<br> C 9.52282 8.51550 13.54043<br> C 17.60321 7.54791 0.92891<br> C 0.58530 0.31102 11.36884<br> C 7.18362 1.56332 16.68291<br> C 11.01926 8.11905 9.86341<br> C 7.47582 4.80132 11.10039<br> C 3.59282 -0.13430 9.84955<br> C 6.01179 6.51430 12.17471<br> C 6.36853 5.17005 12.02942<br> C 7.23131 0.22715 16.01652<br> C 5.59963 4.18477 12.66234<br> C 2.84614 0.65728 8.96213<br> C 2.87561 2.11161 8.97508<br> C 15.08536 7.39548 14.73440<br> C 6.23001 -0.19920 15.13769<br> C 4.47482 4.53325 13.40042<br> C 13.97400 8.19851 14.48576<br> C 4.87173 6.87322 12.88120<br> C 9.47231 8.25578 8.14046<br> C 8.32790 -0.61137 16.27301<br> C 14.46698 4.13864 8.58475<br> C 4.09294 5.87331 13.47165<br> C 1.97640 0.00563 8.07267<br> C 16.07240 7.78504 15.64417<br> H 14.10215 4.93465 1.55678<br> H 3.98110 3.68721 1.55899<br> H 10.89072 1.19647 2.69205<br> H 7.19958 3.19021 3.56839<br> H 4.75923 4.45384 5.96230<br> H 6.45299 1.21835 4.92062<br> H 15.44211 6.00062 4.78824<br> H 17.75043 8.81610 3.97156<br> H 10.41563 1.57993 16.49923<br> H 6.49332 7.81303 7.99143<br> H 0.24800 0.19739 16.37425<br> H 9.53586 -0.26872 6.84508<br> H 6.19685 1.12218 7.44173<br> H 13.45550 8.28133 7.44815<br> H 11.11633 1.31384 6.30260<br> H 11.87413 5.44074 5.42962<br> H 12.38442 8.12016 3.04474<br> H 13.88694 4.78876 4.08791<br> H 4.53915 0.70283 4.22717<br> H 0.88557 0.65625 5.03328<br> H 8.96418 0.89159 10.50060<br> H 8.67994 8.85961 1.01083<br> H 16.35704 8.00331 10.63471<br> H 13.12606 1.45212 2.16563<br> H 3.64702 3.63930 16.44281<br> H 13.76743 4.88477 16.44833<br> H 6.85355 7.37827 15.30535<br> H 10.55820 5.40745 14.43410<br> H 12.97886 4.14375 12.04672<br> H 11.29905 7.38966 13.09313<br> H 2.29216 2.60091 13.23073<br> H -0.01303 -0.23279 14.03603<br> H 7.34113 6.99275 1.49776<br> H 11.26049 0.78023 10.01184<br> H 17.50743 8.37258 1.63130<br> H 8.21398 8.86531 11.16822<br> H 11.54834 7.47018 10.56097<br> H 4.28503 0.31205 10.56295<br> H 6.62643 7.27289 11.69479<br> H 5.89748 3.14154 12.57118<br> H 5.36986 0.44461 14.95599<br> H 3.88656 3.78035 13.92095<br> H 13.21826 7.85764 13.78163<br> H 16.85773 7.91771 12.97237<br> H 8.78884 7.70469 7.49554<br> H 9.07452 -0.28399 16.99402<br> H 1.39009 0.59398 7.37083<br> H 4.63062 7.11938 15.84758<br> &END COORD<br> &KIND Zn<br> BASIS_SET TZVP-MOLOPT-PBE-GTH-q12<br> POTENTIAL GTH-PBE-q12<br> &END KIND<br> &KIND S<br> BASIS_SET TZVP-MOLOPT-PBE-GTH-q6<br> POTENTIAL GTH-PBE-q6<br> &END KIND<br> &KIND O<br> BASIS_SET TZVP-MOLOPT-PBE-GTH-q6<br> POTENTIAL GTH-PBE-q6<br> &END KIND<br> &KIND N<br> BASIS_SET TZVP-MOLOPT-PBE-GTH-q5<br> POTENTIAL GTH-PBE-q5<br> &END KIND<br> &KIND C<br> BASIS_SET TZVP-MOLOPT-PBE-GTH-q4<br> POTENTIAL GTH-PBE-q4<br> &END KIND<br> &KIND H<br> BASIS_SET TZVP-MOLOPT-PBE-GTH-q1<br> POTENTIAL GTH-PBE-q1<br> &END KIND<br> &END SUBSYS<br>&END FORCE_EVAL<br><br>&MOTION<br> &MD<br> ENSEMBLE NPT_I<br> TEMPERATURE 298<br> TIMESTEP 1.0<br> STEPS 50000<br> &THERMOSTAT<br> TYPE NOSE<br> &NOSE<br> LENGTH 3<br> YOSHIDA 3<br> TIMECON 1000<br> &END NOSE<br> &END THERMOSTAT<br> &BAROSTAT<br> PRESSURE 1.0<br> TIMECON 4000<br> &END BAROSTAT<br> &END MD<br> &FREE_ENERGY<br> METHOD METADYN<br> &METADYN<br> USE_PLUMED .TRUE.<br> PLUMED_INPUT_FILE plumed.dat<br> &END METADYN<br> &END FREE_ENERGY<br> &PRINT<br> &TRAJECTORY<br> &EACH<br> MD 5<br> &END EACH<br> &END TRAJECTORY<br> &FORCES<br> UNIT eV*angstrom^-1<br> &EACH<br> MD 5<br> &END EACH<br> &END FORCES<br> &CELL<br> &EACH<br> MD 5<br> &END EACH<br> &END CELL<br> &END PRINT<br>&END MOTION<br>```</font></div><div><br></div><div>This simulation was performed with previous version of cp2k (so without your fix). </div><div class="gmail_quote"><div dir="auto" class="gmail_attr">piątek, 25 października 2024 o 09:50:47 UTC+2 bartosz mazur napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Frederick, <div><br></div><div>it helped with most of the tests! Now only 13 have failed. In the attachments you will find full output from regtests and here is output from single job with TRACE enabled:</div><div><br></div><div><font face="Courier New">```<br>Loading intel/2024a<br> Loading requirement: GCCcore/13.3.0 zlib/1.3.1-GCCcore-13.3.0<br> binutils/2.42-GCCcore-13.3.0 intel-compilers/2024.2.0<br> numactl/2.0.18-GCCcore-13.3.0 UCX/1.16.0-GCCcore-13.3.0<br> impi/2021.13.0-intel-compilers-2024.2.0 imkl/2024.2.0 iimpi/2024a<br> imkl-FFTW/2024.2.0-iimpi-2024a</font></div><div><font face="Courier New"><br>Currently Loaded Modulefiles:<br> 1) GCCcore/13.3.0 7) impi/2021.13.0-intel-compilers-2024.2.0 <br> 2) zlib/1.3.1-GCCcore-13.3.0 8) imkl/2024.2.0 <br> 3) binutils/2.42-GCCcore-13.3.0 9) iimpi/2024a <br> 4) intel-compilers/2024.2.0 10) imkl-FFTW/2024.2.0-iimpi-2024a <br> 5) numactl/2.0.18-GCCcore-13.3.0 11) intel/2024a <br> 6) UCX/1.16.0-GCCcore-13.3.0 <br></font></div><div><font face="Courier New">2 MPI processes with 2 OpenMP threads each<br>started at Fri Oct 25 09:34:34 CEST 2024 in /lustre/tmp/slurm/3127182<br>SIRIUS 7.6.1, git hash: <a href="https://api.github.com/repos/electronic-structure/SIRIUS/git/ref/tags/v7.6.1" rel="nofollow" target="_blank" data-saferedirecturl="https://www.google.com/url?hl=pl&q=https://api.github.com/repos/electronic-structure/SIRIUS/git/ref/tags/v7.6.1&source=gmail&ust=1732200829440000&usg=AOvVaw0q0uHkdJ-Top5sFRG3sICC">https://api.github.com/repos/electronic-structure/SIRIUS/git/ref/tags/v7.6.1</a><br>Warning! Compiled in 'debug' mode with assert statements enabled!</font></div><div><font face="Courier New"><br><br>LIBXSMM_VERSION: develop-1.17-3834 (25693946)<br>CLX/DP TRY JIT STA COL<br></font></div><div><font face="Courier New"> 0..13 8 8 0 0 <br></font></div><div><font face="Courier New"> 14..23 0 0 0 0 <br> 24..64 0 0 0 0 <br></font></div><div><font face="Courier New">Registry and code: 13 MB + 64 KB (gemm=8)<br>Command (PID=423503): /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp -i dftd3src1.inp -o dftd3src1.out<br>Uptime: 2.752513 s</font></div><div><font face="Courier New"><br><br>===================================================================================<br>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br></font></div><div><font face="Courier New">= RANK 0 PID 423503 RUNNING AT r21c01b03</font></div><div><font face="Courier New"><br>= KILLED BY SIGNAL: 11 (Segmentation fault)<br>===================================================================================<br><br>===================================================================================<br>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br></font></div><div><font face="Courier New">= RANK 1 PID 423504 RUNNING AT r21c01b03</font></div><div><font face="Courier New"><br>= KILLED BY SIGNAL: 9 (Killed)<br>===================================================================================<br></font></div><div><font face="Courier New">finished at Fri Oct 25 09:34:39 CEST 2024<br>```</font><br><br>and the last lines:<br><br><font face="Courier New">```<br> 000000:000002<< 13 3 mp_sendrecv_dm2 <br> 0.000 Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002>> 13 4 mp_sendrecv_dm2 <br> start Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002<< 13 4 mp_sendrecv_dm2 <br> 0.000 Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002<< 12 2 pw_nn_compose_r 0<br> .003 Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002<< 11 1 xc_pw_derive 0.003 H<br> ostmem: 955 MB GPUmem: 0 MB<br> 000000:000002>> 11 5 pw_zero start Hostme<br> m: 955 MB GPUmem: 0 MB<br> 000000:000002<< 11 5 pw_zero 0.000 Hostme<br> m: 955 MB GPUmem: 0 MB<br> 000000:000002>> 11 2 xc_pw_derive start H<br> ostmem: 955 MB GPUmem: 0 MB<br> 000000:000002>> 12 3 pw_nn_compose_r s<br> tart Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002>> 13 5 mp_sendrecv_dm2 <br> start Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002<< 13 5 mp_sendrecv_dm2 <br> 0.000 Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002>> 13 6 mp_sendrecv_dm2 <br> start Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002<< 13 6 mp_sendrecv_dm2 <br> 0.000 Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002<< 12 3 pw_nn_compose_r 0<br> .002 Hostmem: 955 MB GPUmem: 0 MB<br> 000000:000002<< 11 2 xc_pw_derive 0.002 H<br> ostmem: 955 MB GPUmem: 0 MB<br> 000000:000002>> 11 6 pw_zero start Hostme<br> m: 955 MB GPUmem: 0 MB<br> 000000:000002<< 11 6 pw_zero 0.001 Hostme<br> m: 960 MB GPUmem: 0 MB<br> 000000:000002>> 11 3 xc_pw_derive start H<br> ostmem: 960 MB GPUmem: 0 MB<br> 000000:000002>> 12 4 pw_nn_compose_r s<br> tart Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002>> 13 7 mp_sendrecv_dm2 <br> start Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002<< 13 7 mp_sendrecv_dm2 <br> 0.000 Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002>> 13 8 mp_sendrecv_dm2 <br> start Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002<< 13 8 mp_sendrecv_dm2 <br> 0.000 Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002<< 12 4 pw_nn_compose_r 0<br> .002 Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002<< 11 3 xc_pw_derive 0.002 H<br> ostmem: 960 MB GPUmem: 0 MB<br> 000000:000002>> 11 1 pw_spline_scale_deriv <br> start Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002<< 11 1 pw_spline_scale_deriv <br> 0.001 Hostmem: 960 MB GPUmem: 0 MB<br> 000000:000002>> 11 20 pw_pool_give_back_pw <br> start Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002<< 11 20 pw_pool_give_back_pw <br> 0.000 Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002>> 11 21 pw_pool_give_back_pw <br> start Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002<< 11 21 pw_pool_give_back_pw <br> 0.000 Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002>> 11 22 pw_pool_give_back_pw <br> start Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002<< 11 22 pw_pool_give_back_pw <br> 0.000 Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002>> 11 23 pw_pool_give_back_pw <br> start Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002<< 11 23 pw_pool_give_back_pw <br> 0.000 Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002>> 11 1 xc_functional_eval s<br> tart Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002>> 12 1 b97_lda_eval star<br> t Hostmem: 965 MB GPUmem: 0 MB<br> 000000:000002<< 12 1 b97_lda_eval 0.10<br> 3 Hostmem: 979 MB GPUmem: 0 MB<br> 000000:000002<< 11 1 xc_functional_eval 0<br> .103 Hostmem: 979 MB GPUmem: 0 MB<br> 000000:000002<< 10 1 xc_rho_set_and_dset_create <br> 0.120 Hostmem: 979 MB GPUmem: 0 MB<br> 000000:000002>> 10 1 check_for_derivatives s<br> tart Hostmem: 979 MB GPUmem: 0 MB<br> 000000:000002<< 10 1 check_for_derivatives 0<br> .000 Hostmem: 979 MB GPUmem: 0 MB<br> 000000:000002>> 10 14 pw_create_r3d start Hos<br> tmem: 979 MB GPUmem: 0 MB<br> 000000:000002<< 10 14 pw_create_r3d 0.000 Hos<br> tmem: 979 MB GPUmem: 0 MB<br> 000000:000002>> 10 15 pw_create_r3d start Hos<br> tmem: 979 MB GPUmem: 0 MB<br> 000000:000002<< 10 15 pw_create_r3d 0.000 Hos<br> tmem: 979 MB GPUmem: 0 MB<br> 000000:000002>> 10 16 pw_create_r3d start Hos<br> tmem: 979 MB GPUmem: 0 MB<br> 000000:000002<< 10 16 pw_create_r3d 0.000 Hos<br> tmem: 979 MB GPUmem: 0 MB<br> 000000:000002>> 10 17 pw_create_r3d start Hos<br> tmem: 979 MB GPUmem: 0 MB<br> 000000:000002<< 10 17 pw_create_r3d 0.000 Hos<br> tmem: 979 MB GPUmem: 0 MB<br>```</font><br><br></div><div>Best</div><div>Bartosz</div><div><br></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">środa, 23 października 2024 o 09:15:33 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Bartosz,</div><div>My fix is merged. Can you switch to the CP2K master and try it again? We are still working on a few issues with the Intel compilers such that we may eventually migrate from ifort to ifx.</div><div>Best,</div><div>Frederick<br></div><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Dienstag, 22. Oktober 2024 um 17:45:21 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Great! Thank you for your help. <div><br></div><div>Best</div><div>Bartosz<br><br></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">wtorek, 22 października 2024 o 15:24:04 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>I have a fix for it. In contrast to my first thought, it is a case of invalid type conversion from real to complex numbers (yes, Fortran is rather strict about it) in pw_derive. This may also be present in a few other spots. I am currently running more tests and I will open a pull request within the next few days.</div><div>Best,</div><div>Frederick<br></div><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">Frederick Stein schrieb am Dienstag, 22. Oktober 2024 um 13:12:49 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I can reproduce the error locally. I am investigating it now.<br><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Dienstag, 22. Oktober 2024 um 11:58:57 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I was loading it as it was needed for compilation. I have unloaded the module, but the error still occurs: <div><br></div><div><font face="Courier New"></font></div><div><font face="Courier New">```<br>LIBXSMM_VERSION: develop-1.17-3834 (25693946)<br>CLX/DP TRY JIT STA COL<br> 0..13 2 2 0 0 <br> 14..23 0 0 0 0 <br> 24..64 0 0 0 0 <br>Registry and code: 13 MB + 16 KB (gemm=2)<br></font></div><div><font face="Courier New">Command (PID=15485): /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp -i H2O-9.inp -o H2O-9.out<br>Uptime: 1.757102 s</font></div><div><font face="Courier New"><br><br>===================================================================================<br>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br></font></div><div><font face="Courier New">= RANK 0 PID 15485 RUNNING AT r30c01b01</font></div><div><font face="Courier New"><br>= KILLED BY SIGNAL: 11 (Segmentation fault)<br>===================================================================================<br><br>===================================================================================<br>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br></font></div><div><font face="Courier New">= RANK 1 PID 15486 RUNNING AT r30c01b01</font></div><div><font face="Courier New"><br>= KILLED BY SIGNAL: 9 (Killed)<br>===================================================================================<br>```</font></div><div><font face="Courier New"></font><br><br>and the last 100 lines:<br><br><font face="Courier New">```<br> 000000:000002>> 11 37 pw_create_c1d start <br> Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 11 37 pw_create_c1d 0.000 <br> Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 10 64 pw_pool_create_pw 0.000<br> Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 10 25 pw_copy start Hostmem: <br> 697 MB GPUmem: 0 MB<br> 000000:000002<< 10 25 pw_copy 0.001 Hostmem: <br> 697 MB GPUmem: 0 MB<br> 000000:000002>> 10 17 pw_axpy start Hostmem: <br> 697 MB GPUmem: 0 MB<br> 000000:000002<< 10 17 pw_axpy 0.001 Hostmem: <br> 697 MB GPUmem: 0 MB<br> 000000:000002>> 10 19 mp_sum_d start Hostmem:<br> 697 MB GPUmem: 0 MB<br> 000000:000002<< 10 19 mp_sum_d 0.000 Hostmem:<br> 697 MB GPUmem: 0 MB<br> 000000:000002>> 10 3 pw_poisson_solve start <br> Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 11 3 pw_poisson_rebuild s<br> tart Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 11 3 pw_poisson_rebuild 0<br> .000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 11 65 pw_pool_create_pw st<br> art Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 38 pw_create_c1d sta<br> rt Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 38 pw_create_c1d 0.0<br> 00 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 11 65 pw_pool_create_pw 0.<br> 000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 11 26 pw_copy start Hostme<br> m: 697 MB GPUmem: 0 MB<br> 000000:000002<< 11 26 pw_copy 0.001 Hostme<br> m: 697 MB GPUmem: 0 MB<br> 000000:000002>> 11 3 pw_multiply_with sta<br> rt Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 11 3 pw_multiply_with 0.0<br> 01 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 11 27 pw_copy start Hostme<br> m: 697 MB GPUmem: 0 MB<br> 000000:000002<< 11 27 pw_copy 0.001 Hostme<br> m: 697 MB GPUmem: 0 MB<br> 000000:000002>> 11 3 pw_integral_ab start<br> Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 20 mp_sum_d start Ho<br> stmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 20 mp_sum_d 0.001 Ho<br> stmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 11 3 pw_integral_ab 0.004<br> Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 11 4 pw_poisson_set start<br> Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 66 pw_pool_create_pw <br> start Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 13 39 pw_create_c1d <br> start Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 13 39 pw_create_c1d <br> 0.000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 66 pw_pool_create_pw <br> 0.000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 28 pw_copy start Hos<br> tmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 28 pw_copy 0.001 Hos<br> tmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 7 pw_derive start H<br> ostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 7 pw_derive 0.002 H<br> ostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 67 pw_pool_create_pw <br> start Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 13 40 pw_create_c1d <br> start Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 13 40 pw_create_c1d <br> 0.000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 67 pw_pool_create_pw <br> 0.000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 29 pw_copy start Hos<br> tmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 29 pw_copy 0.001 Hos<br> tmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 8 pw_derive start H<br> ostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 8 pw_derive 0.002 H<br> ostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 68 pw_pool_create_pw <br> start Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 13 41 pw_create_c1d <br> start Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 13 41 pw_create_c1d <br> 0.000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 68 pw_pool_create_pw <br> 0.000 Hostmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 30 pw_copy start Hos<br> tmem: 697 MB GPUmem: 0 MB<br> 000000:000002<< 12 30 pw_copy 0.001 Hos<br> tmem: 697 MB GPUmem: 0 MB<br> 000000:000002>> 12 9 pw_derive start H<br> ostmem: 697 MB GPUmem: 0 MB<br> ```</font><br><br></div><div>This is the list of currently loaded modules (all come with intel):</div><div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New">Currently Loaded Modulefiles:<br> 1) GCCcore/13.3.0 7) impi/2021.13.0-intel-compilers-2024.2.0 <br> 2) zlib/1.3.1-GCCcore-13.3.0 8) imkl/2024.2.0 <br> 3) binutils/2.42-GCCcore-13.3.0 9) iimpi/2024a <br> 4) intel-compilers/2024.2.0 10) imkl-FFTW/2024.2.0-iimpi-2024a <br> 5) numactl/2.0.18-GCCcore-13.3.0 11) intel/2024a <br> 6) UCX/1.16.0-GCCcore-13.3.0 </font></div><div><font face="Courier New">```</font></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">wtorek, 22 października 2024 o 11:12:57 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Bartosz,</div><div>I am currently running some tests with the latest Intel compiler myself. What bothers me about your setup is the module GCC13/13.3.0 . Why is it loaded? Can you unload it? This would at least reduce potential interferences with between the Intel and the GCC compilers.</div><div>Best,</div><div>Frederick<br></div><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Montag, 21. Oktober 2024 um 16:33:45 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The error for ssmp is:<div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New"></font></div><div><font face="Courier New">LIBXSMM_VERSION: develop-1.17-3834 (25693946)<br>CLX/DP TRY JIT STA COL<br></font></div><div><font face="Courier New"> 0..13 4 4 0 0 <br></font></div><div><font face="Courier New"> 14..23 0 0 0 0 <br> 24..64 0 0 0 0 <br></font></div><div><font face="Courier New">Registry and code: 13 MB + 32 KB (gemm=4)<br>Command (PID=54845): /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.ssmp -i H2O-9.inp -o H2O-9.out<br>Uptime: 2.861583 s<br>/var/spool/slurmd/r30c01b15/job3120330/slurm_script: line 36: 54845 Segmentation fault (core dumped) /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.ssmp -i H2O-9.inp -o H2O-9.out</font></div><div><font face="Courier New">```</font></div><div><br></div><div>and the last 100 lines of output:</div><div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New"> 000000:000001>> 12 20 mp_sum_d start Ho<br> stmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 20 mp_sum_d 0.000 Ho<br> stmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 11 13 dbcsr_dot_sd 0.000 H<br> ostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 10 12 calculate_ptrace_kp 0.0<br> 00 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 9 6 evaluate_core_matrix_traces <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 9 6 rebuild_ks_matrix start Ho<br> stmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 10 6 qs_ks_build_kohn_sham_matrix <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 11 140 pw_pool_create_pw st<br> art Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 79 pw_create_c1d sta<br> rt Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 79 pw_create_c1d 0.0<br> 00 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 11 140 pw_pool_create_pw 0.<br> 000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 11 141 pw_pool_create_pw st<br> art Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 80 pw_create_c1d sta<br> rt Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 80 pw_create_c1d 0.0<br> 00 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 11 141 pw_pool_create_pw 0.<br> 000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 11 61 pw_copy start Hostme<br> m: 380 MB GPUmem: 0 MB<br> 000000:000001<< 11 61 pw_copy 0.004 Hostme<br> m: 380 MB GPUmem: 0 MB<br> 000000:000001>> 11 35 pw_axpy start Hostme<br> m: 380 MB GPUmem: 0 MB<br> 000000:000001<< 11 35 pw_axpy 0.002 Hostme<br> m: 380 MB GPUmem: 0 MB<br> 000000:000001>> 11 6 pw_poisson_solve sta<br> rt Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 6 pw_poisson_rebuild <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 6 pw_poisson_rebuild <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 142 pw_pool_create_pw <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 13 81 pw_create_c1d <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 13 81 pw_create_c1d <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 142 pw_pool_create_pw <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 62 pw_copy start Hos<br> tmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 62 pw_copy 0.003 Hos<br> tmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 6 pw_multiply_with <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 6 pw_multiply_with <br> 0.002 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 63 pw_copy start Hos<br> tmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 63 pw_copy 0.003 Hos<br> tmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 6 pw_integral_ab st<br> art Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 12 6 pw_integral_ab 0.<br> 005 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 12 7 pw_poisson_set st<br> art Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 13 143 pw_pool_create_pw <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 14 82 pw_create_c1d <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 14 82 pw_create_c1d <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 13 143 pw_pool_create_pw <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 13 64 pw_copy start <br> Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 13 64 pw_copy 0.003 <br> Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 13 16 pw_derive star<br> t Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 13 16 pw_derive 0.00<br> 6 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 13 144 pw_pool_create_pw <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 14 83 pw_create_c1d <br> start Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 14 83 pw_create_c1d <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 13 144 pw_pool_create_pw <br> 0.000 Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 13 65 pw_copy start <br> Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001<< 13 65 pw_copy 0.004 <br> Hostmem: 380 MB GPUmem: 0 MB<br> 000000:000001>> 13 17 pw_derive star<br> t Hostmem: 380 MB GPUmem: 0 MB</font></div><div><font face="Courier New">```</font></div><div><br></div><div>for psmp the last 100 lines is:</div><div><font face="Courier New"><br></font></div><div><font face="Courier New">```</font></div><div><font face="Courier New"> 000000:000002<< 9 7 evaluate_core_matrix_traces <br></font></div><div><font face="Courier New"> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 9 7 rebuild_ks_matrix start Ho</font></div><div><font face="Courier New"><br> stmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 10 7 qs_ks_build_kohn_sham_matrix <br></font></div><div><font face="Courier New"> start Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 11 164 pw_pool_create_pw st<br> art Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 12 93 pw_create_c1d sta<br> rt Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 12 93 pw_create_c1d 0.0<br> 00 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 11 164 pw_pool_create_pw 0.<br> 000 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 11 165 pw_pool_create_pw st<br> art Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 12 94 pw_create_c1d sta<br> rt Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 12 94 pw_create_c1d 0.0<br> 00 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 11 165 pw_pool_create_pw 0.<br> 000 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 11 73 pw_copy start Hostme</font></div><div><font face="Courier New"><br> m: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 11 73 pw_copy 0.001 Hostme</font></div><div><font face="Courier New"><br> m: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 11 41 pw_axpy start Hostme</font></div><div><font face="Courier New"><br> m: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 11 41 pw_axpy 0.001 Hostme</font></div><div><font face="Courier New"><br> m: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 11 52 mp_sum_d start Hostm</font></div><div><font face="Courier New"><br> em: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 11 52 mp_sum_d 0.000 Hostm</font></div><div><font face="Courier New"><br> em: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 11 7 pw_poisson_solve sta<br> rt Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 12 7 pw_poisson_rebuild <br></font></div><div><font face="Courier New"> start Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 12 7 pw_poisson_rebuild <br></font></div><div><font face="Courier New"> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 12 166 pw_pool_create_pw </font></div><div><font face="Courier New"><br> start Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 13 95 pw_create_c1d <br></font></div><div><font face="Courier New"> start Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 13 95 pw_create_c1d <br></font></div><div><font face="Courier New"> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 12 166 pw_pool_create_pw </font></div><div><font face="Courier New"><br> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 12 74 pw_copy start Hos</font></div><div><font face="Courier New"><br> tmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 12 74 pw_copy 0.001 Hos</font></div><div><font face="Courier New"><br> tmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 12 7 pw_multiply_with <br></font></div><div><font face="Courier New"> start Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 12 7 pw_multiply_with <br></font></div><div><font face="Courier New"> 0.001 Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 12 75 pw_copy start Hos</font></div><div><font face="Courier New"><br> tmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 12 75 pw_copy 0.001 Hos</font></div><div><font face="Courier New"><br> tmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 12 7 pw_integral_ab st<br> art Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 53 mp_sum_d start</font></div><div><font face="Courier New"><br> Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 13 53 mp_sum_d 0.000</font></div><div><font face="Courier New"><br> Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 12 7 pw_integral_ab 0.<br> 003 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 12 8 pw_poisson_set st<br> art Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 167 pw_pool_create_pw <br></font></div><div><font face="Courier New"> start Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 14 96 pw_create_c1d </font></div><div><font face="Courier New"><br> start Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 14 96 pw_create_c1d </font></div><div><font face="Courier New"><br> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002<< 13 167 pw_pool_create_pw <br></font></div><div><font face="Courier New"> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br></font></div><div><font face="Courier New"> 000000:000002>> 13 76 pw_copy start <br></font></div><div><font face="Courier New"> Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 13 76 pw_copy 0.001 <br> Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 19 pw_derive star<br> t Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 13 19 pw_derive 0.00<br> 2 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 168 pw_pool_create_pw <br> start Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 14 97 pw_create_c1d <br> start Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 14 97 pw_create_c1d <br> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 13 168 pw_pool_create_pw <br> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 77 pw_copy start <br> Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 13 77 pw_copy 0.001 <br> Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 20 pw_derive star<br> t Hostmem: 693 MB GPUmem: 0 MB</font></div><div><font face="Courier New"></font></div><div><font face="Courier New">```</font></div><div><br></div><div>Thanks</div><div>Bartosz<br><br></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">poniedziałek, 21 października 2024 o 08:58:34 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Bartosz,</div><div>I have no idea about the issue with LibXSMM.</div><div>Regarding the trace, I do not know either as there is not much that could break in pw_derive (it just performs multiplications) and the sequence of operations is to unspecific. It may be that the code actually breaks somewhere else. Can you do the same with the ssmp and post the last 100 lines? This way, we remove the asynchronicity issues for backtraces with the psmp version.</div><div>Best,</div><div>Frederick<br></div><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Sonntag, 20. Oktober 2024 um 16:47:15 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The error is:<div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New"></font></div><div><font face="Courier New">LIBXSMM_VERSION: develop-1.17-3834 (25693946)<br>CLX/DP TRY JIT STA COL<br></font></div><div><font face="Courier New"> 0..13 2 2 0 0<br> 14..23 0 0 0 0</font></div><div><font face="Courier New"><br> 24..64 0 0 0 0<br></font></div><div><font face="Courier New">Registry and code: 13 MB + 16 KB (gemm=2)<br>Command (PID=2607388): /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp -i H2O-9.inp -o H2O-9.out<br>Uptime: 5.288243 s</font></div><div><font face="Courier New"><br><br>===================================================================================<br>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br></font></div><div><font face="Courier New">= RANK 0 PID 2607388 RUNNING AT r21c01b10</font></div><div><font face="Courier New"><br>= KILLED BY SIGNAL: 11 (Segmentation fault)<br>===================================================================================<br><br>===================================================================================<br>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br></font></div><div><font face="Courier New">= RANK 1 PID 2607389 RUNNING AT r21c01b10<br>= KILLED BY SIGNAL: 9 (Killed)<br>===================================================================================<br></font></div><div><font face="Courier New">```</font></div><div><br></div><div>and the last 20 lines:</div><div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New"> 000000:000002<< 13 76 pw_copy 0.001<br> Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 19 pw_derive star<br> t Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 13 19 pw_derive 0.00<br> 2 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 168 pw_pool_create_pw<br> start Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 14 97 pw_create_c1d<br> start Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 14 97 pw_create_c1d<br> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 13 168 pw_pool_create_pw<br> 0.000 Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 77 pw_copy start<br> Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002<< 13 77 pw_copy 0.001<br> Hostmem: 693 MB GPUmem: 0 MB<br> 000000:000002>> 13 20 pw_derive star<br> t Hostmem: 693 MB GPUmem: 0 MB</font></div><div><font face="Courier New">```</font><br><br></div><div>Thanks!</div><div class="gmail_quote"><div dir="auto" class="gmail_attr">piątek, 18 października 2024 o 17:18:39 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Please pick one of the failing tests. Then, add the TRACE keyword to the &GLOBAL section and then run the test manually. This increases the size of the output file dramatically (to some million lines). Can you send me the last ~20 lines of the output?<br><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Freitag, 18. Oktober 2024 um 17:09:40 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I'm using do_regtests.py script, not make regtesting, but I assume it makes no difference. As I mentioned in previous message for `--ompthreads 1` all tests were passed both for ssmp and psmp. For ssmp with `--ompthreads 2` I observe similar errors as for psmp with the same setting, I provide example output as attachment. <div><br></div><div>Thanks</div><div>Bartosz<br><br></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">piątek, 18 października 2024 o 16:24:16 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Bartosz,<br></div><div>What happens if you set the number of OpenMP threads to 1 (add '--ompthreads 1' to TESTOPTS)? What errors do you observe in case of the ssmp?</div><div>Best,</div><div>Frederick<br></div><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Freitag, 18. Oktober 2024 um 15:37:43 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Frederick,<div><br></div><div>thanks again for help. So I have tested different simulation variants and I know that the problem occurs when using OMP. For MPI calculations without OMP all tests pass. I have also tested the effect of the <font face="Courier New">`OMP_PROC_BIND` </font>and <font face="Courier New">`OMP_PLACES`</font> parameters and apart from the effect on simulation time, they have no significant effect on the presence of errors. Below are the results for ssmp:</div><div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New">OMP_PROC_BIND, OMP_PLACES, correct, total, wrong, failed, time <br>spread, threads, 3850, 4144, 4, 290, 186min<br>spread, cores, 3831, 4144, 3, 310, 183min<br>spread, sockets, 3864, 4144, 3, 277, 104min<br>close, threads, 3879, 4144, 3, 262, 171min<br>close, cores, 3854, 4144, 0, 290, 168min<br>close, sockets, 3865, 4144, 3, 276, 104min<br>master, threads, 4121, 4144, 0, 23, 1002min<br>master, cores, 4121, 4144, 0, 23, 986min<br>master, sockets, 3942, 4144, 3, 199, 219min<br>false, threads, 3918, 4144, 0, 226, 178min<br>false, cores, 3919, 4144, 3, 222, 176min<br>false, sockets, 3856, 4144, 4, 284, 104min<br>```</font></div><div><br></div><div>and psmp:</div><div><br></div><div><font face="Courier New">```</font></div><div><font face="Courier New">OMP_PROC_BIND, OMP_PLACES, results<br>spread, threads, Summary: correct: 4097 / 4227; failed: 130; 495min<br>spread, cores, 26 / 362<br>spread, cores, 26 / 362<br>close, threads, Summary: correct: 4133 / 4227; failed: 94; 484min<br>close, cores, 60 / 362<br>close, sockets, 13 / 362<br>master, threads, 13 / 362<br>master, cores, 79 / 362<br>master, sockets, Summary: correct: 4153 / 4227; failed: 74; 563min<br>false, threads, Summary: correct: 4153 / 4227; failed: 74; 556min<br>false, cores, Summary: correct: 4106 / 4227; failed: 121; 511min<br>false, sockets, 96 / 362</font></div><div><font face="Courier New">not specified, not specified, Summary: correct: 4129 / 4227; failed: 98; 263min</font><br></div><div><font face="Courier New">```</font></div><div><br></div><div>Any ideas what I could do next to have more information about the source of the problem or maybe you see a potential solution at this stage? I would appreciate any further help. <br></div><div><br></div><div>Best</div><div>Bartosz</div><div><br></div><div><br></div><div class="gmail_quote"><div dir="auto" class="gmail_attr">piątek, 11 października 2024 o 14:30:25 UTC+2 Frederick Stein napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Dear Bartosz,</div><div>If I am not mistaken, you used 8 OpenMP threads. The test do not run that efficiently with such a large number of threads. 2 should be sufficient.</div><div>The test result suggests that most of the functionality may work but due to a missing backtrace (or similar information), it is hard to tell why they fail. You could also try to run some of the single-node tests to assess the stability of CP2K.<br></div><div>Best,</div><div>Frederick<br></div><br><div class="gmail_quote"><div dir="auto" class="gmail_attr">bartosz mazur schrieb am Freitag, 11. Oktober 2024 um 13:48:42 UTC+2:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Sorry, forgot attachments.<div><br></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div>
<p></p>
-- <br />
You received this message because you are subscribed to the Google Groups "cp2k" group.<br />
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="mailto:cp2k+unsubscribe@googlegroups.com">cp2k+unsubscribe@googlegroups.com</a>.<br />
To view this discussion visit <a href="https://groups.google.com/d/msgid/cp2k/48b72f1a-c321-4833-aeb9-1f747967acfcn%40googlegroups.com?utm_medium=email&utm_source=footer">https://groups.google.com/d/msgid/cp2k/48b72f1a-c321-4833-aeb9-1f747967acfcn%40googlegroups.com</a>.<br />