<div dir="ltr"><div>Dear Peter,</div><div><br></div><div>Please, specify your hardware and attach input/output files you're testing.</div><div>I also recommend to completely disable hyperthreading on the BIOS level. There are at least two reasons to do it: 1) cp2k as many other HPC programs has no profit from this technology, 2) as it shown by <a href="http://ia.cr/2018/1060">the recent study</a> HT is not safe for multiuser systems such as clusters, servers, etc, to which your system belongs, I guess.</div><div><br></div><div>Best wishes,</div><div>Anton K.<br></div><br>четверг, 29 ноября 2018 г., 13:20:43 UTC+3 пользователь Peter Kraus написал:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir="ltr"><div>Dear Anton,</div><div><br></div><div>thanks for the suggestion. MPICH 3.3 seems quicker than OpenMPI 3.1, as on 16 MPI instances with 8 OpenMP threads each (128 cores total), it takes ~130 s per wavefunction optimisation step, while OpenMPI takes ~200 s. However, with OpenMPI running with 8x8 parallelisation (64 cores, fits into one of my hyper-threaded nodes), I get ~7 s per step, so the MPI penalty is still ridiculous. This is for a V2O5 bulk system with 168 atoms, PBE and DZ basis set.<br></div><div><br></div><div>Best,</div><div>Peter<br></div><br>On Wednesday, 28 November 2018 13:17:27 UTC+1, Anton Kudelin wrote:<blockquote class="gmail_quote" style="margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Try to employ MPICH or its derivatives (MVAPICH) configured with --with-device=ch3:nemesis<br><br>среда, 28 ноября 2018 г., 14:35:04 UTC+3 пользователь Peter Kraus написал:<blockquote class="gmail_quote" style="margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Dear Mike,</div><div><br></div><div>I have tried to use CP2K on our cluster with nodes connected using 10 GbE, and all I see is a very significant slowdown. This was using gcc-8.2.0, openmpi-3.1.1 and OpenBLAS/fftw/scalapack compiled using the two with OpenMP enabled where possible. I've resorted to submitting "SMP"-like jobs (by selecting the smp parallel environment, but parallelising using both MPI and OpenMP). <br></div><div><br></div><div>If you figure out how to squeeze extra performance from the 10GbE, please let me know.</div><div><br></div><div>Best,</div><div>Peter<br></div><br>On Monday, 12 November 2018 18:01:48 UTC+1, Mike Ruggiero wrote:<blockquote class="gmail_quote" style="margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hello cp2k community - I have recently setup a small computing cluster, with 20-24 core server nodes linked via 10 GbE connections. While scaling on single nodes is as it should be (i.e., nearly linear), I get very little-to no scale up when performing multiple node simulations. After digging around, it seems that this is relatively well known for cp2k, but I'm curious if anyone has had any success on using cp2k over 10 GbE connections. Any advice would be greatly appreciated! <div><br></div><div>Best,</div><div>Michael Ruggiero </div></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div>