<div dir="ltr">I am finding very strange dependence of the benchmark on how I run under openmpi. Does anyone have any insight?<div><br></div><div>cp2k 3.0</div><div><br></div><div>If I simply use:<br><br>mpirun -n 16 cp2k.psmp H2O-64.inp >> H2O-64_REC.log<br><br></div><div>with<br><br></div><div>#PBS -l nodes=n013.cluster.com:ppn=4+n014.cluster.com:ppn=4+n015.cluster.com:ppn=4+n016.cluster.com:ppn=4<br>for example.<br><br>The timing is 165 seconds, and for<br><br></div><div>#PBS -l nodes=4:ppn=16,pmem=1gb<br><div>mpirun --map-by ppr:4:node -n 16 cp2k.psmp H2O-64.inp >> H2O-64_REC.log </div><div>it is 368 seconds!</div><div><br></div><div>Ron</div><div><br></div></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature">---<br>Ronald Cohen<br>Geophysical Laboratory<br>Carnegie Institution<br>5251 Broad Branch Rd., N.W.<br>Washington, D.C. 20015<br><a href="mailto:rco...@carnegiescience.edu" target="_blank">rco...@carnegiescience.edu</a><br>office: 202-478-8937<br>skype: ronaldcohen<br><a href="https://twitter.com/recohen3" target="_blank">https://twitter.com/recohen3</a><br><a href="https://www.linkedin.com/profile/view?id=163327727" target="_blank">https://www.linkedin.com/profile/view?id=163327727</a><br></div></div> <br><div class="gmail_quote">On Wed, Mar 23, 2016 at 4:28 PM, Ronald Cohen <span dir="ltr"><<a href="mailto:rco...@carnegiescience.edu" target="_blank">rco...@carnegiescience.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">So I finally got decent performance with gfortran, openmpi, and openblas across inifiniband. Now I find that the use of openmp and <div>half the number of mpi processes seems to give better performance for the 64 molecule H2O test case. Is that reasonable? I recompiled everything including BLAS, scalapack, etc without -fopenmp etc. to make the popt version.</div><div><br></div><div>I find in seconds:</div><div><br></div><div>1 node 16 MPI procs psmp OMP_NUM_THREADS=1 834</div><div>1 node 16 MPI procs popt OMP_NUM_THREADS=1 836</div><div>2 nodes 16 MPI procs psmp OMP_NUM_THREADS=2 266</div><div>2 nodes 32 MPI procs popt OMP_NUM_THREADS=1 430</div><div>4 nodes 64 MPI procs popt OMP_NUM_THREADS=1 331<br></div><div><div>4 nodes 32 MPI procs psmp OMP_NUM_THREADS=2 189</div><div>4 nodes 64 MPI procs psmp OMP_NUM_THREADS=4 166</div></div><div><br></div><div>So you see there is no overhead using psmp built with openmp and setting threads to 1.</div><div>Using OMP THREADS greatly improves performance over just increasing mpi processes</div><div>This may be because this machine has only 1 GB memory per core, but even 4 threads is better than 2, so it seems openmp </div><div>is more efficient than mpi.</div><div><br></div><div>Still room for improvement though. Any ideas of how to tweak out better performance?</div><div><br></div><div><br></div><div>Ron</div><div><br></div><div> <br></div></div></blockquote></div><br></div>