<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; ">Together with Rad, we investigated a little bit more the problem:<DIV><BR class="khtml-block-placeholder"></DIV><DIV>1) The job of Rad was stopping after ~180 MD steps with that error message.</DIV><DIV>2) Independently of the number of procs used: 32/64/128/256 .. It always crashes at the same point..</DIV><DIV>3) The regtest-norotho/graphite2 sample is giving him the error :<DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> Total charge density (g-space):</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">0.0086530397</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">*****************************************************************************</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> *** 10:24:03 ERRORL2 in cp_fm_cholesky:cp_fm_cholesky_decompose</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">processor ***</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> *** 0 err=-300 condition FAILED at line</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">116 ***</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">*****************************************************************************</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> ===== Routine Calling Stack =====</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 8 cp_fm_cholesky_decompose</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 7 make_preconditioner</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 6 init_scf_loop</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 5 scf_env_do_scf</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 4 qs_energies</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 3 qs_forces</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 2 qs_mol_dyn_low</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> 1 CP2K</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> CP2K| Stopped by process</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">number 0</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> CP2K| Abnormal program termination</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">======================================================================================</DIV><DIV><BR class="khtml-block-placeholder"></DIV>I tried to run the same input file of Rad on another machine (XT3, 32 procs) and the job in 1 hour reached 430 MD steps </DIV><DIV>and I didn't observe any crash due to MPI class errors.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>After this summary (specifically supported by point (3)) I believe that on the machine on which Rad was originally running</DIV><DIV>his job SCALAPACK or MPI (or both, since the error (3) is due to the SCALAPACK and it does not show on other machines) </DIV><DIV>may not be properly installed. </DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Teo</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>p.s.: as a further confirm I asked Rad to run a simple NVE example on the same machine to see what happens.. (it should also abort</DIV><DIV>with the same error message)..</DIV><DIV><BR><DIV><DIV>On 2 Nov 2007, at 14:00, Nichols A. Romero wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite">Rad,<BR><BR>Is this NPT issue reproducible on other computer platforms?<BR><BR>Please test that for us if you can.<BR><BR><DIV><SPAN class="gmail_quote">On 11/2/07, <B class="gmail_sendername">Juerg Hutter</B> <<A href="mailto:hut...@pci.uzh.ch"> hut...@pci.uzh.ch</A>> wrote:</SPAN><BLOCKQUOTE class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><BR>Hi<BR><BR>this could be a problem of CP2K or the compiler (or the <BR>MPI installation).<BR>If it is a problem of CP2K the obvious question is why<BR>didn't it show up before. Can you run a small system<BR>with NPT in parallel? If the error persists please send the<BR>input. Another thing to test would be if the error <BR>depends on the number of CPUs.<BR>CP2K generates and frees MPI groups during calculation.<BR>If the free command is not matching it is possible that<BR>the number of groups keeps increasing (similar to a<BR>memory leak). It is possible that your input causes a <BR>new route in the code where this happens.<BR><BR>Another possibility is that either the compiler or<BR>the installed MPI has a not working installation for<BR>the freeing of communicators.<BR><BR>regards<BR><BR>Juerg Hutter <BR><BR>----------------------------------------------------------<BR>Juerg Hutter Phone : ++41 44 635 4491<BR>Physical Chemistry Institute FAX : ++41 44 635 6838<BR>University of Zurich E-mail: <A href="mailto:hut...@pci.uzh.ch">hut...@pci.uzh.ch</A><BR>Winterthurerstrasse 190<BR>CH-8057 Zurich, Switzerland<BR>----------------------------------------------------------<BR><BR><BR>On Thu, 1 Nov 2007, Rad wrote:<BR> <BR>><BR>> Dear All,<BR>><BR>> I am trying to perform an NPT ensemble with a MPI compiled code and<BR>> run into the following error:<BR>><BR>> Please set the environment variable MPI_GROUP_MAX for additional <BR>> space.<BR>> MPI has run out of internal group entries.<BR>> Please set the environment variable MPI_GROUP_MAX for additional<BR>> space.<BR>> The current value of MPI_GROUP_MAX is 512<BR>><BR>> I have no problem running the calculation with the serially compiled <BR>> code (I tried both NPT_I and NPT_F). I tried the MPI run with cell<BR>> having 56 atoms, expanded to a supercell with 224 atoms, changed the<BR>> ranks to 64, 32, 16, 8, temperatures 2.5 K , 200 K, 300 K, various <BR>> pressures (1 bar, 50 bars) etc and I get the same error.<BR>><BR>> The code is compiled on a IA64 Linux cluster using Intel compiler<BR>> (version 9.1).<BR>><BR>> Please let me know if you have any suggestions and would like to know <BR>> whether the NPT portion is tested for different MPI architectures. If<BR>> it has been tested on a particular arch let me know I will run it on<BR>> the same arch.<BR>><BR>> Thanks<BR>> Rad<BR>> <BR>><BR>> ><BR>><BR><BR>Ph.D.<BR>DoD User Productivity Enhancement and Technology Transfer (PET) Group<BR>High Performance Technologies, Inc.<BR>Reston, VA<BR>443-567-8328 (C)<BR>410-278-2692 (O)<BR> <BR> <BR></BLOCKQUOTE></DIV></BLOCKQUOTE></DIV><BR></DIV></BODY></HTML>