parallel distribution of data

Matt W MattWa... at gmail.com
Wed Mar 26 21:40:57 UTC 2008


> I managed to take down an entire supercomputer with the H2O-8192!

;) Sorry for this, I hope it wasn't too bad.

My suspicion would then be that skipping the optimization means the
job is horrendously badly load balanced, so at least some processors
get ridiculous memory needs.

I'm afraid I have really no further suggestions, maybe the load
balancing distribution code is OK with 1024 procs, but I doubt it.  To
my mind none of this is due to the new distributed routines, but is an
old problem that people who really know the code (not me) have worked
around occasionally when needed.

Sorry for not being of more help,

Matt

>
> On Fri, Mar 21, 2008 at 11:24 AM, Nichols A. Romero <naro... at gmail.com>
> wrote:
>
>
>
> > I think you are right. I have double the number of processors to 1024. It
> > will probably be a while before that gets through the queue.
>
> > On 3/21/08, Matt W <MattWa... at gmail.com> wrote:
>
> > > Well, actually you've got through the 2D distribution routines, so
> > > you've now got a different problem;)  I guess this could just be
> > > running out of memory (it's an allocation that's failed).  Someone
> > > else will have to comment more fully.
>
> > > Matt
>
> > > > Just to follow up. Tried your suggestion of SKIP_OPTIMIZATION TRUE but
> > > that
> > > > did not work.
>
> > > >   preconditioner : FULL_KINETIC        : cholesky inversion of T + eS
> > > >   stepsize       :    0.15000000
> > > >   energy_gap     :    0.20000000
> > > >   eps_taylor     :   0.10000E-15
> > > >   max_taylor     :             4
>
> > > >   ----------------------------------- OT
> > > > --------------------------------------
> > > >  *
> > > >  *** 03:21:02 ERRORL2 in cp_fm_types:cp_fm_create processor      0
> > > ***
> > > >  *** err=-300  condition FAILED at line
> > > 169                          ***
> > > >  *
>
> > > >  ===== Routine Calling Stack =====
>
> > > >             7 make_preconditioner_single
> > > >             6 init_scf_loop
> > > >             5 scf_env_do_scf
> > > >             4 qs_energies
> > > >             3 qs_forces
> > > >             2 qs_mol_dyn_low
> > > >             1 CP2K
> > > >  CP2K| condition FAILED at line 169
> > > >  CP2K| Abnormal program termination, stopped by process number 0
> > > > [0] [MPI Abort by user] Aborting Program!
>
> > --
> > Nichols A. Romero, Ph.D.
> > DoD User Productivity Enhancement and Technology Transfer (PET) Group
> > High Performance Technologies, Inc.
> > Reston, VA
> > 443-567-8328 (C)
> > 410-278-2692 (O)
>
> --
> Nichols A. Romero, Ph.D.
> DoD User Productivity Enhancement and Technology Transfer (PET) Group
> High Performance Technologies, Inc.
> Reston, VA
> 443-567-8328 (C)
> 410-278-2692 (O)


More information about the CP2K-user mailing list