problem scaling multiple walker metadynamics using many nodes

gl... at bristol.ac.uk gl... at bristol.ac.uk
Mon Feb 12 09:55:15 UTC 2018


Dear Luis,

Here's the way we are doing it: Suppose that I have  5 walkers running with 
10 nodes and 28 processors per node:

First, my master input file is:

&GLOBAL

  PROJECT SnF.450C.mwalk

  PROGRAM FARMING

  RUN_TYPE NONE

&END GLOBAL


&FARMING


  NGROUP 5

  GROUP_PARTITION 56 56 56 56 56

  MAX_JOBS_PER_GROUP 1


  &JOB

    DIRECTORY dir0

    INPUT_FILE_NAME cn0.in

  &END JOB

  &JOB

    DIRECTORY dir1

    INPUT_FILE_NAME cn1.in

  &END JOB

  &JOB

    DIRECTORY dir2

    INPUT_FILE_NAME cn2.in

  &END JOB

  &JOB

    DIRECTORY dir3

    INPUT_FILE_NAME cn3.in

  &END JOB

  &JOB

    DIRECTORY dir4

    INPUT_FILE_NAME cn4.in

  &END JOB

  &RESTART

    &EACH

      MD 1

    &END EACH

  &END RESTART

&END FARMING

So far, this looks just like what you are doing (however, I don't see the 
GROUP__PARTITION directive in your input file). Presumably, your input 
files for each walker (cn0.in, cn1.in etc.) each have a section

       &MULTIPLE_WALKERS

         NUMBER_OF_WALKERS 5

         WALKER_ID 1

         &WALKERS_FILE_NAME

           ../WALK_DATA_FILES/WALKER_1.data

           ../WALK_DATA_FILES/WALKER_2.data

           ../WALK_DATA_FILES/WALKER_3.data

           ../WALK_DATA_FILES/WALKER_4.data

           ../WALK_DATA_FILES/WALKER_5.data

         &END

        &END MULTIPLE_WALKERS

Where WALKER_ID is 1, 2,3, 4 or 5.  Note that I have created a directory 
WALK_DATA_FILES in the master directory (where mwalk.in resides).   If all 
that is there, then perhaps the problem is the way the job is run.

On our hpc system, each node has 28 processors.  Since I want each walker 
to use  56 processors, I submit the job asking for 10 nodes with 28 cpus 
per node (or 10 nodes, with 28 tasks per node and 1 cpu per task).   I have 
no idea how the system partitions the processors over the nodes, however. 
 It could be that walker 1 uses nodes 1 and 2, walker 2 uses nodes 3 and 4, 
etc.  Or they could be randomly distributed over the nodes.  I have no 
idea.  However,  with this approach, I seem to always get a linear scaling 
with the number of nodes (the cpu time used per timestep used by each 
walker).  

Make sure that each walker is doing its own thing (as seen in the *-HILLS-LOCAL.metadynLog 
file in each walker directory) and that the walkers are communicating with 
each other (as seen in the *-HILLS.metadynLog file that is replicated in 
each walker directory).

I hope this is helpful;  let me know if this works on your system, or if 
you find a better way!

Dave Sherman


On Wednesday, January 24, 2018 at 11:13:23 PM UTC, lar... at lbl.gov wrote:
>
> Dear all,
>
> I am running a multiple walker metadynamics simulation using FARMING. 
> Specifically, I use 6 replicas and submit the FARMING job to 6 nodes with 
> 68 cores each (see the file *inp* attached here). The simulation runs 
> fine, but I don't seem to be able to speedup the calculation using more 
> than one node per walker (i.e. per individual job). I tried submitting the 
> job to twice as many nodes, hoping that cp2k would assign 2 nodes per 
> replica, but that doesn't speed the simulation at all. I have also tried 
> using  "GROUP_PARTITION 136 136 136 136 136 136" but that doesn't work 
> either. I know there is an issue because when I run a simple simulation 
> (i.e. no farming) of the same system I see a very clear speedup when going 
> from 1 to 2 nodes.
>
> Any ideas on how to make this work?
>
> Thanks!
> Luis
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20180212/03bba924/attachment.htm>


More information about the CP2K-user mailing list