problem scaling multiple walker metadynamics using many nodes
gl... at bristol.ac.uk
gl... at bristol.ac.uk
Mon Feb 12 09:55:15 UTC 2018
Dear Luis,
Here's the way we are doing it: Suppose that I have 5 walkers running with
10 nodes and 28 processors per node:
First, my master input file is:
&GLOBAL
PROJECT SnF.450C.mwalk
PROGRAM FARMING
RUN_TYPE NONE
&END GLOBAL
&FARMING
NGROUP 5
GROUP_PARTITION 56 56 56 56 56
MAX_JOBS_PER_GROUP 1
&JOB
DIRECTORY dir0
INPUT_FILE_NAME cn0.in
&END JOB
&JOB
DIRECTORY dir1
INPUT_FILE_NAME cn1.in
&END JOB
&JOB
DIRECTORY dir2
INPUT_FILE_NAME cn2.in
&END JOB
&JOB
DIRECTORY dir3
INPUT_FILE_NAME cn3.in
&END JOB
&JOB
DIRECTORY dir4
INPUT_FILE_NAME cn4.in
&END JOB
&RESTART
&EACH
MD 1
&END EACH
&END RESTART
&END FARMING
So far, this looks just like what you are doing (however, I don't see the
GROUP__PARTITION directive in your input file). Presumably, your input
files for each walker (cn0.in, cn1.in etc.) each have a section
&MULTIPLE_WALKERS
NUMBER_OF_WALKERS 5
WALKER_ID 1
&WALKERS_FILE_NAME
../WALK_DATA_FILES/WALKER_1.data
../WALK_DATA_FILES/WALKER_2.data
../WALK_DATA_FILES/WALKER_3.data
../WALK_DATA_FILES/WALKER_4.data
../WALK_DATA_FILES/WALKER_5.data
&END
&END MULTIPLE_WALKERS
Where WALKER_ID is 1, 2,3, 4 or 5. Note that I have created a directory
WALK_DATA_FILES in the master directory (where mwalk.in resides). If all
that is there, then perhaps the problem is the way the job is run.
On our hpc system, each node has 28 processors. Since I want each walker
to use 56 processors, I submit the job asking for 10 nodes with 28 cpus
per node (or 10 nodes, with 28 tasks per node and 1 cpu per task). I have
no idea how the system partitions the processors over the nodes, however.
It could be that walker 1 uses nodes 1 and 2, walker 2 uses nodes 3 and 4,
etc. Or they could be randomly distributed over the nodes. I have no
idea. However, with this approach, I seem to always get a linear scaling
with the number of nodes (the cpu time used per timestep used by each
walker).
Make sure that each walker is doing its own thing (as seen in the *-HILLS-LOCAL.metadynLog
file in each walker directory) and that the walkers are communicating with
each other (as seen in the *-HILLS.metadynLog file that is replicated in
each walker directory).
I hope this is helpful; let me know if this works on your system, or if
you find a better way!
Dave Sherman
On Wednesday, January 24, 2018 at 11:13:23 PM UTC, lar... at lbl.gov wrote:
>
> Dear all,
>
> I am running a multiple walker metadynamics simulation using FARMING.
> Specifically, I use 6 replicas and submit the FARMING job to 6 nodes with
> 68 cores each (see the file *inp* attached here). The simulation runs
> fine, but I don't seem to be able to speedup the calculation using more
> than one node per walker (i.e. per individual job). I tried submitting the
> job to twice as many nodes, hoping that cp2k would assign 2 nodes per
> replica, but that doesn't speed the simulation at all. I have also tried
> using "GROUP_PARTITION 136 136 136 136 136 136" but that doesn't work
> either. I know there is an issue because when I run a simple simulation
> (i.e. no farming) of the same system I see a very clear speedup when going
> from 1 to 2 nodes.
>
> Any ideas on how to make this work?
>
> Thanks!
> Luis
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20180212/03bba924/attachment.htm>
More information about the CP2K-user
mailing list