[CP2K:3705] hwloc support in cp2k-trunk
Axel
akoh... at gmail.com
Tue Jan 24 17:25:26 UTC 2012
On Monday, January 23, 2012 2:59:49 AM UTC-5, Christiane Pousa Ribeiro
wrote:
>
> Hi Axel,
>
hi christiane,
> I'm Christiane and the one responsible for the hwloc/libnuma support on
> cp2k.
>
thanks for taking the time to look into this.
> Concerning libnuma, the affinity support is much simpler than the one with
> hwloc. Only, thread/process affinity. I'll check this wrapper to see why it
> is not working and let you know.
>
ok. it may be version specific, too.
[akohlmey at g002 input]$ rpm -qif /usr/lib64/libnuma.so.1
Name : numactl Relocations: (not relocatable)
Version : 2.0.3 Vendor: Red Hat, Inc.
Release : 9.el6 Build Date: Thu Jun 17 10:46:17
2010
> About hwloc, that is true that it requires the latest version because of
> the pci support for network cards and gpus. By default this module, only
> attach processes and their memory to NUMA nodes. Their threads are not
> pinned to any cores, so they can move within a NUMA node. There are other
> strategies to place MPI/threads that can be used by setting the
> machine_arch keys.
>
yes, this kind of behavior is what i would have expected.
this should also help with the internal threading in OpenMPI.
> Could you send me the input, machine_arch keys that you used for these
> tests? I've tested hwloc support on local intel/amd machines (with and
> without gpus) and on CRAY machines and I have no errors like that. All of
> them with NUMA characteristics.
>
please have a look at the attached file. you'll see that there
are some entries that don't look right. particularly the node
names are all that of MPI rank 0.
When you use numactl, how do you can determine the cores for threads and
> MPI tasks? Do you attribute processes to NUMA nodes and consequently,
> threads are also attached to the same set of cores of their parent?
>
yes. our MPI installation is configured by default to have a 1:1 core to MPI
rank mapping (since there is practically nobody yet using MPI+OpenMP)
with memory affinity for giving people the best MPI-only performance.
at the end of the attached file i include a copy of the wrapper script,
that is OpenMPI specific (since that is the only MPI library installed).
overall, it looks to me like that default settings are giving a desirable
processor and memory affinity (which is great) that is consistent with
the best settings i could get using my wrapper script, but the diagnostics
seems to be off and may be confusing people, particularly technical
support in computing centers, that are often too literal and assume
that any software is always giving 100% correct information. ;-)
cheers,
axel.
So, if you have any suggestions, comments, we can discuss this and also
> solve the problems that you have found.
>
> --
> []'s
> Christiane Pousa Ribeiro
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20120124/93eed7aa/attachment.htm>
-------------- next part --------------
this is run using OpenMPI and OpenMP from gfortran 4.6.2
across 6 dual processor nodes.
mpirun -npernode 2 -x OMP_NUM_THREADS=4 /path/to/cp2k/cp2k.psmp test.inp
(which is the 32 water benchmark input with a MACHINE_ARCH section added.
with different flags in that section i get the following output.
Ex.1:
&GLOBAL
&MACHINE_ARCH
PRINT_FULL T
&END
&END
MACHINE| Physical processing units organization
( P0, P1, P2, P3 ), ( P4, P5, P6, P7 )
MACHINE| Architecture organization
Machine#0(24GB)
NUMANode#0(12GB)
Socket#0
L3 (12MB)
L2(256KB) L1(32KB) Core#0 PU#0
L2(256KB) L1(32KB) Core#1 PU#1
L2(256KB) L1(32KB) Core#9 PU#2
L2(256KB) L1(32KB) Core#10 PU#3
--Network Card
--Network Card
NUMANode#1(12GB)
Socket#1
L3 (12MB)
L2(256KB) L1(32KB) Core#0 PU#4
L2(256KB) L1(32KB) Core#1 PU#5
L2(256KB) L1(32KB) Core#9 PU#6
L2(256KB) L1(32KB) Core#10 PU#7
output of lstopo:
Machine (24GB)
NUMANode #0 (phys=0 12GB) + Socket #0 + L3 #0 (12MB)
L2 #0 (256KB) + L1 #0 (32KB) + Core #0 + PU #0 (phys=0)
L2 #1 (256KB) + L1 #1 (32KB) + Core #1 + PU #1 (phys=1)
L2 #2 (256KB) + L1 #2 (32KB) + Core #2 + PU #2 (phys=2)
L2 #3 (256KB) + L1 #3 (32KB) + Core #3 + PU #3 (phys=3)
NUMANode #1 (phys=1 12GB) + Socket #1 + L3 #1 (12MB)
L2 #4 (256KB) + L1 #4 (32KB) + Core #4 + PU #4 (phys=4)
L2 #5 (256KB) + L1 #5 (32KB) + Core #5 + PU #5 (phys=5)
L2 #6 (256KB) + L1 #6 (32KB) + Core #6 + PU #6 (phys=6)
L2 #7 (256KB) + L1 #7 (32KB) + Core #7 + PU #7 (phys=7)
========================================================================
&GLOBAL
&MACHINE_ARCH
PRINT_BRANCH T
&END
&END
MACHINE| NUMA node organization
NUMANode#0(12GB)
Socket#0
L3 (12MB)
L2(256KB) L1(32KB) Core#0 PU#0
L2(256KB) L1(32KB) Core#1 PU#1
L2(256KB) L1(32KB) Core#9 PU#2
L2(256KB) L1(32KB) Core#10 PU#3
========================================================================
&GLOBAL
&MACHINE_ARCH
PRINT_PROC T
&END
&END
this is without setting any kind of processor/memory affinity externally
giving some strange core numbers again.
SCHED| Processes are now runing
Process 0 5173 running on NUMA node 0 core 0-3
Process 1 5174 running on NUMA node 1 core 4-7
Process 2 15205 running on NUMA node 0 core 2-5
Process 3 15206 running on NUMA node 1 core 5-8
Process 4 3091 running on NUMA node 0 core 3-6
Process 5 3092 running on NUMA node 1 core 4-7
Process 6 24978 running on NUMA node 0 core 1-4
Process 7 24979 running on NUMA node 1 core 7-10
Process 8 17698 running on NUMA node 0 core 2-5
Process 9 17699 running on NUMA node 1 core 4-7
Process 10 30755 running on NUMA node 0 core 0-3
Process 11 30756 running on NUMA node 1 core 4-7
MEMORY| Processes memory mapping
Process 0 5173 memory policy LOCAL node 0
Process 1 5174 memory policy LOCAL node 1
Process 2 15205 memory policy LOCAL node 0
Process 3 15206 memory policy LOCAL node 1
Process 4 3091 memory policy LOCAL node 0
Process 5 3092 memory policy LOCAL node 1
Process 6 24978 memory policy LOCAL node 0
Process 7 24979 memory policy LOCAL node 1
Process 8 17698 memory policy LOCAL node 0
Process 9 17699 memory policy LOCAL node 1
Process 10 30755 memory policy LOCAL node 0
Process 11 30756 memory policy LOCAL node 1
this is with using my script wrapper around numactl (and what i expect):
NETWORK| Affinity is on
SCHED| Processes are now runing
Process 0 5264 running on NUMA node 0 core 0-3
Process 1 5265 running on NUMA node 1 core 4-7
Process 2 15294 running on NUMA node 0 core 0-3
Process 3 15293 running on NUMA node 1 core 4-7
Process 4 3193 running on NUMA node 0 core 0-3
Process 5 3192 running on NUMA node 1 core 4-7
Process 6 25070 running on NUMA node 0 core 0-3
Process 7 25071 running on NUMA node 1 core 4-7
Process 8 17793 running on NUMA node 0 core 0-3
Process 9 17794 running on NUMA node 1 core 4-7
Process 10 30853 running on NUMA node 0 core 0-3
Process 11 30852 running on NUMA node 1 core 4-7
MEMORY| Processes memory mapping
Process 0 5264 memory policy LOCAL node 0
Process 1 5265 memory policy LOCAL node 1
Process 2 15294 memory policy LOCAL node 0
Process 3 15293 memory policy LOCAL node 1
Process 4 3193 memory policy LOCAL node 0
Process 5 3192 memory policy LOCAL node 1
Process 6 25070 memory policy LOCAL node 0
Process 7 25071 memory policy LOCAL node 1
Process 8 17793 memory policy LOCAL node 0
Process 9 17794 memory policy LOCAL node 1
Process 10 30853 memory policy LOCAL node 0
Process 11 30852 memory policy LOCAL node 1
========================================================================
&GLOBAL
&MACHINE_ARCH
PRINT_THREAD_CUR T
&END
&END
OMP | Thread 0 from process 0 running on NUMA node 0 core 0
OMP | Thread 1 from process 0 running on NUMA node 0 core 2
OMP | Thread 2 from process 0 running on NUMA node 0 core 1
OMP | Thread 3 from process 0 running on NUMA node 0 core 3
========================================================================
&GLOBAL
&MACHINE_ARCH
PRINT_THREAD T
&END
&END
this is without setting any kind of processor/memory affinity externally
OMP | Thread placement
Processs 0 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 0
OMP | Thread 2 running on NUMA node 0 core 0
OMP | Thread 3 running on NUMA node 0 core 0
Processs 1 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 4
OMP | Thread 2 running on NUMA node 1 core 5
OMP | Thread 3 running on NUMA node 1 core 5
Processs 2 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 2
OMP | Thread 2 running on NUMA node 0 core 0
OMP | Thread 3 running on NUMA node 0 core 2
Processs 3 running on g002
OMP | Thread 0 running on NUMA node 1 core 5
OMP | Thread 1 running on NUMA node 1 core 4
OMP | Thread 2 running on NUMA node 1 core 7
OMP | Thread 3 running on NUMA node 1 core 4
Processs 4 running on g002
OMP | Thread 0 running on NUMA node 0 core 1
OMP | Thread 1 running on NUMA node 0 core 0
OMP | Thread 2 running on NUMA node 0 core 2
OMP | Thread 3 running on NUMA node 0 core 3
Processs 5 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 6
OMP | Thread 2 running on NUMA node 1 core 7
OMP | Thread 3 running on NUMA node 1 core 5
Processs 6 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 1
OMP | Thread 2 running on NUMA node 0 core 2
OMP | Thread 3 running on NUMA node 0 core 3
Processs 7 running on g002
OMP | Thread 0 running on NUMA node 1 core 5
OMP | Thread 1 running on NUMA node 1 core 6
OMP | Thread 2 running on NUMA node 1 core 6
OMP | Thread 3 running on NUMA node 1 core 4
Processs 8 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 0
OMP | Thread 2 running on NUMA node 0 core 1
OMP | Thread 3 running on NUMA node 0 core 1
Processs 9 running on g002
OMP | Thread 0 running on NUMA node 1 core 5
OMP | Thread 1 running on NUMA node 1 core 4
OMP | Thread 2 running on NUMA node 1 core 4
OMP | Thread 3 running on NUMA node 1 core 4
Processs 10 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 3
OMP | Thread 2 running on NUMA node 0 core 2
OMP | Thread 3 running on NUMA node 0 core 1
Processs 11 running on g002
OMP | Thread 0 running on NUMA node 1 core 5
OMP | Thread 1 running on NUMA node 1 core 4
OMP | Thread 2 running on NUMA node 1 core 6
OMP | Thread 3 running on NUMA node 1 core 7
...and again using my numactl wrapper.
OMP | Thread placement
Processs 0 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 0
OMP | Thread 2 running on NUMA node 0 core 0
OMP | Thread 3 running on NUMA node 0 core 0
Processs 1 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 5
OMP | Thread 2 running on NUMA node 1 core 6
OMP | Thread 3 running on NUMA node 1 core 7
Processs 2 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 1
OMP | Thread 2 running on NUMA node 0 core 2
OMP | Thread 3 running on NUMA node 0 core 3
Processs 3 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 5
OMP | Thread 2 running on NUMA node 1 core 6
OMP | Thread 3 running on NUMA node 1 core 7
Processs 4 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 1
OMP | Thread 2 running on NUMA node 0 core 2
OMP | Thread 3 running on NUMA node 0 core 3
Processs 5 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 5
OMP | Thread 2 running on NUMA node 1 core 6
OMP | Thread 3 running on NUMA node 1 core 7
Processs 6 running on g002
OMP | Thread 0 running on NUMA node 0 core 0
OMP | Thread 1 running on NUMA node 0 core 1
OMP | Thread 2 running on NUMA node 0 core 2
OMP | Thread 3 running on NUMA node 0 core 3
Processs 7 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 6
OMP | Thread 2 running on NUMA node 1 core 7
OMP | Thread 3 running on NUMA node 1 core 5
Processs 8 running on g002
OMP | Thread 0 running on NUMA node 0 core 1
OMP | Thread 1 running on NUMA node 0 core 2
OMP | Thread 2 running on NUMA node 0 core 3
OMP | Thread 3 running on NUMA node 0 core 0
Processs 9 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 5
OMP | Thread 2 running on NUMA node 1 core 6
OMP | Thread 3 running on NUMA node 1 core 7
Processs 10 running on g002
OMP | Thread 0 running on NUMA node 0 core 2
OMP | Thread 1 running on NUMA node 0 core 3
OMP | Thread 2 running on NUMA node 0 core 0
OMP | Thread 3 running on NUMA node 0 core 1
Processs 11 running on g002
OMP | Thread 0 running on NUMA node 1 core 4
OMP | Thread 1 running on NUMA node 1 core 5
OMP | Thread 2 running on NUMA node 1 core 6
OMP | Thread 3 running on NUMA node 1 core 7
========================================================================
for your reference:
mpirun --mca mpi_paffinity_alone 0 -x OMP_NUM_THREADS=4 -npernode 2 numawrap sh -c 'h=`hostname`; b=`hwloc-bind --get`; echo $h : $b'
g003 : 0x0000000f
g002 : 0x0000000f
g003 : 0x000000f0
g004 : 0x0000000f
g002 : 0x000000f0
g006 : 0x0000000f
g006 : 0x000000f0
g004 : 0x000000f0
g007 : 0x0000000f
g007 : 0x000000f0
g005 : 0x0000000f
g005 : 0x000000f0
and the content of my "numawrap" script.
#!/bin/sh
host=`uname -n`
mynode=${OMPI_COMM_WORLD_LOCAL_RANK-0}
nmpi=${OMPI_COMM_WORLD_LOCAL_SIZE-1}
nomp=${OMP_NUM_THREADS-1}
ntot=`expr ${nmpi} \* ${nomp}`
ncore=`grep '^processor' /proc/cpuinfo | wc -l`
minmpi=`numactl --hardware | awk '/^available/ { print $2;}'`
maxomp=`expr ${ncore} / ${minmpi}`
# don't oversubscribe
if [ ${ntot} -gt ${ncore} ]
then
echo "Error: too many tasks requested on node ${host}:"
echo " ${nomp}xOpenMP * ${nmpi}xMPI = ${ntot} > ${ncore} CPU cores"
exit 1
fi
# don't run threads across multiple "nodes"
if [ ${nomp} -gt ${maxomp} ]
then
echo "Error: too many threads requested on node ${host}:"
echo " ${nomp}xOpenMP > ${maxomp} CPU cores per socket"
exit 1
fi
# multiple MPI tasks per socket
if [ ${nomp} -lt ${maxomp} ]
then
div=1
if [ ${nmpi} -gt ${minmpi} ]
then
div=`expr ${nmpi} / ${minmpi}`
chk=`expr ${div} \* ${minmpi}`
if [ ${chk} -ne ${nmpi} ]
then
echo "Error: MPI tasks cannot be evenly assigned to CPUs on node
${host}:"
echo " ${nmpi}xMPI vs ${minmpi} CPU sockets"
exit 1
fi
fi
mynode=`expr ${mynode} / ${div}`
fi
# cancel per node processor binding and reset using numactl
if [ ${nomp} -gt 1 ]
then
export OMPI_MCA_mpi_paffinity_alone=0
numactl --cpunodebind=${mynode} --membind=${mynode} "$@"
else
exec "$@"
fi
More information about the CP2K-user
mailing list