[CP2K:3705] hwloc support in cp2k-trunk

Axel akoh... at gmail.com
Tue Jan 24 17:25:26 UTC 2012
Previous message (by thread): [CP2K:3705] hwloc support in cp2k-trunk
Next message (by thread): [CP2K:3713] hwloc support in cp2k-trunk
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Monday, January 23, 2012 2:59:49 AM UTC-5, Christiane Pousa Ribeiro 
wrote:
>
> Hi Axel,
>

hi christiane,
 

> I'm Christiane and the one responsible for the hwloc/libnuma support on 
> cp2k.
>

thanks for taking the time to look into this.
 

> Concerning libnuma, the affinity support is much simpler than the one with 
> hwloc. Only, thread/process affinity. I'll check this wrapper to see why it 
> is not working and let you know.
>

ok. it may be version specific, too.  

[akohlmey at g002 input]$ rpm -qif /usr/lib64/libnuma.so.1 
Name        : numactl                      Relocations: (not relocatable)
Version     : 2.0.3                             Vendor: Red Hat, Inc.
Release     : 9.el6                         Build Date: Thu Jun 17 10:46:17 
2010
 

> About hwloc, that is true that it requires the latest version because of 
> the pci support for network cards and gpus. By default this module, only 
> attach processes and their memory to NUMA nodes. Their threads are not 
> pinned to any cores, so they can move within a NUMA node. There are other 
> strategies to place MPI/threads that can be used by setting the 
> machine_arch keys.
>

yes, this kind of behavior is what i would have expected.
this should also help with the internal threading in OpenMPI.
 

> Could you send me the input, machine_arch keys that you used for these 
> tests? I've tested hwloc support on local intel/amd machines (with and 
> without gpus) and on CRAY machines and I have no errors like that. All of 
> them with NUMA characteristics. 
>

please have a look at the attached file. you'll see that there
are some entries that don't look right. particularly the node
names are all that of MPI rank 0.

When you use numactl, how do you can determine the cores for threads and 
> MPI tasks? Do you attribute processes to NUMA nodes and consequently, 
> threads are also attached to the same set of cores of their parent?
>

yes. our MPI installation is configured by default to have a 1:1 core to MPI
rank mapping (since there is practically nobody yet using MPI+OpenMP)
with memory affinity for giving people the best MPI-only performance.

at the end of the attached file i include a copy of the wrapper script,
that is OpenMPI specific (since that is the only MPI library installed).

overall, it looks to me like that default settings are giving a desirable 
processor and memory affinity (which is great) that is consistent with
the best settings i could get using my wrapper script, but the diagnostics
seems to be off and may be confusing people, particularly technical
support in computing centers, that are often too literal and assume 
that any software is always giving 100% correct information. ;-)

cheers,
     axel.

So, if you have any suggestions, comments, we can discuss this and also 
> solve the problems that you have found.
>
> -- 
> []'s
> Christiane Pousa Ribeiro
>  
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20120124/93eed7aa/attachment.htm>
-------------- next part --------------
this is run using OpenMPI and OpenMP from gfortran 4.6.2
across 6 dual processor nodes.

mpirun -npernode 2 -x OMP_NUM_THREADS=4 /path/to/cp2k/cp2k.psmp test.inp
(which is the 32 water benchmark input with a MACHINE_ARCH section added.

with different flags in that section i get the following output.

Ex.1:

&GLOBAL
   &MACHINE_ARCH
     PRINT_FULL T          
   &END
&END

 MACHINE| Physical processing units organization
( P0, P1, P2, P3 ), ( P4, P5, P6, P7 )
 
 MACHINE| Architecture organization

Machine#0(24GB)
 NUMANode#0(12GB)
  Socket#0
   L3 (12MB)
    L2(256KB)	     L1(32KB)	      Core#0	       PU#0
    L2(256KB)	     L1(32KB)	      Core#1	       PU#1
    L2(256KB)	     L1(32KB)	      Core#9	       PU#2
    L2(256KB)	     L1(32KB)	      Core#10	       PU#3
  --Network Card
  --Network Card
 NUMANode#1(12GB)
  Socket#1
   L3 (12MB)
    L2(256KB)	     L1(32KB)	      Core#0	       PU#4
    L2(256KB)	     L1(32KB)	      Core#1	       PU#5
    L2(256KB)	     L1(32KB)	      Core#9	       PU#6
    L2(256KB)	     L1(32KB)	      Core#10	       PU#7

output of lstopo:

Machine (24GB)
  NUMANode #0 (phys=0 12GB) + Socket #0 + L3 #0 (12MB)
    L2 #0 (256KB) + L1 #0 (32KB) + Core #0 + PU #0 (phys=0)
    L2 #1 (256KB) + L1 #1 (32KB) + Core #1 + PU #1 (phys=1)
    L2 #2 (256KB) + L1 #2 (32KB) + Core #2 + PU #2 (phys=2)
    L2 #3 (256KB) + L1 #3 (32KB) + Core #3 + PU #3 (phys=3)
  NUMANode #1 (phys=1 12GB) + Socket #1 + L3 #1 (12MB)
    L2 #4 (256KB) + L1 #4 (32KB) + Core #4 + PU #4 (phys=4)
    L2 #5 (256KB) + L1 #5 (32KB) + Core #5 + PU #5 (phys=5)
    L2 #6 (256KB) + L1 #6 (32KB) + Core #6 + PU #6 (phys=6)
    L2 #7 (256KB) + L1 #7 (32KB) + Core #7 + PU #7 (phys=7)

========================================================================

&GLOBAL
   &MACHINE_ARCH
     PRINT_BRANCH T          
   &END
&END

 MACHINE| NUMA node organization
 NUMANode#0(12GB)
  Socket#0
   L3 (12MB)
    L2(256KB)	     L1(32KB)	      Core#0	       PU#0
    L2(256KB)	     L1(32KB)	      Core#1	       PU#1
    L2(256KB)	     L1(32KB)	      Core#9	       PU#2
    L2(256KB)	     L1(32KB)	      Core#10	       PU#3

========================================================================

&GLOBAL
   &MACHINE_ARCH
     PRINT_PROC T          
   &END
&END

this is without setting any kind of processor/memory affinity externally
giving some strange core numbers again.

 SCHED| Processes are now runing
 Process        0     5173      running on NUMA node      0        core 0-3
 Process        1     5174      running on NUMA node      1        core 4-7
 Process        2    15205      running on NUMA node      0        core 2-5
 Process        3    15206      running on NUMA node      1        core 5-8
 Process        4     3091      running on NUMA node      0        core 3-6
 Process        5     3092      running on NUMA node      1        core 4-7
 Process        6    24978      running on NUMA node      0        core 1-4
 Process        7    24979      running on NUMA node      1        core 7-10
 Process        8    17698      running on NUMA node      0        core 2-5
 Process        9    17699      running on NUMA node      1        core 4-7
 Process       10    30755      running on NUMA node      0        core 0-3
 Process       11    30756      running on NUMA node      1        core 4-7
 
 MEMORY| Processes memory mapping
 Process        0     5173      memory policy LOCAL        node        0
 Process        1     5174      memory policy LOCAL        node        1
 Process        2    15205      memory policy LOCAL        node        0
 Process        3    15206      memory policy LOCAL        node        1
 Process        4     3091      memory policy LOCAL        node        0
 Process        5     3092      memory policy LOCAL        node        1
 Process        6    24978      memory policy LOCAL        node        0
 Process        7    24979      memory policy LOCAL        node        1
 Process        8    17698      memory policy LOCAL        node        0
 Process        9    17699      memory policy LOCAL        node        1
 Process       10    30755      memory policy LOCAL        node        0
 Process       11    30756      memory policy LOCAL        node        1

this is with using my script wrapper around numactl (and what i expect):

 NETWORK| Affinity is on

 SCHED| Processes are now runing
 Process        0     5264      running on NUMA node      0        core 0-3
 Process        1     5265      running on NUMA node      1        core 4-7
 Process        2    15294      running on NUMA node      0        core 0-3
 Process        3    15293      running on NUMA node      1        core 4-7
 Process        4     3193      running on NUMA node      0        core 0-3
 Process        5     3192      running on NUMA node      1        core 4-7
 Process        6    25070      running on NUMA node      0        core 0-3
 Process        7    25071      running on NUMA node      1        core 4-7
 Process        8    17793      running on NUMA node      0        core 0-3
 Process        9    17794      running on NUMA node      1        core 4-7
 Process       10    30853      running on NUMA node      0        core 0-3
 Process       11    30852      running on NUMA node      1        core 4-7
 
 MEMORY| Processes memory mapping
 Process        0     5264      memory policy LOCAL        node        0
 Process        1     5265      memory policy LOCAL        node        1
 Process        2    15294      memory policy LOCAL        node        0
 Process        3    15293      memory policy LOCAL        node        1
 Process        4     3193      memory policy LOCAL        node        0
 Process        5     3192      memory policy LOCAL        node        1
 Process        6    25070      memory policy LOCAL        node        0
 Process        7    25071      memory policy LOCAL        node        1
 Process        8    17793      memory policy LOCAL        node        0
 Process        9    17794      memory policy LOCAL        node        1
 Process       10    30853      memory policy LOCAL        node        0
 Process       11    30852      memory policy LOCAL        node        1
 
========================================================================

&GLOBAL
   &MACHINE_ARCH
     PRINT_THREAD_CUR T          
   &END
&END

 OMP | Thread    0 from process    0   running on NUMA node    0   core    0
 OMP | Thread    1 from process    0   running on NUMA node    0   core    2
 OMP | Thread    2 from process    0   running on NUMA node    0   core    1
 OMP | Thread    3 from process    0   running on NUMA node    0   core    3

========================================================================

&GLOBAL
   &MACHINE_ARCH
     PRINT_THREAD T          
   &END
&END

this is without setting any kind of processor/memory affinity externally

 OMP | Thread placement
 Processs    0 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 0         
 OMP | Thread    2   running on NUMA node    0   core 0         
 OMP | Thread    3   running on NUMA node    0   core 0         
 Processs    1 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 4         
 OMP | Thread    2   running on NUMA node    1   core 5         
 OMP | Thread    3   running on NUMA node    1   core 5         
 Processs    2 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 2         
 OMP | Thread    2   running on NUMA node    0   core 0         
 OMP | Thread    3   running on NUMA node    0   core 2         
 Processs    3 running on g002
 OMP | Thread    0   running on NUMA node    1   core 5         
 OMP | Thread    1   running on NUMA node    1   core 4         
 OMP | Thread    2   running on NUMA node    1   core 7         
 OMP | Thread    3   running on NUMA node    1   core 4         
 Processs    4 running on g002
 OMP | Thread    0   running on NUMA node    0   core 1         
 OMP | Thread    1   running on NUMA node    0   core 0         
 OMP | Thread    2   running on NUMA node    0   core 2         
 OMP | Thread    3   running on NUMA node    0   core 3         
 Processs    5 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 6         
 OMP | Thread    2   running on NUMA node    1   core 7         
 OMP | Thread    3   running on NUMA node    1   core 5         
 Processs    6 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 1         
 OMP | Thread    2   running on NUMA node    0   core 2         
 OMP | Thread    3   running on NUMA node    0   core 3         
 Processs    7 running on g002
 OMP | Thread    0   running on NUMA node    1   core 5         
 OMP | Thread    1   running on NUMA node    1   core 6         
 OMP | Thread    2   running on NUMA node    1   core 6         
 OMP | Thread    3   running on NUMA node    1   core 4         
 Processs    8 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 0         
 OMP | Thread    2   running on NUMA node    0   core 1         
 OMP | Thread    3   running on NUMA node    0   core 1         
 Processs    9 running on g002
 OMP | Thread    0   running on NUMA node    1   core 5         
 OMP | Thread    1   running on NUMA node    1   core 4         
 OMP | Thread    2   running on NUMA node    1   core 4         
 OMP | Thread    3   running on NUMA node    1   core 4         
 Processs   10 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 3         
 OMP | Thread    2   running on NUMA node    0   core 2         
 OMP | Thread    3   running on NUMA node    0   core 1         
 Processs   11 running on g002
 OMP | Thread    0   running on NUMA node    1   core 5         
 OMP | Thread    1   running on NUMA node    1   core 4         
 OMP | Thread    2   running on NUMA node    1   core 6         
 OMP | Thread    3   running on NUMA node    1   core 7         

...and again using my numactl wrapper.

 OMP | Thread placement
 Processs    0 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 0         
 OMP | Thread    2   running on NUMA node    0   core 0         
 OMP | Thread    3   running on NUMA node    0   core 0         
 Processs    1 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 5         
 OMP | Thread    2   running on NUMA node    1   core 6         
 OMP | Thread    3   running on NUMA node    1   core 7         
 Processs    2 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 1         
 OMP | Thread    2   running on NUMA node    0   core 2         
 OMP | Thread    3   running on NUMA node    0   core 3         
 Processs    3 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 5         
 OMP | Thread    2   running on NUMA node    1   core 6         
 OMP | Thread    3   running on NUMA node    1   core 7         
 Processs    4 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 1         
 OMP | Thread    2   running on NUMA node    0   core 2         
 OMP | Thread    3   running on NUMA node    0   core 3         
 Processs    5 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 5         
 OMP | Thread    2   running on NUMA node    1   core 6         
 OMP | Thread    3   running on NUMA node    1   core 7         
 Processs    6 running on g002
 OMP | Thread    0   running on NUMA node    0   core 0         
 OMP | Thread    1   running on NUMA node    0   core 1         
 OMP | Thread    2   running on NUMA node    0   core 2         
 OMP | Thread    3   running on NUMA node    0   core 3         
 Processs    7 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 6         
 OMP | Thread    2   running on NUMA node    1   core 7         
 OMP | Thread    3   running on NUMA node    1   core 5         
 Processs    8 running on g002
 OMP | Thread    0   running on NUMA node    0   core 1         
 OMP | Thread    1   running on NUMA node    0   core 2         
 OMP | Thread    2   running on NUMA node    0   core 3         
 OMP | Thread    3   running on NUMA node    0   core 0         
 Processs    9 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 5         
 OMP | Thread    2   running on NUMA node    1   core 6         
 OMP | Thread    3   running on NUMA node    1   core 7         
 Processs   10 running on g002
 OMP | Thread    0   running on NUMA node    0   core 2         
 OMP | Thread    1   running on NUMA node    0   core 3         
 OMP | Thread    2   running on NUMA node    0   core 0         
 OMP | Thread    3   running on NUMA node    0   core 1         
 Processs   11 running on g002
 OMP | Thread    0   running on NUMA node    1   core 4         
 OMP | Thread    1   running on NUMA node    1   core 5         
 OMP | Thread    2   running on NUMA node    1   core 6         
 OMP | Thread    3   running on NUMA node    1   core 7         

========================================================================

for your reference:

mpirun --mca mpi_paffinity_alone 0 -x OMP_NUM_THREADS=4 -npernode 2 numawrap sh -c 'h=`hostname`; b=`hwloc-bind --get`; echo $h : $b'

g003 : 0x0000000f
g002 : 0x0000000f
g003 : 0x000000f0
g004 : 0x0000000f
g002 : 0x000000f0
g006 : 0x0000000f
g006 : 0x000000f0
g004 : 0x000000f0
g007 : 0x0000000f
g007 : 0x000000f0
g005 : 0x0000000f
g005 : 0x000000f0

and the content of my "numawrap" script.

#!/bin/sh

host=`uname -n`
mynode=${OMPI_COMM_WORLD_LOCAL_RANK-0}
nmpi=${OMPI_COMM_WORLD_LOCAL_SIZE-1}
nomp=${OMP_NUM_THREADS-1}
ntot=`expr ${nmpi} \* ${nomp}`
ncore=`grep '^processor' /proc/cpuinfo | wc -l`
minmpi=`numactl --hardware | awk '/^available/ { print $2;}'`
maxomp=`expr ${ncore} / ${minmpi}`

# don't oversubscribe
if [ ${ntot} -gt ${ncore} ]
then
  echo "Error: too many tasks requested on node ${host}:"
  echo "   ${nomp}xOpenMP * ${nmpi}xMPI = ${ntot} > ${ncore} CPU cores"
  exit 1
fi

# don't run threads across multiple "nodes"
if [ ${nomp} -gt ${maxomp} ]
then
  echo "Error: too many threads requested on node ${host}:"
  echo "   ${nomp}xOpenMP > ${maxomp} CPU cores per socket"
  exit 1
fi

# multiple MPI tasks per socket
if [ ${nomp} -lt ${maxomp} ]
then
  div=1
  if [ ${nmpi} -gt ${minmpi} ]
  then
    div=`expr ${nmpi} / ${minmpi}`
    chk=`expr ${div} \* ${minmpi}`
    if [ ${chk} -ne ${nmpi} ]
    then
      echo "Error: MPI tasks cannot be evenly assigned to CPUs on node
${host}:"
      echo "   ${nmpi}xMPI vs ${minmpi} CPU sockets"
      exit 1
    fi
  fi
  mynode=`expr ${mynode} / ${div}`
fi

# cancel per node processor binding and reset using numactl
if [ ${nomp} -gt 1 ]
then
  export OMPI_MCA_mpi_paffinity_alone=0
  numactl --cpunodebind=${mynode} --membind=${mynode} "$@"
else
  exec "$@"
fi
Previous message (by thread): [CP2K:3705] hwloc support in cp2k-trunk
Next message (by thread): [CP2K:3713] hwloc support in cp2k-trunk
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the CP2K-user mailing list