[CP2K-user] [CP2K:20750] slow slurm regtests

bartosz mazur bamaz.97 at gmail.com
Mon Oct 7 13:10:09 UTC 2024


Hi all, 

I am trying to run regtests using the slurm sbatch script, but what I am 
observing is their extremely slow execution. After looking at the task, I 
can see that only 4 CPUs are being used (out of 48 set). It looks as if 
each task is run one after the other, i.e. 2 MPI x 2 OMP = 4 CPU. 

I have already tried different `mpiexec` command settings and changed the 
`srun` command, but this did not help. When using 4 nodes the task also 
runs on only 4 CPU of a single node. I don't quite understand why the 
system reports 2 GPUs when the `nvidia-smi --query-gpu=gpu_name 
--format=csv,noheader | wc -l` command is called, so I modified 
do_regtest.py to force 0 GPUs, but that didn't change anything either. The 
instructions at https://www.cp2k.org/dev:regtesting#run_with_sbatch are out 
of date, so maybe something else needs to be changed in the script?

I would appreciate any help!

Here is my sbatch script:

```

#!/bin/bash -l

#SBATCH --time=06:00:00

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=24

#SBATCH --cpus-per-task=2

#SBATCH --ntasks-per-core=1

#SBATCH --mem=180G

 

set -o errexit

set -o nounset

set -o pipefail

 

export MPICH_OFI_STARTUP_CONNECT=1

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# export OMP_PROC_BIND=close

# export OMP_PLACES=cores

 

module load intel/2022b

module load GCC/12.2.0

 

# Let the user see the currently loaded modules in the slurm log for 
completeness:

module list

 

CP2K_BASE_DIR="/lustre/pd01/hpc-kuchta-1716987452/software/cp2k"

CP2K_TEST_DIR=${TMPDIR}

 

CP2K_VERSION="psmp"

 

NTASKS_SINGLE_TEST=2

NNODES_SINGLE_TEST=1

SRUN_CMD="srun --cpu-bind=verbose,cores"

 

# to run tests across nodes (to check for communication effects), use:

# NNODES_SINGLE_TEST=4

# SRUN_CMD="srun --cpu-bind=verbose,cores --ntasks-per-node 2"


# the following should be sufficiently generic:

 

mkdir -p "${CP2K_TEST_DIR}"

cd "${CP2K_TEST_DIR}"

 

cp2k_rel_dir=$(realpath --relative-to="${CP2K_TEST_DIR}" 
"${CP2K_BASE_DIR}/exe/local")

# srun does not like `-np`, override the complete command instead:

export cp2k_run_prefix="${SRUN_CMD} -N ${NNODES_SINGLE_TEST} -n 
${NTASKS_SINGLE_TEST}"

 

"${CP2K_REGEST_SCRIPT_DIR:-${CP2K_BASE_DIR}/tests}/do_regtest.py" \

  --mpiranks ${NTASKS_SINGLE_TEST} \

  --ompthreads ${OMP_NUM_THREADS} \

  --maxtasks ${SLURM_NTASKS} \

  --num_gpus 0 \

  --workbasedir "${CP2K_TEST_DIR}" \

  --mpiexec "mpiexec -n {N}" \

  --debug \

  "${cp2k_rel_dir}" \

  "${CP2K_VERSION}" \

 |& tee "${CP2K_TEST_DIR}/${CP2K_ARCH}.${CP2K_VERSION}.log"


```

and output after 1h of execution:

```

Loading intel/2022b

  Loading requirement: GCCcore/12.2.0 zlib/1.2.12-GCCcore-12.2.0

    binutils/2.39-GCCcore-12.2.0 intel-compilers/2022.2.1

    numactl/2.0.16-GCCcore-12.2.0 UCX/1.13.1-GCCcore-12.2.0

    impi/2021.7.1-intel-compilers-2022.2.1 imkl/2022.2.1 iimpi/2022b

    imkl-FFTW/2022.2.1-iimpi-2022b

Currently Loaded Modulefiles:

 1) GCCcore/12.2.0                  7) 
impi/2021.7.1-intel-compilers-2022.2.1  

 2) zlib/1.2.12-GCCcore-12.2.0      8) imkl/2022.2.1                        
   

 3) binutils/2.39-GCCcore-12.2.0    9) iimpi/2022b                          
   

 4) intel-compilers/2022.2.1       10) imkl-FFTW/2022.2.1-iimpi-2022b       
   

 5) numactl/2.0.16-GCCcore-12.2.0  11) intel/2022b                          
   

 6) UCX/1.13.1-GCCcore-12.2.0      12) GCC/12.2.0                           
   

*************************** Testing started ****************************

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('--version',)


----------------------------- Settings ---------------------------------

MPI ranks:      2

OpenMP threads: 2

GPU devices:    2

Workers:        6

Timeout [s]:    400

Work base dir:  /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41

MPI exec:       mpiexec -n {N}

Smoke test:     False

Valgrind:       False

Keepalive:      False

Flag slow:      False

Debug:          True

Binary dir:     /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local

VERSION:        psmp

Flags:          
omp,libint,fftw3,libxc,libgrpp,pexsi,elpa,parallel,scalapack,mpi_f08,cosma,xsmm,plumed2,spglib,mkl,sirius,libvori,libbqb,libvdwxc,hdf5

------------------------------------------------------------------------

Copying test files ... done

Skipping UNIT/nequip_unittest because its requirements are not satisfied.

Skipping TMC/regtest_ana_on_the_fly because its requirements are not 
satisfied.

Skipping QS/regtest-cusolver because its requirements are not satisfied.

Skipping QS/regtest-dlaf because its requirements are not satisfied.

Skipping Fist/regtest-nequip because its requirements are not satisfied.

Skipping Fist/regtest-allegro because its requirements are not satisfied.

Skipping QS/regtest-dft-vdw-corr-4 because its requirements are not 
satisfied.

Skipping Fist/regtest-deepmd because its requirements are not satisfied.

Skipping Fist/regtest-quip because its requirements are not satisfied.

Launched 362 test directories and 6 worker...


Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/dbt_tas_unittest.psmp'] 
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/dbt_unittest.psmp'] 
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/grid_unittest.psmp'] 
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/libcp2k_unittest.psmp'] 
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/memory_utilities_unittest.psmp'] 
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/parallel_rng_types_unittest.psmp'] 
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('RPA_SIGMA_H2O_clenshaw.inp',)

>>> 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/memory_utilities_unittest

    memory_utilities_unittest                                              
                          -           OK (   0.29 sec)

<<< 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/memory_utilities_unittest 
(1 of 362) done in 0.29 sec

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('H2O_ref.inp',)

>>> 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_unittest

    dbt_unittest                                                           
                          - RUNTIME FAIL (   1.61 sec)

<<< 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_unittest 
(2 of 362) done in 1.61 sec

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('h2o_f01_coulomb_only.inp',)

>>> 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_tas_unittest

    dbt_tas_unittest                                                       
                          - RUNTIME FAIL (   1.84 sec)

<<< 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_tas_unittest 
(3 of 362) done in 1.84 sec

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('test01.inp',)

>>> 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/parallel_rng_types_unittest

    parallel_rng_types_unittest                                            
                          -           OK (   2.04 sec)

<<< 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/parallel_rng_types_unittest 
(4 of 362) done in 2.04 sec

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('h2o_f21.inp',)

>>> 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/grid_unittest

    grid_unittest                                                          
                          -           OK (   2.53 sec)

<<< 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/grid_unittest 
(5 of 362) done in 2.53 sec

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('h2o_dip12.inp',)

>>> 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/libcp2k_unittest

    libcp2k_unittest                                                       
                          -           OK (  19.03 sec)

<<< 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/libcp2k_unittest 
(6 of 362) done in 19.03 sec

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('RPA_SIGMA_H2O_minimax.inp',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('RPA_SIGMA_H_minimax.inp',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('H2O_pao_exp.inp',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('RPA_SIGMA_H_clenshaw.inp',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('RPA_SIGMA_H2O_minimax_NUM_INTEG_GROUPS.inp',)

Creating subprocess: ['mpiexec', '-n', '2', 
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
('H2O-5.inp',)

>>> 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/QS/regtest-rpa-sigma

    RPA_SIGMA_H2O_clenshaw.inp                                              
              -17.19226814           OK (  83.42 sec)

    RPA_SIGMA_H2O_minimax.inp                                               
              -17.18984039           OK (  83.59 sec)

    RPA_SIGMA_H_minimax.inp                                                
              -0.5150377917           OK (  63.64 sec)

    RPA_SIGMA_H_clenshaw.inp                                               
              -0.5150909069           OK (  65.65 sec)

    RPA_SIGMA_H2O_minimax_NUM_INTEG_GROUPS.inp                              
              -17.18984039           OK (  86.54 sec)

<<< 
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/QS/regtest-rpa-sigma 
(7 of 362) done in 382.84 sec
```

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/d9ed484a-b9aa-4b0a-89cc-138343328848n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20241007/444f73e3/attachment-0001.htm>


More information about the CP2K-user mailing list