[CP2K-user] [CP2K:20750] slow slurm regtests
bartosz mazur
bamaz.97 at gmail.com
Mon Oct 7 13:10:09 UTC 2024
Hi all,
I am trying to run regtests using the slurm sbatch script, but what I am
observing is their extremely slow execution. After looking at the task, I
can see that only 4 CPUs are being used (out of 48 set). It looks as if
each task is run one after the other, i.e. 2 MPI x 2 OMP = 4 CPU.
I have already tried different `mpiexec` command settings and changed the
`srun` command, but this did not help. When using 4 nodes the task also
runs on only 4 CPU of a single node. I don't quite understand why the
system reports 2 GPUs when the `nvidia-smi --query-gpu=gpu_name
--format=csv,noheader | wc -l` command is called, so I modified
do_regtest.py to force 0 GPUs, but that didn't change anything either. The
instructions at https://www.cp2k.org/dev:regtesting#run_with_sbatch are out
of date, so maybe something else needs to be changed in the script?
I would appreciate any help!
Here is my sbatch script:
```
#!/bin/bash -l
#SBATCH --time=06:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-core=1
#SBATCH --mem=180G
set -o errexit
set -o nounset
set -o pipefail
export MPICH_OFI_STARTUP_CONNECT=1
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
# export OMP_PROC_BIND=close
# export OMP_PLACES=cores
module load intel/2022b
module load GCC/12.2.0
# Let the user see the currently loaded modules in the slurm log for
completeness:
module list
CP2K_BASE_DIR="/lustre/pd01/hpc-kuchta-1716987452/software/cp2k"
CP2K_TEST_DIR=${TMPDIR}
CP2K_VERSION="psmp"
NTASKS_SINGLE_TEST=2
NNODES_SINGLE_TEST=1
SRUN_CMD="srun --cpu-bind=verbose,cores"
# to run tests across nodes (to check for communication effects), use:
# NNODES_SINGLE_TEST=4
# SRUN_CMD="srun --cpu-bind=verbose,cores --ntasks-per-node 2"
# the following should be sufficiently generic:
mkdir -p "${CP2K_TEST_DIR}"
cd "${CP2K_TEST_DIR}"
cp2k_rel_dir=$(realpath --relative-to="${CP2K_TEST_DIR}"
"${CP2K_BASE_DIR}/exe/local")
# srun does not like `-np`, override the complete command instead:
export cp2k_run_prefix="${SRUN_CMD} -N ${NNODES_SINGLE_TEST} -n
${NTASKS_SINGLE_TEST}"
"${CP2K_REGEST_SCRIPT_DIR:-${CP2K_BASE_DIR}/tests}/do_regtest.py" \
--mpiranks ${NTASKS_SINGLE_TEST} \
--ompthreads ${OMP_NUM_THREADS} \
--maxtasks ${SLURM_NTASKS} \
--num_gpus 0 \
--workbasedir "${CP2K_TEST_DIR}" \
--mpiexec "mpiexec -n {N}" \
--debug \
"${cp2k_rel_dir}" \
"${CP2K_VERSION}" \
|& tee "${CP2K_TEST_DIR}/${CP2K_ARCH}.${CP2K_VERSION}.log"
```
and output after 1h of execution:
```
Loading intel/2022b
Loading requirement: GCCcore/12.2.0 zlib/1.2.12-GCCcore-12.2.0
binutils/2.39-GCCcore-12.2.0 intel-compilers/2022.2.1
numactl/2.0.16-GCCcore-12.2.0 UCX/1.13.1-GCCcore-12.2.0
impi/2021.7.1-intel-compilers-2022.2.1 imkl/2022.2.1 iimpi/2022b
imkl-FFTW/2022.2.1-iimpi-2022b
Currently Loaded Modulefiles:
1) GCCcore/12.2.0 7)
impi/2021.7.1-intel-compilers-2022.2.1
2) zlib/1.2.12-GCCcore-12.2.0 8) imkl/2022.2.1
3) binutils/2.39-GCCcore-12.2.0 9) iimpi/2022b
4) intel-compilers/2022.2.1 10) imkl-FFTW/2022.2.1-iimpi-2022b
5) numactl/2.0.16-GCCcore-12.2.0 11) intel/2022b
6) UCX/1.13.1-GCCcore-12.2.0 12) GCC/12.2.0
*************************** Testing started ****************************
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('--version',)
----------------------------- Settings ---------------------------------
MPI ranks: 2
OpenMP threads: 2
GPU devices: 2
Workers: 6
Timeout [s]: 400
Work base dir: /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41
MPI exec: mpiexec -n {N}
Smoke test: False
Valgrind: False
Keepalive: False
Flag slow: False
Debug: True
Binary dir: /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local
VERSION: psmp
Flags:
omp,libint,fftw3,libxc,libgrpp,pexsi,elpa,parallel,scalapack,mpi_f08,cosma,xsmm,plumed2,spglib,mkl,sirius,libvori,libbqb,libvdwxc,hdf5
------------------------------------------------------------------------
Copying test files ... done
Skipping UNIT/nequip_unittest because its requirements are not satisfied.
Skipping TMC/regtest_ana_on_the_fly because its requirements are not
satisfied.
Skipping QS/regtest-cusolver because its requirements are not satisfied.
Skipping QS/regtest-dlaf because its requirements are not satisfied.
Skipping Fist/regtest-nequip because its requirements are not satisfied.
Skipping Fist/regtest-allegro because its requirements are not satisfied.
Skipping QS/regtest-dft-vdw-corr-4 because its requirements are not
satisfied.
Skipping Fist/regtest-deepmd because its requirements are not satisfied.
Skipping Fist/regtest-quip because its requirements are not satisfied.
Launched 362 test directories and 6 worker...
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/dbt_tas_unittest.psmp']
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/dbt_unittest.psmp']
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/grid_unittest.psmp']
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/libcp2k_unittest.psmp']
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/memory_utilities_unittest.psmp']
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/parallel_rng_types_unittest.psmp']
('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('RPA_SIGMA_H2O_clenshaw.inp',)
>>>
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/memory_utilities_unittest
memory_utilities_unittest
- OK ( 0.29 sec)
<<<
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/memory_utilities_unittest
(1 of 362) done in 0.29 sec
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('H2O_ref.inp',)
>>>
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_unittest
dbt_unittest
- RUNTIME FAIL ( 1.61 sec)
<<<
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_unittest
(2 of 362) done in 1.61 sec
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('h2o_f01_coulomb_only.inp',)
>>>
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_tas_unittest
dbt_tas_unittest
- RUNTIME FAIL ( 1.84 sec)
<<<
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_tas_unittest
(3 of 362) done in 1.84 sec
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('test01.inp',)
>>>
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/parallel_rng_types_unittest
parallel_rng_types_unittest
- OK ( 2.04 sec)
<<<
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/parallel_rng_types_unittest
(4 of 362) done in 2.04 sec
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('h2o_f21.inp',)
>>>
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/grid_unittest
grid_unittest
- OK ( 2.53 sec)
<<<
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/grid_unittest
(5 of 362) done in 2.53 sec
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('h2o_dip12.inp',)
>>>
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/libcp2k_unittest
libcp2k_unittest
- OK ( 19.03 sec)
<<<
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/libcp2k_unittest
(6 of 362) done in 19.03 sec
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('RPA_SIGMA_H2O_minimax.inp',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('RPA_SIGMA_H_minimax.inp',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('H2O_pao_exp.inp',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('RPA_SIGMA_H_clenshaw.inp',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('RPA_SIGMA_H2O_minimax_NUM_INTEG_GROUPS.inp',)
Creating subprocess: ['mpiexec', '-n', '2',
'/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp']
('H2O-5.inp',)
>>>
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/QS/regtest-rpa-sigma
RPA_SIGMA_H2O_clenshaw.inp
-17.19226814 OK ( 83.42 sec)
RPA_SIGMA_H2O_minimax.inp
-17.18984039 OK ( 83.59 sec)
RPA_SIGMA_H_minimax.inp
-0.5150377917 OK ( 63.64 sec)
RPA_SIGMA_H_clenshaw.inp
-0.5150909069 OK ( 65.65 sec)
RPA_SIGMA_H2O_minimax_NUM_INTEG_GROUPS.inp
-17.18984039 OK ( 86.54 sec)
<<<
/lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/QS/regtest-rpa-sigma
(7 of 362) done in 382.84 sec
```
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/d9ed484a-b9aa-4b0a-89cc-138343328848n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20241007/444f73e3/attachment-0001.htm>
More information about the CP2K-user
mailing list