[CP2K-user] [CP2K:20760] Re: slow slurm regtests

Johann Pototschnig pototschnig.johann at gmail.com
Tue Oct 8 14:35:13 UTC 2024


Hi,

The combination of regtest with slurm leads to all workers being run on the 
same processors. (This of course leads to resource starvation if there are 
several workers)
The tests are rather small so more than 2 MPI processes are not useful and 
can lead to failing tests. 

Regarding OpenMP threads you can go a bit larger. but too large doesn't 
make sense as the test are quite small. 

Due to these limitations the tests take time, but it should be a bit faster 
than your setup where the workers are fighting for resources.
They should finish within 10 h.

I would suggest something like:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=8


The assignment of workers to different CPU's is not straightforward. 

best,
Johann


On Monday, October 7, 2024 at 3:11:45 PM UTC+2 bartosz mazur wrote:

> Hi all, 
>
> I am trying to run regtests using the slurm sbatch script, but what I am 
> observing is their extremely slow execution. After looking at the task, I 
> can see that only 4 CPUs are being used (out of 48 set). It looks as if 
> each task is run one after the other, i.e. 2 MPI x 2 OMP = 4 CPU. 
>
> I have already tried different `mpiexec` command settings and changed the 
> `srun` command, but this did not help. When using 4 nodes the task also 
> runs on only 4 CPU of a single node. I don't quite understand why the 
> system reports 2 GPUs when the `nvidia-smi --query-gpu=gpu_name 
> --format=csv,noheader | wc -l` command is called, so I modified 
> do_regtest.py to force 0 GPUs, but that didn't change anything either. The 
> instructions at https://www.cp2k.org/dev:regtesting#run_with_sbatch are 
> out of date, so maybe something else needs to be changed in the script?
>
> I would appreciate any help!
>
> Here is my sbatch script:
>
> ```
>
> #!/bin/bash -l
>
> #SBATCH --time=06:00:00
>
> #SBATCH --nodes=1
>
> #SBATCH --ntasks-per-node=24
>
> #SBATCH --cpus-per-task=2
>
> #SBATCH --ntasks-per-core=1
>
> #SBATCH --mem=180G
>
>  
>
> set -o errexit
>
> set -o nounset
>
> set -o pipefail
>
>  
>
> export MPICH_OFI_STARTUP_CONNECT=1
>
> export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
>
> # export OMP_PROC_BIND=close
>
> # export OMP_PLACES=cores
>
>  
>
> module load intel/2022b
>
> module load GCC/12.2.0
>
>  
>
> # Let the user see the currently loaded modules in the slurm log for 
> completeness:
>
> module list
>
>  
>
> CP2K_BASE_DIR="/lustre/pd01/hpc-kuchta-1716987452/software/cp2k"
>
> CP2K_TEST_DIR=${TMPDIR}
>
>  
>
> CP2K_VERSION="psmp"
>
>  
>
> NTASKS_SINGLE_TEST=2
>
> NNODES_SINGLE_TEST=1
>
> SRUN_CMD="srun --cpu-bind=verbose,cores"
>
>  
>
> # to run tests across nodes (to check for communication effects), use:
>
> # NNODES_SINGLE_TEST=4
>
> # SRUN_CMD="srun --cpu-bind=verbose,cores --ntasks-per-node 2"
>
>
> # the following should be sufficiently generic:
>
>  
>
> mkdir -p "${CP2K_TEST_DIR}"
>
> cd "${CP2K_TEST_DIR}"
>
>  
>
> cp2k_rel_dir=$(realpath --relative-to="${CP2K_TEST_DIR}" 
> "${CP2K_BASE_DIR}/exe/local")
>
> # srun does not like `-np`, override the complete command instead:
>
> export cp2k_run_prefix="${SRUN_CMD} -N ${NNODES_SINGLE_TEST} -n 
> ${NTASKS_SINGLE_TEST}"
>
>  
>
> "${CP2K_REGEST_SCRIPT_DIR:-${CP2K_BASE_DIR}/tests}/do_regtest.py" \
>
>   --mpiranks ${NTASKS_SINGLE_TEST} \
>
>   --ompthreads ${OMP_NUM_THREADS} \
>
>   --maxtasks ${SLURM_NTASKS} \
>
>   --num_gpus 0 \
>
>   --workbasedir "${CP2K_TEST_DIR}" \
>
>   --mpiexec "mpiexec -n {N}" \
>
>   --debug \
>
>   "${cp2k_rel_dir}" \
>
>   "${CP2K_VERSION}" \
>
>  |& tee "${CP2K_TEST_DIR}/${CP2K_ARCH}.${CP2K_VERSION}.log"
>
>
> ```
>
> and output after 1h of execution:
>
> ```
>
> Loading intel/2022b
>
>   Loading requirement: GCCcore/12.2.0 zlib/1.2.12-GCCcore-12.2.0
>
>     binutils/2.39-GCCcore-12.2.0 intel-compilers/2022.2.1
>
>     numactl/2.0.16-GCCcore-12.2.0 UCX/1.13.1-GCCcore-12.2.0
>
>     impi/2021.7.1-intel-compilers-2022.2.1 imkl/2022.2.1 iimpi/2022b
>
>     imkl-FFTW/2022.2.1-iimpi-2022b
>
> Currently Loaded Modulefiles:
>
>  1) GCCcore/12.2.0                  7) 
> impi/2021.7.1-intel-compilers-2022.2.1  
>
>  2) zlib/1.2.12-GCCcore-12.2.0      8) imkl/2022.2.1                      
>      
>
>  3) binutils/2.39-GCCcore-12.2.0    9) iimpi/2022b                        
>      
>
>  4) intel-compilers/2022.2.1       10) imkl-FFTW/2022.2.1-iimpi-2022b     
>      
>
>  5) numactl/2.0.16-GCCcore-12.2.0  11) intel/2022b                        
>      
>
>  6) UCX/1.13.1-GCCcore-12.2.0      12) GCC/12.2.0                         
>      
>
> *************************** Testing started ****************************
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('--version',)
>
>
> ----------------------------- Settings ---------------------------------
>
> MPI ranks:      2
>
> OpenMP threads: 2
>
> GPU devices:    2
>
> Workers:        6
>
> Timeout [s]:    400
>
> Work base dir:  /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41
>
> MPI exec:       mpiexec -n {N}
>
> Smoke test:     False
>
> Valgrind:       False
>
> Keepalive:      False
>
> Flag slow:      False
>
> Debug:          True
>
> Binary dir:     /lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local
>
> VERSION:        psmp
>
> Flags:          
> omp,libint,fftw3,libxc,libgrpp,pexsi,elpa,parallel,scalapack,mpi_f08,cosma,xsmm,plumed2,spglib,mkl,sirius,libvori,libbqb,libvdwxc,hdf5
>
> ------------------------------------------------------------------------
>
> Copying test files ... done
>
> Skipping UNIT/nequip_unittest because its requirements are not satisfied.
>
> Skipping TMC/regtest_ana_on_the_fly because its requirements are not 
> satisfied.
>
> Skipping QS/regtest-cusolver because its requirements are not satisfied.
>
> Skipping QS/regtest-dlaf because its requirements are not satisfied.
>
> Skipping Fist/regtest-nequip because its requirements are not satisfied.
>
> Skipping Fist/regtest-allegro because its requirements are not satisfied.
>
> Skipping QS/regtest-dft-vdw-corr-4 because its requirements are not 
> satisfied.
>
> Skipping Fist/regtest-deepmd because its requirements are not satisfied.
>
> Skipping Fist/regtest-quip because its requirements are not satisfied.
>
> Launched 362 test directories and 6 worker...
>
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/dbt_tas_unittest.psmp'] 
> ('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/dbt_unittest.psmp'] 
> ('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/grid_unittest.psmp'] 
> ('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/libcp2k_unittest.psmp'] 
> ('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/memory_utilities_unittest.psmp'] 
> ('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/parallel_rng_types_unittest.psmp'] 
> ('/lustre/pd01/hpc-kuchta-1716987452/software/cp2k',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('RPA_SIGMA_H2O_clenshaw.inp',)
>
> >>> 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/memory_utilities_unittest
>
>     memory_utilities_unittest                                              
>                           -           OK (   0.29 sec)
>
> <<< 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/memory_utilities_unittest 
> (1 of 362) done in 0.29 sec
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('H2O_ref.inp',)
>
> >>> 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_unittest
>
>     dbt_unittest                                                           
>                           - RUNTIME FAIL (   1.61 sec)
>
> <<< 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_unittest 
> (2 of 362) done in 1.61 sec
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('h2o_f01_coulomb_only.inp',)
>
> >>> 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_tas_unittest
>
>     dbt_tas_unittest                                                       
>                           - RUNTIME FAIL (   1.84 sec)
>
> <<< 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/dbt_tas_unittest 
> (3 of 362) done in 1.84 sec
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('test01.inp',)
>
> >>> 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/parallel_rng_types_unittest
>
>     parallel_rng_types_unittest                                            
>                           -           OK (   2.04 sec)
>
> <<< 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/parallel_rng_types_unittest 
> (4 of 362) done in 2.04 sec
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('h2o_f21.inp',)
>
> >>> 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/grid_unittest
>
>     grid_unittest                                                          
>                           -           OK (   2.53 sec)
>
> <<< 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/grid_unittest 
> (5 of 362) done in 2.53 sec
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('h2o_dip12.inp',)
>
> >>> 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/libcp2k_unittest
>
>     libcp2k_unittest                                                       
>                           -           OK (  19.03 sec)
>
> <<< 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/UNIT/libcp2k_unittest 
> (6 of 362) done in 19.03 sec
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('RPA_SIGMA_H2O_minimax.inp',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('RPA_SIGMA_H_minimax.inp',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('H2O_pao_exp.inp',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('RPA_SIGMA_H_clenshaw.inp',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('RPA_SIGMA_H2O_minimax_NUM_INTEG_GROUPS.inp',)
>
> Creating subprocess: ['mpiexec', '-n', '2', 
> '/lustre/pd01/hpc-kuchta-1716987452/software/cp2k/exe/local/cp2k.psmp'] 
> ('H2O-5.inp',)
>
> >>> 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/QS/regtest-rpa-sigma
>
>     RPA_SIGMA_H2O_clenshaw.inp                                            
>                 -17.19226814           OK (  83.42 sec)
>
>     RPA_SIGMA_H2O_minimax.inp                                             
>                 -17.18984039           OK (  83.59 sec)
>
>     RPA_SIGMA_H_minimax.inp                                                
>               -0.5150377917           OK (  63.64 sec)
>
>     RPA_SIGMA_H_clenshaw.inp                                               
>               -0.5150909069           OK (  65.65 sec)
>
>     RPA_SIGMA_H2O_minimax_NUM_INTEG_GROUPS.inp                            
>                 -17.18984039           OK (  86.54 sec)
>
> <<< 
> /lustre/tmp/slurm/3090305/TEST-psmp-2024-10-07_13-58-41/QS/regtest-rpa-sigma 
> (7 of 362) done in 382.84 sec
> ```
>

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/58256d55-fe8c-47e3-b1ec-144a10c33982n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20241008/33ec2d31/attachment-0001.htm>


More information about the CP2K-user mailing list