[CP2K-user] [CP2K:15200] Re: Does CP2K allow a multi-GPU run?

Lenard Carroll lenardc... at gmail.com
Thu Apr 22 17:35:55 UTC 2021


Oh you meant the error file. Please find it attached.

I have run on CPU only and one GPU. It works.

On Thu, Apr 22, 2021 at 7:31 PM Alfio Lazzaro <alfio.... at gmail.com>
wrote:

> I'm sorry, I cannot assist you, I'm not an expert on how to use CP2K ('m
> not a domain scientist). Without the total log, I can help you...
> I assume you should have a log file from PBS where you can see the error
> message. I can assume it is a memory limit.
> Have you executed on a CPU only?
>
>
>
> Il giorno giovedì 22 aprile 2021 alle 17:45:06 UTC+2 ASSIDUO Network ha
> scritto:
>
>> Here's the log file. The job ended prematurely.
>>
>> On Thu, Apr 22, 2021 at 3:23 PM Lenard Carroll <len... at gmail.com>
>> wrote:
>>
>>> Not sure yet. The job is still in the queue. As soon as it is finished
>>> I'll post the log file info here.
>>>
>>> On Thu, Apr 22, 2021 at 3:15 PM Alfio Lazzaro <al... at gmail.com>
>>> wrote:
>>>
>>>> And it works? Check the output and the performance... It can be that
>>>> your particular test case doesn't use the GPU at all, so could you attach
>>>> the log (at least the final part of it)
>>>>
>>>> Il giorno giovedì 22 aprile 2021 alle 13:42:16 UTC+2 ASSIDUO Network ha
>>>> scritto:
>>>>
>>>>> I am using 30 threads now over 3 GPUs, so I used:
>>>>>
>>>>> export OMP_NUM_THREADS=10
>>>>> mpiexec -n 3 cp2k.psmp -i gold50.inp -o gold50.out
>>>>>
>>>>>
>>>>> On Thu, Apr 22, 2021 at 1:34 PM Alfio Lazzaro <al... at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Wait, I see you have 32 threads in total, so need to have 32/4 = 8
>>>>>> threads.
>>>>>> Please change
>>>>>>
>>>>>> export OMP_NUM_THREADS=8
>>>>>>
>>>>>> Il giorno giovedì 22 aprile 2021 alle 13:27:59 UTC+2 ASSIDUO Network
>>>>>> ha scritto:
>>>>>>
>>>>>>> Shall do. I already set it up, but it's in a long queue.
>>>>>>>
>>>>>>> On Thu, Apr 22, 2021 at 1:22 PM Alfio Lazzaro <al... at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Could you try what I suggested:
>>>>>>>>
>>>>>>>> export OMP_NUM_THREADS=10
>>>>>>>> mpirun -np 4 ./cp2k.psmp -i gold.inp -o gold_pbc.out
>>>>>>>>
>>>>>>>> Please check the corresponding log.
>>>>>>>>
>>>>>>>> As I said above, you need an MPI rank per GPU and you told us that
>>>>>>>> you have 4 GPUs, so you need 4 ranks (or multiple). With 10 you get
>>>>>>>> unbalance.
>>>>>>>>
>>>>>>>>
>>>>>>>> Il giorno giovedì 22 aprile 2021 alle 10:17:27 UTC+2 ASSIDUO
>>>>>>>> Network ha scritto:
>>>>>>>>
>>>>>>>>> Correction, he told me to use:
>>>>>>>>>
>>>>>>>>> mpirun -np 10 cp2k.psmp -i gold.inp -o gold_pbc.out
>>>>>>>>>
>>>>>>>>> but it didn't run correctly.
>>>>>>>>>
>>>>>>>>> On Thu, Apr 22, 2021 at 9:51 AM Lenard Carroll <
>>>>>>>>> len... at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> He suggested I try out:
>>>>>>>>>> mpirun -n 10 cp2k.psmp -i gold.inp -o gold_pbc.out
>>>>>>>>>>
>>>>>>>>>> as he is hoping that will cause the 1 GPU to use 10 CPUs over the
>>>>>>>>>> selected 4 GPUs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 22, 2021 at 9:48 AM Alfio Lazzaro <
>>>>>>>>>> al... at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> Your command to run CP2K doesn't mention MPI (mpirun, mpiexc,
>>>>>>>>>>> ...). Are you running with multiple ranks?
>>>>>>>>>>>
>>>>>>>>>>> You can check those lines in the output:
>>>>>>>>>>>
>>>>>>>>>>>  GLOBAL| Total number of message passing processes
>>>>>>>>>>>               32
>>>>>>>>>>>  GLOBAL| Number of threads for this process
>>>>>>>>>>>               4
>>>>>>>>>>>
>>>>>>>>>>> And check your numbers.
>>>>>>>>>>> I can guess you have 1 rank and 40 threads.
>>>>>>>>>>> To use 4 GPUs you need 4 ranks (and less threads per rank), i.e.
>>>>>>>>>>> something like
>>>>>>>>>>>
>>>>>>>>>>> export OMP_NUM_THREADS=10
>>>>>>>>>>> mpiexec -n 4 ./cp2k.psmp -i gold.inp -o gold_pbc.out
>>>>>>>>>>>
>>>>>>>>>>> Please check with your sysadmin on how to run with multiple MPI
>>>>>>>>>>> ranks.
>>>>>>>>>>>
>>>>>>>>>>> Hope it helps.
>>>>>>>>>>>
>>>>>>>>>>> Alfio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Il giorno mercoledì 21 aprile 2021 alle 09:26:53 UTC+2 ASSIDUO
>>>>>>>>>>> Network ha scritto:
>>>>>>>>>>>
>>>>>>>>>>>> This is what my PBS file looks like:
>>>>>>>>>>>>
>>>>>>>>>>>> #!/bin/bash
>>>>>>>>>>>> #PBS -P <PROJECT>
>>>>>>>>>>>> #PBS -N <JOBNAME>
>>>>>>>>>>>> #PBS -l select=1:ncpus=40:ngpus=4
>>>>>>>>>>>> #PBS -l walltime=08:00:00
>>>>>>>>>>>> #PBS -q gpu_4
>>>>>>>>>>>> #PBS -m be
>>>>>>>>>>>> #PBS -M none
>>>>>>>>>>>>
>>>>>>>>>>>> module purge
>>>>>>>>>>>> module load chpc/cp2k/8.1.0/cuda10.1/openmpi-4.0.0/gcc-7.3.0
>>>>>>>>>>>> source $SETUP
>>>>>>>>>>>> cd $PBS_O_WORKDIR
>>>>>>>>>>>>
>>>>>>>>>>>> cp2k.psmp -i gold.inp -o gold_pbc.out
>>>>>>>>>>>> ~
>>>>>>>>>>>>                                                          ~
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 21, 2021 at 9:22 AM Alfio Lazzaro <
>>>>>>>>>>>> al... at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The way to use 4 GPUs per node is to use 4 MPI ranks. How many
>>>>>>>>>>>>> ranks are you using?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Il giorno martedì 20 aprile 2021 alle 19:44:15 UTC+2 ASSIDUO
>>>>>>>>>>>>> Network ha scritto:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm asking, since the administrator running my country's HPC
>>>>>>>>>>>>>> is saying that although I'm requesting access to 4 GPUs, CP2K is only using
>>>>>>>>>>>>>> 1. I checked the following output:
>>>>>>>>>>>>>>  DBCSR| ACC: Number of devices/node
>>>>>>>>>>>>>>                   4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And it shows that CP2K is picking up 4 GPUs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tuesday, April 20, 2021 at 3:00:17 PM UTC+2 ASSIDUO
>>>>>>>>>>>>>> Network wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I currently have access to 4 GPUs to run an AIMD simulation,
>>>>>>>>>>>>>>> but only one of the GPUs are being used. Is there a way to use the other 3,
>>>>>>>>>>>>>>> and if so, can you tell me how to set it up with a PBS job?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>> Google Groups "cp2k" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>> it, send an email to cp... at googlegroups.com.
>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>> https://groups.google.com/d/msgid/cp2k/70ba0fce-8636-4b75-940d-133ce4dbf0can%40googlegroups.com
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/cp2k/70ba0fce-8636-4b75-940d-133ce4dbf0can%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "cp2k" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>> it, send an email to cp... at googlegroups.com.
>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>> https://groups.google.com/d/msgid/cp2k/92e4f88d-fde8-4127-ab5f-0b98bbbba8ebn%40googlegroups.com
>>>>>>>>>>> <https://groups.google.com/d/msgid/cp2k/92e4f88d-fde8-4127-ab5f-0b98bbbba8ebn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "cp2k" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to cp... at googlegroups.com.
>>>>>>>>
>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/cp2k/59a635d8-0f0c-4dc5-abaf-b8bbe3c18da5n%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/cp2k/59a635d8-0f0c-4dc5-abaf-b8bbe3c18da5n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "cp2k" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to cp... at googlegroups.com.
>>>>>>
>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/cp2k/ec4efd81-6314-4ce7-b22c-148b362d2ba6n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/cp2k/ec4efd81-6314-4ce7-b22c-148b362d2ba6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "cp2k" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to cp... at googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/cp2k/d29306aa-e0b8-4797-9298-13dab23e9083n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/cp2k/d29306aa-e0b8-4797-9298-13dab23e9083n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "cp2k" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cp... at googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cp2k/6852eb71-6886-4fe7-8f4a-3ad8318a289dn%40googlegroups.com
> <https://groups.google.com/d/msgid/cp2k/6852eb71-6886-4fe7-8f4a-3ad8318a289dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20210422/87bdecf8/attachment.htm>
-------------- next part --------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              gpu4002
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu4002
  Local device: mlx5_0
--------------------------------------------------------------------------
[gpu4002:383094] 2 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[gpu4002:383094] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[gpu4002:383094] 2 more processes have sent help message help-mpi-btl-openib.txt / error in device init
error: GPU API call : invalid resource handle
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fffd8ec324f in ???
#1  0x7fffd8ec31d7 in ???
#2  0x7fffd8ec48c7 in ???
#3  0x7fffe9813154 in _ZN9__gnu_cxx27__verbose_terminate_handlerEv
	at ../../.././libstdc++-v3/libsupc++/vterminate.cc:95
#4  0x7fffe9810f15 in _ZN10__cxxabiv111__terminateEPFvvE
	at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:47
#5  0x7fffe9810f60 in _ZSt9terminatev
	at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:57
#6  0x7fffe98111a3 in __cxa_throw
	at ../../.././libstdc++-v3/libsupc++/eh_throw.cc:93
#7  0x2dfb43b in check_runtime_status
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/util.hpp:17
#8  0x2dfb43b in _ZNK3gpu13device_stream13enqueue_eventEv
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/device_stream.hpp:62
#9  0x2dfb43b in _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:248
#10  0x2dfb99c in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_b
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:341
#11  0x2dc62be in _ZN5cosma14local_multiplyIdEEvPN3gpu9mm_handleIT_EEPS3_S6_S6_iiiS3_S3_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/src/cosma/local_multiply.cpp:86
#12  0x2dae250 in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyERNS_12communicatorES2_S2_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:355
#13  0x2daef1d in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEP19ompi_communicator_tS2_S2_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:272
#14  0x2d8122c in _ZN5cosma6pxgemmIdEEvcciiiT_PKS1_iiPKiS3_iiS5_S1_PS1_iiS5_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/cosma-2.2.0/src/cosma/cosma_pxgemm.cpp:329
#15  0x2f12ecb in pdlaed1_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/scalapack-2.1.0/SRC/pdlaed1.f:245
#16  0x2f12621 in pdlaed0_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/scalapack-2.1.0/SRC/pdlaed0.f:215
#17  0x2eeee28 in pdstedc_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/scalapack-2.1.0/SRC/pdstedc.f:246
#18  0x2e56eb9 in pzheevd_
	at /apps/chpc/chem/gpu/cp2k/8.1.0/tools/toolchain/build/scalapack-2.1.0/SRC/pzheevd.f:408
#19  0x252ddf2 in __cp_cfm_diag_MOD_cp_cfm_heevd
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/fm/cp_cfm_diag.F:123
#20  0x252fc6f in __cp_cfm_diag_MOD_cp_cfm_geeig
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/fm/cp_cfm_diag.F:174
#21  0x1b4e5ee in __qs_scf_diagonalization_MOD_do_general_diag_kp
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/qs_scf_diagonalization.F:531
#22  0x1010cf8 in __qs_scf_loop_utils_MOD_qs_scf_new_mos_kp
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/qs_scf_loop_utils.F:339
#23  0xff3a33 in __qs_scf_MOD_scf_env_do_scf
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/qs_scf.F:470
#24  0xfffe14 in __qs_scf_MOD_scf
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/qs_scf.F:234
#25  0xe2b846 in __qs_energy_MOD_qs_energies
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/qs_energy.F:88
#26  0xe4a491 in qs_forces
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/qs_force.F:209
#27  0xe4dc67 in __qs_force_MOD_qs_calc_energy_force
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/qs_force.F:114
#28  0xb3f3ef in __force_env_methods_MOD_force_env_calc_energy_force
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/force_env_methods.F:271
#29  0x6cc8f3 in qs_mol_dyn_low
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/motion/md_run.F:372
#30  0x6cd81b in __md_run_MOD_qs_mol_dyn
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/motion/md_run.F:153
#31  0x5db09e in cp2k_run
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/start/cp2k_runs.F:378
#32  0x5de824 in __cp2k_runs_MOD_run_input
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/start/cp2k_runs.F:983
#33  0x5d94b8 in cp2k
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/start/cp2k.F:337
#34  0x5196ac in main
	at /apps/chpc/chem/gpu/cp2k/8.1.0/src/start/cp2k.F:44
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 0 on node gpu4002 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


More information about the CP2K-user mailing list