[CP2K-user] CUDA RUNTIME API error: EventRecord failed with error cudaErrorInvalidResourceHandle

Alfio Lazzaro alfio.... at gmail.com
Fri Feb 5 07:08:21 UTC 2021


Hello!
I assume that by "12 cpus" you mean 12 MPI ranks, could you confirm? How 
many threads?

First of all, consider that multigpu is still not well-tested. That said, 
more GPUs don't mean faster execution if the code doesn't exploit that...
I see some possible explanations for your results:
1. the GPU part in CP2K is DBCSR, likely your benchmark doesn't use DBCSR 
at lot, so no speed-up. From your CPU result, it seems that you are bound 
by PDGEMMs, so COSMA is beneficial...
2. multiple GPUs can share the same PCIe so the data movement becomes the 
bottleneck

I think a way to investigate is if you share the CP2K outputs. I can take a 
look...
One more question: you said that it crashed for >6 GPUs, do you have a run 
with 4 (or 6) GPUs with COSMA? If so, please share it.
One possibility is to use COSMA with only CPU and then the GPU for DBCSR. 
However, it can be also possible that 6 GPUS with COSMA are good enough to 
speed-up the execution... 
For the rest, I suggest opening an issue on the COSMA page 
(https://github.com/eth-cscs/COSMA/issues ) to understand why >6 GPUs are 
not working (this is not strictly CP2K related).

Alfio




Il giorno venerdì 5 febbraio 2021 alle 02:18:43 UTC+1 singlebook ha scritto:

>
> Hello!
>
> I removed cosma from cp2k. Now it works for multiple GPUs, but the speed 
> has not improved:
> 48 cpus without gpu :  each scf step costs 0.3 second  (cosma is available 
> in cpu version.)
> 48 cpus with 12 gpus: each scf step costs 1.8 second  (cosma is not 
> available. )
> 12 cpus with 12 gpus: each scf step costs 1.2 second  (cosma is not 
> available. )
>
> On Thursday, February 4, 2021 at 2:41:48 PM UTC+8 Alfio Lazzaro wrote:
>
>> The multi-gpu support is still not stable.
>> The error message is inside COSMA.
>> Could you remove this library from your installation of CP2K? I assume 
>> you are using the toolchain, so just use --with-cosma=no
>>
>> Then, I assume you are using PSMP version of CP2K (the only way of using 
>> the multiple GPUs). Could you confirm? Note that there must be a rank (or 
>> multiple ranks) attached to each GPU, e.g. for 12 GPUs I need at least 12 
>> ranks (or multiples).
>>
>> Alfio
>>
>> Il giorno giovedì 4 febbraio 2021 alle 02:20:50 UTC+1 singlebook ha 
>> scritto:
>>
>>>
>>> Hello, All
>>>
>>> I just install CP2K v8.1 on my workstation.  There are 12 NVIDIA K80 
>>> GPUs in the workstation. The compiler is GCC 6.5 and CUDA 10.0.
>>>
>>> I want to perform AIMD for SiC, but when I use more than 6 GPUs, it 
>>> always give me the error:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *CUDA RUNTIME API error: EventRecord failed with error 
>>> cudaErrorInvalidResourceHandleerror: GPU API call : invalid resource 
>>> handleterminate called after throwing an instance of 'std::runtime_error'  
>>> what():  GPU ERRORProgram received signal SIGABRT: Process abort 
>>> signal.Backtrace for this error:#0  0x7fc42ccc626f in ???#1  0x7fc42ccc61f7 
>>> in ???#2  0x7fc42ccc78e7 in ???#3  0x7fc43d68193c in 
>>> _ZN9__gnu_cxx27__verbose_terminate_handlerEv    at 
>>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/vterminate.cc:95#4  
>>> 0x7fc43d67f905 in _ZN10__cxxabiv111__terminateEPFvvE    at 
>>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_terminate.cc:47#5  
>>> 0x7fc43d67f950 in _ZSt9terminatev    at 
>>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_terminate.cc:57#6  
>>> 0x7fc43d67fb68 in __cxa_throw    at 
>>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_throw.cc:87#7  0x2b12c82 in 
>>> check_runtime_status    at 
>>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/util.hpp:17#8  
>>> 0x2b12c82 in _ZNK3gpu13device_stream13enqueue_eventEv    at 
>>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/device_stream.hpp:62#9  
>>> 0x2b12c82 in 
>>> _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE   
>>>  at 
>>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:248#10  
>>> 0x2b1351c in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_b    at 
>>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:341#11  
>>> 0x2adfdee in 
>>> _ZN5cosma14local_multiplyIdEEvPN3gpu9mm_handleIT_EEPS3_S6_S6_iiiS3_S3_   
>>>  at 
>>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/local_multiply.cpp:86#12  
>>> 0x2ac8fb3 in 
>>> _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyERNS_12communicatorES2_S2_   
>>>  at 
>>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:355#13  
>>> 0x2ac9c26 in 
>>> _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEiS2_S2_   
>>>  at 
>>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:272#14  
>>> 0x2a9fc5d in ???#15  0x250cd5c in __cp_fm_basic_linalg_MOD_cp_fm_gemm    at 
>>> /local/src/cp2k-8.1/src/fm/cp_fm_basic_linalg.F:446#16  0xcd8744 in 
>>> __cp_gemm_interface_MOD_cp_gemm    at 
>>> /local/src/cp2k-8.1/src/cp_gemm_interface.F:138#17  0x10c794b in 
>>> __qs_wf_history_methods_MOD_wfi_extrapolate    at 
>>> /local/src/cp2k-8.1/src/qs_wf_history_methods.F:912#18  0x17a5b53 in 
>>> scf_env_initial_rho_setup    at 
>>> /local/src/cp2k-8.1/src/qs_scf_initialization.F:1122#19  0x17a5b53 in 
>>> init_scf_run    at /local/src/cp2k-8.1/src/qs_scf_initialization.F:1047#20  
>>> 0x17a79b5 in __qs_scf_initialization_MOD_qs_scf_env_initialize    at 
>>> /local/src/cp2k-8.1/src/qs_scf_initialization.F:182#21  0xf1e341 in 
>>> __qs_scf_MOD_scf    at /local/src/cp2k-8.1/src/qs_scf.F:222#22  0xc0e966 in 
>>> __qs_energy_MOD_qs_energies    at 
>>> /local/src/cp2k-8.1/src/qs_energy.F:88#23  0x1979f13 in qs_forces    at 
>>> /local/src/cp2k-8.1/src/qs_force.F:209#24  0x197dc87 in 
>>> __qs_force_MOD_qs_calc_energy_force    at 
>>> /local/src/cp2k-8.1/src/qs_force.F:114#25  0x112bfe5 in 
>>> __force_env_methods_MOD_force_env_calc_energy_force    at 
>>> /local/src/cp2k-8.1/src/force_env_methods.F:271#26  0x797c55 in 
>>> __integrator_MOD_nvt    at 
>>> /local/src/cp2k-8.1/src/motion/integrator.F:1103#27  0x78ddca in 
>>> __velocity_verlet_control_MOD_velocity_verlet    at 
>>> /local/src/cp2k-8.1/src/motion/velocity_verlet_control.F:77#28  0x6c1695 in 
>>> qs_mol_dyn_low    at /local/src/cp2k-8.1/src/motion/md_run.F:481#29  
>>> 0x6c209a in __md_run_MOD_qs_mol_dyn    at 
>>> /local/src/cp2k-8.1/src/motion/md_run.F:153#30  0x5536ae in cp2k_run    at 
>>> /local/src/cp2k-8.1/src/start/cp2k_runs.F:378#31  0x556764 in 
>>> __cp2k_runs_MOD_run_input    at 
>>> /local/src/cp2k-8.1/src/start/cp2k_runs.F:983#32  0x534a31 in cp2k    at 
>>> /local/src/cp2k-8.1/src/start/cp2k.F:337#33  0x4ec1cc in main    at 
>>> /local/src/cp2k-8.1/src/start/cp2k.F:44====================================================================================   
>>> BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES=   PID 14969 RUNNING 
>>> AT k172=   EXIT CODE: 134=   CLEANING UP REMAINING PROCESSES=   YOU CAN 
>>> IGNORE THE BELOW CLEANUP 
>>> MESSAGES===================================================================================YOUR 
>>> APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)This 
>>> typically refers to a problem with your application.Please see the FAQ page 
>>> for debugging suggestions*
>>>
>>> There is no problem for CP2K of CPU version, and I also perform 
>>> classical MD for   argon.inp in the exercise with 12 GPUs smoothly.
>>>
>>> Your response is highly appreciated.
>>>
>>> Best wishes,
>>>
>>> Wei
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20210204/c0cebb97/attachment.htm>


More information about the CP2K-user mailing list