[CP2K-user] CUDA RUNTIME API error: EventRecord failed with error cudaErrorInvalidResourceHandle

singlebook chenw... at gmail.com
Thu Feb 4 08:34:52 UTC 2021


Thanks for your reply! Yes, I am using PSMP.  I will recompile cp2k without 
cosma and give you feedback later.

On Thursday, February 4, 2021 at 2:41:48 PM UTC+8 Alfio Lazzaro wrote:

> The multi-gpu support is still not stable.
> The error message is inside COSMA.
> Could you remove this library from your installation of CP2K? I assume you 
> are using the toolchain, so just use --with-cosma=no
>
> Then, I assume you are using PSMP version of CP2K (the only way of using 
> the multiple GPUs). Could you confirm? Note that there must be a rank (or 
> multiple ranks) attached to each GPU, e.g. for 12 GPUs I need at least 12 
> ranks (or multiples).
>
> Alfio
>
> Il giorno giovedì 4 febbraio 2021 alle 02:20:50 UTC+1 singlebook ha 
> scritto:
>
>>
>> Hello, All
>>
>> I just install CP2K v8.1 on my workstation.  There are 12 NVIDIA K80 GPUs 
>> in the workstation. The compiler is GCC 6.5 and CUDA 10.0.
>>
>> I want to perform AIMD for SiC, but when I use more than 6 GPUs, it 
>> always give me the error:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *CUDA RUNTIME API error: EventRecord failed with error 
>> cudaErrorInvalidResourceHandleerror: GPU API call : invalid resource 
>> handleterminate called after throwing an instance of 'std::runtime_error'  
>> what():  GPU ERRORProgram received signal SIGABRT: Process abort 
>> signal.Backtrace for this error:#0  0x7fc42ccc626f in ???#1  0x7fc42ccc61f7 
>> in ???#2  0x7fc42ccc78e7 in ???#3  0x7fc43d68193c in 
>> _ZN9__gnu_cxx27__verbose_terminate_handlerEv    at 
>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/vterminate.cc:95#4  
>> 0x7fc43d67f905 in _ZN10__cxxabiv111__terminateEPFvvE    at 
>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_terminate.cc:47#5  
>> 0x7fc43d67f950 in _ZSt9terminatev    at 
>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_terminate.cc:57#6  
>> 0x7fc43d67fb68 in __cxa_throw    at 
>> ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_throw.cc:87#7  0x2b12c82 in 
>> check_runtime_status    at 
>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/util.hpp:17#8  
>> 0x2b12c82 in _ZNK3gpu13device_stream13enqueue_eventEv    at 
>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/device_stream.hpp:62#9  
>> 0x2b12c82 in 
>> _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE   
>>  at 
>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:248#10  
>> 0x2b1351c in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_b    at 
>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:341#11  
>> 0x2adfdee in 
>> _ZN5cosma14local_multiplyIdEEvPN3gpu9mm_handleIT_EEPS3_S6_S6_iiiS3_S3_   
>>  at 
>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/local_multiply.cpp:86#12  
>> 0x2ac8fb3 in 
>> _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyERNS_12communicatorES2_S2_   
>>  at 
>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:355#13  
>> 0x2ac9c26 in 
>> _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEiS2_S2_   
>>  at 
>> /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:272#14  
>> 0x2a9fc5d in ???#15  0x250cd5c in __cp_fm_basic_linalg_MOD_cp_fm_gemm    at 
>> /local/src/cp2k-8.1/src/fm/cp_fm_basic_linalg.F:446#16  0xcd8744 in 
>> __cp_gemm_interface_MOD_cp_gemm    at 
>> /local/src/cp2k-8.1/src/cp_gemm_interface.F:138#17  0x10c794b in 
>> __qs_wf_history_methods_MOD_wfi_extrapolate    at 
>> /local/src/cp2k-8.1/src/qs_wf_history_methods.F:912#18  0x17a5b53 in 
>> scf_env_initial_rho_setup    at 
>> /local/src/cp2k-8.1/src/qs_scf_initialization.F:1122#19  0x17a5b53 in 
>> init_scf_run    at /local/src/cp2k-8.1/src/qs_scf_initialization.F:1047#20  
>> 0x17a79b5 in __qs_scf_initialization_MOD_qs_scf_env_initialize    at 
>> /local/src/cp2k-8.1/src/qs_scf_initialization.F:182#21  0xf1e341 in 
>> __qs_scf_MOD_scf    at /local/src/cp2k-8.1/src/qs_scf.F:222#22  0xc0e966 in 
>> __qs_energy_MOD_qs_energies    at 
>> /local/src/cp2k-8.1/src/qs_energy.F:88#23  0x1979f13 in qs_forces    at 
>> /local/src/cp2k-8.1/src/qs_force.F:209#24  0x197dc87 in 
>> __qs_force_MOD_qs_calc_energy_force    at 
>> /local/src/cp2k-8.1/src/qs_force.F:114#25  0x112bfe5 in 
>> __force_env_methods_MOD_force_env_calc_energy_force    at 
>> /local/src/cp2k-8.1/src/force_env_methods.F:271#26  0x797c55 in 
>> __integrator_MOD_nvt    at 
>> /local/src/cp2k-8.1/src/motion/integrator.F:1103#27  0x78ddca in 
>> __velocity_verlet_control_MOD_velocity_verlet    at 
>> /local/src/cp2k-8.1/src/motion/velocity_verlet_control.F:77#28  0x6c1695 in 
>> qs_mol_dyn_low    at /local/src/cp2k-8.1/src/motion/md_run.F:481#29  
>> 0x6c209a in __md_run_MOD_qs_mol_dyn    at 
>> /local/src/cp2k-8.1/src/motion/md_run.F:153#30  0x5536ae in cp2k_run    at 
>> /local/src/cp2k-8.1/src/start/cp2k_runs.F:378#31  0x556764 in 
>> __cp2k_runs_MOD_run_input    at 
>> /local/src/cp2k-8.1/src/start/cp2k_runs.F:983#32  0x534a31 in cp2k    at 
>> /local/src/cp2k-8.1/src/start/cp2k.F:337#33  0x4ec1cc in main    at 
>> /local/src/cp2k-8.1/src/start/cp2k.F:44====================================================================================   
>> BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES=   PID 14969 RUNNING 
>> AT k172=   EXIT CODE: 134=   CLEANING UP REMAINING PROCESSES=   YOU CAN 
>> IGNORE THE BELOW CLEANUP 
>> MESSAGES===================================================================================YOUR 
>> APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)This 
>> typically refers to a problem with your application.Please see the FAQ page 
>> for debugging suggestions*
>>
>> There is no problem for CP2K of CPU version, and I also perform classical 
>> MD for   argon.inp in the exercise with 12 GPUs smoothly.
>>
>> Your response is highly appreciated.
>>
>> Best wishes,
>>
>> Wei
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20210204/11a368df/attachment.htm>


More information about the CP2K-user mailing list