[CP2K:3487] cuda_tools in CP2K

Urban Borštnik urban.b... at gmail.com
Thu Sep 8 10:54:53 CEST 2011


Dear Wei,

the GPU/Cuda support is still in early development and as such has quite
a few limitations, performance bottlenecks and possibly bugs--I have not
thoroughly tested it.

Regarding the out-of-memory: I believe the (__CUDAPW & __FFTCU &
__FFTSGL) options are currently incompatible with the __DBCSR_CUDA
option (this is due to different approaches to memory allocation on the
card).  You will probably have to choose one or the other.

Regarding parallelism (__DBCSR_CUDA) on a node (i.e., computer):

Currently the only configuration supported is 1 process (popt, no
threads!) using 1 GPU per node.  While processess can share the GPU,
will be no performance gain from the CUDA part though of course the
non-GPU parts will be faster.

Support for 1 process/multiple threads will be forthcoming and then
supporting multiple GPUs in a box, each controlled by one MPI
process--these two developments should solve your problem.

Having one processes control multiple GPUs is currently not planned.

Sincerely,
Urban

On Tue, 2011-09-06 at 14:59 -0700, Wei wrote:
> Dear all,
> 
> I am interested in the cuda_tools in cp2k, I have complied the recent
> cp2k (Version 2.2.320) with cuda4.0, intel compiler 12, intelmkl
> inside the package, and intelmpi (modification based on Linux-x86-64-
> dbcsr-cuda.popt, see it at the end).
> 
> If I run with "./cp2k.popt test.inp", it is ok for about 100 atoms
> (Sb,Te) or less, but it gives "CUDA Error: out of memory" when the
> system excceeds 120 atoms (it this normal? as each GPU has 6 GB device
> memory).
> 
> So I wonder how can I run it in parallel. Now I cannot run it with
> "mpirun -np 2 ./cp2k.popt test.inp", because it gives the "out of
> memory problem" at once.
> 
> CUDA Error: out of memory
>  ASSERTION FAILED:         1.EQ.        0
> 
>   stack:
>   error in dev_mem_alloc_i at line    35 with error type  -1
>   message: Could not allocate GPU device memory
>     6 error in dev_mem_alloc_i at line    35
>     5 called from dev_mem_alloc_any
>     4 called from init_card_c
>     3 called from dbcsr_multrec_init
>     2 called from dbcsr_mult_m_e_e
>     1 called from dbcsr_multiply_anytype
> 
> 
> Where can I get more information about this cuda_tools? Can this
> "popt" version utilize the resources between nodes like the normal
> case? As we have 2 GPU(NVIDIA Quadro 6000 (Fermi)) and 2 6-core CPU on
> each node, how can I get the best performance out of it? like assign
> the job on several nodes with several MPI-core to control two GPU on
> each node? How?
> 
> Thanks a lot in advance!
> 
> 
> NVCC     = nvcc
> NVFLAGS  = $(DFLAGS) -g -arch sm_20
> 
> CC       = mpiicc
> CPP      =
> FC       = mpiifort
> LD       = $(FC)
> AR       = ar -r
> CPPFLAGS =
> DFLAGS   = -D__INTEL -D__FFTSG  -D__parallel -D__SCALAPACK -D__BLACS -
> D__DBCSR_CUDA
> INTEL_INC= /opt/intel/Compiler/12.0/4.191/rwthlnk/mkl/include
> MKLPATH  = /opt/intel/Compiler/12.0/4.191/rwthlnk/mkl/lib/intel64
> FCFLAGS  = $(DFLAGS) -I$(INTEL_INC) -O3 -msse2 -heap-arrays 64 -
> funroll-loops -fpp -free
> LDFLAGS  = $(FCFLAGS)
> CUDAPATH = /usr/local_rwth/sw/cuda/4.0.17/lib64
> LIBS     = $(CUDAPATH)/libcudart.so            $(CUDAPATH)/
> libcufft.so     $(CUDAPATH)/libcublas.so   $(MKLPATH)/
> libmkl_scalapack_lp64.a  $(MKLPATH)/libmkl_solver_lp64.a   -Wl,--start-
> group    $(MKLPATH)/libmkl_intel_lp64.a    $(MKLPATH)/
> libmkl_sequential.a    $(MKLPATH)/libmkl_core.a   $(MKLPATH)/
> libmkl_blacs_intelmpi_lp64.a  -Wl,--end-group -lpthread
> 
> OBJECTS_ARCHITECTURE = machine_intel.o
> 
> 
> Best regards,
> 
> Wei
> 
> ---------------------------------------------------------
> Wei ZHANG
> PhD student
> Institute for Theoretical Solid State Physics
> RWTH Aachen University, Germany
> 





More information about the CP2K-user mailing list