[CP2K:3487] cuda_tools in CP2K
Urban Borštnik
urban.b... at gmail.com
Thu Sep 8 08:54:53 UTC 2011
Dear Wei,
the GPU/Cuda support is still in early development and as such has quite
a few limitations, performance bottlenecks and possibly bugs--I have not
thoroughly tested it.
Regarding the out-of-memory: I believe the (__CUDAPW & __FFTCU &
__FFTSGL) options are currently incompatible with the __DBCSR_CUDA
option (this is due to different approaches to memory allocation on the
card). You will probably have to choose one or the other.
Regarding parallelism (__DBCSR_CUDA) on a node (i.e., computer):
Currently the only configuration supported is 1 process (popt, no
threads!) using 1 GPU per node. While processess can share the GPU,
will be no performance gain from the CUDA part though of course the
non-GPU parts will be faster.
Support for 1 process/multiple threads will be forthcoming and then
supporting multiple GPUs in a box, each controlled by one MPI
process--these two developments should solve your problem.
Having one processes control multiple GPUs is currently not planned.
Sincerely,
Urban
On Tue, 2011-09-06 at 14:59 -0700, Wei wrote:
> Dear all,
>
> I am interested in the cuda_tools in cp2k, I have complied the recent
> cp2k (Version 2.2.320) with cuda4.0, intel compiler 12, intelmkl
> inside the package, and intelmpi (modification based on Linux-x86-64-
> dbcsr-cuda.popt, see it at the end).
>
> If I run with "./cp2k.popt test.inp", it is ok for about 100 atoms
> (Sb,Te) or less, but it gives "CUDA Error: out of memory" when the
> system excceeds 120 atoms (it this normal? as each GPU has 6 GB device
> memory).
>
> So I wonder how can I run it in parallel. Now I cannot run it with
> "mpirun -np 2 ./cp2k.popt test.inp", because it gives the "out of
> memory problem" at once.
>
> CUDA Error: out of memory
> ASSERTION FAILED: 1.EQ. 0
>
> stack:
> error in dev_mem_alloc_i at line 35 with error type -1
> message: Could not allocate GPU device memory
> 6 error in dev_mem_alloc_i at line 35
> 5 called from dev_mem_alloc_any
> 4 called from init_card_c
> 3 called from dbcsr_multrec_init
> 2 called from dbcsr_mult_m_e_e
> 1 called from dbcsr_multiply_anytype
>
>
> Where can I get more information about this cuda_tools? Can this
> "popt" version utilize the resources between nodes like the normal
> case? As we have 2 GPU(NVIDIA Quadro 6000 (Fermi)) and 2 6-core CPU on
> each node, how can I get the best performance out of it? like assign
> the job on several nodes with several MPI-core to control two GPU on
> each node? How?
>
> Thanks a lot in advance!
>
>
> NVCC = nvcc
> NVFLAGS = $(DFLAGS) -g -arch sm_20
>
> CC = mpiicc
> CPP =
> FC = mpiifort
> LD = $(FC)
> AR = ar -r
> CPPFLAGS =
> DFLAGS = -D__INTEL -D__FFTSG -D__parallel -D__SCALAPACK -D__BLACS -
> D__DBCSR_CUDA
> INTEL_INC= /opt/intel/Compiler/12.0/4.191/rwthlnk/mkl/include
> MKLPATH = /opt/intel/Compiler/12.0/4.191/rwthlnk/mkl/lib/intel64
> FCFLAGS = $(DFLAGS) -I$(INTEL_INC) -O3 -msse2 -heap-arrays 64 -
> funroll-loops -fpp -free
> LDFLAGS = $(FCFLAGS)
> CUDAPATH = /usr/local_rwth/sw/cuda/4.0.17/lib64
> LIBS = $(CUDAPATH)/libcudart.so $(CUDAPATH)/
> libcufft.so $(CUDAPATH)/libcublas.so $(MKLPATH)/
> libmkl_scalapack_lp64.a $(MKLPATH)/libmkl_solver_lp64.a -Wl,--start-
> group $(MKLPATH)/libmkl_intel_lp64.a $(MKLPATH)/
> libmkl_sequential.a $(MKLPATH)/libmkl_core.a $(MKLPATH)/
> libmkl_blacs_intelmpi_lp64.a -Wl,--end-group -lpthread
>
> OBJECTS_ARCHITECTURE = machine_intel.o
>
>
> Best regards,
>
> Wei
>
> ---------------------------------------------------------
> Wei ZHANG
> PhD student
> Institute for Theoretical Solid State Physics
> RWTH Aachen University, Germany
>
More information about the CP2K-user
mailing list