[CP2K:1789] Re: GPGPU status

Ondrej Marsalek ondrej.... at gmail.com
Tue Feb 24 21:57:53 UTC 2009


On Sun, Feb 22, 2009 at 01:00, Axel <akoh... at gmail.com> wrote:
> On Feb 21, 4:28 pm, Ondrej Marsalek <ondrej.... at gmail.com> wrote:
>> Hi again,
>> thanks a lot to everyone who commented (and stayed on topic).
>> Let me please review "out loud" what I have understood the current
>> situation to be.
>> 1) CUDA/GPGPU for the whole of CP2K (and therefore I assume any AIMD
>> code) will have to wait until the accelerators get decent performance in
>> double precision.
> not quite. there are two issues here. point one is that the single
> precision performance of GPUs is _much_ higher than the double
> precision performance.  so those magical 20x,30x,100x speedups are
> only possible in single precision.  but the programming model is also
> different. you have to download data to the graphics card's memory,
> process it and get the results back. this pays off the best, if the
> compute parts have a "bad" scaling O(N**2) or worse, or if you can do
> a large part of the calculation on the GPU without having to move the
> data back. with that in mind, you can gain from BLAS only for level 3
> calls and if the data sets are large (this is colliding with MPI
> parallelisation). point two is that you have to design code
> differently, as you only gain a lot if you can use a lot (hundreds!)
> of threads.  the good news here is that thread creation is cheap, the
> bad is that you have to create threads in groups.
> i would not expect a significant improvement in double precision
> performance relative to single precision performance in the
> overseeable future.
> but i also would not say that one has to wait until double precision
> performance is better. a lot could also be done right now. your
> statement only applies in the strict sense only to gaining performance
> from linking to CUBLAS.
> there are already hartree fock codes that run in single precision, but
> you'll have to redesign a code from the ground up to support this
> well. given the size of the cp2k codebase that would be a gigantic
> undertaking... one could consider implementing a special purpose code
> for special applications, though, using the know-how and the
> algorithms in cp2k. it still is a big pile of work.

thanks for the insights.

>> 2) The one exception is FFT and there is an implementation of FFT using
>> CUDA in CP2K.
> the reason here is that FFT can be done in single precision without
> much loss in stability of the result. but the FFT is only in special
> cases the by far dominating time eater (you have to keep amdahl's law
> in mind, even if FFT would become infinitely fast, you can only get a
> 2x speedup, if the FFT is only 50% of your total time).  the downside
> is, that an parallelizing an FFT is a _hard_ problem.  please check
> the literature, it really is, thus the potential speedup from using
> CUFFT is limited to extreme cases and would at best be of the order of
> 2x or 3x.

I am aware of that and do not expect any magical 100x speedup.

>> 3) CP2K can currently be compiled with FFT in single precision - I have
>> tested a build today (linked against a single precision build of FFTW)
>> and it seems to work.
> right, it should work again. we hope to be able to do some serious
> testing on the new NCSA Xeon/Tesla cluster in the next weeks.

I'd certainly be interested in the results.


More information about the CP2K-user mailing list