GPGPU status

Axel akoh... at gmail.com
Sun Feb 22 00:00:28 UTC 2009



On Feb 21, 4:28 pm, Ondrej Marsalek <ondrej.... at gmail.com> wrote:
> Hi again,
>
> thanks a lot to everyone who commented (and stayed on topic).
>
> Let me please review "out loud" what I have understood the current
> situation to be.
>
> 1) CUDA/GPGPU for the whole of CP2K (and therefore I assume any AIMD
> code) will have to wait until the accelerators get decent performance in
> double precision.

not quite. there are two issues here. point one is that the single
precision
performance of GPUs is _much_ higher than the double precision
performance.
so those magical 20x,30x,100x speedups are only possible in single
precision.
but the programming model is also different. you have to download data
to
the graphics card's memory, process it and get the results back. this
pays
off the best, if the compute parts have a "bad" scaling O(N**2) or
worse,
or if you can do a large part of the calculation on the GPU without
having
to move the data back. with that in mind, you can gain from BLAS only
for
level 3 calls and if the data sets are large (this is colliding with
MPI
parallelisation). point two is that you have to design code
differently,
as you only gain a lot if you can use a lot (hundreds!) of threads.
the
good news here is that thread creation is cheap, the bad is that you
have
to create threads in groups.

i would not expect a significant improvement in double precision
performance
relative to single precision performance in the overseeable future.

but i also would not say that one has to wait until double precision
performance
is better. a lot could also be done right now. your statement only
applies
in the strict sense only to gaining performance from linking to
CUBLAS.

there are already hartree fock codes that run in single precision, but
you'll
have to redesign a code from the ground up to support this well. given
the
size of the cp2k codebase that would be a gigantic undertaking... one
could
consider implementing a special purpose code for special applications,
though,
using the know-how and the algorithms in cp2k. it still is a big pile
of work.

> 2) The one exception is FFT and there is an implementation of FFT using
> CUDA in CP2K.

the reason here is that FFT can be done in single precision without
much
loss in stability of the result. but the FFT is only in special cases
the
by far dominating time eater (you have to keep amdahl's law in mind,
even
if FFT would become infinitely fast, you can only get a 2x speedup, if
the
FFT is only 50% of your total time).
the downside is, that an parallelizing an FFT is a _hard_ problem.
please
check the literature, it really is, thus the potential speedup from
using
CUFFT is limited to extreme cases and would at best be of the order of
2x or 3x.

> 3) CP2K can currently be compiled with FFT in single precision - I have
> tested a build today (linked against a single precision build of FFTW)
> and it seems to work.

right, it should work again. we hope to be able to do some serious
testing
on the new NCSA Xeon/Tesla cluster in the next weeks.


cheers,
   axel.
>
> I will have more to say on this topic, but let me split it into a
> separate thread.
>
> Best regards,
> Ondrej
>
> On Thu, Feb 5, 2009 at 22:21, Ben Levine <ben.l... at gmail.com> wrote:
>
> > Hi Guys,
> > Sorry to be joining the discussion a little late.  I haven't checked
> > the group in a little while, I guess.
>
> > Yes, I've done some work in incorperating some CUDA code into CP2K,
> > and after a long hiatus I am just starting to work on it again.  Right
> > now it is possible to run CP2K with a single precision FFT and several
> > associated scatter/gather operations run on the GPU.  As others have
> > said, running the FFT in single precision does not seem to degrade the
> > accuracy of the calculation significantly.  In my experience this
> > can't be said for other portions of the code, and this is the reason
> > that greater CUDA support is not yet provided.
>
> > To enable CUDA support compile with the -D__CUDA and -D__FFTSGL
> > compiler flags...  But hold off for a little bit if you would.  I'm
> > currently having some problems getting it to run in it's current
> > state.  I'll post again when it's working.  Thanks for your interest.
>
> > Ben
>
> > On Jan 29, 8:59 am, Juerg Hutter <hut... at pci.uzh.ch> wrote:
> >> Hi
>
> >> > I am interested in the status of GPGPU code in CP2K. So far, I have
> >> > found only the very brief mention of single precision FFT using CUDA in
> >> > the input manual and this e-mail from the CPMD archive:
>
> >> >http://www.cpmd.org/pipermail/cpmd-list/2008-April/004330.html
>
> >> > Could someone please give me a brief overview of the options one has in
> >> > this area in CP2K? I would also like to know whether there is someone
> >> > working on some sort of GPGPU code at this time.
>
> >> we have been looking into this a couple of times and also have some
> >> accelerator cards available. Up to now we couldn't find a convincing
> >> application, meaning a project together with a hardware setup where
> >> the work/benefit ratio is good.
>
> >> > I also have one more specific question. Could FFTCU be adapted to use
> >> > double precision capable cards? I am interested in this because of
> >> > cluster calculations in open boundary conditions, where FFT seems to be
> >> > the main bottleneck. Also, does anyone have any experience using FFTCU?
>
> >> we have some (but not comprehensive) experience with a setup where all
> >> of CP2K is running double precicion except for the FFT. (compile with
> >> -D__FFTSGL)
> >> It seems that the loss in accuracy is not dramatic and this might be
> >> an interesting option if a really fast single precision FFT is available.
>
> >> regards
>
> >> Juerg Hutter
>
> >> > Thanks a lot for any replies or comments,
> >> > Ondrej


More information about the CP2K-user mailing list