FFTSGL compile

Axel akoh... at gmail.com
Sun Feb 22 00:23:31 UTC 2009
Previous message (by thread): [CP2K:1783] Re: FFTSGL compile
Next message (by thread): [CP2K:1790] Re: FFTSGL compile
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Feb 21, 4:39 pm, Ondrej Marsalek <ondrej.... at gmail.com> wrote:
> Hi,
>
> as I said in another thread, I have build a single precision version of
> CP2K. It seems to (mostly) work, there were some minor issues in the
> tests, but I have not explored them. I am only interested in this
> because of the potential to use CUDA FFT. So far, I have linked against
> an SP build of FFTW (version 3.2.1).

please note, that there are two different issues. a full single
precision
version of cp2k and a version with just the FFT in single precision.
at the
moment, the full single precision version is mostly kept alive to keep
the
code SP clean and for the - improbable - case - that somebody actually
goes
over the code to rewrite all the parts where single precision use in
the
current style is leading to instabilities. quantum chemistry is a
particularly
troublesome field, since you are interested in the differences of
large numbers.

> I have one question, though. The SP build seems to be a bit slower than
> a DP build, on identical input and a Core2 processor. Is this to be
> expected? Is it because SP is no faster than DP on such a CPU and there
> is the additional overhead of conversion?

assuming that you talk about the version with the single precision FFT
only,
then, yes, you have an additional copy/conversion of data and that
should
lead to a slowdown, and for as long as you didn't compile your FFTW
with
putting your FPU into single precision mode (-pc32 with intel
compilers,
you need -pc64 for double precision; the default is -pc80, btw. this
governs
how many iterations are needed to converge, e.g., a square root in the
FPU),
then there cannot be a speed difference between single and double
precision
floating point. except for the differences in memory bandwith
requirements.

> I should have a Tesla available soon and I'll be happy to test CUDA

actually a nvidia GTX 260 will be good enough for testing and is much
cheaper.
the GTX 285 has a much higher internal memory bandwidth, which may
help for
some applications (should help with FFT) and but already costs double.
haven't
checked yet for the GTX 295 (the dual GPU version of the 285) yet.

> support and see the performance. I hope that there will be a clear gain
> for a cluster system (meaning a lot of "empty space").

as ben already stated, you should be doing very well, if you have a 2x
speedup.
somebody at nvidia told me, that they are working on improving the
speed of
CUFFT, but i doubt the gain will be spectacular.

those magical 20x and more speedups require code written
specifically for the GPU. thus we have refocused the GPU
programming activities in our group one extending/adapting
those applications for our purposes and writing (small)
analysis codes or using GPUs for something that people never
considered to do before with CPUs (because it would have been
to slow). i can give you more details in private e-mail, as
this has little to do with cp2k.

> By the way, does anyone have any performance comparisons with and
> without CUDA used for FFT?

we have some preliminary results that were presented at last year's
LCI
conference. perhaps ben can share them with you.

cheers,
   axel.

> Best,
> Ondrej
>
> On Tue, Feb 17, 2009 at 18:49, Ben Levine <ben.l... at gmail.com> wrote:
>
> > For completeness:  Iain Bethune submitted a patch which fixes my
> > problem.  I'm now looking into the CUDA compile to see if it's
> > working.  Thanks Iain!
>
> > On Feb 9, 6:23 pm, Ben Levine <ben.l... at gmail.com> wrote:
> >> Okay, well, I seem to have found the problem.  Once I get things
> >> cleaned up and tested I'll send a patch.
>
> >> On Feb 5, 5:38 pm, Ben Levine <ben.l... at gmail.com> wrote:
>
> >> > Hi Guys,
> >> > As mentioned in another thread, I'm once again working with the CUDA
> >> > capable version of CP2K.  Unfortunately, it's been a long time since I
> >> > last ran it and I'm having some difficulties.  I'm working with the
> >> > most recent version out of CVS.  I compiled a serial version of the
> >> > code successfully with -D__FFTSGL (with or without -D__CUDA).
> >> > However, when I run the executables my jobs die with a seg fault after
> >> > printing the line:
>
> >> > GLOBAL| This output is from
> >> > process                                           0
>
> >> > I'm using the benchmark jobs from cp2k/tests/benchmarks as test runs
> >> > (specifically H2O-64.inp and H20-512.inp).  I have reproduced this
> >> > error on two machines, though I use a very similar arch file on both.
> >> > I've included one below.  Simply removing -D__FFTSGL yeilds a fully
> >> > functioning double precision executable.  I wonder if anyone has an
> >> > idea what is the problem, and if others can reproduce this problem.
> >> > Thanks for your time!
>
> >> > Ben
>
> >> > # by default some intel compilers put temporaries on the stack
> >> > # this might lead to segmentation faults is the stack limit is set to
> >> > low
> >> > # stack limits can be increased by sysadmins or e.g with ulimit -s
> >> > 256000
> >> > # furthermore new ifort (10.0?) compilers support the option
> >> > # -heap-arrays 64
> >> > # add this to the compilation flags is the other options do not work
> >> > # The following settings worked for:
> >> > # - AMD64 Opteron
> >> > # - SUSE Linux Enterprise SIerver 10.0 (x86_64)
> >> > # - Intel(R) Fortran Compiler for Intel(R) EM64T-based applications,
> >> > Version 10.0
> >> > # - AMD acml library version 3.6.0
> >> > # - MPICH2-1.0.5p4
> >> > # - FFTW 3.1.2
> >> > #
> >> > PERL     = perl
> >> > CC       = gcc
> >> > CPP      = cpp
> >> > FC       = /opt/intel/fce/10.0.025/bin/ifort -FR
> >> > LD       = /opt/intel/fce/10.0.025/bin/ifort -i-static -openmp
> >> > AR       = ar -r
> >> > #DFLAGS   = -D__INTEL -D__FFTMKL -D__FFTSG
> >> > DFLAGS   = -D__INTEL -D__FFTSG -D__FFTSGL -D__FFTW3
> >> > CFLAGS   =  -O2
> >> > CPPFLAGS = -traditional -C $(DFLAGS) -P -I/opt/intel/mkl/10.0.1.014/
> >> > include/fftw -I/opt/intel/mkl/10.0.1.014/include/
> >> > FCFLAGS  = $(DFLAGS) -O2 -xW
> >> > MKLPATH  = /opt/intel/mkl/10.0.1.014/lib/em64t/
> >> > LDFLAGS  = $(FCFLAGS)
> >> > LIBS     = -L$(MKLPATH)\
> >> >            $(MKLPATH)/libmkl_em64t.a\
> >> >            $(MKLPATH)/libmkl_lapack.a\
> >> >            $(MKLPATH)/libguide.a\
> >> >            /usr/local/lib/libfftw3f.a\
> >> >            -lpthread
>
> >> > OBJECTS_ARCHITECTURE = machine_intel.o
Previous message (by thread): [CP2K:1783] Re: FFTSGL compile
Next message (by thread): [CP2K:1790] Re: FFTSGL compile
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the CP2K-user mailing list