[CP2K-user] [CP2K:18636] Re: Install issues with IBM Power9 processors with Nvidia V100 GPU
Nathan Keilbart
nathankeilbart at gmail.com
Fri Apr 7 23:26:21 UTC 2023
Thanks Alfio. Sorry for my late reply. It seems something in my environment
was keeping that from being detected correctly. My scripts now detect
everything correctly and after finding certain libraries that wouldn't
build I was finally able to get a working binary. One strange issue is that
the -ldl flag was needed when compiling the parallel binary. Not sure if
this is normally detected but for my system and inputs I was providing it
didn't do it so I simply added it to the arch files.
Initially, I was getting a cuda memory issue when running my test system of
300 atoms on one node with four GPUs but I have since resubmitted the job
several times and it appears to be working. I'm not sure if I was just
getting a bad node or something.
As I mentioned, I had to disable quite a few libraries. They install just
fine according to the terminal but when I go to compile the binaries it
causes them to misbehave and crash before even doing the initial SCF loop.
Here are the flags I used.
./install_cp2k_toolchain.sh --install-all --with-cmake=system
--with-openmpi=system --with-gcc=system --with-quip=no --with-libtorch=no
--with-plumed=no --with-cosma=no --with-sirius=no --enable-cuda
--gpu-ver=V100
In your opinion, would I get any more of a speed up by debugging this
issue? I'm primarily concerned with the cosma and sirius libraries. Once
again, thank you for your help. I'm working on an intel system and have a
working binary but might have some questions as I'm seeing very poor
scaling when I use multiple nodes.
On Thursday, March 30, 2023 at 9:35:52 PM UTC-7 Alfio Lazzaro wrote:
> There is still something wrong in your local_cuda.psmp file.
> In your output above I cannot find the flag `-D__parallel` . Isee only the
> followings:
>
> -D__OFFLOAD_CUDA -D__DBCSR_ACC -D__FFTW3 -D__LIBINT -D__LIBXC
> -D__SCALAPACK -D__COSMA -D__ELPA -D__ELPA_NVIDIA_GPU -D__GSL -D__HDF5
> -D__LIBVDWXC -D__SPGLIB -D__LIBVORI -D__SPFFT -D__OFFLOAD_GEMM -D__SPLA
> -D__SIRIUS -D__CUDA
>
> So my guess is that the toolchain was not able to recognize MPI (no idea
> why). Could you add -D__parallel on top of those flags?
>
> Il giorno venerdì 31 marzo 2023 alle 00:08:29 UTC+2 Nathan Keilbart ha
> scritto:
>
>> Thank Alfio. I wasn't sure what file was controlling that. I updated the
>> file to have those compilers and then did a make realclean. Afterwards, I
>> am now getting this error:
>>
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:192:19:
>>
>> gcd_max = -1
>> 1
>> Error: Symbol 'gcd_max' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:193:18:
>>
>> DO ipe = 1, CEILING(SQRT(REAL(npe, dp)))
>> 1
>> Error: Symbol 'ipe' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:194:18:
>>
>> jpe = npe/ipe
>> 1
>> Error: Symbol 'jpe' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:185:29:
>>
>> my_blacs_grid_layout = BLACS_GRID_SQUARE
>> 1
>> Error: Symbol 'my_blacs_grid_layout' at (1) has no IMPLICIT type; did you
>> mean 'blacs_grid_layout'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:221:25:
>>
>> my_blacs_repeatable = .FALSE.
>> 1
>> Error: Symbol 'my_blacs_repeatable' at (1) has no IMPLICIT type; did you
>> mean 'blacs_repeatable'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:213:18:
>>
>> my_row_major = .TRUE.
>> 1
>> Error: Symbol 'my_row_major' at (1) has no IMPLICIT type; did you mean
>> 'row_major'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:174:11:
>>
>> npcol = 1
>> 1
>> Error: Symbol 'npcol' at (1) has no IMPLICIT type; did you mean 'ipcol'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:175:9:
>>
>> npe = blacs_env%n_pid
>> 1
>> Error: Symbol 'npe' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:173:11:
>>
>> nprow = 1
>> 1
>> Error: Symbol 'nprow' at (1) has no IMPLICIT type; did you mean 'iprow'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:188:22:
>>
>> SELECT CASE (my_blacs_grid_layout)
>> 1
>> Error: Argument of SELECT statement at (1) cannot be UNKNOWN
>> make[3]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:519:
>> cp_blacs_env.o] Error 1
>> make[2]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:146: all]
>> Error 2
>>
>> make[1]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:128:
>> psmp] Error 2
>> make: *** [Makefile:123: all] Error 2
>>
>> On Thursday, March 30, 2023 at 12:22:43 AM UTC-7 Alfio Lazzaro wrote:
>>
>>> There is no relation with the DBCSR compilation itself, you see a
>>> problem in DBCSR simply because it is the first to compile in CP2K.
>>> The error message is:
>>>
>>> /bin/sh: c: command not found
>>>
>>> and indeed you are using the command
>>>
>>> c -fno-omit-frame-pointer -fopenmp -g -mtune=native -O3 -funroll-loops
>>> ...
>>>
>>> for compiling, therefore there is something wrong in the compiler call.
>>> I think the problem is that the local_cuda.psmp file has something wrong
>>> in the definition of the compilers, namely the lines
>>>
>>> CC := mpicc
>>> FC := mpif90
>>> LD := mpif90
>>> AR := ar -r
>>>
>>> could you check if they are linking to the rights commands?
>>>
>>>
>>>
>>>
>>> Il giorno giovedì 30 marzo 2023 alle 03:12:26 UTC+2 Nathan Keilbart ha
>>> scritto:
>>>
>>>> Hello everyone,
>>>>
>>>> I've been working on installing CP2K on a system with IBM Power9
>>>> processors and Nvidia V100 GPUs. I'm using the toolchain with these options:
>>>>
>>>> ./install_cp2k_toolchain.sh -j --with-cmake=system --mpi-mode=openmpi
>>>> --enable-cuda --gpu-ver=V100
>>>>
>>>> It installs all the dependencies without any errors so that I copy over
>>>> the files to the arch folder and then source the setup file followed by
>>>>
>>>> make -j ARCH=local_cuda VERSION=psmp
>>>>
>>>> The following is some of the last lines of output
>>>>
>>>> /usr/bin/env python3
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/tools/build_utils/fypp/bin/fypp
>>>> -n --line-marker-format=gfortran5
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/dbcsr_tensor_test.F
>>>> dbcsr_tensor_test.F90
>>>> c -fno-omit-frame-pointer -fopenmp -g -mtune=native -O3 -funroll-loops
>>>>
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/openblas-0.3.21/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/fftw-3.3.10/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libint-v2.6.0-cp2k-lmax-5/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libxc-6.0.0/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/COSMA-2.6.2/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/modules'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/elpa'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/gsl-2.7/include'
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/hdf5-1.12.0/include
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libvdwxc-0.4.0/include
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/spglib-1.16.2/include
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpFFT-1.0.6/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpLA-1.5.4/include/spla'
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/sirius-7.3.2/include/cuda
>>>> -fbacktrace -ffree-form -fimplicit-none -std=f2008 -Werror=aliasing
>>>> -Werror=ampersand -Werror=c-binding-type -Werror=intrinsic-shadow
>>>> -Werror=intrinsics-std -Werror=line-truncation -Werror=tabs
>>>> -Werror=target-lifetime -Werror=underflow -Werror=unused-but-set-variable
>>>> -Werror=unused-variable -Werror=unused-dummy-argument -Werror=conversion
>>>> -Werror=zerotrip -Wno-maybe-uninitialized -Wuninitialized
>>>> -Wuse-without-only -D__OFFLOAD_CUDA -D__DBCSR_ACC -D__FFTW3 -D__LIBINT
>>>> -D__LIBXC -D__SCALAPACK -D__COSMA -D__ELPA -D__ELPA_NVIDIA_GPU -D__GSL
>>>> -D__HDF5 -D__LIBVDWXC -D__SPGLIB -D__LIBVORI -D__SPFFT -D__OFFLOAD_GEMM
>>>> -D__SPLA -D__SIRIUS -D__CUDA -D__SHORT_FILE__="\"dbcsr_tensor_test.F\""
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src'
>>>> dbcsr_tensor_test.F90
>>>> /bin/sh: c: command not found
>>>> make[4]:
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr//Makefile:258:
>>>> dbcsr_tensor_test.o] Error 127 (ignored)
>>>> /usr/bin/env python3
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/tools/build_utils/fypp/bin/fypp
>>>> -n --line-marker-format=gfortran5
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/dbcsr_tensor_api.F
>>>> dbcsr_tensor_api.F90
>>>> c -fno-omit-frame-pointer -fopenmp -g -mtune=native -O3 -funroll-loops
>>>>
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/openblas-0.3.21/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/fftw-3.3.10/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libint-v2.6.0-cp2k-lmax-5/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libxc-6.0.0/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/COSMA-2.6.2/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/modules'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/elpa'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/gsl-2.7/include'
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/hdf5-1.12.0/include
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libvdwxc-0.4.0/include
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/spglib-1.16.2/include
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpFFT-1.0.6/include'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpLA-1.5.4/include/spla'
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/sirius-7.3.2/include/cuda
>>>> -fbacktrace -ffree-form -fimplicit-none -std=f2008 -Werror=aliasing
>>>> -Werror=ampersand -Werror=c-binding-type -Werror=intrinsic-shadow
>>>> -Werror=intrinsics-std -Werror=line-truncation -Werror=tabs
>>>> -Werror=target-lifetime -Werror=underflow -Werror=unused-but-set-variable
>>>> -Werror=unused-variable -Werror=unused-dummy-argument -Werror=conversion
>>>> -Werror=zerotrip -Wno-maybe-uninitialized -Wuninitialized
>>>> -Wuse-without-only -D__OFFLOAD_CUDA -D__DBCSR_ACC -D__FFTW3 -D__LIBINT
>>>> -D__LIBXC -D__SCALAPACK -D__COSMA -D__ELPA -D__ELPA_NVIDIA_GPU -D__GSL
>>>> -D__HDF5 -D__LIBVDWXC -D__SPGLIB -D__LIBVORI -D__SPFFT -D__OFFLOAD_GEMM
>>>> -D__SPLA -D__SIRIUS -D__CUDA -D__SHORT_FILE__="\"dbcsr_tensor_api.F\""
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/'
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src'
>>>> dbcsr_tensor_api.F90
>>>> /bin/sh: c: command not found
>>>> make[4]:
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr//Makefile:258:
>>>> dbcsr_tensor_api.o] Error 127 (ignored)
>>>> Updating archive
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/lib/local_cuda/psmp/exts/dbcsr/libdbcsr.a
>>>> ar: creating
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/lib/local_cuda/psmp/exts/dbcsr/libdbcsr.a
>>>> ar: dbcsr_cuda_profiling.o: No such file or directory
>>>> make[4]: ***
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr//Makefile:330:
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/lib/local_cuda/psmp/exts/dbcsr/libdbcsr.a]
>>>> Error 1
>>>> make[3]: ***
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr/Makefile:179:
>>>> libdbcsr] Error 2
>>>> make[2]: ***
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/Makefile.inc:38: dbcsr]
>>>> Error 2
>>>> make[1]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:128:
>>>> psmp] Error 2
>>>> make: *** [Makefile:123: all] Error 2
>>>>
>>>> It seems that it is having issues with the DBCSR module. I initially
>>>> had an issue with this because I seemed to have left off the --recursive
>>>> option and after making sure my git clone had that it at least let me build
>>>> most of the serial version. It at least gave me the cp2k.sopt binary and it
>>>> seems to at least take inputs. I didn't have a chance to test it too much
>>>> yet. When I got this binary I had done
>>>>
>>>> make -j ARCH=local_cuda VERSION="ssmp sdbg psmp pdbg"
>>>>
>>>> as suggested.
>>>>
>>>> Also, I've attempted to install with spack by using
>>>>
>>>> spack install
>>>> cp2k at 2023.1+cosma+cuda+elpa+libint+libxc+mpi+openmp+pexsi+plumed+sirius+spglib
>>>> smm=blas cuda_arch=70
>>>>
>>>> These are some of the last lines of output
>>>>
>>>> >> 4028 collect2: error: ld returned 1 exit status
>>>> >> 4029 collect2: error: ld returned 1 exit status
>>>> >> 4030 make[3]: ***
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/obj/linux-rhel7-power9le-gcc/psmp/
>>>> all.dep:178:
>>>> /tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/exe/linux-rhel7-power9le-gcc/cp2k.p
>>>> smp] Error 1
>>>> 4031 make[3]: *** Waiting for unfinished jobs....
>>>> >> 4032 make[3]: ***
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/obj/linux-rhel7-power9le-gcc/psmp/
>>>> all.dep:194:
>>>> /tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/exe/linux-rhel7-power9le-gcc/libcp2
>>>> k_unittest.psmp] Error 1
>>>> >> 4033 make[2]: ***
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/Makefile:146:
>>>> all] Error 2
>>>> >> 4034 make[1]: ***
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/Makefile:128:
>>>> psmp] Error 2
>>>> >> 4035 make: *** [Makefile:123: all] Error 2
>>>>
>>>> Finally, I also have some intel machines that I'm attempting to build
>>>> on and having issues as well but we can start with the IBM machine as we're
>>>> hoping to accelerate the simulations with the GPU.
>>>>
>>>> Please let me know what other information I can provide. Thank you.
>>>>
>>>> Nathan
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/1d8f3459-b5f3-4fa0-9db2-4d3a8d037f34n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230407/4e3bc232/attachment-0001.htm>
More information about the CP2K-user
mailing list