[CP2K-user] [CP2K:18636] Re: Install issues with IBM Power9 processors with Nvidia V100 GPU

Nathan Keilbart nathankeilbart at gmail.com
Fri Apr 7 23:26:21 UTC 2023


Thanks Alfio. Sorry for my late reply. It seems something in my environment 
was keeping that from being detected correctly. My scripts now detect 
everything correctly and after finding certain libraries that wouldn't 
build I was finally able to get a working binary. One strange issue is that 
the -ldl flag was needed when compiling the parallel binary. Not sure if 
this is normally detected but for my system and inputs I was providing it 
didn't do it so I simply added it to the arch files.

Initially, I was getting a cuda memory issue when running my test system of 
300 atoms on one node with four GPUs but I have since resubmitted the job 
several times and it appears to be working. I'm not sure if I was just 
getting a bad node or something. 

As I mentioned, I had to disable quite a few libraries. They install just 
fine according to the terminal but when I go to compile the binaries it 
causes them to misbehave and crash before even doing the initial SCF loop. 
Here are the flags I used.

./install_cp2k_toolchain.sh --install-all --with-cmake=system 
--with-openmpi=system --with-gcc=system --with-quip=no --with-libtorch=no 
--with-plumed=no --with-cosma=no --with-sirius=no --enable-cuda 
--gpu-ver=V100

In your opinion, would I get any more of a speed up by debugging this 
issue? I'm primarily concerned with the cosma and sirius libraries. Once 
again, thank you for your help. I'm working on an intel system and have a 
working binary but might have some questions as I'm seeing very poor 
scaling when I use multiple nodes.
On Thursday, March 30, 2023 at 9:35:52 PM UTC-7 Alfio Lazzaro wrote:

> There is still something wrong in your local_cuda.psmp file.
> In your output above I cannot find the flag `-D__parallel` . Isee only the 
> followings:
>
> -D__OFFLOAD_CUDA -D__DBCSR_ACC   -D__FFTW3  -D__LIBINT -D__LIBXC 
> -D__SCALAPACK -D__COSMA -D__ELPA -D__ELPA_NVIDIA_GPU -D__GSL -D__HDF5 
> -D__LIBVDWXC -D__SPGLIB -D__LIBVORI -D__SPFFT  -D__OFFLOAD_GEMM  -D__SPLA 
> -D__SIRIUS    -D__CUDA
>
> So my guess is that the toolchain was not able to recognize MPI (no idea 
> why). Could you add -D__parallel on top of those flags?
>
> Il giorno venerdì 31 marzo 2023 alle 00:08:29 UTC+2 Nathan Keilbart ha 
> scritto:
>
>> Thank Alfio. I wasn't sure what file was controlling that. I updated the 
>> file to have those compilers and then did a make realclean. Afterwards, I 
>> am now getting this error:
>>
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:192:19:
>>
>>              gcd_max = -1
>>                    1
>> Error: Symbol 'gcd_max' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:193:18:
>>
>>              DO ipe = 1, CEILING(SQRT(REAL(npe, dp)))
>>                   1
>> Error: Symbol 'ipe' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:194:18:
>>
>>                 jpe = npe/ipe
>>                   1
>> Error: Symbol 'jpe' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:185:29:
>>
>>           my_blacs_grid_layout = BLACS_GRID_SQUARE
>>                              1
>> Error: Symbol 'my_blacs_grid_layout' at (1) has no IMPLICIT type; did you 
>> mean 'blacs_grid_layout'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:221:25:
>>
>>        my_blacs_repeatable = .FALSE.
>>                          1
>> Error: Symbol 'my_blacs_repeatable' at (1) has no IMPLICIT type; did you 
>> mean 'blacs_repeatable'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:213:18:
>>
>>        my_row_major = .TRUE.
>>                   1
>> Error: Symbol 'my_row_major' at (1) has no IMPLICIT type; did you mean 
>> 'row_major'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:174:11:
>>
>>        npcol = 1
>>            1
>> Error: Symbol 'npcol' at (1) has no IMPLICIT type; did you mean 'ipcol'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:175:9:
>>
>>        npe = blacs_env%n_pid
>>          1
>> Error: Symbol 'npe' at (1) has no IMPLICIT type
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:173:11:
>>
>>        nprow = 1
>>            1
>> Error: Symbol 'nprow' at (1) has no IMPLICIT type; did you mean 'iprow'?
>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/src/fm/cp_blacs_env.F:188:22:
>>
>>           SELECT CASE (my_blacs_grid_layout)
>>                       1
>> Error: Argument of SELECT statement at (1) cannot be UNKNOWN
>> make[3]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:519: 
>> cp_blacs_env.o] Error 1
>> make[2]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:146: all] 
>> Error 2
>>
>> make[1]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:128: 
>> psmp] Error 2
>> make: *** [Makefile:123: all] Error 2
>>
>> On Thursday, March 30, 2023 at 12:22:43 AM UTC-7 Alfio Lazzaro wrote:
>>
>>> There is no relation with the DBCSR compilation itself, you see a 
>>> problem in DBCSR simply because it is the first to compile in CP2K.
>>> The error message is:
>>>
>>> /bin/sh: c: command not found
>>>
>>> and indeed you are using the command
>>>
>>> c -fno-omit-frame-pointer -fopenmp -g -mtune=native  -O3 -funroll-loops  
>>>   ...
>>>
>>> for compiling, therefore there is something wrong in the compiler call.
>>> I think the problem is that the local_cuda.psmp file has something wrong 
>>> in the definition of the compilers, namely the lines
>>>
>>> CC             := mpicc
>>> FC             := mpif90
>>> LD             := mpif90
>>> AR             := ar -r
>>>
>>> could you check if they are linking to the rights commands?
>>>
>>>
>>>
>>>
>>> Il giorno giovedì 30 marzo 2023 alle 03:12:26 UTC+2 Nathan Keilbart ha 
>>> scritto:
>>>
>>>> Hello everyone,
>>>>
>>>> I've been working on installing CP2K on a system with IBM Power9 
>>>> processors and Nvidia V100 GPUs. I'm using the toolchain with these options:
>>>>
>>>> ./install_cp2k_toolchain.sh -j --with-cmake=system --mpi-mode=openmpi 
>>>> --enable-cuda --gpu-ver=V100
>>>>
>>>> It installs all the dependencies without any errors so that I copy over 
>>>> the files to the arch folder and then source the setup file followed by
>>>>
>>>> make -j ARCH=local_cuda VERSION=psmp
>>>>
>>>> The following is some of the last lines of output
>>>>
>>>> /usr/bin/env python3 
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/tools/build_utils/fypp/bin/fypp 
>>>> -n --line-marker-format=gfortran5 
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/dbcsr_tensor_test.F 
>>>> dbcsr_tensor_test.F90
>>>> c -fno-omit-frame-pointer -fopenmp -g -mtune=native  -O3 -funroll-loops 
>>>>   
>>>>  -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/openblas-0.3.21/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/fftw-3.3.10/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libint-v2.6.0-cp2k-lmax-5/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libxc-6.0.0/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/COSMA-2.6.2/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/modules' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/elpa' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/gsl-2.7/include' 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/hdf5-1.12.0/include 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libvdwxc-0.4.0/include 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/spglib-1.16.2/include 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpFFT-1.0.6/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpLA-1.5.4/include/spla' 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/sirius-7.3.2/include/cuda 
>>>> -fbacktrace -ffree-form -fimplicit-none -std=f2008  -Werror=aliasing 
>>>> -Werror=ampersand -Werror=c-binding-type -Werror=intrinsic-shadow 
>>>> -Werror=intrinsics-std -Werror=line-truncation -Werror=tabs 
>>>> -Werror=target-lifetime -Werror=underflow -Werror=unused-but-set-variable 
>>>> -Werror=unused-variable -Werror=unused-dummy-argument -Werror=conversion 
>>>> -Werror=zerotrip -Wno-maybe-uninitialized -Wuninitialized 
>>>> -Wuse-without-only  -D__OFFLOAD_CUDA -D__DBCSR_ACC   -D__FFTW3  -D__LIBINT 
>>>> -D__LIBXC -D__SCALAPACK -D__COSMA -D__ELPA -D__ELPA_NVIDIA_GPU -D__GSL 
>>>> -D__HDF5 -D__LIBVDWXC -D__SPGLIB -D__LIBVORI -D__SPFFT  -D__OFFLOAD_GEMM 
>>>>  -D__SPLA -D__SIRIUS    -D__CUDA -D__SHORT_FILE__="\"dbcsr_tensor_test.F\"" 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src' 
>>>> dbcsr_tensor_test.F90 
>>>> /bin/sh: c: command not found
>>>> make[4]: 
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr//Makefile:258: 
>>>> dbcsr_tensor_test.o] Error 127 (ignored)
>>>> /usr/bin/env python3 
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/tools/build_utils/fypp/bin/fypp 
>>>> -n --line-marker-format=gfortran5 
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/dbcsr_tensor_api.F 
>>>> dbcsr_tensor_api.F90
>>>> c -fno-omit-frame-pointer -fopenmp -g -mtune=native  -O3 -funroll-loops 
>>>>   
>>>>  -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/openblas-0.3.21/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/fftw-3.3.10/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libint-v2.6.0-cp2k-lmax-5/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libxc-6.0.0/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/COSMA-2.6.2/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/modules' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/elpa-2022.11.001/nvidia/include/elpa_openmp-2022.11.001/elpa' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/gsl-2.7/include' 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/hdf5-1.12.0/include 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/libvdwxc-0.4.0/include 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/spglib-1.16.2/include 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpFFT-1.0.6/include' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/SpLA-1.5.4/include/spla' 
>>>> -I/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/tools/toolchain/install/sirius-7.3.2/include/cuda 
>>>> -fbacktrace -ffree-form -fimplicit-none -std=f2008  -Werror=aliasing 
>>>> -Werror=ampersand -Werror=c-binding-type -Werror=intrinsic-shadow 
>>>> -Werror=intrinsics-std -Werror=line-truncation -Werror=tabs 
>>>> -Werror=target-lifetime -Werror=underflow -Werror=unused-but-set-variable 
>>>> -Werror=unused-variable -Werror=unused-dummy-argument -Werror=conversion 
>>>> -Werror=zerotrip -Wno-maybe-uninitialized -Wuninitialized 
>>>> -Wuse-without-only  -D__OFFLOAD_CUDA -D__DBCSR_ACC   -D__FFTW3  -D__LIBINT 
>>>> -D__LIBXC -D__SCALAPACK -D__COSMA -D__ELPA -D__ELPA_NVIDIA_GPU -D__GSL 
>>>> -D__HDF5 -D__LIBVDWXC -D__SPGLIB -D__LIBVORI -D__SPFFT  -D__OFFLOAD_GEMM 
>>>>  -D__SPLA -D__SIRIUS    -D__CUDA -D__SHORT_FILE__="\"dbcsr_tensor_api.F\"" 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src/tensors/' 
>>>> -I'/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/dbcsr/src' 
>>>> dbcsr_tensor_api.F90 
>>>> /bin/sh: c: command not found
>>>> make[4]: 
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr//Makefile:258: 
>>>> dbcsr_tensor_api.o] Error 127 (ignored)
>>>> Updating archive 
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/lib/local_cuda/psmp/exts/dbcsr/libdbcsr.a
>>>> ar: creating 
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/lib/local_cuda/psmp/exts/dbcsr/libdbcsr.a
>>>> ar: dbcsr_cuda_profiling.o: No such file or directory
>>>> make[4]: *** 
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr//Makefile:330: 
>>>> /usr/gapps/qsg/codes/cp2k/lassen/v2023.1/lib/local_cuda/psmp/exts/dbcsr/libdbcsr.a] 
>>>> Error 1
>>>> make[3]: *** 
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/build_dbcsr/Makefile:179: 
>>>> libdbcsr] Error 2
>>>> make[2]: *** 
>>>> [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/exts/Makefile.inc:38: dbcsr] 
>>>> Error 2
>>>> make[1]: *** [/usr/gapps/qsg/codes/cp2k/lassen/v2023.1/Makefile:128: 
>>>> psmp] Error 2
>>>> make: *** [Makefile:123: all] Error 2
>>>>
>>>> It seems that it is having issues with the DBCSR module. I initially 
>>>> had an issue with this because I seemed to have left off the --recursive 
>>>> option and after making sure my git clone had that it at least let me build 
>>>> most of the serial version. It at least gave me the cp2k.sopt binary and it 
>>>> seems to at least take inputs. I didn't have a chance to test it too much 
>>>> yet. When I got this binary I had done 
>>>>
>>>> make -j ARCH=local_cuda VERSION="ssmp sdbg psmp pdbg"
>>>>
>>>> as suggested.
>>>>
>>>> Also, I've attempted to install with spack by using
>>>>
>>>> spack install 
>>>> cp2k at 2023.1+cosma+cuda+elpa+libint+libxc+mpi+openmp+pexsi+plumed+sirius+spglib 
>>>> smm=blas cuda_arch=70
>>>>
>>>> These are some of the last lines of output
>>>>
>>>>  >> 4028    collect2: error: ld returned 1 exit status
>>>>   >> 4029    collect2: error: ld returned 1 exit status
>>>>   >> 4030    make[3]: *** 
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/obj/linux-rhel7-power9le-gcc/psmp/
>>>>              all.dep:178: 
>>>> /tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/exe/linux-rhel7-power9le-gcc/cp2k.p
>>>>              smp] Error 1
>>>>      4031    make[3]: *** Waiting for unfinished jobs....
>>>>   >> 4032    make[3]: *** 
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/obj/linux-rhel7-power9le-gcc/psmp/
>>>>              all.dep:194: 
>>>> /tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/exe/linux-rhel7-power9le-gcc/libcp2
>>>>              k_unittest.psmp] Error 1
>>>>   >> 4033    make[2]: *** 
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/Makefile:146: 
>>>> all] Error 2
>>>>   >> 4034    make[1]: *** 
>>>> [/tmp/keilbart/spack-stage/spack-stage-cp2k-2023.1-24dhoyt24tbnn4d423glgoeqqquibmb6/spack-src/Makefile:128: 
>>>> psmp] Error 2
>>>>   >> 4035    make: *** [Makefile:123: all] Error 2
>>>>
>>>> Finally, I also have some intel machines that I'm attempting to build 
>>>> on and having issues as well but we can start with the IBM machine as we're 
>>>> hoping to accelerate the simulations with the GPU.
>>>>
>>>> Please let me know what other information I can provide. Thank you.
>>>>
>>>> Nathan
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/1d8f3459-b5f3-4fa0-9db2-4d3a8d037f34n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20230407/4e3bc232/attachment-0001.htm>


More information about the CP2K-user mailing list