Ok, thanks for emailing me the required data. There are a number of issues.

First only matrix multiplications and fft's can currently be accelerated by 
GPU's. Looking at the timing sections your calculation is dominated by CPU 
Total time:                CP2K                                  609.771
Main bottlenecks:      integrate_v_rspace              268.894
                                calculate_rho_elec              205.875

This is normal for smaller calculations, GPU's become more useful for 
systems with 1000+ atoms.

The second problem is that only a small part (12.4%) of your 
multiplications are ported to the GPU:

 COUNTER                                      CPU                  GPU      
 number of processed stacks                179436                25344      

This is a result of there not being kernels for your basis set. You will 
have to manually add them:

Open: src/dbcsr/libsmm_acc/libcusmm/

There is a section with triples just on the top of the file. Add to it:
triples += combinations(7,9,16,22)



P.S: The main parameter that determines that speed of the calculations that 
you want to do is the CUTOFF parameter in CP2K_INPUT/FORCE_EVAL/DFT/MGRID.
