you are using MKL version 10.0 or later, right?

have a look at the summary of CPU time and ELAPSED time.
in your "serial" calculation, the CPU time is almost 4 times
of your elapsed time. this usually happens, when MKL is used
in multi-threaded mode (you are running on a quad-core node or
a two-way dual core node. right?). since version 10 MKL multi-threads
by default across all available cpus. now if you switch to MPI,
MKL does not know that and thus with -np 4 you are _still_ running
with 4 threads per MPI tasks, i.e. 16 threads altogether. that clogs
up your memory bus and brings down your computation time.

add to that, that a serial executable is a bit faster due to lack
of parallel overhead and the fact that SMP performance of MPICH-1
is suboptimal and your experience is completely understandable.

please read the MKL documentation and either set OMP_NUM_THREADS=1
in your environment or link with the sequential mkl libraries

