Added instrumentated regions to benchmarks, updated FMA kernels to get the peak performance, added correct BLAS libraries to RV machines, correctly parallelized CG and reduced its initialization time for sparse matrices.
Added instrumentated regions to benchmarks, updated FMA kernels to get the peak performance, added correct BLAS libraries to RV machines, correctly parallelized CG and reduced its initialization time for sparse matrices.