问题描述:

I've written a trivial benchmark comparing matrix multiplication performance in three languages - Fortran (using Intel Parallel Studio 2015, compiling with the ifort switches: /O3 /Qopt-prefetch=2 /Qopt-matmul /Qmkl:parallel, this replaces MatMul calls with calls to the Intel MKL library), Python (using the current Anaconda version, including Anaconda Accelerate, which supplies NumPy 1.9.2 linked with the Intel MKL library) and MATLAB R2015a (which, again, does matrix multiplication using the Intel MKL library).

Seeing as how all three implementations utilize the same Intel MKL library for matrix multiplication, I would expect the results to be virtually identical, especially for matrices that are sufficiently large for function call overhead to become negligible. However, this is far from the case, while MATLAB and Python display virtually identical performance, Fortran beats both by a factor of 2-3x. I'd like to understand why.

Here is the code I've used for the Fortran version:

`program MatMulTest`

implicit none

integer, parameter :: N = 1024

integer :: i, j, cr, cm

real*8 :: t0, t1, rate

real*8 :: A(N,N), B(N,N), C(N,N)

call random_seed()

call random_number(A)

call random_number(B)

! First initialize the system_clock

CALL system_clock(count_rate=cr)

CALL system_clock(count_max=cm)

rate = real(cr)

WRITE(*,*) "system_clock rate: ", rate

call cpu_time(t0)

do i = 1, 100, 1

C=MatMul(A,B)

end do

call cpu_time(t1)

write(unit=*, fmt="(a24,f10.5,a2)") "Average time spent: ", (t1-t0), "ms"

write(unit=*, fmt="(a24,f10.3)") "First element of C: ", C(1,1)

end program MatMulTest

Do note that if your system clock rate is not 10000 as in my case, you need to modify the timing calculation accordingly to yield milliseconds.

The Python code:

`import time`

import numpy as np

def main(N):

A = np.random.rand(N,N)

B = np.random.rand(N,N)

for i in range(100):

C = np.dot(A,B)

print C[0,0]

if __name__ == "__main__":

N = 1024

t0 = time.clock()

main(N)

t1 = time.clock()

print "Time elapsed: " + str((t1-t0)*10) + " ms"

And, finally, the MATLAB snippet:

`N=1024;`

A=rand(N,N); B=rand(N,N);

tic;

for i=1:100

C=A*B;

end

t=toc;

disp(['Time elapsed: ', num2str(t*10), ' milliseconds'])

On my system, the results are as follows:

`Fortran: 38.08 ms`

Python: 104.29 ms

MATLAB: 97.36 ms

CPU use is indistinguishable in all three cases (using a steady 47-49% on an i7-920D0 processor w/ HT enabled for the duration of the calculation). Furthermore, the relative performance stays roughly equal for arbitrary matrix sizes with the exception that for very small matrices (N<80 or so) it is useful to manually disable parallelization in Fortran.

Is there any established reason for the discrepancy here? Am I doing something wrong? I would expect that at least for larger matrices Fortran would have no meaningful advantage in this case.

You have two issues here:

- In Python, you time the random initialisation as well as the computation, which you don't in Fortran and MATLAB
- In Fortran, you measure the CPU time while you measure the elapsed time in Python and MATLAB. And since you noticed that the CPU usage is around 46%, this might just account for the difference.

Just fix these two things and retry... You might consider using `date_and_time()`

rather than `cpu_time()`

for that purpose.