Building NumPy and SciPy with Intel MKL

NumPy and SciPy are extremely valuable tools for numerical methods in Python. Under the hood, NumPy relies on highly optimized C and Fortran routines. This includes a set of standardized libraries known as BLAS and LAPACK. BLAS and LAPACK are really just a standardized interface to a number of matrix routines and there are many implementations of BLAS and LAPACK available for free as well as commercially. ATLAS is the default BLAS implementation used by many Linux distributions, including Fedora. ATLAS (Automatically Tuned Linear Algebra Software) is designed to be automatically tuned for specific computer hardware during build time, and can be quite fast when this is done. However, common Linux distributions want their software to run on as many hardware systems as possible and put performance second. As a result, the default versions of ATLAS (and therefore NumPy and SciPy) that ship with these distro’s are tuned for the general case and, as a result, are quite slow.

This tutorial describes how to build NumPy and SciPy in a way this is optimized for Intel CPU’s that are at least as new as the Core2 architecture (or whichever architecture you choose). It also utilizes Intel’s version of BLAS and LAPACK included in the Intel Math Kernel Library (MKL). Although MKL is a commercial product, there are a number of options for obtaining MKL free-of-charge, as described on this page on the Intel web site. The CSU CS department also has an installed version that you may use in /s/parsons/l/sys/intel. Building NumPy and Scipy to use MKL should improve performance significantly and allow you to take advantage of multiple CPU cores when using NumPy and SciPy.

Note: We assume below that the intel development software is installed in /opt/intel (the default location for a system-wide install). You may need to replace this with the path to your install, possibly in your home directory or /s/parsons/l/sys/intel for a CS department install.

Building NumPy with MKL

First, let’s focus on NumPy. We’ll need it to build SciPy later anyway. Download the last version of NumPy from this location. Then, find a suitable location to extract the files. Here is what I do.

mkdir -p ~/src/numpy
cp numpy-1.9.1.tar.gz ~/src/numpy
cd ~/src/numpy
tar -xvf numpy-1.9.1.tar.gz
cd numpy-1.9.1

Now, we need to modify the site.cfg to point NumPy to our MKL installation. First, copy the default and open it in your favorite editor (vim of course!)

cp site.cfg.example site.cfg
vim site.cfg

Now change the following lines.

include_dirs = /opt/intel/mkl/include
library_dirs = /opt/intel/mkl/lib/intel64
mkl_libs = mkl_def, mkl_intel_lp64, mkl_gnu_thread, mkl_core, mkl_mc3
lapack_libs = mkl_def, mkl_intel_lp64, mkl_gnu_thread, mkl_core, mkl_mc3

Note: the order of the MKL libraries above is important. Use the Intel Link Line Advisor if you are having problems or would like to use a different compiler or OpenMP library.

We’ll also need to set our LD_LIBRARY_PATH so that the MKL libraries can be found at run time. This should be added to your .bashrc (or whichever bash init scripts you use).

export LD_LIBRARY_PATH=/opt/intel/mkl/lib/intel64:${LD_LIBRARY_PATH}

Next, let’s set up some flags for gcc that will make the C routines included with NumPy a bit faster.

export CFLAGS='-fopenmp -O2 -march=core2 -ftree-vectorize'
export LDFLAGS='-lm -lpthread -lgomp'

The -march=core2 tells gcc that it can use features that are included in core2 and newer architectures and break compatibility with older CPU’s and AMD CPU’s. This means that this version of NumPy may not work correctly on old hardware or AMD hardware. However, nearly everything in the CS department should work fine. The -ftree-vectorize flag tells gcc to perform automatic SSE vectorization for our core2 architecture. -fopenmp and -lgomp enable OpenMP functionality wherever it may be. Python disutils will install our software in ~/.local by default. If, like me, you prefer to install your software somewhere else, you will also need to set up the PYTHONPATH environment variable in your .bashrc. Next, let’s build it!

# for python2.7
python setup.py build 2>&1 | less

# for python3
#python3 setup.py build 2>&1 | less

Note: If you re-build for any reason (including for python3) you will need to clean up with python setup.py clean –all In the output that is piped to less, you should see something like this telling you that MKL was found:

blas_mkl_info:
FOUND:
library_dirs = ['/opt/intel/mkl/lib/intel64']
include_dirs = ['/opt/intel/mkl/include']
libraries = ['mkl_def', 'mkl_intel_lp64', 'mkl_gnu_thread', 'mkl_core', 'mkl_mc3', 'pthread']
define_macros = [('SCIPY_MKL_H', None)]

FOUND:
library_dirs = ['/opt/intel/mkl/lib/intel64']
include_dirs = ['/opt/intel/mkl/include']
libraries = ['mkl_def', 'mkl_intel_lp64', 'mkl_gnu_thread', 'mkl_core', 'mkl_mc3', 'pthread']
define_macros = [('SCIPY_MKL_H', None)]

Next, let’s install the software

# for python2.7 default location
python setup.py install --user 2>&1 | less

# for python3 default location
#python3 setup.py build --user 2>&1 | less

In order to verify that everything worked, check the following two things. First, we should verify that we are able to load the correct version of NumPy from the ipython prompt. If you start ipython, import numpy as np and ?np, you should see something like this:

/s/chopin/b/grad/idfah/.local/lib/python3.3/site-packages/numpy/__init__.py

Second, ldd should show that NumPy is using the MKL BLAS

# for python 2.7
ldd ~/.local/lib/python2.7/site-packages/numpy/core/_dotblas.so
linux-vdso.so.1 => (0x00007fff7e7fe000)
libm.so.6 => /lib64/libm.so.6 (0x00007fe15e6ed000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe15e4cf000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fe15e2b8000)
libmkl_def.so => /opt/intel/mkl/lib/intel64/libmkl_def.so (0x00007fe15ccd4000)
libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007fe15c5b0000)
libmkl_gnu_thread.so => /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so (0x00007fe15ba54000)
libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007fe15a526000)
libmkl_mc3.so => /opt/intel/mkl/lib/intel64/libmkl_mc3.so (0x00007fe15887a000)
libpython2.7.so.1.0 => /lib64/libpython2.7.so.1.0 (0x00007fe1584b4000)
libc.so.6 => /lib64/libc.so.6 (0x00007fe1580f6000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe15ec54000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fe157ef1000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007fe157cee000)

# for python3
ldd ~/.local/lib/python3.3/site-packages/numpy/core/_dotblas.cpython-33m.so
linux-vdso.so.1 => (0x00007fff02dfe000)
libm.so.6 => /lib64/libm.so.6 (0x00007ffcfd574000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffcfd356000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007ffcfd13f000)
libmkl_def.so => /opt/intel/mkl/lib/intel64/libmkl_def.so (0x00007ffcfbb5b000)
libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007ffcfb437000)
libmkl_gnu_thread.so => /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so (0x00007ffcfa8db000)
libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007ffcf93ad000)
libmkl_mc3.so => /opt/intel/mkl/lib/intel64/libmkl_mc3.so (0x00007ffcf7701000)
libpython3.3m.so.1.0 => /lib64/libpython3.3m.so.1.0 (0x00007ffcf728a000)
libc.so.6 => /lib64/libc.so.6 (0x00007ffcf6ecc000)
/lib64/ld-linux-x86-64.so.2 (0x00007ffcfdada000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007ffcf6cc7000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007ffcf6ac4000)

Note: By default, MKL does not always free allocated memory buffers. For long-running experiments, this can cause MKL to use all available memory and begin swapping, even for small problem sizes. In order to prevent this memory leakage, set export MKL_DISABLE_FAST_MM=true in your .bashrc or relevant startup script.

Building SciPy to use our optimized NumPy and MKL

Before building SciPy, it is important that you follow the steps above to build the optimized NumPy and keep all of the environment variables from above!. SciPy relies heavily on NumPy and should use our optimized NumPy. Once you are ready, download the latest SciPy source code from here. Again, extract the files to a suitable location.

mkdir -p ~/src/scipy
cp scipy-0.14.0.tar.gz ~/src/scipy
cd ~/src/scipy
tar -xvf scipy-0.14.0.tar.gz
cd scipy-0.14.0

Add the -shared flag for gcc. This is a workaround for now, might not need this in the future.

export LDFLAGS="${LDFLAGS} -shared"

Now we can build SciPy

# python2.7
python setup.py build 2>&1 | less

# python3
#python3 setup.py build 2>&1 | less
Again, it should let you know in less that MKL was found

blas_mkl_info:
FOUND:
libraries = ['mkl_def', 'mkl_intel_lp64', 'mkl_gnu_thread', 'mkl_core', 'mkl_mc3', 'pthread']
library_dirs = ['/opt/intel/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/opt/intel/mkl/include']
Finally, install SciPy with

# for python2.7
python setup.py install --user 2>&1 | less

# for python3
#python3 setup.py install --user 2>&1 | less

Benchmarks

To demonstrate the potential performance improvement of using MKL, here are some quick iPython timings. These timings are done on Fedora 23 using an HP-Z800 with 12 XeonE5645 cores (2x6x2.4Ghz) and 96Gb of memory.

First, here are the timings for a 4069x4096x4096 matrix multiply and a 2048×2048 SVD using the default install of numpy linked against ATLAS:

> %timeit np.random.random((4096,4096)).dot(np.random.random((4096,4096)))
1 loops, best of 3: 19.5 s per loop

> %timeit np.linalg.svd(np.random.random((2048,2048)))
1 loops, best of 3: 13 s per loop
And, here are the same timings using numpy linked against MKL:

> %timeit np.random.random((4096,4096)).dot(np.random.random((4096,4096)))
1 loops, best of 3: 2.23 s per loop

> %timeit np.linalg.svd(np.random.random((2048,2048)))
1 loops, best of 3: 3.48 s per loop

Feel free to let me know if it is faster for your problems!