Setting Up Python for Science

August 22, 2014

Edit 2018: This document is obsolete - things have changed a lot since 2014. I am leaving this post up, but be aware that it reflects both an outdated view of my personal setup, as well as an ill-founded confidence in my own expertise!

This is about how I set up python. In particular, I think that a language is useless without ready libraries to do almost anything under the sun. I also firmly believe that python is a tool. I don’t care what programming language I am using, as long as I can get things done fast. I want the programming language to get out of my way, and let me focus on the interesting things. Therefore, I think the listed libraries should be default parts of python, and I treat them as such here.

Here are the components of my “bare-bones python installation”:

  • 64 bit python 2.7
    • Why python 2.7 rather than 3.x? As of writing this (2014), some libraries that I frequently use are still not ported to the 3.x line. But things are getting better and better - last time I tried 3.x, I lasted several weeks before switching back to 2 (mechanize is not yet ported). However, almost everything I do works on both python 2 and 3 (just change the print statements).
    • I recommend 64 bit, because 32 bit python can only use 4GB of RAM. I myself go over that limit regularly when doing numerical tasks.
  • Numpy
    • How is this not yet in the standard library? Honestly, if doing anything math-related, numpy is a gift from the heavens.
  • Pylab
    • Technically, this refers to the combination of numpy,scipy, and matplotlib. It enables importing a simplified environment called pylab. There are no words which can adequately describe the utility of those three libraries. If you’ve ever used matlab, you will feel right at home using pylab.
  • IPython
    • This is technically not a library. Whenever wanting to try something interactively in python, ipython is the way to go. Ipython is orders of magnitude more useful than the standard python console. It also includes the ipython notebook which is amazing when doing analysis for projects/classes.
  • Numba
    • Vanilla python is slow. There are things like Cython and pypy, but CPython is the standard, so all libraries are built to work with it. Due to its slowness, the standard use case is to first write everything in Python, and then rewriting the performance-critical code in C. With Numba, this is done automatically. Just putting @jit before a method makes it compile to machine code transparently. The speedups on loops are enormous, and the time saved is the reason it has earend its place in my “bare-bones” python installation.
  • Theano
    • Math defines my programs. Many of the most important (and most computationally intensive) actions can be put into single equations. That is where Theano comes in. You can define an equation in theano, variables to update, variables to output, and even formulas of which to take a gradient, and theano will create functions which have optimized the computations. If available, theano even uses a GPU for computations on 32 bit floating point. That means that you can get the benefits of CUDA without the complexity involved with learning GPU programming! While originally, theano was meant for optimizing neural networks, its utility extends to any application using numerical calculations. When using theano on math-heavy work, the speed difference between fine-tuned C code, and thrown-together python hackery is effectively negligible - and python can even be faster if you set theano up to work with your GPU.

Linux

Installing the above-mentioned components is easy on most linux systems. I don’t include details, since if you’re using linux, you probably already know how to use a terminal.

If you use Arch, pacman -S python2-matplotlib python2-ipython python2-scipy python2-ipython should get the basics out of the way, leaving python2-numba-git and python2-theano which can be found in the AUR. At the time of writing, python2-numba-git is broken, but should be fixed any day now.

On ubuntu or debian start with: sudo apt-get install python-numpy python-scipy python-matplotlib, then use the theano install instructions. I am not sure how to install numba, since I have not used debian-based systems for a couple of years now. Google can probably lead you to instructions.

Once stuff is installed, skip down to “Does it work” to check if stuff is happening the way it is supposed to…

Windows

First things first: you need to get yourself some python! Go to the python website, and download the “Python 2.7.x Windows X86-64 Installer”. I am assuming that you know how to install things. The default settings are by far the best, and changing the installation directory is not recommended. The only thing that might be useful is making sure that “add python to system search PATH” is checked (I forgot the exact words used).

Once python itself is installed, you will want to get the rest of your libraries. Fortunately, www.lfd.uci.edu/~gohlke/pythonlibs/ includes 64 bit installers for all of the libraries we want!

You can choose the installers, making sure that the files are of the form nameoflibrary.win-amd64-py2.7.exe.

We’re not done yet. That was the easy part. Next comes actually getting Theano to work with 64 bit python.

Getting Theano to work

The main problem with theano, which is also its main benefit, is that it actually compiles stuff, so you need a 64 bit MinGW build! I myself got it working after reading this, but I had to do several modifications, so I wrote them down here. Also, all the computers with nice GPUs to which I have access use linux, so I cannot give instructions on enabling CUDA support in windows.

Getting the MinGW compiler

The MinGW project provides the standard gnu compilers for C, C++ (the same ones that you use on linux). 64 bit python is being used here, so 64 bit MinGW is needed. The only real annoyance here is that MinGW-64 comes with its own python environment. I ended up installing it with python, and then deleting all of the files associated with python. Lastly, make sure that the bin directory of minGW64 (the folder that has gcc.exe) is in your PATH. You do that by editing the PATH environmental variable (searching for “environment variables” in the start menu search should get you there)

Getting the Python library to work

I didn’t originally realize this, but the 64 bit development files are available here. Theano should work after installing them.

When importing theano, if it gives errors amounting to “not recognizing libpython”, then unfortunately, you might have to generate your own libpython .a file. This amounts to using dlltool. A tutorial can be found here.

Things that make life easy

Now it is time to set up python for easy running. By default, double-clicking on a file ending with .py will run it in the python interpreter. Unfortunately, if your program crashes, it also quits the interpreter, without letting you see the errors. And that gets really annoying, especially since in my experience, I spend more time debugging programs than actually running them! The easiest solution involves downloading git, which by itself is super useful, but more importantly, during download you can set so that git bash is added to your windows explorer context menu. This will allow you to open a command line in the correct directory by right clicking on the background of a folder:

From the “git bash”, you can run python programs simply by typing in python filename.py. You can also get the previous command by using the up arrow key. This will make the error messages easily visible.

Next, add C:/Python27/Scripts to the Path environmental variable. This was already done for getting MinGW to work with python, and it is the same procedure.

You might want to log out then log back in for the changes to take effect. Then, open bash (right click on desktop, and the menu should contain Git Bash). Run the following command:

pip install pyreadline

If it works, you’re good to go, if not, you might not have the Python scripts directory in your path.

What text editor should I use?

I didn’t think this section was necessary, but I frequently find people who are new to programming on windows want to use Notepad. Don’t. You want to use Notepad++. In particular, right clicking on a python file gives you the option of editing in notepad++, which is very useful. You can also use IDLE, which comes with default python (right click on a python file, and edit with IDLE is an option). IDLE was my first python editor, but opening files was too slow for me.

If you are a student (students get it free!), or happen to be rich enough to afford Visual Studio, then I highly recommend python tools for visual studio. In fact, since I am still a student, Visual Studio with pytools is my preferred development environment for python on windows.

Does it work?

So you’ve installed all of the libraries. How to make sure they work? To start off, open bash, and type in ipython. You should get a python terminal. Just by importing everything you’ll be able to tell that things are probably installed correctly:

In [1]: import pylab
In [2]: import theano
In [3]: import numba
In [4]:

A sample program

Let’s try a program which will actually test if stuff works! Save this as test.py, and run it in bash using python test.py

from pylab import *
from numba import jit

import theano.tensor as T
from theano import function

#The @jit tells numba to compile the given method. Removing it will show how much numba can speed things up
@jit
def this_takes_forever_without_jit(a):
z = zeros(1000)
for i in range(a):
for j in range(1000):
z[j] += i + j
return z

print this_takes_forever_without_jit(50000)

inp = T.vector("input")

#Creates a theano function, which takes the elementwise sigmoid of the input vector
sample_theano_function = function([inp],T.nnet.sigmoid(inp))

#Creates a vector with values from -10 to 10, going by 0.01
x = arange(-10,10,0.01)

y = sample_theano_function(x)

#Plots the results on a pretty plot!
plot(x,y)
show()

Run it, and see what happens!

If there are no errors, you can get back to doing science!

Final Thoughts

With the given setup, I have found myself with less reason than ever to use any other programming language. I still use R for some statistical workloads, and I write some stuff in C, but it became clear to me that no other language I know offers such an enormous boost in the amount I can get done with a small amount of time.