Accelerating Python Libraries with Numba (Part 1)

Posted Tuesday, May 7, 2013


Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.

In this Continuum Open Notebook, you’ll see how Numba accelerates the performance of the GrowCut image segmentation window function by three orders of magnitude in two lines of Python.

Continuum Analytics Open Notebooks are free from copyright restrictions; you may reuse the text, images, and code in this open notebook without any restriction from us, although we always appreciate attribution. See the license at the bottom of this notebook for more information.

If you are reading the blog form of this notebook, you can run the code yourself on our cloud-based Python-in-the-browser app, Wakari. Wakari gives you a full Scientific Python stack, right from your browser, and allows you to write and share your own IPython Notebooks. Sign up for free here, and you will be able to edit and run the code in this notebook from your browser.


Today, we are showcasing Numba, a Continuum Analytics-sponsored open source project, and demonstrating how it can accelerate your code to C-like speeds through the use of a just-in-time specializing compiler. Don’t worry if you don’t know what that is; our computer scientists assure us that no kittens are harmed in the making of Numba code. If you are new to the Python community, you may not be familiar with similar frameworks, such as Weave, Cython, or NumExpr. If you are familiar with these tools, but not yet familiar with Numba, you might be wondering what Numba offers that these other tools don’t. Here are three of its most important distinguishing features:

  • Numba is a compiler - It leverages the LLVM Compilation Framework to parse, compile to, and optimize assembly code, in the same manner as the fastest compiled languages such as C and Fortran.

  • Numba is Python - Although Numba utilizes very powerful libraries beneath Python for performance, the code you write is always pure Python, making it easier to author, debug, verify, and maintain.

  • Accelerating a function in Numba can be done with two lines of code - You provide a numpy array-based Python (currently a subset of the full language, but we’re working on it!) function, import Numba’s autojit method, then either annotate it with the @autojit decorator or create the accelerated function by calling autojit directly.

If you’d like to learn more about Numba and awesome image processing examples bore you, feel free to browse the documentation, or if you’d prefer, head right to the code on github.


First, we’ll set up the open notebook for inline plotting, import our accelerated Numba implementation of GrowCut along with several other packages, and finally, set up a default plot size.

%pylab inline
# the growcut_numba module is available from
# as
import growcut_numba
import numpy as np
from matplotlib import pyplot as plt
from skimage import io
import time
# Adjust the SIZEX and SIZEY variables here if you'd
# like larger or smaller images.
SIZEX, SIZEY = 14, 14
plt.figsize(SIZEX, SIZEY)


You may be the sort of person who doesn’t buy the prettiest car on the lot or the one with the 600 MhZ satellite-assisted user interface. You might even want a car that feels good on a test drive. Great! Let’s see how Numba code actually feels and performs for an application in image processing. In this notebook, we will look at the GrowCut algorithm, which performs image segmentation using a two-dimensional window function. Don’t worry if you’re not familiar with image segmentation; here is the basic idea.

Image Segmentation is the process of partitioning the pixels of an image into two or more sets, usually as a means of distinguishing objects in the image for further processing or analysis. Here’s an example of some cute kittens and their mother being segmented from the background.

Original image

We downloaded this cute image of a mother cat and her kitties by Artur Andzej from Wikimedia Commons.

# Notice how easy it is to read and plot images using scikit-image!
kittehs = io.imread('kittehs.jpg')
# This last command turns the x/y axis off
# comment it out to see pixel ranges


GrowCut is an assisted algorithm, which means we need to manually segment part of the image ourselves to seed it. We could use a machine learning tool to cluster similar pixels, but for this demonstration we’ll hand-label seed sections of the image as interesting (kitteh!) and non-interesting (grass).

# labels and strengths are numpy arrays of the same height
# and width as the original image (but no channel dimension).
# Implicitly, a 0 value is an undetermined label
labels = np.zeros(kittehs.shape[0:2],
grass = -1
kitteh = 1
# We mark the boundaries of the image as grassy
labels[0:150, :] = grass
labels[750:, :] = grass
labels[:, 0:100] = grass
labels[:, 1150:] = grass
# A rectangular chunk in the middle as kitteh
labels[300:550, 400:1000] = kitteh
# And an otherwise easy to miss kitteh tail chunk
labels[575:600, 100:140]  = kitteh
# We don't want to end with any pixels as unlabeled, so
# we set the default strength to 0.
strength = np.zeros_like(labels, dtype=np.float64)
# Then set the strength of all labeled values to 1.
strength[np.nonzero(labels)] = 1.0
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(SIZEX, SIZEX/2))
ax1.imshow(labels, vmin=-1, vmax=1, cmap='gray')
ax1.set_title('Seed label regions (black=grass, gray=unlabeled, white=kitteh)')
ax2.contour(labels, colors='r')
ax2.set_title('Seed label region boundaries on original image');

We’ll time 50 iterations of our accelerated GrowCut algorithm with our input image, labels, and label strengths. The algorithm works similar to a cellular automata. At each iteration, already labeled pixels add their labels to their similar neighbors. Yes, go back and re-read that sentence, or go down to the code below.

If you’re following along on Wakari, this computation is going to take about a minute. So feel free to take that extra time browsing your favorite social media news site. We’ll keep the computations going while you’re in the other tab looking at pictures of cats.

tic = time.time()
growcut_labels = growcut_numba.growcut(kittehs, np.dstack((labels, strength)),
                         window_size=10, max_iter=50)
toc = time.time() - tic
print "Time elapsed: %d seconds" % toc
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time elapsed: 41 seconds
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(SIZEX, SIZEX/2))
ax1.imshow(growcut_labels, interpolation='nearest', cmap='gray', vmin=-1, vmax=1)
ax1.set_title('GrowCut label regions (black=grass, gray=unlabeled, white=kitteh)')
ax2.set_title('GrowCut label region boundaries on original image');
ax2.contour(growcut_labels, colors='r')

# Create a copy of the original image
masked_kittehs = kittehs.copy()
# Then create a Boolean ndarray with values = False (kitteh) and True otherwise
bool_mask = growcut_labels != kitteh
# Now use numpy fancy indexing to set all True (non-kitteh) regions to 0
masked_kittehs[bool_mask, :] = 0

The untuned results aren’t perfect, but it’s clear to see that this is a useful algorithm, especially due to the simplicity of its implementation, which we’ll discuss next.


GrowCut uses a cellular automata-based approach to segmenting an image. At the core of the algorithm is a window function that iterates over the image data. Typical approaches to vectorization here fail miserably due to the complicated inner kernel, and require an unreasonable amount of memory without substantial performance gains.

Native Python

def kernel(image, state, state_next, window_radius):
    changes = 0
    height = image.shape[0]
    width = image.shape[1]
    for j in xrange(width):
        for i in xrange(height):
            winning_colony = state[i, j, 0]
            defense_strength = state[i, j, 1]
            for jj in xrange(window_floor(j, window_radius),
                             window_ceil(j+1, width, window_radius)):
                for ii in xrange(window_floor(i, window_radius),
                                 window_ceil(i+1, height, window_radius)):
                    if (ii == i and jj == j):
                    d = image[i, j, 0] - image[ii, jj, 0]
                    s = d * d
                    for k in range(1, 3):
                        d = image[i, j, k] - image[ii, jj, k]
                        s += d * d
                    gval = 1.0 - math.sqrt(s)/sqrt_3
                    attack_strength = gval * state[ii, jj, 1]
                    if attack_strength > defense_strength:
                        defense_strength = attack_strength
                        winning_colony = state[ii, jj, 0]
                        changes += 1
            state_next[i, j, 0] = winning_colony
            state_next[i, j, 1] = defense_strength
    return changes

Python with Numba

This is all of the code required to accelerate the Python code with Numba. Notice we don’t even need to modify the original function! Accelerating this code maintains the native Python code’s slim memory profile with very little coding effort.

from numba import autojit
numba_kernel = autojit()(kernel)

You might be wondering why we started with the Numba-accelerated GrowCut kernel instead of the native Python one. Let’s time a single iteration of the native GrowCut code vs. a single iteration of the numba GrowCut code to find out why. (SPOILER: You might want to come back in about ten minutes.)

tic = time.time()
growcut_labels = growcut_numba.growcut(kittehs, np.dstack((labels, strength)), window_size=10, max_iter=1)
numba_time = time.time() - tic
print "Single Numba iteration: %e seconds" % numba_time
tic = time.time()
growcut_labels = growcut_numba.growcut(kittehs, np.dstack((labels, strength)), window_size=10, max_iter=1)
native_time = time.time() - tic
print "Single native Python iteration: %e seconds" % native_time
print "Numba speedup: %d" % (native_time/numba_time)
Single Numba iteration: 8.518391e-01 seconds
Single native Python iteration: 5.961966e+02 seconds
Numba speedup: 699

Of course, we need to do many iterations of GrowCut, not just a single iteration. We saw earlier that fifty iterations of the Numba code can run in less than sixty seconds. We approximate the cost of fifty native Python iterations by multiplying against the single iteration cost. It takes approximately 10 minutes to run the native kernel for a single iteration, fifty iterations will take more than 8 hours. This is the power of Numba; by adding 2 lines of code to your file, you could speed your code from 8 hours of computation to under a minute.


Some of you are very excited by this. We at Continuum certainly are. We love Python, it’s a beautiful language, and up until now we’ve been willing to put up with the pains of interfacing to other languages and other techniques for getting assembly-level performance. Now, we don’t have to.

Some of you are a bit disappointed. You just spent all this time staring at pictures of cats and studying code learning about GrowCut, and a speedup of three orders of magnitude comes with two lines of code? You want to understand why the Python code was so slow, see how other tools similar to Numba perform on this problem with charts and graphics, and learn how to use Numba for your own codes and libraries. You might even be interested in learning about how the professional version of Numba, NumbaPro, can accelerate code on GPUs as well as part of our Anaconda Accelerate product. Great! Stay tuned to this blog for more open notebooks in this series where we’ll explore all these topics in more detail.


This notebook features the GrowCut algorithm, which was introduced in GrowCut - Interactive Multi-Label N-D Image Segmentation By Cellular Automata by V. Vezhnevets and V. Konouchine. Credit for the original native Python code is due to Dr. Nathan Faggian and Dr. Stefan van der Walt.

To the extent possible under law, Continuum Analytics has waived all copyright and related or neighboring rights to this article, Accelerating Python Libraries with Numba - Part 1.