A Python Concurrency Tutorial

By Edward D'Souza (June 14th, 2020)


I've always been afraid of concurrency in Python. Threading isn't supposed to be a good idea because of the GIL. Multi-processing sounds complicated. In this article I look at what happpens if you try concurrency wiith a simple, computationally intensive task. I found that parallelizing tasks with multiple processes is both effective and surpisingly simple in Python.

(Note: Code requires Python 3.8)

Sections

Example Problem

We are going to explore concurrency in Python using this example problem:

Given a set of 2D points ({(x0, y0), (x1, y1), etc.}), find the closest pair.

This problem has a O(n*log(n)) solution, but to keep things simple, we're going to use the naive O(n^2) solution.

Basic algorithm

Before getting into concurrency, let's start with the basic solution to this problem.

To make things more flexible later, we're going to structure it as a function that takes in two sets of points ("from" and "to") and finds the closest pair of points between the two sets. To find the closest pair within an entire set of points, we can pass in the same set of points to both arguments.

main.py
from algorithm import closest_pair

points = [(0,1), (9, 10), (2, 3)]
print(closest_pair(points, points))  # --> ((0, 1), (2, 3))
algorithm.py
import math
from random import randrange, seed


def closest_pair(from_set, to_set):
    best_pair = None
    best_distance = math.inf

    for a in from_set:
        for b in to_set:
            if a != b:
                if (ab_distance := math.dist(a, b)) < best_distance:
                    best_pair = (a, b)
                    best_distance = ab_distance

    return best_pair


NUM_POINTS = 10_000


def generate_points(num_points=NUM_POINTS):
    seed(0)
    return [(randrange(0, 100), randrange(0, 100)) for _ in range(num_points)]

Single-threaded performance

As a baseline, let's look at the how long it takes the algorithm to complete using a single thread. Note that I'm using NUM_POINTS = 10,000 for all the examples to make the problem sizable enough to measure on my machine.

single_thread1.py
from algorithm import closest_pair, generate_points

points = generate_points()
closest = closest_pair(points, points)
print(closest)
single_thread1.py results
$ time python single_thread1.py
((49, 97), (49, 96))
python single_thread1.py  16.00s user 0.06s system 99% cpu 16.096 total

Multi-threaded performance

In a CPU bound task, there shouldn't be any benefit to using multiple threads, because of the GIL. But we're going to do it anyway just to see what happens.

multi_thread1.py
import queue
import threading

from algorithm import closest_pair, distance, generate_points

points = generate_points()

q = queue.Queue()
x1 = threading.Thread(
    target=lambda *args: q.put(closest_pair(*args)),
    args=(points[:len(points) // 2], points),
)
x2 = threading.Thread(
    target=lambda *args: q.put(closest_pair(*args)),
    args=(points[len(points) // 2:], points),
)
x1.start()
x2.start()
x1.join()
x2.join()

res1 = q.get()
res2 = q.get()

closest = min([res1, res2], key=lambda res: distance(res[0], res[1]))
print(closest)
multi_thread1.py results
$ time python multi_thread1.py
((49, 97), (49, 96))
python multi_thread1.py  16.43s user 0.15s system 99% cpu 16.584 total

As expected, there was no speed up with multiple threads.

Multi-process performance

Our final move is using multiple processes. The multi-threading code above can be adapted with minimal changes to make it use multiple processes instead.

Initial Solution

multi_proc1.py
import math
from multiprocessing import Process, Queue
from algorithm import closest_pair, generate_points


def doit(q, *args):
    q.put(closest_pair(*args))


if __name__ == "__main__":
    q = Queue()

    points = generate_points()

    x1 = Process(target=doit, args=(q, points[: len(points) // 2], points),)
    x2 = Process(target=doit, args=(q, points[len(points) // 2 :], points),)
    x1.start()
    x2.start()
    x1.join()
    x2.join()

    res1 = q.get()
    res2 = q.get()

    closest = min([res1, res2], key=lambda res: math.dist(res[0], res[1]))
    print(closest)
multi_proc1.py results
$ t time python multi_proc1.py
((49, 97), (49, 96))
python multi_proc1.py  16.69s user 0.10s system 196% cpu 8.540 total

Here we see the expected, ideal 2x speed up by splitting the work across two processes.

Maxing out the parallelism

We can clean up the code a little with multiprocessing.Pool and bump the level of parallelism to match the system's CPU count.

multi_proc2.py
import itertools
import multiprocessing
import math

from algorithm import closest_pair, generate_points


def segment_list(l, N):
    chunk_size, remainder = divmod(len(l), N)
    first, rest = l[: chunk_size + remainder], l[chunk_size + remainder :]
    return itertools.chain([first], zip(*[iter(rest)] * chunk_size))


def do_it(segment, points):
    return closest_pair(segment, points)


if __name__ == "__main__":
    N = multiprocessing.cpu_count()  # 8 for me.

    points = generate_points()
    segments = list(segment_list(points, N))
    proc_args = list(zip(segments, [points] * N))

    pool = multiprocessing.Pool(processes=N)
    results = pool.starmap(do_it, proc_args)

    closest = min(results, key=lambda res: math.dist(res[0], res[1]))
    print(closest)
multi_proc2.py results
$ time python multi_proc2.py
((49, 97), (49, 96))
python multi_proc2.py  35.08s user 0.22s system 759% cpu 4.650 total

Note that I only have 4 cores. "cpu_count" returns 8 on my system due to hyperthreading, but since I'm executing the exact same function on all processes, I probably don't see much benefit beyond 4 parallel tasks.

Given all that, reducing total execution time to ~29% of the single threaded solution is pretty good!