Python threading and subprocesses explained

By default, Python’s runtime executes in a single thread, with traffic directed by the Global Interpreter Lock (GIL). Most of the time this isn’t a significant bottleneck, but it becomes one when you want to run many jobs in parallel.

Python provides two ways to work around this issue: threading and multiprocessing. Each approach allows you to break a long-running job into parallel batches, which you can work on side-by-side. Depending on the job in question, you can sometimes speed up operations tremendously. At the very least, you can treat tasks in such a way that they don’t block other work while they wait to be completed.

This article introduces you to one of the most convenient ways to use threading and subprocesses in Python: the Pool object, which works with both thread and process pools. We’ll also look at two newer mechanisms Python has under development for parallelism and concurrency: the new free-threaded or ‘no-GIL’ version of Python, and the subinterpreter system. Both are still a long way from daily use, but they’re worth knowing about for the future.

Python threads vs. Python processes

Python threads are units of work that run independently of one another. In CPython, they’re implemented as actual operating system-level threads, but they’re serialized—meaning they’re forced to run serially—through the GIL. This is to ensure only one thread at a time can modify Python objects, so data doesn’t get corrupted.

Python threads aren’t good for running CPU-bound tasks side by side (at least not for now). But they’re a useful way to organize tasks that involve some waiting. Python can execute thread A or thread C while thread B is waiting for a reply from an external system, for example.

Python processes are whole instances of the Python interpreter that run independently. Each Python process has its own GIL and its own copy of the data to be worked on. That means multiple Python processes can run in parallel on separate hardware cores. The tradeoff is that a Python process takes longer to spin up than a Python thread, and any data interchange between interpreters is far slower than with threads.

How to choose between Python threads and Python processes

There are a few simple rules to help you choose between Python threads and Python processes:

If you’re performing long-running I/O bound operations, which require waiting on a service outside Python, use threads. Examples of these tasks include multiple parallel web-scraping or file-processing jobs.
If you’re performing long-running CPU bound operations handled by an external library written in C, such as NumPy, use threads. Here too, the work is being done outside Python.
If you’re performing long-running CPU bound operations in Python, use processes.

Thread pools and process pools in Python

The easiest way to work with threads and processes for many kinds of jobs is by using Python’s Pool object. A Pool lets you define a set of threads or processes (your choice) that you can feed any number of jobs, which will return results in the order they finish.

As an example, let’s take a list of numbers from 1 to 100, construct URLS from them, and fetch them in parallel. This example is I/O bound, so there’s likely to be no discernible performance difference between using threads or processes; still, the basic idea should be clear.


from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

from urllib.request import urlopen
from time import perf_counter

def work(n):    
    with urlopen("https://www.google.com/#{n}") as f:
        contents = f.read(32)
    return contents

def run_pool(pool_type):
    with pool_type() as pool:
        start = perf_counter()
        results = pool.map(work, numbers)
    print ("Time:", perf_counter()-start)
    print ([_ for _ in results])    

if __name__ == '__main__':
    numbers = [x for x in range(1,16)]
    
    # Run the task using a thread pool
    run_pool(ThreadPoolExecutor)
    
    # Run the task using a process pool
    run_pool(ProcessPoolExecutor)

How Python multiprocessing works

In the above example, the concurrent.futures module provides high-level pool objects for running work in threads (ThreadPoolExecutor) and processes (ProcessPoolExecutor). Both pool types have the same API, so you can create functions that work interchangeably with both, as the example shows.

We use run_pool to submit instances of the work function to the different types of pools. By default, each pool instance uses a single thread or process per available CPU core. There’s a certain amount of overhead associated with creating pools, so don’t overdo it. If you’re going to be processing lots of jobs over a long period of time, create the pool first and don’t dispose of it until you’re done. With the Executor objects, you can use a context manager to create and dispose of pools (with/as).

pool.map() is the function we use to subdivide the work. The pool.map() function takes a function with a list of arguments to apply to each instance of the function, splits the work into chunks (you can specify the chunk size but the default is generally fine), and feeds each chunk to a worker thread or process.

Normally, map blocks the thread it’s running in, meaning you can’t do anything else until map returns finished work. If you want to run map asynchronously, by providing a callback function that runs when all its jobs finish, use map_async. In this case, we want to wait until all the work is finished before doing anything else, so we don’t need to worry about it.

Finally, this basic example only involves threads and processes that have their own individual state. if you have a long-running CPU-bound operation where threads or processes need to share information with one another, look into using multiprocessing with shared memory or a server process.

On the whole, the more you can partition both the processing and the data to be processed, the faster everything will run. That’s a cardinal rule of multiprocessing and multithreading no matter what language you’re using.

CPU-bound vs. I/O-bound work

The above example works equally well with threads or subprocesses because the work involved isn’t “CPU-bound”—meaning it isn’t some long-running calculation. It involves waiting on a response from something external, in this case a network call. But if we did have CPU-bound work, threads would not be effective.

To see why, run this example:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from time import perf_counter

def work(n):    
    n = 0
    for x in range(10_000_000):
        n+=x
    return n

def run_pool(pool_type):
    with pool_type() as pool:
        start = perf_counter()
        results = pool.map(work, numbers)
    print ("Time:", perf_counter()-start)
    print ([_ for _ in results])    

if __name__ == '__main__':
    numbers = [x for x in range(1,16)]
    
    # Run the task using a thread pool
    run_pool(ThreadPoolExecutor)
    
    # Run the task using a process pool
    run_pool(ProcessPoolExecutor)

You’ll find the thread pool completes many times slower than the process pool. That’s not including the overhead for starting the process pool (which isn’t negligible), but it’s still significant.

One advantage to using the Executor abstraction is if you find a given piece of work doesn’t behave well with threads, you can easily run it in a process pool by changing the pool type.

Python threads after Python 3.13

Over the years, various projects have attempted to produce a version of the CPython interpreter without a GIL. A GIL-less CPython would allow threads to run with full parallelism. The above CPU-bound example, for instance, would complete in about the same amount of time when using either threads or subprocesses.

The bad news is that many attempts at a no-GIL CPython have come with heavy tradeoffs. Multithreaded programs ran well, but single-threaded programs—the majority of Python—ran drastically slower. But the most recent attempt to remove the GIL, now enshrined as PEP 703, fixes many of these issues. It’s now being offered to end users in an experimental form.

If you install Python 3.13 on Windows, you get the option to install a separate version of the interpreter that runs free-threaded. If you run the above thread-and-process pool example on that free-threaded build, you’ll see threads and processes doing about equally well.

The free-threaded build is still a long way from being recommended for production use. Single-threaded Python programs still experience a performance hit, but the plan over the next few releases is to minimize that before recommending the free-threaded build. In the meantime, it’s worth experimenting with the new build to get an idea of how free-threading compares to multiprocessing for common tasks.

Python subinterpreters vs. threads

Another feature being added to CPython “under the hood” is the concept of the subinterpreter, as detailed in PEP 734. Each CPython process can, in theory, run one or more instances of the actual Python interpreter side-by-side, each with its own GIL. This allows for many of the same behaviors as free-threading but without the automatic performance hit to single-threaded code, and with the GIL retained for when it actually does come in handy.

Right now, the subinterpreter concept is just that; a concept. The eventual plan, if PEP 734 is fully accepted, is to provide a new set of abstractions for Python developers to distribute workloads between subinterpreters. Unlike threads, subinterpreters can’t share state freely, so they would need to use a Queue or a similar abstraction to pass information back and forth. But a developer could use an InterpreterPoolExecutor to distribute work across interpreters and synchronize the results, in the same way we do now for threads and subprocesses. Until subinterpreters are officially added to Python, though, subprocesses and threads are still the abstractions to use.