Scaling asyncio on Free-Threaded Python

Introduction

The Python standard library provides the asyncio module to facilitate writing high-performance concurrent code. By leveraging async/await syntax, it provides a high level API for creating and managing event loops, coroutines, tasks, and performing asynchronous I/O operations. It is used as a foundation for Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Multiple libraries and frameworks, such as FastAPI and aiohttp, are built on top of asyncio.

In this blog post, we will explore the changes I made in the upcoming Python 3.14 release to enable asyncio to scale on the free-threaded build of CPython.

The GIL and asyncio: A brief recap

Before diving into the details of scaling asyncio on the free-threaded build of CPython, it's important to understand what the Global Interpreter Lock (GIL) is and how it is a significant limitation for asyncio in the first place.

The Global Interpreter Lock (GIL) is a global mutex that protects access to Python objects, preventing multiple threads from executing Python code at once. This means that even though you can have multiple threads in a Python program, only one thread can execute Python code at a time.

asyncio uses an event loop as a scheduler to enable highly efficient I/O-bound concurrency by switching between tasks during non-blocking I/O operations. The event loop leverages platform-specific support for asynchronous I/O—such as epoll on Linux, kqueue on macOS, and IOCP on Windows to perform these operations efficiently. Since only one event loop can run per thread, CPU-bound tasks, which would otherwise block the event loop, are typically offloaded to separate threads. However, the GIL limits true parallel execution of Python code across threads. Hence, even when tasks are offloaded, they still compete for the GIL for execution. This lock contention limits parallelism and can tank performance for any CPU-bound workload.

The GIL also prevents execution of multiple event loops in parallel running in different threads. This limits the ability to scale asyncio applications across multiple CPU cores.

Scaling `asyncio` on Free-Threaded Python

The free-threaded build of CPython removes the GIL, allowing multiple threads to execute in parallel. This opens up new possibilities for asyncio applications, enabling them to scale across multiple CPU cores without the limitations imposed by the GIL. However, this means that asyncio needed to be adapted to work in a free-threaded environment, as it previously relied on the GIL and global state and was not thread-safe.

Since each thread can only run one event loop, asyncio internally does book-keeping for each thread running an event loop and primarily stores three key pieces of state:

Current loop: When a thread starts running an asyncio event loop, it sets its current loop to the instance of the running event loop. Once the event loop is stopped, the current loop is set to None. The current loop is used to associate futures, tasks and callbacks with the running event loop. The current loop can be accessed using asyncio.get_running_loop().
Tasks: When a task is created, it is added to the set of tasks to be executed by the current event loop. This allows each loop to manage its own tasks independently, and once a task is completed, it is removed from the set. The set of tasks can be accessed by asyncio.all_tasks().
Current task: When a task starts executing, it is set as the current task for the event loop. This allows the event loop to keep track of which task is currently running, and once the task completes or suspends by awaiting on something, the current task state is reset. High-level APIs such as asyncio.timeout() and asyncio.TaskGroup rely on current task for proper cancellation of tasks. The current task can be accessed using asyncio.current_task().

Up until now, asyncio was designed with the assumption of a single-threaded environment, and relied on the GIL to manage access to shared state. The current task state was stored in a global dictionary mapping threads to their current task, and all tasks were stored in a global WeakSet. This scales poorly with the number of threads in free-threading because of reference counting and locking contention on these global data structures.

In Python 3.14, I have implemented several changes to fix thread safety of asyncio and enable it to scale effectively on the free-threaded build of CPython. It is now implemented using lock-free data structures and per-thread state, allowing for highly efficient task management and execution across multiple threads. In the general case of multiple event loops running in parallel, there is no lock contention and performance scales linearly with the number of threads.

Here are the key changes:

Per-thread linked list of tasks: Python 3.14 introduces a per-thread circular double linked list implementation for storing tasks instead of a global WeakSet. The linked list is per-thread, meaning that each thread maintains its own list of tasks and allows for lock-free addition and removal of tasks. Weak references are slow and and prone to contention. The new implementation removes the use of weak references entirely and makes tasks responsible for removing themselves from the list when they are done. This requires cooperation between the task's deallocator and the executing threads to ensure that the task is removed from the list before it is freed, otherwise a thread could try accessing an already freed task. By removing the use of weak references, the overhead of reference counting is eliminated entirely and addition/removal of a task in the list now requires only updating the pointers in the linked list.

This design allows for efficient, lock-free and thread-safe task management and scales well on the free-threaded interpreter.

This was implemented in https://github.com/python/cpython/pull/128869.
Per-thread current task: Python 3.14 stores the current task in the thread state structure, which is local to each thread. By storing the current task on the thread state, the overhead of accessing the current task is reduced, allowing for lock-free access to the current task while avoiding dictionary lookup. This allows for faster switching between tasks -- a very frequent operation in asyncio.

This was implemented in https://github.com/python/cpython/pull/129899.

Both of these changes allow asyncio to scale linearly with the number of threads in free-threading, and has significantly improved performance for both single-threaded and multi-threaded asyncio usage. The standard pyperformance benchmark suite shows a significant 10–20% improvement in single-threaded performance while also reducing memory usage.

For a deeper dive into the implementation, check out the internal docs for asyncio.

Benchmarks

Here are the benchmark results comparing the performance of asyncio on the free-threaded build with the GIL-enabled build on a Windows machine with 6 physical CPU cores and 12 hyper-threads:

TCP Benchmark: This benchmark measures raw TCP performance.

Speed with a single worker is 276 MB/s, with 6 workers that scales to 532 MB/s with the default build and 1455 MB/s with the free-threaded build, and with 12 workers that is 698 MB/s and 1924 MB/s respectively.
Web Scraping: This benchmark measures the performance of using aiohttp with Web Scraping on asyncio.

Speed with a single worker on default build is 12 stories/sec, with 12 workers that scales to 35 stories/sec, and with the free-threaded build it is 80 stories/sec.

Summary

asyncio now has first-class support for free-threading and scales linearly with the number of threads, and can take advantage of multiple cores effectively. It is now possible to run multiple event loops in parallel, which unlocks new possibilities for high-performance multi-threaded asyncio applications such as web servers, data processing pipelines, and more.

Scaling asyncio on Free-Threaded Python

Introduction

The GIL and asyncio: A brief recap

Scaling `asyncio` on Free-Threaded Python

Benchmarks

Summary

More articles from our Blog

Creating a daft plugin for the Narwhals library

Building a Foundation: Migrating pyOpenSci to Django