I spent an embarrassing number of hours last year debugging a production service that was silently dropping WebSocket connections under load. The fix turned out to be a single line -- a missing await that turned a coroutine into a silently-discarded object. That experience sent me down a rabbit hole: I decided I wouldn't use asyncio again until I could build my own python async event loop from scratch. What I found was surprisingly elegant, and far less magical than I expected.
Here's the thing most Python developers get wrong about async: they treat as concurrency pixie dust. Sprinkle some here, toss in an there, and your code is magically fast. But doesn't make anything faster in the way you think. It makes I/O cheaper. To really understand why -- and when it actually helps -- you need to see what's happening underneath.
async/await
async def
await
async
waiting
The C10K Problem and Why Async Matters
In 1999, Dan Kegel posed a question that shaped the next two decades of server architecture: can a single server handle 10,000 simultaneous connections? This was the C10K problem. At the time, the standard model was one-thread-per-connection (or one-process-per-connection, for the truly old-school). Each thread consumed ~1MB of stack memory and forced expensive context switches. 10,000 threads meant 10GB of stack memory alone, before you even got to your application state.
The fundamental insight is that most networked connections spend the vast majority of their time waiting. A typical HTTP request might spend 0.1ms on actual CPU work and 100ms waiting for a database query or upstream API call. Thread-per-connection wastes resources holding a thread hostage during that 99.9% idle time.
This is the distinction between I/O-bound and CPU-bound work:
I/O-bound: your code spends most of its time waiting for external operations (network, disk, database). This is where async excels.
CPU-bound: your code spends most of its time computing (matrix multiplication, image processing, hashing). Async gives you zero benefit here -- you need multiprocessing or native extensions.
A python async event loop solves the C10K problem by multiplexing thousands of connections onto a single thread. Instead of blocking a thread while waiting for data, the event loop registers interest in a socket, goes off to do other work, and comes back when data is ready. One thread, one stack, 10,000+ connections.
Python doesn't create a thread, doesn't schedule anything, doesn't touch the OS scheduler. It creates a coroutine object. That's it. A coroutine is essentially a generator on steroids -- it's a function whose execution can be suspended and resumed at specific points (the await expressions).
Under the hood, async def compiles to a function that returns a coroutine object with three key methods inherited from the generator protocol:
send(value) -- resume execution, injecting value as the result of the most recent await
throw(exception) -- resume execution by raising an exception at the await point
close() -- tell the coroutine to clean up and exit
The await keyword is syntactic sugar for yielding control back to whoever is driving the coroutine. When the event loop calls coro.send(None) on a fresh coroutine, execution proceeds until the first await. At that point, the coroutine suspends, returning a "future" object that represents the pending I/O operation. The event loop takes note of that future, monitors the underlying I/O, and later calls coro.send(result) to inject the result and resume execution.
Here's a minimal demonstration of driving a coroutine manually, without any event loop:
import types
@types.coroutine
def sleep_stub():
"""A minimal awaitable that yields control once."""
yield "sleeping"
async def example():
print("before sleep")
await sleep_stub()
print("after sleep")
# Drive it manually
coro = example()
result = coro.send(None) # prints "before sleep", returns "sleeping"
print(f"Coroutine yielded: {result}")
try:
coro.send(None) # prints "after sleep", raises StopIteration
except StopIteration:
print("Coroutine finished")
This is the entire async/await mechanism stripped bare: coroutines yield at await points, and something external (the event loop) resumes them. No threads, no parallelism -- just cooperative multitasking.
The Event Loop Internals: Selectors, Callbacks, and Tasks
CPython's asyncio event loop (specifically _UnixSelectorEventLoop or _WindowsSelectorEventLoop) is built on three pillars:
1. The selector (I/O multiplexing). Python's selectors module wraps the OS-level I/O multiplexing primitives: epoll on Linux, kqueue on macOS/BSD, IOCP on Windows. When you await a network read, the event loop registers the file descriptor with the selector, saying "wake me up when this socket has data." The selector can monitor thousands of file descriptors simultaneously with a single system call.
2. The callback/ready queue. The event loop maintains a queue of callbacks that are ready to execute right now. Each iteration of the loop drains this queue. When a timer fires, an I/O operation completes, or you call loop.call_soon(), a callback gets added to this queue.
3. Tasks (coroutine wrappers). A Task is a Future subclass that wraps a coroutine and knows how to drive it via send()/throw(). When you call asyncio.create_task(coro), you're creating a Task that registers a callback to step the coroutine forward. Each "step" calls coro.send(result), which runs until the next await, which might return another future, which the Task then attaches a callback to, and so on.
The core loop looks roughly like this pseudocode:
while there are pending tasks:
1. Run all ready callbacks
2. Poll the selector (with timeout = time until next scheduled callback)
3. For each ready I/O event, schedule the associated callback
4. Process scheduled (timed) callbacks whose deadlines have passed
That's it. That's the whole python async event loop. Let's build one.
Building a Minimal Event Loop From Scratch
Here's a working event loop in ~200 lines of Python 3.12+. It supports coroutine scheduling, sleep, and selector-based socket I/O. This is a real, runnable implementation -- not pseudocode.
"""
mini_event_loop.py -- A minimal async event loop in ~200 lines.
Supports: coroutine tasks, sleep, and non-blocking socket I/O.
Requires: Python 3.12+
"""
import selectors
import socket
import time
import heapq
from collections import deque
from dataclasses import dataclass, field
from typing import Any, Coroutine, Generator
# ---------------------------------------------------------------------------
# Awaitable primitives: these are what coroutines 'await' on
# ---------------------------------------------------------------------------
class _SleepAwaitable:
"""Yielded by sleep() to tell the loop to resume after a delay."""
def __init__(self, seconds: float):
self.seconds = seconds
class _ReadAwaitable:
"""Yielded by sock_recv() to tell the loop to wait for readable data."""
def __init__(self, sock: socket.socket, nbytes: int):
self.sock = sock
self.nbytes = nbytes
class _WriteAwaitable:
"""Yielded by sock_sendall() to tell the loop to wait for writability."""
def __init__(self, sock: socket.socket, data: bytes):
self.sock = sock
self.data = data
class _AcceptAwaitable:
"""Yielded by sock_accept() to tell the loop to wait for a connection."""
def __init__(self, sock: socket.socket):
self.sock = sock
import types
@types.coroutine
def sleep(seconds: float):
"""Suspend the current task for `seconds`."""
yield _SleepAwaitable(seconds)
@types.coroutine
def sock_recv(sock: socket.socket, nbytes: int) -> bytes:
"""Await readable data on a non-blocking socket."""
data = yield _ReadAwaitable(sock, nbytes)
return data
@types.coroutine
def sock_sendall(sock: socket.socket, data: bytes):
"""Await writability and send data on a non-blocking socket."""
yield _WriteAwaitable(sock, data)
@types.coroutine
def sock_accept(sock: socket.socket):
"""Await an incoming connection on a listening socket."""
result = yield _AcceptAwaitable(sock)
return result
# ---------------------------------------------------------------------------
# Task: wraps a coroutine and drives it step by step
# ---------------------------------------------------------------------------
class Task:
_next_id = 0
def __init__(self, coro: Coroutine):
Task._next_id += 1
self.id = Task._next_id
self.coro = coro
self.result = None
self.exception = None
self.finished = False
self._callbacks: list = []
def add_done_callback(self, fn):
if self.finished:
fn(self)
else:
self._callbacks.append(fn)
def _finish(self, result=None, exception=None):
self.result = result
self.exception = exception
self.finished = True
for cb in self._callbacks:
cb(self)
def __repr__(self):
return f"Task(id={self.id}, finished={self.finished})"
# ---------------------------------------------------------------------------
# Scheduled callback (for timers / sleep)
# ---------------------------------------------------------------------------
@dataclass(order=True)
class _Scheduled:
deadline: float
callback: Any = field(compare=False)
args: tuple = field(compare=False, default=())
# ---------------------------------------------------------------------------
# The Event Loop
# ---------------------------------------------------------------------------
class EventLoop:
def __init__(self):
self._selector = selectors.DefaultSelector()
self._ready: deque = deque()
self._scheduled: list = []
self._tasks: set[Task] = set()
self._stopping = False
def create_task(self, coro: Coroutine) -> Task:
"""Schedule a coroutine as a Task."""
task = Task(coro)
self._tasks.add(task)
self._ready.append((self._task_step, (task, None, None)))
return task
def run(self, coro: Coroutine):
"""Run a coroutine to completion (like asyncio.run)."""
main_task = self.create_task(coro)
final_result = None
final_exc = None
def on_main_done(t: Task):
nonlocal final_result, final_exc
final_result = t.result
final_exc = t.exception
self._stopping = True
main_task.add_done_callback(on_main_done)
self._run_forever()
self._selector.close()
if final_exc:
raise final_exc
return final_result
def _run_forever(self):
while not self._stopping:
self._run_once()
def _run_once(self):
timeout = 0 if self._ready else self._next_deadline_timeout()
events = self._selector.select(timeout=timeout)
for key, mask in events:
callback, args = key.data
self._ready.append((callback, args))
self._selector.unregister(key.fileobj)
now = time.monotonic()
while self._scheduled and self._scheduled[0].deadline <= now:
entry = heapq.heappop(self._scheduled)
self._ready.append((entry.callback, entry.args))
for _ in range(len(self._ready)):
callback, args = self._ready.popleft()
callback(*args)
def _next_deadline_timeout(self) -> float | None:
if not self._scheduled:
return 0.5
delay = self._scheduled[0].deadline - time.monotonic()
return max(0, delay)
def _task_step(self, task: Task, value: Any, exc: Exception | None):
"""Advance a task's coroutine by one step."""
try:
if exc:
awaitable = task.coro.throw(exc)
else:
awaitable = task.coro.send(value)
except StopIteration as e:
task._finish(result=e.value)
self._tasks.discard(task)
return
except Exception as e:
task._finish(exception=e)
self._tasks.discard(task)
return
self._handle_awaitable(task, awaitable)
def _handle_awaitable(self, task: Task, awaitable):
if isinstance(awaitable, _SleepAwaitable):
deadline = time.monotonic() + awaitable.seconds
heapq.heappush(
self._scheduled,
_Scheduled(deadline, self._task_step, (task, None, None)),
)
elif isinstance(awaitable, _ReadAwaitable):
self._selector.register(
awaitable.sock,
selectors.EVENT_READ,
(self._on_sock_read, (task, awaitable.sock, awaitable.nbytes)),
)
elif isinstance(awaitable, _WriteAwaitable):
self._selector.register(
awaitable.sock,
selectors.EVENT_WRITE,
(self._on_sock_write, (task, awaitable.sock, awaitable.data)),
)
elif isinstance(awaitable, _AcceptAwaitable):
self._selector.register(
awaitable.sock,
selectors.EVENT_READ,
(self._on_sock_accept, (task, awaitable.sock)),
)
elif isinstance(awaitable, Task):
awaitable.add_done_callback(
lambda t: self._ready.append(
(self._task_step, (task, t.result, t.exception))
)
)
else:
raise RuntimeError(f"Task yielded unknown awaitable: {awaitable!r}")
def _on_sock_read(self, task: Task, sock: socket.socket, nbytes: int):
data = sock.recv(nbytes)
self._task_step(task, data, None)
def _on_sock_write(self, task: Task, sock: socket.socket, data: bytes):
sock.sendall(data)
self._task_step(task, None, None)
def _on_sock_accept(self, task: Task, sock: socket.socket):
conn, addr = sock.accept()
conn.setblocking(False)
self._task_step(task, (conn, addr), None)
That's it. ~200 lines, and we have a working python async event loop that supports coroutine task scheduling, sleep timers, and non-blocking socket I/O using the selectors module for I/O multiplexing.
How asyncio.run(), gather(), and create_task() Work Internally
Now that we've built our own loop, the real asyncio internals are less mysterious.
asyncio.run(coro) does four things: (1) creates a new event loop, (2) wraps coro in a Task, (3) calls loop.run_until_complete(task), which blocks until the task finishes, (4) cleans up by cancelling remaining tasks and closing the loop. Our EventLoop.run() method above is a simplified version of exactly this.
asyncio.create_task(coro) wraps a coroutine in a Task object and schedules its first step on the running loop. The Task's __step method calls coro.send(result), which advances the coroutine to the next await. If the await yields a Future, the Task attaches a callback to that Future so that when the Future resolves, the Task steps again. This is precisely the _task_step + _handle_awaitable pattern in our implementation.
asyncio.gather(*coros) is surprisingly simple in concept. It creates a Task for each coroutine, then returns a Future that resolves when all Tasks complete. Internally, it attaches a callback to each child Task; when a child finishes, the callback checks if all siblings are done and, if so, resolves the parent Future with the collected results.
Threads vs. Async vs. Multiprocessing: When to Use What
This is the table I wish someone had shown me five years ago:
Threading
Asyncio
Multiprocessing
Best for
I/O-bound (legacy/blocking libs)
I/O-bound (async-native libs)
CPU-bound work
Concurrency model
Preemptive (OS schedules)
Cooperative (you yield)
True parallelism (separate processes)
GIL impact
Only one thread runs Python at a time
Single thread, no GIL issue
Each process has own GIL
Memory overhead
~1MB per thread (stack)
~1KB per coroutine (heap)
~30MB+ per process (full interpreter)
Context switch cost
~1-10 microseconds (kernel)
~0.1 microseconds (userspace)
~100 microseconds (process)
10K connections
Impractical (~10GB RAM)
Trivial (~10MB RAM)
Absurd (~300GB RAM)
Debugging
Race conditions, deadlocks
Accidental blocking, missing awaits
Serialization overhead, IPC
Here's a quick benchmark comparing fetching 100 URLs using threads vs. async:
After building the event loop, certain failure modes become obvious.
1. Blocking the loop. This is the cardinal sin. If you call time.sleep(5) inside an async def, you freeze the entire event loop for 5 seconds. Every other coroutine starves. Always use await asyncio.sleep(), not time.sleep(). Always use aiohttp, not requests.
2. Forgetting to await. Writing asyncio.create_task(my_coro()) is correct. Writing my_coro() alone creates a coroutine object that nobody ever drives. It just gets garbage collected. Python 3.12+ emits a warning, but it's easy to miss.
3. CPU-bound work in async code. If you compute a Fibonacci number inside an async def, you block the event loop just like time.sleep. For CPU-bound work, use loop.run_in_executor() to offload to a thread pool or process pool.
4. The function coloring problem. Once one function is async, every function that calls it must also be async. This "infects" your codebase upward. Think carefully before making your library async-first.
5. Debugging is harder. Stack traces in async code are fragmented across await points. Tools like asyncio.get_event_loop().set_debug(True) and python -X dev help.
When to NOT use async:
Your I/O calls are already fast and infrequent
Your workload is CPU-bound (use multiprocessing or C extensions)
Your entire dependency chain is synchronous
Your team isn't comfortable with the mental model
The python async event loop is not magic. It's a while True loop that polls the OS for ready I/O, runs callbacks, and advances coroutines one send() at a time. Building your own event loop in ~200 lines is the single best way to understand python coroutines under the hood. Once you see that a Task is just a state machine stepping through a coroutine, that await is just yield with better syntax, and that the event loop is just a scheduler spinning between I/O polling and callback execution -- the entire python asyncio internals become transparent.