Last month I was profiling a Python microservice that processes 40,000 messages per second from a Kafka topic. CPU was pegged at 92%. I assumed it was our business logic -- some hairy graph traversal stuff. I was wrong. Over 61% of the CPU time was spent in json.loads() and the Pydantic model validation that followed. Serialization. The thing nobody thinks about until it's the only thing that matters.
This sent me down a rabbit hole that I've been living in for weeks: the world of python zero-copy serialization. I benchmarked everything. I read C extension source code. I swapped serialization layers in production and watched our P99 latencies drop by 4x. What I found is that the Python serialization landscape in 2026 is genuinely exciting -- and most teams are leaving enormous performance on the table by defaulting to json.dumps() and calling it a day.
Here's everything I learned.
The Hidden Tax: Why Serialization Is Your Actual Bottleneck
Most Python developers think of serialization as a solved, boring problem. You call json.dumps(), you get a string. You call json.loads(), you get a dict. What's to optimize?
Everything, it turns out.
In a typical Python web service, serialization and deserialization happen constantly: reading request bodies, writing response bodies, communicating between microservices, pushing to and pulling from message queues, caching in Redis, logging structured events. A single API request might serialize and deserialize data six to ten times before a response leaves the server.
The stdlib json module is implemented partly in C (via _json), but it's fundamentally limited. It builds a complete Python object tree on every loads() call -- every string becomes a str, every number becomes an int or float, every object becomes a dict. Each of these is a heap allocation. For a 1 KB JSON payload with 30 fields, you're looking at 50+ individual Python object allocations, each going through the memory allocator, each tracked by the garbage collector.
Multiply that by thousands of requests per second and you start to understand why serialization routinely shows up as 20-40% of CPU time in profiled Python services. I've seen it go as high as 60%+ in services that do lightweight processing on large message payloads -- which describes most data pipeline workers.
The fundamental issue is copying. Traditional serialization follows a pattern: bytes come in from the network, get copied into a Python buffer, get parsed into intermediate tokens, get copied again into Python objects, and then optionally get validated and copied into yet another set of typed objects (your Pydantic models, dataclasses, etc.). That's three or four copies of essentially the same data, each with its own allocation overhead.
Python zero-copy serialization aims to eliminate as many of these copies as possible. The approaches vary -- some reduce copies from four to two, some from four to one, and FlatBuffers achieves a true zero-copy read directly from the wire bytes. Let's look at each approach.
Protocol Buffers: The Incumbent Workhorse
Protocol Buffers (protobuf) has been Google's serialization format since 2001, and it's been open source since 2008. In 2026, with the v7.x release line (7.34.0 at time of writing), the Python implementation has matured considerably thanks to the upb backend becoming the default since version 4.21.0.
The upb backend is built on a high-performance C library rather than wrapping the C++ implementation like the old cpp backend. This gives Python protobuf substantially better parsing performance, especially for large payloads.
Here's what working with protobuf looks like in Python:
// user.proto
syntax = "proto3";
message User {
string name = 1;
string email = 2;
int32 age = 3;
repeated string tags = 4;
Address address = 5;
}
message Address {
string street = 1;
string city = 2;
string country = 3;
}
Protobuf's strengths are well-known: compact binary format (typically 2-3x smaller than JSON), strong schema evolution guarantees, and broad language support. In benchmarks, protobuf serialization in Python is roughly 53% faster than equivalent JSON serialization and 73% faster for deserialization.
But protobuf is not zero-copy. When you call ParseFromString(), the upb backend allocates a C-level message structure and copies all field values into it. It's faster than JSON parsing, but it still follows the parse-then-access pattern.
FlatBuffers: True Zero-Copy, But at What Cost?
FlatBuffers is Google's answer to the question: "What if we never deserialized at all?"
Created by Wouter van Oortmerssen, FlatBuffers stores data in a binary format that can be read directly without a parsing step. The serialized bytes are the data structure. You access fields through accessor methods that calculate offsets into the underlying byte buffer on the fly.
This is genuine python zero-copy serialization in the purest sense. When a FlatBuffer arrives over the network, you can start reading fields immediately without allocating any intermediate objects.
The numbers reflect this. In benchmarks, FlatBuffers deserialization takes approximately 0.09 microseconds compared to Protocol Buffers' 69 microseconds for the same payload. That's a 766x speedup on deserialization. The tradeoff is serialization speed: FlatBuffers serialization takes about 1,048 microseconds versus protobuf's 708 microseconds.
The API is verbose and imperative. In practice, I've found FlatBuffers most compelling for game development and data pipeline stages where you receive a large buffer but only need to read 2-3 fields out of a 50-field message.
msgspec: The Python-Native Speed Demon
This is where things get really interesting. msgspec, created by Jim Crist-Harif, achieves performance numbers that shouldn't be possible in Python. The key insight: validation and deserialization should be the same operation. When you use Pydantic or cattrs, data gets deserialized first (JSON bytes to Python dicts) and then validated second (Python dicts to typed model instances). msgspec does both in a single pass.
The benchmark numbers are striking:
Library
JSON Encode (us)
JSON Decode (us)
Total (us)
Relative
msgspec (Struct)
140
367
507
1.0x
msgspec (no schema)
183
482
665
1.3x
orjson
179
464
643
1.3x
ujson
628
855
1,483
2.9x
rapidjson
514
1,131
1,645
3.2x
simdjson
1,234
771
2,005
4.0x
stdlib json
1,228
919
2,147
4.2x
msgspec with Struct schemas is 4.2x faster than stdlib json for the round-trip, and it's doing more work -- it's also validating the data against a schema.
The memory story is even more dramatic. When decoding a large 77 MiB JSON file:
Library
Peak Memory (MiB)
Decode Time (ms)
msgspec (Struct)
67.6
176.8
msgspec (no schema)
218.3
630.5
stdlib json
295.0
868.6
orjson
406.3
691.7
msgspec with Structs uses 67.6 MiB to decode a 77 MiB file.
Here's what working with msgspec looks like:
import msgspec
from typing import Optional
class Address(msgspec.Struct):
street: str
city: str
country: str = "US"
class User(msgspec.Struct):
name: str
email: str
age: int
tags: list[str] = []
address: Optional[Address] = None
encoder = msgspec.json.Encoder()
decoder = msgspec.json.Decoder(User)
user = User(
name="Alice",
email="alice@example.com",
age=30,
tags=["developer", "python"],
address=Address(street="123 Main St", city="Portland")
)
data = encoder.encode(user)
parsed = decoder.decode(data)
assert isinstance(parsed, User)
# MessagePack -- same API, even faster
mp_encoder = msgspec.msgpack.Encoder()
mp_decoder = msgspec.msgpack.Decoder(User)
mp_data = mp_encoder.encode(user)
parsed_mp = mp_decoder.decode(mp_data)
Head-to-Head: The Complete Benchmarks
Approach
Encode Speed
Decode Speed
Validation
Wire Size
Zero-Copy
Ergonomics
stdlib json
1.0x (baseline)
1.0x (baseline)
None
Large
No
Excellent
orjson
6.8x faster
2.0x faster
None
Large
No
Good
msgspec (JSON)
8.8x faster
2.5x faster
Built-in
Large
Partial
Excellent
msgspec (msgpack)
11.0x faster
2.6x faster
Built-in
Medium
Partial
Excellent
protobuf (upb)
2-3x faster
3-4x faster
Schema-based
Small
No
Fair
FlatBuffers
1.2x slower
750x+ faster*
Schema-based
Medium-Large
Yes
Poor
*FlatBuffers "decode" speed reflects accessing a single field from an already-received buffer.
When to Use What: A Decision Framework
Choose stdlib json when: You're writing scripts, not services. Your payloads are tiny and infrequent.
Choose orjson when: You need a zero-effort speedup on existing JSON code. You can't change your data model layer.
Choose msgspec when: You're building a new service. You want validation and serialization in one step. Memory efficiency matters. You're processing high-throughput message streams.
Choose protobuf when: You need cross-language serialization. Wire size is a primary constraint. You're using gRPC.
Choose FlatBuffers when: Read-heavy, write-rare pattern on large binary data. Sub-microsecond deserialization latency is a genuine requirement.
For the Kafka service I mentioned at the beginning, I went with msgspec using MessagePack encoding. CPU usage dropped from 92% to 34%. P99 latency went from 12ms to 3ms. Memory usage per worker dropped from 480MB to 190MB.
What's Coming Next
The python zero-copy serialization space is evolving fast. Python 3.13's free-threaded mode changes the calculus for multi-threaded workers. msgspec's C extensions are already thread-safe, and early benchmarks suggest 3-4x throughput improvements for multi-threaded msgspec decode.
Arrow IPC and zero-copy columnar formats are increasingly relevant for data-heavy Python services. Apache Arrow's IPC format gives you zero-copy access to columnar data and plays beautifully with Pandas, Polars, and DuckDB.
The bottom line: serialization performance in Python is no longer a niche concern. With msgspec available as a pip install, every Python developer can get order-of-magnitude improvements in serialization throughput and memory efficiency. The days of accepting json.loads() as "fast enough" should be behind us. Your profiler will thank you.