Post-Quantum TLS: Migrating Service-to-Service Traffic to ML-KEM

The cryptographic foundations of the modern internet are built on a fragile assumption: that certain mathematical problems, like factoring large integers (RSA) or finding discrete logarithms (ECC), are computationally infeasible. For decades, this assumption held true. However, the maturation of quantum computing introduces a definitive expiration date for these algorithms.

While a Cryptographically Relevant Quantum Computer (CRQC) capable of breaking RSA-2048 or X25519 does not yet exist, the threat is not theoretical. It is temporal. This has led to the rise of 'Harvest Now, Decrypt Later' (HNDL) attacks, where adversaries capture encrypted traffic today with the intention of decrypting it once quantum hardware scales. For long-lived service-to-service communication and sensitive data backhauls, the migration to Post-Quantum Cryptography (PQC) must begin now.

In this article, we will explore the transition from classical key exchange to ML-KEM (Module-Lattice-Based Key-Encapsulation Mechanism), formerly known as Kyber, and how to implement it within your TLS infrastructure.

The NIST Selection: Why ML-KEM?

After a multi-year competition, the National Institute of Standards and Technology (NIST) recently finalized the first set of PQC standards. Among them, FIPS 203 defines ML-KEM (Kyber) as the primary standard for general-purpose encryption and key establishment.

ML-KEM belongs to a family of algorithms based on the 'Learning with Errors' (LWE) problem over module lattices. Unlike RSA, which relies on number theory, ML-KEM relies on the geometric complexity of high-dimensional lattices. Even for a quantum computer, finding the shortest vector in a lattice is an exponentially hard problem.

Key Advantages of ML-KEM:

Efficiency: ML-KEM is remarkably fast. In many benchmarks, it outperforms classical ECDH in terms of CPU cycles for encapsulation and decapsulation.
Security Margins: It offers high security levels (ML-KEM-768 is roughly equivalent to AES-192) with relatively small key sizes compared to other PQC candidates.
Maturity: It was the most scrutinized candidate during the NIST competition, making it the safest bet for production environments.

The Hybrid Approach: Safety in Transition

We are currently in a transition period. While ML-KEM is mathematically robust, its implementation in software is still relatively new. We don't want to discard the battle-tested security of Elliptic Curve Cryptography (ECC) just yet.

The industry standard for migrating to PQC is the Hybrid Key Exchange. Instead of replacing X25519 or P-256, we combine them. In a hybrid handshake, the client and server perform two simultaneous key exchanges—one classical and one post-quantum—and combine the results using a Key Derivation Function (KDF).

This ensures that if ML-KEM were somehow found to have a flaw, the connection is still as secure as traditional ECC. If ECC is broken by a quantum computer, the ML-KEM layer provides the necessary protection.

Implementing ML-KEM in the TLS Handshake

To implement ML-KEM for service-to-service communication (e.g., between a microservice and a database, or between two internal APIs), you need to update your TLS stack to support the new hybrid groups. The most common identifier currently in use is X25519MLKEM768 (codepoint 0x11EC).

1. Language Support: Go 1.23+

Go has been a frontrunner in PQC adoption. Starting with Go 1.23, the crypto/tls package includes support for ML-KEM (Kyber) enabled by default for internal testing, and it can be explicitly configured for production use.

// Example: Configuring a Go TLS Server with ML-KEM
package main

import (
	"crypto/tls"
	"net/http"
)

func main() {
	cert, _ := tls.LoadX509KeyPair("server.crt", "server.key")

	config := &tls.Config{
		Certificates: []tls.Certificate{cert},
		// Enable X25519 + ML-KEM-768 hybrid
		CurvePreferences: []tls.CurveID{
			tls.CurveID(0x11EC), // X25519MLKEM768
			tls.X25519,
			tls.CurveP256,
		},
		MinVersion: tls.VersionTLS13,
	}

	server := &http.Server{
		Addr:      ":443",
		TLSConfig: config,
	}

	server.ListenAndServeTLS("", "")
}

2. Infrastructure Support: Envoy and Nginx

If you use a service mesh like Istio or Linkerd, or a reverse proxy like Envoy or Nginx, the migration happens at the proxy level.

Envoy (v1.30+): Envoy has integrated BoringSSL’s PQC implementation. You can enable it by modifying the tls_context:

tls_context:
  common_tls_context:
    tls_params:
      ecdh_curves:
        - "X25519MLKEM768"
        - "X25519"

Nginx: For Nginx, you generally need to link against OpenSSL 3.x with the oqs-provider (Open Quantum Safe) or wait for your distribution to enable the finalized ML-KEM codepoints in their OpenSSL builds.

Performance Considerations and the "MTU Problem"

While ML-KEM is computationally efficient, it introduces a significant change in network payload sizes.

X25519 Public Key: 32 bytes
ML-KEM-768 Public Key: 1,184 bytes
X25519 Ciphertext: 32 bytes
ML-KEM-768 Ciphertext: 1,088 bytes

In a standard TLS 1.3 handshake, the ClientHello message contains the key shares. Adding a ~1.2KB ML-KEM key share means the ClientHello might exceed the standard Ethernet MTU (Maximum Transmission Unit) of 1,500 bytes.

The Fragmentation Risk

When a TLS handshake packet is fragmented at the IP layer, it becomes susceptible to being dropped by poorly configured middleboxes, firewalls, or load balancers that do not correctly reassemble fragments.

Actionable Advice: Before rolling out ML-KEM across your entire VPC, perform a "canary" test to ensure your network infrastructure handles larger ClientHello packets. Monitor for TLS Handshake Timeout errors, which are often the first symptom of fragmentation issues.

Observability: How to Know You’re Secure

Migrating to PQC is a silent upgrade. Your services will continue to work, but how do you verify that the traffic is actually using ML-KEM?

You should update your telemetry to capture the negotiated TLS cipher suite and the Key Exchange Group. In Prometheus/Grafana, you should track:

tls_handshake_group_total: A counter partitioned by the group (e.g., x25519, x25519_mlkem768).
tls_handshake_duration_seconds: To ensure the larger keys aren't significantly impacting latency in your specific environment.

If you see a high percentage of x25519 and zero x25519_mlkem768, it means your clients or servers are falling back to classical cryptography, likely due to a configuration mismatch or an outdated library.

Strategic Rollout: A Step-by-Step Plan

Migrating an entire microservice architecture is a daunting task. I recommend a phased approach:

Phase 1: The Inventory

Identify all services that handle sensitive, long-lived data. This includes databases, message brokers (Kafka/RabbitMQ), and internal API gateways. These are your high-priority targets for HNDL protection.

Phase 2: Internal Tooling and SDKs

Update your base Docker images and shared communication libraries (e.g., a shared gRPC wrapper) to use a PQC-ready version of your language's runtime. For Go, this is 1.23+; for Java, look at Bouncy Castle's PQC offerings; for Rust, use rustls with the aws-lc-rs crypto provider.

Phase 3: The Hybrid Canary

Enable hybrid key exchange on a single, non-critical service-to-service link. Monitor for latency spikes and packet loss. This phase is about testing your network's tolerance for larger handshake packets.

Phase 4: Enforced Policy

Once confident, update your infrastructure-as-code (Terraform/Pulumi) to set the preferred curve to X25519MLKEM768 across the fleet.

Conclusion

The migration to Post-Quantum Cryptography is not an optional security hardening exercise; it is a fundamental requirement for maintaining data privacy in the coming decade. ML-KEM (Kyber) provides a robust, standardized path forward. By adopting a hybrid approach today, you protect your services against future quantum threats without sacrificing the proven security of elliptic curves.

Next Steps:

Audit your TLS stack: Check if your current language runtimes and proxies (Envoy/Nginx) support the X25519MLKEM768 group.
Test MTU limits: Use tools like ping -s or tcpdump to verify that your network handles 2KB+ packets without fragmentation issues.
Enable Hybrid by Default: Start with internal service-to-service traffic where you have control over both ends of the connection.