Chapter 83: Metering and billing pipelines

Every ML product that charges per usage is fundamentally a metering pipeline attached to a serving stack. A user sends a request, the serving stack counts something — tokens, seconds, dollars, GPU-hours — and that count must travel reliably from the data plane to the invoice. The path is longer than it looks, full of dropped events, double-counted retries, clock-skewed aggregations, and subtle discrepancies that resolve to angry customer tickets at the end of the month. Getting this right is a distinguishing problem between platforms that scale and platforms that crumble at the first serious customer.

After this chapter the metering pipeline is fully mapped. The sidecar metering pattern. The event protocol. Central aggregation. The commercial metering platforms (Amberflo, Stripe Metering, Metronome) and what they replace. Idempotent billing events and why they matter. How per-token billing for LLM products specifically flows from “token emitted in vLLM” to “charge on a credit card.” This chapter leans on Chapter 82 (operations as the unit of billing) and connects forward to Chapter 84 (Kafka as the firehose the metering pipeline rides on).

Outline:

The billing data path.
Sidecar metering as the collection pattern.
The usage event protocol.
Central aggregation and deduplication.
Idempotent billing events.
Per-token LLM billing in detail.
Commercial metering platforms.
Rating, aggregation, and invoicing.
Reconciliation and the audit problem.
The mental model.

83.1 The billing data path

The path from “the user uses something” to “the user pays for it” has five layers.

Raw events are the source of truth; pricing is applied late in the rating layer so a price change can be applied retroactively by re-running rating against the archived events.

Every layer is a separate service with its own failure modes. The data plane generates events at request time. The metering collector (often a sidecar) receives structured counters. The transport carries them durably to central aggregation. The aggregator does the per-customer rollups. Rating applies prices. Billing generates invoices. Payment collects money.

The reason to know the full path is that bugs at every layer look identical to the customer (“my bill is wrong”) but are fixed in totally different places. A bug in the data plane (wrong token count) cannot be fixed in billing after the fact without a reconciliation. A bug in the aggregator (double-counted retries) is fixable at the aggregator but leaves inconsistent history. A bug in rating (wrong price tier applied) is fixable in rating with a replay, if the events are still available.

The design principle is: the data plane emits raw events, not pre-calculated charges. Dollars are computed late, in a stage that owns pricing. The data plane just says “this request used 12 tokens for model X at time T for tenant Y.” If the pricing for model X changes, old events are re-rated; the raw event is the source of truth.

83.2 Sidecar metering as the collection pattern

The collection layer needs to be close to the data plane (so it sees every request) but decoupled from it (so the billing system’s problems don’t crash the serving stack). The standard pattern is a sidecar.

A sidecar is a process that runs next to the main workload — in Kubernetes, it is a second container in the same pod — and receives events from the main process via a local mechanism (Unix socket, localhost HTTP, shared memory). The main process calls “log usage” on the sidecar and returns immediately; the sidecar is responsible for buffering, batching, retrying, and delivering the event downstream.

Why this shape:

Decoupling. A bug in the sidecar (or in the downstream billing service) does not affect the main process. Even if the sidecar crashes, the main process continues serving traffic.
Locality. The sidecar sees exactly the same request context the main process sees — same trace id, same user, same timing. It can also add host-level context (instance id, region, GPU model) that the main process doesn’t know about.
Independent scaling. The sidecar is a thin, stateless component. It can be updated independently of the serving stack without rebuilding vLLM.
Language independence. The main process writes in whatever (Python, Rust, C++); the sidecar is written in whatever the metering team uses.

The main process emits events on a hot path. The event emission must be cheap — ideally sub-millisecond. The standard approach:

# Main process
usage_client.log({
    "request_id": req_id,
    "tenant_id": tenant,
    "operation": "chat.completion",
    "model": "llama-3-70b",
    "input_tokens": prompt_len,
    "output_tokens": completion_len,
    "started_at": t0,
    "completed_at": t1,
})

The log call writes to a local buffer (in-memory queue, Unix socket) and returns. The sidecar batches buffered events, typically every 100-500ms or every N events, and POSTs them to the central aggregator. If the aggregator is down, the sidecar retries with backoff. If the retry buffer fills, the sidecar spills to local disk. If local disk fills, the sidecar drops and alerts — this is the only failure mode that loses data, and it must be loud.

Alternative: emit events straight to Kafka from the main process, no sidecar. This works at smaller scale but couples the main process’s latency to the Kafka producer’s latency, and Kafka producer libraries have their own quirks (partitioning, compression, batch flushes). The sidecar pattern absorbs those concerns in a dedicated component.

83.3 The usage event protocol

The event schema is the contract between the data plane and the billing system. Getting it right matters a lot because schemas are expensive to change once in production.

A reasonable schema:

{
  "event_id": "evt_01HX8ZKJA3P5T4YNCNDW34Q9VX",
  "event_time": "2026-04-10T12:34:56.789Z",
  "tenant_id": "acme-corp",
  "user_id": "user_abc123",
  "operation_id": "op_01HX8ZKJA...",
  "resource": "chat.completion",
  "model": "llama-3-70b-instruct",
  "region": "us-west-2",
  "counters": {
    "input_tokens": 1247,
    "output_tokens": 389,
    "cached_input_tokens": 900,
    "duration_ms": 4230
  },
  "metadata": {
    "trace_id": "abcd1234...",
    "request_id": "req_...",
    "hostname": "vllm-789xyz",
    "api_version": "2026-01-01"
  },
  "schema_version": "2"
}

The important properties:

event_id is globally unique and stable. The data plane generates it at event time and it does not change. Downstream dedup relies on it. Use ULIDs or UUIDs.

event_time is the time the work happened, not the time the event was delivered. The sidecar may deliver an event hours later after a network partition. The billing period that applies is determined by event_time, not delivery time.

counters is a map of numeric dimensions. Input tokens, output tokens, duration, cached tokens (for prefix caching), image count, audio seconds — whatever the product meters. Each counter is a signed integer or float.

tenant_id is the billing axis. Every event must have one. This is the primary key for aggregation.

resource names what was consumed. Chat completion, embedding, fine-tune, image generation, storage. The rating system uses this to find the price.

schema_version tracks schema evolution. Adding fields is usually forward-compatible; removing or renaming is not. The version lets aggregators handle multiple concurrent schema versions during rollouts.

The schema should be small (sub-kilobyte per event) so the firehose doesn’t become the bottleneck. The metadata fields are for debugging and trace correlation; they should not influence billing.

Serialization: JSON is universal but expensive. Protobuf or Avro is standard in large-scale pipelines because of the schema-registry discipline (Chapter 84). Pick based on volume.

83.4 Central aggregation and deduplication

The aggregator receives events from the firehose and rolls them up per tenant, per resource, per time bucket. Its responsibilities:

Deduplication. The same event_id can arrive more than once because of retries, at-least-once delivery semantics (Chapter 84), and sidecar restarts. The aggregator keeps a deduplication window (typically hours to days) and drops duplicates. Bloom filters, Redis sets with TTLs, or a dedup table in the event store all work.

Time bucketing. Events are grouped into buckets (minute, hour, day) for rollups. The bucket is derived from event_time, not from delivery time, which means events can arrive “late” for an already-closed bucket. The aggregator must handle late-arriving events — usually by re-opening the bucket if it’s within a tolerance window (hours) and rejecting otherwise (days past).

Per-counter sums. For each (tenant, resource, bucket), sum each counter. This is the primary storage: “acme-corp used 4.2M input tokens of llama-3-70b in the 12:00-13:00 bucket on 2026-04-10.”

Persistence. Store the rolled-up totals durably. Typical choice is a columnar store (ClickHouse, BigQuery, Druid) or a time-series database (TimescaleDB, InfluxDB). The raw events can also be kept (in object storage, partitioned by date) for audit and re-aggregation.

Streaming vs batch. Two designs:

Streaming aggregator (Flink, Beam, Kafka Streams). Consumes the firehose continuously, maintains in-memory rollups with periodic checkpoints, flushes to storage. Low latency (near-real-time dashboards), more complex to operate.
Batch aggregator. Drop raw events into object storage, run a Spark/Dataflow/BigQuery job every 5-30 minutes to aggregate. Simpler, higher latency, cheaper.

For billing, batch is usually enough — invoices are monthly, not real-time. For in-app usage dashboards and quota enforcement, streaming matters. A typical platform runs both: streaming aggregator for live counters, batch aggregator as the authoritative record.

The deduplication step is the one most teams get wrong. They assume Kafka’s “exactly-once” is actually exactly-once end-to-end and skip the dedup. Kafka’s exactly-once is within-Kafka only; once the sidecar is in the picture, with its retry buffers, you need application-level dedup. The event_id + dedup window is the right answer.

83.5 Idempotent billing events

Billing is an area where double-counting is worse than dropping — you can’t bill a customer twice for the same token, even if that means occasionally losing an event. The discipline is end-to-end idempotency.

The chain of idempotency:

Emission. The data plane generates an event_id for each logical usage event. If the sidecar retries the same event, the event_id is the same.
Transport. The event carries its event_id through the firehose.
Aggregation. The aggregator drops duplicates by event_id within the dedup window.
Rating. The rating stage converts (tenant, resource, counters) totals into dollar amounts. Idempotent by construction — re-running it produces the same dollars.
Invoice. The billing system issues an invoice per billing period. The invoice has its own idempotency key (usually period + tenant) so re-running the invoicing job does not duplicate invoices.
Payment. The payment system uses an idempotency key (Chapter 78) so retrying a charge does not double-charge.

At every layer, the operation that adds information (incrementing a counter, issuing an invoice, charging a card) must be idempotent under retries. Without this discipline, a network hiccup or a restart produces double charges, and you spend your weekend writing apology emails and issuing refunds.

The most subtle trap: retries of the original request (not retries of the event emission). Consider an LLM completion that times out client-side, the client retries, and the second attempt actually succeeds. Both attempts generated real work and real usage events. Do you bill for both, or only the “successful” one?

The common answer: bill for completed work. If the server actually ran the request and produced output, it bills, regardless of whether the client saw the response. This is controversial (customers dislike it) but it matches reality — the server spent GPU seconds, and whoever retried should carry the cost. Some platforms are more generous and do not bill for requests that returned 5xx to the client; this is a business decision, not a technical constraint, but it requires knowing at event time whether the request “succeeded” in the client’s view. Usually you log both and let the rating layer apply the policy.

83.6 Per-token LLM billing in detail

Per-token billing for LLMs has specific quirks because the cost model has multiple dimensions.

Cached tokens can be 10× cheaper than novel input tokens; the usage event must carry all dimensions so the rating layer applies the correct price to each.

The dimensions:

Input tokens: prompt + system prompt + chat history. Usually cheaper per token than output because prefill is compute-bound and parallelizable.
Output tokens: generated response. Usually more expensive per token because decode is memory-bandwidth-bound and sequential.
Cached input tokens: tokens that hit the prefix cache (Chapter 29). Usually discounted to 10-50% of normal input price because the server skipped most of the work.
Cached write: in some APIs, there’s a small premium for writing to the cache (Anthropic’s model), offset by the savings on reads.
Image tokens: images are tokenized into N tokens based on size and model (e.g., 170 tokens per 512x512 tile for some vision models). Priced per token like text.
Audio seconds: for audio models, priced per second of audio.
Reasoning tokens: for “thinking” models (o1-style), the internal reasoning tokens count against billing even though the user doesn’t see them.

The event needs to carry all these counters so rating can apply the right price to each. A single chat completion might produce an event with 7+ counters: input, output, cached-input, cached-write, image tokens, reasoning tokens, duration.

Where the counters come from in vLLM:

vLLM (and similar inference servers) emit metrics per request: num_input_tokens, num_output_tokens, time spent in prefill, time spent in decode. The sidecar reads these at request completion and emits a usage event. The request’s output includes the counters in a usage field in the response body:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1247,
    "completion_tokens": 389,
    "total_tokens": 1636,
    "cached_tokens": 900
  }
}

The gateway reads this and forwards it to the billing sidecar. This closes the loop: the counters that appear on the invoice are the same counters returned to the caller.

Cached token billing specifically is worth understanding because it is a load-bearing pricing mechanism. If a user sends the same 10k-token system prompt on every request, prefix caching (Chapter 29) means the server only does the prefill work for the novel tail of the prompt. The “cached tokens” counter is those 10k tokens; the “input tokens” counter is just the novel tail. The rating system charges cached tokens at the discounted rate and input tokens at the full rate. From a user perspective, this makes long system prompts affordable.

The catch with cached tokens is that the cache is per-replica and not guaranteed. A request might land on a replica that has the cache warm, or one that doesn’t. Billing that depended on “you will always pay the cached rate” would be a lie; billing that says “you pay cached rate when you hit the cache, we try hard to hit it” is honest. Most providers do the honest version and rely on prefix caching working well enough in aggregate.

Output token streaming and billing. Because output tokens are streamed (Chapter 79), the final token count is not known until the stream ends. Billing has to wait for the completion event to fire the usage record — it cannot emit per-token events, or it would overwhelm the pipeline. The pattern is: on stream completion, emit one event with the final counters. If the client canceled mid-stream, emit an event with the partial counts (because the server still did the work).

83.7 Commercial metering platforms

Building a full metering and billing pipeline from scratch is a 6-12 month engineering effort. The commercial options exist to skip that effort.

Amberflo. Cloud-native usage metering, aggregation, pricing, and billing. You emit events via their SDK or API; they handle ingestion, dedup, aggregation, rating, and invoice generation. Integrates with Stripe for payment. Targeted at usage-based SaaS including AI platforms.

Stripe Billing + Metering (the “Stripe Meter” product). Stripe added usage-based metering in 2024-2025 that layers on top of the existing billing and subscription primitives. You push usage events (meter_events) to Stripe, Stripe aggregates them against a Meter resource, and the aggregated usage is attached to a subscription for invoice generation. The upside: if you are already on Stripe Billing, this is the least-friction option. The downside: limited flexibility in aggregation logic, best for simpler pricing models.

Metronome. Similar to Amberflo, B2B focused, strong on enterprise contract complexity (commits, overages, discounts, true-ups). Built by ex-Dropbox billing engineers. Used by several large AI platforms.

Chargebee, Recurly, Zuora. Older subscription billing systems with bolted-on metered billing. Adequate for some cases, not purpose-built.

Orb. Another recent usage-billing platform, similar shape to Metronome.

When to buy vs build: buy at early and mid stages. Commercial platforms solve the “pricing, invoicing, dunning, revenue recognition” side of billing, which is hundreds of person-months to build correctly. You still write the event emission side (the data plane is your product), but the downstream pipeline is theirs.

When to build: when the commercial platforms cannot model your pricing, or the per-event cost is prohibitive at your volume, or you have legal/compliance constraints that require keeping billing data in-house. Very large platforms (AWS, Google Cloud, the major AI labs) build their own, often iterating on something that started commercial and outgrew it.

The middle ground is “buy the billing system, build the metering pipeline.” Emit your own events into a Kafka firehose, run your own aggregator, push aggregated totals to the commercial billing system for invoicing. This splits the work at a natural boundary: you own the domain logic (what counts as usage), they own the financial logic (how it becomes an invoice).

83.8 Rating, aggregation, and invoicing

Rating is the step that turns “4.2M input tokens of llama-3-70b” into “$2,940.00.” It needs:

A price book: current prices per resource, per region, per tier. Usually a versioned database table. Price changes are rolled out as new rows with effective dates; old events use the price effective at their event time.
Tier logic: volume discounts (“over 10M tokens/month, price drops 20%”), committed-use discounts (“customer committed 50M tokens/month, first 50M at discounted rate”), free tiers, promotional credits.
Currency and FX: the platform might price in USD but bill in EUR / GBP / JPY. Exchange rates at invoice time, not event time (but you can choose either).
Taxes: VAT, GST, sales tax. Usually handled by the billing system or a dedicated tax service (Avalara, Stripe Tax).

Rating can run in real-time (streaming aggregator, live balance display) or batch (end-of-billing-period rating job, invoice generation). Most platforms run both: real-time for user-visible dashboards (“you’ve used $42 this month”), batch for the authoritative invoice.

Invoice generation takes the rated totals for a billing period, composes them into a line-item bill, applies credits and discounts, and ships it to the customer. For self-serve customers this is automated and payment is pulled from a card. For enterprise customers there is usually a human review step, a PO process, and NET-30/60/90 terms.

The rarely-discussed piece: credit memos and refunds. When a customer is over-billed (bug in metering, outage credit, goodwill adjustment), the billing system needs to issue a credit that offsets a future invoice or produces an actual refund. This is a whole sub-system and a whole class of bugs. Get it right early.

83.9 Reconciliation and the audit problem

Reconciliation is the process of verifying that what is billed matches what actually happened. Every mature billing pipeline runs periodic reconciliation jobs:

Data plane vs aggregator. Sum the raw events in the firehose (from object storage snapshots) and compare to the aggregator’s totals. Discrepancies indicate drops or double-counts.

Each reconciliation comparison isolates the layer with the bug: a gap between raw events and aggregator totals means a drop or double-count in the aggregator, not in rating.

Aggregator vs rating. Verify that rating’s dollar totals match the aggregator’s counter totals times the price book.

Rating vs invoice. Verify that the invoice’s line items match rating’s output.

Invoice vs payment. Verify that collected payments match issued invoices.

Accounting reconciliation. The finance team has its own general ledger; it must match the billing system. This is a regulatory thing (SOX, etc.) at companies of a certain size.

Reconciliation produces a number: the discrepancy rate, expressed as a percentage of revenue or as absolute dollars. The target is well below 0.1% — above that and auditors, customers, or regulators start asking questions. Reconciliation discovers bugs that pre-commit tests and production monitoring missed, because the bugs only show up in aggregate over time.

The audit problem is: can you, at any point, tell a customer “here is every request you were billed for, here is the cost of each, here is the aggregated total on your invoice.” The answer must be yes, traceable, and verifiable. This requires keeping raw events (not just aggregates) for the audit window — usually 1-7 years depending on jurisdiction. Object storage is cheap enough that keeping compressed raw events for seven years is not a cost concern; losing them is a legal concern.

83.10 The mental model

Eight points to take into Chapter 84:

The billing data path has seven stages: data plane → collector → transport → aggregator → rating → invoice → payment. Each is a separate service with separate failure modes.
The sidecar metering pattern decouples the data plane from the billing system. Main process does the work; sidecar ships the usage events.
The data plane emits raw counters, not dollars. Pricing is applied late in a dedicated rating stage.
Every event has a stable event_id for deduplication end-to-end. At-least-once delivery plus application-level dedup, not “exactly-once.”
Idempotency at every layer is non-negotiable. Double charges are worse than dropped events.
LLM billing has multiple counter dimensions: input tokens, output tokens, cached tokens, image tokens, reasoning tokens, duration. The event schema must carry all of them.
Commercial platforms (Stripe Metering, Amberflo, Metronome) are worth buying until you are very large. Build the metering pipeline; buy the billing platform.
Reconciliation is mandatory. Raw events kept for the audit window; periodic jobs compare every layer; discrepancies investigated as bugs.

In Chapter 84, the firehose those usage events ride on: Kafka as the backbone of high-volume telemetry and the Part VI capstone.

Read it yourself

Amberflo’s “Usage-based pricing” engineering blog series — practical walkthroughs of metering pipelines.
Stripe’s Meter API documentation and the “Metered billing” pattern guide.
The OpenAI API Usage docs — a clean example of the per-token billing dimensions in production.
Anthropic’s prompt caching documentation — how cache write vs cache read billing works in production.
Martin Kleppmann, Designing Data-Intensive Applications, Chapter 11 (stream processing) — the deduplication and exactly-once framing underlies every metering pipeline.
The Google Cloud Billing Export to BigQuery documentation — a reference design for a reconciliation-ready billing store.

Practice

Design the usage event schema for a product that charges per LLM chat completion with input, output, and cached tokens. What fields are required? What indexes would the aggregator need?
Walk through the lifecycle of a single request: user sends a chat, 1200 input tokens, 400 output tokens, 900 of the input cached. Trace the usage event from vLLM to invoice. Where is the event stored at each stage?
A customer reports “my bill is 3x what I expect.” Design the audit query that proves (or disproves) the billing system’s total. What tables do you query, what do you compare?
A sidecar crashes and loses 30 seconds of events. What is the blast radius for billing? Which mitigations limit it?
The price of llama-3-70b changed on 2026-04-01 at 00:00 UTC. An event arrives with event_time: 2026-03-31T23:59:58Z but is delivered on 2026-04-02. Which price does rating apply? Why?
Compare building an in-house aggregator on Kafka Streams vs using Stripe Metering for a platform doing 10k events/sec. What decides between them?
Stretch: Build a minimal end-to-end metering demo: Python sidecar → Redpanda (Kafka) → Python consumer that aggregates per-tenant token counts into Postgres → a rating function that applies per-model prices → an invoice script that emits a PDF. About 500 lines total.