Chapter 85: Object storage primer: the S3 model

Every ML system eventually touches object storage. Model weights sit there. Training datasets sit there. Logs get archived there. Parquet files for the lakehouse live there. Checkpoints during long training runs are dumped there. Terraform state, CI artifacts, container image layers for some registries, backup snapshots — all of it. The reason is simple: object storage is the only cloud primitive that offers effectively infinite capacity at effectively unlimited durability for effectively pennies per gigabyte-month. Nothing else in a cloud comes close on that combination.

But object storage is not a filesystem. It is not POSIX. It has semantics that surprise people coming from NFS or local disk, and those surprises are where production bugs live. The S3 API is the de-facto industry standard — GCS, Azure Blob, MinIO, Cloudflare R2, Backblaze B2, Ceph RGW, and a half-dozen others all implement “S3 compatibility” to some degree — and the S3 mental model is what gets tested in interviews. This chapter drills into that model: what the API actually promises, where it breaks, how multipart works, why presigned URLs exist, what request rates look like in practice, and what lifecycle policies do for your storage bill.

Outline:

The object storage abstraction vs. filesystem.
The S3 API as a contract.
Consistency: eventual vs read-after-write.
Multipart uploads: when and why.
Presigned URLs: the auth pattern.
Request rates and partition keys.
Lifecycle policies and storage classes.
Intelligent tiering and the cold path.
Durability vs availability.
The S3-compatible ecosystem.
The mental model.

85.1 The object storage abstraction vs. filesystem

An object store is a flat key-value map. The key is a string (up to ~1024 bytes of UTF-8 in S3). The value is a blob of bytes (up to 5 TB in S3) plus a small bag of metadata (content-type, content-length, user-defined headers, ETag, last-modified). That is the entire data model. There is no directory. There is no mv. There is no chmod. There is no ftruncate. There is no “append to this file.” You put a key, you get a key, you delete a key. That is the contract.

Directories are a convention layered on top of the flat key space by using / as a separator in keys. The S3 API’s ListObjects call accepts a prefix parameter and an optional delimiter — if you pass prefix=models/llama-3-70b/ and delimiter=/, the response lists keys matching that prefix and collapses everything past the next / into “common prefixes.” The effect looks directory-ish from the outside but is stateless under the hood. There is no directory object. Deleting models/llama-3-70b/ as a “directory” requires enumerating every key with that prefix and deleting them one by one.

The implications are large. Renaming a 2 TB “directory” is not free — it is a copy of every object followed by a delete of every original, which at object-store latencies and rate limits can take hours. There is no way to “atomically replace” a file without using conditional writes (a relatively recent S3 feature). There is no way to get a stable read while a writer is modifying an object — you get whatever version happened to be durably written at the moment your GET landed. And there is no way to hold open a file descriptor the way POSIX lets you; every read is an HTTP GET, every write is an HTTP PUT, and the network is between you and your bytes.

All of this matters because ML workloads routinely want to pretend object storage is a filesystem. s3fs, goofys, mountpoint-s3, gcsfuse, and friends all present a FUSE view of a bucket as if it were POSIX. They lie. Every lie has an edge case. The correct mental model is that object storage is an HTTP-addressable key-value store, and the filesystem view is a convenient illusion for tools that don’t know any better.

Object storage is a flat key-value map — the slash in a key is a convention, not a real directory, so renaming a "folder" is a full copy-and-delete of every object under that prefix.

85.2 The S3 API as a contract

The S3 API is small — fewer than thirty operations account for 99% of production traffic — and that smallness is its strength. The core verbs:

Verb	Purpose
`PutObject`	Write a full object in one request. Max 5 GB per request.
`GetObject`	Read a full object or a byte range.
`HeadObject`	Fetch metadata without the body (cheap).
`DeleteObject`	Delete a single key.
`DeleteObjects`	Bulk delete up to 1000 keys per request.
`ListObjectsV2`	Paginated listing under a prefix.
`CopyObject`	Server-side copy, up to 5 GB per request.
`CreateMultipartUpload`	Begin a multipart upload session.
`UploadPart`	Upload one part (5 MB to 5 GB).
`CompleteMultipartUpload`	Commit all parts as one object.
`AbortMultipartUpload`	Discard an in-progress multipart.

That is the core. The rest of the API surface is ACLs, bucket policies, versioning, replication, notifications, tagging, and the gnarly analytics features most teams never touch. The core verbs are what show up in interviews and in the SDK calls of every data pipeline.

Every operation is an authenticated HTTPS request, signed with AWS Signature V4 (SigV4). The signature uses the client’s access key and secret, the request method and path, the timestamp, and a canonical representation of the headers and body. The server recomputes the signature from the same inputs and compares. If they match, the request is authorized. If they do not, 403. The signing is the entire auth model: there are no sessions, no cookies, no long-lived tokens. Every request stands alone.

The “contract” framing matters because S3 compatibility is a spectrum, not a boolean. MinIO implements most of the API with strong consistency and honest behavior. Cloudflare R2 implements a subset and notably charges no egress fees. GCS implements an S3-compatible gateway with different semantics around consistency and listing. Azure Blob has its own API but provides a compatibility shim. When a tool says “S3 compatible,” the real question is: which subset, with which consistency guarantees, at what rate limits? Read the docs, do not trust the badge.

85.3 Consistency: eventual vs read-after-write

For years, the single most surprising thing about S3 was its consistency model. From 2006 until December 2020, S3 was eventually consistent for most operations: you could PUT an object, then GET it, and receive either the new object, the old object, or a 404 — any of these were legal according to the API contract. For tools that assumed “write then read works,” this was a source of nightmare bugs. Spark jobs would read stale data. Dataset validation would fail intermittently. Atomic publishes were not really atomic.

In December 2020, AWS shipped strong read-after-write consistency for all S3 operations, everywhere, with no opt-in. This is one of the most important quiet improvements in cloud history. The new model:

PUT then GET — always returns the new object. No eventual consistency window.
PUT then LIST — the new object is immediately visible in listings.
DELETE then GET — immediately returns 404.
Overwrite PUT then GET — returns the new version.

Read-after-write consistency does not mean every consistency property you want. Notably:

No transactions. If you PUT two objects, there is no way to make them atomic. A reader can see the first without the second.
No compare-and-swap — until recently. AWS shipped If-Match and If-None-Match conditional writes in 2024, which give you an ETag-based CAS primitive. This is what modern table formats like Iceberg and Delta use to do atomic commits on S3.
Eventual consistency still exists for cross-region replication. A replicated bucket in another region lags the source by seconds to minutes.
Listings can still have propagation lag on some S3-compatible stores. GCS has long had strong listing consistency; MinIO does by default; other vendors vary.

For ML systems, the implication is that you can treat a single S3 bucket as a durable key-value store with linearizable writes per-key. But anything that needs multi-key atomicity — “publish this dataset by writing a manifest that references all its parts” — has to use a table format or a conditional write pattern, not a sequence of naive PUTs.

85.4 Multipart uploads: when and why

S3 accepts PUTs up to 5 GB in a single request. For anything larger, you must use multipart upload. For anything between 100 MB and 5 GB, you probably should use multipart upload anyway, because it gives you three things a single PUT does not: parallelism, resumability, and retry granularity.

The protocol is three-stage:

CreateMultipartUpload. The server returns an UploadId. This is a handle for the in-progress upload.
UploadPart. Upload each part with a sequential PartNumber (1 to 10,000) and the UploadId. Each part must be between 5 MB and 5 GB, except the last, which has no minimum. Each part returns an ETag.
CompleteMultipartUpload. Send the list of (PartNumber, ETag) pairs. The server stitches them together atomically into a single object with a new ETag. Readers cannot see a partial object; the commit is atomic.

Multipart upload serializes only the final commit — parts upload in parallel, and readers cannot observe a partial object until CompleteMultipartUpload succeeds atomically.

The reasons to use multipart, even for modest-size objects:

Parallel upload. Upload 10 parts in parallel from 10 threads. A 10 GB object that would take ~100 seconds to PUT serially at 100 MB/s finishes in ~10 seconds over the same pipe if you can saturate it in parallel. This is how aws s3 cp gets its speed.
Resumability. If part 7 fails, you retry part 7, not the whole thing. For a 200 GB training checkpoint, this matters.
Memory discipline. You don’t need the whole object in memory to start uploading. You can stream parts as they’re produced.

The part size is a tuning knob. Larger parts mean fewer requests (lower request-rate overhead) but worse parallelism and slower retries. Smaller parts mean more requests (more overhead per byte, more signature computation) but better concurrency. For ML checkpoint writes, 64 MB to 256 MB parts are typical. For small objects, stop worrying and use a single PUT.

The one footgun: abandoned multipart uploads do not clean themselves up. If your process crashes between CreateMultipartUpload and CompleteMultipartUpload, the parts remain billed against your bucket until you either complete the upload, abort it, or configure a lifecycle policy to expire incomplete multipart uploads after N days. Always set that lifecycle rule. Always. (The default recommendation is 7 days.)

85.5 Presigned URLs: the auth pattern

A presigned URL is a signed GET or PUT URL that embeds the signature in the query string rather than the headers. It is valid for a fixed duration (up to 7 days for SigV4), after which the signature expires. The owning service generates the URL using its own credentials, and then hands it to a caller that has no S3 credentials of its own. The caller uses the URL directly against S3.

This is the auth pattern that unlocks a huge category of ML-serving architectures. The typical flows:

Dataset upload from a browser. The frontend hits the backend: “I want to upload a file.” The backend generates a presigned PUT URL for a specific key in a bucket, with a short TTL (15 minutes). The frontend PUTs the file directly to S3 using that URL. The bytes never flow through the backend. The backend’s compute, bandwidth, and memory stay free. This scales indefinitely.

Model artifact download. A training job produces a model and writes it to S3. A downstream inference deployment needs to fetch it. The control plane generates a presigned GET URL and passes it to the inference node via an init container or environment variable. The inference node fetches the model directly. No IAM credentials have to be provisioned on the inference node.

Temporary data sharing. A data scientist needs to send a 40 GB eval result to a collaborator. They generate a presigned URL valid for 1 day and email it. The collaborator downloads it, no IAM dance, no additional users to create.

The gotchas:

The URL is a bearer token. Anyone with the URL can use it. Keep the TTL short.
The URL is bound to a specific method and path. A presigned PUT URL for key=foo cannot be used to PUT key=bar. You cannot use a presigned GET URL to DELETE.
Headers matter at sign time. If you sign a URL with Content-Type: application/json in the signed headers, the client must send exactly that header. Mismatches are a common source of “why is my presigned URL returning 403” bugs.
You can presign any S3 operation, not just GET/PUT. Presigned HEAD, presigned multipart upload part, presigned delete — all valid, all useful in specific cases.

Presigned URLs are the auth model that makes S3 usable as a backing store for multi-tenant, user-facing systems. Without them, every byte of user data would flow through your backend. With them, your backend becomes a control plane and S3 does the heavy lifting.

Presigned URLs let the backend act as a control plane — it signs a time-limited URL and returns it to the client, which then PUTs or GETs bytes directly to S3 without the backend seeing a single byte of payload.

85.6 Request rates and partition keys

S3’s scaling is famous: a bucket can sustain 3,500 PUTs/sec and 5,500 GETs/sec per prefix. The per-prefix qualifier matters. S3 internally partitions keys by prefix, and each partition has its own rate limit. If all your writes go to a single prefix — say, you’re bulk-loading log files with keys like logs/2026/04/10/00/00/01.json — you hit the single-partition limit fast, the SDK starts getting 503 SlowDown errors, and your throughput stalls around 3,500 req/sec total.

The fix, historically, was to randomize the prefix of your key. Instead of logs/2026/04/10/..., you’d write logs/8a3f/2026/04/10/... where 8a3f was the first 4 hex characters of a hash of the rest of the key. Now your keys are spread across 65,536 possible partition prefixes, and S3 auto-scales each partition independently. Total throughput becomes effectively unlimited for reasonable workloads.

In 2018, AWS silently improved the sharding algorithm so that pure-sequential-prefix workloads scale better automatically. As of current S3, you generally do not need to manually shard prefixes for steady-state workloads. But two cases still bite:

Bursty writes with hot prefixes. A batch job that writes 100,000 objects to the same date partition in 60 seconds will still get throttled. For this, you either rate-limit the writer, shard the prefix, or retry with backoff (the SDK does this by default).
S3-compatible stores that have not caught up. MinIO, R2, and friends have their own sharding behaviors, often worse than S3’s. If you’re hitting rate limits on a compatible store, sharding the prefix is still the fix.

The practical rule: design keys for scan patterns and distribution. Keys that sort well by time (good for ListObjectsV2 with a date prefix) are also the keys that concentrate writes. There is no free lunch. For ML training datasets, a common pattern is to use content-addressable keys (hash of the object) as the primary path and maintain a separate index file or database that maps “logical name” to “content hash.” The hash prefix naturally distributes writes, and the index provides ordered access.

85.7 Lifecycle policies and storage classes

Object storage is cheap but not free. $23/TB/month for S3 Standard adds up when you’re storing petabytes of training data. The way you control cost is with storage classes and lifecycle policies.

The main S3 storage classes:

Class	$/GB/mo	Retrieval	Use case
Standard	0.023	Free, instant	Hot data
Standard-IA	0.0125	$0.01/GB	Warm data, 30+ day access
One Zone-IA	0.01	$0.01/GB	Non-critical warm data
Glacier Instant	0.004	$0.03/GB	Archive, millisecond access
Glacier Flexible	0.0036	$0.01-0.03/GB, minutes to hours	Archive
Glacier Deep Archive	0.00099	$0.02-0.10/GB, hours	Deep archive

The prices have a 20× spread. The lesson is: data that is written once and rarely read belongs in a cold class. Training datasets from three years ago that you keep around “just in case” belong in Glacier Deep Archive, not Standard.

A lifecycle policy is a bucket-level rule that automatically transitions or expires objects based on age. The rules are simple: “for objects under prefix X, transition to class Y after N days, then to class Z after M days, then delete after P days.” Typical rules for an ML org:

Log files: Standard for 30 days, then Glacier Instant for 90 days, then delete.
Training checkpoints: Standard for 14 days, then Standard-IA for 60 days, then delete (you keep only the best checkpoint, which has a different policy).
Model artifacts: Standard indefinitely for the current production model, Standard-IA for old versions, Glacier for archived.
Incomplete multipart uploads: abort after 7 days, always.

Lifecycle rules run once a day, asynchronously. They are not a real-time enforcement mechanism. Transitions are eventually consistent: a rule that says “transition after 30 days” may transition between 30 and 32 days, not exactly at midnight. This is fine for cost optimization, wrong for compliance-driven deletion (for which you need a separate “GDPR delete” pipeline).

85.8 Intelligent tiering and the cold path

For workloads where access patterns are unpredictable, S3 offers Intelligent-Tiering: a storage class that automatically moves objects between frequent, infrequent, and archive access tiers based on observed access patterns. You pay a small monthly monitoring fee per object (~$0.0025 per 1000 objects) and, in exchange, you don’t have to design lifecycle rules by hand.

The heuristic is simple: an object that hasn’t been accessed in 30 days moves to the infrequent tier. Not accessed in 90 days, to the archive instant tier. Not accessed in 180 days, to the deep archive tier. Access it, and it bounces back to frequent. The savings can be 40-70% on workloads with long-tail access, which is most data platforms.

The catches:

Small objects hurt. The per-object monitoring fee means Intelligent-Tiering is bad for buckets with millions of tiny objects. Use it for buckets with fewer, larger objects (think Parquet files, model weights), not for buckets with millions of 1 KB JSON logs.
The monitoring fee is real. At 100M objects, it’s $250/month just for the monitoring. Math out whether the tier savings beat the fee.
Access time changes behavior. When an object in the deep archive tier is accessed, the first read has the latency of Glacier (minutes to hours). If you have SLA-sensitive reads on an Intelligent-Tiering object, you may be surprised. For ML training, this is almost always fine; for user-facing serving, it can be a bug.

For the ML bulk data path — training datasets, checkpoints, lakehouse Parquet — Intelligent-Tiering is a default-good choice. For model weights served directly to inference, keep them in Standard with an explicit lifecycle policy.

85.9 Durability vs availability

S3 advertises eleven 9s of durability (99.999999999%). This is the probability that a given object, once written, still exists after a year. The number comes from S3’s internal replication: each object is stored across three or more availability zones, with erasure coding and background rebuild on any degradation. Even losing an entire AZ does not lose data.

S3 advertises four 9s of availability (99.99%) for Standard. This is different. Availability is the probability that a given GET or PUT succeeds at a given time. The four 9s allows for ~52 minutes of downtime per year, per region, per bucket. In practice, S3 has had several longer outages over the past decade (2017 us-east-1 outage, 2020 us-east-1 outage, 2024 eu-west-1 degradation). When S3 is down, every dependent system is down.

Durability does not protect you from:

Deletes. If you delete the object, S3 cheerfully confirms that it’s gone. Enable versioning if you want protection against accidental deletes.
Overwrites. A PUT to an existing key replaces it. Versioning protects this too.
Corrupt uploads. If your client sends bad bytes, S3 stores bad bytes. Validate with ETag or content hashes.
Regional failures. Eleven 9s is per-region. A region-wide disaster loses everything. Use cross-region replication for disaster recovery.
Account compromise. If someone gets your credentials, they can delete your bucket. Enable MFA-delete, use bucket policies, and put critical buckets in a separate account.

The right mental model: object storage is durable but mutable. The durability number protects you against hardware failure, not against the most common causes of data loss, which are mistakes in code and mistakes in IAM.

85.10 The S3-compatible ecosystem

Because the S3 API won, every other object store implements some subset of it. A quick tour:

GCS (Google Cloud Storage). Native API is JSON-over-HTTPS with a different auth scheme. An S3-compatible interop layer exists but is not quite first-class. GCS has strong consistency, including listing consistency, and the strongest global bucket semantics. The killer feature is lower egress to GCP services and tight integration with BigQuery.

Azure Blob Storage. Native API is distinct from S3. An S3 shim exists but is used less. The Azure native API has block blobs (like multipart), page blobs (random-write, used for VM disks), and append blobs (append-only, good for logs).

MinIO. Open-source, self-hosted, highly S3-compatible. The go-to for air-gapped environments, on-prem ML platforms, and “I need an S3 locally for testing.” Deploys as a StatefulSet on Kubernetes. Supports most of the S3 API with strong consistency. The ML platform stack often uses MinIO as the local object store for tests and sometimes production.

Cloudflare R2. S3-compatible API, no egress fees. For workloads that serve files to the internet, the egress savings can be enormous — millions of dollars a year for high-traffic deployments. The tradeoff is a less mature ecosystem and some API gaps.

Backblaze B2. S3-compatible, cheap, good for backup. Less common in ML.

Ceph RGW (RADOS Gateway). Open-source, on-prem, S3-compatible, built on the Ceph distributed storage system. Used by HPC centers and some enterprise ML platforms. Operationally heavier than MinIO.

Wasabi, IDrive e2, various niche vendors. S3-compatible budget storage. Useful for archival and cold paths.

For ML systems, the pattern you see most often is: primary storage on S3 or GCS, a local MinIO for dev and tests, and occasionally an R2 or Wasabi bucket for high-egress or archival use cases. The S3 API is the portability layer that makes this possible.

85.11 The mental model

Eight points to take into Chapter 86:

Object storage is a flat key-value map, not a filesystem. Directories are a convention on top of keys.
The S3 API is the industry contract. Every other object store implements some compatible subset.
S3 has strong read-after-write consistency as of December 2020. Multi-key atomicity still requires table formats or conditional writes.
Use multipart upload for anything > 100 MB for parallelism, resumability, and retry granularity. Always lifecycle-expire incomplete multiparts.
Presigned URLs are the auth primitive that moves bytes around your system without IAM credentials on every hop.
Prefix design controls throughput. Hot prefixes get throttled; distributed prefixes scale.
Lifecycle policies and storage classes control cost. 20× price spread between hot and cold.
Durability is per-region and does not protect you from deletes, overwrites, or bad IAM. Versioning and cross-region replication fill the gaps.

In Chapter 86 the focus shifts from the bulk-byte layer to the structured data plane: document and key-value stores.

Read it yourself

Amazon S3 User Guide, especially the sections on consistency, multipart upload, and storage classes (docs.aws.amazon.com/s3).
DeCandia et al., Dynamo: Amazon’s Highly Available Key-value Store (2007). The paper that shaped how AWS thinks about durable distributed storage.
The MinIO documentation on S3 compatibility and the specific API gaps.
The vLLM / Hugging Face docs on loading models from S3 — a good practical reference for how ML systems actually fetch weights.
Werner Vogels’ blog posts on S3’s strong consistency launch (December 2020).
Jeff Barr’s AWS blog post on the S3 prefix auto-sharding change (2018).

Practice

Walk through the three-step multipart upload protocol. What happens if your process crashes between UploadPart 5 and UploadPart 6? What if it crashes after UploadPart 10 but before CompleteMultipartUpload?
You’re uploading a 2 TB training dataset from an on-prem GPU box to S3 over a 1 Gbps link. Compute the theoretical minimum upload time. With 10-way parallel multipart, can you actually hit that? What’s the bottleneck?
Why is a presigned URL bound to a specific method and path? What’s the security argument?
Design a key layout for a bucket that stores 1 billion training images. Optimize for (a) scan throughput under a date prefix and (b) write throughput without hot prefixes. What’s the tradeoff?
Compute the monthly cost of storing 500 TB in Standard vs Standard-IA vs Glacier Deep Archive. When does Glacier stop making sense (think about retrieval costs)?
A bucket has 100 million objects averaging 50 KB each. Calculate the Intelligent-Tiering monitoring fee per month. Is the monitoring fee worth it?
Stretch: Set up MinIO locally with the Helm chart, create a bucket, write a Python script using boto3 that does a multipart upload of a 1 GB file with 10 parallel workers. Measure the throughput. Then add a lifecycle rule that expires incomplete multipart uploads after 1 day and verify it triggers by creating and abandoning a multipart upload.