Design

Created Apr 25, 2026openspec/changes/hydrate-first-party-blob-streams/design.mdView on GitHub →

Current State

The reference RS already has the core blob substrate:

POST /v1/blobs stores content-addressed bytes for owner-authorized ingestion.
GET /v1/blobs/{blob_id} serves bytes only when the caller can see a record that references the blob through a visible blob_ref.
Read responses decorate visible blob_ref objects with fetch_url.
Query tests already prove grant-visible blob_ref enforcement and Gmail attachment expansion for synthetic hydrated records.

The missing layer is connector hydration. Real first-party connectors still often emit records such as Gmail attachments, Slack files, statements, or receipts as metadata without a populated blob_ref, so the transport exists but clients cannot reach actual bytes.

Principles

Use the existing blob contract. Do not add /attachments/:id/content, /download, or stream-specific byte routes.
Connector output remains metadata-first. A file-like record can exist without bytes; hydration status tells clients whether bytes are available.
Hydration is grant-safe by construction. Bytes are fetchable only through a record whose blob_ref field is visible under the bearer.
Hydration is idempotent. Re-running a connector should not duplicate stored bytes for the same content.
Failures are records, not surprises. If byte collection is blocked, too large, expired, unavailable, or intentionally skipped, preserve a metadata record with a machine-readable hydration status/reason.
No secrets in blobs or timelines. Connector logs and run timelines must not include byte contents or signed source URLs.

Candidate Streams

The first implementation pass should audit all shipped first-party manifests and classify each stream:

Hydrate now: bytes are source-available, useful, bounded, and safe to fetch in normal sync.
Metadata only: bytes may exist but are not available through the current connector auth/path, are too expensive, or need separate design.
Not binary: no collectible byte payload.

Known candidates:

Gmail attachments: first vertical slice; stream already models metadata, blob_ref, content_sha256, and hydration_status.
Slack file/message attachment surfaces: likely useful, but distinguish Slack message attachments from actual uploaded files and remote files.
Chase/USAA/Amazon statements, receipts, invoices, or exports: high assistant value, but may require browser download handling and retention-size policy.
ChatGPT/Claude/Codex artifacts or uploaded files: audit first; many records may be text metadata or local filesystem references rather than collectible source bytes.
GitHub gists/release assets/files: audit whether these should be records, blobs, or both.

Hydration Model

For a hydrated file-like record, connector code should:

Fetch bytes from the source while operating under the owner-authorized connector session.
Upload/store bytes through the reference blob seam.
Emit a record whose data.blob_ref contains:
- blob_id
- mime_type
- size_bytes
- sha256
Emit a connector-specific integrity field such as content_sha256 where the manifest already declares it.
Emit hydration_status: "hydrated".

For a non-hydrated file-like record, connector code should emit metadata plus hydration_status such as:

deferred
unavailable
blocked
too_large
failed

If a stream needs finer diagnostics, add an optional non-secret hydration_error_code or equivalent field through the manifest rather than overloading logs.

Size And Retention

This change should start conservative:

Set a per-connector default max blob size.
Allow operator override via environment/config only after documenting the consequence.
Keep blob storage local to the reference deployment.
Avoid automatic OCR/text extraction in this change; extracted text can be a later capability.

Client Discovery

Clients should learn byte availability from:

/v1/schema stream fields and field capabilities,
blob_ref fields present in returned records,
blob_ref.fetch_url injected by the RS at read time,
hydration_status on file-like records.

Docs and examples should explicitly say that /content, /download, or stream-specific byte URL guessing is not the PDPP contract.

First Vertical Slice

Implement Gmail attachment hydration first because:

The stream and relationship already exist.
Real user testing has confirmed metadata exists but blob_ref is absent.
Existing synthetic query-contract tests already define the desired grant-safe behavior.
It unlocks high-value assistant questions involving leases, invoices, receipts, PDFs, and documents.

The Gmail slice should prove:

A real sync can populate attachment blob_ref for at least one attachment.
expand=attachments exposes fetch_url only when the grant includes attachments.blob_ref.
Fetching fetch_url returns the original bytes.
Metadata-only fallback remains valid when download fails or is skipped.

Non-Goals

No new public blob endpoint.
No resource-specific attachment content routes.
No automatic OCR/PDF/docx text extraction.
No hosted blob storage.
No cross-connector deduplication semantics beyond content-addressed storage.
No change to the PDPP grant model.