Add Schema Validation Coverage
tasks25/25
1. Pre-work: shared helper
- Add
src/schema-registry.tsexportingmakeValidateRecord(schemas)returning aValidateRecordclosure with consistent diagnostics. - Migrate
connectors/{amazon,chase,chatgpt,reddit,usaa}/schemas.tsto use the helper. Remove duplicated boilerplate. - Drop the
validateRecord as ValidateRecordcast inconnectors/amazon/index.tsnow that the helper returns the right type directly. - Confirm
pnpm typecheckclean andpnpm test(640+ tests) green.
2. Pre-work: replay existing schemas vs real DB
- Add
bin/replay-schemas.tsthat runs every record from the local sqlite through the connector'svalidateRecordand writes a JSON report underlocal/. - Add
bin/sample-records.tsthat emits 5 representative records per stream — used to ground new schemas in real shapes. - Replay the five existing schemas. Document drift: usaa 6/926 records fail on legitimate data-quality issues (missing currency, empty-string descriptions); reddit 1225/1225 fail because DB holds v0.1 records and the schema is v0.2.
- Decide: do not loosen schemas to mask drift. The SKIP_RESULT path is the diagnostic signal.
3. Six new connectors
- github: schemas.ts (6 manifest-declared streams). Replay: 8702/8702 pass.
- ynab: schemas.ts (9 streams, milliunit-aware, composite transaction IDs). Replay: 21537/21537 pass.
- gmail: schemas.ts (5 streams;
labelskeyed onname, notid). Replay: 50485/50485 pass. - codex: schemas.ts (6 streams) + manifest reconcile to declare
function_calls. Replay: 74945/74945 pass. - claude_code: schemas.ts (6 streams). Replay: 252863/252863 pass.
- slack: schemas.ts (13 manifest-declared streams;
messages.tsis string-formatted float, not ISO). Replay: 349130/349130 pass.
4. Manifest version bumps
- github 0.2.0 → 0.3.0
- gmail 0.1.0 → 0.2.0
- ynab 0.2.0 → 0.3.0
- codex 0.2.0 → 0.3.0 (also adds
function_callsstream) - claude_code 0.2.0 → 0.3.0
- slack 0.3.0 → 0.4.0
5. Validation
-
pnpm testpasses 640 tests, 0 failures. - My-files-only
biome checkclean (14 files). - Final replay: 771,485 records across 11 connectors. 9 at 100%, usaa at 99.4%, reddit at 0% (re-ingest required).
-
openspec validate add-schema-validation-coverage --strict. -
openspec validate --all --strict.
6. Followups (not in this change)
- manifest/schema/emit reconciliation: audit every connector for declared streams with no emitted records, emitted streams missing from the manifest, and
SCHEMASkeys that do not match the manifest. The owner review already fixed GitHub/Slack drift in this tranche; make this a reusable check. - fixture replay tests: add committed
__fixtures__/replay coverage for the six newly schemed connectors so schema confidence does not depend on Tim's local SQLite database. - usaa: 6 records emit with missing currency or empty-string descriptions. File as data-quality issue; do not loosen schema.
- reddit: re-ingest from a v0.2 connector capture so the local DB matches the v0.2 schema. Tracked in
add-polyfill-connector-system/tasks.md. - zod cleanup: replace deprecated
z.string().url()usage in connector schemas. - The 18 connectors with no
parsers.tsyet are out of scope — they don't have anything to validate.