{
"id": "doc_123_chunk_4",
"metadata": {
"source": "paper.pdf",
"page": 12,
"doc_id": "doc_123",
"chunk_text": "55 KB of text...",
"raw_html": "10 KB of HTML...",
"summary": "9.5 KB summary..."
}
}
v0.3.0 - MIT License
Stop vector DB metadata limit errors before upsert.
Scan JSON/JSONL records, validate upsert readiness, move heavy payloads to sidecar stores, and use safe_upsert from Python for cleaner ingestion pipelines.
$ vectormeta scan records.json --target pinecone
Target: pinecone / Limit: 40960 bytes (40.00 KB)
Records scanned: 2 / Oversized: 1
doc_123_chunk_4 -> 76,747 B; largest fields: chunk_text, raw_html, summary
$ vectormeta validate records.json --target pinecone --dim 1536
Records validated: 2 / Errors: 1
metadata_too_large -> doc_123_chunk_4
$ vectormeta fix records.json --target pinecone --sidecar-store sqlite --sidecar sidecars.sqlite --out ready.json
Fix summary: updated 2 records; stored content-addressed sidecar refs.
Before and after
Move storage-heavy payloads out of filterable metadata.
vectormeta keeps source, page, doc_id, and tags in the vector DB record, then writes text, HTML, tables, and summaries to JSON, file, or SQLite sidecars.
{
"id": "doc_123_chunk_4",
"metadata": {
"source": "paper.pdf",
"page": 12,
"doc_id": "doc_123",
"content_ref": "sqlite:8eb2..."
}
}
// SQLite/FileStore payload
{
"id": "doc_123_chunk_4",
"chunk_text": "55 KB of text...",
"raw_html": "10 KB of HTML...",
"summary": "9.5 KB summary..."
}
Workflow
CLI checks plus Python safe upsert.
Use it locally before upsert, or wire it into CI as a validation and metadata hygiene gate.
Detect oversized records
Measure compact UTF-8 JSON byte size, identify oversized records, and show the largest top-level metadata fields.
$ vectormeta scan records.json --target pinecone
Catch upsert failures
Check metadata size, ID hygiene, Pinecone metadata value rules, and vector dimension consistency before upload.
$ vectormeta validate records.json --target pinecone --dim 1536
Move heavy fields
Move text, HTML, summaries, tables, and OCR payloads into sidecars while preserving useful filter fields.
$ vectormeta fix records.json --sidecar ./sidecar --out ready.json
Restore for inspection
Load content_ref sidecars back into records for debugging, migration work, or local data review.
$ vectormeta hydrate ready.json --sidecar ./sidecar
Reduction logic
Designed to keep vectors searchable.
The fixer follows a predictable policy order so cleaned metadata stays small without losing the fields your filters depend on.
Exact UTF-8 JSON sizing
Uses compact JSON bytes, not character counts or Python string reprs.
Move explicit fields first
Respects --move-fields before built-in heavy-field defaults.
Greedy non-keep removal
Moves the largest non-keep fields one at a time until metadata fits.
Protect filterable fields
Keeps source, page, doc_id, tags, language, and related fields by default.
Warn on hard tradeoffs
Reports when keep fields must move or a record still cannot fit.
Safe local sidecars
Sanitized filenames, overwrite protection, and path-traversal checks.
Limit presets
Clear policies for strict and advisory targets.
Pinecone is the clearest strict-limit target in the MVP. Other presets are advisory and should be verified against your deployment.
| Target | Default | Policy | Note |
|---|---|---|---|
| pinecone | 40 KB | strict | Primary MVP target. Verify current official docs. |
| chroma | 256 KB | advisory | Local/configurable deployments. Not a universal hard limit. |
| qdrant | 64 KB | advisory | Conservative default. Configure --limit-kb for your cluster. |
| weaviate | 64 KB | advisory | Conservative default. Configure --limit-kb for your schema. |
| custom | required | manual | Use --target custom --limit-kb for team-specific policy. |
Production checks
Built for production workflows
Typed, tested, and linted. Designed to slot into CI without changes.
Use vectormeta before your next vector DB upsert.
Install from PyPI, scan and validate your records, or add safe_upsert to ingestion code.
$ pip install vectormeta