vectormeta

v0.3.0 - MIT License

Stop vector DB metadata limit errors before upsert.

Scan JSON/JSONL records, validate upsert readiness, move heavy payloads to sidecar stores, and use safe_upsert from Python for cleaner ingestion pipelines.

vectormeta - scan + validate + fix + safe_upsert

$ vectormeta scan records.json --target pinecone

Target: pinecone / Limit: 40960 bytes (40.00 KB)

Records scanned: 2 / Oversized: 1

doc_123_chunk_4 -> 76,747 B; largest fields: chunk_text, raw_html, summary

$ vectormeta validate records.json --target pinecone --dim 1536

Records validated: 2 / Errors: 1

metadata_too_large -> doc_123_chunk_4

$ vectormeta fix records.json --target pinecone --sidecar-store sqlite --sidecar sidecars.sqlite --out ready.json

Fix summary: updated 2 records; stored content-addressed sidecar refs.

Before and after

Move storage-heavy payloads out of filterable metadata.

vectormeta keeps source, page, doc_id, and tags in the vector DB record, then writes text, HTML, tables, and summaries to JSON, file, or SQLite sidecars.

Oversized / 76.7 KB exceeds 40 KB
{
  "id": "doc_123_chunk_4",
  "metadata": {
    "source": "paper.pdf",
    "page": 12,
    "doc_id": "doc_123",
    "chunk_text": "55 KB of text...",
    "raw_html": "10 KB of HTML...",
    "summary": "9.5 KB summary..."
  }
}
vectormeta fix -> sidecar/doc_123_chunk_4.json
Clean / under 1 KB fits policy
{
  "id": "doc_123_chunk_4",
  "metadata": {
    "source": "paper.pdf",
    "page": 12,
    "doc_id": "doc_123",
    "content_ref": "sqlite:8eb2..."
  }
}

// SQLite/FileStore payload
{
  "id": "doc_123_chunk_4",
  "chunk_text": "55 KB of text...",
  "raw_html": "10 KB of HTML...",
  "summary": "9.5 KB summary..."
}

Workflow

CLI checks plus Python safe upsert.

Use it locally before upsert, or wire it into CI as a validation and metadata hygiene gate.

01 / SCAN

Detect oversized records

Measure compact UTF-8 JSON byte size, identify oversized records, and show the largest top-level metadata fields.

$ vectormeta scan records.json --target pinecone
02 / VALIDATE

Catch upsert failures

Check metadata size, ID hygiene, Pinecone metadata value rules, and vector dimension consistency before upload.

$ vectormeta validate records.json --target pinecone --dim 1536
03 / FIX

Move heavy fields

Move text, HTML, summaries, tables, and OCR payloads into sidecars while preserving useful filter fields.

$ vectormeta fix records.json --sidecar ./sidecar --out ready.json
04 / HYDRATE

Restore for inspection

Load content_ref sidecars back into records for debugging, migration work, or local data review.

$ vectormeta hydrate ready.json --sidecar ./sidecar

Reduction logic

Designed to keep vectors searchable.

The fixer follows a predictable policy order so cleaned metadata stays small without losing the fields your filters depend on.

1

Exact UTF-8 JSON sizing

Uses compact JSON bytes, not character counts or Python string reprs.

2

Move explicit fields first

Respects --move-fields before built-in heavy-field defaults.

3

Greedy non-keep removal

Moves the largest non-keep fields one at a time until metadata fits.

4

Protect filterable fields

Keeps source, page, doc_id, tags, language, and related fields by default.

5

Warn on hard tradeoffs

Reports when keep fields must move or a record still cannot fit.

6

Safe local sidecars

Sanitized filenames, overwrite protection, and path-traversal checks.

Limit presets

Clear policies for strict and advisory targets.

Pinecone is the clearest strict-limit target in the MVP. Other presets are advisory and should be verified against your deployment.

Target Default Policy Note
pinecone 40 KB strict Primary MVP target. Verify current official docs.
chroma 256 KB advisory Local/configurable deployments. Not a universal hard limit.
qdrant 64 KB advisory Conservative default. Configure --limit-kb for your cluster.
weaviate 64 KB advisory Conservative default. Configure --limit-kb for your schema.
custom required manual Use --target custom --limit-kb for team-specific policy.

Production checks

Built for production workflows

Typed, tested, and linted. Designed to slot into CI without changes.

Python 3.10+ Typer CLI Rich output Pydantic config Preflight validation safe_upsert API SQLite sidecars Content deduplication JSON + JSONL Pytest tests Ruff linting Mypy strict GitHub Actions CI Overwrite protection Path traversal checks Store-backed sidecars

Use vectormeta before your next vector DB upsert.

Install from PyPI, scan and validate your records, or add safe_upsert to ingestion code.

$ pip install vectormeta