Why AI Treats Every Collection Like a JSON Dump

AI treats MongoDB like a JSON dumping ground. It embeds everything. An order document contains full product documents, which contain full category documents, which contain full vendor documents. The nesting goes three or four levels deep. This works in a demo with 10 orders. In production with 100,000 orders, documents approach the 16MB BSON limit. Updating a vendor name means updating every order that contains a product from that vendor. The “schema-less” flexibility becomes a data consistency nightmare.

Indexes are missing or wrong. AI creates collections and immediately starts querying them. It never runs createIndex(). MongoDB does a full collection scan on every query. With 10,000 documents, you don’t notice. With 10 million, your API times out. When AI does add indexes, they’re wrong - single-field indexes on queries that filter by three fields. The index gets used for the first field and the rest is a scan.

Schema validation doesn’t exist. MongoDB supports JSON Schema validation at the collection level, but AI never sets it up. Any document shape gets inserted. A user document without an email field? Stored. An order with a string where a number should be? Stored. A product with a negative price? Stored. Your application discovers these inconsistencies at read time, scattered across different code paths, weeks after the bad data was inserted.

Aggregation pipelines from AI are verbose and slow. AI generates $unwind followed by $group where a simple $project would work. It puts $match stages after $lookup stages instead of before, so the lookup processes documents that will be filtered out anyway. It uses $addFields repeatedly instead of combining transformations. The pipeline runs, but it processes 10x the data it needs to.

Restructuring Documents, Indexes, and Aggregation Pipelines

We start with the MongoDB profiler. We enable profiling for slow operations and analyze the output. Which queries are slowest? Which run most often? Which scan the most documents? This data drives every decision. We don’t guess about index needs - we measure.

Document models get restructured. We evaluate every embedded document against three criteria: Is it accessed independently? Does it change frequently? Can it exceed reasonable size? Documents that answer yes to any of these become references instead of embeddings. We use $lookup for joins where needed, but the goal is a document model where the most common query patterns don’t need joins at all.

Indexes are built from query patterns. We analyze the profiler output and explain() results for every slow query. Compound indexes match the query’s filter fields, sort fields, and projection fields - in that order. Covered queries, where the index contains all the data the query needs, eliminate collection scans entirely. We verify with explain("executionStats") that every indexed query uses the index.

JSON Schema validation goes on every collection. We define required fields, field types, value ranges, and pattern constraints. Validation runs on insert and update. Documents that violate the schema are rejected at the database level, not discovered in application code. We use validationAction: "error" for critical collections and validationAction: "warn" for collections that need a migration period.

250KB Documents Shrink to 2KB With Proper References

Before: A collection of 5 million order documents averaging 250KB each because products are fully embedded. The most common query - orders by user - takes 4.2 seconds with a full collection scan. No schema validation. The aggregation pipeline for the monthly sales report takes 45 seconds and uses 2GB of RAM.

After: Order documents average 2KB with product references. A compound index on {userId: 1, createdAt: -1} makes the orders-by-user query return in 3ms. Schema validation prevents invalid documents at write time. The sales report aggregation runs in 800ms because $match filters early, $lookup only processes matched documents, and $group works on a fraction of the data.

Storage costs drop significantly. Those 5 million orders went from 1.2TB to 10GB. The savings on disk, backup storage, and replication bandwidth add up fast. Read performance improves because the working set fits in RAM instead of spilling to disk.

Schema Validation and Profiler Alerts That Stop Drift Early

We configure MongoDB Atlas Performance Advisor or install mtools for self-hosted deployments. Slow queries trigger alerts with suggested indexes. New queries that cause collection scans are flagged within hours of deployment.

Schema validation becomes part of the deployment pipeline. Every migration that modifies a collection also updates the validation schema. CI runs a check to ensure all collections have validation rules. Collections without schemas fail the pipeline.

We add Mongoose or Prisma with strict mode for the application layer. Schema definitions in the application match the database validation rules. TypeScript types are generated from the schema so the compiler enforces document structure. AI-generated code that inserts invalid documents fails at compile time, not in production.

Aggregation pipelines get reviewed for stage ordering. We document the rules: $match before $lookup. $project to reduce document size before expensive stages. $limit as early as possible. These rules are in a reviewable checklist that applies to every PR that touches an aggregation. AI doesn’t optimize pipeline order - your review process does.

MongoDB Vibe Code Cleanup

Why AI Treats Every Collection Like a JSON Dump

Restructuring Documents, Indexes, and Aggregation Pipelines

250KB Documents Shrink to 2KB With Proper References

Schema Validation and Profiler Alerts That Stop Drift Early

>Why this combination

>What you get

>Ideal for

>Other technologies

>Industries

Ready to build?