❯ man spanna/mongodb-concepts
MongoDB Aggregation: Master the Data Assembly Line
Master the MongoDB aggregation pipeline — learn core stages, advanced joins, performance rules, and a common workflow for transforming data at scale.
› /docs/aggregation-pipeline-guide
If standard find() queries are like picking a single item off a shelf, the Aggregation Pipeline is like owning the entire factory. It allows you to take raw documents, put them through a series of “processing stations,” and come out with a finished product — whether that’s a summarized report, a complex data transformation, or a multi-collection join.
In this guide, we’ll break down how to build these pipelines without losing your mind (or your server’s RAM).
The big idea: the “Assembly Line” analogy
Think of an aggregation pipeline as a literal assembly line in a factory:
- Input: A stream of raw documents from your collection.
- Stages: Each stage is a workstation that performs one specific task (filtering, sorting, or reshaping).
- Output: The transformed data is passed to the next station. By the time it reaches the end, you have exactly the result you need.
Each workstation is called a Stage, and the syntax looks like an array of instructions:
db.collection.aggregate([ { stage1 }, { stage2 }, { stage3 } ])
The “Big Three” core stages
Most pipelines rely on these three heavy lifters to do 90% of the work.
1. $match (The Filter)
This is usually your first stage. It filters the documents so you only process what you actually need.
- SQL Equivalent:
WHERE - Pro-tip: Put this as early as possible. If you filter out 90% of your data in the first stage, the rest of the pipeline runs 10x faster.
2. $group (The Summarizer)
This is where the magic happens. You can group documents by a specific field (like category or userId) and calculate totals.
3. $project (The Reshaper)
This stage lets you pick exactly which fields you want to pass along. You can rename fields, remove them, or even create new ones on the fly.
- SQL Equivalent:
SELECT - Pro-tip: Use this at the end to clean up your data before it hits your app.
Advanced moves: joins and arrays
Sometimes your data is messy or spread across different collections. That’s where these stages come in:
- $unwind (The Array Exploder): If a document has an array of 5 items,
$unwindwill turn that into 5 separate documents. This is essential if you want to run a$groupor$sorton individual array elements. - $lookup (The Join): This allows you to pull in data from another collection. For example, you can join an Orders collection with a Products collection to see the actual names of the items purchased.
- $addFields (The Calculator): Similar to
$project, but it just adds new fields to the existing document without removing the old ones. Great for calculating computed values, for example:
{
$addFields: {
totalPrice: { $multiply: ["$price", "$quantity"] }
}
}
Performance rules: keep it snappy
Aggregations can be resource-intensive. Follow these “Laws of the Pipeline” to keep your database happy:
- Filter Early: Use
$matchfirst to reduce the document count. - Use Indexes: Only the very first stages of a pipeline can use your collection’s indexes. If you put a
$projectbefore a$match, you might lose the ability to use an index for that filter! - Watch the 100MB Limit: Each stage has a RAM limit of 100MB. If you’re processing millions of documents, you might need to enable
allowDiskUse: true, though it’s slower. - $sort + $limit Optimization: If you put a
$limitimmediately after a$sort, MongoDB is smart enough to only keep the top “N” items in memory while sorting, which is much faster than sorting everything and then throwing most of it away.
Common aggregation workflow
| Order | Stage | Purpose |
|---|---|---|
| 1 | $match | Narrow down the data (uses indexes!). |
| 2 | $unwind | Flatten arrays if needed. |
| 3 | $group | Calculate totals, averages, or counts. |
| 4 | $sort | Put your results in order. |
| 5 | $limit | Keep only the top results. |
| 6 | $project | Clean up and format the output. |
Summary
The Aggregation Pipeline is the most powerful tool in your MongoDB toolkit. By thinking of it as an assembly line — filtering early, grouping logically, and reshaping only at the end — you can replace hundreds of lines of application code with a single database call. With tools like Spanna to help you build, visualize, and debug these pipelines stage-by-stage, you’ll be able to transform complex data into actionable insights in no time.
# something missing or wrong? tell us · or open a PR