/Aggregation Pipeline Guide

❯ man spanna/mongodb-concepts

MongoDB Aggregation: Master the Data Assembly Line

Master the MongoDB aggregation pipeline — learn core stages, advanced joins, performance rules, and a common workflow for transforming data at scale.

/docs/aggregation-pipeline-guide

If standard find() queries are like picking a single item off a shelf, the Aggregation Pipeline is like owning the entire factory. It allows you to take raw documents, put them through a series of “processing stations,” and come out with a finished product — whether that’s a summarized report, a complex data transformation, or a multi-collection join.

In this guide, we’ll break down how to build these pipelines without losing your mind (or your server’s RAM).

The big idea: the “Assembly Line” analogy

Think of an aggregation pipeline as a literal assembly line in a factory:

  1. Input: A stream of raw documents from your collection.
  2. Stages: Each stage is a workstation that performs one specific task (filtering, sorting, or reshaping).
  3. Output: The transformed data is passed to the next station. By the time it reaches the end, you have exactly the result you need.

Each workstation is called a Stage, and the syntax looks like an array of instructions:

db.collection.aggregate([ { stage1 }, { stage2 }, { stage3 } ])

The “Big Three” core stages

Most pipelines rely on these three heavy lifters to do 90% of the work.

1. $match (The Filter)

This is usually your first stage. It filters the documents so you only process what you actually need.

  • SQL Equivalent: WHERE
  • Pro-tip: Put this as early as possible. If you filter out 90% of your data in the first stage, the rest of the pipeline runs 10x faster.

2. $group (The Summarizer)

This is where the magic happens. You can group documents by a specific field (like category or userId) and calculate totals.

3. $project (The Reshaper)

This stage lets you pick exactly which fields you want to pass along. You can rename fields, remove them, or even create new ones on the fly.

  • SQL Equivalent: SELECT
  • Pro-tip: Use this at the end to clean up your data before it hits your app.

Advanced moves: joins and arrays

Sometimes your data is messy or spread across different collections. That’s where these stages come in:

  • $unwind (The Array Exploder): If a document has an array of 5 items, $unwind will turn that into 5 separate documents. This is essential if you want to run a $group or $sort on individual array elements.
  • $lookup (The Join): This allows you to pull in data from another collection. For example, you can join an Orders collection with a Products collection to see the actual names of the items purchased.
  • $addFields (The Calculator): Similar to $project, but it just adds new fields to the existing document without removing the old ones. Great for calculating computed values, for example:
{
  $addFields: {
    totalPrice: { $multiply: ["$price", "$quantity"] }
  }
}

Performance rules: keep it snappy

Aggregations can be resource-intensive. Follow these “Laws of the Pipeline” to keep your database happy:

  1. Filter Early: Use $match first to reduce the document count.
  2. Use Indexes: Only the very first stages of a pipeline can use your collection’s indexes. If you put a $project before a $match, you might lose the ability to use an index for that filter!
  3. Watch the 100MB Limit: Each stage has a RAM limit of 100MB. If you’re processing millions of documents, you might need to enable allowDiskUse: true, though it’s slower.
  4. $sort + $limit Optimization: If you put a $limit immediately after a $sort, MongoDB is smart enough to only keep the top “N” items in memory while sorting, which is much faster than sorting everything and then throwing most of it away.

Common aggregation workflow

OrderStagePurpose
1$matchNarrow down the data (uses indexes!).
2$unwindFlatten arrays if needed.
3$groupCalculate totals, averages, or counts.
4$sortPut your results in order.
5$limitKeep only the top results.
6$projectClean up and format the output.

Summary

The Aggregation Pipeline is the most powerful tool in your MongoDB toolkit. By thinking of it as an assembly line — filtering early, grouping logically, and reshaping only at the end — you can replace hundreds of lines of application code with a single database call. With tools like Spanna to help you build, visualize, and debug these pipelines stage-by-stage, you’ll be able to transform complex data into actionable insights in no time.

# something missing or wrong? tell us · or open a PR