b¬äd†ødd

Dedicated to design and performance of databases and audio systems.

MongoDB:
  Aggregation Framework

Overview:

Within MongoDB there are a couple of ways to aggregate data. One is via its Aggregation Framework and the other is using Map Reduce. While much buzz is generated regarding the latter across Big Data platforms, the former should not be discounted.

As newer technologies mature they often look to extend their capabilities. One popular facet being touted is the concept of operational BI. Today we associate BI as a form of reporting and the first level of data warehousing capability. Traditionally we think of dimensional schema/models with fact and dimension tables. Historically these were necessary for end user understanding and developing a physical design that worked well on SMP and traditionally OLTP-based hardware and query optimizers.

However, with some of the newer technologies like MongoDB, is there some overlap into the BI space? Instead of full-blown dimensional models, could the more simplistic sets be cast as hierarchical structures embedded into single collections? It is early on the curve, but it does make for an interesting case. However, it is important to understand the scope of the need. Within a precise usage, and a limited complexity, it may offer value.

The pros include in-memory capabilities, a relative quicker time-to-market, and a lower cost; and, MongoDB has automatic data replication capabilities within and across data centers. Think: latency mitigation, high availability, disaster recovery, etc. Plus, the technology accommodates near/real-time updates if designed appropriately.

The cons are that it would require a data mart style of implementation, potentially requiring redundant and aggregate stores of data which are purpose/usage-focused. MongoDB is also early on in their relationship with many supporting vendor technologies.

With near/real-time, replication, and lower complexity needs, MongoDB may provide a database option for operational BI. However, it would need to be validated along with the supporting interfaces for data integration and data visualization. With almost any reporting, aggregation is key.

That aside, let's view its Aggregate Framework which is native to MongoDB. It is a pipeline concept that applies criteria and performs projection and aggregation similar to SQL concepts we are all familiar with.

For example: SQL equivalents

SELECT ~ project
WHERE ~ match
GROUP BY ~ group
ORDER BY ~ sort

Using our Winter Olympics data from the recent past blogs, here are a few examples of Aggregation Framework.

A count of Winter Olympic games awarded by country (i.e. includes the planned 1940 and 1944 games as well as 2018).

// Winter Olympics awardings by country
db.olympics.aggregate(
    {"$match" : { "year" : { $lte : 2018 }}},
    {"$project" : {"location.country" : 1}},
    {"$group" : {"_id" : "$location.country", "count" : {"$sum" : 1}}},
    {"$sort" : {"count" : -1, "_id" : 1}}
)

results in

{ "_id" : "United States", "count" : 4 }
{ "_id" : "France", "count" : 3 }
{ "_id" : "Italy", "count" : 3 }
{ "_id" : "Japan", "count" : 3 }
{ "_id" : "Switzerland", "count" : 3 }
{ "_id" : "Austria", "count" : 2 }
{ "_id" : "Canada", "count" : 2 }
{ "_id" : "Germany", "count" : 2 }
{ "_id" : "Norway", "count" : 2 }
{ "_id" : "Russia", "count" : 1 }
{ "_id" : "South Korea", "count" : 1 }
{ "_id" : "Yugoslavia", "count" : 1 }

However, some of those games were never celebrated due to WWII and the 2018 South Korean games have not yet occurred. We refine our query to get a count of Winter Olympic games actually celebrated by country.

// Winter Olympics celebrations by country
db.olympics.aggregate(
    {"$match" : { "celebrated" : { $ne : false }}},
    {"$match" : { "year" : { $lte: 2014 }}},
    {"$project" : {"location.country" : 1}},
    {"$group" : {"_id" : "$location.country", "count" : {"$sum" : 1}}},
    {"$sort" : {"count" : -1, "_id" : 1}}
)

Now the result set is

{ "_id" : "United States", "count" : 4 }
{ "_id" : "France", "count" : 3 }
{ "_id" : "Austria", "count" : 2 }
{ "_id" : "Canada", "count" : 2 }
{ "_id" : "Italy", "count" : 2 }
{ "_id" : "Japan", "count" : 2 }
{ "_id" : "Norway", "count" : 2 }
{ "_id" : "Switzerland", "count" : 2 }
{ "_id" : "Germany", "count" : 1 }
{ "_id" : "Russia", "count" : 1 }
{ "_id" : "Yugoslavia", "count" : 1 }

An example of querying into an array of multiple disciplines (e.g. Ice Hockey, Figure Skating) per games. Here, the games by country that included Skeleton.

// Winter Olympics that have been celebrated and involve Skeleton:
db.olympics.aggregate(
    {"$match" : { "celebrated" : { $ne : false }}},
    {"$match" : { "year" : { $lte : 2014 }}},
    {"$match" : { "discipline" : { $in : ["Skeleton"] }}},
    {"$project" : {"location.country" : 1}},
    {"$group" : {"_id" : "$location.country", "count" : {"$sum" : 1}}},
    {"$sort" : {"count" : -1, "_id" : 1}}
)

Results in

{ "_id" : "Switzerland", "count" : 2 }
{ "_id" : "Canada", "count" : 1 }
{ "_id" : "Italy", "count" : 1 }
{ "_id" : "Russia", "count" : 1 }
{ "_id" : "United States", "count" : 1 }

Skeleton only occurred in the 1928 and 1948 games, both in St. Moritz Switzerland, and from the 2002 Salt Lake City games onward.

I hope this helps to illustrate some of the capabilities built into MongoDB for your consideration.

Cheers
-Brad