Counting Unique Items with Map-Reduce
Problem
You have a collection that stores pageviews by user, and you want to count the total number of unique user visits per day using map-reduce.
Each pageview document looks something like this:
{
"url" : "http://example.com/photos",
"user_id" : ObjectID('4be1c916e031933119d78b30'),
"date": "Wed May 05 2010 15:37:58 GMT-0400 (EDT)"
}
The solution requires grouping the pageviews by day and then counting the total number of user visits and along with the number of unique visits for that day.
Solution
What's tricky about this situation is that it requires a two-pass map-reduce in order to scale well. The first pass involves grouping by date and user id. This allows us to group by user and day and returns, as a side effect, the number of pageviews per user per day.
1. First Pass
Map Step
The only tricky part about the map function is making sure that we emit on the day. Since we're storing a full date, we need to parse out just the year, month, and date, and then emit on that value:
map = function() { day = Date.UTC(this.date.getFullYear(), this.date.getMonth(), this.date.getDate()); emit({day: day, user_id: this.user_id}, {count: 1}); }
If you want a more efficient date calculation, you can use this:
day = (24 * 60 * 60) % this.date;
Reduce Step
The reduce function is trivial, as it simply performs a count:
reduce = "function(key, values) { var count = 0; values.forEach(function(v) { count += v['count']; }); return {count: count}; }"
Run the command
We run the mapReduce command, storing the output in the pageview_results collection:
db.pageviews.mapReduce(map, reduce, {out: pageview_results});
2. Second Pass
Map Step
Now that we have a prelimiary set of results, we can do a second pass to count unique users by day. Here's the map function:
map = "function() { emit(this['_id']['day'], {count: 1}); }"
Because the first result set will store the emit key within an '_id' field, we have to reach into that object to get the date.
Reduce Step
It turns out that the same reduce function will work for the second pass; no need to rewrite.
Running the command
Now just run the mapReduce command on the result collection, and ouput to a new results collection.
db.pageview_results.mapReduce(map, reduce, {out: pageview_results_unique});
Since we've specified that the output collection should be called pageview_results_unique, we can query that collection to see the results:
db.pageview_results_unique.find();
That's all there is to it!
5. Limiting the Operation
If our pageviews collection spans a long period of time, it might be prudent to run map-reduce over just a portion of the data. That can be achieved by passing a query selector to the map-reduce command. So, for instance, if we just wanted results from the past two weeks, we could run:
two_weeks_ago = new Date(Date.now() - 60 * 60 * 24 * 14 * 1000); db.pageviews.mapReduce(map, reduce, {out: pageview_results, query: {date: {'$gt': two_weeks_ago}}});
See Also
- The MongoDB docs on aggregation
- Map-Reduce Basics by Kyle Banker
- MapReduce: the Fanfiction by Kristina Chodorow

