Fork me on GitHub

MongoDB Cookbook MongoDB Cookbook

Counting Unique Items with Map-Reduce

Credit: Kyle Banker

Problem

You have a collection that stores pageviews by user, and you want to count the total number of unique user visits per day using map-reduce.

Each pageview document looks something like this:

{
    "url" : "http://example.com/photos",
    "user_id" : ObjectID('4be1c916e031933119d78b30'),
    "date": "Wed May 05 2010 15:37:58 GMT-0400 (EDT)"
}

The solution requires grouping the pageviews by day and then counting the total number of user visits and along with the number of unique visits for that day.

Solution

What's tricky about this situation is that it requires a two-pass map-reduce in order to scale well. The first pass involves grouping by date and user id. This allows us to group by user and day and returns, as a side effect, the number of pageviews per user per day.

1. First Pass

Map Step

The only tricky part about the map function is making sure that we emit on the day. Since we're storing a full date, we need to parse out just the year, month, and date, and then emit on that value:

map = function() {
  day = Date.UTC(this.date.getFullYear(), this.date.getMonth(), this.date.getDate());

  emit({day: day, user_id: this.user_id}, {count: 1});
}

If you want a more efficient date calculation, you can use this:

  day = (24 * 60 * 60) % this.date;
Reduce Step

The reduce function is trivial, as it simply performs a count:

reduce = "function(key, values) {
  var count = 0;

  values.forEach(function(v) {
    count += v['count'];
  });

  return {count: count};
}"
Run the command

We run the mapReduce command, storing the output in the pageview_results collection:

db.pageviews.mapReduce(map, reduce, {out: pageview_results});

2. Second Pass

Map Step

Now that we have a prelimiary set of results, we can do a second pass to count unique users by day. Here's the map function:

map = "function() {
  emit(this['_id']['day'], {count: 1});
}"

Because the first result set will store the emit key within an '_id' field, we have to reach into that object to get the date.

Reduce Step

It turns out that the same reduce function will work for the second pass; no need to rewrite.

Running the command

Now just run the mapReduce command on the result collection, and ouput to a new results collection.

db.pageview_results.mapReduce(map, reduce, {out: pageview_results_unique});

Since we've specified that the output collection should be called pageview_results_unique, we can query that collection to see the results:

db.pageview_results_unique.find();

That's all there is to it!

5. Limiting the Operation

If our pageviews collection spans a long period of time, it might be prudent to run map-reduce over just a portion of the data. That can be achieved by passing a query selector to the map-reduce command. So, for instance, if we just wanted results from the past two weeks, we could run:

two_weeks_ago = new Date(Date.now() - 60 * 60 * 24 * 14 * 1000);
db.pageviews.mapReduce(map, reduce,
  {out: pageview_results, query: {date: {'$gt': two_weeks_ago}}});

See Also

blog comments powered by Disqus