Fork me on GitHub

MongoDB Cookbook MongoDB Cookbook

The Random Attribute

Credit: Alberto Lerner, Dwight Merriman, and Aaron Staple

Problem

Consider a scenario where you'd like to issue a query but would like to pick a random document in the result.

photos.find({"author":"John Doe"})

Any 'John Doe' would do. But you'd like a different one each time you run the query. Sure, you can always count the resulting documents and randomly pick one. But in that case the query would be run in its entirety and all the results would be transferred to your app.

Now, consider another scenario where you'd like to run a map/reduce but would be happy to trade result accuracy for performance. That is, you'd be happy to use a sample (of a given percentage) of you data. You don't really know the number of documents involved, but they are numerous.

Solution

We can add a special attribute in each document that we'll call here the 'random attribute,' RA. The RA needs to be, well, random. If you use, for instance, a number generated by Math.random() in Javascript, that would work.

> db.docs.drop()
> db.docs.save( { key : 1, ..., random : Math.random() } )
> db.docs.save( { key : 1, ..., random : Math.random() } )
> db.docs.save( { key : 2, ..., random : Math.random() } )
... many more insertions with 'key : 2' ...
> db.docs.save( { key : 2, ..., random : Math.random() } )
...

If you use Math.random(), the random attribute in your documents could have any value from 0 to 1:

...
{ "_id" : ObjectId("4bfa81198cf5fc1002a42b91"), "key" : 2, "random" : 0.23578915913357468}
...
{ "_id" : ObjectId("4bfa81198cf5fc1002a42b93"), "key" : 2, "random" : 0.8983254666113549 }
...

The solution also requires the RA to be indexed in a certain way. But let's discuss this using an example.

1. Picking a random document from the result

If you're just interested in one document, you'd formulate your original query and add a filter over the RA. The idea here is that you'd try to find which of the result documents has the closest RA to a random number you draw at query time.

The code below shows one way to do it. You'd pick a random number on the fly -- using the same method you used to populate the RA -- and test your number against the stored attribute.

> rand = Math.random()
> cmp  = Math.random()
> result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
> if ( result == null ) {
>   result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
> }

Note that we're not going for equality alone because the chances of that to occur are low. So we try either '$gte' or '$lte' with equal probability but knowing that in some cases it may not return a result, even though there are documents in the result. For that reason, an empty result must be verified by doing a search in the opposite direction.

The final -- but important -- detail about this query is that both the search criteria and the RA must be indexed together. Please, see the Caveats sections for further details.

> db.docs.ensureIndex( { key : 1, random :1 } )

2. Map/reduce on a sample of the data

If your collection is large and the computation you want to run could operate on a sample instead, ie tolerating a less accurate result, you can have the mapping phase apply an early filter based on the RA.

> db.docs.drop()
> for (i=0; i < 10000; i++) { db.docs.save( { key : i % 10, rand : Math.random() } ) }
> m = function() { emit(this.key, 1); }
> r = function(k, vals) {
  var sum=0;
  for (var i in vals) sum += vals[i];
  return sum;
 }
> sample = 0.1
> res = db.docs.mapReduce(m, r, { query : { key : 2, rand : { $lte: sample } } })

Mongo will issue the query over all the data but the mapper would be called only for the sampled documents, 10% of them here. In the example above, the running time should be significantly reduced as compared with the "full" query. The counter for 'key : 2' with a 10% sample was 85 when the perfect result would have been 100.

You could improve accuracy by increasing the sample size, of course.

Here's a sample-based count for all the keys. This should give you an idea about the speed/accuracy trade-off.

> res = db.docs.mapReduce(m, r, { query : { rand : { $lte : 0.1 } } }) 
...
> db[res.result].find()
{ "_id" : 0, "value" : 93 }
{ "_id" : 1, "value" : 82 }
{ "_id" : 2, "value" : 85 }
{ "_id" : 3, "value" : 92 }
{ "_id" : 4, "value" : 114 }
{ "_id" : 5, "value" : 104 }
{ "_id" : 6, "value" : 100 }
{ "_id" : 7, "value" : 90 }
{ "_id" : 8, "value" : 104 }
{ "_id" : 9, "value" : 103 }

In the map/reduce RA case, that attribute doesn't necessarily need to be indexed as in case 1 above.

Caveat

In the simple document case, the query we'd use must be an equality one. The map-reduce case doesn't require so.

The random attribute will work better if the results we're extracting from have a large number of documents. Consider for instance a query with few results:

> db.docs.save( { key : 1, random : Math.random() } )
> db.docs.save( { key : 1, random : Math.random() } )
> db.docs.find()
{ "_id" : ObjectId("4bfa9585cffdb770c08e7cc9"), "key" : 1, "random" : 0.9988383572723725 }
{ "_id" : ObjectId("4bfa9586cffdb770c08e7cca"), "key" : 1, "random" : 0.8338006548262672 }

The RA cannot be considered to be uniformly distributed between 0 and 1 for that key. The net effect is that some documents from the result would appear much often than others, when a random document form search criteria 'k : 1' is requested.

blog comments powered by Disqus