Sunday, 4 December 2011

A simple guide to finding distinct array values in a MongoDB collection

So you want to find unique values within an array within a document in a collection? A reasonable request.
In ANSI SQL you'll be using DISTINCT, JOINS and GROUP BYs, stuff you're used to, but in the NoSQL realm your best bet is mapreduce.
It might seem a little bit like hard work, and probably a little intimidating at first, but it is certainly worth it; mapreduce is an extremely powerful tool.

Set up the collection, the map and reduce functions, and execute the mapreduce command:

use itemCollection
db.itemCollection.insert({
"id" : "1",
"somefield" : "blah",
"someotherfield" : "meh",
"notanotherfield" : "feh",
"items" : ["item0", "item1", "item2", "item2"]
})
db.itemCollection.insert({
"id" : "2",
"somefield" : "bleh",
"someotherfield" : "geh",
"notanotherfield" : "beh",
"items" : ["item0", "item1", "item2", "item1", "item0"]
})
db.itemCollection.insert({
"id" : "3",
"somefield" : "bluh",
"someotherfield" : "heh",
"notanotherfield" : "jeh",
"items" : ["item0", "item1", "item2", "item3", "item4"]
})
map = function() {
if (!this.items) {
return;
}
for (index in this.items) {
emit(this.items[index], 1);
}
}
reduce = function(key, values) {
var count = 0;
for (index in values) {
count += values[index];
}
return count;
}
db.runCommand( { mapreduce : "itemCollection", map : map , reduce : reduce , query : { "id" : "1"} , out : "items" } );
db.items.find().forEach(printjson);
{ "_id" : "item0", "value" : 1 }
{ "_id" : "item1", "value" : 1 }
{ "_id" : "item2", "value" : 2 }
view raw distinctarray1 hosted with ❤ by GitHub


But now you want to find unique values within an array within each document in an entire collection.

db.runCommand( { mapreduce : "itemCollection", map : map , reduce : reduce, out : "items" } );
db.items.find().forEach(printjson)
{ "_id" : "item0", "value" : 4 }
{ "_id" : "item1", "value" : 4 }
{ "_id" : "item2", "value" : 4 }
{ "_id" : "item3", "value" : 1 }
{ "_id" : "item4", "value" : 1 }
view raw distinctarray2 hosted with ❤ by GitHub

The map function in this example iterates over each of the items and emits the key/value of each array element.
The reduce function aggregates the key/value from each of the emits from the map function. In this example we're looking at unique keys and maintaining a count of the unique keys.
If you're looking to find the distinct array elements for a single document, simply specify the document index. For the entire collection, just leave the query out. *simples*

No comments:

Post a Comment