Back to Blog

Steven Haines

Devs are from Venus, Ops are from Mars, Big Data: MongoDB

mongodb logoIf you’re just joining this column, it is one aspect of a response to the gap between how development and how operations view technology and measure their success – it is wholly possible for development and operations to be individually successful, but for the organization to fail.

So, what can we do to better align development and operations so that they can speak the same language and work towards the success of the organization as a whole? This article series attempts to address a portion of this problem by presenting operation teams insight into how specific architecture and development decisions affect the day-to-day operational requirements of an application.

The current article series is reviewing Big Data and the various solutions that have been built to capture, manage, and analyze very large amounts of data. Unlike the relational databases of the past, Big Data is not one-size-fits-all, but rather individual solutions have been built that address specific problem domains.

The last handful of articles reviewed Hadoop, MapReduce, and HBase, which represent open-source implementations of three core Google technologies: Google File System, MapReduce, and Big Table, respectively. This article moves away from the Google stack and focuses on other NoSQL document storage implementations, starting with MongoDB.

NoSQL is a blanket term that originated in 2009 to group together non-relational databases. It is unfortunate because these servers are categorized based upon what they do not support rather than on their strengths, but the term has stuck so we'll use it.

Introduction to MongoDB

MongoDB is a document storage engine designed from its inception to support web-scale quantities of data, meaning upwards of petabytes of data! It is meant to be deployed horizontally across dozens, hundreds, or even thousands of machines and it provides the infrastructure to quickly locate and optionally analyze that data.

MongoDB is a document store, and as such, it saves “documents”, which are unstructured sets of data in JSON format, into document “collections”. The data is unstructured in that it is not in a relational model and there are no strict rules that define what a document looks like, even in the same collection, but because the data is stored in JSON, it is very readable. Let's consider a simple example that I presented in an article that builds a blog post:

mongodb blog code

This blog post has the following components:

  • A unique ID
  • The user that posted the message
  • The message itself
  • A link to a picture that has a title and a URL
  • A list of comments

What is strange, from a relational database perspective, is that the sub-items (picture and comments) are embedded inside the document and are not links to documents stored in another collection.

The choice of embedding documents or storing the IDs of documents from other collections is yours and depends on your use case: if you see yourself separately managing pictures, independent of a blog post, then you might want them to exist in their own collection, but if you're only ever going to manage pictures as children of blog posts then it is perfectly acceptable to embed them inside posts.

Furthermore, when we talk about document queries later, the performance of embedding the picture information inside a post is much better than referencing it in another collection and subsequently loading it.

To contrast this to the relational world, let's review how might we manage this document in a relational database. Figure 1 shows an entity-relationship model (as a class diagram) for these concepts.

mongodb diagram of blog post

This diagram can be summarized as follows:

  • The Feed is the central entity that ties everything together
  • A feed can have zero or more pictures, but a picture can only be associated with one feed
  • A feed can only have one author but an author can have multiple feeds
  • A feed can have many comments, but any individual comment can only be made against one feed
  • A comment can only have one author, but an author can have multiple comments

This is a normalized data model that removes duplicate data. For example, if someone were to change their username, that change would be made in one place and then all feeds and comments made by that user would be updated.

In the MongoDB model, the data is denormalized, which means that the username is spread across multiple feeds so changing the username would require changes across all feeds that the user wrote or commented on. But aside from changing data, denormalized data does provide the benefit that the data is all collocated in the same document and hence does not require additional queries to retrieve.

This is a pretty simple set of relationships, but let's consider what we would have to do to retrieve a feed from this data model.

INNER JOIN user ON feed.user_id =
INNER JOIN picture ON = picture.feed_id
WHERE user.username = 'shaines';

And then this would need to be followed up with the following SQL to retrieve comments:
SELECT * FROM comment WHERE feed_id = ?
Now we can contrast this with querying MongoDB for the same information:
db.feed.find( {'user': 'shaines'} );
Because the data is denormalized and the individual documents contain the user information, feed information, picture information, and comments, it can all be retrieved with one simple query.

MongoDB Queries

You might have noticed that we were able to retrieve records from MongoDB by using a query syntax rather than just retrieving records by a primary key, which is how many NoSQL solution work. MongoDB somewhat bridges the gap between NoSQL key/value stores and relational databases, which may be one of the reasons for its popularity.

MongoDB defines a query syntax in which you can request documents that match a particular pattern. For example, in the query above we retrieved all feeds with a username value of 'shaines'. We could have added additional constraints to further refine our results, such as by searching for feeds by both a username and date.

Additionally, MongoDB allows non-absolute matching searches using its comparison directives, such as $lt for less than and $gt for greater than. In this way we could write a single query to retrieve people that live in a certain county between the ages of 18 and 25.

Finally, similar to relational databases, MongoDB supports the notion of indexes. If you are frequently going to query for documents based on a particular field then you might want to create an index on that field so that MongoDB can search that field quicker. An index is a separately maintained memory structure / file that stores all of the values for its indexed field.

For example, if we created an index on a person's age then the index file for the age field would have all ages and those ages would contain links to the documents for the people matching that age. So age 18 might link to a dozen documents representing people who are 18 years old. Indexes speed up searches, but also slow down inserts and updates because of the additional time to update the index.

If you are predominately inserting data into a collection then you may not want to absorb the additional overhead of maintaining the index, but if you are querying more than you are inserting and there is a subset of fields against which you are querying then indexes make a lot of sense!

MongoDB Documents

As I mentioned above, MongoDB documents are JSON documents, stored in a binary format that they call BSON. If you would like to take MongoDB for a spin, you can install and run it by following the instructions I wrote in my InformIT article ( You can launch the client command shell by executing the “mongo” command from the “bin” folder.

Let's use a new example database to host our sample data by executing the following command:

> use example;
switched to db example
Now that we're in the example database, let's insert a couple documents into the “users” collection:
> db.users.insert({firstName: "Steve", lastName: "Haines", age: 42});
> db.users.insert({firstName: "Michael", lastName: "Haines", age: 13});

This added two user documents, one for me and one for my son. We can retrieve all users in the collection by using the find() command:

> db.users.find();
{ "_id" : ObjectId("4dad0c6cf535f389c971823d"), "firstName" : "Steve", "lastName" : "Haines", "age" : 42 }
{ "_id" : ObjectId("4dad0c78f535f389c971823e"), "firstName" : "Michael", "lastName" : "Haines", "age" : 13 }

You'll notice that Mongo created a default primary key named “_id” for each document that we inserted and the remainder of the documents contain the our values.

Now let's execute some queries that filter down the results a little:

> db.users.find( {firstName: "Steve"} );
{ "_id" : ObjectId("4dad0c6cf535f389c971823d"), "firstName" : "Steve", "lastName" : "Haines", "age" : 42 }
> db.users.find( {lastName: "Haines"} );
{ "_id" : ObjectId("4dad0c6cf535f389c971823d"), "firstName" : "Steve", "lastName" : "Haines", "age" : 42 }
{ "_id" : ObjectId("4dad0c78f535f389c971823e"), "firstName" : "Michael", "lastName" : "Haines", "age" : 13 }
> db.users.find( {age: {"$gt":14}} );
{ "_id" : ObjectId("4dad0c6cf535f389c971823d"), "firstName" : "Steve", "lastName" : "Haines", "age" : 42 }
> db.users.find( {age: {"$lt":14}} );
{ "_id" : ObjectId("4dad0c78f535f389c971823e"), "firstName" : "Michael", "lastName" : "Haines", "age" : 13 }

The first query retrieved all documents with a first name of “Steve”, which was just me. The second query retrieve all documents with a last name of “Haines”, which included both my son and me. And the last two queries searched based off of age (greater than 14 and less than 14).

Now let's combine two search criteria into a single query:

> db.users.find( {lastName:"Haines", age: {"$lt":14}} );
{ "_id" : ObjectId("4dad0c78f535f389c971823e"), "firstName" : "Michael", "lastName" : "Haines", "age" : 13}

In this example we search for all documents with the last name of “Haines” and an age less than 14, which just returns my son.

Updates work similarly: you provide a search criteria and use the $set directive to change whatever values you want. For example, in June when my son turns 14 we could execute the following command:

> db.users.update( {firstName:"Michael", lastName:"Haines"}, {$set: {age:14}} );
> db.users.find();
{ "_id" : ObjectId("4dad0c6cf535f389c971823d"), "firstName" : "Steve", "lastName" : "Haines", "age" : 42 }
{ "_id" : ObjectId("4dad0c78f535f389c971823e"), "firstName" : "Michael", "lastName" : "Haines", "age" : 14}

We retrieved all documents with a first name of “Michael” and a last name of “Haines” and set the age on all of those documents to “14”. I executed a subsequent find() so that you could see the updated document.

Finally, let's clean up our data by first removing me from the collection and then removing all records:

> db.users.remove( {firstName:"Steve", lastName:"Haines"} );
> db.users.find();
{ "_id" : ObjectId("4dad0c78f535f389c971823e"), "firstName" : "Michael", "lastName" : "Haines", "age" : 14}
> db.users.remove();
> db.users.find();

The remove() method removes documents from a collection. Specifying search criteria removes specific documents and passing nothing to the remove() method removes all documents in the collection.


This article provided a brief introduction to MongoDB, defining its role in a web-based application and, through a few example operations and queries, hopefully demonstrated why developers like to use it.

MongoDB provides support for web-scale data and high performance, but also supports query capabilities similar to what you might find in a relational database. It does not support joins or other advanced SQL concepts and it likes things denormalized, but if you can restate your problem in its terms then you can get the best of both the SQL and NoSQL worlds: web-scale data storage with advanced search capabilities.

The next article will review how to deploy MongoDB to a production environment. A co-worker mentioned to me this week that MongoDB is great for small sets of data, but once the amount of data grows substantially that its performance is poor. My response to that will be covered in the next article: how do you properly deploy MongoDB to support web-scale data?