Thursday, July 11, 2013

MongoDB-based Cache Service

I've talked about using REST web service to wrap database and provide managed repository service. This time, I'd like to discuss developing cache service with MongoDB's two convenient features.

Before I start, I want to make it clear that the cache service here is not the one that trying reduce response time to sub-ms. It's something you hesitate to call it again and want to store it somewhere. Usually it's a distributed web service (for example, Maps API from Google), or an expensive SQL statement result. You want to cache it not only because you don't want to wait for few seconds again, but also to save your usage quota, or reduce the workload of a database. In this case, you'll be happy if we can reduce the response from X seconds to X ms.

Depending on whether it's a single node or clustered environment, the size of the cached data, text or binary, there are quite a few products that can fulfill the task. But when we examine if the solution can scale up and scale out, the answer become not clear. Consider config something like Memcached in a 4 node cluster and you'll get the idea. Basically you have to explicitly tell each node, "you are in a group so you guys have a shared memory or disk".

How about share nothing? As long as a node knows the cache, it doesn't matter how many other nodes also know the cache, they can share something, with same key of course. Cache service can just become a couple of HTTP methods (POST and GET) backed by MongoDB. But why MongoDB?

One aspect of a cache is the capacity, in bytes or in number of objects. In MongoDB, you can use Capped Collections to achieve this. You can create a capped collection using
db.createCollection("mycoll", {capped:true, size:100000})
or convert a collection to capped one using
db.runCommand({"convertToCapped": "mycoll", size: 100000});

The value of size parameter is in bytes. You may not know the size of a document in the collection if you want to know how many documents can be stored in the capped collection. If you already have a amount of documents, you can run
db.mycoll.stats()
and check the avgObjSize value before converting it to capped collection. Here is an example,
{
    "ns" : "mydb.mycoll",
    "count" : 7739,
    "size" : 42885120,
    "avgObjSize" : 5541.429125209976,
    "storageSize" : 65724416,
    "numExtents" : 8,
    "nindexes" : 1,
    "lastExtentSize" : 23224320,
    "paddingFactor" : 1,
    "systemFlags" : 1,
    "userFlags" : 0,
    "totalIndexSize" : 228928,
    "indexSizes" : {
        "_id_" : 228928
    },
    "ok" : 1
}


If you run stats() on a capped collection, you'll see 2 more lines in result
    "capped" : true,
    "max" : NumberLong("9223372036854775807"),

Another feature in caching is Time To Live, which is used to specify when a cached item should be invalidated. In MongoDB, you can create index on a date field and provide expireAfterSeconds option to set the TTL of a collection.
db.mycoll.ensureIndex( { "created": 1 }, { expireAfterSeconds: 3600 } )

Note however that the background task to delete expired documents runs once every 60 seconds, so don't expect this feature working much more accurately than that. And you can't make a collection both size- and time-based (who's going to need both anyway).

So next time when you design a size-based or time-based cache, would you like to consider MongoDB?