Wednesday, September 18, 2013

Query logging and profiling in MongoDB

In my previous blog post I mentioned that the 1.5.0 release of the MongoDB driver for PHP has improved notification support using the PHP Notification API, and showed off a fairly useless (but fun) progress bar.
Now its time for a little bit more practical examples using this system: Query logging and profiling.

Those of you who clicked through to the github Pull Request may have noticed the long long description of the API and noticed all the different "MONGO_STREAM_NOTIFY_LOG_xxx" constants it included..

When instantiating MongoClient you can pass in "Driver Options" using the 3rd argument. Currently there is only one available Driver Option: "context".

The context key takes a Stream Context, and is assigned to all of the underlaying connections the driver creates - which means you can set any of the Socket Context and SSL Context options and it will Just Work (maaan.. that saved me a ton of work.. Not needing to implement any of the SSL things myself!).
Since its just a normal Stream Context, the lesser known Context Parameters will also Just Work, but what does that mean?

Well, PHP has a trivial notifications system when operating on streams. The out-of-the-box support is (incremental) progress updates when reading and writing to streams, which I covered in my previous blog post. The MongoDB driver however extends this support with a lot of other things.

Keep in mind I haven't documented any of this just yet as its all still experimental.

OK. Lets check this out.

(The entire script is available as a Gist on Github).
class MyStreamNotifications {
    public $reqs = array();

    /* Our callback function
     * The prototype is define by the PHP Stream Notification system:
    function update($notification_code, $severity, $message, $message_code, $bytes_transferred, $bytes_max) {
        switch($notification_code) {
            /* For now, lets focus on the TYPE_LOG, and ignore the others */
                /* This notification only uses two arguments, so we can ignore the rest */
                return $this->logMessage($message_code, $message);


    /* Our logging function
     * The $code describes the operation being executed,
     * and $message is a json serialized array, packed with information
     * Note: If ext/json is not available, $message with be PHP serialize()d instead
    function logMessage($code, $message) {
        $data = json_decode($message, true);

        switch($code) {
            /* Write operations are a one way street in MongoDB, so these don't end with a
             * When using WriteConcern (defaults to 1), and additional query (command) is
             * sent to check for errors, called getLastError*/
                /* We only care about the query and response headers, for now */

            /* We´ve got a query */
                /* Retrieve the Wire protocol Request ID, so we can match it with the
                 * Response headers once the query as executed */
                $reqid = $data["server"]["request_id"];

                /* Log all the metadata we have about the query, and add the start time */
                $this->reqs[$reqid] = $data;
                $this->reqs[$reqid]["start"] = microtime(true);

            /* The query has executed on the server, and now we are reading the
             * response headers */
                // Get the Request ID this respose is for
                $id = $data["server"]["request_id"];

                /* The driver runs some commands internally, such as isMaster() to
                 * sanitycheck the connection to the server, these commands currently
                 * don't trigger the notification system - that may/will change in the
                 * future though */
                if (!isset($this->reqs[$id])) {

                // Fetch the request info
                $req = $this->reqs[$id];
                // End the timer
                $end = microtime(true);

                // Add the ending time, and how long it took
                $req["ended"] = $end;
                $req["total"] = $end - $req["start"];

                $this->reqs[$id] = $req;

                // Unknown log message code, ignore

    function getRequests() {
        return $this->reqs;
Thats a good chunk of code, just to profile queries.. But bare with me for a moment.

There is a lot of things happening right there, but hopefully the inline comments make sense of it all.
The update() method is the actual callback function (which we will be registering a little later).
It has a lot of arguments, and the $bytes_transferred and $bytes_max aren't even relevant in most scenarios, but ohhwell. Nothing we can do about that as this is the interface PHP provides us with.

The $notification_code is the "main argument" in the callback function, it tells us what sort of notification it is, and there are a handful of different notifications that can happen.
Each notification then has its own set of "$message_code", in this case, one of the MONGO_STREAM_NOTIFY_LOG_xxx constants.
Since we are only interested in logging and profiling the queries we ignore all the different MONGO_STREAM_NOTIFY_LOG_xxx (they are only there for completeness - in case you want to extend the example and experiment with them later), and only catch the MONGO_STREAM_NOTIFY_LOG_QUERY and MONGO_STREAM_NOTIFY_LOG_RESPONSE_HEADER.

Moving on..
/* Some Helper Functions */

/* Split a Server Hash to extract the server name from it */
function getServerName($hash) {
    list($server, $replicaset, $dbname, $pid) = explode(";", $hash);

    return $server;

/* Resolve the server type to a readable string */
function getServerType($type) {
    switch($type) {
        // FIXME: Do we not export these as constants?
        case 0x01: return "STANDALONE";
        case 0x02: return "PRIMARY";
        case 0x04: return "SECONDARY";
        case 0x10: return "MONGOS";

/* Resolve Cursor (Query) Options to something meaningful */
function getCursorOptions($opts) {
    // FIXME: Do we not export these as constants?
    if (($opts & 0x04) == 0x04) {
        return "slaveOkay";
    return "Primary Only";
Here we define some helper functions, and actually few FIXMEs for myself so you don't have to remember these magic numbers :)
Other then that, nothing really important.
Next up, boilerplate setting to connect to a ReplicaSet.
/* Boilerplate stuff to setup a ReplicaSet connection */
// My replicaset seed list
$seeds = "toddler.local:30200,toddler.local:30201";

// MongoDB Connection options
$moptions = array(
    // Don't wait to long on servers that are down
    "connectTimeoutMS" => 300,
    // Read from secondaries, if available
    "readPreference" => MongoClient::RP_SECONDARY_PREFERRED,
    // My local ReplicaSet name
    "replicaSet" => "REPLICASET",
    // Don't wait more then a second to confirm Replication
    "wTimeoutMS" => 1000,
Again, that wasn't fun... Lets take a look at some fun things next!
/* Registering a stream context! This is the fun part! */

// Create an instance of our callback object
$mn = new MyStreamNotifications;

// No Stream Context Options, for now
$coptions = array();
// Configure the Notification Stream Context Parameter to call MyStreamNotifications->update()
$cparams  = array("notification" => array($mn, "update"));

// Create the Stream Context
$ctx = stream_context_create($coptions, $cparams);

$doptions = array("context" => $ctx);

// Instanciate a MongoClient with all our options
$mc = new MongoClient($seeds, $moptions, $doptions);
If you have never played around with PHP Stream Contexts, this may look a bit alien..
The stream_context_create() function takes two arguments, context options and parameters.
There is a very subtle difference between those two, and we do support both (yes yes, not documented.. :) still highly experimental), but for now, we are using the notification parameter.

Now that we have setup everything we need and have connected to MongoDB, configured just the way we like it, its time to run few commands and queries to check this thing out
// Query the example collection in the test database
// you should probably change this to something meaningful
$collection = $mc->selectCollection("test", "example");

// Random search criteria, you should change this to something meaningful
$criteria = array("search" => "criteria");
// Add some sorting & limit for the fun of it
$cursor = $collection->find($criteria)->skip(13)->limit(5)->sort(array("_id" => 1));

// Count on a secondary (we constructed the MongoClient with SECONDARY_PREFERRED)
$count = $cursor->count();
echo "We have $count results in total\n";

// Fetch the data from a primary

/* The query isn't execute until we start iterating over it.. */
foreach($cursor as $doc) {
When you are trying this out for yourself, make sure you have establish a connection to your MongoDB instance (modifying the $seeds and $moptions["replicaSet"] variables up there), and modify the database+collection name, along with the query criteria.

Now that we have executed some things, our $mn->reqs array has now some interesting info its time to print out
// Fetch our requests from our logger
$reqs = $mn->getRequests();
foreach($reqs as $request) {
    printf("Hit %s (which is a %s, I asked for %s) with:\n%s\n",
        json_encode($request["query"], JSON_PRETTY_PRINT)
    printf("Skipping the first %d matches, limiting the results to %d documents..\n",
    printf("Server execution time+network latency: %.8f sec\n\n", $request["total"]);
Putting it all together, and running the script should then print out something like:
We have 205 results in total
Hit localhost:30201 (which is a SECONDARY, I asked for slaveOkay) with:
    "count": "example",
    "query": {
        "search": "criteria"
Skipping the first 0 matches, limiting the results to -1 documents..
Server execution time+network latency: 0.00042319 sec

Hit localhost:30200 (which is a PRIMARY, I asked for Primary Only) with:
    "$query": {
        "search": "criteria"
    "$orderby": {
        "_id": 1
Skipping the first 13 matches, limiting the results to 5 documents..
Server execution time+network latency: 0.00045013 sec

You may want to var_dump() the $request variable to see what other fun things it contains to play with, hopefully I'll have it all properly documented by the time we release 1.5.0 version of the driver.
As you can see, the driver needs to massage the query a little bit to add things like orderby clauses, where as "skip" and "limit" are part of the wire protocol and therefore not part of the actual query syntax.
If you have the MongoDB profiler enabled, you could then lookup how long the query execution took on the server, subtract the time we got and you have the time it took us to read and write over the network.

Another related feature is the $comment operator that "tags" a query, so its easier to spot in the profiler. You could use this to for example group your queries ("person model", "retrieving comments", "product query", ...).

Although this is a good chunk of code, I hope it will be useful, especially for applications that require thorough auditing of its queries. This should also be easily integrable into most "Profiler/Developer toolbar" that the various frameworks provide.

Since this is all still highly experimental, there is still time to change some things, so I wonder.. what do you think?


Tuesday, September 17, 2013

PHP Stream Notifications ❤ MongoDB

PHP Streams have several pretty nifty features that most people don't really know about; Filters, Wrappers, Context, and Notifications. Documenting these is a bit difficult, and getting the user to discover these features is even more problematic, as these things usually live outside of the normal path (function reference).
Maybe I'll blog about these things in the future, but for now I want to talk about the Stream (context) Notifications - or more specifically; Stream Notifications in the MongoDB extension for PHP.

The Stream Notifications are essentially pretty simple: when a stream does something, it notifies something (a callback function) that it did something (notification codes, importance, message code, ..).
Since this feature is a little neglected and people don't seem to know about it, this "something happened" isn't really a lot of things. When PHP reads from the http stream it will tell you if it came across a "redirect header", "mime-type", and, when available, the content size. Simple things.
This is actually pretty powerful and allows you to do some work, even when file_get_contents() a large large file, so it doesn't have to be a blocking operation anymore.
An example of a simple, yet pretty cool, script would be a wget/curl style download application with a progress bar (we actually use that example to download pear things when you install PHP and don't have wget or fetch).

Enter MongoDB for PHP.
As of pecl/mongo 1.4.0 we have switched out our homegrown sockets library in favor of using the native PHP streams. This gives us several advantages, such as greater portability, automatic support for SSL, and memory reporting is now more inline with reality, as everything is mapped using the PHP allocation functions.

Now, since we are using the PHP streams.. a pretty cool thing happens: We can now play around with the notification API and Stream Contexts.
A highly experimental (undocumented) API was included in the 1.4.0 release, and now as we are prepping for 1.5.0, I've spent more time on it, adding things that are "useless, until you play with it".. Such as progress reporting for read/write.. Wait what?

Yeah. OMG LOL For realz.
I created a little example, called spindle.php, which queries the enron dataset, and prints out a progress bar for all socket read and writes, along with meta information like the actual command and query being sent - and even some information about the actual wireprotocol operation, in case you are interested. Writing the queries and reading the data is too fast on my laptop so I had to add usleep() into the code to see the progressbar spinning.. but its still fun.

I can't really think of a real practical use case for reporting socket read/write progress updates.. But its there. Hopefully someone can come up with some cool things using it :)

Note: The Pull Request hasn't been merged yet, and the API is still considered as experimental, so you'll have to use my improved-notify-support branch for now if you want to play with it.

Progress reporting obviously isn't the only thing the new Notification API does, I'll blog about the more practical notifications later :)


Friday, September 6, 2013

New designs floating around

Since 2008 there have been numerous efforts to create a new design for, all of which have failed - so far.
We've never come as close as two years ago, when the "beta mode" option was added to our website, but we never really got around to finish it.
The "beta design" has even received a lot of makeover compared to what is "beta mode" now.
To make things a little bit more awesome, there is also a new branch called "responsive" which has a lot of changes in it too, especially for manual pages.

Hopefully, one day, we'll actually finish one of these and flip the switch forever.

In the meanwhile, I'm working on retiring my little baby, PhD, as I want to encourage more people to get involved with the overall project, especially the neglected parts like documentations.
It doesn't help anyone to be greeted with a "ohh, btw, you have to have a doctorate to be able to work on the docs" like "almost" is the case with our usage of Docbook - not to mention the fact DocBook isn't exactly the easiest markup format to learn.

I've been a huge advocate of highly semantic markup, and never dreamed that I would want to get rid of Docbook for our documentation.. But now I finally understand.
It must be *simple* to contribute.

Although I have optimised our tooling from taking over a week down to roughly 3minutes, it still isn't good enough. It needs to instant. And we can't reach that point with out current markup. We need to change. Make it simpler. Attract new people. Make it fun to contribute and help out.

To that end I've started working on a tool I call "WTFM", which is a Markdownish renderer..
The goal is to keep some of our smart trickery and simplifications we do in Docbook, just without that massive amount of markup and confusion. There shouldn't be any need to "learn a new format". Just write the docs, click save and you see it rendered instantly. No weirdo wiki markup noone understands, or hundreds of rules to obey.

I haven't exactly finalised the tool, or the actual "non-format format" yet, but the current (pre-alpha has-a-lot-of-bugs) proof-of-concept is running - and to stay true to our "lets make a new design" goals.. It has completely different markup rendering, so I had to create a new CSS for :)

What is your favorite layout so far? You think you can help us finish any of these?
Shamefully enough, getting the website running locally is a huge pain in the ***, but these instructions should take away most of the pain...

Hope to hear from you!