About

Nodeta is a software development company that focuses on web software. We employ a highly agile and effective process. We have worked both on light independent projects and in the environment of large global enterprises.

Our Products

Flowdock

Streamline your team's tasks, feeds and communication. Organize with tags. Sign up for beta!


APIdock

APIdock provides a rich and usable interface for searching, perusing and improving the documentation of projects that are included in the app.

Categories
Archives
-->

Why Flowdock migrated from Cassandra to MongoDB

Otto Hilska July 26th, 2010

Flowdock is a modern web-based team messenger. All software developers should be using it instead of their Campfires, Skype Chats, IRCs, etc. because it better supports their actual workflow.

Last weekend we completed a transition from Flowdock’s database of choice, Cassandra, to another NoSQL alternative, MongoDB. Since our technology stack has always generated some interest, I’ll now try to justify our decision in public.

Some of our users might remember this:

Twitter screenshot: having some database problems

At some point we started to have some stability issues with Cassandra. All nodes would go into an infinite loop, running GC and trying to compact the data files – occasionally falling off the cluster. We were unable to solve the problem, except that restarting and then compacting a node usually settled it down for a while. Other people had reported similar problems. Last couple of weeks our Cassandra nodes always ate all the resources they were given, slowing down Flowdock.

This was not the first time we had run into problems because of our bleeding edge database choice. When upgrading from 0.4 to 0.5, we had to shut down the cluster, only to find out that it hadn’t flushed everything to the disk (even though we explicitly flushed it, as instructed). Thus we ended up having a couple of minutes of discussions lost, and our custom-built indices were miserably out of date and needed to be rebuilt. I think it was 4 AM when we finally got to leave the office.

The NoSQL scene has evolved since we made our original decision to go with Cassandra. MongoDB is changing rapidly, and the latest addition of auto-sharding and replica sets made it a compelling alternative to Cassandra. So I decided to give it a try.

It took me a day to write a conversion script for our data. Within a week or so we were able to run Flowdock purely on MongoDB. It was tested internally for a couple of weeks before it was deployed to production.

Now that we have done the change, I’m happy to see that we got some benefits (very well known in most databases) in addition to the performance and reliability characteristics:

  1. Smart (multikey) indices. Manually maintaining indices was always tedious, and MongoDB can index everything we need out-of-the-box. For example, our messages have tags, implying a document format like this:
    { content: "Write a blog post about #mongodb.",
      workspace: 'myflow',
      tags: ["mongodb", "todo", "@Otto"] }
    

    Now when looking for my own tasks, Flowdock backend only needs to do this query:

    db.messages.find({
      workspace: 'myflow',
      tags: { $all: ["todo", "@Otto"] }
    })
    
  2. Queries. No matter how simple your data model is, every once in a while you need to perform a query that you didn’t plan in advance. MongoDB lets you construct complex queries directly from the console, pretty much like an SQL database. It will then perform a sequential scan, which is still much faster and more convenient than processing millions of rows manually, on the client-side.
  3. Map-Reduce. It’s great for stuff like analytics. MongoDB’s Map-Reduce support is not perfect, but at least it’s easy to use.
  4. GridFS makes storing our files very easy. The storage capabilities expand together with the rest of our MongoDB cluster.

We have faced only some minor limitations:

  1. We found a bug in JSON parsing that got fixed in 10 minutes.
  2. Dots are not allowed in BSON document keys. Typically it might not be a problem, but we had to work around it in our data migration.
  3. Document size is limited to 4 megabytes. It’s not a problem with our data model, but since MongoDB supports fantastic atomic in-place updates, you have to be careful not to grow your documents above this limit.
  4. Adding new nodes is not as easy as it is with Cassandra. However, Cassandra has its own problems load-balancing them.

So far it’s been a very smooth ride. Development and database administration just got a whole lot easier.

» Now give the new and better Flowdock a try!

Trackbacks

  1. Why Flowdock migrated from Cassandra to MongoDB. › PHP App Engine
  2. The week in links (08/02) « turnings :: daniel berlinger
  3. MongoDB报告 - Just My weblog – everything about my life - 5188la Blog

16 Comments

  1. alexs

    Thanks for the post. I’d be interested to know a couple of things. First, what problems did you find with mongodb’s mapreduce support? Second, how could adding new nodes in mongodb be made easier? As far as I understood it’s just a matter of starting up a new instance in slave mode.

  2. I didn’t exactly have problems with Map-Reduce, but it’s not as good as it could be. For example, it doesn’t utilize more than one CPU core per node (I understand this is a limitation of the JavaScript interpreter). Also, it doesn’t cache parts of the result effectively, so running the query many times in row doesn’t get any faster.

    About adding nodes: that’s true, if you’re just adding new slaves. Adding new shards and configuring replica sets still needs some extra work – not much, but it’s not fully automated either.

  3. What kind of reads/writes per second are you experiencing? Any comments on server load between the Cassandra and mongodb?

  4. serdar

    Salman’s question is perfect.

  5. I was wondering what kind of fail-over mechanism you had to implement.
    Are you using a MongoDB arbiter? Is it just the out-of-the-box solution or did you need to customize it to your needs? (in respect to error recovery)

  6. George

    Hope you have better luck than this guy: http://www.korokithakis.net/node/119

  7. @Salman:

    Our query throughput is not too impressive: we’ve got some 22M messages in the beginning of our public beta, but our backend mostly just writes to the database. With these volumes we could’ve done pretty well without Cassandra, but of course we want to be prepared for the future. As such, the load is very low.

    Cassandra would be much better suited for sites with a) a bit different data model and b) much more traffic.

  8. As far as I’m aware, even in the development builds, there is a limitation on using more than one index on arrays.

    What kind of write lock are you getting? I’m finding that it’s getting to ~35% time spent locked with 130 mil+ docs in a single collection. Do you run the munin stats? Can we see?

  9. @George:

    That post is from a guy running MongoDB on a non-production environment. Had he followed a production scenario with some sort of replication. He would have mitigated most, if not all, of the possible data loss. The only time I have seen MongoDB lose data is when the disk runs out of space and cannot fsync the mmap’d activity to disk. Even then there are solutions floating around to this issue.

  10. tszming

    >> Dots are not allowed in BSON document keys.

    Seems this validation is done by the client, i.e. the mongo shell.

    I was able to insert a key with a dot “.” using the Java driver.

  11. @Tszming the validation is done by the client, but if the java client is letting you do that insert then it is a bug (do you mind reporting it? ;)). Dots are disallowed because they can lead to subtle errors when they are present in key names – you should definitely avoid them.

  12. Martin

    There’s one major problem with MongoDB: It’s licensed AGPL 3.

    AGPL virally forces ALL connected software to a compatible license unless you buy a commercial license from MongoDB’s developers to get rid of AGPL. The project’s explanation that AGPL would stop affecting applications behind a driver is just wrong – if that should be part of the license then the license can not be AGPL or GPL 3. The whole idea of AGPL is that software which connects differently than by being linked in terms of code can be influenced as well: If you connect by network or by shell commands instead of linking a library your software would not be affected by GPL 2 (that loophole is commonly used and angered too many hardcore GPL developers). However, with AGPL or GPL3 that trick is closed; the license affects all code regardless of how it connects. For that reason, MongoDB drivers should not be able to be licensed Apache or similar as well because that would allow for undisclosed code. At runtime, when you are connecting to MongoDB itself, the drivers will convert implicitely to AGPL license as long as they permit such transformations (MIT/BSD does, I’m not sure about Apache) and you are again linking directly against AGPL licensed libraries. If a driver is under a license that does not allow such transformations to AGPL at runtime, it breaks the terms of AGPL. MongoDB’s license is deeply flawed, unclear and confusing…

    This has prevented me as well as the company I work for to try MongoDB – it’s of no use unless you code (A)GPL 3 applications or have the money to buy free from these license restrictions, but then there are better databases you could spend your money on. I personally prefer licenses that are more free than (A)GPL because I don’t like being restricted in what I can use my own code for.

    So, by using the free MongoDB – how I see it – you would have to open up FlowDock under (A)GPL for everyone because:

    - FlowDock users access your webapp
    - the webapp uses a driver
    - that driver connects to MongoDB

    That last part of the chain forces every other part into a license compatible with AGPL 3, so your users have to get access to all parts’ source code or you are violating (A)GPL (and may get sued). Unless you buy a commercial license, of course.

  13. @Russell:

    I tested with 120M documents in a single collection, and wrote a script that inserts real-looking data 500 inserts/sec, 250 reads/sec and 100 updates/sec. It doesn’t produce noticeable load, and write lock stays very low. A couple of times per minute an insert might take more than 100 ms, but otherwise it works very smoothly. There are five indices, total of ~50 gigabytes. The test server had 12 GB of memory.

    I’m running MongoDB on non-virtualized servers, it might explain something.

  14. @Martin:

    “Note however that it is never required that applications using mongo be published. The copyleft applies only to the mongod and mongos database programs. This is why Mongo DB drivers are all licensed under an Apache license. Your application, even though it talks to the database, is a separate program and “work”.”
    http://www.mongodb.org/display/DOCS/Licensing

  15. Robert Cooper

    The problem with Linux is it is GPL. You can’t run software on Linux or that talks to Linux without being infected by the license.

    Martin: that is just ignent. That has never been how GPL software works. The rule has always been: what are you directly linking to? Or more to the point, if you did a monolithic compile, what code goes into your app. For some reason, people seem fond of not using the LGPL, but adding “Exceptions” to GPL code (see MySQL, Java, etc). But if passing data to/from a GPL application affected your code, the whole of the web would be “infected.”

  16. Martin

    @Robert Cooper: That’s the exact problem with AGPL and GPL 3. As soon as you interface an “infected” codebase it spreads.

    I know this is NOT the case with GPL 2 and of course not with commercially compatible licensed such as LGPL, Apache, MIT, BSD and a myriad of other licenses as well.

    Let’s assume there’s a server software like a database under GPL2. The protocol cannot be GPLed so if you write some driver to interface it without linking GPL licensed code you are free to choose whatever license you want for your driver. Another example is a data exchange happening through stdin/stdout: If you interface a GPL2 application in such a way then it is considered a public interface that cuts the GPL at that point.

    Some people didn’t like that because this is a common way to circumvent the legal restrictions of GPL without breaking it (which is okay since they had a point). The big difference between AGPL and GPL is that AGPL claims any interface to be under AGPL restrictions such as network communication or stdin/out. Although you don’t link directly to a AGPLed codebase you are still forced to license your code under either AGPL or at least GPL.

    I think that MongoDB’s core developers got confused with those licenses. What they claim MongoDB to be licensed under matches GPL2 but definitely not AGPL since the whole point of AGPL gets lost.

    To get back to your example, theoretically, if you are interfacing an AGPLed server via HTTP, AGPL would spread through that channel as well, to those MongoDB drivers that are working with REST, to your own driver that you wrote because you felt the urge to and so on. Of course that’s highly unpractical and unrealistic, that’s just not the way any open system works. But that’s how AGPL has been intended to be; I’d like to call that some sort of broken design.

Leave a Reply