The Blog

The what's what of the Flowdock atmosphere.

Full-Text Search with MongoDB — Flowdock Style

Otto Hilska March 30th, 2011

Flowdock is a collaboration tool for technical teams. From the very beginning, we had this idea of using tags to categorize real-time discussions. It works really well for building lists (like ToDo lists, lists of competitors, notes etc.) and for categorizing links (think delicious.com for your internal discussions). It changed the way we work. Still, sometimes you just forget to tag something – which is probably why our recently announced full-text search was a very common feature request.

We store all the messages in MongoDB. Implementing tags with multikey indices was trivial, and using the same mechanism for full-text search seemed to make sense. Some characteristics of our use case are:

  1. Each flow (workspace) is separate, and there’s never need to search through all messages in the database.
  2. Search results are simply sorted chronologically, there is no need for a “Pagerank”.
  3. In Flowdock everything happens in real time, so people expect to see their messages in search results immediately.
  4. We use MongoDB to store messages (with 3 copies of each message + backups). Having all that data in another redundant system would be cumbersome.

Search implementation

The implementation was very straight-forward. Every message document would simply have a new field, _keywords, that gets populated from the message contents. At this point we decided not to do stemming, but to get feedback from our customers about what’s relevant and what’s not. This already resulted in adding the functionality to jump back in chat history.

A message document would now look like this:

{ id: 12345,
  author: …,
  content: “Hey @Otto, should you write the blog post?”,
  flow: “flowdock:developers”,
  _keywords: [“hey”, “otto”, “should”, “you”, “write”, “the”,
    “blog”, “post”],
  _tags: [“user:12”]
}

We needed to add new index { flow: 1, _keywords: 1, id: -1 }. It basically means that we’re not going to be able to search by tags and keywords at the same time.

Keyword population was enabled in our backend secretly much before we rolled out the end-user UI. We also had a Scala script to go through all the older messages and populate the field.

Operational issues

Even before this exercise our system was in state where indices are too large to fit in memory. As it turned out, the _keywords index size in our use-case is ~2500 MB per 1 million messages. Having accumulated more than 50M messages, this really turned out to be a challenge.

Tweeting about our indexing process

When running the migration we encountered several challenges:

  1. MongoDB supports generating the index in background. However, since the service was running at the same time, index generation and active users were aggressively fighting about the available resources. Since adding new messages already involved updating several large indices, and since background index generation was populating memory cache with its own values, all queries started hitting the disk and the user experience was not tolerable.

    To solve this issue we actually removed _keywords from all the messages that were populated by the backend so far, and then built the (empty) index. This way it was way faster.

  2. Running the _keywords population script also made all queries hit the disk. Since it’s not possible to give priorities to MongoDB queries, we ended up adding lots of sleep() calls in our script. Process 500 messages, sleep 5 seconds…

    By default MongoDB doesn’t ensure that the change was actually written to disk before returning from the query. This could lead to a situation where queries are just stacking up because the indexing script is faster than your database. Setting the Write Concern level appropriately will make the queries block until they’ve actually completed.

    If possible, it’s also good to optimize for data locality. If your data is partitioned per organization, it’s likely that two consequent messages from the same organization are closer to each other in the index than two completely random messages. In some cases this might ease the disk I/O a lot.

    Even with these tricks we’d still have to stop the script every once in a while, when the queries started getting too slow.

In the end we took a couple of shortcuts to make our lives easier:

  1. We deleted millions of messages from organizations who had stopped using Flowdock during the beta phase.
  2. Ordered SSD disks from our service provider. It turned out to be the best investment ever! Of course it’s not doable for people using Amazon EC2 etc. but in our case we got a HUGE performance improvement. We could now populate _keywords on a live database with minimal load.

Other solutions would’ve of course been to simply shard until the indices per server fit in memory. Since in our use case people mostly address their newer messages, we felt that SSD is a more efficient solution.

Now the search is running happily, with only a couple of queries per day taking more than 100 ms. Full-text search with MongoDB might be a good fit for many use cases, just be prepared for all the operational issues that any search solution will bring you.

Why Flowdock migrated from Cassandra to MongoDB

Otto Hilska July 26th, 2010

Flowdock is a modern web-based team messenger. All software developers should be using it instead of their Campfires, Skype Chats, IRCs, etc. because it better supports their actual workflow.

Last weekend we completed a transition from Flowdock’s database of choice, Cassandra, to another NoSQL alternative, MongoDB. Since our technology stack has always generated some interest, I’ll now try to justify our decision in public.

Some of our users might remember this:

Twitter screenshot: having some database problems

At some point we started to have some stability issues with Cassandra. All nodes would go into an infinite loop, running GC and trying to compact the data files – occasionally falling off the cluster. We were unable to solve the problem, except that restarting and then compacting a node usually settled it down for a while. Other people had reported similar problems. Last couple of weeks our Cassandra nodes always ate all the resources they were given, slowing down Flowdock.

This was not the first time we had run into problems because of our bleeding edge database choice. When upgrading from 0.4 to 0.5, we had to shut down the cluster, only to find out that it hadn’t flushed everything to the disk (even though we explicitly flushed it, as instructed). Thus we ended up having a couple of minutes of discussions lost, and our custom-built indices were miserably out of date and needed to be rebuilt. I think it was 4 AM when we finally got to leave the office.

The NoSQL scene has evolved since we made our original decision to go with Cassandra. MongoDB is changing rapidly, and the latest addition of auto-sharding and replica sets made it a compelling alternative to Cassandra. So I decided to give it a try.

It took me a day to write a conversion script for our data. Within a week or so we were able to run Flowdock purely on MongoDB. It was tested internally for a couple of weeks before it was deployed to production.

Now that we have done the change, I’m happy to see that we got some benefits (very well known in most databases) in addition to the performance and reliability characteristics:

  1. Smart (multikey) indices. Manually maintaining indices was always tedious, and MongoDB can index everything we need out-of-the-box. For example, our messages have tags, implying a document format like this:
    { content: "Write a blog post about #mongodb.",
      workspace: 'myflow',
      tags: ["mongodb", "todo", "@Otto"] }
    

    Now when looking for my own tasks, Flowdock backend only needs to do this query:

    db.messages.find({
      workspace: 'myflow',
      tags: { $all: ["todo", "@Otto"] }
    })
    
  2. Queries. No matter how simple your data model is, every once in a while you need to perform a query that you didn’t plan in advance. MongoDB lets you construct complex queries directly from the console, pretty much like an SQL database. It will then perform a sequential scan, which is still much faster and more convenient than processing millions of rows manually, on the client-side.
  3. Map-Reduce. It’s great for stuff like analytics. MongoDB’s Map-Reduce support is not perfect, but at least it’s easy to use.
  4. GridFS makes storing our files very easy. The storage capabilities expand together with the rest of our MongoDB cluster.

We have faced only some minor limitations:

  1. We found a bug in JSON parsing that got fixed in 10 minutes.
  2. Dots are not allowed in BSON document keys. Typically it might not be a problem, but we had to work around it in our data migration.
  3. Document size is limited to 4 megabytes. It’s not a problem with our data model, but since MongoDB supports fantastic atomic in-place updates, you have to be careful not to grow your documents above this limit.
  4. Adding new nodes is not as easy as it is with Cassandra. However, Cassandra has its own problems load-balancing them.

So far it’s been a very smooth ride. Development and database administration just got a whole lot easier.

» Now give the new and better Flowdock a try!