About a month ago, we rolled out major changes to our online status tracking service. Very quickly, our public Streaming API started running out of memory and needed to be constantly restarted. Here’s a recap of what we learned while tracking down the bugs.
Introducing Presence 2.0
When Flowdock was launched, it was a web app that had one flow per browser window. We tracked who was online by having clients send regular pings to the server, which were then relayed to everyone else in the flow. As Flowdock became more popular and especially when support for multiple flows per window was added, this became one of our biggest architectural pain points. It’s actually quite surprising that we were able to get away without a rewrite for so long.
Our new, rewritten presence detection service tracks your open connections in real time. Only state changes – like going offline or becoming active after being idle – are relayed to others. The new presence updates no longer go through the regular message processors, taking a lot of pressure off our other backend services. However, it also means that our Streaming API needs to support updating your online state when connecting, sending your presence updates to others and determining which users’ presence updates should be sent to you.
Introducing memory leaks
When we deployed our new and improved Streaming API service to production, it started to constantly run out of memory. Node.js processes have a memory limit of around 1.4GB, which we hit daily. Our memory usage graphs looked like this:
We tried several things to figure out what’s wrong. Taking three heap dumps and comparing them using node-inspector on development machines gave us no clues. Heap dumps from production were of no use either. Just reading the code didn’t help – finding memory leaks without any idea of what might be wrong is as likely as winning the lottery.
What finally helped us track down the problem was our server stats and graphs. The message delivery volume for the said service had a curious (and huge) peak that started when we released our presence-enabled Android app and dropped back to normal levels when the service was restarted. This was a pretty clear indication that some message callbacks were not getting cleaned up.
We initially didn’t notice this because it had happened more than a week ago, and we hadn’t looked that far back in the graphs.
Introducing issue #1309
So – it appeared that a message callback that delivered messages to clients was not being cleaned up properly after a connection was closed. This meant that the callbacks kept piling up, and the measured message delivery volume as well as memory usage grew. We went over all the related code with a fine comb but couldn’t find any code path that didn’t perform the required cleanup. What was going on?
After a fair bit of desperate Googling, we found this: nodejs/io.js issue #1309, “No finish/close event on aborted http-response – race condition”. We had found our bug. A simple traffic simulation that opened connections and aborted them confirmed our finding. The
close event for the Node.js HTTP response was simply not firing, meaning the cleanup was never done.
We added a simple timer that checks if the connections are still alive and terminates the ones that are not. Even if the
close event does not fire, the
request object still has information about the state of the connection, allowing us to check whether it has been destroyed or not:
Feeling victorious, we deployed the fix and started seeing the
disconnect-dead-socket events in our stats.
Introducing bad asynchronous code
The next morning (we live in Helsinki, Finland, so our peak traffic is during the evening and nighttime, while people in the USA are at work), I opened the graphs to verify that we had fixed the memory leak.
Turns out we hadn’t. The memory usage graph still had a constant positive slope, albeit somewhat less steep than before. It was back to drawing board to figure out what was wrong.
Again, what helped us find the culprit was our graphs. We were leaking memory at a fairly constant rate, so it had to be something that happened all the time. It seemed to be impossible to reproduce in our development environment with a simple traffic simulation, so it had to be related to how the Streaming API is used in the real world.
Finally, we found it. I had overlooked a small detail in the local traffic simulation:
last-event-id, a field that’s part of the EventSource specification. EventSource has a built-in mechanism to handle reconnections without losing data. Since every sent message has an id, sending the id of the last seen message with
last-event-id when reconnecting forces the server to resend any messages that were missed while not connected. In case there are too many missed messages, we respond with a
409 Conflict, instructing the client to perform a full resync.
We were unable to simulate the memory leak due to not testing with the
last-event-id parameter. Adding it to the traffic simulation finally revealed that we were leaking a certain type of listener, which in turn quickly lead to finding what caused the memory leak by reviewing the code.
Our new presence code was coupled to each connection’s lifecycle, and cleaned up its own callbacks – the ones that relayed presence updates from the people you see over your open connections. However, there was a small gotcha: when connecting, we first needed to check (via an HTTP API call) which 1-to-1s are open so we know who you can “see”. In case of a sync error, the faster
409 response closed the connection and forced a cleanup. The slower 1-to-1 call then finished, and re-added a callback that would never be removed since the cleanup was already done.
What made the problem so pronounced? Our new presence code introduced a new place to which we could add callbacks, but due to how the code was organised, those callbacks’ scope included everything related to a single connection. The result: lots of leaked memory.
In the end, what really made a difference in pinpointing the leaks was good statistics data, graphs (via StatsD, Graphite and Grafana), logging and simulating real-world usage. We wrote a bunch of scripts that can generate variable amounts of traffic to a development server with configurable parameters, helping us test different kinds of traffic situations in the future. Graphs help us notice changes in patterns and are invaluable when tracking down these kinds of things. To be more specific, graphs for memory consumption, message delivery volumes, HTTP status codes over time and certain callback counts were very helpful.
If someone came up to me looking for pro tips on how to track down Node.js memory leaks, these would be my recommendations:
- Instrument everything. Count connections, callbacks and every kind of object that might create a memory leak if it does not get cleaned up. If you have one registered callback for every connection, instrument the count for both so you can see if there’s some disparity.
- Once you have instrumented everything, graph the data. Looking at graphs lets you see how the data changes over time and helps you notice irregularities. The graphs also tell you what kind of problems you’re facing. Does the metric for X grow in a linear fashion or does it follow the changes in some other metric?
- Throw your debugger out of the window. Alternatively, you could structure your code so that a heap dump shows meaningful information about retained memory content – things like named functions and certain types of objects. But in our experience, compiled code and functions tend to be the most visible content.
- Simulate real world usage with a script and see how your app behaves. Notice any disparities in counters that should measure the same thing, like a disparity between the number of callbacks and active clients if those should match.
By the way, we’re looking for people who love these kinds of challenges.