Expert Node.js Support
Learn more

risingstack

Node.js War Stories: Debugging Issues in Production

Node.js War Stories: Debugging Issues in Production

In this article, you can read stories from Netflix, RisingStack & nearForm about Node.js issues in production - so you can learn from our mistakes and avoid repeating them. You'll also learn what methods we used to debug these Node.js issues.

Special shoutout to Yunong Xiao of Netflix, Matteo Collina of nearForm & Shubhra Kar from Strongloop for helping us with their insights for this post!


At RisingStack, we have accumulated a tremendous experience of running Node apps in production in the past 4 years - thanks to our Node.js consulting, training and development business.

As well as the Node teams at Netflix & nearForm we picked up the habit of always writing thorough postmortems, so the whole team (and now the whole world) could learn from the mistakes we made.

Netflix & Debugging Node: Know your Dependencies

Let's start with a slowdown story from Yunong Xiao, which happened with our friends at Netflix.

The trouble started with the Netflix team noticing that their applications response time increased progressively - some of their endpoints' latency increased with 10ms every hour.

This was also reflected in the growing CPU usage.

Netflix debugging Nodejs in production with the Request latency graph Request latencies for each region over time - photo credit: Netflix

At first, they started to investigate whether the request handler is responsible for slowing things down.

After testing it in isolation, it turned out that the request handler had a constant response time around 1ms.

So the problem was not that, and they started to suspect that probably it's deeper in the stack.

The next thing Yunong & the Netflix team tried are CPU flame graphs and Linux Perf Events.

Flame graph of Netflix Nodejs slowdown Flame graph or the Netflix slowdown - photo credit: Netflix

What you can see in the flame graph above is that

  • it has high stacks (which means lot of function calls)
  • and the boxes are wide (meaning we are spending quite some time in those functions).

After further inspection, the team found that Express's router.handle and router.handle.next has lots of references.

The Express.js source code reveals a couple of interesting tidbits:

  • Route handlers for all endpoints are stored in one global array.
  • Express.js recursively iterates through and invokes all handlers until it finds the right route handler.

Before revealing the solution of this mystery, we have to get one more detail:

Netflix's codebase contained a periodical code that ran every 6 minutes and grabbed new route configs from an external resource and updated the application's route handlers to reflect the changes.

This was done by deleting old handlers and adding new ones. Accidentally, it also added the same static handler all over again - even before the API route handlers. As it turned out, this caused the extra 10ms response time hourly.

Takeaways from Netflix's Issue

  • Always know your dependencies - first, you have to fully understand them before going into production with them.
  • Observability is key - flame graphs helped the Netflix engineering team to get to the bottom of the issue.

Read the full story here: Node.js in Flames.


Expert help when you need it the most

Commercial Node.js Support by RisingStack
Learn more


RisingStack CTO: "Crypto takes time"

You may have already heard to story of how we broke down the monolithic infrastructure of Trace (our Node.js monitoring solution) into microservices from our CTO, Peter Marton.

The issue we'll talk about now is a slowdown which affected Trace in production:

As the very first versions of Trace ran on a PaaS, it used the public cloud to communicate with other services of ours.

To ensure the integrity of our requests, we decided to sign all of them. To do so, we went with Joyent's HTTP signing library. What's really great about it, is that the request module supports HTTP signature out of the box.

This solution was not only expensive, but it also had a bad impact on our response times.

network delay in nodejs request visualized by trace The network delay built up our response times - photo: Trace

As you can see on the graph above, the given endpoint had a response time of 180ms, however from that amount, 100ms was just the network delay between the two services alone.

As the first step, we migrated from the PaaS provider to use Kubernetes. We expected that our response times would be a lot better, as we can leverage internal networking.

We were right - our latency improved.

However, we expected better results - and a lot bigger drop in our CPU usage. The next step was to do CPU profiling, just like the guys at Netflix:

crypto sign function taking up cpu time

As you can see on the screenshot, the crypto.sign function takes up most of the CPU time, by consuming 10ms on each request. To solve this, you have two options:

  • if you are running in a trusted environment, you can drop request signing,
  • if you are in an untrusted environment, you can scale up your machines to have stronger CPUs.

Takeaways from Peter Marton

  • Latency in-between your services has a huge impact on user experience - whenever you can, leverage internal networking.
  • Crypto can take a LOT of time.

nearForm: Don't block the Node.js Event Loop

React is more popular than ever. Developers use it for both the frontend and the backend, or they even take a step further and use it to build isomorphic JavaScript applications.

However, rendering React pages can put some heavy load on the CPU, as rendering complex React components is CPU bound.

When your Node.js process is rendering, it blocks the event loop because of its synchronous nature.

As a result, the server can become entirely unresponsive - requests accumulate, which all puts load on the CPU.

What can be even worse is that even those requests will be served which no longer have a client - still putting load on the Node.js application, as Matteo Collina of nearForm explains.

It is not just React, but string operations in general. If you are building JSON REST APIs, you should always pay attention to JSON.parse and JSON.stringify.

As Shubhra Kar from Strongloop (now Joyent) explained, parsing and stringifying huge payloads can take a lot of time as well (and blocking the event loop in the meantime).

function requestHandler(req, res) {  
  const body = req.rawBody
  let parsedBody
  try {
    parsedBody = JSON.parse(body)
  }
  catch(e) {
     res.end(new Error('Error parsing the body'))
  }
  res.end('Record successfully received')
}

Simple request handler

The example above shows a simple request handler, which just parses the body. For small payloads, it works like a charm - however, if the JSON's size can be measured in megabytes, the execution time can be seconds instead of milliseconds. The same applies for JSON.stringify.

To mitigate these issues, first, you have to know about them. For that, you can use Matteo's loopbench module, or Trace's event loop metrics feature.

With loopbench, you can return a status code of 503 to the load balancer, if the request cannot be fulfilled. To enable this feature, you have to use the instance.overLimit option. This way ELB or NGINX can retry it on a different backend, and the request may be served.

Once you know about the issue and understand it, you can start working on fixing it - you can do it either by leveraging Node.js streams or by tweaking the architecture you are using.

Takeaways from nearForm

  • Always pay attention to CPU bound operations - the more you have, to more pressure you put on your event loop.
  • String operations are CPU-heavy operations

Debugging Node.js Issues in Production

I hope these examples from Netflix, RisingStack & nearForm will help you to debug your Node.js apps in Production.

If you'd like to learn more, I recommend checking out these recent posts which will help you to deepen your Node knowledge:

If you have any questions, please let us know in the comments!

Case Study: Finding a Node.js Memory Leak in Ghost

Case Study: Finding a Node.js Memory Leak in Ghost

At RisingStack, we have been using Ghost from the very beginning, and we love it! As of today we have more than 125 blogposts, with thousands of unique visitors every day, and with 1.5 million pageviews in 2016 overall.

In this post I’m going to share the story of how we discovered a node.js memory leak in [email protected], and what role Trace played in the process of detecting and fixing it.

What's Ghost?

Just a blogging platform

Node.js Memory Leak - The Ghost blogging platforms logo

Ghost is a fully open-source publishing platform written entirely in JavaScript. It uses Node.js for the backend, Ember.js for the admin side and Handlebars.js to power the rendering.

Ghost is actively developed - in the last 30 days, it had 10 authors with 66 commits to the master branch. The project's roadmap can be found here: https://trello.com/b/EceUgtCL/ghost-roadmap.

You can open an account at https://ghost.org/ and start writing instantly - or you can host your own version of Ghost, just like we do.

Our Ghost Deployment

Firstly, I'd like to give you a quick overview of how we deploy and use Ghost in production at RisingStack. We use Ghost as a npm module, required into a bigger project, something like this:

// adding Trace to monitor the blog
require([email protected]/trace')  
const path = require('path')  
const ghost = require('ghost')

ghost({  
  config: path.join(__dirname, 'config.js')
}).then(function (ghostServer) {
  ghostServer.start()
})

Deployments are done using Circle CI which creates a Docker image, pushes it to a Docker registry and deploys it to a staging environment. If everything looks good, the updates are moved to the production blog you are reading now. As a backing database, the blog uses PostgreSQL.

The Node.js Memory Leak

As we like to keep our dependencies up-to-date, we updated to [email protected] as soon as it came out. Once we did this, our alerts started to fire, as memory usage started to grow:

Node.js Memory leak in ghost - Trace memory metrics

Luckily, we had alerts set up for memory usage in Trace, which notified us that something is not right. As Trace integrates with Opsgenie and Pagerduty seamlessly, we could have set up alerts for those channels.

Node.js Monitoring and Debugging from the Experts of RisingStack

Set up alerts for your Node.js deployments using Trace
Learn more

We set up alerts for the blog service at 180 and 220 Mb because usually it consumes around 150 Mb when everything’s all right.

Setting up alerts for Node.js memory leaks in Trace

What was even better, is that the alerting was set up in a way that it triggered actions on the collector level. What does this mean? It means, that Trace could create a memory heapdump automatically, without human intervention. Once we started to investigate the issue, the memory heapdump was already in the Profiler section of Trace in the format that's supported by the Google Chrome DevTools.

This enabled us to start looking at the problem instantly, and in a way it happened in the production system, not by trying to reproduce the issue in a local development environment.

Also, as we could take multiple heapdumps from the application itself, we could compare them using the comparison view of the DevTools.

Memory heapshot comparison with Trace and Chrome's Devtools

How to use the comparison view to find the source of a problem? On the picture above, you can see that I compared the heapdump that Trace automatically collected when the alert was triggered with a heapdump that was requested earlier, when everything was ok with the service.

"You can use 2 heapdumps & Chromes comparison view to investigate a Node.js memory leak!" via @RisingStack #nodejs

Click To Tweet

What you have to look for is the #Delta, which shows +772 in our case. This means that at the time our high memory usage alert was triggered the heapdump had an extra 772 objects in it. On the bottom of the picture you can see what were these elements, and that they have something to do with the lodash module.

"The "delta" shows the extra objects in your memory heapdump in case of a leak!" via @RisingStack #nodejs

Click To Tweet

Figuring this out otherwise would be extremely challenging since you’d have to reproduce the issue in a local environment - which is tricky if you don’t even know what caused it.

Should I update? Well..

The final cause of the leak was found by Katharina Irrgang, a core Ghost contributor. To check out the whole thread you can take a look at the GitHub issue: https://github.com/TryGhost/Ghost/issues/7189 . A fix was shipped with 0.10.1. - but updating to it will cause another issue: slow response times.

Slow Response Times

Once we upgraded to the new version, we ran into a new problem - our blog's response time started to degrade. The 95 percentile grew from 100ms to almost 300ms. It instantly triggered our alerts set for response times.

Slow response time graph from Trace

For the slow response time we started to take CPU profiles using Trace. For now, we are still investigating the exact reason, but so far we suspect something is off with how moment.js is used.

CPU profile analysis with Trace

We will update the post once we found why it happens.

Conclusion

I hope this article helped you to figure out what to do in case you're experiencing memory leaks in your Node.js applications. If you'd like to get memory heapdumps automatically in a case like this, connect your services with Trace and enable alerting just like we did earlier.

If you have any additional questions, you can reach me in the comments section!

How Developers use Node.js - Survey Results

How Developers use Node.js - Survey Results

RisingStack, the provider of Trace - a next-gen Node.js debugging and performance monitoring solution and silver member of the Node Foundation conducted a survey during 2016 Summer to find out how developers use Node.js and what technologies they prefer with it. This article summarizes the results.

The results show that MongoDB, RabbitMQ, AWS, Jenkins, Docker and Amazon Container Services are the go-to choices for developing, containerizing and shipping Node.js applications.

The survey also let us find out various aspects of developing Node.js and choices for async control flow, debugging, continuous integration or finding packages. The results also tell Node developers major pain-point: debugging.

The survey was open for 35 days from 11 July until 15 August 2016. During this period, 1126 Node.js developers completed it. 55% of them have more than two years of Node.js experience, while 26% uses Node between one and two years. 20% works at a company that is publicly traded, 7% at a Fortune 500 enterprise.

Technologies used with Node.js

MongoDB became the go-to database

Node.js Survey - What databases are you using? MongoDB wins.

According to the results, MongoDB is clearly the go-to database for Node.js developers. Roughly ⅔ of our respondents claimed that they use MongoDB with their Node.js applications. It's also worth noticing that the popularity of Redis is massively increasing with the experience of Node engineers. This trend is also true in the case of PostgreSQL and ElasticSearch.

Node.js Monitoring and Debugging from the Experts of RisingStack

Build performant applications using Trace
Learn more

Node.js Survey - Database usage and developer experience

"The popularity of Redis and PostgreSQL is heavily increasing with developer XP" via @RisingStack #nodejs #survey

Click To Tweet


Redis leads as a caching solution, but a lot of developers still don’t do it

Node.js Survey - What do you use for caching? Redis wins.

Half of our respondents said that they are using Redis for caching, but a staggering 45% stated that they don’t use any. Cross referencing the answers with developer experience allows us to see that the popularity of Redis is quite high amongst long-time Node users, compared to engineers with less than one year of Node.js experience.

Node.js Survey - Caching usage and developer experience

The popularity of messaging systems is still low

According to our survey, 58% of Node.js developers don't use any messaging systems. This means that either developers are rarely using messaging in their microservices system, they use REST API-s or they don’t have a sophisticated system in place.

Node.js Survey - What messaging systems are you using? RabbitMQ wins.

Those who use Messaging systems answered that they mostly use RabbitMQ (24% of all respondents). If we only investigate the responses of people who use messaging systems, RabbitMQ beats the rest of the existing solutions by far.

Node.js Survey - Messaging system usage

Node.js apps are most likely running on AWS

According to our survey, 43% of Node.js developers use AWS for running their applications, but running an own datacenter is popular as well (34%) especially amongst enterprises (nearly 50% of them has own datacenters) - but this is no surprise.

Node.js Survey - Where do you run your Node.js apps? AWS.

"43% of Node.js developers use AWS for running their applications" via @RisingStack #nodejs #survey @awscloud

Click To Tweet

What’s interesting though is that Heroku and DigitalOcean are competing neck and neck to become the second biggest cloud platform for Node.js. According to our data, DigitalOcean is more popular with smaller companies (under 50) while Heroku stays strong as an enterprise solution as well.

Node.js Survey - Running apps and company size

Docker dominates in the Node community

Currently, Docker containers are the go-to solution for most of Node.js developers (47% of all respondents claimed to use it - which is 73% of all container tech users in the survey). Docker seems to be equally popular within all company sizes - but advanced developers appear to be using it much more (the ones with over one year’s of experience).

Node.js Survey - What container techs or VMs are you using? Docker.

"Docker containers are the go-to solution for most of Node.js developers." via @RisingStack #nodejs #survey @docker

Click To Tweet

64% of the respondents said that they use some container technology - which means that the popularity of containers rose since the last major Node.js survey from 45% with a significant 20% increase since January 2016.

Node.js Survey - Container techs and developer experience.

Amazon Container Service is the first choice for running containers

Node.js Survey - How do you run your containers? Amazon Container Service wins.

While Amazon Container Service leads as the choice of running containers with Node.js, it’s worth noting that Kubernetes is already on 25% according to our survey, and it seems to be popular especially with enterprise Node.js developers.

Node.js Survey - Running Amazon and Kubernetes with company size

Node.js development

Configuration files are being used more often than environmental variables

The majority of Node developers (59% vs. 38%) prefer config files over credentials. Only 29 respondents (3%) stated that they use both.

Node.js Survey - Environment variables or config files? Config files wins.

Using only configuration files suggests a possible security problem since it implies that credentials are stored in the repositories. If you have the credentials to productions systems in GitHub, you can quickly run into trouble with rogue developers.

Using Env vars is highly recommended for secrets - while developers can still use config files in general.

Promises lead with async control flow

In Node.js - most of the core libraries are working with callbacks. The results show that Node.js users are leaning towards using promises right now.

Node.js Survey - What do you use for async control flow? Promises wins.

Around half a year ago there was a pull-request in the core Node.js repository asking for async functions to return a native Promise. The answer for this was: “A Promises API doesn’t make sense for core right now because it's too early in the evolution of V8-based promises and their relationship to other ES* features. There is tiny interest within the TC in exploring this in the core in the short-term.”

Maybe it’s time to revisit the issue - since the demand is present.

Developers trust the console.log for debugging

Console.log is leading the race amongst other debugging solutions like the Node Inspector, the Built-in debugger and the debug module. Around ¾ of Node developers use it for finding errors in their applications - while much-sophisticated solutions are available as well.

Node.js Survey - How do you debug your applications? Using the console.log

"Around ¾ of Node developers use console.log to find errors in their applications." via @RisingStack #nodejs #survey

Click To Tweet

A closer look at the data lets us know that more experienced developers are leaning towards the Node Inspector and the Debug Module as well.

Node.js Survey - Debugging applications and developer experience

APM’s are still quite unpopular in the Node.js community

According to the responses in our survey, only ¼ of Node.js developers use APMs - application performance monitoring tools - to identify issues in their applications. Although, emerging trends in the dataset suggest that APM usage grows with company size and developer experience.

Node.js Survey - How do you identify issues in your app? Using logs.

SaaS CI’s still have a low market share in the Node community

According to the answers in our survey, using shell scripts is the most popular way of pushing code to staging or production environments - but Jenkins clearly wins among continuous delivery and integration platforms so far, and is becoming more popular as company size increases.

Node.js Survey - What do you use to push code or containers? Shell scripts win.

Node.js developers rarely update dependencies

Frequently updating dependencies is highly recommended with Node.js applications - since around 15% of npm packages carry a known vulnerability & 76% of Node shops use vulnerable dependencies according to a recent survey.

Node.js Survey - How often do you update dependencies? Less frequently than a month.

Updating dependencies less frequently than every week exposes applications to severe attacks all the time. According to our survey, 45% of Node.js developers update dependencies less frequently than a month, and 27% of them update dependencies month-by-month. Only 28% answered that they update dependencies at least every week.

These numbers correlate neither with company size nor with developer experience.

Node.js developers Google for their packages

According to our survey, the majority of developers use Google to find packages and decide which one of them they should use. Although the popularity of the npmjs.org/npms.io search platforms is 56% amongst our respondents, the data shows that it goes up to almost 70% for the demographic group of experienced (more than four years of Node development)! Preference increases with experience in this case.

Node.js Survey - How do you decide what package to pick? People mostly Google for them.

Junior Node.js developers don’t know what semantic versioning is

Although 71% of our respondents uses semantic versioning when publishing/consuming modules, this number should be higher in our opinion. Everyone should use semantic versioning since npm works with semver! Updating packages without using it can easily break Node.js applications.

Node.js Survey - Do you use semantic versioning? Mostly yes.

If we dig deeper in the dataset we can see that around half of the Node developers with less than a year experience don’t know what semver is or don’t use it - while advanced developers are embracing it on a much higher level.

Node.js teams introduce new tools and technologies very fast

According to our survey 35% of Node developers can introduce new tech/tools/product to their companies in a few days, and 29% in just a few weeks.

Node.js Survey - How much time is needed to introduce new technologies, tools or products to your company? A few weeks.

"35% of Node devs can introduce new tech/tools to their companies in a few days." via @RisingStack #nodejs #survey

Click To Tweet

If we investigate the data more thoroughly, a not-so-surprising pattern emerges which lets us know that the time needed to introduce new tech/tools is gradually increasing with the size of a company.

Debugging is the most severe pain-point for developing with Node.js

We also asked Node developers about what's their biggest pain points regarding development. The top answers were:

  • Debugging / Profiling / Performance Monitoring
  • Callbacks and Callback hell
  • Understanding Async programming
  • Dependency management
  • Lack of conventions/best practices
  • Structuring
  • Bad documentation
  • Finding the right packages

"The biggest pain point of Node.js development is debugging." via @RisingStack #nodejs #survey

Click To Tweet

Conclusion

Developing Node.js is still an interesting and ever-changing experience. We'd like to thank for the the engineers who took their time with answering to our questions, and we hope that the information presented in the article is valuable for the whole Node community.

The full dataset is going to be released and linked in this blogpost in a few days.