Expert Node.js Support
Learn more

netflix

Node.js War Stories: Debugging Issues in Production

Node.js War Stories: Debugging Issues in Production

In this article, you can read stories from Netflix, RisingStack & nearForm about Node.js issues in production - so you can learn from our mistakes and avoid repeating them. You'll also learn what methods we used to debug these Node.js issues.

Special shoutout to Yunong Xiao of Netflix, Matteo Collina of nearForm & Shubhra Kar from Strongloop for helping us with their insights for this post!


At RisingStack, we have accumulated a tremendous experience of running Node apps in production in the past 4 years - thanks to our Node.js consulting, training and development business.

As well as the Node teams at Netflix & nearForm we picked up the habit of always writing thorough postmortems, so the whole team (and now the whole world) could learn from the mistakes we made.

Netflix & Debugging Node: Know your Dependencies

Let's start with a slowdown story from Yunong Xiao, which happened with our friends at Netflix.

The trouble started with the Netflix team noticing that their applications response time increased progressively - some of their endpoints' latency increased with 10ms every hour.

This was also reflected in the growing CPU usage.

Netflix debugging Nodejs in production with the Request latency graph Request latencies for each region over time - photo credit: Netflix

At first, they started to investigate whether the request handler is responsible for slowing things down.

After testing it in isolation, it turned out that the request handler had a constant response time around 1ms.

So the problem was not that, and they started to suspect that probably it's deeper in the stack.

The next thing Yunong & the Netflix team tried are CPU flame graphs and Linux Perf Events.

Flame graph of Netflix Nodejs slowdown Flame graph or the Netflix slowdown - photo credit: Netflix

What you can see in the flame graph above is that

  • it has high stacks (which means lot of function calls)
  • and the boxes are wide (meaning we are spending quite some time in those functions).

After further inspection, the team found that Express's router.handle and router.handle.next has lots of references.

The Express.js source code reveals a couple of interesting tidbits:

  • Route handlers for all endpoints are stored in one global array.
  • Express.js recursively iterates through and invokes all handlers until it finds the right route handler.

Before revealing the solution of this mystery, we have to get one more detail:

Netflix's codebase contained a periodical code that ran every 6 minutes and grabbed new route configs from an external resource and updated the application's route handlers to reflect the changes.

This was done by deleting old handlers and adding new ones. Accidentally, it also added the same static handler all over again - even before the API route handlers. As it turned out, this caused the extra 10ms response time hourly.

Takeaways from Netflix's Issue

  • Always know your dependencies - first, you have to fully understand them before going into production with them.
  • Observability is key - flame graphs helped the Netflix engineering team to get to the bottom of the issue.

Read the full story here: Node.js in Flames.


Expert help when you need it the most

Commercial Node.js Support by RisingStack
Learn more


RisingStack CTO: "Crypto takes time"

You may have already heard to story of how we broke down the monolithic infrastructure of Trace (our Node.js monitoring solution) into microservices from our CTO, Peter Marton.

The issue we'll talk about now is a slowdown which affected Trace in production:

As the very first versions of Trace ran on a PaaS, it used the public cloud to communicate with other services of ours.

To ensure the integrity of our requests, we decided to sign all of them. To do so, we went with Joyent's HTTP signing library. What's really great about it, is that the request module supports HTTP signature out of the box.

This solution was not only expensive, but it also had a bad impact on our response times.

network delay in nodejs request visualized by trace The network delay built up our response times - photo: Trace

As you can see on the graph above, the given endpoint had a response time of 180ms, however from that amount, 100ms was just the network delay between the two services alone.

As the first step, we migrated from the PaaS provider to use Kubernetes. We expected that our response times would be a lot better, as we can leverage internal networking.

We were right - our latency improved.

However, we expected better results - and a lot bigger drop in our CPU usage. The next step was to do CPU profiling, just like the guys at Netflix:

crypto sign function taking up cpu time

As you can see on the screenshot, the crypto.sign function takes up most of the CPU time, by consuming 10ms on each request. To solve this, you have two options:

  • if you are running in a trusted environment, you can drop request signing,
  • if you are in an untrusted environment, you can scale up your machines to have stronger CPUs.

Takeaways from Peter Marton

  • Latency in-between your services has a huge impact on user experience - whenever you can, leverage internal networking.
  • Crypto can take a LOT of time.

nearForm: Don't block the Node.js Event Loop

React is more popular than ever. Developers use it for both the frontend and the backend, or they even take a step further and use it to build isomorphic JavaScript applications.

However, rendering React pages can put some heavy load on the CPU, as rendering complex React components is CPU bound.

When your Node.js process is rendering, it blocks the event loop because of its synchronous nature.

As a result, the server can become entirely unresponsive - requests accumulate, which all puts load on the CPU.

What can be even worse is that even those requests will be served which no longer have a client - still putting load on the Node.js application, as Matteo Collina of nearForm explains.

It is not just React, but string operations in general. If you are building JSON REST APIs, you should always pay attention to JSON.parse and JSON.stringify.

As Shubhra Kar from Strongloop (now Joyent) explained, parsing and stringifying huge payloads can take a lot of time as well (and blocking the event loop in the meantime).

function requestHandler(req, res) {  
  const body = req.rawBody
  let parsedBody
  try {
    parsedBody = JSON.parse(body)
  }
  catch(e) {
     res.end(new Error('Error parsing the body'))
  }
  res.end('Record successfully received')
}

Simple request handler

The example above shows a simple request handler, which just parses the body. For small payloads, it works like a charm - however, if the JSON's size can be measured in megabytes, the execution time can be seconds instead of milliseconds. The same applies for JSON.stringify.

To mitigate these issues, first, you have to know about them. For that, you can use Matteo's loopbench module, or Trace's event loop metrics feature.

With loopbench, you can return a status code of 503 to the load balancer, if the request cannot be fulfilled. To enable this feature, you have to use the instance.overLimit option. This way ELB or NGINX can retry it on a different backend, and the request may be served.

Once you know about the issue and understand it, you can start working on fixing it - you can do it either by leveraging Node.js streams or by tweaking the architecture you are using.

Takeaways from nearForm

  • Always pay attention to CPU bound operations - the more you have, to more pressure you put on your event loop.
  • String operations are CPU-heavy operations

Debugging Node.js Issues in Production

I hope these examples from Netflix, RisingStack & nearForm will help you to debug your Node.js apps in Production.

If you'd like to learn more, I recommend checking out these recent posts which will help you to deepen your Node knowledge:

If you have any questions, please let us know in the comments!

Node.js Examples - How Enterprises use Node in 2016

Node.js Examples - How Enterprises use Node in 2016

Node.js had an extraordinary year so far: npm already hit 4 million users and processes a billion downloads a week, while major enterprises adopt the language as the main production framework day by day.

The latest example of Node.js ruling the world is the fact that NASA uses it “to build the present and future systems supporting spaceship operations and development.” - according to the recent tweets of Collin Estes - Director of Software Technologies of the Space Agency.

Node.js Examples: Nasa is using it to design Spacewalks

"So, Node.js is used for designing spacewalks - but what else?” via @RisingStack #nodejs #examples @nodejs

Click To Tweet

Fortunately, the Node Foundation’s “Enterprise conversations” project lets us peek into the life of the greatest enterprises and their use cases as well.

This article summarizes how GoDaddy, Netflix, and Capital One uses Node.js in 2016.

GoDaddy ditched .NET to work with Node.js

Charlie Robbins is the Director of Engineering for the UX platform at GoDaddy. He is one of the longest-term users of the technology, since he started to use it shortly after watching Ryan Dahl’s legendary Node.js presentation at JSConf in December 2009 and was one of the founders of Nodejitsu.

His team at GoDaddy uses Node.js for both front-end and back-end projects, and they recently rolled out their global site rebrand in one hour thanks to the help of Node.js.

Before that, the company primarily used .NET and was transitioning to Java. They figured out that despite the fact that Microsoft does a great job supporting .NET developers and they’ve made .NET open source, it doesn’t have a vibrant community of module publishers and they had to rely too much on what Microsoft released.

“The typical .NET scenario is that you wait for Microsoft to come out with something that you can use to do a certain task. You become really good at using that, but the search process for what’s good and what’s bad, it’s just not a skill that you develop.”

Because of this, the company had to develop a new skill: to go out and find all the other parts of the stack. As opposed to other enterprise technologies like .NET where most of the functionality was included in the standard library, they had to become experts in evaluating modules.

Node.js Examples: GoDaddy searching for new modules is a skill they need to learn

GoDaddy started to use Node for the front-end and then ended up using it more in the back-end as well. The same .NET engineers who were writing the back-end code were writing the JavaScript front-end code. The majority of engineers are full stack now.

The most exciting things for Charlie about Node.js are being handled mainly by the working groups.

“I’m very excited about the tracing working group and the things that are going to come out of that to build an open source instrumentation system of eco-tooling.”

Other exciting things for him are the diagnostics working group (previously: inclusivity) and the Node.js Live events - particularly Node.js communities in countries where English is not used. Places like China, for example, where most of the engineers are still primarily speaking Chinese, and there’s a not a lot of crossovers.

“I’m excited to see those barriers start to come down and as those events get to run.”

As of talking about GoDaddy and Node: they have just released the project that they’ve been working on pretty extensively with Cassandra. It was an eight-month long process, and you can read the full story of “Taming Cassandra in Node.js” at the GoDaddy engineering blog.



Need help with enterprise-grade Node.js Development?
Hire the experts of RisingStack!


Netflix scales horizontally thanks to its Node container layer

The next participants in Node Foundations enterprise conversation series are Kim Trott, the director of UI Platform Engineering and Yunong Xiao, Platform Architect from Netflix.

Kim’s been at Netflix for nine years - she just arrived before the company launched its first streaming service. It was the era when you could only watch Netflix with Windows Media Player, and the full catalog consisted only 50 titles.

“I've seen the evolution of Netflix going from DVD and streaming to now being our own content producer.“

Yunong Xiao, who’s well known for being the maintainer of restify arrived two years ago, and just missed the party the company held for reaching 15 million users - but since they are fastly approaching their 100 millionth subscribers, he’ll have a chance to celebrate soon. Yunong previously worked at Joyent on Node.js and distributed systems, and at AWS as well. His role at Netflix is to have Node up and running in scale and making sure it’s performing well.

Kim manages the UI platform team within the UI engineering part of the organization. Their role is to help all the teams building the Netflix application by making them more productive and efficient. This job can cover a wide range of tasks: it could be building libraries that are shared across all of the teams that make it easier to do data access or client side logging, and building things that make easier to run Node applications in production for UI focused teams.

Kim provided us a brief update on how the containerization of the edge services have been going at Netflix - since she talked about it on Node Interactive in last years December.

Node.js Examples: netflix is using Node for the containerization of their edge services When any device or client tries to access Netflix, they have to use something what's called edge services, which is a set of endpoint scripts - a monolithic JVM based system, which lets them mutate and access data. It’s been working really well, but since it’s a monolith, Netflix met some vertical scaling concerns. It was a great opportunity to leverage Node and Docker to be able to scale horizontally all of this data access scripts out.

“Since I’ve spoken at Node Interactive we've made a lot of progress on the project, and we're actually about to run a full system test where we put real production traffic through the new Node container layer to prove out the whole stack and flush out any problems around scaling or memory, so that's really exciting.”

How Node.js affected developer productivity at Netflix?

The developer productivity comes from breaking down the monolith into smaller, much more manageable pieces - and from being able to run them on local machines and do the containerization.

We can effectively guarantee that what you're running locally will very closely mirror what you run in production and that's really beneficial - told Kim.

“Because of the way Node works we can attach debuggers, and set breakpoint steps through the code. If you wanted to debug these groovy scripts in the past, you would make some code changes upload it to the edge layer, run it, see if it breaks, make some more changes, upload it again, and so on..”

It saves us tens of minutes to test, but the real testament to this project is: all of our engineers who are working on the clients are asking: when do we get to use this instead of the current stack? - told Yunong.

The future of Node at Netflix

Over the next few months, the engineering team will move past building out the previously mentioned stack and start working on tooling and performance related problems. Finding better tools for post-mortem debugging is something that they're absolutely passionate about.

They are also planning to be involved in the working groups and help contribute back to the community and so that they can build a better tool that everyone can leverage.

“One of the reasons why Node is so popular is the fact that it's got a really solid suite of tools just to debug, so that's something that we’re actually working contributing on.”

Node.js brings joy for developers at Capital One

Azat Mardan is a technology fellow at Capital One and an expert on Node.js and JavaScript. He’s also the author of the Webapplog.com, and you’ve probably read one of his most popular book: Practical Node.js.

“Most people think of Capital One as a bank and not as a technology company, which it is. At Capital One, and especially this Technology Fellowship program, we bring innovation, so we have really interesting people on my team: Jim Jagielski and Mitch Pirtle. One founded Apache Software Foundation and the other, Joomla!, so I’m just honored to be on this team.”

Azats goal is to bring Node.js to Capital One and to teach Node.js courses internally, as well as to write for the blog, and provide architectural advice. The company has over 5,000 engineers and several teams who started using Node.js at different times.

Capital One uses Node.js for:

  • Hygieia, which is an open-source dashboard for DevOps. It started in 2013 and was announced last year at OSCON, and it has about 900 GitHub stars right now. They’re using Node.js for the frontend and for the build too.
  • Building the orchestration layer. They have three versions of the Enterprise API, and it’s mostly built with Java, but it’s not convenient to use on the front end.

Node.js Examples: Capital One use cases

Capital One uses Angular mostly, but they have a little bit of React as well. In this case, the front-facing single page applications need something to massage and format the data - basically to make multiple codes to the different APIs. Node.js works really great for them for building this orchestration layer.

“It’s a brilliant technology for that piece of the stack because it allows us to use the same knowledge from the front end, to reuse some of the modules, to use the same developers. I think that’s the most widespread use case at Capital One, in terms of Node.js.”

The effect of Node.js on the company

Node.js allows much more transferable skill-sets between the front end and some of the back-end team, and it allows them to be a little bit more integrated.

“When I’m working with the team, and whether it’s Java or C# developers, they’re doubling a little bit on front ends; so they’re not experts but once they switch to the stack where Node.js is used in the back end, they’re more productive because they don’t have that switch of context. I see this pure joy that it brings to them during development because JavaScript it just a fun language that they can use."

From the business perspective: the teams can reuse some of the modules and templates for example, and some of the libraries as well. It’s great from both the developers and from the managerial perspective.

Also, Node has a noticeable effect on the positions and responsibilities of the engineers as well.

Big companies like Capital One will definitely need pure back-end engineers for some of the projects in the future, but more and more teams employ ninjas who can do front-end, back-end, and a little bit of DevOps too - so the teams are becoming smaller.

Instead of two teams, one is a pure back end, and one is a pure front end - consisting seven people overall - a ninja team of five can do both.

“That removes a lot of overhead in communication because now you have fewer people, so you need fewer meetings, and you actually can focus more on the work, instead of just wasting your time.”

The future of Node.js

Node.js has the potential to be the go-to-framework for both startups and big companies, which is a really unique phenomenon - according to Azat.

“I’m excited about this year, actually. I think this year is when Node.js has gone mainstream.”

The Node.js Interactive in December has shown that major companies are supporting Node.js now. IBM said that Node.js and Java are the two languages for the APIs they would be focusing on, so the mainstream adoption of the language is coming, unlike what we’ve seen with Ruby - he told.

“I’m excited about Node.js in general, I see more demand for courses, for books, for different topics, and I think having this huge number of front-end JavaScript developers is just a tremendous advantage in Node.js.”

Start learning Node!

As you can see, adopting Node.js in an enterprise environment has tremendous benefits. It makes the developers happier and increases the productivity of the engineering teams.

If you’d like to start learning it I suggest to check out our Node Hero tutorial series.

Share your thoughts in the comments.