Node.js Development & Consulting

Talk with an expert

Try our Node.js monitoring tool

node.js monitoring

Announcing Free Node.js Monitoring & Debugging with Trace

Announcing Free Node.js Monitoring & Debugging with Trace

Today, we’re excited to announce that Trace, our Node.js monitoring & debugging tool is now free for open-source projects.

What is Trace?

We launched Trace a year ago with the intention of helping developers looking for a Node.js specific APM which is easy to use and helps with the most difficult aspects of building Node projects, like..

  • finding memory leaks in a production environment
  • profiling CPU usage to find bottlenecks
  • tracing distributed call-chains
  • avoiding security leaks & bad npm packages

.. and so on.

Node.js Monitoring with Trace by RisingStack - Performance Metrics chart

Why are we giving it away for free?

We use a ton of open-source technology every day, and we are also the maintainers of some.

We know from experience that developing an open-source project is hard work, which requires a lot of knowledge and persistence.

Trace will save a lot of time for those who use Node for their open-source projects.

How to get started with Trace?

  1. Visit trace.risingstack.com and sign up - it's free.
  2. Connect your app with Trace.
  3. Head over to this form and tell us a little bit about your project.

Done. Your open-source project will be monitored for free as a result.

If you need help with Node.js Monitoring & Debugging..

Just drop us a tweet at @RisingStack if you have any additional questions about the tool or the process.

If you'd like to read a little bit more about the topic, I recommend to read our previous article The Definitive Guide for Monitoring Node.js Applications.

One more thing

At the same time of making Trace available for open-source projects, we're announcing our new line of business at RisingStack:

Commercial Node.js support, aimed at enterprises with Node.js applications running in a production environment.

RisingStack now helps to bootstrap and operate Node.js apps - no matter what life cycle they are in.


Disclaimer: We retain the exclusive right to accept or deny your application to use Trace by RisingStack for free.

The Definitive Guide for Monitoring Node.js Applications

The Definitive Guide for Monitoring Node.js Applications

In the previous chapters of Node.js at Scale we learned how you can get Node.js testing and TDD right, and how you can use Nightwatch.js for end-to-end testing.

In this article, we will learn about running and monitoring Node.js applications in Production. Let's discuss these topics:

  • What is monitoring?
  • What should be monitored?
  • Open-source monitoring solutions
  • SaaS and On-premise monitoring offerings

What is Node.js Monitoring?

Monitoring is observing the quality of a software over time. The available products and tools we have in this industry usually go by the term Application Performance Monitoring or APM in short.

If you have a Node.js application in a staging or production environment, you can (and should) do monitoring on different levels:

You can monitor

  • regions,
  • zones,
  • individual servers and,
  • of course, the Node.js software that runs on them.

In this guide we will deal with the software components only, as if you run in a cloud environment, the others are taken care for you usually.

What should be monitored?

Each application you write in Node.js produces a lot of data about its' behavior.

There are different layers where an APM tool should collect data from. The more of them covered, the more insights you'll get about your system's behavior.

  • Service level
  • Host level
  • Instance (or process) level

The list you can find below collects the most crucial problems you'll run into while you maintain a Node.js application in production. We'll also discuss how monitoring helps to solve them and what kind of data you'll need to do so.

Problem 1.: Service Downtimes

If your application is unavailable, your customers can't spend money on your sites. If your API's are down, your business partners and services depending on them will fail as well because of you.

We all know how cringeworthy it is to apologize for service downtimes.

Your topmost priority should be preventing failures and providing 100% availability for your application.

Running a production app comes with great responsibility.

Node.js APM's can easily help you detecting and preventing downtimes, since they usually collect service level metrics.

This data can show if your application handles requests properly, although it won't always help to tell if your public sites or API's are available.

To have a proper coverage on downtimes, we recommend to set up a pinger as well which can emulate user behavior and provide foolproof data on availability. If you want to cover everything, don't forget to include different regions like the US, Europe and Asia too.

Problem 2.: Slow Services, Terrible Response Times

Slow response times have a huge impact on conversion rate, as well as on product usage. The faster your product is the more customers and user satisfaction you'll have.

Usually, all Node.js APM's can show if your services are slowing down, but interpreting that data requires further analysis.

I recommend doing two things to find the real reasons for slowing services.

  • Collect data on a process level too. Check out each instance of a service to figure out what happens under the hood.
  • Request CPU profiles when your services slow down and analyze them to find the faulty functions.

Eliminating performance bottlenecks enables you to scale your software more efficiently and also to optimize your budget.

Problem 3.: Solving Memory Leaks is Hard

Our Node.js Consulting & Development expertise allowed us to build huge enterprise systems and help developers making them better.

What we see constantly is that Memory Leaks in Node.js applications are quite frequent and that finding out what causes them is among the greatest struggles Node developers face.

This impression is backed with data as well. Our Node.js Developer Survey showed that Memory Leaks cause a lot of headache for even the best engineers.

To find memory leaks, you have to know exactly when they happen.

Some APM's collect memory usage data which can be used to recognize a leak. What you should look for is the steady growth of memory usage which ends up in a service crash & restart (as Node runs out of memory after 1,4 Gigabytes).

Node.js memory leak shown in Trace, the node.js monitoring tool

If your APM collects data on the Garbage Collector as well, you can look for the same pattern. As extra objects in a Node app's memory pile up, the time spent with Garbage Collection increases simultaneously. This is a great indicator of the Memory Leak.

After figuring out that you have a leak, request a memory heapdump and look for the extra objects!

This sounds easy in theory but can be challenging in practice.

What you can do is request 2 heapdumps from your production system with a Monitoring tool, and analyze these dumps with Chrome's DevTools. If you look for the extra objects in comparison mode, you'll end up seeing what piles up in your app's memory.

If you'd like a more detailed rundown on these steps, I wrote one article about finding a Node.js memory leak in Ghost, where I go into more details.

Problem 4.: Depending on Code Written by Anonymus

Most of the Node.js applications heavily rely on npm. We can end up with a lot of dependencies written by developers of unknown expertise and intentions.

Roughly 76% of Node shops use vulnerable packages, while open source projects regularly grow stale, neglecting to fix security flaws.

There are a couple of possible steps to lower the security risks of using npm packages.

  1. Audit your modules with the Node Security Platform CLI
  2. Look for unused dependencies with the depcheck tool
  3. Use the npm stats API, or browse historic stats on npm-stat.com to find out if others using a package
  4. Use the npm view <pkg> maintainers command to avoid packages maintained by only a few
  5. Use the npm outdated command or Greenkeeper to learn whether you're using the latest version of a package.

Going through these steps can consume a lot of your time, so picking a Node.js Monitoring Tool which can warn you about insecure dependencies is highly recommended.

Problem 6.: Email Alerts often go Unnoticed

Let's be honest. We are developers who like spending time writing code - not going through our email account every 10 minutes..

According to my experience, email alerts are usually unread and it's very easy to miss out on a major outage or problem if we depend only on them.

Email is a subpar method to learn about issues in production.

I guess that you also don't want to watch dashboards for potential issues 24/7. This is why it is important to look for an APM with great alerting capabilities.

What I recommend is to use pager systems like opsgenie or pagerduty to learn about critical issues. Pair up the monitoring solution of your choice with one of these systems if you'd like to know about your alerts instantly.

A few alerting best-practices we follow at RisingStack:

  • Always keep alerting simple and alert on symptoms
  • Aim to have as few alerts as possible - associated with end-user pain
  • Alert on high response time and error rates as high up in the stack as possible

Problem 7.: Finding Crucial Errors in the Code

If a feature is broken on your site, it can prevent customers from achieving their goals. Sometimes it can be a sign of bad code quality. Make sure you have proper test coverage for your codebase and a good QA process (preferably automated).

If you use an APM that collects errors from your app then you'll be able to find the ones which occur more often.

The more data your APM is accessing, better the chances of finding and fixing critical issues. We recommend to use a monitoring tool which collects and visualises stack traces as well - so you'll be able to find the root causes of errors in a distributed system.


In the next part of the article, I will show you one open-source, and one SaaS / on-premises Node.js monitoring solution that will help you operate your applications.

Prometheus - an Open-Source, General Purpose Monitoring Platform

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud.

Prometheus was started in 2012, and since then, many companies and organizations have adopted the tool. It is a standalone open source project and maintained independently of any company.

In 2016, Prometheus joined the Cloud Native Computing Foundation, right after Kubernetes.

The most important features of Prometheus are:

  • a multi-dimensional data model (time series identified by metric name and key/value pairs),
  • a flexible query language to leverage this dimensionality,
  • time series collection happens via a pull model over HTTP by default,
  • pushing time series is supported via an intermediary gateway.

Node.js monitoring with prometheus

As you could see from the previous features, Prometheus is a general purpose monitoring solution, so you can use it with any language or technology you prefer.

Check out the official Prometheus getting started pages if you'd like to give it a try.

Before you start monitoring your Node.js services, you need to add instrumentation to them via one of the Prometheus client libraries.

For this, there is a Node.js client module, which you can find here. It supports histograms, summaries, gauges and counters.

Essentially, all you have to do is require the Prometheus client, then expose its output at an endpoint:

const Prometheus = require('prom-client')  
const server = require('express')()

server.get('/metrics', (req, res) => {  
  res.end(Prometheus.register.metrics())
})

server.listen(process.env.PORT || 3000)  

This endpoint will produce an output, that Prometheus can consume - something like this:

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1490433285  
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 33046528  
# HELP nodejs_eventloop_lag_seconds Lag of event loop in seconds.
# TYPE nodejs_eventloop_lag_seconds gauge
nodejs_eventloop_lag_seconds 0.000089751  
# HELP nodejs_active_handles_total Number of active handles.
# TYPE nodejs_active_handles_total gauge
nodejs_active_handles_total 4  
# HELP nodejs_active_requests_total Number of active requests.
# TYPE nodejs_active_requests_total gauge
nodejs_active_requests_total 0  
# HELP nodejs_version_info Node.js version info.
# TYPE nodejs_version_info gauge
nodejs_version_info{version="v4.4.2",major="4",minor="4",patch="2"} 1  

Of course, these are just the default metrics which were collected by the module we have used - you can extend it with yours. In the example below we collect the number of requests served:

const Prometheus = require('prom-client')  
const server = require('express')()

const PrometheusMetrics = {  
  requestCounter: new Prometheus.Counter('throughput', 'The number of requests served')
}

server.use((req, res, next) => {  
  PrometheusMetrics.requestCounter.inc()
  next()
})

server.get('/metrics', (req, res) => {  
  res.end(Prometheus.register.metrics())
})

server.listen(3000)  

Once you run it, the /metrics endpoint will include the throughput metrics as well:

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1490433805  
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 25120768  
# HELP nodejs_eventloop_lag_seconds Lag of event loop in seconds.
# TYPE nodejs_eventloop_lag_seconds gauge
nodejs_eventloop_lag_seconds 0.144927586  
# HELP nodejs_active_handles_total Number of active handles.
# TYPE nodejs_active_handles_total gauge
nodejs_active_handles_total 0  
# HELP nodejs_active_requests_total Number of active requests.
# TYPE nodejs_active_requests_total gauge
nodejs_active_requests_total 0  
# HELP nodejs_version_info Node.js version info.
# TYPE nodejs_version_info gauge
nodejs_version_info{version="v4.4.2",major="4",minor="4",patch="2"} 1  
# HELP throughput The number of requests served
# TYPE throughput counter
throughput 5  

Once you have exposed all the metrics you have, you can start querying and visualizing them - for that, please refer to the official Prometheus query documentation and the vizualization documentation.

As you can imagine, instrumenting your codebase can take quite some time - since you have to create your dashboard and alerts to make sense of the data. While sometimes these solutions can provide greater flexibility for your use-case than hosted solutions, it can take months to implement them & then you have to deal with operating them as well.

If you have the time to dig deep into the topic, you'll be fine with it.

Meet Trace - our SaaS, and On-premises Node.js Monitoring Tool

As we just discussed, running your own solution requires domain knowledge, as well as expertise on how to do proper monitoring. You have to figure out what aggregation to use for what kind of metrics, and so on..

This is why it can make a lot of sense to go with a hosted monitoring solution - whether it is a SaaS product or an on-premises offering.

At RisingStack, we are developing our own Node.js Monitoring Solution, called Trace. We built all the experience into Trace which we gained through the years of providing professional Node services.

What's nice about Trace, is that you have all the metrics you need with only adding a single line of code to your application - so it really takes only a few seconds to get started.

require([email protected]/trace')  

After this, the Trace collector automatically gathers your application's performance data and visualizes it for you in an easy to understand way.

Just a few things Trace is capable to do with your production Node app:

  1. Send alerts about Downtimes, Slow services & Bad Status Codes.
  2. Ping your websites and API's with an external service + show APDEX metrics.
  3. Collect data on service, host and instance levels as well.
  4. Automatically create a (10 second-long) CPU profile in a production environment in case of a slowdown.
  5. Collect data on memory consumption and garbage collection.
  6. Create memory heapdumps automatically in case of a Memory Leak in production.
  7. Show errors and stack traces from your application.
  8. Visualize whole transaction call-chains in a distributed system.
  9. Show how your services communicate with each other on a live map.
  10. Automatically detect npm packages with security vulnerabilities.
  11. Mark new deployments and measure their effectiveness.
  12. Integrate with Slack, Pagerduty, and Opsgenie - so you'll never miss an alert.

Although Trace is currently a SaaS solution, we'll make an on-premises version available as well soon.

It will be able to do exactly the same as the cloud version, but it will run on Amazon VPC or in your own datacenter. If you're interested in it, let's talk!

Summary

I hope that in this chapter of Node.js at Scale I was able to give useful advice about monitoring your Node.js application. In the next article, you will learn how to debug Node.js applications in an easy way.

Node Hero - Monitoring Node.js Applications

Node Hero - Monitoring Node.js Applications

This article is the 13th part of the tutorial series called Node Hero - in these chapters, you can learn how to get started with Node.js and deliver software products using it.

In the last article of the series, I’m going to show you how to do Node.js monitoring and how to find advanced issues in production environments.

The Importance of Node.js Monitoring

Getting insights into production systems is critical when you are building Node.js applications! You have an obligation to constantly detect bottlenecks and figure out what slows your product down.

An even greater issue is to handle and preempt downtimes. You must be notified as soon as they happen, preferably before your customers start to complain. Based on these needs, proper monitoring should give you at least the following features and insights into your application's behavior:

  • Profiling on a code level: You have to understand how much time does it take to run each function in a production environment, not just locally.

  • Monitoring network connections: If you are building a microservices architecture, you have to monitor network connections and lower delays in the communication between your services.

  • Performance dashboard: Knowing and constantly seeing the most important performance metrics of your application is essential to have a fast, stable production system.

  • Real-time alerting: For obvious reasons, if anything goes down, you need to get notified immediately. This means that you need tools that can integrate with Pagerduty or Opsgenie - so your DevOps team won’t miss anything important.

"Getting insights into production systems is critical when you are building #nodejs applications" via @RisingStack

Click To Tweet

Server Monitoring versus Application Monitoring

One concept developers usually apt to confuse is monitoring servers and monitoring the applications themselves. As we tend to do a lot of virtualization, these concepts should be treated separately, as a single server can host dozens of applications.

Let’s go trough the major differences!

Server Monitoring

Server monitoring is responsible for the host machine. It should be able to help you answer the following questions:

  • Does my server have enough disk space?
  • Does it have enough CPU time?
  • Does my server have enough memory?
  • Can it reach the network?

For server monitoring, you can use tools like zabbix.

Application Monitoring

Application monitoring, on the other hand, is responsible for the health of a given application instance. It should let you know the answers to the following questions:

  • Can an instance reach the database?
  • How much request does it handle?
  • What are the response times for the individual instances?
  • Can my application serve requests? Is it up?

For application monitoring, I recommend using our tool called Trace. What else? :)

We developed it to be an easy to use and efficient tool that you can use to monitor and debug applications from the moment you start building them, up to the point when you have a huge production app with hundreds of services.

How to Use Trace for Node.js Monitoring

To get started with Trace, head over to https://trace.risingstack.com and create your free account!

Once you registered, follow these steps to add Trace to your Node.js applications. It only takes up a minute - and these are the steps you should perform:

Start Node.js monitoring with these steps

Easy, right? If everything went well, you should see that the service you connected has just started sending data to Trace:

Reporting service in Trace for Node.js Monitoring

#1: Measure your performance

As the first step of monitoring your Node.js application, I recommend to head over to the metrics page and check out the performance of your services.

Basic Node.js performance metrics

  • You can use the response time panel to check out median and 95th percentile response data. It helps you to figure out when and why your application slows down and how it affects your users.
  • The throughput graph shows request per minutes (rpm) for status code categories (200-299 // 300-399 // 400-499 // >500 ). This way you can easily separate healthy and problematic HTTP requests within your application.
  • The memory usage graph shows how much memory your process uses. It’s quite useful for recognizing memory leaks and preempting crashes.

Advanced Node.js Monitoring Metrics

If you’d like to see special Node.js metrics, check out the garbage collection and event loop graphs. Those can help you to hunt down memory leaks. Read our metrics documentation.

#2: Set up alerts

As I mentioned earlier, you need a proper alerting system in action for your production application.

Go the alerting page of Trace and click on Create a new alert.

  • The most important thing to do here is to set up downtime and memory alerts. Trace will notify you on email / Slack / Pagerduty / Opsgenie, and you can use Webhooks as well.

  • I recommend setting up the alert we call Error rate by status code to know about HTTP requests with 4XX or 5XX status codes. These are errors you should definitely care about.

  • It can also be useful to create an alert for Response time - and get notified when your app starts to slow down.

#3: Investigate memory heapdumps

Go to the Profiler page and request a new memory heapdump, wait 5 minutes and request another. Download them and open them on Chrome DevTool’s Profiles page. Select the second one (the most recent one), and click Comparison.

chrome heap snapshot for finding a node.js memory leak

With this view, you can easily find memory leaks in your application. In a previous article I’ve written about this process in a detailed way, you can read it here: Hunting a Ghost - Finding a Memory Leak in Node.js

#4: CPU profiling

Profiling on the code level is essential to understand how much time does your function take to run in the actual production environment. Luckily, Trace has this area covered too.

All you have to do is to head over to the CPU Profiles tab on the Profiling page. Here you can request and download a profile which you can load into the Chrome DevTool as well.

CPU profiling in Trace

Once you loaded it, you'll be able to see the 10 second timeframe of your application and see all of your functions with times and URL's as well.

With this data, you'll be able to figure out what slows down your application and deal with it!

Download the whole Node Hero series as a single pdf

The End

Update: as a sequel to Node Hero, we have started a new series called Node.js at Scale. Check it out if you are interested in more in-depth articles!

This is it.

During the 13 episodes of the Node Hero series, you learned the basics of building great applications with Node.js.

I hope you enjoyed it and improved a lot! Please share this series with your friends if you think they need it as well - and show them Trace too. It’s a great tool for Node.js development!

If you have any questions regarding Node.js monitoring, let me know in the comments section!


Introducing Distributed Tracing for Microservices Monitoring

Introducing Distributed Tracing for Microservices Monitoring

At RisingStack, as an enterprise Node.js development and consulting company, we have been working tirelessly in the past two years to build durable and efficient microservices architectures for our clients and as being passionate advocates of this technology.

During this period, we had to face the cold fact that there aren’t proper tools able to support microservices architectures and the developers working with them. Monitoring, debugging and maintaining distributed systems is still extremely challenging.

We want to change this because doing microservices shouldn’t be so hard.

I am proud to announce that Trace - our microservices monitoring tool has entered the Open Beta stage and is available to use for free with Node.js services from now on.

Trace provides:

  • A Distributed Trace view for all of your transactions with error details
  • Service Map to see the communication between your microservices
  • Metrics on CPU, memory, RPM, response time, event loop and garbage collection
  • Alerting with Slack, Pagerduty, and Webhook integration

Trace makes application-level transparency available on a large microservices system with very low overhead. It will also help you to localize production issues faster to debug and monitor applications with ease.

You can use Trace in any IaaS or PaaS environment, including Amazon AWS, Heroku or DigitalOcean. Our solution currently supports Node.js only, but it will be available for other languages later as well. The open beta program lasts until 1 July.

Get started with Trace for free

Read along to get details on the individual features and on how Trace works.

Distributed Tracing

The most important feature of Trace is the transaction view. By using this tool, you can visualize every transaction going through your infrastructure on a timeline - in a very detailed way.

Distributed Tracing View Trace by Risingstack

By attaching a correlation ID to certain requests, Trace groups services taking part in a transaction and visualizes the exact data-flow on a simple tree-graph. Thanks to this you can see the distributed call stacks and the dependencies between your microservices and see where a request takes the most time.

This approach also lets you to localize ongoing issues and show them on the graph. Trace provides detailed feedback on what caused an error in a transaction and gives you enough data to start debugging your system instantly.

Distributed Tracing with Detailed Error Message

When a service causes an error in a distributed system, usually all of the services taking part in that transaction will throw an error, and it is hard to figure out which one really caused the trouble in the first place. From now on, you won’t need to dig through log files to find the answer.

With Trace, you can instantly see what was the path of a certain request, what services were involved, and what caused the error in your system.

The technology Trace uses is primarily based on Google’s Dapper whitepaper. Read the whole study to get the exact details.

Microservices Topology

Trace automatically generates a dynamic service map based on how your services communicate with each other or with databases and external APIs. In this view, we provide feedback on infrastructure health as well, so you will get informed when something begins to slow down or when a service starts to handle an increased amount of requests.

Distributed Tracing with Service Topology Map

The service topology view also allows you to immediately get a sense of how many requests your microservices handle in a given period and how big are their response times.

By getting this information you can see how your application looks like and understand the behavior of your microservices architecture.

Metrics and Alerting

Trace provides critical metrics data for each of your monitored services. Other than basics like CPU usage, memory usage, throughput and response time, our tool reports event loop and garbage collection metrics as well to make microservices development and operations easier.

Distributed Tracing with Metrics and Alerting

You can create alerts and get notified when a metric passes warning or error thresholds so you can act immediately. Trace will alert you via Slack, Pagerduty, Email or Webhook.

Give Microservices Monitoring a Try

Adding Trace to your services is possible with just a couple lines of code, and it can be installed and used in under two minutes.

Click to sign up for Trace

We are curious on your feedback on Trace and on the concept of distributed transaction tracking, so don’t hesitate to express your opinion in the comment section.

Monitoring Microservices Architectures: Enterprise Best Practices

Monitoring Microservices Architectures: Enterprise Best Practices

By reading the following article, you can get insight on how lead engineers at IBM, Financial Times and Netflix think about the pain-points of application monitoring and what are their best practices for maintaining and developing microservices. Also, I’d like to introduce a solution we developed at RisingStack, which aims to tackle the most important issues with monitoring microservices architectures.


Tearing down a monolithic application into a microservices architecture brings tremendous benefits to engineering teams and organizations. New features can be added without rewriting other services. Smaller codebases make development easier and faster, and the parts of an application can be scaled separately.

Unfortunately, migrating to a microservices architecture has its challenges as well since it requires complex distributed systems, where it can be difficult to understand the communication and request flow between the services. Also, monitoring gets increasingly frustrating thanks to a myriad of services generating a flood of unreliable alerts and un-actionable metrics.

Visibility is crucial for IBM with monitoring microservices architectures

Jason McGee, Vice President and Chief Technical Officer of Cloud Foundation Services at IBM let us take a look at the microservice related problems enterprises often face in his highly recommended Dockercon interview with The New Stack.

Node.js Monitoring and Debugging from the Experts of RisingStack

Build performant microservices applications using Trace
Learn more

For a number of years - according to Jason - developer teams were struggling to deal with the increasing speed and delivery pressures they had to fulfill, but with the arrival of microservices, things have changed.

Migrating from the Monolith to a Microservices Architecture

In a microservices architecture, a complex problem can be broken up into units that are truly independent, so the parts can continue to work separately. The services are decoupled, so people can operate in small groups with less coordination and therefore they can respond more quickly and go faster.

“It’s interesting that a lot of people talk about microservices as a technology when in reality I think it’s more about people, and how people are working together.”

The important thing about microservices for Jason is that anyone can give 5 or 10 people responsibility for a function, and they can manage that function throughout its lifecycle and update it whenever they need to - without having to coordinate with the rest of the world.

“But in technology, everything has a tradeoff, a downside. If you look at microservices at an organization level, the negative trade-off is the great increase in the complexity of operations. You end up with a much more complex operating environment.”

Right now, a lot of activity in the microservices space is about that what kind of tools and management systems teams have to put around their services to make microservices architectures a practical thing to do, said Jason. Teams with microservices have to understand how they want to factor their applications, what approaches they want to take for wiring everything together, and how can they reach the visibility of their services.

The first fundamental problem developers have to solve is how the services are going to find each other. After that, they have to manage complexity by instituting some standardized approach for service discovery. The second biggest problem is about monitoring and bringing visibility to services. Developers have to understand what’s going on, by getting visibility into what is happening in their cloud-based network of services.

Describing this in a simplified manner: an app can have hundreds of services behind the scene, and if it doesn’t work, someone has to figure out what’s going on. When developers just see miles of logs, they are going to have a hard time tracing back a problem to its cause. That’s why people working with microservices need excellent tools providing actionable outputs.

“There is no way a human can map how everyone is talking to everyone, so you need new tools to give you the visibility that you need. That’s a new problem that has to be solved for microservices to became an option.”


At RisingStack, as an enterprise Node.js development and consulting company, we experienced the same problems with microservices since the moment of their conception.

Our frustration of not having proper tools to solve these issues led us to develop our own solution called Trace, a microservice monitoring tool with distributed transaction tracking, error detection, and process monitoring for microservices. Our tool is currently in an open beta stage, therefore it can be used for free.

If you’d like to give it a look, we’d appreciate your feedback on our Node.js monitoring platform.


Financial Times eases the pain of monitoring microservices architectures with the right tools and smart alerts

Sarah Wells, Principal Engineer of Financial Times told the story of what it’s like to move from monitoring a monolithic application to monitoring a microservice architecture in her Codemotion presentation named Alert overload: How to adopt a microservices architecture.

About two years ago Financial Times started working on a new project where their goal was to build a new content platform (Fast FT) with a microservices architecture and APIs. The project team also started to do DevOps at the same time, because they were building a lot of new services, and they couldn’t take the time to hand them over to a different operations team. According to Sarah, supporting their own services meant that all of the pain the operations team used to have was suddenly transferred to them when they did shoddy monitoring and alerting.

“Microservices make it worse! Microservices are an efficient device for transforming business problems into distributed transaction problems.”

It’s also important to note here, that there’s a lot of things to like about microservices as Sarah mentioned:

“I am very happy that I can reason about what I’m trying to do because I can make changes live to a very small piece of my system and roll back really easily whenever I want to. I can change the architecture and I can get rid of the old stuff much more easily than I could when I was building a monolith.”

Let’s see what was the biggest challenge the DevOps team at Financial Times faced with a microservice architecture. According to Sarah, monitoring suddenly became much harder because they had a lot more systems than before. The app they built consisted of 45 microservices. They had 3 environments (integration, test, production) and 2 VM’s for each of those services. Since they ran 20 different checks per service (for things like CPU load, disk status, functional tests, etc.) and they ran them every 5 minutes at least. They ended up with 1,500,000 checks a day, which meant that they got alerts for unlikely and transient things all the time.

“When you build a microservices architecture and something fails, you’re going to get an alert from a service that’s using it. But if you’re not clever about how you do alerts, you’re also going to get alerts from every other service that uses it, and then you get a cascade of alerts.”

One time a new developer joined Sarah’s team he couldn’t believe the number of emails they got from different monitoring services, so he started to count them. The result was over 19,000 system monitoring alerts in 50 days, 380 a day on average. Functional monitoring was also an issue since the team wanted to know when their response time was getting slow or when they logged or returned an error to anyone. Needless to say, they got swamped by the amount of alerts they got, namely 12,745 response time or error alerts in 50 days, 255 a day on average.

Monitoring a Microservices Architecture can cause trouble with Alerting

Sarah and the team finally developed three core principles for making this almost unbearable situation better.

1.Think about monitoring from the start.

The Financial Times team created far too many alerts without thinking about why they were doing it. As it turned out, it was the business functionality they really cared about, not the individual microservices - so that’s what their alerting should have focused on. At the end of the day, they only wanted an alert when they needed to take action. Otherwise, it was just noise. They made sure that the alerts are actually good because anyone reading them should be able to work out what they mean and what is needed to do.

According to Sarah’s experiences, a good alert has clear language, is not fake, and contains a link to more explanatory information. They had also developed a smart solution: they tied all of their microservices together by passing around transaction ID’s as request headers, so the team instantly knew that if an error was caused thanks by an event in the system, and they could even search for it. The team also established health checks for every RESTful application, since they wanted to know early about problems that could affect their customers.

2.Use the right tools for the job.

Since the platform Sarah’s team have been working on was an internal PaaS, they figured out that they needed some tooling to get the job done. They used different solutions for service monitoring, log aggregation, graphing, real-time error analysis, and also built some custom in-house tools for themselves. You can check out the individual tools in Sarah’s presentation from slide51.

The main takeaway from their example was that they needed tools that could show if something happened 10 minutes ago but disappeared soon after - while everyone was in a meeting. They figured out the proper communication channel for alerting: it was not email, but Slack! The team had also established a clever reaction system to tag solved and work in progress issues in Slack.

3.Cultivate your alerts

As soon as you stop paying attention to alerts, things will go wrong. When Sarah’s team gets an alert, they are reviewing it and acting on it immediately. If the alert isn’t good, they are either getting rid of it or making it better. If it isn’t helpful, they make sure that it won’t get sent again. It’s also important to make sure that alerts didn’t stop working. To check this, the team of FT often breaks things deliberately (they actually have a chaos monkey), just to make sure that alerts do fire.

How did the team benefit from these actions? They were able to turn off all emails from system monitoring and they could carry on with work while they were still able to monitor their systems. Sarah ended her presentation with a huge recommendation for using microservices and with her previously discussed pieces of advice distilled in a brief form:

“I build microservices because they are good, and I really like working with them. If you do that, you have to appreciate that you need to work at supporting them. Think about monitoring from the start, make sure you have the right tools and continue to work on your alerts as you go.”

Death Star diagrams make no sense with Microservices Architectures

Adrian Cockroft had the privilege to gain a tremendous amount of microservices related experience by working as Chief Architect for 7 years at Netflix - a company heavily relying on a microservices architecture to provide excellent user experience.

According to Adrian, teams working with microservices have to deal with three major problems right now.

“When you have microservices, you end up with a high rate of change. You do a code push and floods of new microservices appear. It’s possible to launch thousands of them in a short time, which will certainly break any monitoring solution.”

The second problem is that everything is ephemeral: Short lifetimes make it hard to aggregate historical views of services, and hand tweaked monitoring tools take too much work to keep running.

“Microservices have increasingly complex calling patterns. These patterns are hard to figure out with 800 microservices calling each other all the time. The visualization of these flows gets overwhelming, and it’s hard to render so many nodes.”

These microservice diagrams may look complicated, but looking inside a monolith would be even more confusing because it’s tangled together in ways you can’t even see. The system gets tangled together, like a big mass of spaghetti - said Adrian.

A Microservices Architecture often looks like Death Star diagrams Furthermore, managing scale is a grave challenge in the industry right now, because a single company can have tens of thousands of instances across five continents and that makes things complicated. Tooling is crucial in this area. Netflix built its own in-house monitoring tool. Twitter made its own tool too, which is called Zipkin (an open source Java monitoring tool based on Google’s Dapper technology). The problem with these tools is when teams look at the systems they have successfully mapped out, they often end up with the so-called Death Star diagrams.

“Currently, there are a bunch of tools trying to do monitoring in a small way - they can show the request flow across a few services. The problem is, that they can only visualize your own bounded context - who are your clients, who are your dependencies. That works pretty well, but once you’re getting into what’s the big picture with everything, the result will be too difficult to comprehend.”

For Adrian, it was a great frustration at Netflix that every monitoring tool they tried exploded on impact. Another problem is that using, or even testing monitoring tools at scale gets expensive very quickly. Adrian illustrated his claim with a frightening example: The single biggest budget component for Amazon is the monitoring system: it takes up 20% of the costs.

“Pretty much all of the tools you can buy now understand datacenters with a hundred nodes, that’s easy. Some of them can understand cloud. Some of them can get to a few thousand nodes. There’s a few alpha and beta monitoring solutions that claim they can get to the ten thousands. With APM’s you want to understand containers, because your containers might be coming and going in seconds - so event-driven monitoring is a big challenge for these systems.”

According to Adrian, there is still hope since the tools that are currently being built will get to the point where the large scale companies can use them as commercial products.


If you have additional thoughts on the topic, feel free to share it in the comments section.

Hunting a Ghost - Finding a Memory Leak in Node.js

Hunting a Ghost - Finding a Memory Leak in Node.js

Finding a Node.js memory leak can be quite challenging - recently we had our fair share of it.

One of our client's microservices started to produce the following memory usage:


Node.js memory leak in Trace

Memory usage grabbed with Trace by RisingStack - our Node.js Performance monitoring and debugging tool

You may spend quite a few days on things like this: profiling the application and looking for the root cause. In this post, I would like to summarize what tools you can use and how, so you can learn from it.

The TL;DR version

In our particular case the service was running on a small instance, with only 512MB of memory. As it turned out, the application didn't leak any memory, simply the GC didn't start collecting unreferenced objects.

Why was it happening? As a default, Node.js will try to use about 1.5GBs of memory, which has to be capped when running on systems with less memory. This is the expected behaviour as garbage collection is a very costly operation.

The solution for it was adding an extra parameter to the Node.js process:

node --max_old_space_size=400 server.js --production  

Still, if it is not this obvious, what are your options to find memory leaks?



Monitor your memory usage and find leaks with ease. - check out Trace by RisingStack!


Understanding V8's Memory Handling

Before diving into the technics that you can employ to find and fix memory leaks in Node.js applications, let's take a look at how memory is handled in V8.

Definitions
  • resident set size: is the portion of memory occupied by a process that is held in the RAM, this contains:
    • the code itself
    • the stack
    • the heap
  • stack: contains primitive types and references to objects
  • heap: stores reference types, like objects, strings or closures
  • shallow size of an object: the size of memory that is held by the object itself
  • retained size of an object: the size of the memory that is freed up once the object is deleted along with its' dependent objects
How The Garbage Collector Works

Garbage collection is the process of reclaiming the memory occupied by objects that are no longer in use by the application. Usually, memory allocation is cheap while it's expensive to collect when the memory pool is exhausted.

An object is a candidate for garbage collection when it is unreachable from the root node, so not referenced by the root object or any other active objects. Root objects can be global objects, DOM elements or local variables.

The heap has two main segments, the New Space and the Old Space. The New Space is where new allocations are happening; it is fast to collect garbage here and has a size of ~1-8MBs. Objects living in the New Space are called Young Generation. The Old Space where the objects that survived the collector in the New Space are promoted into - they are called the Old Generation. Allocation in the Old Space is fast, however collection is expensive so it is infrequently performed .

Why is garbage collection expensive? The V8 JavaScript engine employs a stop-the-world garbage collector mechanism. In practice, it means that the program stops execution while garbage collection is in progress.

Usually, ~20% of the Young Generation survives into the Old Generation. Collection in the Old Space will only commence once it is getting exhausted. To do so the V8 engine uses two different collection algorithms:

  • Scavenge collection, which is fast and runs on the Young Generation,
  • Mark-Sweep collection, which is slower and runs on the Old Generation.

For more information on how this works check out the A tour of V8: Garbage Collection article. For more information on general memory management, visit the Memory Management Reference.

Tools / Technics You Can Use to Find a Memory Leak in Node.js

The heapdump module

With the heapdump module, you can create a heap snapshot for later inspection. Adding it to your project is as easy as:

npm install heapdump --save  

Then in your entry point just add:

var heapdump = require('heapdump');  

Once you are done with it, you can start collecting heapdump with either using the $ kill -USR2 <pid> command or by calling:

heapdump.writeSnapshot(function(err, filename) {  
  console.log('dump written to', filename);
});

Once you have your snapshots, it's time to make sense of them. Make sure you capture multiple of them with some time difference so you can compare them.

Google Chrome DevTools

First you have to load your memory snapshots into the Chrome profiler. To do so, open up Chrome DevTools, go to profiles and Load your heap snapshots.

find a node.js memory leak with chrome load profiles

Once you loaded them it should be something like this:

chrome heap snapshot for finding a node.js memory leak

So far so good, but what can be seen exactly in this screenshot?

One of the most important things here to notice is the selected view: Comparison. This mode enables you to compare two (or more) heap snapshots taken at different times, so you can pinpoint exactly what objects were allocated and not freed up in the meantime.

The other important tab is the Retainers. It shows exactly why an object cannot be garbage collected, what is holding a reference to it. In this case the global variable called log is holding a reference to the object itself, preventing the garbage collector to free up space.

Low-Level Tools

mdb

The mdb utility is an extensible utility for low-level debugging and editing of the live operating system, operating system crash dumps, user processes, user process core dumps, and object files.

gcore

Generate a core dump of a running program with process ID pid.

Putting it together

To investigate dumps, first we have to create one. You can easily do so with:

gcore `pgrep node`  

After you have it, you can search for the all the JS Objects on the heap using:

> ::findjsobjects

Of course, you have to take successive core dumps so that you can compare different dumps.

Once you identified objects that look suspicious, you can analyze them using:

object_id::jsprint  

Now all you have to do is find the retainer of the object (the root).

object_id::findjsobjects -r  

This command will return with id of the retainer. Then you can use ::jsprint again to analyze the retainer.

For a detailed version check out Yunong Xiao's talk from Netflix on how to use it:

Recommended Reading

UPDATE: Read the story of how we found a memory leak in our blogging platform by comparing heapshots with Trace and Chrome's DevTools.

You have additional thoughts or insights on Node.js memory leaks? Share it in the comments.

Trace - Microservice Monitoring and Debugging

Trace by RisingStack - Distributed Tracing, Service map, Alerting and Performance Monitoring for Microservices

We are happy to announce Trace, a microservice monitoring and debugging tool that empowers you to get all the metrics you need when operating microservices. Trace both comes as a free, open source tool and as a hosted service.

Start monitoring your services

Why Trace for microservice monitoring?

Debugging and monitoring microservices can be really challenging:

  • no stack trace, hard to debug
  • easy to lose track of services when dealing with a lot
  • bottleneck detection

Key Features

Trace solves these problems by adding the ability to

  • do distributed stack traces,
  • topology view for your services,
  • and alerting for overwhelmed services,
  • third-party service monitoring (coming soon),
  • trace heterogeneous infrastructures with languages like Java, PHP or Ruby (coming soon).

How It Works

We want to monitor the traffic of our microservices. To be able to do this, we have to access each HTTP request-response pairs to get and set information. With wrapping the http core module's request function and the Server.prototype object, we can sniff all the information we need.

Trace is mostly based on the Google Dapper white paper - so we implemented the ServerReceive, ServerSend, ClientSend, ClientReceive events for monitoring the lifetime of a request.

trace events

In the example above, we want to catch the very first incoming request: SR (A): Server Receive. The http.Server will emit a request event, with http.IncomingMessage and a http.ServerResponse with the signature of

function (request, response) { }  

In the wrapper, we can record every information we want, like timing, the source, the requested path, or even the whole HTTP header for further investigation.

In Trace, one of the fundamental features is tracking the whole transaction in microservice architectures. Luckily we can do it, by setting a request-id header on the outgoing requests.

If our service has to call another service before it can send the response to its caller, we have to track this kind of request-response pairs, spans as well. A span always comes from http.request by calling an endpoint. By wrapping the http.request function, we can do the same as in the http.Server.prototype with one minor difference: here we want to pair the corresponding request and response, and assign a span-id to it.

However, the request-id will just pass through the span. In order to store the generated request-id, we use Continuation-Local Storage: after a request arrived and we generated the request-id, we store it in CLS, so when we try to call another service we can just get it back.

Create reporters

After you set up the collector by simply requiring it in your main file:

require([email protected]/trace');  

You can select a reporting method to process the collected data. You can use:

  • our Trace servers to see the transactions, your topology and services,
  • Logstash,
  • or any other custom reporter (see later).

You have to provide a trace.config.js config file, where you can declare the reporter. If you just want to see the collected data, you can use Logstash with the following config file:

/**
* Trace example config file for using with Logstash
*/

var reporters = require([email protected]/trace/lib/reporters');  
var config = {};

config.appName = 'Example Service Name';

config.reporter = reporters.logstash.create({  
  type: 'tcp',
  host: 'localhost',
  port: 12201
});

module.exports = config;  

If you start Logstash with the following command, every collected packet information will be displayed in the terminal:

logstash -e 'input { tcp { port => 12201 } } output { stdout {} }'  

Also, this approach can be really powerful when you want to tunnel these metrics into different systems, like ElasticSearch or just store them on S3.

Adding custom reporters

If you want to use the collector with your custom reporter, you have to provide your own implementation of the reporter API. The only required method is a send method with the collected data and a callback as parameters.

function CustomReporter (options) {  
  // init your reporter
}

CustomReporter.prototype.send = function (data, callback) {  
  // implement the data sending,
  // don't forget to call the callback after the data sending has ended
};

function create(options) {  
  return new CustomReporter(options);
}

module.exports.create = create;  

Use the Trace collector with Trace servers

If you want to enjoy all the benefits of our Trace service, you need to create an account first. After your API Key has been generated, you can use it in your config file:

/**
* Trace example config file for using with Trace servers
*/

var config = {};

config.appName = 'Example Service Name';

config.reporter = require([email protected]/trace/lib/reporters').trace.create({  
  apiKey: 'YOUR-APIKEY',
  appName: config.appName
});

module.exports = config;  

Adding Trace to your project

To use the Trace collector as a dependency of your project, use:

npm install --save @risingstack/trace

Currently, Trace supports [email protected], [email protected] and [email protected].

Trace-as-a-Service

If you don't want run your own infrastructure for storing and displaying microservice metrics we provide microservice monitoring as a service as well. This is Trace:

trace topology

trace view

Check out our tool!

Start monitoring your services