By reading the following article, you can get insight on how lead engineers at IBM, Financial Times and Netflix think about the pain-points of application monitoring and what are their best practices for maintaining and developing microservices. Also, I’d like to introduce a solution we developed at RisingStack, which aims to tackle the most important issues with monitoring microservicesMicroservices are not a tool, rather a way of thinking when building software applications. Let's begin the explanation with the opposite: if you develop a single, self-contained application and keep improving it as a whole, it's usually called a monolith. Over time, it's more and more difficult to maintain and update it without breaking anything, so the development cycle may... architectures.
Killing the Monolith
Tearing down a monolithic application into a microservices architecture brings tremendous benefits to engineering teams and organizations. New features can be added without rewriting other services. Smaller codebases make development easier and faster, and the parts of an application can be scaled separately.
Unfortunately, migrating to a microservices architecture has its challenges as well since it requires complex distributed systems, where it can be difficult to understand the communication and request flow between the services. Also, monitoring gets increasingly frustrating thanks to a myriad of services generating a flood of unreliable alerts and un-actionable metrics.
Visibility is crucial for IBM with monitoring microservices architectures
Jason McGee, Vice President and Chief Technical Officer of Cloud Foundation Services at IBM let us take a look at the microservice related problems enterprises often face in his highly recommended Dockercon interview with The New Stack.
For a number of years – according to Jason – developer teams were struggling to deal with the increasing speed and delivery pressures they had to fulfill, but with the arrival of microservices, things have changed.
In a microservices architecture, a complex problem can be broken up into units that are truly independent, so the parts can continue to work separately. The services are decoupled, so people can operate in small groups with less coordination and therefore they can respond more quickly and go faster.
“It’s interesting that a lot of people talk about microservices as a technology when in reality I think it’s more about people, and how people are working together.”
The important thing about microservices for Jason is that anyone can give 5 or 10 people responsibility for a function, and they can manage that function throughout its lifecycle and update it whenever they need to – without having to coo
rdinate with the rest of the world.
“But in technology, everything has a tradeoff, a downside. If you look at microservices at an organization level, the negative trade-off is the great increase in the complexity of operations. You end up with a much more complex operating environment.”
Right now, a lot of activity in the microservices space is about that what kind of tools and management systems teams have to put around their services to make microservices architectures a practical thing to do, said Jason. Teams with microservices have to understand how they want to factor their applications, what approaches they want to take for wiring everything together, and how can they reach the visibility of their services.
The first fundamental problem developers have to solve is how the services are going to find each other. After that, they have to manage complexity by instituting some standardized approach for service discovery. The second biggest problem is about monitoring and bringing visibility to services. Developers have to understand what’s going on, by getting visibility into what is happening in their cloud-based network of services.
Describing this in a simplified manner: an app can have hundreds of services behind the scene, and if it doesn’t work, someone has to figure out what’s going on. When developers just see miles of logs, they are going to have a hard time tracing back a problem to its cause. That’s why people working with microservices need excellent tools providing actionable outputs.
“There is no way a human can map how everyone is talking to everyone, so you need new tools to give you the visibility that you need. That’s a new problem that has to be solved for microservices to became an option.”
Distributed Transaction Tracking
At RisingStack, as an enterprise Node.js development and consulting company, we experienced the same problems with microservices since the moment of their conception.
Our frustration of not having proper tools to solve these issues led us to develop our own solution called Trace, a microservice monitoring tool with distributed transaction tracking, error detection, and process monitoring for microservices. Our tool is currently in an open beta stage, therefore it can be used for free.
If you’d like to give it a look, we’d appreciate your feedback on our Node.js monitoring platform.
Financial Times eases the pain of monitoring microservices architectures with the right tools and smart alerts
Sarah Wells, Principal Engineer of Financial Times told the story of what it’s like to move from monitoring a monolithic application to monitoring a microservice architecture in her Codemotion presentation named Alert overload: How to adopt a microservices architecture.
About two years ago Financial Times started working on a new project where their goal was to build a new content platform (Fast FT) with a microservices architecture and APIs. The project team also started to do DevOps at the same time, because they were building a lot of new services, and they couldn’t take the time to hand them over to a different operations team. According to Sarah, supporting their own services meant that all of the pain the operations team used to have was suddenly transferred to them when they did shoddy monitoring and alerting.
“Microservices make it worse! Microservices are an efficient device for transforming business problems into distributed transaction problems.”
It’s also important to note here, that there’s a lot of things to like about microservices as Sarah mentioned:
“I am very happy that I can reason about what I’m trying to do because I can make changes live to a very small piece of my system and roll back really easily whenever I want to. I can change the architecture and I can get rid of the old stuff much more easily than I could when I was building a monolith.”
Let’s see what was the biggest challenge the DevOps team at Financial Times faced with a microservice architecture. According to Sarah, monitoring suddenly became much harder because they had a lot more systems than before. The app they built consisted of 45 microservices. They had 3 environments (integration, test, production) and 2 VM’s for each of those services. Since they ran 20 different checks per service (for things like CPU load, disk status, functional tests, etc.) and they ran them every 5 minutes at least. They ended up with 1,500,000 checks a day, which meant that they got alerts for unlikely and transient things all the time.
“When you build a microservices architecture and something fails, you’re going to get an alert from a service that’s using it. But if you’re not clever about how you do alerts, you’re also going to get alerts from every other service that uses it, and then you get a cascade of alerts.”
One time a new developer joined Sarah’s team he couldn’t believe the number of emails they got from different monitoring services, so he started to count them. The result was over 19,000 system monitoring alerts in 50 days, 380 a day on average. Functional monitoring was also an issue since the team wanted to know when their response time was getting slow or when they logged or returned an error to anyone. Needless to say, they got swamped by the amount of alerts they got, namely 12,745 response time or error alerts in 50 days, 255 a day on average.
Sarah and the team finally developed three core principles for making this almost unbearable situation better.
1.Think about monitoring from the start.
The Financial Times team created far too many alerts without thinking about why they were doing it. As it turned out, it was the business functionality they really cared about, not the individual microservices – so that’s what their alerting should have focused on. At the end of the day, they only wanted an alert when they needed to take action. Otherwise, it was just noise. They made sure that the alerts are actually good because anyone reading them should be able to work out what they mean and what is needed to do.
According to Sarah’s experiences, a good alert has clear language, is not fake, and contains a link to more explanatory information. They had also developed a smart solution: they tied all of their microservices together by passing around transaction ID’s as request headers, so the team instantly knew that if an error was caused thanks by an event in the system, and they could even search for it. The team also established health checks for every RESTful application, since they wanted to know early about problems that could affect their customers.
2.Use the right tools for the job.
Since the platform Sarah’s team have been working on was an internal PaaS, they figured out that they needed some tooling to get the job done. They used different solutions for service monitoring, log aggregation, graphing, real-time error analysis, and also built some custom in-house tools for themselves. You can check out the individual tools in Sarah’s presentation from slide51.
The main takeaway from their example was that they needed tools that could show if something happened 10 minutes ago but disappeared soon after – while everyone was in a meeting. They figured out the proper communication channel for alerting: it was not email, but Slack! The team had also established a clever reaction system to tag solved and work in progress issues in Slack.
3.Cultivate your alerts
As soon as you stop paying attention to alerts, things will go wrong. When Sarah’s team gets an alert, they are reviewing it and acting on it immediately. If the alert isn’t good, they are either getting rid of it or making it better. If it isn’t helpful, they make sure that it won’t get sent again. It’s also important to make sure that alerts didn’t stop working. To check this, the team of FT often breaks things deliberately (they actually have a chaos monkey), just to make sure that alerts do fire.
How did the team benefit from these actions? They were able to turn off all emails from system monitoring and they could carry on with work while they were still able to monitor their systems. Sarah ended her presentation with a huge recommendation for using microservices and with her previously discussed pieces of advice distilled in a brief form:
“I build microservices because they are good, and I really like working with them. If you do that, you have to appreciate that you need to work at supporting them. Think about monitoring from the start, make sure you have the right tools and continue to work on your alerts as you go.”
Death Star diagrams make no sense with Microservices Architectures
Adrian Cockroft had the privilege to gain a tremendous amount of microservices related experience by working as Chief Architect for 7 years at Netflix – a company heavily relying on a microservices architecture to provide excellent user experience.
According to Adrian, teams working with microservices have to deal with three major problems right now.
“When you have microservices, you end up with a high rate of change. You do a code push and floods of new microservices appear. It’s possible to launch thousands of them in a short time, which will certainly break any monitoring solution.”
The second problem is that everything is ephemeral: Short lifetimes make it hard to aggregate historical views of services, and hand tweaked monitoring tools take too much work to keep running.
“Microservices have increasingly complex calling patterns. These patterns are hard to figure out with 800 microservices calling each other all the time. The visualization of these flows gets overwhelming, and it’s hard to render so many nodes.”
These microservice diagrams may look complicated, but looking inside a monolith would be even more confusing because it’s tangled together in ways you can’t even see. The system gets tangled together, like a big mass of spaghetti – said Adrian.
Furthermore, managing scale is a grave challenge in the industry right now, because a single company can have tens of thousands of instances across five continents and that makes things complicated. Tooling is crucial in this area. Netflix built its own in-house monitoring tool. Twitter made its own tool too, which is called Zipkin (an open source Java monitoring tool based on Google’s Dapper technology). The problem with these tools is when teams look at the systems they have successfully mapped out, they often end up with the so-called Death Star diagrams.
“Currently, there are a bunch of tools trying to do monitoring in a small way – they can show the request flow across a few services. The problem is, that they can only visualize your own bounded context – who are your clients, who are your dependencies. That works pretty well, but once you’re getting into what’s the big picture with everything, the result will be too difficult to comprehend.”
For Adrian, it was a great frustration at Netflix that every monitoring tool they tried exploded on impact. Another problem is that using, or even testing monitoring tools at scale gets expensive very quickly. Adrian illustrated his claim with a frightening example: The single biggest budget component for Amazon is the monitoring system: it takes up 20% of the costs.
“Pretty much all of the tools you can buy now understand datacenters with a hundred nodes, that’s easy. Some of them can understand cloud. Some of them can get to a few thousand nodes. There’s a few alpha and beta monitoring solutions that claim they can get to the ten thousands. With APM’s you want to understand containers, because your containers might be coming and going in seconds – so event-driven monitoring is a big challenge for these systems.”
According to Adrian, there is still hope since the tools that are currently being built will get to the point where the large scale companies can use them as commercial products.
Additional Thoughts
If you have additional thoughts on the topic, feel free to share it in the comments section.