Hello everyone,
As asked on this long-living issue and then discussed in several meetings, SIG-Observability thinks that the community sees value in a community-driven, and thus vendor-neutral, Whitepaper talking about Observability.
Some work has already been done in the past, but it was kinda abandoned in the last couple of months. Simone Ferlin and I made a review of the sections that may be interesting and, since we are not experts, we'd like to invite the brilliant minds that are part of the CNCF community to help us write this whitepaper. The working document can be found here: https://docs.google.com/document/d/1eoxBe-tkQclixeNmKXcyCMmaF5w1Kh1rBDdLs0-cFsA/edit#
As said in the issue:
Goal: Support users in implementing observability and monitoring for their cloud-native workloads.
Target: End-users building cloud-native applications
Scope: Define basic concepts of data collection and analysis and how CNCF projects can be used for this. Maybe add 1 - 3 real-world reference examples
Even though this structure is not final, let me do a quick overview of what we think may be valuable. Please let us know if anyone would like to write about any of the topics or subtopics, or have suggestions about other topic that may be missing:
- What is observability?
That is the entry point of the whitepaper. If a user wants to implement observability, first he/she needs to know what he/she is trying to achieve, what is the final goal.
- The three Observability pillars
Yes, there is a lot more to Observability than just the three pillars, but we believe that those are the starting point to someone that is beginning its observability journey. Here we'd like to quickly explain how those pillars make the foundation to an observable system.
- Metrics
Explain what are metrics, what problems they are supposed to solve, what problems they are not supposed to solve, how they are usually collected, and general best practices.
- Logs
Explain what are logs, what problems they are supposed to solve, what problems they are not supposed to solve, how they are usually collected, and general best practices.
- Traces
Explain what are traces, what problems they are supposed to solve, what problems they are not supposed to solve, how they are usually collected, and general best practices.
- Different ways to observe a system
Now that we explained a set of telemetry data types, how do we bring value out of them? Beginners tend to create telemetry data at random and it often does more harm than good. We need to be extremely clear that cloud-native observability is not just using cloud-native tools, we need to explain clear methodologies and guidelines to bring value out of the collected telemetry! Then, let's start talking about the different ways to observe a system and dig deeper into methodologies. There are so many ways to observe a system, so let's keep in mind that there is room to add new ways besides the ones listed below.
- Monitoring
People tend to think that observability is another word for monitoring, we should start by explaining the difference between them, then dig deeper into the monitoring subject. Talk about blackbox and whitebox monitoring, maybe explain RED and USE method. Talk a little about SLOs and SLO-based alerts, explain if and when SLO-based alerts can substitute the traditional resource-based alerts, e.g., high CPU saturation and node disk filling up. We can mention tools that help us do cloud-native monitoring, but let's not get too deep into implementation details, focus on teaching clear methodologies.
- Chaos Engineering
Chaos engineering adoption has been increasing a lot in the last couple of years. Let's explain what is it, its principles, its goals, when someone should or shouldn't be doing chaos engineering.
- Data visualization and exploration
A very simple way, just opening the data visualization tool of choice and explore the collected telemetry. Even though it's simple, I think this might be the final goal of an observable system! When developing a well observable system, the one that can find answers is not the one who has been working on said system for longer, but the one that is more curious regarding the collected telemetry. Building good dashboards is no easy task and proper guidance is very-well appreciated.
- M.E.L.T
Explain what is M.E.L.T. and its goals, how it can be used not only for system resiliency but to measure the UX of users, and how a company's product can be improved using this approach.
- [Undecided] Service Mesh
We were in doubt about how service mesh relates to observability. Tools like Linkerd and can show us powerful information regarding network activities between different components, but we don't know if it's a tool feature or if that's the real goal of a service mesh. Does service mesh overlaps with monitoring? Does it overlap with data visualization and exploration? If we decide to keep service mesh a way to observe a system, then let's be clear about its goals and how it differs from the other approaches.
- [Undecided] Continuous Profiling
A lot of new tools and companies have emerged based on application profiling in this last year. This methodology completely ignores the three pillars of observability and focuses on, well, application profiles. Its adoption is not that popular yet, but some think that it is worth mentioning. If we decide to keep it, let's talk about what is a profile, and why continuously collecting those profiles can help end-users build more reliable and performant software.
- Use Cases
In this section, we're aiming to show real-life use cases of companies who were struggling with a certain subject and how they managed to solve it. It can be how the company used different observability methodologies to solve a certain problem or use-cases when trying to implement or adopt cloud-native observability methods. We've listed below some use-cases that we think might be useful, but don't feel limited to this shortlist.
- Implementing SL{A, I, O}s
SLOs are the standard to ship new features into production while keeping an acceptable level of reliability. Getting started with it is hard though, especially if a company is used to the old sysadmin approach where infrastructure and development teams have conflicting goals. We are interested in real-life use cases of companies that have successfully made the culture changes and are now happy with their SLOs.
- Data retention and legal issues
Europe and some other countries around the world have laws that protect users' privacy. The main point of these laws is that a user needs to be able to delete all data related to him/her if he/she decides to do so. That includes any data related to observability. When implementing cloud-native observability, how to keep track of data of individual users and how to proceed when a user decides to delete everything?
- Gaps around observability
Cloud-Native observability is still a little bit far from perfect. Some problems still can't be easily solved, whether for lack of proper tooling or proper methodology. This section will talk about problems that SIG-Observability has identified as not ready for wider adoption. This also means that there is room in the market if anyone is interested in working on these gaps :)
- Machine learning and statistics
Yes, a very polemic topic! Some say that machine learning is completely useless for cloud-native observability due to the always-changing nature of cloud applications, while others say that ML has room to solve some problems. Let's be clear on how M.L. can be used here and where do we expect to see improvements in the future.
- Monitoring streaming APIs
There is the USE method to monitor resources and the RED method to monitor request-response-driven services. None of those methods apply to streaming APIs, due to its nature of long-living connections. Intending to put people on-call to monitor streaming APIs that someone might not have written, there is a need for a methodology that applies to all Streaming APIs to tell if it is healthy or not. This section should give some direction to the point that we want to be in a near future when monitoring Streaming APIs.
- Core dumps
After crashes, kernel cores are dumped into the filesystem and can be used to analyze the reason for the crash. When working on the cloud, sometimes access to the nodes is not allowed or the experience to do so is far from great. This section should explain what would be a good experience when someone needs a Kernel core dump.
|