Surfacing proxy internal state


Eliza Weisman <el...@...>
 

Recently, I've been working on debugging some Linkerd issues & have found the `client_state.json` endpoint (which displays a dump of a Linkerd instance's current routing state) to be extremely useful. In fact, I feel like if we had built this functionality into Linkerd much earlier, a number of issues could have been solved much more easily. Therefore, I've been doing thinking on what a similar feature in Conduit might look like.

It seems to me that it might be a good idea to follow a design similar to `tap` --- you request a state dump for the proxy for a given pod (or pod and service, in the case of a pod with multiple services?) and the control plane sends the request to that proxy & returns it to the user. It might be nice, eventually, to support querying similar to `tap` here, so you can say "give me a state dump from all the proxies matching x predicates", but I don't think that needs to be prioritized. It should be pretty easy to make the proxy respond to these requests by converting its' discovery state into a textual format --- if we wanted to stick with JSON like Linkerd does, we could use Serde; or just use the existing `fmt::Debug` impls? I suspect a proof of concept could be written in an afternoon.

What do you all think? Is this the right approach, and is this feature something worth having?

Liza


Oliver Gould <v...@...>
 

This is interesting! Thanks for starting the discussion.

On Mon, Feb 19, 2018 at 3:02 PM Eliza Weisman <el...@...> wrote:
It seems to me that it might be a good idea to follow a design similar to `tap` --- you request a state dump for the proxy for a given pod (or pod and service, in the case of a pod with multiple services?) and the control plane sends the request to that proxy & returns it to the user.

I totally agree that we're going to want to expose proxy state for diagnostics. Though, I'm skeptical that a big state dump is the right approach...

First, is this feature for Conduit developers or Conduit users? I would be particularly wary of exposing users to all of the internal details of the proxy in order to do simple debugging; and if it's just for developers, it probably doesn't need to be as complicated as full-on tap like feature...

What are the specific questions you'd like to ask of the proxy?


Andrew Seigner <si...@...>
 

Totally agree additional introspection in Linkerd would have saved tons of debug time. It's a lesson we should leverage with Conduit.

For reference, here's a related proposal for Linkerd: https://github.com/linkerd/linkerd/issues/1780

I think an important question to ask: Which information should come from the controller vs. proxy?

On Tue, Feb 20, 2018 at 10:34 AM, Oliver Gould <v...@...> wrote:
This is interesting! Thanks for starting the discussion.

On Mon, Feb 19, 2018 at 3:02 PM Eliza Weisman <el...@...> wrote:
It seems to me that it might be a good idea to follow a design similar to `tap` --- you request a state dump for the proxy for a given pod (or pod and service, in the case of a pod with multiple services?) and the control plane sends the request to that proxy & returns it to the user.

I totally agree that we're going to want to expose proxy state for diagnostics. Though, I'm skeptical that a big state dump is the right approach...

First, is this feature for Conduit developers or Conduit users? I would be particularly wary of exposing users to all of the internal details of the proxy in order to do simple debugging; and if it's just for developers, it probably doesn't need to be as complicated as full-on tap like feature...

What are the specific questions you'd like to ask of the proxy?

--
You received this message because you are subscribed to the Google Groups "conduit-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conduit-dev+unsubscribe@googlegroups.com.
To post to this group, send email to condu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/conduit-dev/CABtqR6MwVM%2BtUGMfHXRMeKzC9PCfyJadUgvzctdSdzjYPPHesA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.


Eliza Weisman <el...@...>
 



On Tuesday, February 20, 2018 at 10:35:01 AM UTC-8, Oliver Gould wrote:
First, is this feature for Conduit developers or Conduit users? I would be particularly wary of exposing users to all of the internal details of the proxy in order to do simple debugging; and if it's just for developers, it probably doesn't need to be as complicated as full-on tap like feature...
 
I think ideally, the main use case I have in mind is a little of both. I'd like to be able to get this sort of state dump as a developer, but what I'd really like is a single command I could hypothetically give to a user to help diagnose a reported issue. I'd like an easy way to be able to say "run this command and give me back the output", and get something back that could provide more insight than, say, pulling logs.

While implementing a full-featured console command might be overkill, I do think it would be nice to be able to bundle this functionality in one step instead of "find which pod is in error by following these steps, set the proxy log level for that pod, pull the logs from that pod...". Possibly information from the control plane could be used to make reasonable guesses about which proxies to get debug information from automatically, though I'm not sure if that would be practical.

What are the specific questions you'd like to ask of the proxy?

This is still an open question ---  I'd love input from others. From experience with Linkerd, I think service discovery/routing state is a good starting point. I've been looking into a Linkerd issue where k8s API events are received, triggering the log lines that say "this endpoint was changed", but the discovery change isn't actually reflected in routing, which is where `client_state.json` has been incredibly valuable for inspecting what the Linkerd router actually sees at a given moment, regardless of what the logs claim it *should* see. Although Conduit's routing and discovery work differently, I think that, for example, being able to compare states between the control plane and an individual proxy could be very helpful...


Oliver Gould <v...@...>
 

On Tue, Feb 20, 2018 at 11:31 AM Eliza Weisman <el...@...> wrote:
I'd really like is a single command I could hypothetically give to a user to help diagnose a reported issue. I'd like an easy way to be able to say "run this command and give me back the output".

Why should they have to give you the output at all?
 
being able to compare states between the control plane and an individual proxy could be very helpful...

My point is that if this comparison can be made, we should completely cut ourselves out of the loop and build the diagnostic features into the controller. We can give ourselves additional diagnostic features, but I think we should be primarily focused on encoding this into the product as much as possible.


Eliza Weisman <el...@...>
 



On Tuesday, February 20, 2018 at 11:49:26 AM UTC-8, Oliver Gould wrote:
Why should they have to give you the output at all?

My point is that if this comparison can be made, we should completely cut ourselves out of the loop and build the diagnostic features into the controller. We can give ourselves additional diagnostic features, but I think we should be primarily focused on encoding this into the product as much as possible.

I see your point, and I definitely agree that improving the ability of Conduit users to debug their own installations is worth prioritising. From my perspective, at least, any tools that make it easier for users to diagnose issues will also likely aid the developers when users do inevitably hit problems that require a bug report, but if we want these debugging tools to be accessible to everyone, their UX becomes much more important. I think `conduit check` is a good example of the right direction for this kind of thing.

Somewhat unrelatedly, I've also been thinking a bit about improving the readability and understandability of log messages across Conduit's components, and making them easier to access. But that might deserve its' own discussion entirely.


Oliver Gould <v...@...>
 

On Tue, Feb 20, 2018 at 2:31 PM Eliza Weisman <el...@...> wrote:
From my perspective, at least, any tools that make it easier for users to diagnose issues will also likely aid the developers when users do inevitably hit problems that require a bug report, but if we want these debugging tools to be accessible to everyone, their UX becomes much more important. I think `conduit check` is a good example of the right direction for this kind of thing.

I think we're basically in violent agreement about this.

I wasn't being hyperbolic when I asked what specific questions the user wants to ask -- we'll build much more useful tool if we can answer this question well.

Alex and I started to think through this a few weeks ago but put it on hold to focus on some other telemetry work.  Here's the short list of things we came up with then. Let's expand on this in the doc: https://docs.google.com/document/d/15rj21UWATZwnL34l06Y_CoRm-DEsHx6_EpgMXhxCSQM/edit#


Eliza Weisman <el...@...>
 

Thanks for the link, I wasn't aware of that document previously. Will take a look!

On Tue, Feb 20, 2018 at 4:23 PM Oliver Gould <v...@...> wrote:
On Tue, Feb 20, 2018 at 2:31 PM Eliza Weisman <el...@...> wrote:
From my perspective, at least, any tools that make it easier for users to diagnose issues will also likely aid the developers when users do inevitably hit problems that require a bug report, but if we want these debugging tools to be accessible to everyone, their UX becomes much more important. I think `conduit check` is a good example of the right direction for this kind of thing.

I think we're basically in violent agreement about this.

I wasn't being hyperbolic when I asked what specific questions the user wants to ask -- we'll build much more useful tool if we can answer this question well.

Alex and I started to think through this a few weeks ago but put it on hold to focus on some other telemetry work.  Here's the short list of things we came up with then. Let's expand on this in the doc: https://docs.google.com/document/d/15rj21UWATZwnL34l06Y_CoRm-DEsHx6_EpgMXhxCSQM/edit#


Phil Calçado <ph...@...>
 

The way I am reading this thread, I see two things being discussed:

1) A way to figure out what a proxy thinks is the state of the world right now
2) A way to track errors as they happen

(1) seems to be what client_state.json does for Linkerd. I think we can build something like this incrementally, and make sure that every time we add a new feature the equivalent data is added there.

(2) Seems to be the document Oliver linked to. It seems to me that the best approach to deal with this would be structured logging that can then be consumed by various tools.


But stepping out of the solutions domain into the problem domain, as mentioned previously there are two personas here. 

One is a "support engineer", someone from the Conduit community or a company's own devtools team who is trying to help an service owner debug an issue. The other persona is the service owner, who would like to know what is going on ASAP and on their own, without having to bug anybody else.

Now that we generally agree that this is a good feature to have, I would suggest we "start with the consumer" and spec this feature from what these personas need/want to accomplish instead of what technology or data we might have available.

I am more than happy to help spec'ing these user stories if people think this is generally a good idea.


On Tue, Feb 20, 2018 at 7:43 PM, Eliza Weisman <el...@...> wrote:
Thanks for the link, I wasn't aware of that document previously. Will take a look!

On Tue, Feb 20, 2018 at 4:23 PM Oliver Gould <v...@...> wrote:
On Tue, Feb 20, 2018 at 2:31 PM Eliza Weisman <el...@...> wrote:
From my perspective, at least, any tools that make it easier for users to diagnose issues will also likely aid the developers when users do inevitably hit problems that require a bug report, but if we want these debugging tools to be accessible to everyone, their UX becomes much more important. I think `conduit check` is a good example of the right direction for this kind of thing.

I think we're basically in violent agreement about this.

I wasn't being hyperbolic when I asked what specific questions the user wants to ask -- we'll build much more useful tool if we can answer this question well.

Alex and I started to think through this a few weeks ago but put it on hold to focus on some other telemetry work.  Here's the short list of things we came up with then. Let's expand on this in the doc: https://docs.google.com/document/d/15rj21UWATZwnL34l06Y_CoRm-DEsHx6_EpgMXhxCSQM/edit#

--
You received this message because you are subscribed to the Google Groups "conduit-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conduit-dev+unsubscribe@googlegroups.com.
To post to this group, send email to condu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/conduit-dev/CAAtCF3YowN5WTrba3TtLCfDnm4-h0dFUCEFc%2BFVEecKACRN_fA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--