Telemetry Meeting Notes


Alex Leong <al...@...>
 

Today we had a meeting to strategize on the direction we want to take with Conduit's telemetry and to figure out concrete next steps.

Attendees: Oliver Gould, Andrew Seigner, Risha Mars, Kevin Lingerfelt, Alex Leong

With Prometheus now scraping telemetry data from the proxies directly, we have much richer data available than is currently displayed in the web UI and CLI.  Furthermore, reworking the existing UIs to use the new data is cumbersome because this data needs to be plumbed through the public-api.  Therefore, we'd like to take less iterative approach where we build brand new views for the web UI and CLI rather than updating the existing ones.  We can also speed development and improve flexibility by having these clients query Prometheus directly instead of using an api that papers over it.

To start, we will focus on the `conduit stat` command.  By updating this command to make queries directly to Prometheus, we should be able to get top-line metrics (request volume, success rate, latency) about any Kubernetes object in the mesh (namespace, service, deployment, job, etc.).  The web UI will follow the same pattern and should also display top-line metrics about all applicable Kubernetes objects.

We also discussed the ListPods API which is used to determine how many pods have been connected to the mesh.  This functionality can also move out of the API and into the CLI and web service directly.  In order to accurately determine if a pod is connected to the mesh, having the proxy expose a `start-time` metric would be helpful.

Next steps:
- Add start-time metric to the proxy
- Write a spec for the `conduit stat` command
- Begin determining what Prometheus queries are needed to retrieve the data we want, using simulate-proxy as a mock source of data


Phil Calçado <ph...@...>
 

Thanks for the notes, Alex.

So that I understand this better, are we saying that clients (CLI especially) will talk to other back-ends that aren't the API (in this case Prometheus, but I think this is a case of the zero-one-infinity rule[1])? Is this meant as a temporary solution for the telemetry project or have we decided to move away from the architecture we have right now, which I understand as clients are meant to be dumb and logic should be consolidated in the API?

FWIW preempting a positive answer, I have several concerns about smart clients, including keeping versions sync between CLI and back-ends, and am generally to the opinion that we should follow the kubectl/Kubernetes-CLI-SIG approach of having as much as possible done on the server side. I also believe we should have one API gateway that clients talk to so that they don't have to make assumptions about the topology of the services in the control plane.

Cheers


On Mon, Mar 26, 2018 at 11:53 PM, Alex Leong <al...@...> wrote:
Today we had a meeting to strategize on the direction we want to take with Conduit's telemetry and to figure out concrete next steps.

Attendees: Oliver Gould, Andrew Seigner, Risha Mars, Kevin Lingerfelt, Alex Leong

With Prometheus now scraping telemetry data from the proxies directly, we have much richer data available than is currently displayed in the web UI and CLI.  Furthermore, reworking the existing UIs to use the new data is cumbersome because this data needs to be plumbed through the public-api.  Therefore, we'd like to take less iterative approach where we build brand new views for the web UI and CLI rather than updating the existing ones.  We can also speed development and improve flexibility by having these clients query Prometheus directly instead of using an api that papers over it.

To start, we will focus on the `conduit stat` command.  By updating this command to make queries directly to Prometheus, we should be able to get top-line metrics (request volume, success rate, latency) about any Kubernetes object in the mesh (namespace, service, deployment, job, etc.).  The web UI will follow the same pattern and should also display top-line metrics about all applicable Kubernetes objects.

We also discussed the ListPods API which is used to determine how many pods have been connected to the mesh.  This functionality can also move out of the API and into the CLI and web service directly.  In order to accurately determine if a pod is connected to the mesh, having the proxy expose a `start-time` metric would be helpful.

Next steps:
- Add start-time metric to the proxy
- Write a spec for the `conduit stat` command
- Begin determining what Prometheus queries are needed to retrieve the data we want, using simulate-proxy as a mock source of data

--
You received this message because you are subscribed to the Google Groups "conduit-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conduit-dev+unsubscribe@googlegroups.com.
To post to this group, send email to condu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/conduit-dev/CA%2B_j-%2BoZSpq7%2BDxoEmi9r5kZF_-bKpuMBibODVDKUfQytARgOA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--


Oliver Gould <v...@...>
 



On Tue, Mar 27, 2018 at 3:16 PM Phil Calçado <ph...@...> wrote:
So that I understand this better, are we saying that clients (CLI especially) will talk to other back-ends that aren't the API (in this case Prometheus, but I think this is a case of the zero-one-infinity rule[1])? Is this meant as a temporary solution for the telemetry project or have we decided to move away from the architecture we have right now, which I understand as clients are meant to be dumb and logic should be consolidated in the API?

I'd like for us to eventually have these sorts of APIs in the controller, but in the short term, I think doing so is premature abstraction.  In the short term, we need to start modeling kubernetes resource types more fully--talking about namespaces, replicationcontrollers, daemonsets, statefulsets, etc...  i don't think conduit's API should have to fully model kubernetes' API. I also think we need to be very careful about trying to abstract over it, as these abstractions are lossy, and changing them requires some care and attention   Same goes for prometheus -- what meaningful API can we put over prometheus that makes it easier to interact with? i'm not convinced we're in the best place to do that right now; and i don't see a lot of value in simply modeling the prometheus query API over gRPC.
 
FWIW preempting a positive answer, I have several concerns about smart clients, including keeping versions sync between CLI and back-ends, and am generally to the opinion that we should follow the kubectl/Kubernetes-CLI-SIG approach of having as much as possible done on the server side. I also believe we should have one API gateway that clients talk to so that they don't have to make assumptions about the topology of the services in the control plane.

I agree with the general idea that this logic should be serverside -- but that API isn't clear yet.  Until it is, I think it's a fair tradeoff to implement this is library functionality that is used in both CLI and web.  once we understand the use cases well enough that our API simplifies interacting with the backing systems, I think we should formalize it.  until then, less is more.


On Mon, Mar 26, 2018 at 11:53 PM, Alex Leong <al...@...> wrote:
Today we had a meeting to strategize on the direction we want to take with Conduit's telemetry and to figure out concrete next steps.

Attendees: Oliver Gould, Andrew Seigner, Risha Mars, Kevin Lingerfelt, Alex Leong

With Prometheus now scraping telemetry data from the proxies directly, we have much richer data available than is currently displayed in the web UI and CLI.  Furthermore, reworking the existing UIs to use the new data is cumbersome because this data needs to be plumbed through the public-api.  Therefore, we'd like to take less iterative approach where we build brand new views for the web UI and CLI rather than updating the existing ones.  We can also speed development and improve flexibility by having these clients query Prometheus directly instead of using an api that papers over it.

To start, we will focus on the `conduit stat` command.  By updating this command to make queries directly to Prometheus, we should be able to get top-line metrics (request volume, success rate, latency) about any Kubernetes object in the mesh (namespace, service, deployment, job, etc.).  The web UI will follow the same pattern and should also display top-line metrics about all applicable Kubernetes objects.

We also discussed the ListPods API which is used to determine how many pods have been connected to the mesh.  This functionality can also move out of the API and into the CLI and web service directly.  In order to accurately determine if a pod is connected to the mesh, having the proxy expose a `start-time` metric would be helpful.

Next steps:
- Add start-time metric to the proxy
- Write a spec for the `conduit stat` command
- Begin determining what Prometheus queries are needed to retrieve the data we want, using simulate-proxy as a mock source of data

--
You received this message because you are subscribed to the Google Groups "conduit-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conduit-dev...@....
To post to this group, send email to condu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/conduit-dev/CA%2B_j-%2BoZSpq7%2BDxoEmi9r5kZF_-bKpuMBibODVDKUfQytARgOA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--

--
You received this message because you are subscribed to the Google Groups "conduit-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conduit-dev...@....
To post to this group, send email to condu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/conduit-dev/CAMHNScWGknrbXnds_W7OMmt_6uEaHDeHtBuBCUxe-6pYEc4qCw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Oliver Gould <v...@...>
 

Although, as we implement this, I hope we find that there's a simple API that does enough of what we need... And, if that's the case, we should just keep this serverside. I don't think it's important to start with this logic in the controller, though.


On Tue, Mar 27, 2018 at 4:23 PM Oliver Gould <v...@...> wrote:


On Tue, Mar 27, 2018 at 3:16 PM Phil Calçado <ph...@...> wrote:
So that I understand this better, are we saying that clients (CLI especially) will talk to other back-ends that aren't the API (in this case Prometheus, but I think this is a case of the zero-one-infinity rule[1])? Is this meant as a temporary solution for the telemetry project or have we decided to move away from the architecture we have right now, which I understand as clients are meant to be dumb and logic should be consolidated in the API?

I'd like for us to eventually have these sorts of APIs in the controller, but in the short term, I think doing so is premature abstraction.  In the short term, we need to start modeling kubernetes resource types more fully--talking about namespaces, replicationcontrollers, daemonsets, statefulsets, etc...  i don't think conduit's API should have to fully model kubernetes' API. I also think we need to be very careful about trying to abstract over it, as these abstractions are lossy, and changing them requires some care and attention   Same goes for prometheus -- what meaningful API can we put over prometheus that makes it easier to interact with? i'm not convinced we're in the best place to do that right now; and i don't see a lot of value in simply modeling the prometheus query API over gRPC.
 
FWIW preempting a positive answer, I have several concerns about smart clients, including keeping versions sync between CLI and back-ends, and am generally to the opinion that we should follow the kubectl/Kubernetes-CLI-SIG approach of having as much as possible done on the server side. I also believe we should have one API gateway that clients talk to so that they don't have to make assumptions about the topology of the services in the control plane.

I agree with the general idea that this logic should be serverside -- but that API isn't clear yet.  Until it is, I think it's a fair tradeoff to implement this is library functionality that is used in both CLI and web.  once we understand the use cases well enough that our API simplifies interacting with the backing systems, I think we should formalize it.  until then, less is more.


On Mon, Mar 26, 2018 at 11:53 PM, Alex Leong <al...@...> wrote:
Today we had a meeting to strategize on the direction we want to take with Conduit's telemetry and to figure out concrete next steps.

Attendees: Oliver Gould, Andrew Seigner, Risha Mars, Kevin Lingerfelt, Alex Leong

With Prometheus now scraping telemetry data from the proxies directly, we have much richer data available than is currently displayed in the web UI and CLI.  Furthermore, reworking the existing UIs to use the new data is cumbersome because this data needs to be plumbed through the public-api.  Therefore, we'd like to take less iterative approach where we build brand new views for the web UI and CLI rather than updating the existing ones.  We can also speed development and improve flexibility by having these clients query Prometheus directly instead of using an api that papers over it.

To start, we will focus on the `conduit stat` command.  By updating this command to make queries directly to Prometheus, we should be able to get top-line metrics (request volume, success rate, latency) about any Kubernetes object in the mesh (namespace, service, deployment, job, etc.).  The web UI will follow the same pattern and should also display top-line metrics about all applicable Kubernetes objects.

We also discussed the ListPods API which is used to determine how many pods have been connected to the mesh.  This functionality can also move out of the API and into the CLI and web service directly.  In order to accurately determine if a pod is connected to the mesh, having the proxy expose a `start-time` metric would be helpful.

Next steps:
- Add start-time metric to the proxy
- Write a spec for the `conduit stat` command
- Begin determining what Prometheus queries are needed to retrieve the data we want, using simulate-proxy as a mock source of data

--
You received this message because you are subscribed to the Google Groups "conduit-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conduit-dev...@....
To post to this group, send email to condu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/conduit-dev/CA%2B_j-%2BoZSpq7%2BDxoEmi9r5kZF_-bKpuMBibODVDKUfQytARgOA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--

--
You received this message because you are subscribed to the Google Groups "conduit-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conduit-dev...@....
To post to this group, send email to condu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/conduit-dev/CAMHNScWGknrbXnds_W7OMmt_6uEaHDeHtBuBCUxe-6pYEc4qCw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Brian Smith <br...@...>
 

Oliver Gould <v...@buoyant.io> wrote:
Phil Calçado <ph...@buoyant.io> wrote:
i don't think conduit's API should have to fully model
kubernetes' API. I also think we need to be very careful about trying to
abstract over it, as these abstractions are lossy, and changing them
requires some care and attention Same goes for prometheus -- what
meaningful API can we put over prometheus that makes it easier to interact
with? i'm not convinced we're in the best place to do that right now; and i
don't see a lot of value in simply modeling the prometheus query API over
gRPC.
OTOH we could embrace lossy abstractions. I would expect the public
API to look something like:

```rust
fn query(query: &str) -> Result<Vec<Table>, Error>

struct Table {
title: String,
columns: Vec<Column>,
cells: Vec<Vec<Value>> // columns within rows
}

struct Column {
heading: String,
type: Type,
}

enum Value { String(String), Float(float), /*...*/ }
enum Type { String, Float }
```

The web UI and CLI would pass the query as a string and then render
the result tables either in text (CLI) or HTML (web UI).

IIUC, kubectl already does this, but using a different data model more
like JSON/YAML, which is why it can render arbitrary kinds of objects
with minimal configuration.

I agree with the general idea that this logic should be serverside -- but
that API isn't clear yet. Until it is, I think it's a fair tradeoff to
implement this is library functionality that is used in both CLI and web.
once we understand the use cases well enough that our API simplifies
interacting with the backing systems, I think we should formalize it. until
then, less is more.
I think it is fine to iterate on things using the "smart client" model
and then figure out a way to abstract things into a "dumb client"
model once we have the UI worked out, if it makes sense to even
abstract out the commonality.

One positive thing about this change w.r.t. security is that we don't
have to worry about delegating any authentication or authorization
because all the access control will be done by Kubernetes. We'll need
to document what roles users will need to have to be able to use each
CLI command; this should be easy to do since we already gave those
roles to the public API.

One negative thing about this change w.r.t. security is that we're
giving people the ability to send arbitrary queries to Prometheus
(IIUC), which means lots of exposure to any security bugs in
Prometheus. However, we can address this later, once we understand
users' needs better.

Cheers,
Brian