Recommendations for Open Source Analytic (OLAP) system / API to mine Thanos/Prometheus data.


Bartłomiej Płotka
 

Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek


Ricardo Aravena
 

Bartek,

This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. For example, in the past, I worked on implementing Apache Druid to collect mobile analytics. In this space, I can think of these projects: Druid, Pinot, KylinClickhouse, Modrian, Cubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Having said that because OLAP systems can be quite complex, there are multiple components that may fall into the scope of other CNCF SIGs. For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

In any case, it would be great to approach the different projects so that the CNCF community is aware of how OLAPs work and foster general interest. 

Ricardo



On Fri, May 29, 2020 at 9:01 AM Bartłomiej Płotka <bwplotka@...> wrote:
Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek


Rob Skillington
 

Hey Bartek,

Glad to hear this topic brought up, it's something we think a lot about and have some experience with it at Uber (running OLAP queries against monitoring and observability data).

1. With respect to SIG Observability, I think talking and moving forward on options/standardized approaches to OLAP on monitoring and observability data makes sense. With regards to BI/OLAP in general, I would say that SIG Observability should not be focused on this space and would probably be better served by a dedicated data engineering SIG.

2. At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two. 

That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.

Best,
Rob


On Fri, May 29, 2020 at 12:50 PM Ricardo Aravena <raravena80@...> wrote:
Bartek,

This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. For example, in the past, I worked on implementing Apache Druid to collect mobile analytics. In this space, I can think of these projects: Druid, Pinot, KylinClickhouse, Modrian, Cubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Having said that because OLAP systems can be quite complex, there are multiple components that may fall into the scope of other CNCF SIGs. For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

In any case, it would be great to approach the different projects so that the CNCF community is aware of how OLAPs work and foster general interest. 

Ricardo



On Fri, May 29, 2020 at 9:01 AM Bartłomiej Płotka <bwplotka@...> wrote:
Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek


Bartłomiej Płotka
 

Ricardo, Rob thanks for the answers so far! (: 

Ricardo:

> This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. 

Yup, we have already solid projects in the observability space. However, business-oriented analytics results are one of the best use cases/outcome of the long term observability data we collect for monitoring needs, right? (: It's quite an amazing side effect and benefit of collecting such data. The idea is to connect both worlds through integrations and Open APIs.

> In this space, I can think of these projects: DruidPinotKylinClickhouseModrianCubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Thanks for those examples, a recommendation of using those would be amazing as well. What I will say might be controversial, but the goal of this initiative is NOT to steal projects or compete with Apache. It's actually the opposite: Integrate better with the most promising open-source systems that solve the community use cases. I think if we can encourage some amazing project to join CNCF that's great, but IMO CNCF is only here to help project that needs help. If Druid or others already have helped from other organizations, that's fine, does it matter for us? (: In my opinion, what matters is that the promising project has the help, funding, and support it needs.

> For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

I agree this is very connected. However, my honest opinion is that Analytics, even in OLAP based fashion overlaps a little bit with SIG-Observability and this is why I am interested to find some solutions for our communities. 

Rob:
> experience with it at Uber (running OLAP queries against monitoring and observability data).

Yes! This is what we are looking for - production-grade experience and recommendations for this.

> 1. With respect to SIG Observability, I think talking and moving forward on options/standardized approaches to OLAP on monitoring and observability data makes sense. With regards to BI/OLAP in general, I would say that SIG Observability should not be focused on this space and would probably be better served by a dedicated data engineering SIG.

I tend to agree, however, given no one started SIG-BigData yet and given the observability data is quite an enormous source of meaningful information, I would love to explore at least API and integrations possibilities here. Maybe I'm wrong (: 

> 2. At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

Awesome, good examples, worth to revisit those amazing projects and integrations there (Spark, Presto, Hadoop)

> I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

Yes! Does anyone have more info about that Spark integration? I remember some teams are using Presto on Thanos data already at Red Hat,  I might try to find more information on that as well. (: 

> The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two. 
That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.


Yes! This is actually the amazing novelty we would love to push toward as well. Instead of storing the same data in 5 places can we keep it in just one? The idea would be to promote efficient streaming read API more vs copying the data to different formats. I mentioned this in one of the requirements here. This might mean more work on those Thanos/Cortex/M3/ecosystem projects, but given we are already collaborating, it might be easier (: This is along the lines what we try to push on metrics/logs/tracing world as mentioned by my team colleague Frederic: Can we reuse similar index for those three since we observe collect different data.. but from the same resources?

Kind Regards,
Bartek

On Fri, 29 May 2020 at 18:16, Rob Skillington <rob@...> wrote:
Hey Bartek,

Glad to hear this topic brought up, it's something we think a lot about and have some experience with it at Uber (running OLAP queries against monitoring and observability data).

1. With respect to SIG Observability, I think talking and moving forward on options/standardized approaches to OLAP on monitoring and observability data makes sense. With regards to BI/OLAP in general, I would say that SIG Observability should not be focused on this space and would probably be better served by a dedicated data engineering SIG.

2. At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two. 

That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.

Best,
Rob


On Fri, May 29, 2020 at 12:50 PM Ricardo Aravena <raravena80@...> wrote:
Bartek,

This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. For example, in the past, I worked on implementing Apache Druid to collect mobile analytics. In this space, I can think of these projects: Druid, Pinot, KylinClickhouse, Modrian, Cubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Having said that because OLAP systems can be quite complex, there are multiple components that may fall into the scope of other CNCF SIGs. For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

In any case, it would be great to approach the different projects so that the CNCF community is aware of how OLAPs work and foster general interest. 

Ricardo



On Fri, May 29, 2020 at 9:01 AM Bartłomiej Płotka <bwplotka@...> wrote:
Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek


Richard Hartmann
 

Hijacking top email to reply across the board.

As many of you will know, I have been nagging Prometheus-team about this for years, so yes, I think we should cover this.

At PromCon 2017's dev summit hallway track, we talked about connectors to existing data analysis, e.g. an R interface to natively access data stored in Prometheus format. Thanos' block storage would solve a lot of pain points and Promtheus' remote read/write API is another obvious immediate attach point. Also at around the same time, I started a discussion about extending PromQL in this direction, a discussion which never went anywhere, but which I can see being revived.

I disagree that the topic should be death-by-committee'd day 1 by splitting it across several SIGs. Concerted effort and input from subject-matter experts is good, though. But get something off the ground first before making it more cumbersome.

Overall, I think it's something which we should at least take a look at in the context of this SIG. Deeper analysis of data definitely falls under o11y.


Best,
Richard


On Fri, May 29, 2020 at 6:00 PM Bartłomiej Płotka <bwplotka@...> wrote:
Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek


Rob Skillington
 

Hey Richi,

With respect to using R with Prometheus data, both Spark and Presto have the ability to write R programs against. Spark is much more mature in this space however. SparkR can run R functions over large datasets and has several Machine Learning algorithms already packaged for use with it's native types and model persistence with MLlib:
https://spark.apache.org/docs/latest/sparkr.html

I agree that integration of OLAP for observability and monitoring data can and should be led by SIG-Observability. And agreed on deeper analysis of data falling within Observability.

I thought there was a question of should SIG-Observability drive general purpose OLAP/BI within CNCF, of which I thought might not be best since data engineering is a huge space and there are many non-Observability related use cases of OLAP/BI better served by a SIG specifically focused on driving work in solely that area.

Just my two cents though, maybe at first it could be folded into SIG-Observability and later broken out perhaps.

Rob


On Tue, Jun 2, 2020 at 8:32 AM Richard Hartmann <richih@...> wrote:
Hijacking top email to reply across the board.

As many of you will know, I have been nagging Prometheus-team about this for years, so yes, I think we should cover this.

At PromCon 2017's dev summit hallway track, we talked about connectors to existing data analysis, e.g. an R interface to natively access data stored in Prometheus format. Thanos' block storage would solve a lot of pain points and Promtheus' remote read/write API is another obvious immediate attach point. Also at around the same time, I started a discussion about extending PromQL in this direction, a discussion which never went anywhere, but which I can see being revived.

I disagree that the topic should be death-by-committee'd day 1 by splitting it across several SIGs. Concerted effort and input from subject-matter experts is good, though. But get something off the ground first before making it more cumbersome.

Overall, I think it's something which we should at least take a look at in the context of this SIG. Deeper analysis of data definitely falls under o11y.


Best,
Richard


On Fri, May 29, 2020 at 6:00 PM Bartłomiej Płotka <bwplotka@...> wrote:
Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek


RichiH Hartmann
 

On Wed, Jun 3, 2020 at 3:38 PM Rob Skillington <rob@...> wrote:
 
Just my two cents though, maybe at first it could be folded into SIG-Observability and later broken out perhaps.

The IETF, the RIPE, and the Prometheus, model is to make something that works first, and then start branching out from there. So yes, I would strongly argue it should live somewhere related first, and then take on a life of its own.

NB: This is not SIG IETF, SIG RIPE, or SIG Prometheus. Just going on what I have seen work in the past.


Best
Richard


Bartłomiej Płotka
 

BTW, Looks like Dan also suggests Analytics is in scope of our SIGs currently: https://github.com/cncf/landscape/issues/1632#issuecomment-638763810 (: 

Kind Regards,
Bartek

On Thu, 4 Jun 2020 at 15:40, RichiH Hartmann via lists.cncf.io <richih=grafana.com@...> wrote:
On Wed, Jun 3, 2020 at 3:38 PM Rob Skillington <rob@...> wrote:
 
Just my two cents though, maybe at first it could be folded into SIG-Observability and later broken out perhaps.

The IETF, the RIPE, and the Prometheus, model is to make something that works first, and then start branching out from there. So yes, I would strongly argue it should live somewhere related first, and then take on a life of its own.

NB: This is not SIG IETF, SIG RIPE, or SIG Prometheus. Just going on what I have seen work in the past.


Best
Richard


Ricardo Aravena
 

> Thanks for those examples, a recommendation of using those would be amazing as well. What I will say might be controversial, but the goal of this initiative is NOT to steal projects or compete with Apache. It's actually the opposite: Integrate better with the most promising open-source systems that solve the community use cases. I think if we can encourage some amazing project to join CNCF that's great, but IMO CNCF is only here to help project that needs help. If Druid or others already have helped from other organizations, that's fine, does it matter for us? (: In my opinion, what matters is that the promising project has the help, funding, and support it needs.

I like Druid, they are backed by https://imply.io too. I think you have multiple options to go about integrating Prometheus or time-series databases with these systems. One aspect would be to just support Lambda (batch/streaming), Kappa (all in one streaming), or both architectures. 

At the batch layer and to create a warehouse, one way would be to support exporting from TSDB to standard data formats. i.e Parquet, Avro, Arrow (in-mem columnar), Protobuf, etc. For streaming, you can support TSDB publishing to Kafka. Then you could orchestrate everything with Spark or Flink (batch and streaming) on top of K8s of course :-). Because there are many projects, and no one size fits all (data sources come in all different forms) I think it would be great to come up with a reference architecture that works for Prometheus (assuming all the integration points have been created). 

It would also be super interesting to understand how some of the observability vendors have implemented their systems if they'd like to share. Although, it might be too much to ask since that typically constitutes a lot of their bread and butter :-) 

On Sat, May 30, 2020 at 5:44 AM Bartłomiej Płotka <bwplotka@...> wrote:
Ricardo, Rob thanks for the answers so far! (: 

Ricardo:

> This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. 

Yup, we have already solid projects in the observability space. However, business-oriented analytics results are one of the best use cases/outcome of the long term observability data we collect for monitoring needs, right? (: It's quite an amazing side effect and benefit of collecting such data. The idea is to connect both worlds through integrations and Open APIs.

> In this space, I can think of these projects: DruidPinotKylinClickhouseModrianCubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Thanks for those examples, a recommendation of using those would be amazing as well. What I will say might be controversial, but the goal of this initiative is NOT to steal projects or compete with Apache. It's actually the opposite: Integrate better with the most promising open-source systems that solve the community use cases. I think if we can encourage some amazing project to join CNCF that's great, but IMO CNCF is only here to help project that needs help. If Druid or others already have helped from other organizations, that's fine, does it matter for us? (: In my opinion, what matters is that the promising project has the help, funding, and support it needs.

> For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

I agree this is very connected. However, my honest opinion is that Analytics, even in OLAP based fashion overlaps a little bit with SIG-Observability and this is why I am interested to find some solutions for our communities. 

Rob:
> experience with it at Uber (running OLAP queries against monitoring and observability data).

Yes! This is what we are looking for - production-grade experience and recommendations for this.

> 1. With respect to SIG Observability, I think talking and moving forward on options/standardized approaches to OLAP on monitoring and observability data makes sense. With regards to BI/OLAP in general, I would say that SIG Observability should not be focused on this space and would probably be better served by a dedicated data engineering SIG.

I tend to agree, however, given no one started SIG-BigData yet and given the observability data is quite an enormous source of meaningful information, I would love to explore at least API and integrations possibilities here. Maybe I'm wrong (: 

> 2. At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

Awesome, good examples, worth to revisit those amazing projects and integrations there (Spark, Presto, Hadoop)

> I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

Yes! Does anyone have more info about that Spark integration? I remember some teams are using Presto on Thanos data already at Red Hat,  I might try to find more information on that as well. (: 

> The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two. 
That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.


Yes! This is actually the amazing novelty we would love to push toward as well. Instead of storing the same data in 5 places can we keep it in just one? The idea would be to promote efficient streaming read API more vs copying the data to different formats. I mentioned this in one of the requirements here. This might mean more work on those Thanos/Cortex/M3/ecosystem projects, but given we are already collaborating, it might be easier (: This is along the lines what we try to push on metrics/logs/tracing world as mentioned by my team colleague Frederic: Can we reuse similar index for those three since we observe collect different data.. but from the same resources?

Kind Regards,
Bartek

On Fri, 29 May 2020 at 18:16, Rob Skillington <rob@...> wrote:
Hey Bartek,

Glad to hear this topic brought up, it's something we think a lot about and have some experience with it at Uber (running OLAP queries against monitoring and observability data).

1. With respect to SIG Observability, I think talking and moving forward on options/standardized approaches to OLAP on monitoring and observability data makes sense. With regards to BI/OLAP in general, I would say that SIG Observability should not be focused on this space and would probably be better served by a dedicated data engineering SIG.

2. At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two. 

That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.

Best,
Rob


On Fri, May 29, 2020 at 12:50 PM Ricardo Aravena <raravena80@...> wrote:
Bartek,

This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. For example, in the past, I worked on implementing Apache Druid to collect mobile analytics. In this space, I can think of these projects: Druid, Pinot, KylinClickhouse, Modrian, Cubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Having said that because OLAP systems can be quite complex, there are multiple components that may fall into the scope of other CNCF SIGs. For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

In any case, it would be great to approach the different projects so that the CNCF community is aware of how OLAPs work and foster general interest. 

Ricardo



On Fri, May 29, 2020 at 9:01 AM Bartłomiej Płotka <bwplotka@...> wrote:
Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek