Re: Recommendations for Open Source Analytic (OLAP) system / API to mine Thanos/Prometheus data.


Ricardo Aravena
 

> Thanks for those examples, a recommendation of using those would be amazing as well. What I will say might be controversial, but the goal of this initiative is NOT to steal projects or compete with Apache. It's actually the opposite: Integrate better with the most promising open-source systems that solve the community use cases. I think if we can encourage some amazing project to join CNCF that's great, but IMO CNCF is only here to help project that needs help. If Druid or others already have helped from other organizations, that's fine, does it matter for us? (: In my opinion, what matters is that the promising project has the help, funding, and support it needs.

I like Druid, they are backed by https://imply.io too. I think you have multiple options to go about integrating Prometheus or time-series databases with these systems. One aspect would be to just support Lambda (batch/streaming), Kappa (all in one streaming), or both architectures. 

At the batch layer and to create a warehouse, one way would be to support exporting from TSDB to standard data formats. i.e Parquet, Avro, Arrow (in-mem columnar), Protobuf, etc. For streaming, you can support TSDB publishing to Kafka. Then you could orchestrate everything with Spark or Flink (batch and streaming) on top of K8s of course :-). Because there are many projects, and no one size fits all (data sources come in all different forms) I think it would be great to come up with a reference architecture that works for Prometheus (assuming all the integration points have been created). 

It would also be super interesting to understand how some of the observability vendors have implemented their systems if they'd like to share. Although, it might be too much to ask since that typically constitutes a lot of their bread and butter :-) 

On Sat, May 30, 2020 at 5:44 AM Bartłomiej Płotka <bwplotka@...> wrote:
Ricardo, Rob thanks for the answers so far! (: 

Ricardo:

> This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. 

Yup, we have already solid projects in the observability space. However, business-oriented analytics results are one of the best use cases/outcome of the long term observability data we collect for monitoring needs, right? (: It's quite an amazing side effect and benefit of collecting such data. The idea is to connect both worlds through integrations and Open APIs.

> In this space, I can think of these projects: DruidPinotKylinClickhouseModrianCubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Thanks for those examples, a recommendation of using those would be amazing as well. What I will say might be controversial, but the goal of this initiative is NOT to steal projects or compete with Apache. It's actually the opposite: Integrate better with the most promising open-source systems that solve the community use cases. I think if we can encourage some amazing project to join CNCF that's great, but IMO CNCF is only here to help project that needs help. If Druid or others already have helped from other organizations, that's fine, does it matter for us? (: In my opinion, what matters is that the promising project has the help, funding, and support it needs.

> For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

I agree this is very connected. However, my honest opinion is that Analytics, even in OLAP based fashion overlaps a little bit with SIG-Observability and this is why I am interested to find some solutions for our communities. 

Rob:
> experience with it at Uber (running OLAP queries against monitoring and observability data).

Yes! This is what we are looking for - production-grade experience and recommendations for this.

> 1. With respect to SIG Observability, I think talking and moving forward on options/standardized approaches to OLAP on monitoring and observability data makes sense. With regards to BI/OLAP in general, I would say that SIG Observability should not be focused on this space and would probably be better served by a dedicated data engineering SIG.

I tend to agree, however, given no one started SIG-BigData yet and given the observability data is quite an enormous source of meaningful information, I would love to explore at least API and integrations possibilities here. Maybe I'm wrong (: 

> 2. At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

Awesome, good examples, worth to revisit those amazing projects and integrations there (Spark, Presto, Hadoop)

> I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

Yes! Does anyone have more info about that Spark integration? I remember some teams are using Presto on Thanos data already at Red Hat,  I might try to find more information on that as well. (: 

> The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two. 
That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.


Yes! This is actually the amazing novelty we would love to push toward as well. Instead of storing the same data in 5 places can we keep it in just one? The idea would be to promote efficient streaming read API more vs copying the data to different formats. I mentioned this in one of the requirements here. This might mean more work on those Thanos/Cortex/M3/ecosystem projects, but given we are already collaborating, it might be easier (: This is along the lines what we try to push on metrics/logs/tracing world as mentioned by my team colleague Frederic: Can we reuse similar index for those three since we observe collect different data.. but from the same resources?

Kind Regards,
Bartek

On Fri, 29 May 2020 at 18:16, Rob Skillington <rob@...> wrote:
Hey Bartek,

Glad to hear this topic brought up, it's something we think a lot about and have some experience with it at Uber (running OLAP queries against monitoring and observability data).

1. With respect to SIG Observability, I think talking and moving forward on options/standardized approaches to OLAP on monitoring and observability data makes sense. With regards to BI/OLAP in general, I would say that SIG Observability should not be focused on this space and would probably be better served by a dedicated data engineering SIG.

2. At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two. 

That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.

Best,
Rob


On Fri, May 29, 2020 at 12:50 PM Ricardo Aravena <raravena80@...> wrote:
Bartek,

This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. For example, in the past, I worked on implementing Apache Druid to collect mobile analytics. In this space, I can think of these projects: Druid, Pinot, KylinClickhouse, Modrian, Cubes (There might be others)  Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Having said that because OLAP systems can be quite complex, there are multiple components that may fall into the scope of other CNCF SIGs. For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

In any case, it would be great to approach the different projects so that the CNCF community is aware of how OLAPs work and foster general interest. 

Ricardo



On Fri, May 29, 2020 at 9:01 AM Bartłomiej Płotka <bwplotka@...> wrote:
Hi SIG Observability! 👋

I recently noticed that many of CNCF's Prometheus and Thanos users often desire to use their metric data collected by Prometheus for more advanced Analytics cases. Something more suitable for Business Intelligence / OLAP use cases. 

As the Prometheus maintainers, we designed Prometheus Query API and PomQL for realtime monitoring, or at most for simple analytics. It's far from being efficient for Data Mining or Data Exploration.

I feel there are two things we are missing in the CNCF space: 

1. Please tell me if I am wrong here, but I don't see any particular BI/OLAP open source project in the CNCF space. If not, I think as CNCF SIG Observability there is some possibility for us to encourage some project for this to either join or at least be closer integrated with the community. Do you think as the CNCF SIG Observability should we be doing this? 🤔

2. Metric data from, especially if you have years of it thanks to Thanos or Cortex, is an amazing source of information. In the Thanos community, we are actively looking for a project that will fit most of the requirements stated hereAre you currently a user of some Open Source OLAP system worth recommending? If yes, which one? Would you like to have good integration of such a system with metrics? 

We are looking for your feedback, preferably on this GitHub issue: https://github.com/thanos-io/thanos/issues/2682, I plan to also put this topic for the next SIG agenda if we will have time for it. 🤗 

Kind Regards and have a good weekend!
Bartek

Join cncf-tag-observability@lists.cncf.io to automatically receive all group messages.