Introducing Shaman, the Apache Druid Cloud Manager

Published on June 26, 2017 by Miguel Morales

ZeroX is pleased to announce the general availability of Shaman, a self-service platform for deploying Apache Druid clusters.

We’ve packaged years of experience managing high scale, production, Druid clusters into a dashboard driven experience.

In this blog post we review all the major features and lay out our thoughts about the future of the product.



Data Schemas

Data schemas are designed to make it easy to share schemas across your data driven applications. This is part of the over-arching vision of Shaman.

Schemas define how data is processed in terms of columns. For example, given the following simple JSON data event:



We can generate a schema through the Shaman dashboard. Shaman currently supports the following column types:

  • Long
  • Double
  • String
  • Filtered Count




Columns can also be computed as in the case of bid_requests above.

Certain types also allow for special post-processing triggers. For example, you can specify if Shaman should count uniques, sum values, treat values as dimensions, etc.

Filtered counts are handy types that allow you to count data events based on a certain column’s value. This makes it easier to define simpler and easier to understand schemas.

Data schemas may be shared across clusters and updates while a cluster is running.



Cluster Deployment

We’ve made cluster deployment as easy as possible. Shaman was designed to easily provision Druid clusters for development, staging, and production settings.

Single node clusters are extremely simple and can handle a base number of 1K requests per second. This capacity can be enough to prototype and deploy applications that don’t require the SLA requirements that a production application requires.

Multi-node clusters, currently enabled upon request, are designed to be highly available and scale horizontally. Perfectly suitable for applications with strict SLA requirements. Basic multi-node clusters can handle a base number of 8K requests per second.

Capacity can be added to multi-node clusters by adding nodes as needed. Shaman provides monitoring and alerts to notify you when capacity should be added to your cluster.







Real-time Ingestion

Shaman is built for real-time ingestion support from the ground up. We provide an SSL-secured HTTP gateway that load balances and routes incoming data to a cluster’s real-time indexing node.

Real-time ingestion is handled transparently by Tranquility. A data event’s timestamp must be within 5 minutes of the time Shaman receives the request. Otherwise it will be discarded and be handled through Shaman’s built-in lambda architecture.

When automatic backup storage is enabled, data received through the real-time gateway is sent to cold cloud storage. Automatic backups are required to activate Shaman’s built-in lambda architecture.



Lambda Jobs

Lambda jobs are currently in private testing and enabled upon request. We’ve completely automated the deployment of behind-the-scenes Hadoop clusters to efficiently process historical data.

Lambda jobs can be scheduled to run hourly, daily, and monthly. By enabling lambda jobs, cluster data is compressed as segments are merged and aggregated by setting the appropriate time granularity.



Data Storage Design

The built-in lambda architecture rolls up data based on the time range the job is configured for: hour, day, week, & month. By rolling up data, historical reports are kept snappy, segments are compressed, and process any data dropped by the real-time nodes. It is assumed that this design works for most applications.



Data Visualization w/ Superset

Finally we’ve bundled a custom version of Apache Superset, a business intelligence platform for Druid. This allows users to instantly visualize real-time and historical data on their Druid clusters.

Superset allows users to create and share reporting dashboards and even generate the native Druid query that can be used to build native reporting applications.



The Future

Shaman aims to be a fully fledged data platform built on top of powerful open-source projects. Users can build data-intensive applications without worrying about infrastructure until they need to, if they ever need to. We’ve laid out the foundational lambda architecture and have plans of adding other popular types of clusters such as Apache Spark and Kubernetes.

Our vision is to empower users to build data-driven applications with the same agility and speed as building typical web applications.

However, we need your feedback to get there. Our goal is to create the best experience and abstracting tedious processes that get in the way of building great applications.