This post deals with anomaly detection. There are three broad approaches to Anomaly detection
- Use basic statistics (standard deviation, average and other measures like z score derived from std deviation and average) — This can be done in Imply pivot and is covered in Part 2 of this post.
- Use a forecasting model and detect anomalies based on how much a specific metric deviates from forecast — This can be done using a tool like Sherlock with druid and is covered in part 3 of this post.
- Use machine learning (clustering and classification especially) — This is covered in my other post on anomaly detection
Sherlock is an anomaly detection tool open sourced by yahoo (https://github.com/yahoo/sherlock). Sherlock offers a number of anomaly detection models like movingaveragemodel, regressionmodel, olympic model etc. Druid is a open source real time streaming analytics platform that generates metrics on which anomaly detection can be done. The key value of using druid in an anomaly detection pipe line is that druid can generates sums, counts, averages ,percentiles etc in sub seconds on streaming data by combining both streaming and historical data. Imply provides commercial support for druid and Imply Pivot is a proprietary BI tool which makes it easy to do slice and dice on data in druid and also define metrics based on statistics.
This post in three parts. The first part covers setup of all the components and ingesting sample data. The second part covers setting up statistics using imply pivot and the third part covers using Sherlock.
Part 1 — setup everything
a) Download imply quick start from https://docs.imply.io/2021.08/quickstart/#start-unmanaged-imply
To use statistics navigate to <imply install>/conf-quickstart/druid/_common and open common.runtime.properties and add the druid-stats extension to the druid extensions load list
druid.extensions.loadList=[“druid-stats”,”druid-histogram”, “druid-datasketches”, “druid-kafka-indexing-service”, “imply-utility-belt”]
b) Install redis https://linuxize.com/post/how-to-install-and-configure-redis-on-centos-7/ (redis is required for sherlock)
c) Build sherlock from https://github.com/yahoo/sherlock (note that the location of the EGADS library has changed. Please follow https://github.com/yahoo/egads and make changes to the POM xml). Comment out the bintray location and add the jcenter location as below
<! — <repository>
Once imply is started, browse to http://localhost:9095. This will open the pivot ui. Click on Data>Load data
click on start new spec
click on example data and select Koalas to the max (9 days). This loads 9 days of browsing data from http://koalastothemax.com/
keep choosing next on all subsequent screens to select the default options. On the partition screen ensure that segment granularity is day
on the publish screen specify the name of the data source as kttm in druid
click on submit in the next screen to load the data. Once the task has succeeded you should see segments for 9 days
Part 2 — statistical anomaly detection in pivot
Navigate to http://localhost:9095
click on settings and enable all experimental features
now click on visuals and click on “create new data cube”. Select check box to create SQL cube as shown in below screen shot.
once you save the cube, click on edit measures
we will add two statistical measures — z score and iqr (these are described in https://towardsdatascience.com/statistical-techniques-for-anomaly-detection-6ac89e32d17a).
add a measure called z score session length and set it to abs(max(t.session_length)/stddev_samp(t.session_length)-avg(t.session_length)/stddev_samp(t.session_length))
z score is (x-mean)/stddev for each x. In the above, i have used max(x) as pivot visualisation will aggregate over arbitrary time intervals. If the time interval in pivot is the same as the granularity of the data then max(x)=x.
Add another measure called iqr session length as APPROX_QUANTILE(t.”session_length”, 0.75)-APPROX_QUANTILE(t.”session_length”, 0.25)
now choose multi-selection on the measure drop down and select the above two measures for a 7 day window with hourly granularity
note that there is a clear anomaly on aug 27 (between 2 and 3 pm if you hover over that on the pivot UI) based on the iqr. Select that window on the pivot UI
now in the measure drop down select session_length and unselect z score and iqr
In the above screen shot it is clear that the sum of session_length is negative for some minute in the 2–3 PM time range. This was the anomaly that was detected by the statistics.
Part 3— using sherlock with druid
Sherlock is an open source Anomaly detection tool (https://github.com/yahoo/sherlock). Sherlock makes us of druid’s ability to rapidly execute queries with aggregations. After building sherlock and installing and starting redis (following part 1 of this post) start sherlock
nohup java -jar sherlock-1.12-SNAPSHOT-jar-with-dependencies.jar &
note that the above assumes that redis is running on the default port (6379). If the port is different then specify the appropriate flag and the port number.
once sherlock is up you should be able to navigate to http://localhost:4080
click on add cluster and add your cluster
click on save in the above screen and navigate to druid clusters. If the cluster was added correctly then you should see the status as ok
click on flash query and specify the following query in the druid query box and the other parameters as in the screen shot
the above is a druid native query and can be obtained by using the query
select city,country,platform,sum(session_length) from kttm group by 1,2,3 in the druid console and looking at the query plan.
populate the other params as below. Note that query gets three days of data and we want to do anomaly detection for aug 27.
click on flash results to get the following. Note the anomalies at 14:00 and 15:00 on aug 27 which is consistent with earlier result from using statistics in part 2.
Note that Sherlock offers a list of both time series and anomaly detection models. How each of these is different and what use cases can they be used is a post for another day! The olympic model for instance assumes that the value for the next point is the smoothed average over the previous n periods. K-sigma is a simple anomaly detection algorithm that looks for values greater than k times the std deviation.
Once the flash query results are acceptable it can be saved as a job and scheduled to run to generate results periodically
In this example using sherlock and the one in part 2 using pivot the key thing to note is that both these use cases require a data store that can be queried rapidly using aggregate queries (std dev, percentile, avg, sum etc). Using druid allows us to run these queries with sub second latencies and hence it is possible to do real time anomaly detection. Note that the druid query used above has a granularity of 1 hour. One can also run the same query at a granularity 1 day or a granularity of 1 minute. This ability to query the data at arbitrary granularity is quite critical for anomaly detection. The two approaches seen here — based on statistics and based on forecasting — will catch different data points as anomalies depending on the granularity chosen for the query. Given that anomalies are by definition unpredictable it is necessary to look at different granularities to detect anomalies without too many false positives. Druid adds the following key capabilities for anomaly detection
- Ability to generate metrics at different granularities
- Ability to generate different metrics (count,sum,distinct count, percentile etc)
- Ability to group metrics by different sets of dimensions.
- Do the above with sub second query response.
The dataset we have looked has 25 dimensions. This results in 1000s of combinations of dimensions for which metrics need to be monitored. Customers often have 60000 dimension-metric combinations that need to be monitored. Anomaly detection at this scale depends critically on the ability of the data store to rapidly return results for 1000s of queries per second.