Use and Explore

Anomaly Detection

Getting Started

Introduction

IoT connected devices normally generate quite significant amount of data. This data can be used for solving many important problems by applying various data analysis methods. One type of tasks that can be effectively solved by means of data analysis is anomaly detection. Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior.

Anomaly Detection Service

The Bosch IoT Analytics team has implemented a cloud service running in Bosch IoT Cloud (BIC) and AWS. The main goal of this cloud service is to automate the task of anomaly detection by providing a self-service UI for creating analysis models and scalable executing environment for executing these models.

This Tutorial

In this tutorial, we use data from a real use case, where devices are boilers used for heating a house. Boilers are equipped with numerous sensors and they send data to a central server in the form of irregular events. Each event contains numeric as well as categorical data. For example, a device can send data from its temperature sensors like Temp1=20.9 or Temp2=21.5. The final business goal is to decrease service costs by detecting unusual behavior which can potentially lead to problems with this device.

For the purposes of this tutorial, sample data is stored in a CSV file where one line represents one event.

NOTE

Using the Clone menu item of the Heating System example model, you can duplicate the example model to see how the model described in this tutorial should look like.

Requirements

Content

  1. Creating an Analysis Model
  2. Wizard for Editing the Analysis Model
  3. Executing the Analysis
  4. Analyzing the Results
  5. Exploring the Results Visually

Creating an Analysis Model

Click New button in order to create a new analysis model.

Continue with Wizard for Editing the Analysis Model.

Wizard for Editing the Analysis Model

This wizard guides you through the modeling process.

  1. Basic Details
  2. Input Data
  3. Feature Generation
  4. Aggregation over Time
  5. Analysis
  6. Output

Basic Details

Add a CSV file with specific sample data to be analyzed into assets. The CSV file to be added must look like this:

dt,Temp1,Temp2,Status
...
2015-03-01T23:29:57.000Z,,19.62,
2015-03-01T23:30:08.000Z,,19.42,
2015-03-01T23:30:08.000Z,20.54,,
2015-03-01T23:30:28.000Z,20.87,,
2015-03-01T23:30:37.000Z,20.66,,
2015-03-01T23:30:47.000Z,20.79,,
2015-03-01T23:31:17.000Z,20.64,,
2015-03-01T23:31:47.000Z,,19.42,Off
...

Alternatively, you can choose the MongoDB or the MySql option to specify the connection to a database in the Bosch IoT Cloud.

Click Next and continue with Input Data.

Input Data

A date column must exist in the source file and contain time stamps of events.

Choose the Input Type, the Input File, and specify the Date Time Field.

Choose the Date Time Format that fits the values in the Date Time Field.

Click Next and continue with Feature Generation.

Feature Generation

We do not want to generate any additional domain-specific features.

Click Next and continue with Aggregation over Time.

Aggregation over Time

Since we are analyzing irregular events, it is necessary to convert them to a regular time series where one measurement corresponds to one regular interval. The size of the time interval is chosen from the combo box. In our example, we choose 1 Hour which means that we group all events belonging to one hour and then aggregate them by representing as one record. We do not want to generate any additional domain-specific features.

  • In the Interval Length field, choose the time period between two measurements.
  • In the Interval Field field, enter the column that contains the interval.
  • The Device Field is empty because we analyze data only for one device
  • In the Date Time Field, enter the column that contains the date that is used in the date column.
  • In the Date Time Format field, choose the date format.
  • 4 numeric columns are computed by finding mean and max for 2 input parameters (Temp1 and Temp2)
  • The 5th definition will generate 2 columns because column Status takes two values: On and Off. The ads_pivot_count built-in aggregation function will find their frequencies.

Click Next and continue with Analysis.

Analysis

This page of the wizard allows you to choose the type of the analysis algorithm that will be used for anomaly detection. By default, there are four options:

  • Multidimensional Scaling (MDS)
  • One-Class SVM Fit
  • Elliptic Envelope Fit

Choose the Multidimensional Scaling (MDS) option.

If the Trusted Data Scientist role is assigned to your user account, you can choose Custom as an additional Analysis Type. The trusted data scientist role is assigned to users who are experience data scientists and have signed a special contract. Please get in touch via service-analytics@bosch-si.com if you are interested in this option.

If you have chosen Custom, the Analysis Function field appears.

Enter the name of a Python function used for data analysis.

Click Next and continue with Output.

Output

In this wizard page, you choose where the results of analysis have to be stored. Choose the CSV Output Type to store the results in a CSV file named in the Output File Name field which can be downloaded.

Alternatively, you can choose the MongoDB or the MySql as Output Type to save the output to a database.

Click Finish and continue with Executing the Analysis.

Executing the Analysis

When an analysis model is completed, it can be executed. Executing an analysis model means starting a job which is executed in BIC and really processes the provided data.

Go to the list of analysis models (configurations) by clicking Back to Overview link. Check the new configuration and then click Execute Now button or the same menu item in the context menu.

Continue with Analyzing the Results.

Analyzing the Results

When the job is finished, you can click the View Dashboard menu item:

Alternatively, the CSV file with the results can be downloaded by clicking the Download Output menu item.

Continue with Exploring the Results Visually.

Exploring the Results Visually

When clicking the View Dashboard button, the following interactive visualization charts are displayed. Use the +/- buttons to zoom in/out. Additionally, you can use other functions to analyze certain areas and download the plot as a PNG file.

One point in this Scatter Plot represents one device behavior during one-time interval (1 hour in our example). We can easily see several outliers (marked in red) in this diagram which represent devices with the behavior that differs significantly from the majority of the devices.

In the Anomaly Indicator over Time chart, you can choose the Aggregation Time (Daily, 12h, 6h, 1h) and the Anomaly Threshold (10%, 20%, 30%) to determine the values that exceed the threshold within the specified period.

If you need other visualizations, please do not hesitate to contact the Anomaly Detection team to improve the service.

Alternatively, the results can be visualized using external tools.

Developer Guide

Feature Generation

Feature Generation in Anomaly Detection service is intended to generate and add new columns to the table without changing the number of rows. Each value in the new column depends only on the attribute values of one current row.

The values of new columns are computed by using Python functions. These functions can be standard Python functions, built-in functions provided by the anomaly detection service or user-defined functions provided via asset files (only for trusted data scientists). Python files, with the user defined functions, have to be uploaded along with other assets which will be then dynamically imported during the execution of the workflow.

Any function that is used to generate a new column is supposed to return one value. However, this value is computed either from a single value of another (specified) column or from the values of all columns of one row. It depends on what is specified in the Feature Column field in the new feature definition:

  • The Feature Field is empty. In this case, the function specified in the Function field will be applied to one row (a sequence of all column values) which is passed as the parameter of this function. The individual column values of this row can be then accessed as array members, for example, if the input parameter name is row then the function can access its values as row['column1'], row['column2'] etc.
  • The Feature Field field is non-empty and contains an existing column name. In this case, the function specified in the Function field will be applied to a single value (not a row) which is passed as the parameter. This value can be then accessed directly from the function body using the argument name.

Note that the New Feature Field can be equal to the Feature Field. In this case, the function will overwrite the existing column by newly generated values returned by the function. In other words, the function is applied to the existing Feature Field by renaming it to New Feature Name and replacing its values with those generated by the Function.

Examples

  • Standard Single-Valued Functions for Column Transformations (Basic User)
    For example, if you want to use logarithmic transformation for your skewed data, you can use standard mathematical functions - log or log2 or log10 - as the Function name on Feature Generation tab. You can also use other built in functions such as abs, float etc. for numerical conversions before the analysis.



    In addition to that, on the Feature Generation tab in the wizard, enter the name of the new column in the New Feature Field field and enter the name of the column to be transformed in the Feature Field field.



  • User-Defined Functions for Row Transformations (Trusted Data Scientist)
    You can also use user-defined functions. Let us assume that two columns of the table, temp_inside and temp_outside, store the current temperature inside and outside of the building, respectively, Yet, for data analysis it is more important to know the difference between them and hence you need to generate a feature as a difference between these two columns. In other words, you need to define a new column temp_diff which is equal to the difference between temp_outside and temp_inside.



    For that purpose, you need to define a user-defined function in a .py file and upload it as an asset on the Basic details tab, e.g.:
    def temp_diff_fn(row):
        return row['temp_outside'] - row['temp_inside']
    Note that this function is applied to one row because it needs two columns in order compute its result. It receives one row, the elements of which are accessed by using column names like row['temp_outside']. In order to use this function, you’ll have to provide a new feature name (e.g. temp_diff) and the name of the function as used in the uploaded .py asset (e.g. temp_diff_fn). Note that the Feature Field is empty to indicate that the function is applied to the whole row.

List of Functions

These functions can be used for column transformations:

  • log
  • log2
  • log10
  • sqrt
  • float
  • int
  • str
  • abs

Note that other functions, for example, available via other Python modules, can be also used. It is only important that the engine can resolve this function by name and it takes one argument (by correctly processing its values). 

Aggregation Over Time

Aggregation over time is a very commonly used data transformation to make the continuous data suitable for anomaly detection algorithms. Generally, these methods compress a time series by replacing a set of consecutive values by a representative value of them (usually their average). This type of transformation of data can provide a number of benefits, such as reducing the dimensionality of the data and smoothing by masking noise and missing values.


From the configuration wizard of the Anomaly Detection service, you can define the following:

  • Interval Length: length of time interval for aggregation for e.g. 1 Minute, 10 Minutes, 1 Hour etc.
  • New Interval Field: name of new column that will store the identifiers of time intervals
  • Device Field: name of existing device identifiers column
  • Date Time Field: name of existing column that stores date/time stamps
  • Date Time Format: format identifier for existing date/time data
  • Feature Aggregation definitions

All rows of the input data table are grouped according to the values in the Date Time Field. All rows with the date belonging to one interval (specified via the Interval Length) are supposed to belong to one group. For example, if the Interval Length is 1 Month then the rows dated by 1 May, 5 May and 25 May belong to one group. All rows from one group are then converted to one value of the new output.

The rows of one group are aggregated into one value by applying a Function. which is different for each new output column defined. Yet, this function gets not a whole group (a number of rows) but rather a list of values from one column of these rows. This column is specified in the Feature Field. Note that in contrast to the feature generation, this field is mandatory. The aggregation function receives a list of values and returns one value. It can be any Python function which gets a list of values and returns one value. In particular, it can be a standard function like sum or mean.

Example Illustration

In the Heating Systems analysis (provided as an example in ADS), the input data has two device (temperature sensors) data columns viz. Temp1Temp2, one date/time identifier column dt and one categorical data, i.e. device status (on/off) Status.

To aggregate the data for an Interval Length of 1 Hour, the name of the new aggregated date/time identifier Interval Column will be interval.



For the aggregation, you can use the following:

  • Mean of Temp1 values over 1 Hour: Tempm
  • Maximum of Temp1 values over 1 Hour: Tempmx
  • Mean of Temp2 values over 1 Hour: Temp2m
  • Maximum of Temp2 values over 1 Hour: Temp2mx
  • Frequency of each of the value in Status for 1 Hour using ads_pivot_count: Status


List of functions

Standard aggregation functions are normally applied to numeric columns by aggregating their values for the selected time interval:

  • mean: finds average value of a group
  • sum : finds sum value of a group
  • std : finds standard deviation of a group
  • var: finds variance of a group
  • min: finds minimum value of a group
  • max: finds maximum value of a group

In addition to the standard Python aggregation functions, ADS provides some additional (convenience) functions which are useful in many data analysis scenarios:

  • ads_pivot_count

This function is applied to one (categorical) column with a finite number of values. In contrast to other aggregation functions, it generates many output columns (not one column). It generates as many new columns as this column takes unique values. The values in these new columns are equal to the frequency of the corresponding values. For example, if the input Status column takes two values On and Off then there will be two output columns. The values in these output columns are equal to the frequency of each value, for example, how frequently the status 'On' and 'Off' have occurred during the specified interval.

  • ads_pivot (only for trusted data scientist role)
    This function is applied to one (categorical) column with a finite number of values the name of which is specified in the feature field. Each unique value of this column is used to create one new column.
    The values in these new columns are computed via aggregation of values in another column which is provided in extended (mandatory) parameters in the field values.
    The aggregation function is provided in extended (mandatory) parameter in the aggfunc field.
    In particular, the following standard aggregation functions are possible:
    • mean
    • sum
    • std
    • var
    • min
    • max

Also, extended parameters can list included and excluded columns in the fields include and exclude, respectively.
Here is an example of extended parameters:
"arguments": {"aggfunc":"std", "values":"SENSOR_VALUE", "include":["CBA_01_EXT_WS", "CBA_01_EXT_T"], "exclude":["CO2_1", "RHW_1"]

Post-processing

Normalization. Currently, normalization can be performed only after the ads_pivot_count function by normalizing the resulted frequencies. The type of transformations is specified in the additional parameters (and hence it requires the role of trusted data scientist). The parameter preprocessing is a JSON object the field names of which specify the transformation type to be applied. The value of the transformation type key is an object with parameters for the transformation. For example, if we want to scale the obtained frequencies (using default parameters) then it can be done as follows:
"arguments": { "preprocessing": { "scale": {} } }

The following transformation types are supported:

  • scale
  • robust_scale
  • minmax_scale
  • maxabs_scale

It is also possible to apply the following mathematical functions to the result columns:

  • log
  • log2
  • log10
  • sqrt

More information about these transformations can be found here

Algorithms/Analysis

Anomaly Detection provides several data mining algorithms used as analysis types for detecting anomalies. Based on your problem statement and suitability of the data you can choose one among the below for your analysis.

Multidimensional Scaling (MDS)

This algorithm reduces the dimensionality of the original data set to a smaller number of dimensions by maximally retaining the distances between objects (rows) in the original data.
Given a nxn dissimilarity matrix D and a distance measure, the goal of MDS is to perform dimensionality reduction in a way that will preserve dot products between data points as closely as possible. In Anomaly Detection Service, we consider the Euclidean distance measure for MDS.
The output of this scaling maximizes the variance in the data set while reducing dimensionality. Distances that are far apart in the original data set will tend to be far apart in the projected data set. This makes it very easier to detect the anomalies out of the normal data.


This algorithm is generally useful when you have a high-dimensional data set.

Further readings

One-Class SVM Fit

In contrast to traditional Support Vector Machines (SVM), one-class SVMs attempt to learn a decision boundary (a hyperplane) that achieves the maximum separation between the points and the origin. Only a small fraction of data points is allowed to lie on the other side of the decision boundary. Those data points are considered as outliers or anomalies.

One-class SVM is not an outlier-detection method, rather a novelty-detection method: its training set should not be contaminated by outliers as it may fit them. Having said that, one-class SVM is really useful in certain challenging situations such as outlier detection in high dimensions.

One-Class SVM is generally useful when you have a lot of normal data with not many cases of anomalous data.

Parameters (Trusted Data Scientist Role)

The following parameters can be set additionally by switching to the Raw mode and editing the JSON configuration for the analysis:

  • nu (float, in the interval (0, 1], default value 0.5) An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
  • gamma (float) Kernel coefficient for ‘rbf’,‘poly’, and ‘sigmoid’. If gamma is 'auto' then 1/n_features will be used instead.
  • kernel (string, default=’rbf’) Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable.

Further readings 

Elliptic Envelope Fit

This algorithm fits the data into an ellipse in a multidimensional space (elliptic envelope). Here, the training data is supposed to be contaminated by the outliers and it is assumed that data is distributed normally. It will produce a decision function representing the border separating the central part and the outer part with outliers. The greater the distance from this border, the more likely is it that the point represents an outlier.

Elliptic Envelope is generally useful for normally distributed data.

Parameters (Trusted Data Scientist Role)

The following parameters can be set additionally by switching to the Raw mode and editing the JSON configuration for the analysis:

  • contamination (float, in the interval (0, 0.5), default value 0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
  • support_fraction (float, in the interval (0,1)) The proportion of points to be included in the support of the raw MCD estimate.

Further readings