Monitoring rules reference

Monitoring rules are configured on a per-resource basis, with rules for the following resources:

All monitoring rules contain a configurable field called Alert severity which is the severity granted to an alert when its condition is triggered. Monitoring views can be configured to only send out alerts that meet or exceed a certain severity.

Rule componentDescriptionExample options
Alert severitySeverity of monitoring report conditionLow, Medium, High

Agent rules

Agent last heartbeat time

Alerts when the agent bootstrapper's last heartbeat is older than a set threshold.

Rule componentDescriptionExample options
If value is greater thanAmount of time elapsed since the last heartbeat received from the agent bootstrapper10 minutes

We recommend setting this monitor value to 10 minutes.

Agent manager version stale time

Alerts when the agent bootstrapper version has not been upgraded since a set threshold.

Rule componentDescriptionExample options
If value is greater thanAmount of time elapsed since the agent manager has been on an old version10 minutes

We recommend setting this monitor value to 10 days.

Agent version stale time

Alerts when the agent version has not been upgraded since a set threshold.

Rule componentDescriptionExample options
If value is greater thanAmount of time elapsed since the agent has been on an older version10 days

We recommend setting this monitor value to 10 days.

High CPU utilization

Alerts when the agent CPU utilization exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater thanPercentage of CPU utilization80

We recommend setting this monitor value to 80 (%).

JVM heap usage is close to the limit

Alerts when the JVM heap usage exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater thanPercentage of JVM heap used / JVM heap available70

We recommend setting this monitor value to 70 (%).

Low disk space

Alerts when the available disk space drops below a set threshold.

Rule componentDescriptionExample options
If value is less thanAvailable disk space10GB

We recommend setting this monitor value to 10GB.

Time until earliest keystore certificate expires

Alerts when a certificate in the agent's keystore will expire within a set threshold.

Rule componentDescriptionExample options
If value is less thanAmount of time until a certificate expires10 days

** We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.**

Time until earliest truststore certificate expires

Alerts when a certificate in the agent's truststore will expire within a set threshold.

Rule componentDescriptionExample options
If value is less thanAmount of time until a certificate expires10 days

We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.

Queue size

Alerts when the number of jobs queued on an agent exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater thanThe number of jobs in the agent's job queue70

We recommend setting this monitor value to 70 (jobs).

Schedule rules

Consecutive schedule failures

Alerts when the number of consecutive schedule failures meets or exceeds a set threshold. This does not count schedule runs that result in a cancelled build.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive schedule failures1

The default behavior for this monitor is to alert with medium severity at one failure and high severity at three failures, though these thresholds are highly dependent on the frequency and stability of the schedules that are included in the monitoring rule's scope.

Schedule duration

Alerts when a schedule is running longer than a set threshold.

Rule componentDescriptionExample options
If value is greater than or equal toThe duration of the schedule2 hours
This monitor is typically used on highly critical schedules to quickly inform whether or not the schedule will complete in the expected time. Due to the variable nature of schedules, this monitor is often schedule-scoped.

Changelog jobs failing on active pipeline

Alerts when the "changelog" job for the object or link's active pipeline is failing. This rule is non-configurable, alerting with medium severity at one changelog job failure and high severity at three changelog job failures.

Merge changes job failing on active pipeline

Alerts when the "merge changes" job for the object or link's active pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive merge job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Sync jobs failing on active pipeline

Alerts when the "sync" job for the object or link's active pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive sync job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Changelog jobs failing on replacement pipeline

Alerts when the "changelog" job for the object or link's replacement pipeline is failing. This rule is non-configurable, alerting with medium severity at one changelog job failure and high severity at three changelog job failures.

Merge changes job failing on replacement pipeline

Alerts when the "merge changes" job for the object or link's replacement pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive merge job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Sync jobs failing on replacement pipeline

Alerts when the "sync" job for the object or link's replacement pipeline is failing.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive sync job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.

Scroll job failing on pipeline

Alerts when the "scroll" job for the object or link's active or replacement pipeline is failing. Scroll jobs are responsible for streaming data from the backing datasource to the object databases.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of consecutive scroll job failures3

The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures, and these values are configurable.

Sync propagation delay

Alerts when a dataset backing the object has a transaction with a sync time that exceeds a set threshold.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time taken to sync a transaction1 day

Invalid stream records detected

Alerts when records in an input stream contain format violations. The scroll job ignores these records. This rule is non-configurable, alerting with critical severity when the number of ignored rows is greater than or equal to one.

Streaming dataset rules

Derived stream monitors

Last checkpoint duration

Alerts if the last checkpoint took more time than the configured threshold to complete.

Rule componentDescriptionExample options
If value is greater thanThreshold of time taken to checkpoint10 minutes

Liveness: time since last successful checkpoint

Alerts if the stream has not completed a checkpoint since the configured threshold. The default threshold configuration is 2 minutes. This monitor encompasses streams that are not running as well as streams failing a checkpoint.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time elapsed since last checkpoint2 minutes

Total lag

Alerts if a stream's lag (total unprocessed upstream records) exceeds the set threshold.

Rule componentDescriptionExample options
If value is greater thanThreshold of unprocessed upstream records1000

This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.

Total throughput

Alerts if a stream's throughput (records processed per checkpoints) falls below the set threshold.

Rule componentDescriptionExample options
If value is less thanThreshold of records processed per checkpoint100

This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.

Ingest stream monitors

Records ingested over last 5 minutes / 30 minutes / 1 hour / 4 hours / 1 day

Alerts if the number of records ingested into the raw stream's live view over the selected time window was less than or equal to the configured threshold.

Rule componentDescriptionExample options
If value is less than or equal toThreshold of ingested records per unit time100

Live deployment rules

Live deployment heartbeat

Alerts when deployment has not emitted a heartbeat for more than one minute.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time elapsed since last heartbeat1 minute

Time series sync rules

Points written by the time series sync over last 5 or 30 minutes

Alerts if the number of points written by the time series sync over the last 5 or 30 minute window was less than or equal to the configured threshold.

Rule componentDescriptionExample options
If value is less than or equal toThreshold of points written per unit time100

Dataset rules

Time since job last succeeded

Alerts when a job on a dataset has not succeeded within a specified time threshold. Unlike the "Time since last updated" health check, the following conditions will always count as a passing status for the monitor:

  • The job succeeded, but the transaction was aborted
  • The job succeeded, but no new data was added
Rule componentDescriptionExample options
If value is greater thanAmount of time elapsed since the dataset was last updated1 day

We recommend setting this monitor value based on your dataset's expected update frequency. For daily updates, set it to 1 day.

Geotemporal observation rules

Geotemporal observations sent over last 5 or 30 minutes

Alerts if the number of geotemporal observations sent over the last 5 or 30 minute window was less than or equal to the configured threshold.

Rule componentDescriptionExample options
If value is less than or equal toThreshold of geotemporal observations sent per unit time100

Automation rules

The following rules apply to both automations and time series streaming automations.

Automation has no new evaluations

Alerts if there has been no new evaluation since the configured threshold. Use this rule to detect performance degradation in an automation that should have been evaluated but did not. This rule does not alert when the automation has not been triggered.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time elapsed since last automation evaluation1 hour

Automation has no new triggers

Alerts if there have been no new monitor triggers within the configured threshold. Use this rule to detect when an automation is not being triggered as expected.

Rule componentDescriptionExample options
If value is greater than or equal toThreshold of time elapsed since last automation trigger1 day

The single latest event exceeded the automation's failure threshold

Alerts if the single most recent execution had at least the configured number of failures. This does not take into account multiple events.

Rule componentDescriptionExample options
If value is greater thanThreshold of number of failures in most recent automation execution10

Automation has been disabled by the system

Alerts if an automation was disabled by the system due to reaching limits or triggering cycles. This rule is non-configurable, alerting with high severity when the automation is disabled.

Function rules

Function executions can fail for a variety of reasons. For a full list of failure types, see function failure types.

The Number of function failures in window rule tracks all failure types. The Number of user-facing function failures in window rule tracks only user-facing errors. The Number of non-user-facing function failures in window rule tracks all failure types except user-facing errors.

Function duration p95

Alerts when the p95 function duration exceeds the specified thresholds. The p95 is measured over a sliding window of recent data.

Rule componentDescriptionExample options
If value is greater thanThreshold of duration10s

Number of function failures in window

Alerts when the total number of failed executions of a function in the given window exceeds a given threshold. This rule tracks all failure types, including both user-facing and non-user-facing errors.

Rule componentDescriptionExample options
If value is greater thanThreshold of number of failures0
Time windowThe time period to count failures in1 hour

Number of user-facing function failures in window

Alerts when the number of user-facing function failures over a given window exceeds the specified thresholds. This rule tracks only user-facing errors thrown by function code.

Rule componentDescriptionExample options
If value is greater thanThreshold of number of failures0
Time windowThe time period to count failures in1 hour

Number of non-user-facing function failures in window

Alerts when the number of function failures over a given window exceeds the specified thresholds, excluding user-facing errors thrown by function code. This rule is useful for monitoring infrastructure and system-level failures without noise from expected user input errors.

Rule componentDescriptionExample options
If value is greater thanThreshold of number of failures0
Time windowThe time period to count failures in1 hour

Action rules

Action executions can fail for a variety of reasons. For a full list of failure types, see action failure types.

The Number of action failures in window rule tracks all failure types. The Number of non-user-facing action failures in window rule tracks all failure types except user-facing function failures.

Action duration p95

Alerts when the p95 action duration exceeds the specified thresholds. The p95 is measured over a sliding window of recent data.

Rule componentDescriptionExample options
If value is greater thanThreshold of duration10s

Number of action failures in window

Alerts when the total number of failed executions of an action in the given window exceeds a given threshold. This rule tracks all failure types, including both user-facing and non-user-facing errors.

Rule componentDescriptionExample options
If value is greater thanThreshold of number of failures0
Time windowThe time period to count failures in1 hour

Number of non-user-facing action failures in window

Alerts when the number of action failures over a given window exceeds the specified thresholds, excluding failures caused by user-facing errors thrown by function-backed action code. This rule is useful for monitoring infrastructure and system-level failures without noise from expected user input errors.

This rule tracks all failure types except user-facing function failures thrown by function code and displayed to users.

Rule componentDescriptionExample options
If value is greater thanThreshold of number of failures0
Time windowThe time period to count failures in1 hour