Data expectations

Data expectations are requirements that can be applied to dataset outputs. These requirements (known as "expectations") can be used to create checks that improve data pipeline stability.

Data expectations can be set on each pipeline output to define an expectation on the resulting output. Pipeline Builder currently supports two data expectations: primary key and row count.

Screenshot of data expectations pane

If any expectations fail, the build will fail. The job expectations pane will show which data expectations passed and failed.

Primary key data expectations

Primary key expectations are provided with one or more column names and verify:

  • Each column has no null values.
  • The combination of columns is unique.

Example of a primary key data expectation

In the specific column selected, we check that every entry underneath is unique.

If two columns are selected, we check that the combination of both columns are unique.

In our example, we'll use id and time as two columns existing in our dataset.

Example dataset:

idtime
18pm
19pm
28pm
38pm

The above example would pass the check. This is because even though 1 and 8pm are repeated individually, the combination of id and time remains unique.

Conversely, the following would fail:

idtime
18pm
29pm
18pm

This table would fail the check because the 1 and 8pm combination is repeated.

Row count data expectations

Row count expectations are provided with a minimum and/or maximum row count.

If a minimum row count is provided, the expectation will verify that there are at least the specified amount of rows.

If a maximum row count is provided, the expectation will verify that there are at most this many rows.