Adding Any Constraint you want with AWS Deequ

Conveniently, AWS Deequ has a lot methods for validations i.e. .isComplete(), .isContainedIn() etc. but no clear documentaiton or examples on how to add your own. Fortunately, it’s actually very easy to do. Most of the checks just end up using satisfies function:

  /**
    * Creates a constraint that runs the given condition on the data frame.
    *
    * @param columnCondition Data frame column which is a combination of expression and the column
    *                        name. It has to comply with Spark SQL syntax.
    *                        Can be written in an exact same way with conditions inside the
    *                        `WHERE` clause.
    * @param constraintName  A name that summarizes the check being made. This name is being used to
    *                        name the metrics for the analysis being done.
    * @param assertion       Function that receives a double input parameter and returns a boolean
    * @param hint A hint to provide additional context why a constraint could have failed
    * @return
    */
  def satisfies(
      columnCondition: String,
      constraintName: String,
      assertion: Double => Boolean = Check.IsOne,
      hint: Option[String] = None)
    : CheckWithLastConstraintFilterable = {

    addFilterableConstraint { filter =>
      complianceConstraint(constraintName, columnCondition, assertion, filter, hint)
    }
  }

For instance, isGreaterThan simply resorts to the following:

  def isGreaterThan(
      columnA: String,
      columnB: String,
      assertion: Double => Boolean = Check.IsOne,
      hint: Option[String] = None)
    : CheckWithLastConstraintFilterable = {

    satisfies(s"$columnA > $columnB", s"$columnA is greater than $columnB", assertion,
      hint = hint)
  }

from which we can see that .satisfies() simply requires:

  1. columnCondition which is a Spark SQL expression.
  2. constraintName is a user friendly check name that pops up in validation result.
  3. Rest of the parameters which you can look up yourself 🤪

Examples

Validate the dateTime field has a minimum start date:

.satisfies("dateTime >= '2021-01-01'", "Start date is reasonable")

Error message:

constraint constraint_message
ComplianceConstraint(Compliance(Start date is reasonable,dateTime >= ‘2021-01-01’,None)) Value: 0.9865361077111383 does not meet the constraint requirement!

Validate that dateTime column values are less than current date:

.satisfies("dateTime <= current_timestamp", "End date is less than now")
Have a question⁉ Contact me.