Adding Any Constraint you want with AWS Deequ
Conveniently, AWS Deequ has a lot methods for validations i.e. .isComplete()
, .isContainedIn()
etc. but no clear documentaiton or examples on how to add your own. Fortunately, it’s actually very easy to do. Most of the checks just end up using satisfies function:
/**
* Creates a constraint that runs the given condition on the data frame.
*
* @param columnCondition Data frame column which is a combination of expression and the column
* name. It has to comply with Spark SQL syntax.
* Can be written in an exact same way with conditions inside the
* `WHERE` clause.
* @param constraintName A name that summarizes the check being made. This name is being used to
* name the metrics for the analysis being done.
* @param assertion Function that receives a double input parameter and returns a boolean
* @param hint A hint to provide additional context why a constraint could have failed
* @return
*/
def satisfies(
columnCondition: String,
constraintName: String,
assertion: Double => Boolean = Check.IsOne,
hint: Option[String] = None)
: CheckWithLastConstraintFilterable = {
addFilterableConstraint { filter =>
complianceConstraint(constraintName, columnCondition, assertion, filter, hint)
}
}
For instance, isGreaterThan
simply resorts to the following:
def isGreaterThan(
columnA: String,
columnB: String,
assertion: Double => Boolean = Check.IsOne,
hint: Option[String] = None)
: CheckWithLastConstraintFilterable = {
satisfies(s"$columnA > $columnB", s"$columnA is greater than $columnB", assertion,
hint = hint)
}
from which we can see that .satisfies()
simply requires:
columnCondition
which is a Spark SQL expression.constraintName
is a user friendly check name that pops up in validation result.- Rest of the parameters which you can look up yourself 🤪
Examples
Validate the dateTime
field has a minimum start date:
.satisfies("dateTime >= '2021-01-01'", "Start date is reasonable")
Error message:
constraint | constraint_message |
---|---|
ComplianceConstraint(Compliance(Start date is reasonable,dateTime >= ‘2021-01-01’,None)) | Value: 0.9865361077111383 does not meet the constraint requirement! |
Validate that dateTime
column values are less than current date:
.satisfies("dateTime <= current_timestamp", "End date is less than now")
To contact me, send an email anytime or leave a comment below.