Amazon Deequ is an open-source data quality library developed internally at Amazon. It addresses the requirements of ensuring data quality by defining unit tests for data that it can then scale to datasets with billions of records. Deequ provides multiple features, like automatic constraint suggestions and verification, metrics computation, and data profiling.
On the other hand, AWS Glue Data Quality is a feature of AWS Glue that reduces the manual effort required to set up data quality from days to hours. It automatically computes statistics, recommends quality rules, monitors, and alerts you when it detects that quality has deteriorated. AWS Glue Data Quality is built on Deequ and offers a simplified user experience for customers who want to use this open-source package.
One key difference between the two is that while Deequ is an open-source library that requires several steps to implement in production, including building the infrastructure, writing custom AWS Glue jobs, profiling the data, and generating rules before applying them. AWS Glue Data Quality simplifies this process by providing a serverless, cost-effective solution for achieving data quality at rest and in pipelines, however internally it’s using the same Amazon Deequ open-source library, developed by Amazon itself.
|Feature||Amazon Deequ||AWS Glue Data Quality|
|Automatic constraint suggestions and verification||Yes||Yes|
|Simplified user experience||No||Yes|
|Built on top of Apache Spark||Yes||Obviously|
|Regularly computes data quality metrics and generates reports||If configured||If turned on|
|Supports any data source supported by Apache Spark||Yes||No, as there is no API to access Spark.|
|Meets the requirements of production use cases at Amazon||Yes||Yes|
|Quickly analyzes data and creates data quality rules for you||Yes||Yes|
|Offers 25+ out-of-the-box DQ rules to start from||Yes||Yes|
|Provides a Data Quality score that provides an overview of the health of your data. You can use this score to make confident business decisions.||Yes||Yes|
|Helps you identify the exact records that caused your quality scores to go down. You can easily identify them, quarantine and fix them.||Can do||Yes|
|Pay-as-you-go billing to increase agility and improve costs.||n/a||Yes|
One of the features of a serverless solution like AWS is support for DQDL (Data Quality Definition Language). Basically, it’s a slightly shorter version of rule definitions comparing to writing Python or Scala code.
Data Quality Definition Language (DQDL) is a domain-specific language that you use to define rules for AWS Glue Data Quality. This language was developed by Amazon to help users understand and work with their data quality tools. DQDL is used to create rulesets, which group individual data quality rules together.
DQDL is case-sensitive and contains a ruleset, which groups individual data quality rules together. To construct a ruleset, you must create a list named Rules (capitalized), delimited by a pair of square brackets. The list should contain one or more comma-separated DQDL rules.
The structure of a DQDL rule depends on the rule type. However, DQDL rules generally fit the following format:
<RuleType> <Parameter> <Parameter> <Expression>
RuleType is the case-sensitive name of the rule type that you want to configure. For example,
CustomSql (Spark SQL dialect). Rule parameters differ for each rule type.
DQDL supports the following logical operators that you can use to combine rules:
and operator results in true if and only if the rules that it connects are true. Otherwise, the combined rule results in false. The
or operator results in true if and only if one or more of the rules that it connects are true.
To contact me, send an email anytime or leave a comment below.