Browse Source

[Docs][DataQuality]: Add DataQuality Docs (#9512)

Co-authored-by: Jiajie Zhong <zhongjiajie955@gmail.com>
3.0.0/version-upgrade
zixi0825 2 years ago committed by GitHub
parent
commit
337696e258
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 310
      docs/docs/en/guide/task/data-quality.md
  2. 313
      docs/docs/zh/guide/task/data-quality.md
  3. BIN
      docs/img/tasks/demo/custom_sql_check.png
  4. BIN
      docs/img/tasks/demo/enumeration_check.png
  5. BIN
      docs/img/tasks/demo/field_length_check.png
  6. BIN
      docs/img/tasks/demo/multi_table_accuracy_check.png
  7. BIN
      docs/img/tasks/demo/multi_table_comparison_check.png
  8. BIN
      docs/img/tasks/demo/null_check.png
  9. BIN
      docs/img/tasks/demo/regexp_check.png
  10. BIN
      docs/img/tasks/demo/result.png
  11. BIN
      docs/img/tasks/demo/rule_detail.png
  12. BIN
      docs/img/tasks/demo/rule_list.png
  13. BIN
      docs/img/tasks/demo/table_count_check.png
  14. BIN
      docs/img/tasks/demo/timeliness_check.png
  15. BIN
      docs/img/tasks/demo/uniqueness_check.png

310
docs/docs/en/guide/task/data-quality.md

@ -0,0 +1,310 @@
# Overview
## Introduction
The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves.
- The execution flow of the data quality task is as follows:
> The user defines the task in the interface, and the user input value is stored in `TaskParam`
When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`.
Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine. The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler`
`Worker` sends the task result to `Master`, after `Master` receives `TaskResponse`, it will judge whether the task type is `DataQualityTask`, if so, it will read the corresponding result from `t_ds_dq_execute_result` according to `taskInstanceId`, and then The result is judged according to the check mode, operator and threshold configured by the user. If the result is a failure, the corresponding operation, alarm or interruption will be performed according to the failure policy configured by the user.
Add config : `<server-name>/conf/common.properties`
```properties
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
```
Please fill in `data-quality.jar.name` according to the actual package name,
If you package `data-quality` separately, remember to modify the package name to be consistent with `data-quality.jar.name`.
If the old version is upgraded and used, you need to execute the `sql` update script to initialize the database before running.
If you want to use `MySQL` data, you need to comment out the `scope` of `MySQL` in `pom.xml`
Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, other data sources have not been tested yet
`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use `jdbc` to read `Hive`
## Detail
- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it indicates that the data does not meet expectations, and the failure strategy is executed.
- CheckFormula:
- Expected-Actual
- Actual-Expected
- (Actual/Expected)x100%
- (Expected-Actual)/Expected x100%
- Operator:=、>、>=、<<=、!=
- ExpectedValue
- FixValue
- DailyAvg
- WeeklyAvg
- MonthlyAvg
- Last7DayAvg
- Last30DayAvg
- SrcTableTotalRows
- TargetTableTotalRows
- example
- CheckFormula:Expected-Actual
- Operator:>
- Threshold:0
- ExpectedValue:FixValue=9。
Assuming that the actual value is 10, the operator is >, and the expected value is 9, then the result 10 -9 > 0 is true, which means that the row data in the empty column has exceeded the threshold, and the task is judged to fail
# Guide
## NullCheck
### Introduction
The goal of the null value check is to check the number of empty rows in the specified column. The number of empty rows can be compared with the total number of rows or a specified threshold. If it is greater than a certain threshold, it will be judged as failure.
- Calculate the SQL statement that the specified column is empty as follows:
```sql
SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or ${src_field} = '') AND (${src_filter})
```
- The SQL to calculate the total number of rows in the table is as follows:
```sql
SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
```
### UI Guide
![dataquality_null_check](/img/tasks/demo/null_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the validation data is located
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select the check column name
- Check method:
- [Expected-Actual]
- [Actual-Expected]
- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Timeliness Check
### Introduction
The timeliness check is used to check whether the data is processed within the expected time. The start time and end time can be specified to define the time range. If the amount of data within the time range does not reach the set threshold, the check task will be judged as fail
### UI Guide
![dataquality_timeliness_check](/img/tasks/demo/timeliness_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the validation data is located
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select check column name
- start time: the start time of a time range
- end time: the end time of a time range
- Time Format: Set the corresponding time format
- Check method:
- [Expected-Actual]
- [Actual-Expected]
- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Field Length Check
### Introduction
The goal of field length verification is to check whether the length of the selected field meets the expectations. If there is data that does not meet the requirements, and the number of rows exceeds the threshold, the task will be judged to fail
### UI Guide
![dataquality_length_check](/img/tasks/demo/field_length_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the validation data is located
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select the check column name
- Logical operators: =, >, >=, <, <=, ! =
- Field length limit: like the title
- Check method:
- [Expected-Actual]
- [Actual-Expected]
- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Uniqueness Check
### Introduction
The goal of the uniqueness check is to check whether the field is duplicated. It is generally used to check whether the primary key is duplicated. If there is duplication and the threshold is reached, the check task will be judged to be failed.
### UI Guide
![dataquality_uniqueness_check](/img/tasks/demo/uniqueness_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the validation data is located
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select the check column name
- Check method:
- [Expected-Actual]
- [Actual-Expected]
- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Regular Expression Check
### Introduction
The goal of regular expression verification is to check whether the format of the value of a field meets the requirements, such as time format, email format, ID card format, etc. If there is data that does not meet the format and exceeds the threshold, the task will be judged as failed.
### UI Guide
![dataquality_regex_check](/img/tasks/demo/regexp_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the validation data is located
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select check column name
- Regular expression: as title
- Check method:
- [Expected-Actual]
- [Actual-Expected]
- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Enumeration Check
### Introduction
The goal of enumeration value verification is to check whether the value of a field is within the range of enumeration values. If there is data that is not in the range of enumeration values and exceeds the threshold, the task will be judged to fail
### UI Guide
![dataquality_enum_check](/img/tasks/demo/enumeration_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the validation data is located
- Src table filter conditions: such as title, also used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select the check column name
- List of enumeration values: separated by commas
- Check method:
- [Expected-Actual]
- [Actual-Expected]
- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Table Count Check
### Introduction
The goal of table row number verification is to check whether the number of rows in the table reaches the expected value. If the number of rows does not meet the standard, the task will be judged as failed.
### UI Guide
![dataquality_count_check](/img/tasks/demo/table_count_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the validation data is located
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select the check column name
- Check method:
- [Expected-Actual]
- [Actual-Expected]
- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Custom SQL Check
### Introduction
### UI Guide
![dataquality_custom_sql_check](/img/tasks/demo/custom_sql_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the data to be verified is located
- Actual value name: alias in SQL for statistical value calculation, such as max_num
- Actual value calculation SQL: SQL for outputting actual values,
- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.
- select max(a) as max_num from ${src_table}, the table name must be filled like this
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Check method:
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type from the drop-down menu
## Accuracy check of multi-table
### Introduction
Accuracy checks are performed by comparing the accuracy differences of data records for selected fields between two tables, examples are as follows
- table test1
| c1 | c2 |
| :---: | :---: |
| a | 1 |
|b|2|
- table test2
| c21 | c22 |
| :---: | :---: |
| a | 1 |
|b|3|
If you compare the data in c1 and c21, the tables test1 and test2 are exactly the same. If you compare c2 and c22, the data in table test1 and table test2 are inconsistent.
### UI Guide
![dataquality_multi_table_accuracy_check](/img/tasks/demo/multi_table_accuracy_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: drop-down to select the table where the data to be verified is located
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Target data type: choose MySQL, PostgreSQL, etc.
- Target data source: the corresponding data source under the source data type
- Target data table: drop-down to select the table where the data to be verified is located
- Target filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Check column:
- Fill in the source data column, operator and target data column respectively
- Verification method: select the desired verification method
- Operators: =, >, >=, <, <=, ! =
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
- Expected value type: select the desired type in the drop-down menu, only SrcTableTotalRow, TargetTableTotalRow and fixed value are suitable for selection here
## Comparison of the values checked by the two tables
### Introduction
Two-table value comparison allows users to customize different SQL statistics for two tables and compare the corresponding values. For example, for the source table A, the total amount of a certain column is calculated, and for the target table, the total amount of a certain column is calculated. value sum2, compare sum1 and sum2 to determine the check result
### UI Guide
![dataquality_multi_table_comparison_check](/img/tasks/demo/multi_table_comparison_check.png)
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
- Source data table: the table where the data is to be verified
- Actual value name: Calculate the alias in SQL for the actual value, such as max_age1
- Actual value calculation SQL: SQL for outputting actual values,
- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.
- select max(age) as max_age1 from ${src_table} The table name must be filled like this
- Target data type: choose MySQL, PostgreSQL, etc.
- Target data source: the corresponding data source under the source data type
- Target data table: the table where the data is to be verified
- Expected value name: Calculate the alias in SQL for the expected value, such as max_age2
- Expected value calculation SQL: SQL for outputting expected value,
- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.
- select max(age) as max_age2 from ${target_table} The table name must be filled like this
- Verification method: select the desired verification method
- Operators: =, >, >=, <, <=, ! =
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Task result view
![dataquality_result](/img/tasks/demo/result.png)
## Rule View
### List of rules
![dataquality_rule_list](/img/tasks/demo/rule_list.png)
### Rules Details
![dataquality_rule_detail](/img/tasks/demo/rule_detail.png)

313
docs/docs/zh/guide/task/data-quality.md

@ -0,0 +1,313 @@
# 概述
## 任务类型介绍
数据质量任务是用于检查数据在集成、处理过程中的数据准确性。本版本的数据质量任务包括单表检查、单表自定义SQL检查、多表准确性以及两表值比对。数据质量任务的运行环境为Spark2.4.0,其他版本尚未进行过验证,用户可自行验证。
- 数据质量任务的执行逻辑如下:
> 用户在界面定义任务,用户输入值保存在`TaskParam`中
运行任务时,`Master`会解析`TaskParam`,封装`DataQualityTask`所需要的参数下发至`Worker。
Worker`运行数据质量任务,数据质量任务在运行结束之后将统计结果写入到指定的存储引擎中,当前数据质量任务结果存储在`dolphinscheduler`的`t_ds_dq_execute_result`表中
`Worker`发送任务结果给`Master`,`Master`收到`TaskResponse`之后会判断任务类型是否为`DataQualityTask`,如果是的话会根据`taskInstanceId`从`t_ds_dq_execute_result`中读取相应的结果,然后根据用户配置好的检查方式,操作符和阈值进行结果判断,如果结果为失败的话,会根据用户配置好的的失败策略进行相应的操作,告警或者中断
## 注意事项
添加配置信息:`<server-name>/conf/common.properties`
```properties
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
```
这里的`data-quality.jar.name`请根据实际打包的名称来填写,
如果单独打包`data-quality`的话,记得修改包名和`data-quality.jar.name`一致。
如果是老版本升级使用,运行之前需要先执行`sql`更新脚本进行数据库初始化。
如果要用到`MySQL`数据,需要将`pom.xml`中`MySQL`的`scope`注释掉
当前只测试了`MySQL`、`PostgreSQL`和`HIVE`数据源,其他数据源暂时未测试过
`Spark`需要配置好读取`Hive`元数据,`Spark`不是采用`jdbc`的方式读取`Hive`
## 检查逻辑详解
- 校验公式:[校验方式][操作符][阈值],如果结果为真,则表明数据不符合期望,执行失败策略
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 操作符:=、>、>=、<<=、!=
- 期望值类型
- 固定值
- 日均值
- 周均值
- 月均值
- 最近7天均值
- 最近30天均值
- 源表总行数
- 目标表总行数
- 例子
- 校验方式为:[Expected-Actual][期望值-实际值]
- [操作符]:>
- [阈值]:0
- 期望值类型:固定值=9。
假设实际值为10,操作符为 >, 期望值为9,那么结果 10 -9 > 0 为真,那就意味列为空的行数据已经超过阈值,任务被判定为失败
# 任务操作指南
## 单表检查之空值检查
### 检查介绍
空值检查的目标是检查出指定列为空的行数,可将为空的行数与总行数或者指定阈值进行比较,如果大于某个阈值则判定为失败
- 计算指定列为空的SQL语句如下:
```sql
SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or ${src_field} = '') AND (${src_filter})
```
- 计算表总行数的SQL如下:
```sql
SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
```
### 界面操作指南
![dataquality_null_check](/img/tasks/demo/null_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择验证数据所在表
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 源表检查列:下拉选择检查列名
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 单表检查之及时性检查
### 检查介绍
及时性检查用于检查数据是否在预期时间内处理完成,可指定开始时间、结束时间来界定时间范围,如果在该时间范围内的数据量没有达到设定的阈值,那么会判断该检查任务为失败
### 界面操作指南
![dataquality_timeliness_check](/img/tasks/demo/timeliness_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择验证数据所在表
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 源表检查列:下拉选择检查列名
- 起始时间:某个时间范围的开始时间
- 结束时间:某个时间范围的结束时间
- 时间格式:设置对应的时间格式
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 单表检查之字段长度校验
### 检查介绍
字段长度校验的目标是检查所选字段的长度是否满足预期,如果有存在不满足要求的数据,并且行数超过阈值则会判断任务为失败
### 界面操作指南
![dataquality_length_check](/img/tasks/demo/field_length_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择验证数据所在表
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 源表检查列:下拉选择检查列名
- 逻辑操作符:=,>、>=、<<=、!=
- 字段长度限制:如标题
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 单表检查之唯一性校验
### 检查介绍
唯一性校验的目标是检查字段是否存在重复的情况,一般用于检验primary key是否有重复,如果存在重复且达到阈值,则会判断检查任务为失败
### 界面操作指南
![dataquality_uniqueness_check](/img/tasks/demo/uniqueness_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择验证数据所在表
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 源表检查列:下拉选择检查列名
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 单表检查之正则表达式校验
### 检查介绍
正则表达式校验的目标是检查某字段的值的格式是否符合要求,例如时间格式、邮箱格式、身份证格式等等,如果存在不符合格式的数据并超过阈值,则会判断任务为失败
### 界面操作指南
![dataquality_regex_check](/img/tasks/demo/regexp_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择验证数据所在表
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 源表检查列:下拉选择检查列名
- 正则表达式:如标题
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 单表检查之枚举值校验
### 检查介绍
枚举值校验的目标是检查某字段的值是否在枚举值的范围内,如果存在不在枚举值范围里的数据并超过阈值,则会判断任务为失败
### 界面操作指南
![dataquality_enum_check](/img/tasks/demo/enumeration_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择验证数据所在表
- 源表过滤条件:如标题,统计表总行数的时候也会用到,选填
- 源表检查列:下拉选择检查列名
- 枚举值列表:用英文逗号,隔开
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 单表检查之表行数校验
### 检查介绍
表行数校验的目标是检查表的行数是否达到预期的值,如果行数未达标,则会判断任务为失败
### 界面操作指南
![dataquality_count_check](/img/tasks/demo/table_count_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择验证数据所在表
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 源表检查列:下拉选择检查列名
- 校验方式:
- [Expected-Actual][期望值-实际值]
- [Actual-Expected][实际值-期望值]
- [Actual/Expected][实际值/期望值]x100%
- [(Expected-Actual)/Expected][(期望值-实际值)/期望值]x100%
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 单表检查之自定义SQL检查
### 检查介绍
### 界面操作指南
![dataquality_custom_sql_check](/img/tasks/demo/custom_sql_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择要验证数据所在表
- 实际值名:为统计值计算SQL中的别名,如max_num
- 实际值计算SQL: 用于输出实际值的SQL、
- 注意点:该SQL必须为统计SQL,例如统计行数,计算最大值、最小值等
- select max(a) as max_num from ${src_table},表名必须这么填
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 校验方式:
- 校验操作符:=,>、>=、<<=、!=
- 阈值:公式中用于比较的值
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型
## 多表检查之准确性检查
### 检查介绍
准确性检查是通过比较两个表之间所选字段的数据记录的准确性差异,例子如下
- 表test1
| c1 | c2 |
| :---: | :---: |
| a | 1 |
| b | 2|
- 表test2
| c21 | c22 |
| :---: | :---: |
| a | 1 |
| b | 3|
如果对比c1和c21中的数据,则表test1和test2完全一致。 如果对比c2和c22则表test1和表test2中的数据则存在不一致了。
### 界面操作指南
![dataquality_multi_table_accuracy_check](/img/tasks/demo/multi_table_accuracy_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:下拉选择要验证数据所在表
- 源过滤条件:如标题,统计表总行数的时候也会用到,选填
- 目标数据类型:选择MySQL、PostgreSQL等
- 目标数据源:源数据类型下对应的数据源
- 目标数据表:下拉选择要验证数据所在表
- 目标过滤条件:如标题,统计表总行数的时候也会用到,选填
- 检查列:
- 分别填写 源数据列,操作符,目标数据列
- 校验方式:选择想要的校验方式
- 操作符:=,>、>=、<<=、!=
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
- 期望值类型:在下拉菜单中选择所要的类型,这里只适合选择SrcTableTotalRow、TargetTableTotalRow和固定值
## 两表检查之值比对
### 检查介绍
两表值比对允许用户对两张表自定义不同的SQL统计出相应的值进行比对,例如针对源表A统计出某一列的金额总值sum1,针对目标表统计出某一列的金额总值sum2,将sum1和sum2进行比较来判定检查结果
### 界面操作指南
![dataquality_multi_table_comparison_check](/img/tasks/demo/multi_table_comparison_check.png)
- 源数据类型:选择MySQL、PostgreSQL等
- 源数据源:源数据类型下对应的数据源
- 源数据表:要验证数据所在表
- 实际值名:为实际值计算SQL中的别名,如max_age1
- 实际值计算SQL: 用于输出实际值的SQL、
- 注意点:该SQL必须为统计SQL,例如统计行数,计算最大值、最小值等
- select max(age) as max_age1 from ${src_table} 表名必须这么填
- 目标数据类型:选择MySQL、PostgreSQL等
- 目标数据源:源数据类型下对应的数据源
- 目标数据表:要验证数据所在表
- 期望值名:为期望值计算SQL中的别名,如max_age2
- 期望值计算SQL: 用于输出期望值的SQL、
- 注意点:该SQL必须为统计SQL,例如统计行数,计算最大值、最小值等
- select max(age) as max_age2 from ${target_table} 表名必须这么填
- 校验方式:选择想要的校验方式
- 操作符:=,>、>=、<<=、!=
- 失败策略
- 告警:数据质量任务失败了,DolphinScheduler任务结果为成功,发送告警
- 阻断:数据质量任务失败了,DolphinScheduler任务结果为失败,发送告警
## 任务结果查看
![dataquality_result](/img/tasks/demo/result.png)
## 规则查看
### 规则列表
![dataquality_rule_list](/img/tasks/demo/rule_list.png)
### 规则详情
![dataquality_rule_detail](/img/tasks/demo/rule_detail.png)

BIN
docs/img/tasks/demo/custom_sql_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

BIN
docs/img/tasks/demo/enumeration_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

BIN
docs/img/tasks/demo/field_length_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

BIN
docs/img/tasks/demo/multi_table_accuracy_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

BIN
docs/img/tasks/demo/multi_table_comparison_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

BIN
docs/img/tasks/demo/null_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

BIN
docs/img/tasks/demo/regexp_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

BIN
docs/img/tasks/demo/result.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

BIN
docs/img/tasks/demo/rule_detail.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

BIN
docs/img/tasks/demo/rule_list.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

BIN
docs/img/tasks/demo/table_count_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

BIN
docs/img/tasks/demo/timeliness_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

BIN
docs/img/tasks/demo/uniqueness_check.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Loading…
Cancel
Save