The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves.
The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves.
- The execution flow of the data quality task is as follows:
The execution flow of the data quality task is as follows:
> The user defines the task in the interface, and the user input value is stored in `TaskParam`
- The user defines the task in the interface, and the user input value is stored in `TaskParam`.
When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`.
- When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`.
Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine. The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler`
- Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine.
`Worker` sends the task result to `Master`, after `Master` receives `TaskResponse`, it will judge whether the task type is `DataQualityTask`, if so, it will read the corresponding result from `t_ds_dq_execute_result` according to `taskInstanceId`, and then The result is judged according to the check mode, operator and threshold configured by the user. If the result is a failure, the corresponding operation, alarm or interruption will be performed according to the failure policy configured by the user.
- The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler`
`Worker` sends the task result to `Master`, after `Master` receives `TaskResponse`, it will judge whether the task type is `DataQualityTask`, if so, it will read the corresponding result from `t_ds_dq_execute_result` according to `taskInstanceId`, and then The result is judged according to the check mode, operator and threshold configured by the user.
- If the result is a failure, the corresponding operation, alarm or interruption will be performed according to the failure policy configured by the user.
Please fill in `data-quality.jar.name` according to the actual package name,
- Please fill in `data-quality.jar.name` according to the actual package name.
If you package `data-quality` separately, remember to modify the package name to be consistent with `data-quality.jar.name`.
- If you package `data-quality` separately, remember to modify the package name to be consistent with `data-quality.jar.name`.
If the old version is upgraded and used, you need to execute the `sql` update script to initialize the database before running.
- If the old version is upgraded and used, you need to execute the `sql` update script to initialize the database before running.
If you want to use `MySQL` data, you need to comment out the `scope` of `MySQL` in `pom.xml`
- If you want to use `MySQL` data, you need to comment out the `scope` of `MySQL` in `pom.xml`.
Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, other data sources have not been tested yet
- Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, other data sources have not been tested yet.
`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use `jdbc` to read `Hive`
- `Spark` needs to be configured to read `Hive` metadata, `Spark` does not use `jdbc` to read `Hive`.
## Detail
## Detailed Inspection Logic
- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it indicates that the data does not meet expectations, and the failure strategy is executed.
| **Parameter** | **Description** |
- CheckFormula:
| ----- | ---- |
- Expected-Actual
| CheckMethod | [CheckFormula][Operator][Threshold], if the result is true, it indicates that the data does not meet expectations, and the failure strategy is executed. |
| Example |<ul><li>CheckFormula:Expected-Actual</li><li>Operator:></li><li>Threshold:0</li><li>ExpectedValue:FixValue=9</li></ul>
- ExpectedValue
- FixValue
- DailyAvg
- WeeklyAvg
- MonthlyAvg
- Last7DayAvg
- Last30DayAvg
- SrcTableTotalRows
- TargetTableTotalRows
- example
- CheckFormula:Expected-Actual
- Operator:>
- Threshold:0
- ExpectedValue:FixValue=9。
Assuming that the actual value is 10, the operator is >, and the expected value is 9, then the result 10 -9 > 0 is true, which means that the row data in the empty column has exceeded the threshold, and the task is judged to fail
In the example, assuming that the actual value is 10, the operator is >, and the expected value is 9, then the result 10 -9 > 0 is true, which means that the row data in the empty column has exceeded the threshold, and the task is judged to fail.
# Guide
## NullCheck
# Task Operation Guide
### Introduction
## Null Value Check for Single Table Check
### Inspection Introduction
The goal of the null value check is to check the number of empty rows in the specified column. The number of empty rows can be compared with the total number of rows or a specified threshold. If it is greater than a certain threshold, it will be judged as failure.
The goal of the null value check is to check the number of empty rows in the specified column. The number of empty rows can be compared with the total number of rows or a specified threshold. If it is greater than a certain threshold, it will be judged as failure.
- Calculate the SQL statement that the specified column is empty as follows:
- Calculate the SQL statement that the specified column is empty as follows:
@ -64,247 +53,253 @@ The goal of the null value check is to check the number of empty rows in the spe
SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
| Threshold | The value used in the formula for comparison. |
- Threshold: The value used in the formula for comparison
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
| Expected value type | Select the desired type from the drop-down menu. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Timeliness Check of Single Table Check
- Expected value type: select the desired type from the drop-down menu
### Inspection Introduction
The timeliness check is used to check whether the data is processed within the expected time. The start time and end time can be specified to define the time range. If the amount of data within the time range does not reach the set threshold, the check task will be judged as fail.
## Timeliness Check
### Introduction
### Interface Operation Guide
The timeliness check is used to check whether the data is processed within the expected time. The start time and end time can be specified to define the time range. If the amount of data within the time range does not reach the set threshold, the check task will be judged as fail
| Threshold | The value used in the formula for comparison. |
- Threshold: The value used in the formula for comparison
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
| Expected value type | Select the desired type from the drop-down menu. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Field Length Check for Single Table Check
- Expected value type: select the desired type from the drop-down menu
### Inspection Introduction
## Field Length Check
The goal of field length verification is to check whether the length of the selected field meets the expectations. If there is data that does not meet the requirements, and the number of rows exceeds the threshold, the task will be judged to fail.
### Introduction
The goal of field length verification is to check whether the length of the selected field meets the expectations. If there is data that does not meet the requirements, and the number of rows exceeds the threshold, the task will be judged to fail
| Threshold | The value used in the formula for comparison. |
- Threshold: The value used in the formula for comparison
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
| Expected value type | Select the desired type from the drop-down menu. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Uniqueness Check for Single Table Check
- Expected value type: select the desired type from the drop-down menu
### Inspection Introduction
## Uniqueness Check
### Introduction
The goal of the uniqueness check is to check whether the field is duplicated. It is generally used to check whether the primary key is duplicated. If there is duplication and the threshold is reached, the check task will be judged to be failed.
The goal of the uniqueness check is to check whether the field is duplicated. It is generally used to check whether the primary key is duplicated. If there is duplication and the threshold is reached, the check task will be judged to be failed.
| Threshold | The value used in the formula for comparison. |
- Threshold: The value used in the formula for comparison
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
| Expected value type | Select the desired type from the drop-down menu. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Regular Expression Check for Single Table Check
- Expected value type: select the desired type from the drop-down menu
### Inspection Introduction
## Regular Expression Check
### Introduction
The goal of regular expression verification is to check whether the format of the value of a field meets the requirements, such as time format, email format, ID card format, etc. If there is data that does not meet the format and exceeds the threshold, the task will be judged as failed.
The goal of regular expression verification is to check whether the format of the value of a field meets the requirements, such as time format, email format, ID card format, etc. If there is data that does not meet the format and exceeds the threshold, the task will be judged as failed.
| Threshold | The value used in the formula for comparison. |
- Threshold: The value used in the formula for comparison
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
| Expected value type | Select the desired type from the drop-down menu. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Enumeration Value Validation for Single Table Check
- Expected value type: select the desired type from the drop-down menu
### Inspection Introduction
## Enumeration Check
### Introduction
The goal of enumeration value verification is to check whether the value of a field is within the range of enumeration values. If there is data that is not in the range of enumeration values and exceeds the threshold, the task will be judged to fail
The goal of enumeration value verification is to check whether the value of a field is within the range of enumeration values. If there is data that is not in the range of enumeration values and exceeds the threshold, the task will be judged to fail
| Threshold | The value used in the formula for comparison. |
- Threshold: The value used in the formula for comparison
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
| Expected value type | Select the desired type from the drop-down menu. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Table Row Number Verification for Single Table Check
- Expected value type: select the desired type from the drop-down menu
## Table Count Check
### Inspection Introduction
### Introduction
The goal of table row number verification is to check whether the number of rows in the table reaches the expected value. If the number of rows does not meet the standard, the task will be judged as failed.
The goal of table row number verification is to check whether the number of rows in the table reaches the expected value. If the number of rows does not meet the standard, the task will be judged as failed.
| Threshold | The value used in the formula for comparison. |
- Threshold: The value used in the formula for comparison
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
| Expected value type | Select the desired type from the drop-down menu. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
## Custom SQL Check for Single Table Check
- Expected value type: select the desired type from the drop-down menu
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
| **Parameter** | **Description** |
- Source data table: drop-down to select the table where the data to be verified is located
| ----- | ---- |
- Actual value name: alias in SQL for statistical value calculation, such as max_num
| Source data type | Select MySQL, PostgreSQL, etc. |
- Actual value calculation SQL: SQL for outputting actual values,
| Source data source | The corresponding data source under the source data type. |
- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.
| Source data table | Drop-down to select the table where the data to be verified is located. |
- select max(a) as max_num from ${src_table}, the table name must be filled like this
| Actual value name | Alias in SQL for statistical value calculation, such as max_num. |
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
|Actual value calculation SQL | SQL for outputting actual values. Note:<ul><li>The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.</li><li>Select max(a) as max_num from ${src_table}, the table name must be filled like this.</li></ul> |
- Check method:
| Src filter conditions | Such as the title, it will also be used when counting the total number of rows in the table, optional. |
- Threshold: The value used in the formula for comparison
| Check operators | =, >, >=, <, <=, ! = |
- Failure strategy
| Threshold | The value used in the formula for comparison. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
| Expected value type | Select the desired type from the drop-down menu. |
- Expected value type: select the desired type from the drop-down menu
## Accuracy Check of Multi-table
## Accuracy check of multi-table
### Inspection Introduction
### Introduction
Accuracy checks are performed by comparing the accuracy differences of data records for selected fields between two tables, examples are as follows
Accuracy checks are performed by comparing the accuracy differences of data records for selected fields between two tables, examples are as follows
- table test1
- table test1
| c1 | c2 |
| c1 | c2 |
| :---: | :---: |
| :---: | :---: |
| a | 1 |
| a | 1 |
|b|2|
|b|2|
- table test2
- table test2
| c21 | c22 |
| c21 | c22 |
| :---: | :---: |
| :---: | :---: |
| a | 1 |
| a | 1 |
|b|3|
| b | 3 |
If you compare the data in c1 and c21, the tables test1 and test2 are exactly the same. If you compare c2 and c22, the data in table test1 and table test2 are inconsistent.
If you compare the data in c1 and c21, the tables test1 and test2 are exactly the same. If you compare c2 and c22, the data in table test1 and table test2 are inconsistent.
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
| **Parameter** | **Description** |
- Source data table: drop-down to select the table where the data to be verified is located
| ----- | ---- |
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
| Source data type | Select MySQL, PostgreSQL, etc. |
- Target data type: choose MySQL, PostgreSQL, etc.
| Source data source | The corresponding data source under the source data type. |
- Target data source: the corresponding data source under the source data type
| Source data table | Drop-down to select the table where the data to be verified is located. |
- Target data table: drop-down to select the table where the data to be verified is located
| Src filter conditions | Such as the title, it will also be used when counting the total number of rows in the table, optional. |
- Target filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
| Target data type | Choose MySQL, PostgreSQL, etc. |
- Check column:
| Target data source | The corresponding data source under the source data type. |
- Fill in the source data column, operator and target data column respectively
| Target data table | Drop-down to select the table where the data to be verified is located. |
- Verification method: select the desired verification method
| Target filter conditions | Such as the title, it will also be used when counting the total number of rows in the table, optional. |
- Operators: =, >, >=, <, <=, ! =
| Check column | Fill in the source data column, operator and target data column respectively. |
- Failure strategy
| Verification method | Select the desired verification method. |
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
| Operators | =, >, >=, <, <=, ! = |
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li><ul> |
- Expected value type: select the desired type in the drop-down menu, only SrcTableTotalRow, TargetTableTotalRow and fixed value are suitable for selection here
| Expected value type | Select the desired type in the drop-down menu, only `SrcTableTotalRow`, `TargetTableTotalRow` and fixed value are suitable for selection here. |
## Comparison of the values checked by the two tables
## Comparison of the values checked by the two tables
### Introduction
### Inspection Introduction
Two-table value comparison allows users to customize different SQL statistics for two tables and compare the corresponding values. For example, for the source table A, the total amount of a certain column is calculated, and for the target table, the total amount of a certain column is calculated. value sum2, compare sum1 and sum2 to determine the check result
Two-table value comparison allows users to customize different SQL statistics for two tables and compare the corresponding values. For example, for the source table A, the total amount of a certain column is calculated, and for the target table, the total amount of a certain column is calculated. value sum2, compare sum1 and sum2 to determine the check result.
- Source data type: select MySQL, PostgreSQL, etc.
- Source data source: the corresponding data source under the source data type
| **Parameter** | **Description** |
- Source data table: the table where the data is to be verified
| ----- | ---- |
- Actual value name: Calculate the alias in SQL for the actual value, such as max_age1
| Source data type | Select MySQL, PostgreSQL, etc. |
- Actual value calculation SQL: SQL for outputting actual values,
| Source data source | The corresponding data source under the source data type. |
- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.
| Source data table | The table where the data is to be verified. |
- select max(age) as max_age1 from ${src_table} The table name must be filled like this
| Actual value name | Calculate the alias in SQL for the actual value, such as max_age1. |
- Target data type: choose MySQL, PostgreSQL, etc.
| Actual value calculation SQL | SQL for outputting actual values. Note: <ul><li>The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.</li><li>Select max(age) as max_age1 from ${src_table} The table name must be filled like this.</li></ul> |
- Target data source: the corresponding data source under the source data type
| Target data type | Choose MySQL, PostgreSQL, etc. |
- Target data table: the table where the data is to be verified
| Target data source | The corresponding data source under the source data type. |
- Expected value name: Calculate the alias in SQL for the expected value, such as max_age2
| Target data table | The table where the data is to be verified. |
- Expected value calculation SQL: SQL for outputting expected value,
| Expected value name | Calculate the alias in SQL for the expected value, such as max_age2. |
- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.
| Expected value calculation SQL | SQL for outputting expected value. Note: <ul><li>The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.</li><li>Select max(age) as max_age2 from ${target_table} The table name must be filled like this.</li></ul> |
- select max(age) as max_age2 from ${target_table} The table name must be filled like this
| Verification method | Select the desired verification method. |
- Verification method: select the desired verification method
| Operators | =, >, >=, <, <=, ! = |
- Operators: =, >, >=, <, <=, ! =
| Failure strategy | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
The Resource Center is usually used for operations such as uploading files, UDF functions, and task group management. You can appoint the local file directory as the upload directory for a single machine (this operation does not need to deploy Hadoop). Or you can also upload to a Hadoop or MinIO cluster, at this time, you need to have Hadoop (2.6+) or MinIO or other related environments.
When it is necessary to use the Resource Center to create or upload relevant files, all files and resources will be stored on HDFS. Therefore the following configuration is required.
## Local File Resource Configuration
## Local File Resource Configuration
@ -13,13 +13,9 @@ Configure the file in the following paths: `api-server/conf/common.properties` a
- Change `data.basedir.path` to the local directory path. Please make sure the user who deploy dolphinscheduler have read and write permissions, such as: `data.basedir.path=/tmp/dolphinscheduler`. And the directory you configured will be auto-created if it does not exists.
- Change `data.basedir.path` to the local directory path. Please make sure the user who deploy dolphinscheduler have read and write permissions, such as: `data.basedir.path=/tmp/dolphinscheduler`. And the directory you configured will be auto-created if it does not exists.
- Modify the following two parameters, `resource.storage.type=HDFS` and `resource.hdfs.fs.defaultFS=file:///`.
- Modify the following two parameters, `resource.storage.type=HDFS` and `resource.hdfs.fs.defaultFS=file:///`.
## HDFS Resource Configuration
## Configuring the common.properties
When it is necessary to use the Resource Center to create or upload relevant files, all files and resources will be stored on HDFS. Therefore the following configuration is required.
### Configuring the common.properties
After version 3.0.0-alpha, if you want to upload resources using HDFS or S3 from the Resource Center, the following paths need to be configured: `api-server/conf/common.properties` and `worker-server/conf/common.properties`. This can be found as follows.
After version 3.0.0-alpha, if you want to upload resources using HDFS or S3 from the Resource Center, you will need to configure the following paths The following paths need to be configured: `api-server/conf/common.properties` and `worker-server/conf/common.properties`. This can be found as follows.
```properties
```properties
#
#
@ -42,12 +38,13 @@ After version 3.0.0-alpha, if you want to upload resources using HDFS or S3 from
# user data local directory path, please make sure the directory exists and have read write permissions
# user data local directory path, please make sure the directory exists and have read write permissions
# resource store on HDFS/S3 path, resource file will store to this base path, self configuration, please make sure the directory exists on hdfs and have read write permissions. "/dolphinscheduler" is recommended
# if resource.storage.type=HDFS, the user must have the permission to create directories under the HDFS root path
# if resource.storage.type=HDFS, the user must have the permission to create directories under the HDFS root path
resource.hdfs.root.user=root
resource.hdfs.root.user=hdfs
# if resource.storage.type=S3, the value like: s3a://dolphinscheduler;
# if resource.storage.type=S3, the value like: s3a://dolphinscheduler; if resource.storage.type=HDFS and namenode HA is enabled, you need to copy core-site.xml and hdfs-site.xml to conf dir
# if resource.storage.type=HDFS and namenode HA is enabled, you need to copy core-site.xml and hdfs-site.xml to conf dir
# if resourcemanager HA is enabled or not use resourcemanager, please keep the default value;
# if resourcemanager HA is enabled or not use resourcemanager, please keep the default value; If resourcemanager is single, you only need to replace ds1 to actual resourcemanager hostname
# If resourcemanager is single, you only need to replace ds1 to actual resourcemanager hostname
# Whether hive SQL is executed in the same session
# Whether hive SQL is executed in the same session
support.hive.oneSession=false
support.hive.oneSession=false
# use sudo or not, if set true, executing user is tenant user and deploy user needs sudo permissions;
# use sudo or not, if set true, executing user is tenant user and deploy user needs sudo permissions; if set false, executing user is the deploy user and doesn't need sudo permissions
# if set false, executing user is the deploy user and doesn't need sudo permissions
sudo.enable=true
sudo.enable=true
# network interface preferred like eth0, default: empty
# network interface preferred like eth0, default: empty
> * If only the `api-server/conf/common.properties` file is configured, then resource uploading is enabled, but you can not use resources in task. If you want to use or execute the files in the workflow you need to configure `worker-server/conf/common.properties` too.
> * If only the `api-server/conf/common.properties` file is configured, then resource uploading is enabled, but you can not use resources in task. If you want to use or execute the files in the workflow you need to configure `worker-server/conf/common.properties` too.
> * If you want to use the resource upload function, the deployment user in [installation and deployment](../installation/standalone.md) must have relevant operation authority.
> * If you want to use the resource upload function, the deployment user in [installation and deployment](../installation/standalone.md) must have relevant operation authority.
> * If you using a Hadoop cluster with HA, you need to enable HDFS resource upload, and you need to copy the `core-site.xml` and `hdfs-site.xml` under the Hadoop cluster to `worker-server/conf` and `api-server/conf`, otherwise skip this copy step.
> * If you using a Hadoop cluster with HA, you need to enable HDFS resource upload, and you need to copy the `core-site.xml` and `hdfs-site.xml` under the Hadoop cluster to `worker-server/conf` and `api-server/conf`, otherwise skip this copy step.
The Resource Center is typically used for uploading files, UDF functions, and task group management. For a stand-alone
environment, you can select the local file directory as the upload folder (**this operation does not require Hadoop or HDFS deployment**).
Of course, you can also choose to upload to Hadoop or MinIO cluster. In this case, you need to have Hadoop (2.6+) or MinIOn and other related environments.
The task group is mainly used to control the concurrency of task instances and is designed to control the pressure of other resources (it can also control the pressure of the Hadoop cluster, the cluster will have queue control it). When creating a new task definition, you can configure the corresponding task group and configure the priority of the task running in the task group.
The task group is mainly used to control the concurrency of task instances and is designed to control the pressure of other resources (it can also control the pressure of the Hadoop cluster, the cluster will have queue control it). When creating a new task definition, you can configure the corresponding task group and configure the priority of the task running in the task group.
**Note**: The usage of task groups is applicable to tasks executed by workers, such as [switch] nodes, [condition] nodes, [sub_process] and other node types executed by the master are not controlled by the task group. Let's take the shell node as an example:
**Note**: The usage of task groups is applicable to tasks executed by workers, such as [switch] nodes, [condition] nodes, [sub_process] and other node types executed by the master are not controlled by the task group. Let's take the shell node as an example:
@ -40,13 +40,13 @@ Regarding the configuration of the task group, all you need to do is to configur
- Priority: When there is a waiting resource, the task with high priority will be distributed to the worker by the master first. The larger the value of this part, the higher the priority.
- Priority: When there is a waiting resource, the task with high priority will be distributed to the worker by the master first. The larger the value of this part, the higher the priority.
### Implementation Logic of Task Group
## Implementation Logic of Task Group
#### Get Task Group Resources:
### Get Task Group Resources
The master judges whether the task is configured with a task group when distributing the task. If the task is not configured, it is normally thrown to the worker to run; if a task group is configured, it checks whether the remaining size of the task group resource pool meets the current task operation before throwing it to the worker for execution. , if the resource pool -1 is satisfied, continue to run; if not, exit the task distribution and wait for other tasks to wake up.
The master judges whether the task is configured with a task group when distributing the task. If the task is not configured, it is normally thrown to the worker to run; if a task group is configured, it checks whether the remaining size of the task group resource pool meets the current task operation before throwing it to the worker for execution. , if the resource pool -1 is satisfied, continue to run; if not, exit the task distribution and wait for other tasks to wake up.
#### Release and Wake Up:
### Release and Wake Up
When the task that has occupied the task group resource is finished, the task group resource will be released. After the release, it will check whether there is a task waiting in the current task group. If there is, mark the task with the best priority to run, and create a new executable event. The event stores the task ID that is marked to acquire the resource, and then the task obtains the task group resource and run.
When the task that has occupied the task group resource is finished, the task group resource will be released. After the release, it will check whether there is a task waiting in the current task group. If there is, mark the task with the best priority to run, and create a new executable event. The event stores the task ID that is marked to acquire the resource, and then the task obtains the task group resource and run.
The resource management and file management functions are similar. The difference is that the resource management is the UDF upload function, and the file management uploads the user programs, scripts and configuration files. Operation function: rename, download, delete.
The resource management and file management functions are similar. The difference is that the resource management is the UDF upload function, and the file management uploads the user programs, scripts and configuration files. Operation function: rename, download, delete.
- Upload UDF resources
- Upload UDF resources: Same as uploading files.
> Same as uploading files.
## Function Management
### Function Management
- Create UDF function
- Create UDF function
> Click "Create UDF Function", enter the UDF function parameters, select the UDF resource, and click "Submit" to create the UDF function.
> Click "`Create UDF Function`", enter the UDF function parameters, select the UDF resource, and click `Submit` to create the UDF function.
> Currently, only supports temporary UDF functions of Hive.
> Currently, only supports temporary UDF functions of `HIVE`.
- UDF function name: enter the name of the UDF function.
- UDF function name: Enter the name of the UDF function.
- Package name Class name: enter the full path of the UDF function.
- Package name Class name: Enter the full path of the UDF function.
- UDF resource: set the resource file corresponding to the created UDF function.
- UDF resource: Set the resource file corresponding to the created UDF function.
* Only the administrator account in the security center has the authority to operate. It has functions such as queue management, tenant management, user management, alarm group management, worker group management, token management, etc. In the user management module, can authorize to the resources, data sources, projects, etc.
- Only the administrator account in the security center has the authority to operate. It has functions such as queue management, tenant management, user management, alarm group management, worker group management, token management, etc. In the user management module, can authorize to the resources, data sources, projects, etc.
* Administrator login, the default username and password is `admin/dolphinscheduler123`
- Administrator login, the default username and password is `admin/dolphinscheduler123`.
## Create Queue
## Create Queue
@ -50,7 +50,7 @@
## Token Management
## Token Management
> Since the back-end interface has login check, token management provides a way to execute various operations on the system by calling interfaces.
Since the back-end interface has login check, token management provides a way to execute various operations on the system by calling interfaces.
- The administrator enters the `Security Center -> Token Management page`, clicks the `Create Token` button, selects the expiration time and user, clicks the `Generate Token` button, and clicks the `Submit` button, then create the selected user's token successfully.
- The administrator enters the `Security Center -> Token Management page`, clicks the `Create Token` button, selects the expiration time and user, clicks the `Generate Token` button, and clicks the `Submit` button, then create the selected user's token successfully.
HttpPost httpPost = new HttpPost("http://127.0.0.1:12345/escheduler/projects/create");
HttpPost httpPost = new HttpPost("http://127.0.0.1:12345/escheduler/projects/create");
httpPost.setHeader("token", "123");
httpPost.setHeader("token", "123");
@ -96,9 +95,9 @@
## Granted Permissions
## Granted Permissions
* Granted permissions include project permissions, resource permissions, data source permissions, UDF function permissions.
- Granted permissions include project permissions, resource permissions, data source permissions, UDF function permissions.
* The administrator can authorize the projects, resources, data sources and UDF functions to normal users which not created by them. Because the way to authorize projects, resources, data sources and UDF functions to users is the same, we take project authorization as an example.
- The administrator can authorize the projects, resources, data sources and UDF functions to normal users which not created by them. Because the way to authorize projects, resources, data sources and UDF functions to users is the same, we take project authorization as an example.
* Note: The user has all permissions to the projects created by them. Projects will not be displayed in the project list and the selected project list.
- Note: The user has all permissions to the projects created by them. Projects will not be displayed in the project list and the selected project list.
- The administrator enters the `Security Center -> User Management` page and clicks the `Authorize` button of the user who needs to be authorized, as shown in the figure below:
- The administrator enters the `Security Center -> User Management` page and clicks the `Authorize` button of the user who needs to be authorized, as shown in the figure below:
- Create a task node in the workflow definition, select the worker group and the environment corresponding to the worker group. When executing the task, the Worker will execute the environment first before executing the task.
- Create a task node in the workflow definition, select the worker group and the environment corresponding to the worker group. When executing the task, the Worker will execute the environment first before executing the task.
- Each process can be related to zero or several clusters to support multiple environment, now just support k8s.
- Each process can be related to zero or several clusters to support multiple environment, now just support k8s.
> Usage cluster
> Usage cluster
- After creation and authorization, k8s namespaces and processes will associate clusters. Each cluster will have separate workflows and task instances running independently.
- After creation and authorization, k8s namespaces and processes will associate clusters. Each cluster will have separate workflows and task instances running independently.
- After creation and authorization, you can select it from the namespace drop down list when edit k8s task, If the k8s cluster name is `ds_null_k8s` means test mode which will not operate the cluster actually.
- After creation and authorization, you can select it from the namespace drop down list when edit k8s task, If the k8s cluster name is `ds_null_k8s` means test mode which will not operate the cluster actually.