Browse Source

[doc] Change tasks doc (#10639)

3.0.0/version-upgrade
sneh-wha 2 years ago committed by caishunfeng
parent
commit
38bce4f2d3
  1. 48
      docs/docs/en/guide/task/conditions.md
  2. 66
      docs/docs/en/guide/task/datax.md
  3. 3
      docs/docs/en/guide/task/dependent.md
  4. 29
      docs/docs/en/guide/task/emr.md
  5. 78
      docs/docs/en/guide/task/flink.md
  6. 49
      docs/docs/en/guide/task/http.md
  7. 86
      docs/docs/en/guide/task/jupyter.md
  8. 48
      docs/docs/en/guide/task/kubernetes.md
  9. 84
      docs/docs/en/guide/task/map-reduce.md
  10. 151
      docs/docs/en/guide/task/mlflow.md
  11. 65
      docs/docs/en/guide/task/openmldb.md
  12. 32
      docs/docs/en/guide/task/pigeon.md
  13. 41
      docs/docs/en/guide/task/python.md
  14. 59
      docs/docs/en/guide/task/shell.md
  15. 77
      docs/docs/en/guide/task/spark.md
  16. 35
      docs/docs/en/guide/task/sql.md
  17. 21
      docs/docs/en/guide/task/stored-procedure.md
  18. 33
      docs/docs/en/guide/task/sub-process.md
  19. 47
      docs/docs/en/guide/task/switch.md
  20. 39
      docs/docs/en/guide/task/zeppelin.md

48
docs/docs/en/guide/task/conditions.md

@ -4,25 +4,25 @@ Condition is a conditional node, that determines which downstream task should ru
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/conditions.png" width="20"/> task node to canvas.
## Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number.
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- Downstream tasks selection: Depending on the status of the predecessor task, you can jump to the corresponding branch, currently two branches are supported: success, failure
- Success: When the upstream task runs successfully, run the success branch.
- Failure: When the upstream task runs failed, run the failure branch.
- Upstream condition selection: can select one or more upstream tasks for conditions.
- Add an upstream dependency: the first parameter is to choose a specified task name, and the second parameter is to choose the upstream task status to trigger conditions.
- Select upstream task relationship: use `and` and `or` operators to handle the complex relationship of upstream when there are multiple upstream tasks for conditions.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the "`Create Workflow`" button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/conditions.png" width="20"/> task node to canvas.
## Task Parameters
| **Parameter** | **Description** |
| -------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Downstream tasks selection | Depending on the status of the predecessor task, you can jump to the corresponding branch, currently two branches are supported: success, failure <ul><li>Success: When the upstream task runs successfully, run the success branch.</li><li>Failure: When the upstream task runs failed, run the failure branch.</li></ul></li></ul> |
| Upstream condition selection | Can select one or more upstream tasks for conditions.<ul><li>Add an upstream dependency: the first parameter is to choose a specified task name, and the second parameter is to choose the upstream task status to trigger conditions.</li><li>Select upstream task relationship: use `and` and `or` operators to handle the complex relationship of upstream when there are multiple upstream tasks for conditions.</li></ul></li></ul> |
## Related Task
@ -41,21 +41,21 @@ Go to the workflow definition page, and then create the following task nodes:
- Node_Success: Shell task, print out "success", Node_A executes the successful branch.
- Node_False: Shell task, print out "false", Node_A executes the failed branch.
![condition_task01](/img/tasks/demo/condition_task01.png)
![condition_task01](../../../../img/tasks/demo/condition_task01.png)
### 2. View the execution result
After you finish creating the workflow, you can run the workflow online. You can view the execution status of each task on the workflow instance page. As shown below:
![condition_task02](/img/tasks/demo/condition_task02.png)
![condition_task02](../../../../img/tasks/demo/condition_task02.png)
In the above figure, the task status marked with a green check mark is the successfully executed task node.
## Notice
## Note
- The Conditions task supports multiple upstream tasks, but only two downstream tasks.
- The Conditions task and the workflow that contain it do not support copy operations.
- The predecessor task of Conditions cannot connect to its branch nodes, which will cause logical confusion and does not conform to DAG scheduling. The situation shown below is **wrong**.
![condition_task03](/img/tasks/demo/condition_task03.png)
![condition_task04](/img/tasks/demo/condition_task04.png)
![condition_task03](../../../../img/tasks/demo/condition_task03.png)
![condition_task04](../../../../img/tasks/demo/condition_task04.png)

66
docs/docs/en/guide/task/datax.md

@ -6,33 +6,37 @@ DataX task type for executing DataX programs. For DataX nodes, the worker will e
## Create Task
- Click Project Management -> Project Name -> Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page.
- Drag the <img src="/img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board.
## Task Parameter
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch.
- **Descriptive information**: describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle.
- **Worker grouping**: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution.
- **Environment Name**: Configure the environment name in which to run the script.
- **Number of failed retry attempts**: The number of times the task failed to be resubmitted.
- **Failed retry interval**: The time, in cents, interval for resubmitting the task after a failed task.
- **Delayed execution time**: The time, in cents, that a task is delayed in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail.
- **Custom template**: Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements.
- **json**: json configuration file for DataX synchronization.
- **Custom parameters**: SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement.
- **Data source**: Select the data source from which the data will be extracted.
- **sql statement**: the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias.
- **Target library**: Select the target library for data synchronization.
- **Pre-sql**: Pre-sql is executed before the sql statement (executed by the target library).
- **Post-sql**: Post-sql is executed after the sql statement (executed by the target library).
- **Stream limit (number of bytes)**: Limits the number of bytes in the query.
- **Limit flow (number of records)**: Limit the number of records for a query.
- **Running memory**: the minimum and maximum memory required can be configured to suit the actual production environment.
- **Predecessor task**: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag the <img src="../../../../img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node name | The node name in a workflow definition is unique. |
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the prohibition execution. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Description | Describe the function of the node. |
| Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Environment Name | Configure the environment name in which run the script. |
| Number of failed retries | The number of times the task failed to resubmit. |
| Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) |
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) |
| Delayed execution time | The time, in cents, that a task is delayed in execution. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. |
| Custom template | Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements. |
| json | json configuration file for DataX synchronization. |
| Custom parameters | SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement. |
| Data source | Select the data source from which the data will be extracted. |
| sql statement | the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias. |
| Target library | Select the target library for data synchronization. |
| Pre-sql | Pre-sql is executed before the sql statement (executed by the target library). |
| Post-sql | Post-sql is executed after the sql statement (executed by the target library). |
| Stream limit (number of bytes) | Limits the number of bytes in the query. |
| Limit flow (number of records) | Limit the number of records for a query. |
| Running memory | the minimum and maximum memory required can be configured to suit the actual production environment. |
| Predecessor task | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
## Task Example
@ -42,7 +46,7 @@ This example demonstrates importing data from Hive into MySQL.
If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The configuration file is as follows: `/dolphinscheduler/conf/env/dolphinscheduler_env.sh`.
![datax_task01](/img/tasks/demo/datax_task01.png)
![datax_task01](../../../../img/tasks/demo/datax_task01.png)
After the environment has been configured, DolphinScheduler needs to be restarted.
@ -52,12 +56,12 @@ As the default data source does not contain data to be read from Hive, a custom
After writing the required json file, you can configure the node content by following the steps in the diagram below.
![datax_task02](/img/tasks/demo/datax_task02.png)
![datax_task02](../../../../img/tasks/demo/datax_task02.png)
### View run results
![datax_task03](/img/tasks/demo/datax_task03.png)
![datax_task03](../../../../img/tasks/demo/datax_task03.png)
### Notice
### Note
If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX.

3
docs/docs/en/guide/task/dependent.md

@ -22,7 +22,8 @@ Dependent nodes are **dependency check nodes**. For example, process A depends o
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
## Examples
## Task Examples
The Dependent node provides a logical judgment function, which can detect the execution of the dependent node according to the logic.

29
docs/docs/en/guide/task/emr.md

@ -4,17 +4,24 @@
Amazon EMR task type, for creating EMR clusters on AWS and running computing tasks. Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object and submit to AWS.
## Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number.
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- JSON: JSON corresponding to the [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object, for details refer to [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples).
## Create Task
* Click `Project Management -> Project Name -> Workflow Definition`, click the "`Create Workflow`" button to enter the DAG editing page.
* Drag `AmazonEMR` task from the toolbar to the artboard to complete the creation.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node name | The node name in a workflow definition is unique. |
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.|
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker grouping | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Times of failed retry attempts | The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. |
| Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
| JSON | JSON corresponding to the [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object, for details refer to [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples). |
## JSON example

78
docs/docs/en/guide/task/flink.md

@ -10,39 +10,41 @@ Flink task type, used to execute Flink programs. For Flink nodes:
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/tasks/icons/flink.png" width="15"/>task node to canvas.
## Task Parameter
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- **Program type**: Support Java, Scala, Python and SQL four languages.
- **The class of main function**: The **full path** of Main Class, the entry point of the Flink program.
- **Main jar package**: The jar package of the Flink program (upload by Resource Center).
- **Deployment mode**: Support 2 deployment modes: cluster and local.
- **Initialization script**: Script file to initialize session context.
- **Script**: The sql script file developed by the user that should be executed.
- **Flink version**: Select version according to the execution env.
- **Task name** (optional): Flink task name.
- **JobManager memory size**: Used to set the size of jobManager memories, which can be set according to the actual production environment.
- **Number of slots**: Used to set the number of slots, which can be set according to the actual production environment.
- **TaskManager memory size**: Used to set the size of taskManager memories, which can be set according to the actual production environment.
- **Number of TaskManager**: Used to set the number of taskManagers, which can be set according to the actual production environment.
- **Parallelism**: Used to set the degree of parallelism for executing Flink tasks.
- **Main program parameters**: Set the input parameters for the Flink program and support the substitution of custom parameter variables.
- **Optional parameters**: Support `--jar`, `--files`,` --archives`, `--conf` format.
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them.
- **Custom parameter**: It is a local user-defined parameter for Flink, and will replace the content with `${variable}` in the script.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/flink.png" width="15"/>task node to canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node name | The node name in a workflow definition is unique. |
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. |
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker grouping | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Environment Name | Configure the environment name in which run the script. |
| Times of failed retry attempts | The number of times the task failed to resubmit. |
| Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
| Delayed execution time | The time (unit minute) that a task delays in execution. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
| Program type | Support Java, Scala, Python and SQL four languages. |
| Class of main function**: The **full path** of Main Class, the entry point of the Flink program. |
| Main jar package | The jar package of the Flink program (upload by Resource Center). |
| Deployment mode | Support 2 deployment modes: cluster and local. |
| Initialization script | Script file to initialize session context. |
| Script | The sql script file developed by the user that should be executed. |
| Flink version | Select version according to the execution environment. |
| Task name | Flink task name. |
| JobManager memory size | Used to set the size of jobManager memories, which can be set according to the actual production environment. |
| Number of slots | Used to set the number of slots, which can be set according to the actual production environment. |
| TaskManager memory size | Used to set the size of taskManager memories, which can be set according to the actual production environment. |
| Number of TaskManager | Used to set the number of taskManagers, which can be set according to the actual production environment. |
| Parallelism | Used to set the degree of parallelism for executing Flink tasks. |
| Main program parameters | Set the input parameters for the Flink program and support the substitution of custom parameter variables. |
| Optional parameters | Support `--jar`, `--files`,` --archives`, `--conf` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| Custom parameter | It is a local user-defined parameter for Flink, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example
@ -54,7 +56,7 @@ This is a common introductory case in the big data ecosystem, which often apply
If you are using the flink task type in a production environment, it is necessary to configure the required environment first. The following is the configuration file: `bin/env/dolphinscheduler_env.sh`.
![demo-flink-simple](/img/tasks/demo/flink_task01.png)
![demo-flink-simple](../../../../img/tasks/demo/flink_task01.png)
#### Upload the Main Package
@ -62,21 +64,21 @@ When using the Flink task node, you need to upload the jar package to the Resour
After finish the Resource Centre configuration, upload the required target files directly by dragging and dropping.
![resource_upload](/img/tasks/demo/upload_jar.png)
![resource_upload](../../../../img/tasks/demo/upload_jar.png)
#### Configure Flink Nodes
Configure the required content according to the parameter descriptions above.
![demo-flink-simple](/img/tasks/demo/flink_task02.png)
![demo-flink-simple](../../../../img/tasks/demo/flink_task02.png)
### Execute the FlinkSQL Program
Configure the required content according to the parameter descriptions above.
![demo-flink-sql-simple](/img/tasks/demo/flink_sql_test.png)
![demo-flink-sql-simple](../../../../img/tasks/demo/flink_sql_test.png)
## Notice
## Note
- JAVA and Scala only used for identification, there is no difference. If use Python to develop Flink, there is no class of the main function and the rest is the same.

49
docs/docs/en/guide/task/http.md

@ -6,28 +6,28 @@ This node is used to perform http type tasks such as the common POST and GET req
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page.
- Drag the <img src="/img/tasks/icons/http.png" width="15"/> from the toolbar to the drawing board.
## Task Parameter
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch.
- **Descriptive information**: describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle.
- **Worker grouping**: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution.
- **Environment Name**: Configure the environment name in which to run the script.
- **Number of failed retry attempts**: The number of times the task failed to be resubmitted.
- **Failed retry interval**: The time, in cents, interval for resubmitting the task after a failed task.
- **Delayed execution time**: the time, in cents, that a task is delayed in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail.
- **Request address**: HTTP request URL.
- **Request type**: Support GET, POSt, HEAD, PUT, DELETE.
- **Request parameters**: Support Parameter, Body, Headers.
- **Verification conditions**: support default response code, custom response code, content included, content not included.
- **Verification content**: When the verification condition selects a custom response code, the content contains, and the content does not contain, the verification content is required.
- **Custom parameter**: It is a user-defined parameter of http part, which will replace the content with `${variable}` in the script.
- **Predecessor task**: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the "`Create Workflow`" button to enter the DAG editing page.
- Drag the <img src="../../../../img/tasks/icons/http.png" width="15"/> from the toolbar to the drawing board.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Request address | HTTP request URL. |
| Request type | Supports GET, POSt, HEAD, PUT, DELETE. || Request parameters |Supports Parameter, Body, Headers. || Verification conditions | Supports default response code, custom response code, content included, content not included.|
| Verification content | When the verification condition selects a custom response code, the content contains, and the content does not contain, the verification content is required. |
| Custom parameter | It is a user-defined parameter of http part, which will replace the content with `${variable}` in the script. |
| Pre tasks | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
## Example
@ -40,8 +40,5 @@ The main configuration parameters are as follows:
- userName: Username
- userPassword: User login password
![http_task](/img/tasks/demo/http_task01.png)
![http_task](../../../../img/tasks/demo/http_task01.png)
## Notice
None

86
docs/docs/en/guide/task/jupyter.md

@ -0,0 +1,86 @@
# Jupyter
## Overview
Use `Jupyter Task` to create a jupyter-type task and execute jupyter notes. When the worker executes `Jupyter Task`,
it will use `papermill` to evaluate jupyter notes. Click [here](https://papermill.readthedocs.io/en/latest/) for details about `papermill`.
## Conda Configuration
- Config `conda.path` in `common.properties` to the path of your `conda.sh`, which should be the same `conda` you use to manage the python environment of your `papermill` and `jupyter`.
Click [here](https://docs.conda.io/en/latest/) for more information about `conda`.
- `conda.path` is set to `/opt/anaconda3/etc/profile.d/conda.sh` by default. If you have no idea where your `conda` is, simply run `conda info | grep -i 'base environment'`.
> NOTE: `Jupyter Task Plugin` uses `source` command to activate conda environment.
> If your tenant does not have permission to use `source`, `Jupyter Task Plugin` will not function.
## Python Dependency Management
### Use Pre-Installed Conda Environment
1. Create a conda environment manually or using `shell task` on your target worker.
2. In your `jupyter task`, set `condaEnvName` as the name of the conda environment you just created.
### Use Packed Conda Environment
1. Use [Conda-Pack](https://conda.github.io/conda-pack/) to pack your conda environment into `tarball`.
2. Upload packed conda environment to `resource center`.
3. Select your packed conda environment as `resource` in your `jupyter task`, e.g. `jupyter_env.tar.gz`.
> NOTE: Make sure you follow the [Conda-Pack](https://conda.github.io/conda-pack/) official instructions.
> If you unpack your packed conda environment, the directory structure should be the same as below:
```
.
├── bin
├── conda-meta
├── etc
├── include
├── lib
├── share
└── ssl
```
> NOTE: Please follow the `conda pack` instructions above strictly, and DO NOT modify `bin/activate`.
> `Jupyter Task Plugin` uses `source` command to activate your packed conda environment.
> If you are concerned about using `source`, choose other options to manage your python dependency.
## Create Task
- Click `Project Management-Project Name-Workflow Definition`, and click the "`Create Workflow`" button to enter the DAG editing page.
- Drag <img src="../../../../img/tasks/icons/jupyter.png" width="15"/> from the toolbar to the canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. |
| Conda Env Name | Name of conda environment or packed conda environment tarball. |
|Input Note Path | Path of input jupyter note template. |
| Out Note Path | Path of output note. |
| Jupyter Parameters | Parameters in json format used for jupyter note parameterization. |
| Kernel | Jupyter notebook kernel. |
| Engine | Engine to evaluate jupyter notes. |
| Jupyter Execution Timeout | Timeout set for each jupyter notebook cell. |
| Jupyter Start Timeout | Timeout set for jupyter notebook kernel. |
| Others | Other command options for papermill. |
## Task Example
### Jupyter Task Example
This example illustrates how to create a jupyter task node.
![demo-jupyter-simple](../../../../img/tasks/demo/jupyter.png)

48
docs/docs/en/guide/task/kubernetes.md

@ -0,0 +1,48 @@
# K8S Node
## Overview
K8S task type used to execute a batch task. In this task, the worker submits the task by using a k8s client.
## Create Task
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/kubernetes.png" width="15"/> to the canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Namespace | The namespace for running k8s task. |
| Min CPU | Minimum CPU requirement for running k8s task. |
| Min Memory | Minimum memory requirement for running k8s task. |
| Image | The registry url for image. |
| Custom parameter | It is a local user-defined parameter for K8S task, these params will pass to container as environment variables. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example
### Configure the K8S Environment in DolphinScheduler
If you are using the K8S task type in a production environment, the K8S cluster environment is required.
### Configure K8S Nodes
Configure the required content according to the parameter descriptions above.
![K8S](../../../../img/tasks/demo/kubernetes-task-en.png)
## Note
Task name contains only lowercase alphanumeric characters or '-'

84
docs/docs/en/guide/task/map-reduce.md

@ -6,45 +6,51 @@ MapReduce(MR) task type used for executing MapReduce programs. For MapReduce nod
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/tasks/icons/mr.png" width="15"/> to the canvas.
## Task Parameter
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- **Resource**: Refers to the list of resource files that called in the script, and upload or create files by the Resource Center file management.
- **Custom parameters**: It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/mr.png" width="15"/> to the canvas.
## Task Parameters
### General
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Resource | Refers to the list of resource files that called in the script, and upload or create files by the Resource Center file management. |
| Custom parameters | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
### JAVA or SCALA Program
- **Program type**: Select JAVA or SCALA program.
- **The class of the main function**: The **full path** of Main Class, the entry point of the MapReduce program.
- **Main jar package**: The jar package of the MapReduce program.
- **Task name** (optional): MapReduce task name.
- **Command line parameters**: Set the input parameters of the MapReduce program and support the substitution of custom parameter variables.
- **Other parameters**: support `-D`, `-files`, `-libjars`, `-archives` format.
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them.
- **User-defined parameter**: It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script.
## Python Program
- **Program type**: Select Python language.
- **Main jar package**: The Python jar package for running MapReduce.
- **Other parameters**: support `-D`, `-mapper`, `-reducer,` `-input` `-output` format, and you can set the input of user-defined parameters, such as:
- `-mapper "mapper.py 1"` `-file mapper.py` `-reducer reducer.py` `-file reducer.py` `–input /journey/words.txt` `-output /journey/out/mr/\${currentTimeMillis}`
- The `mapper.py 1` after `-mapper` is two parameters, the first parameter is `mapper.py`, and the second parameter is `1`.
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them.
- **User-defined parameter**: It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script.
| **Parameter** | **Description** |
| ------- | ---------- |
| Program type | Select JAVA or SCALA program. |
| The class of the main function | The **full path** of Main Class, the entry point of the MapReduce program. |
| Main jar package | The jar package of the MapReduce program. |
| Task name | MapReduce task name. |
| Command line parameters | Set the input parameters of the MapReduce program and support the substitution of custom parameter variables. |
| Other parameters | Support `-D`, `-files`, `-libjars`, `-archives` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
### Python Program
| **Parameter** | **Description** |
| ------- | ---------- |
| Program type | Select Python language. |
| Main jar package | The Python jar package for running MapReduce. |
| Other parameters | Support `-D`, `-mapper`, `-reducer,` `-input` `-output` format, and you can set the input of user-defined parameters, such as:<ul><li>`-mapper "mapper.py 1"` `-file mapper.py` `-reducer reducer.py` `-file reducer.py` `–input /journey/words.txt` `-output /journey/out/mr/${currentTimeMillis}`</li><li>The `mapper.py 1` after `-mapper` is two parameters, the first parameter is `mapper.py`, and the second parameter is `1`. </li></ul> |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
## Task Example
@ -56,7 +62,7 @@ This example is a common introductory type of MapReduce application, which used
If you are using the MapReduce task type in a production environment, it is necessary to configure the required environment first. The following is the configuration file: `bin/env/dolphinscheduler_env.sh`.
![mr_configure](/img/tasks/demo/mr_task01.png)
![mr_configure](../../../../img/tasks/demo/mr_task01.png)
#### Upload the Main Package
@ -64,10 +70,10 @@ When using the MapReduce task node, you need to use the Resource Centre to uploa
After finish the Resource Centre configuration, upload the required target files directly by dragging and dropping.
![resource_upload](/img/tasks/demo/upload_jar.png)
![resource_upload](../../../../img/tasks/demo/upload_jar.png)
#### Configure MapReduce Nodes
Configure the required content according to the parameter descriptions above.
![demo-mr-simple](/img/tasks/demo/mr_task02.png)
![demo-mr-simple](../../../../img/tasks/demo/mr_task02.png)

151
docs/docs/en/guide/task/mlflow.md

@ -0,0 +1,151 @@
# MLflow Node
## Overview
[MLflow](https://mlflow.org) is an excellent open source platform to manage the ML lifecycle, including experimentation,
reproducibility, deployment, and a central model registry.
MLflow task plugin used to execute MLflow tasks,Currently contains MLflow Projects and MLflow Models. (Model Registry will soon be rewarded for support)
- MLflow Projects: Package data science code in a format to reproduce runs on any platform.
- MLflow Models: Deploy machine learning models in diverse serving environments.
- Model Registry: Store, annotate, discover, and manage models in a central repository.
The MLflow plugin currently supports and will support the following:
- [x] MLflow Projects
- [x] BasicAlgorithm: contains LogisticRegression, svm, lightgbm, xgboost
- [x] AutoML: AutoML tool,contains autosklean, flaml
- [x] Custom projects: Support for running your own MLflow projects
- [ ] MLflow Models
- [x] MLFLOW: Use `MLflow models serve` to deploy a model service
- [x] Docker: Run the container after packaging the docker image
- [x] Docker Compose: Use docker compose to run the container, it will replace the docker run above
- [ ] Seldon core: Use Selcon core to deploy model to k8s cluster
- [ ] k8s: Deploy containers directly to K8S
- [ ] MLflow deployments: Built-in deployment modules, such as built-in deployment to SageMaker, etc
- [ ] Model Registry
- [ ] Register Model: Allows artifacts (Including model and related parameters, indicators) to be registered directly into the model center
## Create Task
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/mlflow.png" width="15"/> task node to canvas.
## Task Parameters and Example
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
| MLflow Tracking Server URI | MLflow Tracking Server URI, default http://localhost:5000. |
| Experiment Name | Create the experiment where the task is running, if the experiment does not exist. If the name is empty, it is set to ` Default `, the same as MLflow. |
### MLflow Projects
#### BasicAlgorithm
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-basic-algorithm.png)
**Task Parameters**
| **Parameter** | **Description** |
| ------- | ---------- |
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
| Model Name | The registered model name is added to the original model version and registered as Production. |
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation. |
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `"time_budget=30;estimator_list=['lgbm']"` for flaml 。The convention will be passed with '; ' shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. <ul><li>[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)</li><li>[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC)</li><li>[lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)</li><li>[xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)</li></ul> |
| Algorithm |The selected algorithm currently supports `LR`, `SVM`, `LightGBM` and `XGboost` based on [scikit-learn](https://scikit-learn.org/) form. |
| Parameter Search Space | Parameter search space when running the corresponding algorithm, which can be empty. For example, the parameter `max_depth=[5, 10];n_estimators=[100, 200]` for lightgbm 。The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. |
#### AutoML
![mlflow-automl](../../../../img/tasks/demo/mlflow-automl.png)
**Task Parameter**
| **Parameter** | **Description** |
| ------- | ---------- |
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
| model name | The registered model name is added to the original model version and registered as Production. |
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation). |
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `n_estimators=200;learning_rate=0.2` for flaml. The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. The detailed parameter list is as follows: <ul><li>[flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects)</li><li>[autosklearn](https://automl.github.io/auto-sklearn/master/api.html)</li></ul> |
| AutoML tool | The AutoML tool used, currently supports [autosklearn](https://github.com/automl/auto-sklearn) and [flaml](https://github.com/microsoft/FLAML). |
#### Custom projects
![mlflow-custom-project.png](../../../../img/tasks/demo/mlflow-custom-project.png)
**Task Parameter**
| **Parameter** | **Description** |
| ------- | ---------- |
| parameters | `--param-list` in `mlflow run`. For example `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9`. |
| Repository | Repository url of MLflow Project,Support git address and directory on worker. If it's in a subdirectory,We add `#` to support this (same as `mlflow run`) , for example `https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native`. |
| Project Version | Version of the project,default master. |
You can now use this feature to run all MLFlow projects on Github (For example [MLflow examples](https://github.com/mlflow/mlflow/tree/master/examples) ). You can also create your own machine learning library to reuse your work, and then use DolphinScheduler to use your library with one click.
### MLflow Models
**General Parameters**
| **Parameter** | **Description** |
| ------- | ---------- |
| Model-URI | Model-URI of MLflow , support `models:/<model_name>/suffix` format and `runs:/` format. See https://mlflow.org/docs/latest/tracking.html#artifact-stores |
| Port | The port to listen on. |
#### MLflow
![mlflow-models-mlflow](../../../../img/tasks/demo/mlflow-models-mlflow.png)
#### Docker
![mlflow-models-docker](../../../../img/tasks/demo/mlflow-models-docker.png)
#### DOCKER COMPOSE
![mlflow-models-docker-compose](../../../../img/tasks/demo/mlflow-models-docker-compose.png)
## Environment to Prepare
### Conda environment
You need to enter the admin account to configure a conda environment variable(Please
install [anaconda](https://docs.continuum.io/anaconda/install/)
or [miniconda](https://docs.conda.io/en/latest/miniconda.html#installing ) in advance).
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-conda-env.png)
Note During the configuration task, select the conda environment created above. Otherwise, the program cannot find the
Conda environment.
![mlflow-set-conda-env](../../../../img/tasks/demo/mlflow-set-conda-env.png)
### Start the MLflow Service
Make sure you have installed MLflow, using 'pip install mlflow'.
Create a folder where you want to save your experiments and models and start MLflow service.
```sh
mkdir mlflow
cd mlflow
mlflow server -h 0.0.0.0 -p 5000 --serve-artifacts --backend-store-uri sqlite:///mlflow.db
```
After running, an MLflow service is started.
After this, you can visit the MLflow service (`http://localhost:5000`) page to view the experiments and models.
![mlflow-server](../../../../img/tasks/demo/mlflow-server.png)

65
docs/docs/en/guide/task/openmldb.md

@ -0,0 +1,65 @@
# OpenMLDB Node
## Overview
[OpenMLDB](https://openmldb.ai/) is an excellent open source machine learning database, providing a full-stack
FeatureOps solution for production.
OpenMLDB task plugin used to execute tasks on OpenMLDB cluster.
## Create Task
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/openmldb.png" width="15"/> task node to canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
| zookeeper | OpenMLDB cluster zookeeper address, e.g. 127.0.0.1:2181. |
| zookeeper path | OpenMLDB cluster zookeeper path, e.g. /openmldb. |
| Execute Mode | Determine the init mode, offline or online. You can switch it in sql statement. |
| SQL statement | SQL statement. |
| Custom parameters | It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. |
## Task Examples
### Load data
![load data](../../../../img/tasks/demo/openmldb-load-data.png)
We use `LOAD DATA` to load data into OpenMLDB cluster. We select `offline` here, so it will load to offline storage.
### Feature extraction
![fe](../../../../img/tasks/demo/openmldb-feature-extraction.png)
We use `SELECT INTO` to do feature extraction. We select `offline` here, so it will run sql on offline engine.
### Environment to Prepare
#### Start the OpenMLDB Cluster
You should create an OpenMLDB cluster first. If in production env, please check [deploy OpenMLDB](https://openmldb.ai/docs/en/v0.5/deploy/install_deploy.html).
You can follow [run OpenMLDB in docker](https://openmldb.ai/docs/zh/v0.5/quickstart/openmldb_quickstart.html#id11)
to a quick start.
#### Python Environment
The OpenMLDB task will use OpenMLDB Python SDK to connect OpenMLDB cluster. So you should have the Python env.
We will use `python3` by default. You can set `PYTHON_HOME` to use your custom python env.
Make sure you have installed OpenMLDB Python SDK in the host where the worker server running, using `pip install openmldb`.

32
docs/docs/en/guide/task/pigeon.md

@ -1,19 +1,27 @@
# Pigeon
## Overview
Pigeon is a task used to trigger remote tasks, acquire logs or status by calling remote WebSocket service. It is DolphinScheduler uses a remote WebSocket service to call tasks.
## Create
## Create Task
Drag from the toolbar <img src="/img/pigeon.png" width="20"/> to the canvas to create a new Pigeon task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/pigeon.png" width="20"/> to the canvas to create a new Pigeon task.
## Parameter
## Task Parameters
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number.
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- Target task name: Target task name of this Pigeon node.
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Target task name | Target task name of this Pigeon node. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |

41
docs/docs/en/guide/task/python.md

@ -7,23 +7,29 @@ it will generate a temporary python script, and executes the script by the Linux
## Create Task
- Click Project Management-Project Name-Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page.
- Drag <img src="/img/tasks/icons/python.png" width="15"/> from the toolbar to the canvas.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag <img src="../../../../img/tasks/icons/python.png" width="15"/> from the toolbar to the canvas.
## Task Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Environment Name: Configure the environment name in which to run the script.
- Number of failed retry attempts: The failure task resubmitting times. It supports drop-down and hand-filling.
- Failed retry interval: The time interval for resubmitting the task after a failed task. It supports drop-down and hand-filling.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail.
- Script: Python program developed by the user.
- Resource: Refers to the list of resource files that need to be called in the script, and the files uploaded or created by the resource center-file management.
- Custom parameters: It is the user-defined parameters of Python, which will replace the content with \${variable} in the script.
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. |
| Script | Python program developed by the user. |
| Resource | Refers to the list of resource files that need to be called in the script, and the files uploaded or created by the resource center-file management. |
| Custom parameters | It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. |
## Task Example
@ -32,7 +38,7 @@ it will generate a temporary python script, and executes the script by the Linux
This example simulates a common task that runs by a simple command. The example is to print one line in the log file, as shown in the following figure:
"This is a demo of python task".
![demo-python-simple](/img/tasks/demo/python_ui_next.jpg)
![demo-python-simple](../../../../img/tasks/demo/python_ui_next.jpg)
```python
print("This is a demo of python task")
@ -44,12 +50,9 @@ This example simulates a custom parameter task. We use parameters for reusing ex
we declare a custom parameter named "param_key", with the value "param_val". Then we use echo to print the parameter "${param_key}" we just declared.
After running this example, we would see "param_val" print in the log.
![demo-python-custom-param](/img/tasks/demo/python_custom_param_ui_next.jpg)
![demo-python-custom-param](../../../../img/tasks/demo/python_custom_param_ui_next.jpg)
```python
print("${param_key}")
```
## Notice
None

59
docs/docs/en/guide/task/shell.md

@ -2,47 +2,48 @@
## Overview
Shell task used to create a shell task type and execute a series of shell scripts. When the worker run the shell task, a temporary shell script is generated, and use the Linux user with the same name as the tenant executes the script.
Shell task type, used to create a shell type task and execute a series of shell scripts. When the worker executes this task, a temporary shell script is generated and executed using the linux user with the same name as the tenant.
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/tasks/icons/shell.png" width="15"/> to the canvas.
## Task Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Environment Name: Configure the environment name in which run the script.
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number.
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- Script: Shell program developed by users.
- Resource: Refers to the list of resource files that called in the script, and upload or create files by the Resource Center file management.
- Custom parameters: It is a user-defined local parameter of Shell, and will replace the content with `${variable}` in the script.
- Predecessor task: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/shell.png" width="15"/> to the canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Script | A SHELL program developed by the user. |
| Resource | Refers to the list of resource files that need to be called in the script, and the files uploaded or created in Resource Center - File Management.|
| User-defined parameter | It is a user-defined parameter of Shell, which will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
## Task Example
### Simply Print
### Print a Line 
We make an example simulate from a common task which runs by one command. The example is to print one line in the log file, as shown in the following figure:
"This is a demo of shell task".
This example shows how to simulate simple tasks with just one or more simple lines of command. In this example, we will see how to print a line in a log file.
![demo-shell-simple](/img/tasks/demo/shell.jpg)
![demo-shell-simple](../../../../img/tasks/demo/shell.jpg)
### Custom Parameters
### Use custom parameters
This example simulates a custom parameter task. We use parameters for reusing existing tasks as template or coping with the dynamic task. In this case,
we declare a custom parameter named "param_key", with the value "param_val". Then we use `echo` to print the parameter "${param_key}" we just declared.
After running this example, we would see "param_val" print in the log.
This example simulates a custom parameter task. In order to reuse existing tasks more conveniently, or when faced with dynamic requirements, we will use variables to ensure the reusability of scripts. In this example, we first define the parameter "param_key" in the custom script and set its value to "param_val". Then the echo command is declared in the "script", and the parameter "param_key" is printed out. When we save and run the task, we will see in the log that the value "param_val" corresponding to the parameter "param_key" is printed out.
![demo-shell-custom-param](/img/tasks/demo/shell_custom_param.jpg)
![demo-shell-custom-param](../../../../img/tasks/demo/shell_custom_param.jpg)
## Attention
## Note
The shell task type resolves whether the task log contains ```application_xxx_xxx``` to determine whether is the yarn task. If so, the corresponding application
will be use to judge the running state of the current shell node. At this time, if stops the operation of the workflow, the corresponding ```application_id```

77
docs/docs/en/guide/task/spark.md

@ -10,38 +10,39 @@ Spark task type for executing Spark application. When executing the Spark task,
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/tasks/icons/spark.png" width="15"/> to the canvas.
## Task Parameter
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- **Program type**: Supports Java, Scala, Python and SQL.
- **Spark version**: Support Spark1 and Spark2.
- **The class of main function**: The **full path** of Main Class, the entry point of the Spark program.
- **Main jar package**: The Spark jar package (upload by Resource Center).
- **SQL scripts**: SQL statements in .sql files that Spark sql runs.
- **Deployment mode**: (1) spark submit supports three modes: yarn-clusetr, yarn-client and local.
(2) spark sql supports yarn-client and local modes.
- **Task name** (optional): Spark task name.
- **Driver core number**: Set the number of Driver core, which can be set according to the actual production environment.
- **Driver memory size**: Set the size of Driver memories, which can be set according to the actual production environment.
- **Number of Executor**: Set the number of Executor, which can be set according to the actual production environment.
- **Executor memory size**: Set the size of Executor memories, which can be set according to the actual production environment.
- **Main program parameters**: Set the input parameters of the Spark program and support the substitution of custom parameter variables.
- **Optional parameters**: support `--jars`, `--files`,` --archives`, `--conf` format.
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them.
- **Custom parameter**: It is a local user-defined parameter for Spark, and will replace the content with `${variable}` in the script.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/spark.png" width="15"/> to the canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Program type | Supports Java, Scala, Python, and SQL. |
| Spark version | Support Spark1 and Spark2. |
| The class of main function | The **full path** of Main Class, the entry point of the Spark program. |
| Main jar package | The Spark jar package (upload by Resource Center). |
| SQL scripts | SQL statements in .sql files that Spark sql runs. |
| Deployment mode | <ul><li>spark submit supports three modes: yarn-clusetr, yarn-client and local.</li><li>spark sql supports yarn-client and local modes.</li></ul> |
| Task name | Spark task name. |
| Driver core number | Set the number of Driver core, which can be set according to the actual production environment. |
| Driver memory size | Set the size of Driver memories, which can be set according to the actual production environment. |
| Number of Executor | Set the number of Executor, which can be set according to the actual production environment. |
| Executor memory size | Set the size of Executor memories, which can be set according to the actual production environment. |
| Main program parameters | Set the input parameters of the Spark program and support the substitution of custom parameter variables. |
| Optional parameters | Support `--jars`, `--files`,` --archives`, `--conf` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| Custom parameter | It is a local user-defined parameter for Spark, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example
@ -55,7 +56,7 @@ This is a common introductory case in the big data ecosystem, which often apply
If you are using the Spark task type in a production environment, it is necessary to configure the required environment first. The following is the configuration file: `bin/env/dolphinscheduler_env.sh`.
![spark_configure](/img/tasks/demo/spark_task01.png)
![spark_configure](../../../../img/tasks/demo/spark_task01.png)
##### Upload the Main Package
@ -63,23 +64,23 @@ When using the Spark task node, you need to upload the jar package to the Resour
After finish the Resource Centre configuration, upload the required target files directly by dragging and dropping.
![resource_upload](/img/tasks/demo/upload_jar.png)
![resource_upload](../../../../img/tasks/demo/upload_jar.png)
##### Configure Spark Nodes
Configure the required content according to the parameter descriptions above.
![demo-spark-simple](/img/tasks/demo/spark_task02.png)
![demo-spark-simple](../../../../img/tasks/demo/spark_task02.png)
### spark sql
### Spark sql
#### Execute DDL and DML statements
This case is to create a view table terms and write three rows of data and a table wc in parquet format and determine whether the table exists. The program type is SQL. Insert the data of the view table terms into the table wc in parquet format.
![spark_sql](/img/tasks/demo/spark_sql.png)
![spark_sql](../../../../img/tasks/demo/spark_sql.png)
## Notice
## Note
JAVA and Scala are only used for identification, and there is no difference when you use the Spark task. If your application is developed by Python, you could just ignore the parameter **Main Class** in the form. Parameter **SQL scripts** is only for SQL type and could be ignored in JAVA, Scala and Python.

35
docs/docs/en/guide/task/sql.md

@ -10,24 +10,21 @@ Refer to [DataSource](../datasource/introduction.md)
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/tasks/icons/sql.png" width="25"/> to the canvas.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/sql.png" width="25"/> to the canvas.
## Task Parameter
- Data source: Select the corresponding DataSource.
- SQL type: Supports query and non-query.
- Query: supports `DML select` type commands, which return a result set. You can specify three templates for email notification as form, attachment or form attachment;
- Non-query: support `DDL` all commands and `DML update, delete, insert` three types of commands;
- Segmented execution symbol: When the data source does not support executing multiple SQL statements at a time, the symbol for splitting SQL statements is provided to call the data source execution method multiple times.
Example: 1. When the Hive data source is selected as the data source, this parameter does not need to be filled in. Because the Hive data source itself supports executing multiple SQL statements at one time;
2. When the MySQL data source is selected as the data source, and multi-segment SQL statements are to be executed, this parameter needs to be filled in with a semicolon `;`. Because the MySQL data source does not support executing multiple SQL statements at one time;
- SQL parameter: The input parameter format is `key1=value1;key2=value2...`.
- SQL statement: SQL statement.
- UDF function: For Hive DataSources, you can refer to UDF functions created in the resource center, but other DataSource do not support UDF functions.
- Custom parameters: SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement.
- Pre-SQL: Pre-SQL executes before the SQL statement.
- Post-SQL: Post-SQL executes after the SQL statement.
| **Parameter** | **Description** |
| ------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data source | Select the corresponding DataSource. |
| SQL type | Supports query and non-query. <ul><li>Query: supports `DML select` type commands, which return a result set. You can specify three templates for email notification as form, attachment or form attachment;</li><li>Non-query: support `DDL` all commands and `DML update, delete, insert` three types of commands;<ul><li>Segmented execution symbol: When the data source does not support executing multiple SQL statements at a time, the symbol for splitting SQL statements is provided to call the data source execution method multiple times. Example: 1. When the Hive data source is selected as the data source, this parameter does not need to be filled in. Because the Hive data source itself supports executing multiple SQL statements at one time; 2. When the MySQL data source is selected as the data source, and multi-segment SQL statements are to be executed, this parameter needs to be filled in with a semicolon `;. Because the MySQL data source does not support executing multiple SQL statements at one time.</li></ul></li></ul> |
| SQL parameter | The input parameter format is `key1=value1;key2=value2...`. |
| SQL statement | SQL statement. |
| UDF function | For Hive DataSources, you can refer to UDF functions created in the resource center, but other DataSource do not support UDF functions. |
| Custom parameters | SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement. |
| Pre-SQL | Pre-SQL executes before the SQL statement. |
| Post-SQL | Post-SQL executes after the SQL statement. |
## Task Example
@ -37,13 +34,13 @@ Refer to [DataSource](../datasource/introduction.md)
This example creates a temporary table `tmp_hello_world` in Hive and writes a row of data. Before creating a temporary table, we need to ensure that the table does not exist. So we use custom parameters to obtain the time of the day as the suffix of the table name every time we run, this task can run every different day. The format of the created table name is: `tmp_hello_world_{yyyyMMdd}`.
![hive-sql](/img/tasks/demo/hive-sql.png)
![hive-sql](../../../../img/tasks/demo/hive-sql.png)
#### After Running the Task Successfully, Query the Results in Hive
Log in to the bigdata cluster and use 'hive' command or 'beeline' or 'JDBC' and other methods to connect to the 'Apache Hive' for the query. The query SQL is `select * from tmp_hello_world_{yyyyMMdd}`, please replace `{yyyyMMdd}` with the date of the running day. The following shows the query screenshot:
![hive-sql](/img/tasks/demo/hive-result.png)
![hive-sql](../../../../img/tasks/demo/hive-result.png)
### Use Pre-SQL and Post-SQL Example
@ -51,6 +48,8 @@ Table created in the Pre-SQL, after use in the SQL statement, cleaned in the Pos
![pre_post_sql](../../../../img/tasks/demo/pre_post_sql.png)
## Notice
## Note
Pay attention to the selection of SQL type. If it is an insert operation, need to change to "Non-Query" type.
To compatible with long session,UDF function are created by the syntax(CREATE OR REPLACE)

21
docs/docs/en/guide/task/stored-procedure.md

@ -8,6 +8,21 @@
<img src="/img/procedure-en.png" width="80%" />
</p>
- DataSource: The DataSource type of the stored procedure supports MySQL and POSTGRESQL, select the corresponding DataSource.
- Method: The method name of the stored procedure.
- Custom parameters: The custom parameter types of the stored procedure support `IN` and `OUT`, and the data types support: VARCHAR, INTEGER, LONG, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP and BOOLEAN.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| DataSource | The DataSource type of the stored procedure supports MySQL and POSTGRESQL, select the corresponding DataSource. |
| Method | The method name of the stored procedure. |
| Custom parameters | The custom parameter types of the stored procedure support `IN` and `OUT`, and the data types support: VARCHAR, INTEGER, LONG, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP and BOOLEAN. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |

33
docs/docs/en/guide/task/sub-process.md

@ -6,20 +6,23 @@ The sub-process node is to execute an external workflow definition as a task nod
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/tasks/icons/sub_process.png" width="15"/> task node to canvas to create a new SubProcess task.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/sub_process.png" width="15"/> task node to canvas to create a new SubProcess task.
## Task Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node schedules normally.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Environment Name: Configure the environment name in which run the script.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- Sub-node: It is the workflow definition of the selected sub-process. Enter the sub-node in the upper right corner to jump to the workflow definition of the selected sub-process.
- Predecessor task: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
| **Parameter** | **Description** |
| ---- |---------|
| Node name | Unique name of node in workflow definition. |
| Run flag | Identifies whether this node schedules normally. |
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment name in which run the script. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
| Child node | It is the workflow definition of the selected sub-process. Enter the child node in the upper right corner to jump to the workflow definition of the selected sub-process. |
| Pre task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
## Task Example
@ -29,18 +32,18 @@ This example simulates a common task type, here we use a child node task to reca
Create a shell task to print "hello" and define the workflow as `test_dag01`.
![subprocess_task01](/img/tasks/demo/subprocess_task01.png)
![subprocess_task01](../../../../img/tasks/demo/subprocess_task01.png)
## Create the Sub_process task
To use the sub_process, you need to create the sub-node task, which is the shell task we created in the first step. After that, as shown in the diagram below, select the corresponding sub-node in position ⑤.
![subprocess_task02](/img/tasks/demo/subprocess_task02.png)
![subprocess_task02](../../../../img/tasks/demo/subprocess_task02.png)
After creating the sub_process, create a corresponding shell task for printing "world" and link both together. Save the current workflow and run it to get the expected result.
![subprocess_task03](/img/tasks/demo/subprocess_task03.png)
![subprocess_task03](../../../../img/tasks/demo/subprocess_task03.png)
## Notice
## Note
When using `sub_process` to recall a sub-node task, you need to ensure that the defined sub-node is online status, otherwise, the sub_process workflow will not work properly.

47
docs/docs/en/guide/task/switch.md

@ -1,29 +1,34 @@
# Switch
## Overview
The switch is a conditional judgment node, decide the branch executes according to the value of [global variable](../parameter/global.md) and the expression result written by the user.
**Note** Execute expressions using javax.script.ScriptEngine.eval.
**Note**: Execute expressions using javax.script.ScriptEngine.eval.
## Create Task
Click Project -> Management-Project -> Name-Workflow Definition, and click the Create Workflow button to enter the DAG editing page.
Drag from the toolbar <img src="../../../../img/switch.png" width="20"/> task node to canvas to create a task.
**Note** After created a switch task, you must first configure the upstream and downstream, then configure the parameter of task branches.
## Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Environment name: The environment in Security, if not configured, it will not be used.
- Task group name: The group in Resources, if not configured, it will not be used.
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number.
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number.
- Delay execution time: Task delay execution time.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- Condition: You can configure multiple conditions for the switch task. When the conditions are satisfied, execute the configured branch. You can configure multiple different conditions to satisfy different businesses.
- Branch flow: The default branch flow, when all the conditions are not satisfied, execute this branch flow.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/switch.png" width="20"/> task node to canvas to create a task.
**Note**: After created a switch task, you must first configure the upstream and downstream, then configure the parameter of task branches.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Delay execution time | Task delay execution time. |
| Condition | You can configure multiple conditions for the switch task. When the conditions are satisfied, execute the configured branch. You can configure multiple different conditions to satisfy different businesses. |
| Branch flow | The default branch flow, when all the conditions are not satisfied, execute this branch flow. |
## Task Example

39
docs/docs/en/guide/task/zeppelin.md

@ -7,21 +7,26 @@ it will call `Zeppelin Client API` to trigger zeppelin notebook paragraph. Click
## Create Task
- Click Project Management-Project Name-Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page.
- Drag <img src="/img/tasks/icons/zeppelin.png" width="15"/> from the toolbar to the canvas.
## Task Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Number of failed retry attempts: The failure task resubmitting times. It supports drop-down and hand-filling.
- Failed retry interval: The time interval for resubmitting the task after a failed task. It supports drop-down and hand-filling.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail.
- Zeppelin Note ID: The unique note id for a zeppelin notebook note.
- Zeppelin Paragraph ID: The unique paragraph id for a zeppelin notebook paragraph.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag <img src="../../../../img/tasks/icons/zeppelin.png" width="15"/> from the toolbar to the canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Zeppelin Note ID | The unique note id for a zeppelin notebook note. |
| Zeppelin Paragraph ID | The unique paragraph id for a zeppelin notebook paragraph. If you want to schedule a whole note at a time, leave this field blank. |
| Zeppelin Parameters | Parameters in json format used for zeppelin dynamic form. |
## Task Example
@ -29,7 +34,7 @@ it will call `Zeppelin Client API` to trigger zeppelin notebook paragraph. Click
This example illustrates how to create a zeppelin paragraph task node.
![demo-zeppelin-paragraph](/img/tasks/demo/zeppelin.png)
![demo-zeppelin-paragraph](../../../../img/tasks/demo/zeppelin.png)
![demo-get-zeppelin-id](/img/tasks/demo/zeppelin_id.png)
![demo-get-zeppelin-id](../../../../img/tasks/demo/zeppelin_id.png)

Loading…
Cancel
Save