Browse Source

[doc] Change tasks doc (#10639)

3.1.0-release
sneh-wha 2 years ago committed by GitHub
parent
commit
65ebdbef98
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 36
      docs/docs/en/guide/task/conditions.md
  2. 58
      docs/docs/en/guide/task/datax.md
  3. 3
      docs/docs/en/guide/task/dependent.md
  4. 29
      docs/docs/en/guide/task/emr.md
  5. 66
      docs/docs/en/guide/task/flink.md
  6. 43
      docs/docs/en/guide/task/http.md
  7. 54
      docs/docs/en/guide/task/jupyter.md
  8. 44
      docs/docs/en/guide/task/kubernetes.md
  9. 74
      docs/docs/en/guide/task/map-reduce.md
  10. 125
      docs/docs/en/guide/task/mlflow.md
  11. 69
      docs/docs/en/guide/task/openmldb.md
  12. 32
      docs/docs/en/guide/task/pigeon.md
  13. 37
      docs/docs/en/guide/task/python.md
  14. 3
      docs/docs/en/guide/task/shell.md
  15. 65
      docs/docs/en/guide/task/spark.md
  16. 27
      docs/docs/en/guide/task/sql.md
  17. 21
      docs/docs/en/guide/task/stored-procedure.md
  18. 47
      docs/docs/en/guide/task/switch.md
  19. 32
      docs/docs/en/guide/task/zeppelin.md

36
docs/docs/en/guide/task/conditions.md

@ -4,25 +4,25 @@ Condition is a conditional node, that determines which downstream task should ru
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the "`Create Workflow`" button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/conditions.png" width="20"/> task node to canvas. - Drag from the toolbar <img src="../../../../img/conditions.png" width="20"/> task node to canvas.
## Parameter ## Task Parameters
- Node name: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. | -------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
- Descriptive information: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Task group name | The group in Resources, if not configured, it will not be used. |
- Downstream tasks selection: Depending on the status of the predecessor task, you can jump to the corresponding branch, currently two branches are supported: success, failure | Environment Name | Configure the environment in which to run the script. |
- Success: When the upstream task runs successfully, run the success branch. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- Failure: When the upstream task runs failed, run the failure branch. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- Upstream condition selection: can select one or more upstream tasks for conditions. | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- Add an upstream dependency: the first parameter is to choose a specified task name, and the second parameter is to choose the upstream task status to trigger conditions. | Downstream tasks selection | Depending on the status of the predecessor task, you can jump to the corresponding branch, currently two branches are supported: success, failure <ul><li>Success: When the upstream task runs successfully, run the success branch.</li><li>Failure: When the upstream task runs failed, run the failure branch.</li></ul></li></ul> |
- Select upstream task relationship: use `and` and `or` operators to handle the complex relationship of upstream when there are multiple upstream tasks for conditions. | Upstream condition selection | Can select one or more upstream tasks for conditions.<ul><li>Add an upstream dependency: the first parameter is to choose a specified task name, and the second parameter is to choose the upstream task status to trigger conditions.</li><li>Select upstream task relationship: use `and` and `or` operators to handle the complex relationship of upstream when there are multiple upstream tasks for conditions.</li></ul></li></ul> |
## Related Task ## Related Task
@ -51,7 +51,7 @@ After you finish creating the workflow, you can run the workflow online. You can
In the above figure, the task status marked with a green check mark is the successfully executed task node. In the above figure, the task status marked with a green check mark is the successfully executed task node.
## Notice ## Note
- The Conditions task supports multiple upstream tasks, but only two downstream tasks. - The Conditions task supports multiple upstream tasks, but only two downstream tasks.
- The Conditions task and the workflow that contain it do not support copy operations. - The Conditions task and the workflow that contain it do not support copy operations.

58
docs/docs/en/guide/task/datax.md

@ -6,35 +6,37 @@ DataX task type for executing DataX programs. For DataX nodes, the worker will e
## Create Task ## Create Task
- Click Project Management -> Project Name -> Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag the <img src="../../../../img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board. - Drag the <img src="../../../../img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board.
## Task Parameter ## Task Parameters
- **Node name**: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- **Run flag**: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch. | ------- | ---------- |
- **Descriptive information**: describe the function of the node. | Node name | The node name in a workflow definition is unique. |
- **Task priority**: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle. | Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the prohibition execution. |
- **Worker grouping**: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution. | Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
- **Environment Name**: Configure the environment name in which to run the script. | Description | Describe the function of the node. |
- **Number of failed retry attempts**: The number of times the task failed to be resubmitted. | Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
- **Failed retry interval**: The time, in cents, interval for resubmitting the task after a failed task. | Environment Name | Configure the environment name in which run the script. |
- **Cpu quota**: Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | Number of failed retries | The number of times the task failed to resubmit. |
- **Max memory**:Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
- **Delayed execution time**: The time, in cents, that a task is delayed in execution. | Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. | Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) |
- **Custom template**: Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements. | Delayed execution time | The time, in cents, that a task is delayed in execution. |
- **json**: json configuration file for DataX synchronization. | Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. |
- **Custom parameters**: SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement. | Custom template | Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements. |
- **Data source**: Select the data source from which the data will be extracted. | json | json configuration file for DataX synchronization. |
- **sql statement**: the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias. | Custom parameters | SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement. |
- **Target library**: Select the target library for data synchronization. | Data source | Select the data source from which the data will be extracted. |
- **Pre-sql**: Pre-sql is executed before the sql statement (executed by the target library). | sql statement | the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias. |
- **Post-sql**: Post-sql is executed after the sql statement (executed by the target library). | Target library | Select the target library for data synchronization. |
- **Stream limit (number of bytes)**: Limits the number of bytes in the query. | Pre-sql | Pre-sql is executed before the sql statement (executed by the target library). |
- **Limit flow (number of records)**: Limit the number of records for a query. | Post-sql | Post-sql is executed after the sql statement (executed by the target library). |
- **Running memory**: the minimum and maximum memory required can be configured to suit the actual production environment. | Stream limit (number of bytes) | Limits the number of bytes in the query. |
- **Predecessor task**: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. | Limit flow (number of records) | Limit the number of records for a query. |
| Running memory | the minimum and maximum memory required can be configured to suit the actual production environment. |
| Predecessor task | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
## Task Example ## Task Example
@ -60,6 +62,6 @@ After writing the required json file, you can configure the node content by foll
![datax_task03](../../../../img/tasks/demo/datax_task03.png) ![datax_task03](../../../../img/tasks/demo/datax_task03.png)
### Notice ### Note
If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX. If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX.

3
docs/docs/en/guide/task/dependent.md

@ -26,7 +26,8 @@ Dependent nodes are **dependency check nodes**. For example, process A depends o
| Delayed execution time | The time (unit minute) that a task delays in execution. | | Delayed execution time | The time (unit minute) that a task delays in execution. |
| Pre task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. | | Pre task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Examples
## Task Examples
The Dependent node provides a logical judgment function, which can detect the execution of the dependent node according to the logic. The Dependent node provides a logical judgment function, which can detect the execution of the dependent node according to the logic.

29
docs/docs/en/guide/task/emr.md

@ -4,17 +4,24 @@
Amazon EMR task type, for creating EMR clusters on AWS and running computing tasks. Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object and submit to AWS. Amazon EMR task type, for creating EMR clusters on AWS and running computing tasks. Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object and submit to AWS.
## Parameter ## Create Task
- Node name: The node name in a workflow definition is unique. * Click `Project Management -> Project Name -> Workflow Definition`, click the "`Create Workflow`" button to enter the DAG editing page.
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. * Drag `AmazonEMR` task from the toolbar to the artboard to complete the creation.
- Descriptive information: Describe the function of the node.
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. ## Task Parameters
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. | **Parameter** | **Description** |
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. | ------- | ---------- |
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Node name | The node name in a workflow definition is unique. |
- JSON: JSON corresponding to the [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object, for details refer to [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples). | Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.|
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker grouping | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Times of failed retry attempts | The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. |
| Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
| JSON | JSON corresponding to the [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object, for details refer to [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples). |
## JSON example ## JSON example

66
docs/docs/en/guide/task/flink.md

@ -10,39 +10,41 @@ Flink task type, used to execute Flink programs. For Flink nodes:
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/flink.png" width="15"/>task node to canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/flink.png" width="15"/>task node to canvas.
## Task Parameter ## Task Parameters
- **Node name**: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. | ------- | ---------- |
- **Descriptive information**: Describe the function of the node. | Node name | The node name in a workflow definition is unique. |
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. |
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describe the function of the node. |
- **Environment Name**: Configure the environment name in which run the script. | Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
- **Times of failed retry attempts**: The number of times the task failed to resubmit. | Worker grouping | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task. | Environment Name | Configure the environment name in which run the script. |
- **Delayed execution time**: The time (unit minute) that a task delays in execution. | Times of failed retry attempts | The number of times the task failed to resubmit. |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
- **Program type**: Support Java, Scala, Python and SQL four languages. | Delayed execution time | The time (unit minute) that a task delays in execution. |
- **The class of main function**: The **full path** of Main Class, the entry point of the Flink program. | Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
- **Main jar package**: The jar package of the Flink program (upload by Resource Center). | Program type | Support Java, Scala, Python and SQL four languages. |
- **Deployment mode**: Support 2 deployment modes: cluster and local. | Class of main function**: The **full path** of Main Class, the entry point of the Flink program. |
- **Initialization script**: Script file to initialize session context. | Main jar package | The jar package of the Flink program (upload by Resource Center). |
- **Script**: The sql script file developed by the user that should be executed. | Deployment mode | Support 2 deployment modes: cluster and local. |
- **Flink version**: Select version according to the execution env. | Initialization script | Script file to initialize session context. |
- **Task name** (optional): Flink task name. | Script | The sql script file developed by the user that should be executed. |
- **JobManager memory size**: Used to set the size of jobManager memories, which can be set according to the actual production environment. | Flink version | Select version according to the execution environment. |
- **Number of slots**: Used to set the number of slots, which can be set according to the actual production environment. | Task name | Flink task name. |
- **TaskManager memory size**: Used to set the size of taskManager memories, which can be set according to the actual production environment. | JobManager memory size | Used to set the size of jobManager memories, which can be set according to the actual production environment. |
- **Number of TaskManager**: Used to set the number of taskManagers, which can be set according to the actual production environment. | Number of slots | Used to set the number of slots, which can be set according to the actual production environment. |
- **Parallelism**: Used to set the degree of parallelism for executing Flink tasks. | TaskManager memory size | Used to set the size of taskManager memories, which can be set according to the actual production environment. |
- **Main program parameters**: Set the input parameters for the Flink program and support the substitution of custom parameter variables. | Number of TaskManager | Used to set the number of taskManagers, which can be set according to the actual production environment. |
- **Optional parameters**: Support `--jar`, `--files`,` --archives`, `--conf` format. | Parallelism | Used to set the degree of parallelism for executing Flink tasks. |
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them. | Main program parameters | Set the input parameters for the Flink program and support the substitution of custom parameter variables. |
- **Custom parameter**: It is a local user-defined parameter for Flink, and will replace the content with `${variable}` in the script. | Optional parameters | Support `--jar`, `--files`,` --archives`, `--conf` format. |
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. | Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| Custom parameter | It is a local user-defined parameter for Flink, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example ## Task Example
@ -76,7 +78,7 @@ Configure the required content according to the parameter descriptions above.
![demo-flink-sql-simple](../../../../img/tasks/demo/flink_sql_test.png) ![demo-flink-sql-simple](../../../../img/tasks/demo/flink_sql_test.png)
## Notice ## Note
- JAVA and Scala only used for identification, there is no difference. If use Python to develop Flink, there is no class of the main function and the rest is the same. - JAVA and Scala only used for identification, there is no difference. If use Python to develop Flink, there is no class of the main function and the rest is the same.

43
docs/docs/en/guide/task/http.md

@ -6,28 +6,28 @@ This node is used to perform http type tasks such as the common POST and GET req
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the "`Create Workflow`" button to enter the DAG editing page.
- Drag the <img src="../../../../img/tasks/icons/http.png" width="15"/> from the toolbar to the drawing board. - Drag the <img src="../../../../img/tasks/icons/http.png" width="15"/> from the toolbar to the drawing board.
## Task Parameter ## Task Parameters
- **Node name**: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- **Run flag**: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch. | ------- | ---------- |
- **Descriptive information**: describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- **Task priority**: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- **Worker grouping**: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution. | Description | Describes the function of this node. |
- **Environment Name**: Configure the environment name in which to run the script. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- **Number of failed retry attempts**: The number of times the task failed to be resubmitted. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- **Failed retry interval**: The time, in cents, interval for resubmitting the task after a failed task. | Task group name | The group in Resources, if not configured, it will not be used. |
- **Delayed execution time**: the time, in cents, that a task is delayed in execution. | Environment Name | Configure the environment in which to run the script. |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- **Request address**: HTTP request URL. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- **Request type**: Support GET, POSt, HEAD, PUT, DELETE. | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- **Request parameters**: Support Parameter, Body, Headers. | Request address | HTTP request URL. |
- **Verification conditions**: support default response code, custom response code, content included, content not included. | Request type | Supports GET, POSt, HEAD, PUT, DELETE. || Request parameters |Supports Parameter, Body, Headers. || Verification conditions | Supports default response code, custom response code, content included, content not included.|
- **Verification content**: When the verification condition selects a custom response code, the content contains, and the content does not contain, the verification content is required. | Verification content | When the verification condition selects a custom response code, the content contains, and the content does not contain, the verification content is required. |
- **Custom parameter**: It is a user-defined parameter of http part, which will replace the content with `${variable}` in the script. | Custom parameter | It is a user-defined parameter of http part, which will replace the content with `${variable}` in the script. |
- **Predecessor task**: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. | Pre tasks | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
## Example ## Example
@ -42,6 +42,3 @@ The main configuration parameters are as follows:
![http_task](../../../../img/tasks/demo/http_task01.png) ![http_task](../../../../img/tasks/demo/http_task01.png)
## Notice
None

54
docs/docs/en/guide/task/jupyter.md

@ -11,7 +11,7 @@ it will use `papermill` to evaluate jupyter notes. Click [here](https://papermil
Click [here](https://docs.conda.io/en/latest/) for more information about `conda`. Click [here](https://docs.conda.io/en/latest/) for more information about `conda`.
- `conda.path` is set to `/opt/anaconda3/etc/profile.d/conda.sh` by default. If you have no idea where your `conda` is, simply run `conda info | grep -i 'base environment'`. - `conda.path` is set to `/opt/anaconda3/etc/profile.d/conda.sh` by default. If you have no idea where your `conda` is, simply run `conda info | grep -i 'base environment'`.
> NOTICE: `Jupyter Task Plugin` uses `source` command to activate conda environment. > NOTE: `Jupyter Task Plugin` uses `source` command to activate conda environment.
> If your tenant does not have permission to use `source`, `Jupyter Task Plugin` will not function. > If your tenant does not have permission to use `source`, `Jupyter Task Plugin` will not function.
@ -28,7 +28,7 @@ Click [here](https://docs.conda.io/en/latest/) for more information about `conda
2. Upload packed conda environment to `resource center`. 2. Upload packed conda environment to `resource center`.
3. Select your packed conda environment as `resource` in your `jupyter task`, e.g. `jupyter_env.tar.gz`. 3. Select your packed conda environment as `resource` in your `jupyter task`, e.g. `jupyter_env.tar.gz`.
> **_Note:_** Make sure you follow the [Conda-Pack](https://conda.github.io/conda-pack/) official instructions. > NOTE: Make sure you follow the [Conda-Pack](https://conda.github.io/conda-pack/) official instructions.
> If you unpack your packed conda environment, the directory structure should be the same as below: > If you unpack your packed conda environment, the directory structure should be the same as below:
``` ```
@ -42,36 +42,40 @@ Click [here](https://docs.conda.io/en/latest/) for more information about `conda
└── ssl └── ssl
``` ```
> NOTICE: Please follow the `conda pack` instructions above strictly, and DO NOT modify `bin/activate`. > NOTE: Please follow the `conda pack` instructions above strictly, and DO NOT modify `bin/activate`.
> `Jupyter Task Plugin` uses `source` command to activate your packed conda environment. > `Jupyter Task Plugin` uses `source` command to activate your packed conda environment.
> If you are concerned about using `source`, choose other options to manage your python dependency. > If you are concerned about using `source`, choose other options to manage your python dependency.
## Create Task ## Create Task
- Click Project Management-Project Name-Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management-Project Name-Workflow Definition`, and click the "`Create Workflow`" button to enter the DAG editing page.
- Drag <img src="../../../../img/tasks/icons/jupyter.png" width="15"/> from the toolbar to the canvas. - Drag <img src="../../../../img/tasks/icons/jupyter.png" width="15"/> from the toolbar to the canvas.
## Task Parameter ## Task Parameters
- Node name: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- Run flag: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch. | ------- | ---------- |
- Descriptive information: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- Number of failed retry attempts: The failure task resubmitting times. It supports drop-down and hand-filling. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- Failed retry interval: The time interval for resubmitting the task after a failed task. It supports drop-down and hand-filling. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- Cpu quota: Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | Task group name | The group in Resources, if not configured, it will not be used. |
- Max memory:Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | Environment Name | Configure the environment in which to run the script. |
- Timeout alarm: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- Conda Env Name: Name of conda environment or packed conda environment tarball. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- Input Note Path: Path of input jupyter note template. | Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
- Out Note Path: Path of output note. | Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
- Jupyter Parameters: Parameters in json format used for jupyter note parameterization. | Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. |
- Kernel: Jupyter notebook kernel. | Conda Env Name | Name of conda environment or packed conda environment tarball. |
- Engine: Engine to evaluate jupyter notes. |Input Note Path | Path of input jupyter note template. |
- Jupyter Execution Timeout: Timeout set for each jupyter notebook cell. | Out Note Path | Path of output note. |
- Jupyter Start Timeout: Timeout set for jupyter notebook kernel. | Jupyter Parameters | Parameters in json format used for jupyter note parameterization. |
- Others: Other command options for papermill. | Kernel | Jupyter notebook kernel. |
| Engine | Engine to evaluate jupyter notes. |
| Jupyter Execution Timeout | Timeout set for each jupyter notebook cell. |
| Jupyter Start Timeout | Timeout set for jupyter notebook kernel. |
| Others | Other command options for papermill. |
## Task Example ## Task Example

44
docs/docs/en/guide/task/kubernetes.md

@ -6,27 +6,31 @@ K8S task type used to execute a batch task. In this task, the worker submits the
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/kubernetes.png" width="15"/> to the canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/kubernetes.png" width="15"/> to the canvas.
## Task Parameter ## Task Parameters
- **Node name**: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. | ------- | ---------- |
- **Descriptive information**: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- **Environment Name**: Configure the environment name in which to run the task. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- **Times of failed retry attempts**: The number of times the task failed to resubmit. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task. | Task group name | The group in Resources, if not configured, it will not be used. |
- **Delayed execution time**: The time (unit minute) that a task delays in execution. | Environment Name | Configure the environment in which to run the script. |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- **Namespace**::the namespace for running k8s task | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- **Min CPU**:min CPU requirement for running k8s task | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- **Min Memory**:min memory requirement for running k8s task | Namespace | The namespace for running k8s task. |
- **Image**:the registry url for image | Min CPU | Minimum CPU requirement for running k8s task. |
- **Custom parameter**: It is a local user-defined parameter for K8S task, these params will pass to container as environment variables. | Min Memory | Minimum memory requirement for running k8s task. |
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. | Image | The registry url for image. |
| Custom parameter | It is a local user-defined parameter for K8S task, these params will pass to container as environment variables. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example ## Task Example
### Configure the K8S Environment in DolphinScheduler ### Configure the K8S Environment in DolphinScheduler
@ -39,6 +43,6 @@ Configure the required content according to the parameter descriptions above.
![K8S](../../../../img/tasks/demo/kubernetes-task-en.png) ![K8S](../../../../img/tasks/demo/kubernetes-task-en.png)
## Notice ## Note
Task name contains only lowercase alphanumeric characters or '-' Task name contains only lowercase alphanumeric characters or '-'

74
docs/docs/en/guide/task/map-reduce.md

@ -6,45 +6,51 @@ MapReduce(MR) task type used for executing MapReduce programs. For MapReduce nod
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/mr.png" width="15"/> to the canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/mr.png" width="15"/> to the canvas.
## Task Parameter ## Task Parameters
- **Node name**: The node name in a workflow definition is unique. ### General
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- **Descriptive information**: Describe the function of the node. | **Parameter** | **Description** |
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | ------- | ---------- |
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- **Environment Name**: Configure the environment name in which run the script. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- **Times of failed retry attempts**: The number of times the task failed to resubmit. | Description | Describes the function of this node. |
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- **Delayed execution time**: The time (unit minute) that a task delays in execution. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Task group name | The group in Resources, if not configured, it will not be used. |
- **Resource**: Refers to the list of resource files that called in the script, and upload or create files by the Resource Center file management. | Environment Name | Configure the environment in which to run the script. |
- **Custom parameters**: It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Resource | Refers to the list of resource files that called in the script, and upload or create files by the Resource Center file management. |
| Custom parameters | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
### JAVA or SCALA Program ### JAVA or SCALA Program
- **Program type**: Select JAVA or SCALA program. | **Parameter** | **Description** |
- **The class of the main function**: The **full path** of Main Class, the entry point of the MapReduce program. | ------- | ---------- |
- **Main jar package**: The jar package of the MapReduce program. | Program type | Select JAVA or SCALA program. |
- **Task name** (optional): MapReduce task name. | The class of the main function | The **full path** of Main Class, the entry point of the MapReduce program. |
- **Command line parameters**: Set the input parameters of the MapReduce program and support the substitution of custom parameter variables. | Main jar package | The jar package of the MapReduce program. |
- **Other parameters**: support `-D`, `-files`, `-libjars`, `-archives` format. | Task name | MapReduce task name. |
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them. | Command line parameters | Set the input parameters of the MapReduce program and support the substitution of custom parameter variables. |
- **User-defined parameter**: It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. | Other parameters | Support `-D`, `-files`, `-libjars`, `-archives` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
## Python Program | User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
- **Program type**: Select Python language. ### Python Program
- **Main jar package**: The Python jar package for running MapReduce.
- **Other parameters**: support `-D`, `-mapper`, `-reducer,` `-input` `-output` format, and you can set the input of user-defined parameters, such as: | **Parameter** | **Description** |
- `-mapper "mapper.py 1"` `-file mapper.py` `-reducer reducer.py` `-file reducer.py` `–input /journey/words.txt` `-output /journey/out/mr/\${currentTimeMillis}` | ------- | ---------- |
- The `mapper.py 1` after `-mapper` is two parameters, the first parameter is `mapper.py`, and the second parameter is `1`. | Program type | Select Python language. |
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them. | Main jar package | The Python jar package for running MapReduce. |
- **User-defined parameter**: It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. | Other parameters | Support `-D`, `-mapper`, `-reducer,` `-input` `-output` format, and you can set the input of user-defined parameters, such as:<ul><li>`-mapper "mapper.py 1"` `-file mapper.py` `-reducer reducer.py` `-file reducer.py` `–input /journey/words.txt` `-output /journey/out/mr/${currentTimeMillis}`</li><li>The `mapper.py 1` after `-mapper` is two parameters, the first parameter is `mapper.py`, and the second parameter is `1`. </li></ul> |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
## Task Example ## Task Example

125
docs/docs/en/guide/task/mlflow.md

@ -31,35 +31,26 @@ The MLflow plugin currently supports and will support the following:
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/mlflow.png" width="15"/> task node to canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/mlflow.png" width="15"/> task node to canvas.
## Task Example ## Task Parameters and Example
First, introduce some general parameters of DolphinScheduler: | **Parameter** | **Description** |
| ------- | ---------- |
- **Node name**: The node name in a workflow definition is unique. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
the `prohibition execution`. | Description | Describes the function of this node. |
- **Descriptive information**: Describe the function of the node. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
to low, and tasks with the same priority will execute in a first-in first-out order. | Task group name | The group in Resources, if not configured, it will not be used. |
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, | Environment Name | Configure the environment in which to run the script. |
randomly select a worker machine for execution. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- **Environment Name**: Configure the environment name in which run the script. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- **Times of failed retry attempts**: The number of times the task failed to resubmit. | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task. | Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
- **Delayed execution time**: The time (unit minute) that a task delays in execution. | MLflow Tracking Server URI | MLflow Tracking Server URI, default http://localhost:5000. |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm | Experiment Name | Create the experiment where the task is running, if the experiment does not exist. If the name is empty, it is set to ` Default `, the same as MLflow. |
email will send and the task execution will fail.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as
upstream of the current task.
Here are some specific parameters for the MLFlow component:
- **MLflow Tracking Server URI**: MLflow Tracking Server URI, default http://localhost:5000.
- **Experiment Name**: Create the experiment where the task is running, if the experiment does not exist. If the name is empty, it is set to ` Default `, the same as MLflow.
### MLflow Projects ### MLflow Projects
@ -67,71 +58,54 @@ Here are some specific parameters for the MLFlow component:
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-basic-algorithm.png) ![mlflow-conda-env](../../../../img/tasks/demo/mlflow-basic-algorithm.png)
**Task Parameter** **Task Parameters**
| **Parameter** | **Description** |
- **Register Model**: Register the model or not. If register is selected, the following parameters are expanded. | ------- | ---------- |
- **Model Name**: The registered model name is added to the original model version and registered as | Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
Production. | Model Name | The registered model name is added to the original model version and registered as Production. |
- **Data Path**: The absolute path of the file or folder. Ends with .csv for file or contain train.csv and | Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation. |
test.csv for folder(In the suggested way, users should build their own test sets for model evaluation). | Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `"time_budget=30;estimator_list=['lgbm']"` for flaml 。The convention will be passed with '; ' shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. <ul><li>[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)</li><li>[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC)</li><li>[lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)</li><li>[xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)</li></ul> |
- **Parameters**: Parameter when initializing the algorithm/AutoML model, which can be empty. For example | Algorithm |The selected algorithm currently supports `LR`, `SVM`, `LightGBM` and `XGboost` based on [scikit-learn](https://scikit-learn.org/) form. |
parameters `"time_budget=30;estimator_list=['lgbm']"` for flaml 。The convention will be passed with '; ' shards | Parameter Search Space | Parameter search space when running the corresponding algorithm, which can be empty. For example, the parameter `max_depth=[5, 10];n_estimators=[100, 200]` for lightgbm 。The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. |
each parameter, using the name before the equal sign as the parameter name, and using the name after the equal
sign to get the corresponding parameter value through `python eval()`.
- [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)
- [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC)
- [lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)
- [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)
- **Algorithm**:The selected algorithm currently supports `LR`, `SVM`, `LightGBM` and `XGboost` based
on [scikit-learn](https://scikit-learn.org/) form.
- **Parameter Search Space**: Parameter search space when running the corresponding algorithm, which can be
empty. For example, the parameter `max_depth=[5, 10];n_estimators=[100, 200]` for lightgbm 。The convention
will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name,
and using the name after the equal sign to get the corresponding parameter value through `python eval()`.
#### AutoML #### AutoML
![mlflow-automl](../../../../img/tasks/demo/mlflow-automl.png) ![mlflow-automl](../../../../img/tasks/demo/mlflow-automl.png)
**Task Parameter** **Task Parameter**
| **Parameter** | **Description** |
- **Register Model**: Register the model or not. If register is selected, the following parameters are expanded. | ------- | ---------- |
- **model name**: The registered model name is added to the original model version and registered as | Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
Production. | model name | The registered model name is added to the original model version and registered as Production. |
- **Data Path**: The absolute path of the file or folder. Ends with .csv for file or contain train.csv and | Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation). |
test.csv for folder(In the suggested way, users should build their own test sets for model evaluation). | Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `n_estimators=200;learning_rate=0.2` for flaml. The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. The detailed parameter list is as follows: <ul><li>[flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects)</li><li>[autosklearn](https://automl.github.io/auto-sklearn/master/api.html)</li></ul> |
- **Parameters**: Parameter when initializing the algorithm/AutoML model, which can be empty. For example | AutoML tool | The AutoML tool used, currently supports [autosklearn](https://github.com/automl/auto-sklearn) and [flaml](https://github.com/microsoft/FLAML). |
parameters `n_estimators=200;learning_rate=0.2` for flaml. The convention will be passed with '; 'shards
each parameter, using the name before the equal sign as the parameter name, and using the name after the equal
sign to get the corresponding parameter value through `python eval()`. The detailed parameter list is as follows:
- [flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects)
- [autosklearn](https://automl.github.io/auto-sklearn/master/api.html)
- **AutoML tool**: The AutoML tool used, currently
supports [autosklearn](https://github.com/automl/auto-sklearn)
and [flaml](https://github.com/microsoft/FLAML).
#### Custom projects #### Custom projects
![mlflow-custom-project.png](../../../../img/tasks/demo/mlflow-custom-project.png) ![mlflow-custom-project.png](../../../../img/tasks/demo/mlflow-custom-project.png)
**Task Parameter** **Task Parameter**
| **Parameter** | **Description** |
- **parameters**: `--param-list` in `mlflow run`. For example `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9`. | ------- | ---------- |
- **Repository**: Repository url of MLflow Project,Support git address and directory on worker. If it's in a subdirectory,We add `#` to support this (same as `mlflow run`) , for example `https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native`. | parameters | `--param-list` in `mlflow run`. For example `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9`. |
- **Project Version**: Version of the project,default master. | Repository | Repository url of MLflow Project,Support git address and directory on worker. If it's in a subdirectory,We add `#` to support this (same as `mlflow run`) , for example `https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native`. |
| Project Version | Version of the project,default master. |
You can now use this feature to run all MLFlow projects on Github (For example [MLflow examples](https://github.com/mlflow/mlflow/tree/master/examples) ). You can also create your own machine learning library to reuse your work, and then use DolphinScheduler to use your library with one click. You can now use this feature to run all MLFlow projects on Github (For example [MLflow examples](https://github.com/mlflow/mlflow/tree/master/examples) ). You can also create your own machine learning library to reuse your work, and then use DolphinScheduler to use your library with one click.
### MLflow Models ### MLflow Models
General Parameters: **General Parameters**
- **Model-URI**: Model-URI of MLflow , support `models:/<model_name>/suffix` format and `runs:/` format. See https://mlflow.org/docs/latest/tracking.html#artifact-stores. | **Parameter** | **Description** |
- **Port**: The port to listen on. | ------- | ---------- |
| Model-URI | Model-URI of MLflow , support `models:/<model_name>/suffix` format and `runs:/` format. See https://mlflow.org/docs/latest/tracking.html#artifact-stores |
| Port | The port to listen on. |
#### MLFLOW #### MLflow
![mlflow-models-mlflow](../../../../img/tasks/demo/mlflow-models-mlflow.png) ![mlflow-models-mlflow](../../../../img/tasks/demo/mlflow-models-mlflow.png)
@ -143,12 +117,9 @@ General Parameters:
![mlflow-models-docker-compose](../../../../img/tasks/demo/mlflow-models-docker-compose.png) ![mlflow-models-docker-compose](../../../../img/tasks/demo/mlflow-models-docker-compose.png)
- **Max Cpu Limit**: For example `1.0` or `0.5`, the same as docker compose. ## Environment to Prepare
- **Max Memory Limit**: For example `1G` or `500M`, the same as docker compose.
## Environment to prepare
### Conda env ### Conda environment
You need to enter the admin account to configure a conda environment variable(Please You need to enter the admin account to configure a conda environment variable(Please
install [anaconda](https://docs.continuum.io/anaconda/install/) install [anaconda](https://docs.continuum.io/anaconda/install/)
@ -161,7 +132,7 @@ Conda environment.
![mlflow-set-conda-env](../../../../img/tasks/demo/mlflow-set-conda-env.png) ![mlflow-set-conda-env](../../../../img/tasks/demo/mlflow-set-conda-env.png)
### Start the mlflow service ### Start the MLflow Service
Make sure you have installed MLflow, using 'pip install mlflow'. Make sure you have installed MLflow, using 'pip install mlflow'.

69
docs/docs/en/guide/task/openmldb.md

@ -9,65 +9,54 @@ OpenMLDB task plugin used to execute tasks on OpenMLDB cluster.
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/openmldb.png" width="15"/> task node to canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/openmldb.png" width="15"/> task node to canvas.
## Task Example ## Task Parameters
First, introduce some general parameters of DolphinScheduler | **Parameter** | **Description** |
| ------- | ---------- |
- **Node name**: The node name in a workflow definition is unique. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
the `prohibition execution`. | Description | Describes the function of this node. |
- **Descriptive information**: Describe the function of the node. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
to low, and tasks with the same priority will execute in a first-in first-out order. | Task group name | The group in Resources, if not configured, it will not be used. |
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, | Environment Name | Configure the environment in which to run the script. |
randomly select a worker machine for execution. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- **Environment Name**: Configure the environment name in which run the script. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- **Times of failed retry attempts**: The number of times the task failed to resubmit. | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task. | Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
- **Delayed execution time**: The time (unit minute) that a task delays in execution. | zookeeper | OpenMLDB cluster zookeeper address, e.g. 127.0.0.1:2181. |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm | zookeeper path | OpenMLDB cluster zookeeper path, e.g. /openmldb. |
email will send and the task execution will fail. | Execute Mode | Determine the init mode, offline or online. You can switch it in sql statement. |
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as | SQL statement | SQL statement. |
upstream of the current task. | Custom parameters | It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. |
### OpenMLDB Parameters ## Task Examples
**Task Parameter** ### Load data
- **zookeeper** :OpenMLDB cluster zookeeper address, e.g. 127.0.0.1:2181.
- **zookeeper path** : OpenMLDB cluster zookeeper path, e.g. /openmldb.
- **Execute Mode** :determine the init mode, offline or online. You can switch it in sql statement.
- **SQL statement** :SQL statement.
- Custom parameters: It is the user-defined parameters of Python, which will replace the content with \${variable} in the script.
Here are some examples:
#### Load data
![load data](../../../../img/tasks/demo/openmldb-load-data.png) ![load data](../../../../img/tasks/demo/openmldb-load-data.png)
We use `LOAD DATA` to load data into OpenMLDB cluster. We select `offline` here, so it will load to offline storage. We use `LOAD DATA` to load data into OpenMLDB cluster. We select `offline` here, so it will load to offline storage.
#### Feature extraction ### Feature extraction
![fe](../../../../img/tasks/demo/openmldb-feature-extraction.png) ![fe](../../../../img/tasks/demo/openmldb-feature-extraction.png)
We use `SELECT INTO` to do feature extraction. We select `offline` here, so it will run sql on offline engine. We use `SELECT INTO` to do feature extraction. We select `offline` here, so it will run sql on offline engine.
## Environment to prepare ### Environment to Prepare
### Start the OpenMLDB cluster #### Start the OpenMLDB Cluster
You should create an OpenMLDB cluster first. If in production env, please check [deploy OpenMLDB](https://openmldb.ai/docs/en/v0.5/deploy/install_deploy.html). You should create an OpenMLDB cluster first. If in production env, please check [deploy OpenMLDB](https://openmldb.ai/docs/en/v0.5/deploy/install_deploy.html).
You can follow [run OpenMLDB in docker](https://openmldb.ai/docs/zh/v0.5/quickstart/openmldb_quickstart.html#id11) You can follow [run OpenMLDB in docker](https://openmldb.ai/docs/zh/v0.5/quickstart/openmldb_quickstart.html#id11)
to a quick start. to a quick start.
### Python env #### Python Environment
The OpenMLDB task will use OpenMLDB Python SDK to connect OpenMLDB cluster. So you should have the Python env. The OpenMLDB task will use OpenMLDB Python SDK to connect OpenMLDB cluster. So you should have the Python env.

32
docs/docs/en/guide/task/pigeon.md

@ -1,19 +1,27 @@
# Pigeon # Pigeon
## Overview
Pigeon is a task used to trigger remote tasks, acquire logs or status by calling remote WebSocket service. It is DolphinScheduler uses a remote WebSocket service to call tasks. Pigeon is a task used to trigger remote tasks, acquire logs or status by calling remote WebSocket service. It is DolphinScheduler uses a remote WebSocket service to call tasks.
## Create ## Create Task
Drag from the toolbar <img src="../../../../img/pigeon.png" width="20"/> to the canvas to create a new Pigeon task. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/pigeon.png" width="20"/> to the canvas to create a new Pigeon task.
## Parameter ## Task Parameters
- Node name: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. | ------- | ---------- |
- Descriptive information: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Task group name | The group in Resources, if not configured, it will not be used. |
- Target task name: Target task name of this Pigeon node. | Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Target task name | Target task name of this Pigeon node. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |

37
docs/docs/en/guide/task/python.md

@ -7,25 +7,29 @@ it will generate a temporary python script, and executes the script by the Linux
## Create Task ## Create Task
- Click Project Management-Project Name-Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag <img src="../../../../img/tasks/icons/python.png" width="15"/> from the toolbar to the canvas. - Drag <img src="../../../../img/tasks/icons/python.png" width="15"/> from the toolbar to the canvas.
## Task Parameter ## Task Parameter
- Node name: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- Run flag: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch. | ------- | ---------- |
- Descriptive information: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- Environment Name: Configure the environment name in which to run the script. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- Number of failed retry attempts: The failure task resubmitting times. It supports drop-down and hand-filling. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- Failed retry interval: The time interval for resubmitting the task after a failed task. It supports drop-down and hand-filling. | Task group name | The group in Resources, if not configured, it will not be used. |
- Cpu quota: Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | Environment Name | Configure the environment in which to run the script. |
- Max memory:Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- Timeout alarm: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- Script: Python program developed by the user. | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- Resource: Refers to the list of resource files that need to be called in the script, and the files uploaded or created by the resource center-file management. | Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
- Custom parameters: It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. | Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. |
| Script | Python program developed by the user. |
| Resource | Refers to the list of resource files that need to be called in the script, and the files uploaded or created by the resource center-file management. |
| Custom parameters | It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. |
## Task Example ## Task Example
@ -52,6 +56,3 @@ After running this example, we would see "param_val" print in the log.
print("${param_key}") print("${param_key}")
``` ```
## Notice
None

3
docs/docs/en/guide/task/shell.md

@ -9,6 +9,9 @@ Shell task type, used to create a shell type task and execute a series of shell
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/shell.png" width="15"/> to the canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/shell.png" width="15"/> to the canvas.
## Task Parameters
| **Parameter** | **Description** | | **Parameter** | **Description** |
| ------- | ---------- | | ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. | | Node Name | Set the name of the task. Node names within a workflow definition are unique. |

65
docs/docs/en/guide/task/spark.md

@ -10,38 +10,39 @@ Spark task type for executing Spark application. When executing the Spark task,
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/spark.png" width="15"/> to the canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/spark.png" width="15"/> to the canvas.
## Task Parameter ## Task Parameters
- **Node name**: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. | ------- | ---------- |
- **Descriptive information**: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- **Environment Name**: Configure the environment name in which run the script. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- **Times of failed retry attempts**: The number of times the task failed to resubmit. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task. | Task group name | The group in Resources, if not configured, it will not be used. |
- **Delayed execution time**: The time (unit minute) that a task delays in execution. | Environment Name | Configure the environment in which to run the script. |
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- **Program type**: Supports Java, Scala, Python and SQL. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- **Spark version**: Support Spark1 and Spark2. | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- **The class of main function**: The **full path** of Main Class, the entry point of the Spark program. | Program type | Supports Java, Scala, Python, and SQL. |
- **Main jar package**: The Spark jar package (upload by Resource Center). | Spark version | Support Spark1 and Spark2. |
- **SQL scripts**: SQL statements in .sql files that Spark sql runs. | The class of main function | The **full path** of Main Class, the entry point of the Spark program. |
- **Deployment mode**: (1) spark submit supports three modes: yarn-clusetr, yarn-client and local. | Main jar package | The Spark jar package (upload by Resource Center). |
(2) spark sql supports yarn-client and local modes. | SQL scripts | SQL statements in .sql files that Spark sql runs. |
- **Task name** (optional): Spark task name. | Deployment mode | <ul><li>spark submit supports three modes: yarn-clusetr, yarn-client and local.</li><li>spark sql supports yarn-client and local modes.</li></ul> |
- **Driver core number**: Set the number of Driver core, which can be set according to the actual production environment. | Task name | Spark task name. |
- **Driver memory size**: Set the size of Driver memories, which can be set according to the actual production environment. | Driver core number | Set the number of Driver core, which can be set according to the actual production environment. |
- **Number of Executor**: Set the number of Executor, which can be set according to the actual production environment. | Driver memory size | Set the size of Driver memories, which can be set according to the actual production environment. |
- **Executor memory size**: Set the size of Executor memories, which can be set according to the actual production environment. | Number of Executor | Set the number of Executor, which can be set according to the actual production environment. |
- **Main program parameters**: Set the input parameters of the Spark program and support the substitution of custom parameter variables. | Executor memory size | Set the size of Executor memories, which can be set according to the actual production environment. |
- **Optional parameters**: support `--jars`, `--files`,` --archives`, `--conf` format. | Main program parameters | Set the input parameters of the Spark program and support the substitution of custom parameter variables. |
- **Resource**: Appoint resource files in the `Resource` if parameters refer to them. | Optional parameters | Support `--jars`, `--files`,` --archives`, `--conf` format. |
- **Custom parameter**: It is a local user-defined parameter for Spark, and will replace the content with `${variable}` in the script. | Resource | Appoint resource files in the `Resource` if parameters refer to them. |
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. | Custom parameter | It is a local user-defined parameter for Spark, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example ## Task Example
@ -71,7 +72,7 @@ Configure the required content according to the parameter descriptions above.
![demo-spark-simple](../../../../img/tasks/demo/spark_task02.png) ![demo-spark-simple](../../../../img/tasks/demo/spark_task02.png)
### spark sql ### Spark sql
#### Execute DDL and DML statements #### Execute DDL and DML statements
@ -79,7 +80,7 @@ This case is to create a view table terms and write three rows of data and a tab
![spark_sql](../../../../img/tasks/demo/spark_sql.png) ![spark_sql](../../../../img/tasks/demo/spark_sql.png)
## Notice ## Note
JAVA and Scala are only used for identification, and there is no difference when you use the Spark task. If your application is developed by Python, you could just ignore the parameter **Main Class** in the form. Parameter **SQL scripts** is only for SQL type and could be ignored in JAVA, Scala and Python. JAVA and Scala are only used for identification, and there is no difference when you use the Spark task. If your application is developed by Python, you could just ignore the parameter **Main Class** in the form. Parameter **SQL scripts** is only for SQL type and could be ignored in JAVA, Scala and Python.

27
docs/docs/en/guide/task/sql.md

@ -10,24 +10,21 @@ Refer to [DataSource](../datasource/introduction.md)
## Create Task ## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/sql.png" width="25"/> to the canvas. - Drag from the toolbar <img src="../../../../img/tasks/icons/sql.png" width="25"/> to the canvas.
## Task Parameter ## Task Parameter
- Data source: Select the corresponding DataSource. | **Parameter** | **Description** |
- SQL type: Supports query and non-query. | ------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
- Query: supports `DML select` type commands, which return a result set. You can specify three templates for email notification as form, attachment or form attachment; | Data source | Select the corresponding DataSource. |
- Non-query: support `DDL` all commands and `DML update, delete, insert` three types of commands; | SQL type | Supports query and non-query. <ul><li>Query: supports `DML select` type commands, which return a result set. You can specify three templates for email notification as form, attachment or form attachment;</li><li>Non-query: support `DDL` all commands and `DML update, delete, insert` three types of commands;<ul><li>Segmented execution symbol: When the data source does not support executing multiple SQL statements at a time, the symbol for splitting SQL statements is provided to call the data source execution method multiple times. Example: 1. When the Hive data source is selected as the data source, this parameter does not need to be filled in. Because the Hive data source itself supports executing multiple SQL statements at one time; 2. When the MySQL data source is selected as the data source, and multi-segment SQL statements are to be executed, this parameter needs to be filled in with a semicolon `;. Because the MySQL data source does not support executing multiple SQL statements at one time.</li></ul></li></ul> |
- Segmented execution symbol: When the data source does not support executing multiple SQL statements at a time, the symbol for splitting SQL statements is provided to call the data source execution method multiple times. | SQL parameter | The input parameter format is `key1=value1;key2=value2...`. |
Example: 1. When the Hive data source is selected as the data source, this parameter does not need to be filled in. Because the Hive data source itself supports executing multiple SQL statements at one time; | SQL statement | SQL statement. |
2. When the MySQL data source is selected as the data source, and multi-segment SQL statements are to be executed, this parameter needs to be filled in with a semicolon `;`. Because the MySQL data source does not support executing multiple SQL statements at one time; | UDF function | For Hive DataSources, you can refer to UDF functions created in the resource center, but other DataSource do not support UDF functions. |
- SQL parameter: The input parameter format is `key1=value1;key2=value2...`. | Custom parameters | SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement. |
- SQL statement: SQL statement. | Pre-SQL | Pre-SQL executes before the SQL statement. |
- UDF function: For Hive DataSources, you can refer to UDF functions created in the resource center, but other DataSource do not support UDF functions. | Post-SQL | Post-SQL executes after the SQL statement. |
- Custom parameters: SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement.
- Pre-SQL: Pre-SQL executes before the SQL statement.
- Post-SQL: Post-SQL executes after the SQL statement.
## Task Example ## Task Example
@ -51,7 +48,7 @@ Table created in the Pre-SQL, after use in the SQL statement, cleaned in the Pos
![pre_post_sql](../../../../img/tasks/demo/pre_post_sql.png) ![pre_post_sql](../../../../img/tasks/demo/pre_post_sql.png)
## Notice ## Note
Pay attention to the selection of SQL type. If it is an insert operation, need to change to "Non-Query" type. Pay attention to the selection of SQL type. If it is an insert operation, need to change to "Non-Query" type.

21
docs/docs/en/guide/task/stored-procedure.md

@ -8,6 +8,21 @@
<img src="../../../../img/procedure-en.png" width="80%" /> <img src="../../../../img/procedure-en.png" width="80%" />
</p> </p>
- DataSource: The DataSource type of the stored procedure supports MySQL and POSTGRESQL, select the corresponding DataSource. ## Task Parameters
- Method: The method name of the stored procedure.
- Custom parameters: The custom parameter types of the stored procedure support `IN` and `OUT`, and the data types support: VARCHAR, INTEGER, LONG, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP and BOOLEAN. | **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| DataSource | The DataSource type of the stored procedure supports MySQL and POSTGRESQL, select the corresponding DataSource. |
| Method | The method name of the stored procedure. |
| Custom parameters | The custom parameter types of the stored procedure support `IN` and `OUT`, and the data types support: VARCHAR, INTEGER, LONG, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP and BOOLEAN. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |

47
docs/docs/en/guide/task/switch.md

@ -1,29 +1,34 @@
# Switch # Switch
## Overview
The switch is a conditional judgment node, decide the branch executes according to the value of [global variable](../parameter/global.md) and the expression result written by the user. The switch is a conditional judgment node, decide the branch executes according to the value of [global variable](../parameter/global.md) and the expression result written by the user.
**Note** Execute expressions using javax.script.ScriptEngine.eval.
**Note**: Execute expressions using javax.script.ScriptEngine.eval.
## Create Task ## Create Task
Click Project -> Management-Project -> Name-Workflow Definition, and click the Create Workflow button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
Drag from the toolbar <img src="../../../../img/switch.png" width="20"/> task node to canvas to create a task. - Drag from the toolbar <img src="../../../../img/switch.png" width="20"/> task node to canvas to create a task.
**Note** After created a switch task, you must first configure the upstream and downstream, then configure the parameter of task branches. **Note**: After created a switch task, you must first configure the upstream and downstream, then configure the parameter of task branches.
## Parameter ## Task Parameters
- Node name: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. | ------- | ---------- |
- Descriptive information: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- Environment name: The environment in Security, if not configured, it will not be used. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- Task group name: The group in Resources, if not configured, it will not be used. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. | Task group name | The group in Resources, if not configured, it will not be used. |
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. | Environment Name | Configure the environment in which to run the script. |
- Delay execution time: Task delay execution time. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
- Condition: You can configure multiple conditions for the switch task. When the conditions are satisfied, execute the configured branch. You can configure multiple different conditions to satisfy different businesses. | Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
- Branch flow: The default branch flow, when all the conditions are not satisfied, execute this branch flow. | Delay execution time | Task delay execution time. |
| Condition | You can configure multiple conditions for the switch task. When the conditions are satisfied, execute the configured branch. You can configure multiple different conditions to satisfy different businesses. |
| Branch flow | The default branch flow, when all the conditions are not satisfied, execute this branch flow. |
## Task Example ## Task Example

32
docs/docs/en/guide/task/zeppelin.md

@ -7,22 +7,26 @@ it will call `Zeppelin Client API` to trigger zeppelin notebook paragraph. Click
## Create Task ## Create Task
- Click Project Management-Project Name-Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page. - Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag <img src="../../../../img/tasks/icons/zeppelin.png" width="15"/> from the toolbar to the canvas. - Drag <img src="../../../../img/tasks/icons/zeppelin.png" width="15"/> from the toolbar to the canvas.
## Task Parameter ## Task Parameters
- Node name: The node name in a workflow definition is unique. | **Parameter** | **Description** |
- Run flag: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch. | ------- | ---------- |
- Descriptive information: Describe the function of the node. | Node Name | Set the name of the task. Node names within a workflow definition are unique. |
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | Description | Describes the function of this node. |
- Number of failed retry attempts: The failure task resubmitting times. It supports drop-down and hand-filling. | Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
- Failed retry interval: The time interval for resubmitting the task after a failed task. It supports drop-down and hand-filling. | Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
- Timeout alarm: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. | Task group name | The group in Resources, if not configured, it will not be used. |
- Zeppelin Note ID: The unique note id for a zeppelin notebook note. | Environment Name | Configure the environment in which to run the script. |
- Zeppelin Paragraph ID: The unique paragraph id for a zeppelin notebook paragraph. If you want to schedule a whole note at a time, leave this field blank. | Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
- Zeppelin Parameters: Parameters in json format used for zeppelin dynamic form. | Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Zeppelin Note ID | The unique note id for a zeppelin notebook note. |
| Zeppelin Paragraph ID | The unique paragraph id for a zeppelin notebook paragraph. If you want to schedule a whole note at a time, leave this field blank. |
| Zeppelin Parameters | Parameters in json format used for zeppelin dynamic form. |
## Task Example ## Task Example

Loading…
Cancel
Save