Browse Source

[Cherry-pick][Improvement] [Doc] Put default task parameters in a new file #11776 (#11957)

Co-authored-by: Eric Gao <ericgao.apache@gmail.com>
3.1.0-release
caishunfeng 2 years ago committed by GitHub
parent
commit
0647b3e10c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 8
      docs/configs/docsdev.js
  2. 24
      docs/docs/en/guide/task/appendix.md
  3. 37
      docs/docs/en/guide/task/chunjun.md
  4. 18
      docs/docs/en/guide/task/conditions.md
  5. 43
      docs/docs/en/guide/task/datax.md
  6. 18
      docs/docs/en/guide/task/dependent.md
  7. 22
      docs/docs/en/guide/task/dinky.md
  8. 61
      docs/docs/en/guide/task/dvc.md
  9. 28
      docs/docs/en/guide/task/emr.md
  10. 51
      docs/docs/en/guide/task/flink.md
  11. 26
      docs/docs/en/guide/task/hive-cli.md
  12. 29
      docs/docs/en/guide/task/http.md
  13. 57
      docs/docs/en/guide/task/jupyter.md
  14. 26
      docs/docs/en/guide/task/kubernetes.md
  15. 50
      docs/docs/en/guide/task/map-reduce.md
  16. 109
      docs/docs/en/guide/task/mlflow.md
  17. 27
      docs/docs/en/guide/task/openmldb.md
  18. 20
      docs/docs/en/guide/task/pigeon.md
  19. 22
      docs/docs/en/guide/task/python.md
  20. 27
      docs/docs/en/guide/task/pytorch.md
  21. 27
      docs/docs/en/guide/task/sagemaker.md
  22. 36
      docs/docs/en/guide/task/seatunnel.md
  23. 21
      docs/docs/en/guide/task/shell.md
  24. 48
      docs/docs/en/guide/task/spark.md
  25. 24
      docs/docs/en/guide/task/sql.md
  26. 25
      docs/docs/en/guide/task/stored-procedure.md
  27. 19
      docs/docs/en/guide/task/sub-process.md
  28. 26
      docs/docs/en/guide/task/switch.md
  29. 40
      docs/docs/en/guide/task/zeppelin.md
  30. 25
      docs/docs/zh/guide/task/appendix.md
  31. 29
      docs/docs/zh/guide/task/chunjun.md
  32. 20
      docs/docs/zh/guide/task/conditions.md
  33. 41
      docs/docs/zh/guide/task/datax.md
  34. 13
      docs/docs/zh/guide/task/dependent.md
  35. 20
      docs/docs/zh/guide/task/dinky.md
  36. 23
      docs/docs/zh/guide/task/dvc.md
  37. 31
      docs/docs/zh/guide/task/emr.md
  38. 51
      docs/docs/zh/guide/task/flink.md
  39. 28
      docs/docs/zh/guide/task/hive-cli.md
  40. 31
      docs/docs/zh/guide/task/http.md
  41. 42
      docs/docs/zh/guide/task/java.md
  42. 53
      docs/docs/zh/guide/task/jupyter.md
  43. 27
      docs/docs/zh/guide/task/kubernetes.md
  44. 46
      docs/docs/zh/guide/task/map-reduce.md
  45. 83
      docs/docs/zh/guide/task/mlflow.md
  46. 31
      docs/docs/zh/guide/task/openmldb.md
  47. 15
      docs/docs/zh/guide/task/pigeon.md
  48. 21
      docs/docs/zh/guide/task/python.md
  49. 55
      docs/docs/zh/guide/task/pytorch.md
  50. 32
      docs/docs/zh/guide/task/sagemaker.md
  51. 38
      docs/docs/zh/guide/task/seatunnel.md
  52. 17
      docs/docs/zh/guide/task/shell.md
  53. 16
      docs/docs/zh/guide/task/spark.md
  54. 10
      docs/docs/zh/guide/task/sql.md
  55. 12
      docs/docs/zh/guide/task/stored-procedure.md
  56. 14
      docs/docs/zh/guide/task/sub-process.md
  57. 19
      docs/docs/zh/guide/task/switch.md
  58. 33
      docs/docs/zh/guide/task/zeppelin.md

8
docs/configs/docsdev.js

@ -89,6 +89,10 @@ export default {
{
title: 'Task',
children: [
{
title: 'Appendix',
link: '/en-us/docs/dev/user_doc/guide/task/appendix.html',
},
{
title: 'Shell',
link: '/en-us/docs/dev/user_doc/guide/task/shell.html',
@ -713,6 +717,10 @@ export default {
{
title: '任务类型',
children: [
{
title: 'Appendix',
link: '/zh-cn/docs/dev/user_doc/guide/task/appendix.html',
},
{
title: 'Shell',
link: '/zh-cn/docs/dev/user_doc/guide/task/shell.html',

24
docs/docs/en/guide/task/appendix.md

@ -0,0 +1,24 @@
# DolphinScheduler Task Parameters Appendix
DolphinScheduler task plugins share some common default parameters. Each type of task contains all or **some** default parameters as follows:
## Default Task Parameters
| **Parameter** | **Description** |
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node Name | The name of the task. Node names within the same workflow must be unique. |
| Run Flag | Indicating whether to schedule the task. If you do not need to execute the task, you can turn on the `Prohibition execution` switch. |
| Description | Describing the function of this node. |
| Task Priority | When the number of the worker threads is insufficient, the worker executes task according to the priority. When two tasks have the same priority, the worker will execute them in `first come first served` fashion. |
| Worker Group | Machines which execute the tasks. If you choose `default`, scheduler will send the task to a random worker. |
| Task Group Name | Resource group of tasks. It will not take effect if not configured. |
| Environment Name | Environment to execute the task. |
| Number of Failed Retries | The number of task retries for failures. You could select it by drop-down menu or fill it manually. |
| Failure Retry Interval | Interval of task retries for failures. You could select it by drop-down menu or fill it manually. |
| CPU Quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%, and that of 16 cores is 1600%. You could configure it by [task.resource.limit.state](../../architecture/configuration.md). |
| Max Memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. You could configure it by [task.resource.limit.state](../../architecture/configuration.md). |
| Timeout Alarm | Alarm for task timeout. When the task exceeds the "timeout threshold", an alarm email will send. |
| Delayed Execution Time | The time that a task delays for execution in minutes. |
| Resources | Resources which your task node uses. |
| Predecessor Task | The upstream task of the current task node. |

37
docs/docs/en/guide/task/chunjun.md

@ -11,26 +11,16 @@ ChunJun task type for executing ChunJun programs. For ChunJun nodes, the worker
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node name | The node name in a workflow definition is unique. |
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the prohibition execution. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Description | Describe the function of the node. |
| Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Environment Name | Configure the environment name in which run the script. |
| Number of failed retries | The number of times the task failed to resubmit. |
| Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
| Task group name | The task group name. |
| Priority | The task priority. |
| Delayed execution time | The time, in minutes, that a task is delayed in execution. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. |
| Custom template | Custom the content of the ChunJun node's json profile. |
| json | json configuration file for ChunJun synchronization. |
| Custom parameters | It is a user-defined parameter, and will replace the content with `${variable}` in the script.
| Deploy mode | Execute chunjun task mode, eg local standalone. |
| Option Parameters | Support such as `-confProp "{\"flink.checkpoint.interval\":60000}"` |
| Predecessor task | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-------------------|---------------------------------------------------------------------------------------------------------------------------|
| Custom template | Custom the content of the ChunJun node's json profile. |
| json | json configuration file for ChunJun synchronization. |
| Custom parameters | It is a user-defined parameter, and will replace the content with `${variable}` in the script. |
| Deploy mode | Execute chunjun task mode, eg local standalone. |
| Option Parameters | Support such as `-confProp "{\"flink.checkpoint.interval\":60000}"` |
| Predecessor task | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
## Task Example
@ -58,9 +48,9 @@ After writing the required json file, you can configure the node content by foll
### Note
Before execute ${CHUNJUN_HOME}/bin/start-chunjun, need to change the shell ${CHUNJUN_HOME}/bin/start-chunjun, remove '&' in order to run in front.
Before execute ${CHUNJUN_HOME}/bin/start-chunjun, need to change the shell ${CHUNJUN_HOME}/bin/start-chunjun, remove '&' in order to run in front.
such as:
such as:
```shell
nohup $JAVA_RUN -cp $JAR_DIR $CLASS_NAME $@ &
@ -70,4 +60,5 @@ update to following:
```shell
nohup $JAVA_RUN -cp $JAR_DIR $CLASS_NAME $@
```
```

18
docs/docs/en/guide/task/conditions.md

@ -9,19 +9,11 @@ Condition is a conditional node, that determines which downstream task should ru
## Task Parameters
| **Parameter** | **Description** |
| -------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Downstream tasks selection | Depending on the status of the predecessor task, you can jump to the corresponding branch, currently two branches are supported: success, failure <ul><li>Success: When the upstream task runs successfully, run the success branch.</li><li>Failure: When the upstream task runs failed, run the failure branch.</li></ul></li></ul> |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Downstream tasks selection | Depending on the status of the predecessor task, you can jump to the corresponding branch, currently two branches are supported: success, failure <ul><li>Success: When the upstream task runs successfully, run the success branch.</li><li>Failure: When the upstream task runs failed, run the failure branch.</li></ul></li></ul> |
| Upstream condition selection | Can select one or more upstream tasks for conditions.<ul><li>Add an upstream dependency: the first parameter is to choose a specified task name, and the second parameter is to choose the upstream task status to trigger conditions.</li><li>Select upstream task relationship: use `and` and `or` operators to handle the complex relationship of upstream when there are multiple upstream tasks for conditions.</li></ul></li></ul> |
## Related Task

43
docs/docs/en/guide/task/datax.md

@ -11,33 +11,22 @@ DataX task type for executing DataX programs. For DataX nodes, the worker will e
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node name | The node name in a workflow definition is unique. |
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the prohibition execution. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Description | Describe the function of the node. |
| Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Environment Name | Configure the environment name in which run the script. |
| Number of failed retries | The number of times the task failed to resubmit. |
| Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) |
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) |
| Delayed execution time | The time, in cents, that a task is delayed in execution. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. |
| Custom template | Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements. |
| json | json configuration file for DataX synchronization. |
| Resource | When using custom json, if the cluster has kerberos authentication enabled, and datax needs to use the relevant keytab, xml file, etc. when reading or writing plug-ins such as hdfs and hbase, you can use this option. and the files uploaded or created in Resource Center - File Management.|
| Custom parameters | SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement. |
| Data source | Select the data source from which the data will be extracted. |
| sql statement | the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias. |
| Target library | Select the target library for data synchronization. |
| Pre-sql | Pre-sql is executed before the sql statement (executed by the target library). |
| Post-sql | Post-sql is executed after the sql statement (executed by the target library). |
| Stream limit (number of bytes) | Limits the number of bytes in the query. |
| Limit flow (number of records) | Limit the number of records for a query. |
| Running memory | the minimum and maximum memory required can be configured to suit the actual production environment. |
| Predecessor task | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Custom template | Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements. |
| json | json configuration file for DataX synchronization. |
| Resource | When using custom json, if the cluster has kerberos authentication enabled, and datax needs to use the relevant keytab, xml file, etc. when reading or writing plug-ins such as hdfs and hbase, you can use this option. and the files uploaded or created in Resource Center - File Management. |
| Custom parameters | SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement. |
| Data source | Select the data source from which the data will be extracted. |
| sql statement | the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias. |
| Target library | Select the target library for data synchronization. |
| Pre-sql | Pre-sql is executed before the sql statement (executed by the target library). |
| Post-sql | Post-sql is executed after the sql statement (executed by the target library). |
| Stream limit (number of bytes) | Limits the number of bytes in the query. |
| Limit flow (number of records) | Limit the number of records for a query. |
| Running memory | the minimum and maximum memory required can be configured to suit the actual production environment. |
## Task Example

18
docs/docs/en/guide/task/dependent.md

@ -4,7 +4,6 @@
Dependent nodes are **dependency check nodes**. For example, process A depends on the successful execution of process B from yesterday, and the dependent node will check whether process B run successful yesterday.
## Create Task
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
@ -12,20 +11,11 @@ Dependent nodes are **dependency check nodes**. For example, process A depends o
## Task Parameter
| **Parameter** | **Description** |
| ----- | -----------|
| Node name | Unique name of node in workflow definition. |
| Run flag | Identifies whether this node schedules normally. |
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment name in which run the script. |
| Number of failed retries | The number of times the task failed to resubmit. |
| Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
| Delayed execution time | The time (unit minute) that a task delays in execution. |
| Pre task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|------------------|---------------------------------------------|
| Predecessor Task | The upstream task of the current task node. |
## Task Examples

22
docs/docs/en/guide/task/dinky.md

@ -12,21 +12,13 @@ it will call `Dinky API` to trigger dinky task. Click [here](http://www.dlink.to
## Task Parameter
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Dinky Address | The url for a dinky server. |
| Dinky Task ID | The unique task id for a dinky task. |
| Online Task | Specify whether the current dinky job is online. If yes, the submitted job can only be submitted successfully when it is published and there is no corresponding Flink job instance running. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Dinky Address | The url for a dinky server. |
| Dinky Task ID | The unique task id for a dinky task. |
| Online Task | Specify whether the current dinky job is online. If yes, the submitted job can only be submitted successfully when it is published and there is no corresponding Flink job instance running. |
## Task Example

61
docs/docs/en/guide/task/dvc.md

@ -18,31 +18,19 @@ The plugin provides the following three functions:
DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/dvc.png" width="15"/> task node to canvas.
## Task Example
First, introduce some general parameters of DolphinScheduler:
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select
the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high
to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected,
randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm
email will send and the task execution will fail.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as
upstream of the current task.
Here are some specific parameters for the DVC plugin:
- **DVC Task Type** :Upload, Download or Init DVC。
- **DVC Repository** :The DVC repository address associated with the task execution.
## Task Parameters
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DVC Task Type | Upload, Download or Init DVC。 |
| DVC Repository | The DVC repository address associated with the task execution. |
| Remote Store Url | The actual data is stored at the address. You can learn about the supported storage types from the [DVC supported storage types](https://dvc.org/doc/command-reference/remote/add#supported-storage-types). |
| Data Path in DVC Repository | The path which the task uploads /downloads data to in the repository. |
| Data Path In Worker | Data path to be uploaded. / Path for saving data after the file is downloaded to the local |
| Version | After the data is uploaded, the version tag for the data will be saved in `git tag`. / The version of the data to download. |
| Version Message | Version Message. |
### Init DVC
@ -54,11 +42,7 @@ The data is not actually stored in a Git repository, but somewhere else, and DVC
![dvc_init](../../../../img/tasks/demo/dvc_init.png)
**Task Parameter**
- **Remote Store Url** :The actual data is stored at the address. You can learn about the supported storage types from the [DVC supported storage types](https://dvc.org/doc/command-reference/remote/add#supported-storage-types) .
The example above shows that:
The example above shows that:
Initialize repository `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git` as a DVC project and bind the remote storage address to `~/dvc`
### Upload
@ -67,13 +51,6 @@ Used to upload and update data and record version numbers.
![dvc_upload](../../../../img/tasks/demo/dvc_upload.png)
**Task Parameter**
- **Data Path in DVC Repository** :The data will be uploaded to this path in the repository.
- **Data Path In Worker** :Data path to be uploaded.
- **Version** :After the data is uploaded, the version tag for the data will be saved in `git tag`.
- **Version Message** :Version Message.
The example above shows that:
Upload data `/home/data/iris` to the root directory of repository `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git`. The file or folder of data is named `iris`.
@ -86,12 +63,6 @@ Used to download data for a specific version.
![dvc_download](../../../../img/tasks/demo/dvc_download.png)
**Task Parameter**
- **Data Path in DVC Repository** :The path to the data to download in the DVC repository.
- **Data Path In Worker** :Path for saving data after the file is downloaded to the local.
- **Version** :The version of the data to download.
The example above shows that:
Download the data for iris data at version `iris_1.0` in repository `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git` to the `~/dvc_test/iris`
@ -115,11 +86,11 @@ which dvc
You need to enter the admin account to configure a conda environment variable(Please
install [anaconda](https://docs.continuum.io/anaconda/install/)
or [miniconda](https://docs.conda.io/en/latest/miniconda.html#installing ) in advance).
or [miniconda](https://docs.conda.io/en/latest/miniconda.html#installing) in advance).
![dvc_env_config](../../../../img/tasks/demo/dvc_env_config.png)
Note During the configuration task, select the conda environment created above. Otherwise, the program cannot find the
Conda environment.
![dvc_env_name](../../../../img/tasks/demo/dvc_env_name.png)
![dvc_env_name](../../../../img/tasks/demo/dvc_env_name.png)

28
docs/docs/en/guide/task/emr.md

@ -2,7 +2,7 @@
## Overview
Amazon EMR task type, for operation EMR clusters on AWS and running computing tasks.
Amazon EMR task type, for operation EMR clusters on AWS and running computing tasks.
Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to task object and submit to AWS, Two program types are currently supported:
* `RUN_JOB_FLOW` Using [API_RunJobFlow](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples) submit [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object
@ -15,26 +15,23 @@ Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node name | The node name in a workflow definition is unique. |
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.|
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker grouping | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Times of failed retry attempts | The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. |
| Failed retry interval | The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
| Program Type | Select the program type. If it is `RUN_JOB_FLOW`, you need to fill in `jobFlowDefineJson`, if it is `ADD_JOB_FLOW_STEPS`, you need to fill in `stepsDefineJson`. |
| jobFlowDefineJson | JSON corresponding to the [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object, for details refer to [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples). |
| stepsDefineJson | JSON corresponding to the [AddJobFlowStepsRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/AddJobFlowStepsRequest.html) object, for details refer to [API_AddJobFlowSteps_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_AddJobFlowSteps.html#API_AddJobFlowSteps_Examples). |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Program Type | Select the program type. If it is `RUN_JOB_FLOW`, you need to fill in `jobFlowDefineJson`, if it is `ADD_JOB_FLOW_STEPS`, you need to fill in `stepsDefineJson`. |
| jobFlowDefineJson | JSON corresponding to the [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) object, for details refer to [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples). |
| stepsDefineJson | JSON corresponding to the [AddJobFlowStepsRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/AddJobFlowStepsRequest.html) object, for details refer to [API_AddJobFlowSteps_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_AddJobFlowSteps.html#API_AddJobFlowSteps_Examples). |
## Task Example
### Create an EMR cluster and run Steps
This example shows how to create an `EMR` task node of type `RUN_JOB_FLOW`. Taking the execution of `SparkPi` as an example, the task will create an `EMR` cluster and execute the `SparkPi` sample program.
![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_run_job_flow.png)
jobFlowDefineJson example
```json
{
"Name": "SparkPi",
@ -76,11 +73,13 @@ jobFlowDefineJson example
```
### Add a Step to a Running EMR Cluster
This example shows how to create an `EMR` task node of type `ADD_JOB_FLOW_STEPS`. Taking the execution of `SparkPi` as an example, the task will add a `SparkPi` sample program to the running `EMR` cluster.
![ADD_JOB_FLOW_STEPS](../../../../img/tasks/demo/emr_add_job_flow_steps.png)
![JobFlowId](../../../../img/tasks/demo/emr_jobFlowId.png)
stepsDefineJson example
```json
{
"JobFlowId": "j-3V628TKAERHP8",
@ -105,3 +104,4 @@ stepsDefineJson example
- Failover on EMR Task type has not been implemented. In this time, DolphinScheduler only supports failover on yarn task type . Other task type, such as EMR task, k8s task not ready yet.
- `stepsDefineJson` A task definition only supports the association of a single step, which can better ensure the reliability of the task state.

51
docs/docs/en/guide/task/flink.md

@ -15,36 +15,26 @@ Flink task type, used to execute Flink programs. For Flink nodes:
## Task Parameters
| **Parameter** | **Description** |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node name | The node name in a workflow definition is unique. |
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. |
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker grouping | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Environment Name | Configure the environment name in which run the script. |
| Times of failed retry attempts | The number of times the task failed to resubmit. |
| Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. |
| Delayed execution time | The time (unit minute) that a task delays in execution. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
| Program type | Support Java, Scala, Python and SQL four languages. |
| Class of main function | The **full path** of Main Class, the entry point of the Flink program. |
| Main jar package | The jar package of the Flink program (upload by Resource Center). |
| Deployment mode | Support 3 deployment modes: cluster, local and application (Flink 1.11 and later. See also [Run an application in Application Mode](https://nightlies.apache.org/flink/flink-docs-release-1.11/ops/deployment/yarn_setup.html#run-an-application-in-application-mode)). |
| Initialization script | Script file to initialize session context. |
| Script | The sql script file developed by the user that should be executed. |
| Flink version | Select version according to the execution environment. |
| Task name | Flink task name. |
| JobManager memory size | Used to set the size of jobManager memories, which can be set according to the actual production environment. |
| Number of slots | Used to set the number of slots, which can be set according to the actual production environment. |
| TaskManager memory size | Used to set the size of taskManager memories, which can be set according to the actual production environment. |
| Number of TaskManager | Used to set the number of taskManagers, which can be set according to the actual production environment. |
| Parallelism | Used to set the degree of parallelism for executing Flink tasks. |
| Main program parameters | Set the input parameters for the Flink program and support the substitution of custom parameter variables. |
| Optional parameters | Support `--jar`, `--files`,` --archives`, `--conf` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| Custom parameter | It is a local user-defined parameter for Flink, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| Program type | Support Java, Scala, Python and SQL four languages. |
| Class of main function | The **full path** of Main Class, the entry point of the Flink program. |
| Main jar package | The jar package of the Flink program (upload by Resource Center). |
| Deployment mode | Support 3 deployment modes: cluster, local and application (Flink 1.11 and later. See also [Run an application in Application Mode](https://nightlies.apache.org/flink/flink-docs-release-1.11/ops/deployment/yarn_setup.html#run-an-application-in-application-mode)). |
| Initialization script | Script file to initialize session context. |
| Script | The sql script file developed by the user that should be executed. |
| Flink version | Select version according to the execution environment. |
| Task name | Flink task name. |
| JobManager memory size | Used to set the size of jobManager memories, which can be set according to the actual production environment. |
| Number of slots | Used to set the number of slots, which can be set according to the actual production environment. |
| TaskManager memory size | Used to set the size of taskManager memories, which can be set according to the actual production environment. |
| Number of TaskManager | Used to set the number of taskManagers, which can be set according to the actual production environment. |
| Parallelism | Used to set the degree of parallelism for executing Flink tasks. |
| Main program parameters | Set the input parameters for the Flink program and support the substitution of custom parameter variables. |
| Optional parameters | Support `--jar`, `--files`,` --archives`, `--conf` format. | |
| Custom parameter | It is a local user-defined parameter for Flink, and will replace the content with `${variable}` in the script. |
## Task Example
@ -83,3 +73,4 @@ Configure the required content according to the parameter descriptions above.
- JAVA and Scala only used for identification, there is no difference. If use Python to develop Flink, there is no class of the main function and the rest is the same.
- Use SQL to execute Flink SQL tasks, currently only Flink 1.13 and above are supported.

26
docs/docs/en/guide/task/hive-cli.md

@ -24,24 +24,14 @@ You could choose between these two based on your needs.
## Task Parameters
| **Parameter** | **Description** |
|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node Name | The name of the task. Node names within the same workflow must be unique. |
| Run Flag | Indicating whether to schedule the task. If you do not need to execute the task, you can turn on the `Prohibition execution` switch. |
| Description | Describing the function of this node. |
| Task Priority | When the number of the worker threads is insufficient, the worker executes task according to the priority. When two tasks have the same priority, the worker will execute them in `first come first served` fashion. |
| Worker Group | Machines which execute the tasks. If you choose `default`, scheduler will send the task to a random worker. |
| Task Group Name | Resource group of tasks. It will not take effect if not configured. |
| Environment Name | Environment to execute the task. |
| Number of Failed Retries | The number of task retries for failures. You could select it by drop-down menu or fill it manually. |
| Failure Retry Interval | Interval of task retries for failures. You could select it by drop-down menu or fill it manually. |
| CPU Quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%, and that of 16 cores is 1600%. You could configure it by [task.resource.limit.state](../../architecture/configuration.md). |
| Max Memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. You could configure it by [task.resource.limit.state](../../architecture/configuration.md). |
| Timeout Alarm | Alarm for task timeout. When the task exceeds the "timeout threshold", an alarm email will send. |
| Hive Cli Task Execution Type | The type of hive cli task execution, choose either `FROM_SCRIPT` or `FROM_FILE`. |
| Hive SQL Script | If you choose `FROM_SCRIPT` for `Hive Cli Task Execution Type`, you need to fill in your SQL script. |
| Hive Cli Options | Extra options for hive cli, such as `--verbose` |
| Resources | If you choose `FROM_FILE` for `Hive Cli Task Execution Type`, you need to select your SQL file. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|------------------------------|------------------------------------------------------------------------------------------------------|
| Hive Cli Task Execution Type | The type of hive cli task execution, choose either `FROM_SCRIPT` or `FROM_FILE`. |
| Hive SQL Script | If you choose `FROM_SCRIPT` for `Hive Cli Task Execution Type`, you need to fill in your SQL script. |
| Hive Cli Options | Extra options for hive cli, such as `--verbose` |
| Resources | If you choose `FROM_FILE` for `Hive Cli Task Execution Type`, you need to select your SQL file. |
## Task Example

29
docs/docs/en/guide/task/http.md

@ -6,28 +6,19 @@ This node is used to perform http type tasks such as the common POST and GET req
## Create Task
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag the <img src="../../../../img/tasks/icons/http.png" width="15"/> from the toolbar to the drawing board.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Request address | HTTP request URL. |
| Request type | Supports GET, POSt, HEAD, PUT, DELETE. || Request parameters |Supports Parameter, Body, Headers. || Verification conditions | Supports default response code, custom response code, content included, content not included.|
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|---|--------------------|---|---|-------------------------|-----------------------------------------------------------------------------------------------|
| Request address | HTTP request URL. |
| Request type | Supports GET, POSt, HEAD, PUT, DELETE. || Request parameters | Supports Parameter, Body, Headers. || Verification conditions | Supports default response code, custom response code, content included, content not included. |
| Verification content | When the verification condition selects a custom response code, the content contains, and the content does not contain, the verification content is required. |
| Custom parameter | It is a user-defined parameter of http part, which will replace the content with `${variable}` in the script. |
| Pre tasks | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
| Custom parameter | It is a user-defined parameter of http part, which will replace the content with `${variable}` in the script. |
## Example
@ -37,8 +28,8 @@ The main configuration parameters are as follows:
- URL: Address to access the target resource. Here is the system's login page.
- HTTP Parameters:
- userName: Username
- userPassword: User login password
- userName: Username
- userPassword: User login password
![http_task](../../../../img/tasks/demo/http_task01.png)

57
docs/docs/en/guide/task/jupyter.md

@ -6,21 +6,20 @@ Use `Jupyter Task` to create a jupyter-type task and execute jupyter notes. When
it will use `papermill` to evaluate jupyter notes. Click [here](https://papermill.readthedocs.io/en/latest/) for details about `papermill`.
## Conda Configuration
- Config `conda.path` in `common.properties` to the path of your `conda.sh`, which should be the same `conda` you use to manage the python environment of your `papermill` and `jupyter`.
Click [here](https://docs.conda.io/en/latest/) for more information about `conda`.
Click [here](https://docs.conda.io/en/latest/) for more information about `conda`.
- `conda.path` is set to `/opt/anaconda3/etc/profile.d/conda.sh` by default. If you have no idea where your `conda` is, simply run `conda info | grep -i 'base environment'`.
> NOTE: `Jupyter Task Plugin` uses `source` command to activate conda environment.
> If your tenant does not have permission to use `source`, `Jupyter Task Plugin` will not function.
> NOTE: `Jupyter Task Plugin` uses `source` command to activate conda environment.
> If your tenant does not have permission to use `source`, `Jupyter Task Plugin` will not function.
## Python Dependency Management
### Use Pre-Installed Conda Environment
1. Create a conda environment manually or using `shell task` on your target worker.
2. In your `jupyter task`, set `condaEnvName` as the name of the conda environment you just created.
2. In your `jupyter task`, set `condaEnvName` as the name of the conda environment you just created.
### Use Packed Conda Environment
@ -29,7 +28,7 @@ Click [here](https://docs.conda.io/en/latest/) for more information about `conda
3. Set `condaEnvName` as the name of your packed conda environment in your `jupyter task`, e.g. `jupyter_env.tar.gz`.
4. Select your packed conda environment as `resource` in your `jupyter task`, e.g. `jupyter_env.tar.gz`.
> NOTE: Make sure you follow the [Conda-Pack](https://conda.github.io/conda-pack/) official instructions.
> NOTE: Make sure you follow the [Conda-Pack](https://conda.github.io/conda-pack/) official instructions.
> If you unpack your packed conda environment, the directory structure should be the same as below:
```
@ -41,11 +40,11 @@ Click [here](https://docs.conda.io/en/latest/) for more information about `conda
├── lib
├── share
└── ssl
```
```
> NOTICE: Please follow the `conda pack` instructions above strictly, and DO NOT modify `bin/activate`.
> `Jupyter Task Plugin` uses `source` command to activate your packed conda environment.
> If you are concerned about using `source`, choose other options to manage your python dependency.
> If you are concerned about using `source`, choose other options to manage your python dependency.
### Construct From Requirements
@ -53,7 +52,7 @@ Click [here](https://docs.conda.io/en/latest/) for more information about `conda
2. Set `condaEnvName` as the name of your file of requirements in your `jupyter task`, e.g. `requirements.txt`.
3. Select your file of requirements as `resource` in your `jupyter task`, e.g. `requirements.txt`.
Here is an example file of requirements, from which `jupyter task plugin` will automatically
Here is an example file of requirements, from which `jupyter task plugin` will automatically
construct your python dependencies, run your python code and finally tear down the environment:
```text
@ -94,7 +93,7 @@ packaging==21.3
pandas==1.4.2
pandocfilters==1.5.0
papermill==2.3.4
```
```
## Create Task
@ -103,29 +102,19 @@ papermill==2.3.4
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. |
| Conda Env Name | Name of conda environment or packed conda environment tarball. |
|Input Note Path | Path of input jupyter note template. |
| Out Note Path | Path of output note. |
| Jupyter Parameters | Parameters in json format used for jupyter note parameterization. |
| Kernel | Jupyter notebook kernel. |
| Engine | Engine to evaluate jupyter notes. |
| Jupyter Execution Timeout | Timeout set for each jupyter notebook cell. |
| Jupyter Start Timeout | Timeout set for jupyter notebook kernel. |
| Others | Other command options for papermill. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|---------------------------|-------------------------------------------------------------------|
| Conda Env Name | Name of conda environment or packed conda environment tarball. |
| Input Note Path | Path of input jupyter note template. |
| Out Note Path | Path of output note. |
| Jupyter Parameters | Parameters in json format used for jupyter note parameterization. |
| Kernel | Jupyter notebook kernel. |
| Engine | Engine to evaluate jupyter notes. |
| Jupyter Execution Timeout | Timeout set for each jupyter notebook cell. |
| Jupyter Start Timeout | Timeout set for jupyter notebook kernel. |
| Others | Other command options for papermill. |
## Task Example

26
docs/docs/en/guide/task/kubernetes.md

@ -11,25 +11,15 @@ K8S task type used to execute a batch task. In this task, the worker submits the
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Namespace | The namespace for running k8s task. |
| Min CPU | Minimum CPU requirement for running k8s task. |
| Min Memory | Minimum memory requirement for running k8s task. |
| Image | The registry url for image. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|------------------|------------------------------------------------------------------------------------------------------------------|
| Namespace | The namespace for running k8s task. |
| Min CPU | Minimum CPU requirement for running k8s task. |
| Min Memory | Minimum memory requirement for running k8s task. |
| Image | The registry url for image. |
| Custom parameter | It is a local user-defined parameter for K8S task, these params will pass to container as environment variables. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example

50
docs/docs/en/guide/task/map-reduce.md

@ -11,46 +11,34 @@ MapReduce(MR) task type used for executing MapReduce programs. For MapReduce nod
## Task Parameters
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
### General
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Resource | Refers to the list of resource files that called in the script, and upload or create files by the Resource Center file management. |
| **Parameter** | **Description** |
|-------------------|--------------------------------------------------------------------------------------------------------------------|
| Custom parameters | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
### JAVA or SCALA Program
| **Parameter** | **Description** |
| ------- | ---------- |
| Program type | Select JAVA or SCALA program. |
| The class of the main function | The **full path** of Main Class, the entry point of the MapReduce program. |
| Main jar package | The jar package of the MapReduce program. |
| Task name | MapReduce task name. |
| Command line parameters | Set the input parameters of the MapReduce program and support the substitution of custom parameter variables. |
| Other parameters | Support `-D`, `-files`, `-libjars`, `-archives` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
| **Parameter** | **Description** |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------|
| Program type | Select JAVA or SCALA program. |
| The class of the main function | The **full path** of Main Class, the entry point of the MapReduce program. |
| Main jar package | The jar package of the MapReduce program. |
| Task name | MapReduce task name. |
| Command line parameters | Set the input parameters of the MapReduce program and support the substitution of custom parameter variables. |
| Other parameters | Support `-D`, `-files`, `-libjars`, `-archives` format. |
| User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
### Python Program
| **Parameter** | **Description** |
| ------- | ---------- |
| Program type | Select Python language. |
| Main jar package | The Python jar package for running MapReduce. |
| Other parameters | Support `-D`, `-mapper`, `-reducer,` `-input` `-output` format, and you can set the input of user-defined parameters, such as:<ul><li>`-mapper "mapper.py 1"` `-file mapper.py` `-reducer reducer.py` `-file reducer.py` `–input /journey/words.txt` `-output /journey/out/mr/${currentTimeMillis}`</li><li>The `mapper.py 1` after `-mapper` is two parameters, the first parameter is `mapper.py`, and the second parameter is `1`. </li></ul> |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
| **Parameter** | **Description** |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Program type | Select Python language. |
| Main jar package | The Python jar package for running MapReduce. |
| Other parameters | Support `-D`, `-mapper`, `-reducer,` `-input` `-output` format, and you can set the input of user-defined parameters, such as:<ul><li>`-mapper "mapper.py 1"` `-file mapper.py` `-reducer reducer.py` `-file reducer.py` `–input /journey/words.txt` `-output /journey/out/mr/${currentTimeMillis}`</li><li>The `mapper.py 1` after `-mapper` is two parameters, the first parameter is `mapper.py`, and the second parameter is `1`. </li></ul> |
| User-defined parameter | It is a local user-defined parameter for MapReduce, and will replace the content with `${variable}` in the script. |
## Task Example

109
docs/docs/en/guide/task/mlflow.md

@ -13,21 +13,14 @@ MLflow task plugin used to execute MLflow tasks,Currently contains MLflow Proj
The MLflow plugin currently supports and will support the following:
- [x] MLflow Projects
- [x] BasicAlgorithm: contains LogisticRegression, svm, lightgbm, xgboost
- [x] AutoML: AutoML tool,contains autosklean, flaml
- [x] Custom projects: Support for running your own MLflow projects
- [ ] MLflow Models
- [x] MLFLOW: Use `MLflow models serve` to deploy a model service
- [x] Docker: Run the container after packaging the docker image
- [x] Docker Compose: Use docker compose to run the container, it will replace the docker run above
- [ ] Seldon core: Use Selcon core to deploy model to k8s cluster
- [ ] k8s: Deploy containers directly to K8S
- [ ] MLflow deployments: Built-in deployment modules, such as built-in deployment to SageMaker, etc
- [ ] Model Registry
- [ ] Register Model: Allows artifacts (Including model and related parameters, indicators) to be registered directly into the model center
- MLflow Projects
- BasicAlgorithm: contains LogisticRegression, svm, lightgbm, xgboost
- AutoML: AutoML tool, contains autosklean, flaml
- Custom projects: Support for running your own MLflow projects
- MLflow Models
- MLFLOW: Use `MLflow models serve` to deploy a model service
- Docker: Run the container after packaging the docker image
- Docker Compose: Use docker compose to run the container, it will replace the docker run above
## Create Task
@ -36,21 +29,12 @@ The MLflow plugin currently supports and will support the following:
## Task Parameters and Example
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
| MLflow Tracking Server URI | MLflow Tracking Server URI, default http://localhost:5000. |
| Experiment Name | Create the experiment where the task is running, if the experiment does not exist. If the name is empty, it is set to ` Default `, the same as MLflow. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
| MLflow Tracking Server URI | MLflow Tracking Server URI, default http://localhost:5000. |
| Experiment Name | Create the experiment where the task is running, if the experiment does not exist. If the name is empty, it is set to ` Default `, the same as MLflow. |
### MLflow Projects
@ -59,51 +43,52 @@ The MLflow plugin currently supports and will support the following:
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-basic-algorithm.png)
**Task Parameters**
| **Parameter** | **Description** |
| ------- | ---------- |
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
| Model Name | The registered model name is added to the original model version and registered as Production. |
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation. |
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `"time_budget=30;estimator_list=['lgbm']"` for flaml 。The convention will be passed with '; ' shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. <ul><li>[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)</li><li>[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC)</li><li>[lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)</li><li>[xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)</li></ul> |
| Algorithm |The selected algorithm currently supports `LR`, `SVM`, `LightGBM` and `XGboost` based on [scikit-learn](https://scikit-learn.org/) form. |
| Parameter Search Space | Parameter search space when running the corresponding algorithm, which can be empty. For example, the parameter `max_depth=[5, 10];n_estimators=[100, 200]` for lightgbm 。The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. |
| **Parameter** | **Description** |
|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
| Model Name | The registered model name is added to the original model version and registered as Production. |
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation. |
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `"time_budget=30;estimator_list=['lgbm']"` for flaml. The convention will be passed with '; ' shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. <ul><li>[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)</li><li>[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC)</li><li>[lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)</li><li>[xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)</li></ul> |
| Algorithm | The selected algorithm currently supports `LR`, `SVM`, `LightGBM` and `XGboost` based on [scikit-learn](https://scikit-learn.org/) form. |
| Parameter Search Space | Parameter search space when running the corresponding algorithm, which can be empty. For example, the parameter `max_depth=[5, 10];n_estimators=[100, 200]` for lightgbm. The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. |
#### AutoML
![mlflow-automl](../../../../img/tasks/demo/mlflow-automl.png)
**Task Parameter**
| **Parameter** | **Description** |
| ------- | ---------- |
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
| model name | The registered model name is added to the original model version and registered as Production. |
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation). |
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `n_estimators=200;learning_rate=0.2` for flaml. The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. The detailed parameter list is as follows: <ul><li>[flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects)</li><li>[autosklearn](https://automl.github.io/auto-sklearn/master/api.html)</li></ul> |
| AutoML tool | The AutoML tool used, currently supports [autosklearn](https://github.com/automl/auto-sklearn) and [flaml](https://github.com/microsoft/FLAML). |
| **Parameter** | **Description** |
|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. |
| model name | The registered model name is added to the original model version and registered as Production. |
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation). |
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `n_estimators=200;learning_rate=0.2` for flaml. The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. The detailed parameter list is as follows: <ul><li>[flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects)</li><li>[autosklearn](https://automl.github.io/auto-sklearn/master/api.html)</li></ul> |
| AutoML tool | The AutoML tool used, currently supports [autosklearn](https://github.com/automl/auto-sklearn) and [flaml](https://github.com/microsoft/FLAML). |
#### Custom projects
![mlflow-custom-project.png](../../../../img/tasks/demo/mlflow-custom-project.png)
**Task Parameter**
| **Parameter** | **Description** |
| ------- | ---------- |
| parameters | `--param-list` in `mlflow run`. For example `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9`. |
| Repository | Repository url of MLflow Project,Support git address and directory on worker. If it's in a subdirectory,We add `#` to support this (same as `mlflow run`) , for example `https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native`. |
| Project Version | Version of the project,default master. |
You can now use this feature to run all MLFlow projects on Github (For example [MLflow examples](https://github.com/mlflow/mlflow/tree/master/examples) ). You can also create your own machine learning library to reuse your work, and then use DolphinScheduler to use your library with one click.
| **Parameter** | **Description** |
|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| parameters | `--param-list` in `mlflow run`. For example `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9`. |
| Repository | Repository url of MLflow Project, Support git address and directory on worker. If it's in a subdirectory, We add `#` to support this (same as `mlflow run`) , for example `https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native`. |
| Project Version | Version of the project, default master. |
You can now use this feature to run all MLFlow projects on Github (For example [MLflow examples](https://github.com/mlflow/mlflow/tree/master/examples) ). You can also create your own machine learning library to reuse your work, and then use DolphinScheduler to use your library with one click.
### MLflow Models
**General Parameters**
| **Parameter** | **Description** |
| ------- | ---------- |
| Model-URI | Model-URI of MLflow , support `models:/<model_name>/suffix` format and `runs:/` format. See https://mlflow.org/docs/latest/tracking.html#artifact-stores |
| Port | The port to listen on. |
| **Parameter** | **Description** |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Model-URI | Model-URI of MLflow , support `models:/<model_name>/suffix` format and `runs:/` format. See https://mlflow.org/docs/latest/tracking.html#artifact-stores |
| Port | The port to listen on. |
#### MLflow
@ -117,10 +102,10 @@ You can now use this feature to run all MLFlow projects on Github (For example [
![mlflow-models-docker-compose](../../../../img/tasks/demo/mlflow-models-docker-compose.png)
| **Parameter** | **Description** |
| ------- | ---------- |
| Max Cpu Limit | For example, `1.0` or `0.5`, the same as docker compose. |
| Max Memory Limit | For example `1G` or `500M`, the same as docker compose. |
| **Parameter** | **Description** |
|------------------|----------------------------------------------------------|
| Max Cpu Limit | For example, `1.0` or `0.5`, the same as docker compose. |
| Max Memory Limit | For example `1G` or `500M`, the same as docker compose. |
## Environment to Prepare
@ -128,7 +113,7 @@ You can now use this feature to run all MLFlow projects on Github (For example [
You need to enter the admin account to configure a conda environment variable(Please
install [anaconda](https://docs.continuum.io/anaconda/install/)
or [miniconda](https://docs.conda.io/en/latest/miniconda.html#installing ) in advance).
or [miniconda](https://docs.conda.io/en/latest/miniconda.html#installing) in advance).
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-conda-env.png)
@ -153,4 +138,4 @@ After running, an MLflow service is started.
After this, you can visit the MLflow service (`http://localhost:5000`) page to view the experiments and models.
![mlflow-server](../../../../img/tasks/demo/mlflow-server.png)
![mlflow-server](../../../../img/tasks/demo/mlflow-server.png)

27
docs/docs/en/guide/task/openmldb.md

@ -2,7 +2,7 @@
## Overview
[OpenMLDB](https://openmldb.ai/) is an excellent open source machine learning database, providing a full-stack
[OpenMLDB](https://openmldb.ai/) is an excellent open source machine learning database, providing a full-stack
FeatureOps solution for production.
OpenMLDB task plugin used to execute tasks on OpenMLDB cluster.
@ -14,23 +14,14 @@ OpenMLDB task plugin used to execute tasks on OpenMLDB cluster.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
| zookeeper | OpenMLDB cluster zookeeper address, e.g. 127.0.0.1:2181. |
| zookeeper path | OpenMLDB cluster zookeeper path, e.g. /openmldb. |
| Execute Mode | Determine the init mode, offline or online. You can switch it in sql statement. |
| SQL statement | SQL statement. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-------------------|--------------------------------------------------------------------------------------------------------------|
| zookeeper | OpenMLDB cluster zookeeper address, e.g. 127.0.0.1:2181. |
| zookeeper path | OpenMLDB cluster zookeeper path, e.g. /openmldb. |
| Execute Mode | Determine the init mode, offline or online. You can switch it in sql statement. |
| SQL statement | SQL statement. |
| Custom parameters | It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. |
## Task Examples

20
docs/docs/en/guide/task/pigeon.md

@ -11,17 +11,9 @@ Pigeon is a task used to trigger remote tasks, acquire logs or status by calling
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Target task name | Target task name of this Pigeon node. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|------------------|---------------------------------------|
| Target task name | Target task name of this Pigeon node. |

22
docs/docs/en/guide/task/python.md

@ -12,23 +12,11 @@ it will generate a temporary python script, and executes the script by the Linux
## Task Parameter
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. |
| Script | Python program developed by the user. |
| Resource | Refers to the list of resource files that need to be called in the script, and the files uploaded or created by the resource center-file management. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-------------------|--------------------------------------------------------------------------------------------------------------|
| Script | Python program developed by the user. |
| Custom parameters | It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. |
## Task Example

27
docs/docs/en/guide/task/pytorch.md

@ -19,30 +19,7 @@ The task plugin picture is as follows
![pytorch](../../../../img/tasks/demo/pytorch_en.png)
First, introduce some general parameters of DolphinScheduler:
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select
the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high
to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected,
randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm
email will send and the task execution will fail.
- **Resource**: Refers to the list of resource files that need to be called in the script, and the files uploaded or created in Resource Center - File Management.
- **User-defined parameters**: It is a user-defined parameter of Shell, which will replace the content with `${variable}` in the script.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as
upstream of the current task.
Here are some specific parameters for the Pytorch plugin:
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
#### Run time parameters
@ -70,7 +47,6 @@ The preceding two parameters are used to minimize the running of the configurati
- if choose `conda`, ,that may use `conda` to create a new environment,And you need to specify the Python version.
- **Requirement File** :The defualt is requirements.txt。
We can use relative paths of `Python Script` and `Requirement File` if we set `Project Path` which contains the python script or required requirement file.
#### Demo
@ -81,7 +57,6 @@ We can run task like below:
![pytorch_note](../../../../img/tasks/demo/pytorch_note_en.png)
In addition, if the code is stored in the `Resource`, you can use the `Resource` parameter to download the code, and write the related parameters into the path of the corresponding resource.
## Environment configuration

27
docs/docs/en/guide/task/sagemaker.md

@ -6,7 +6,6 @@
[Amazon SageMaker Model Building Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html) is a tool for building machine learning pipelines that take advantage of direct SageMaker integration.
For users using big data and machine learning, SageMaker task plugin help users connect big data workflows with SageMaker usage scenarios.
DolphinScheduler SageMaker task plugin features are as follows:
@ -21,39 +20,20 @@ DolphinScheduler SageMaker task plugin features are as follows:
## Task Example
First, introduce some general parameters of DolphinScheduler:
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select
the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high
to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected,
randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm
email will send and the task execution will fail.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as
upstream of the current task.
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
Here are some specific parameters for the SagaMaker plugin:
- **SagemakerRequestJson**: Request parameters of StartPipelineExecution,see also [AWS API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartPipelineExecution.html)
The task plugin are shown as follows:
![sagemaker_pipeline](../../../../img/tasks/demo/sagemaker_pipeline.png)
## Environment to prepare
Some AWS configuration is required, modify a field in file `common.properties`
```yaml
# The AWS access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.access.key.id=<YOUR AWS ACCESS KEY>
@ -61,4 +41,5 @@ resource.aws.access.key.id=<YOUR AWS ACCESS KEY>
resource.aws.secret.access.key=<YOUR AWS SECRET KEY>
# The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.region=<AWS REGION>
```
```

36
docs/docs/en/guide/task/seatunnel.md

@ -12,31 +12,22 @@ Click [here](https://seatunnel.apache.org/) for more information about `Apache S
## Task Parameter
- Node name: The node name in a workflow definition is unique.
- Run flag: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch.
- Descriptive information: describe the function of the node.
- Task priority: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle.
- Worker grouping: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution.
- Environment Name: Configure the environment name in which to run the script.
- Number of failed retry attempts: The number of times the task failed to be resubmitted.
- Failed retry interval: The time, in cents, interval for resubmitting the task after a failed task.
- Cpu quota: Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md)
- Max memory:Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md)
- Delayed execution time: The time, in cents, that a task is delayed in execution.
- Timeout alarm: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail.
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
- Engine: Supports FLINK and SPARK
- FLINK
- Run model: supports `run` and `run-application` modes
- Option parameters: used to add the parameters of the Flink engine, such as `-m yarn-cluster -ynm seatunnel`
- SPARK
- Deployment mode: specify the deployment mode, `cluster` `client` `local`
- Master: Specify the `Master` model, `yarn` `local` `spark` `mesos`, where `spark` and `mesos` need to specify the `Master` service address, for example: 127.0.0.1:7077
> Click [here](https://seatunnel.apache.org/docs/2.1.2/command/usage) for more information on the usage of `Apache SeaTunnel command`
- FLINK
- Run model: supports `run` and `run-application` modes
- Option parameters: used to add the parameters of the Flink engine, such as `-m yarn-cluster -ynm seatunnel`
- SPARK
- Deployment mode: specify the deployment mode, `cluster` `client` `local`
- Master: Specify the `Master` model, `yarn` `local` `spark` `mesos`, where `spark` and `mesos` need to specify the `Master` service address, for example: 127.0.0.1:7077
> Click [here](https://seatunnel.apache.org/docs/2.1.2/command/usage) for more information on the usage of `Apache SeaTunnel command`
- Custom Configuration: Supports custom configuration or select configuration file from Resource Center
> Click [here](https://seatunnel.apache.org/docs/2.1.2/concept/config) for more information about `Apache SeaTunnel config` file
> Click [here](https://seatunnel.apache.org/docs/2.1.2/concept/config) for more information about `Apache SeaTunnel config` file
- Script: Customize configuration information on the task node, including four parts: `env` `source` `transform` `sink`
- Resource file: The configuration file of the resource center can be referenced in the task node, and only one configuration file can be referenced.
- Predecessor task: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task.
## Task Example
@ -80,3 +71,4 @@ sink {
}
```

21
docs/docs/en/guide/task/shell.md

@ -9,25 +9,14 @@ Shell task type, used to create a shell type task and execute a series of shell
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/shell.png" width="15"/> to the canvas.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Script | A SHELL program developed by the user. |
| Resource | Refers to the list of resource files that need to be called in the script, and the files uploaded or created in Resource Center - File Management.|
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|------------------------|-----------------------------------------------------------------------------------------------------------|
| Script | A SHELL program developed by the user. |
| User-defined parameter | It is a user-defined parameter of Shell, which will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
## Task Example

48
docs/docs/en/guide/task/spark.md

@ -15,34 +15,26 @@ Spark task type for executing Spark application. When executing the Spark task,
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Program type | Supports Java, Scala, Python, and SQL. |
| Spark version | Support Spark1 and Spark2. |
| The class of main function | The **full path** of Main Class, the entry point of the Spark program. |
| Main jar package | The Spark jar package (upload by Resource Center). |
| SQL scripts | SQL statements in .sql files that Spark sql runs. |
| Deployment mode | <ul><li>spark submit supports three modes: yarn-clusetr, yarn-client and local.</li><li>spark sql supports yarn-client and local modes.</li></ul> |
| Task name | Spark task name. |
| Driver core number | Set the number of Driver core, which can be set according to the actual production environment. |
| Driver memory size | Set the size of Driver memories, which can be set according to the actual production environment. |
| Number of Executor | Set the number of Executor, which can be set according to the actual production environment. |
| Executor memory size | Set the size of Executor memories, which can be set according to the actual production environment. |
| Main program parameters | Set the input parameters of the Spark program and support the substitution of custom parameter variables. |
| Optional parameters | Support `--jars`, `--files`,` --archives`, `--conf` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| Custom parameter | It is a local user-defined parameter for Spark, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Program type | Supports Java, Scala, Python, and SQL. |
| Spark version | Support Spark1 and Spark2. |
| The class of main function | The **full path** of Main Class, the entry point of the Spark program. |
| Main jar package | The Spark jar package (upload by Resource Center). |
| SQL scripts | SQL statements in .sql files that Spark sql runs. |
| Deployment mode | <ul><li>spark submit supports three modes: yarn-clusetr, yarn-client and local.</li><li>spark sql supports yarn-client and local modes.</li></ul> |
| Task name | Spark task name. |
| Driver core number | Set the number of Driver core, which can be set according to the actual production environment. |
| Driver memory size | Set the size of Driver memories, which can be set according to the actual production environment. |
| Number of Executor | Set the number of Executor, which can be set according to the actual production environment. |
| Executor memory size | Set the size of Executor memories, which can be set according to the actual production environment. |
| Main program parameters | Set the input parameters of the Spark program and support the substitution of custom parameter variables. |
| Optional parameters | Support `--jars`, `--files`,` --archives`, `--conf` format. |
| Resource | Appoint resource files in the `Resource` if parameters refer to them. |
| Custom parameter | It is a local user-defined parameter for Spark, and will replace the content with `${variable}` in the script. |
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. |
## Task Example

24
docs/docs/en/guide/task/sql.md

@ -15,16 +15,18 @@ Refer to [datasource-setting](../howto/datasource-setting.md) `DataSource Center
## Task Parameter
| **Parameter** | **Description** |
| ------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data source | Select the corresponding DataSource. |
| SQL type | Supports query and non-query. <ul><li>Query: supports `DML select` type commands, which return a result set. You can specify three templates for email notification as form, attachment or form attachment;</li><li>Non-query: support `DDL` all commands and `DML update, delete, insert` three types of commands;<ul><li>Segmented execution symbol: When the data source does not support executing multiple SQL statements at a time, the symbol for splitting SQL statements is provided to call the data source execution method multiple times. Example: 1. When the Hive data source is selected as the data source, this parameter does not need to be filled in. Because the Hive data source itself supports executing multiple SQL statements at one time; 2. When the MySQL data source is selected as the data source, and multi-segment SQL statements are to be executed, this parameter needs to be filled in with a semicolon `;. Because the MySQL data source does not support executing multiple SQL statements at one time.</li></ul></li></ul> |
| SQL parameter | The input parameter format is `key1=value1;key2=value2...`. |
| SQL statement | SQL statement. |
| UDF function | For Hive DataSources, you can refer to UDF functions created in the resource center, but other DataSource do not support UDF functions. |
| Custom parameters | SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement. |
| Pre-SQL | Pre-SQL executes before the SQL statement. |
| Post-SQL | Post-SQL executes after the SQL statement. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data source | Select the corresponding DataSource. |
| SQL type | Supports query and non-query. <ul><li>Query: supports `DML select` type commands, which return a result set. You can specify three templates for email notification as form, attachment or form attachment;</li><li>Non-query: support `DDL` all commands and `DML update, delete, insert` three types of commands;<ul><li>Segmented execution symbol: When the data source does not support executing multiple SQL statements at a time, the symbol for splitting SQL statements is provided to call the data source execution method multiple times. Example: 1. When the Hive data source is selected as the data source, this parameter does not need to be filled in. Because the Hive data source itself supports executing multiple SQL statements at one time; 2. When the MySQL data source is selected as the data source, and multi-segment SQL statements are to be executed, this parameter needs to be filled in with a semicolon `;. Because the MySQL data source does not support executing multiple SQL statements at one time.</li></ul></li></ul> |
| SQL parameter | The input parameter format is `key1=value1;key2=value2...`. |
| SQL statement | SQL statement. |
| UDF function | For Hive DataSources, you can refer to UDF functions created in the resource center, but other DataSource do not support UDF functions. |
| Custom parameters | SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement. |
| Pre-SQL | Pre-SQL executes before the SQL statement. |
| Post-SQL | Post-SQL executes after the SQL statement. |
## Task Example
@ -52,4 +54,4 @@ Table created in the Pre-SQL, after use in the SQL statement, cleaned in the Pos
Pay attention to the selection of SQL type. If it is an insert operation, need to change to "Non-Query" type.
To compatible with long session,UDF function are created by the syntax(CREATE OR REPLACE)
To compatible with long session,UDF function are created by the syntax(CREATE OR REPLACE)

25
docs/docs/en/guide/task/stored-procedure.md

@ -10,26 +10,17 @@
## Task Parameters
| **Parameter** | **Description** |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| DataSource | The DataSource type of the stored procedure supports MySQL, POSTGRESQL, ORACLE. |
| SQL Statement | call a stored procedure, such as `call test(${in1},${out1});`. |
| Custom parameters | The custom parameter types of the stored procedure support `IN` and `OUT`, and the data types support: VARCHAR, INTEGER, LONG, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP and BOOLEAN. |
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DataSource | The DataSource type of the stored procedure supports MySQL and POSTGRESQL, select the corresponding DataSource. |
| Method | The method name of the stored procedure. |
| Custom parameters | The custom parameter types of the stored procedure support `IN` and `OUT`, and the data types support: VARCHAR, INTEGER, LONG, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP and BOOLEAN. |
## Remark
- Prepare: Create a stored procedure in the database, such as:
- Prepare: Create a stored procedure in the database, e.g.
```
CREATE PROCEDURE dolphinscheduler.test(in in1 INT, out out1 INT)

19
docs/docs/en/guide/task/sub-process.md

@ -6,23 +6,16 @@ The sub-process node is to execute an external workflow definition as a task nod
## Create Task
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/sub_process.png" width="15"/> task node to canvas to create a new SubProcess task.
## Task Parameter
| **Parameter** | **Description** |
| ---- |---------|
| Node name | Unique name of node in workflow definition. |
| Run flag | Identifies whether this node schedules normally. |
| Description | Describe the function of the node. |
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
| Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment name in which run the script. |
| Timeout alarm | Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
| Child node | It is the workflow definition of the selected sub-process. Enter the child node in the upper right corner to jump to the workflow definition of the selected sub-process. |
| Pre task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Child node | It is the workflow definition of the selected sub-process. Enter the child node in the upper right corner to jump to the workflow definition of the selected sub-process. |
## Task Example

26
docs/docs/en/guide/task/switch.md

@ -7,28 +7,20 @@ The switch is a conditional judgment node, decide the branch executes according
**Note**: Execute expressions using javax.script.ScriptEngine.eval.
## Create Task
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="../../../../img/switch.png" width="20"/> task node to canvas to create a task.
- Drag from the toolbar <img src="../../../../img/switch.png" width="20"/> task node to canvas to create a task.
**Note**: After created a switch task, you must first configure the upstream and downstream, then configure the parameter of task branches.
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Delay execution time | Task delay execution time. |
| Condition | You can configure multiple conditions for the switch task. When the conditions are satisfied, execute the configured branch. You can configure multiple different conditions to satisfy different businesses. |
| Branch flow | The default branch flow, when all the conditions are not satisfied, execute this branch flow. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Condition | You can configure multiple conditions for the switch task. When the conditions are satisfied, execute the configured branch. You can configure multiple different conditions to satisfy different businesses. |
| Branch flow | The default branch flow, when all the conditions are not satisfied, execute this branch flow. |
## Task Example
@ -59,4 +51,4 @@ If executed correctly, then taskA will be executed correctly.
Execute and see if it works as expected. It can be seen that the specified downstream tasksA are executed as expected.
![switch_04](../../../../img/tasks/demo/switch_04.png)
![switch_04](../../../../img/tasks/demo/switch_04.png)

40
docs/docs/en/guide/task/zeppelin.md

@ -3,7 +3,7 @@
## Overview
Use `Zeppelin Task` to create a zeppelin-type task and execute zeppelin notebook paragraphs. When the worker executes `Zeppelin Task`,
it will call `Zeppelin Client API` to trigger zeppelin notebook paragraph. Click [here](https://zeppelin.apache.org/) for details about `Apache Zeppelin Notebook`.
it will call `Zeppelin Client API` to trigger zeppelin notebook paragraph. Click [here](https://zeppelin.apache.org/) for details about `Apache Zeppelin Notebook`.
## Create Task
@ -12,34 +12,26 @@ it will call `Zeppelin Client API` to trigger zeppelin notebook paragraph. Click
## Task Parameters
| **Parameter** | **Description** |
| ------- | ---------- |
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
| Zeppelin Note ID | The unique note id for a zeppelin notebook note. |
| Zeppelin Paragraph ID | The unique paragraph id for a zeppelin notebook paragraph. If you want to schedule a whole note at a time, leave this field blank. |
| Zeppelin Production Note Directory | The directory for cloned note in production mode. |
| Zeppelin Rest Endpoint | The REST endpoint of your zeppelin server |
| Zeppelin Parameters | Parameters in json format used for zeppelin dynamic form. |
- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) for default parameters.
| **Parameter** | **Description** |
|------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| Zeppelin Note ID | The unique note id for a zeppelin notebook note. |
| Zeppelin Paragraph ID | The unique paragraph id for a zeppelin notebook paragraph. If you want to schedule a whole note at a time, leave this field blank. |
| Zeppelin Production Note Directory | The directory for cloned note in production mode. |
| Zeppelin Rest Endpoint | The REST endpoint of your zeppelin server |
| Zeppelin Parameters | Parameters in json format used for zeppelin dynamic form. |
## Production (Clone) Mode
- Fill in the optional `Zeppelin Production Note Directory` parameter to enable `Production Mode`.
- In `Production Mode`, the target note gets copied to the `Zeppelin Production Note Directory` you choose.
`Zeppelin Task Plugin` will execute the cloned note instead of the original one. Once execution done,
`Zeppelin Task Plugin` will delete the cloned note automatically.
Therefore, it increases the stability as the modification to a running note triggered by `Dolphin Scheduler`
will not affect the production task.
- In `Production Mode`, the target note gets copied to the `Zeppelin Production Note Directory` you choose.
`Zeppelin Task Plugin` will execute the cloned note instead of the original one. Once execution done,
`Zeppelin Task Plugin` will delete the cloned note automatically.
Therefore, it increases the stability as the modification to a running note triggered by `Dolphin Scheduler`
will not affect the production task.
- If you leave the `Zeppelin Production Note Directory` empty, `Zeppelin Task Plugin` will execute the original note.
- 'Zeppelin Production Note Directory' should both start and end with a `slash`. e.g. `/production_note_directory/`
- 'Zeppelin Production Note Directory' should both start and end with a `slash`. e.g. `/production_note_directory/`
## Task Example

25
docs/docs/zh/guide/task/appendix.md

@ -0,0 +1,25 @@
# DolphinScheduler任务参数附录
`DolphinScheduler`任务插件有一些公共参数,我们将这些公共参数列在文档中供您查阅。每种任务都有如下的所有或者**部分**默认参数:
## 默认任务参数
| **任务参数** | **描述** |
|----------|--------------------------------------------------------------------------------------------------------------------------------------|
| 任务名称 | 任务的名称,同一个工作流定义中的节点名称不能重复。 |
| 运行标志 | 标识这个节点是否需要调度执行,如果不需要执行,可以打开禁止执行开关。 |
| 描述 | 当前节点的功能描述。 |
| 任务优先级 | worker线程数不足时,根据优先级从高到低依次执行任务,优先级一样时根据先到先得原则执行。 |
| Worker分组 | 设置分组后,任务会被分配给worker组的机器机执行。若选择Default,则会随机选择一个worker执行。 |
| 任务组名称 | 任务资源组,未配置则不生效。 |
| 组内优先级 | 一个任务组内此任务的优先级。 |
| 环境名称 | 配置任务执行的环境。 |
| 失败重试次数 | 任务失败重新提交的次数,可以在下拉菜单中选择或者手动填充。 |
| 失败重试间隔 | 任务失败重新提交任务的时间间隔,可以在下拉菜单中选择或者手动填充。 |
| CPU 配额 | 为执行的任务分配指定的CPU时间配额,单位为百分比,默认-1代表不限制,例如1个核心的CPU满载是100%,16个核心的是1600%。 [task.resource.limit.state](../../architecture/configuration.md) |
| 最大内存 | 为执行的任务分配指定的内存大小,超过会触发OOM被Kill同时不会进行自动重试,单位MB,默认-1代表不限制。该功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制。 |
| 超时告警 | 设置超时告警、超时失败。当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。该功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制。 |
| 资源 | 任务执行时所需资源文件 |
| 前置任务 | 设置当前任务的前置(上游)任务。 |
| 延时执行时间 | 任务延迟执行的时间,以分为单位 |

29
docs/docs/zh/guide/task/chunjun.md

@ -11,24 +11,15 @@ ChunJun 任务类型,用于执行 ChunJun 程序。对于 ChunJun 节点,wor
## 任务参数
- 节点名称:设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个结点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default ,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 任务组名称:任务组的名称。
- 组内优先级:一个任务组内此任务的优先级。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 延时执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 自定义模板:自定义 ChunJun 节点的 json 配置文件内容,当前支持此种方式。
- json:ChunJun 同步的 json 配置文件。
- 自定义参数:用户自定义参数,会替换脚本中以 ${变量} 的内容。
- 部署方式: 执行ChunJun任务的方式,比如local,standalone等。
- 选项参数: 支持 `-confProp "{\"flink.checkpoint.interval\":60000}"` 格式。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|------------------------------------------------------------|
| 自定义模板 | 自定义 ChunJun 节点的 json 配置文件内容,当前支持此种方式。 |
| json | ChunJun 同步的 json 配置文件。 |
| 自定义参数 | 用户自定义参数,会替换脚本中以 ${变量} 的内容。 |
| 部署方式 | 执行ChunJun任务的方式,比如local,standalone等。 |
| 选项参数 | 支持 `-confProp "{\"flink.checkpoint.interval\":60000}"` 格式。 |
## 任务样例
@ -44,4 +35,4 @@ ChunJun 任务类型,用于执行 ChunJun 程序。对于 ChunJun 节点,wor
### 配置 ChunJun 任务节点
从 Hive 中读取数据,所以需要自定义 json,可参考:[Hive Json Template](https://github.com/DTStack/chunjun/blob/master/chunjun-examples/json/hive/binlog_hive.json)
从 Hive 中读取数据,所以需要自定义 json,可参考:[Hive Json Template](https://github.com/DTStack/chunjun/blob/master/chunjun-examples/json/hive/binlog_hive.json)

20
docs/docs/zh/guide/task/conditions.md

@ -9,20 +9,12 @@ Conditions 是一个条件节点,根据上游任务运行状态,判断应该
## 任务参数
- 节点名称:设置任务的名称,一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述信息:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器机执行,选择 Default,会随机选择一台 worker 机执行。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.
- 下游任务选择:根据前置任务的状态来跳转到对应的分支,目前支持两个分支:成功、失败
- 成功:当上游运行成功时,运行成功选择的分支
- 失败:当上游运行失败时,运行失败选择的分支
- 上游条件选择:可以为 Conditions 任务选择一个或多个上游任务
- 增加上游依赖:通过选择第一个参数选择对应的任务名称,通过第二个参数选择触发的 Conditions 任务的状态
- 上游任务关系选择:当有多个上游任务时,可以通过`且`以及`或`操作符实现任务的复杂关系。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|---------------------------------------------------------------------------------------------------------------------------------------|
| 下游任务选择 | 根据前置任务的状态来跳转到对应的分支:成功分支 - 当上游运行成功时,运行成功选择的分支;失败分支 - 当上游运行失败时,运行失败选择的分支 |
| 上游条件选择 | 可以为 Conditions 任务选择一个或多个上游任务:增加上游依赖 - 通过选择第一个参数选择对应的任务名称,通过第二个参数选择触发的 Conditions 任务的状态;上游任务关系选择 - 当有多个上游任务时,可以通过`且`以及`或`操作符实现任务的复杂关系。 |
## 相关任务

41
docs/docs/zh/guide/task/datax.md

@ -11,31 +11,20 @@ DataX 任务类型,用于执行 DataX 程序。对于 DataX 节点,worker
## 任务参数
- 节点名称:设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个结点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default ,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- Cpu 配额: 为执行的任务分配指定的CPU时间配额,单位百分比,默认-1代表不限制,例如1个核心的CPU满载是100%,16个核心的是1600%。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 最大内存:为执行的任务分配指定的内存大小,超过会触发OOM被Kill同时不会进行自动重试,单位MB,默认-1代表不限制。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 延时执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 自定义模板:当默认提供的数据源不满足所需要求的时,可自定义 datax 节点的 json 配置文件内容。
- json:DataX 同步的 json 配置文件。
- 资源:在使用自定义json中如果集群开启了kerberos认证后,datax读取或者写入hdfs、hbase等插件时需要使用相关的keytab,xml文件等,则可使用改选项。资源中心-文件管理上传或创建的文件
- 自定义参数:sql 任务类型,而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换 sql 语句中 ${变量}。
- 数据源:选择抽取数据的数据源。
- sql 语句:目标库抽取数据的 sql 语句,节点执行时自动解析 sql 查询列名,映射为目标表同步列名,源表和目标表列名不一致时,可以通过列别名(as)转换。
- 目标库:选择数据同步的目标库。
- 目标库前置 sql:前置 sql 在 sql 语句之前执行(目标库执行)。
- 目标库后置 sql:后置 sql 在 sql 语句之后执行(目标库执行)。
- 限流(字节数):限制查询的字节数。
- 限流(记录数):限制查询的记录数。
- 运行内存:可根据实际生产环境配置所需的最小和最大内存。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|-------------------------------------------------------------------------------------------------------|
| json | DataX 同步的 json 配置文件 |
| 资源 | 在使用自定义json中如果集群开启了kerberos认证后,datax读取或者写入hdfs、hbase等插件时需要使用相关的keytab,xml文件等,则可使用改选项。资源中心-文件管理上传或创建的文件 |
| 自定义参数 | sql 任务类型,而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换 sql 语句中 ${变量} |
| 数据源 | 选择抽取数据的数据源 |
| sql 语句 | 目标库抽取数据的 sql 语句,节点执行时自动解析 sql 查询列名,映射为目标表同步列名,源表和目标表列名不一致时,可以通过列别名(as)转换 |
| 目标库 | 选择数据同步的目标库 |
| 目标库前置 | 前置 sql 在 sql 语句之前执行(目标库执行) |
| 目标库后置 | 后置 sql 在 sql 语句之后执行(目标库执行) |
| 限流(字节数) | 限制查询的字节数 |
| 限流(记录数) | 限制查询的记录数 |
## 任务样例
@ -63,4 +52,4 @@ DataX 任务类型,用于执行 DataX 程序。对于 DataX 节点,worker
## 注意事项:
若默认提供的数据源不满足需求,可在自定义模板选项中,根据实际使用环境来配置 DataX 的 writer 和 reader,可参考:https://github.com/alibaba/DataX
若默认提供的数据源不满足需求,可在自定义模板选项中,根据实际使用环境来配置 DataX 的 writer 和 reader,可参考:https://github.com/alibaba/DataX

13
docs/docs/zh/guide/task/dependent.md

@ -11,17 +11,8 @@ Dependent 节点,就是**依赖检查节点**。比如 A 流程依赖昨天的
## 任务参数
- 节点名称:设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个结点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default ,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 添加依赖:需要判断的依赖任务,可以是某一个项目中的工作流具体的任务执行情况。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
- 此任务除上述链接中的默认参数外无其他参数。
## 任务样例

20
docs/docs/zh/guide/task/dinky.md

@ -12,20 +12,12 @@
## Task Parameter
| **参数** | **描述** |
|-------------|--------------------------------------------------------------------|
| 任务名称 | 设置任务的名称。一个工作流定义中的节点名称是唯一的。 |
| 运行标志 | 标识这个节点是否可以正常调度。如果不需要执行,可以打开禁止执行开关。 |
| 描述 | 描述该节点的功能。 |
| 任务优先级 | worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。 |
| Worker 分组 | 任务分配给 worker 组的机器机执行,选择 Default,会随机选择一台 worker 机执行。 |
| 任务组名 | 资源中的组,如果未配置,将不会使用。 |
| 环境名称 | 配置运行脚本的环境。 |
| 失败重试次数 | 任务失败重新提交的次数,支持下拉和手填。 |
| 失败重试间隔 | 任务失败重新提交任务的时间间隔,支持下拉和手填。 |
| 超时告警 | 勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败. |
| Dinky 地址 | Dinky 服务的 url。 |
| Dinky 任务 ID | Dinky 作业对应的唯一ID。 |
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|-------------|---------------------------------------------------------------------|
| Dinky 地址 | Dinky 服务的 url。 |
| Dinky 任务 ID | Dinky 作业对应的唯一ID。 |
| 上线作业 | 指定当前 Dinky 作业是否上线,如果是,则该被提交的作业只能处于已发布且当前无对应的 Flink Job 实例在运行才可提交成功。 |
## Task Example

23
docs/docs/zh/guide/task/dvc.md

@ -17,28 +17,15 @@ DVC 组件用于在DS上使用DVC的数据版本管理功能,帮助用户简
## 任务样例
首先介绍一些DS通用参数
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
- **节点名称** :设置任务的名称。一个工作流定义中的节点名称是唯一的。
- **运行标志** :标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- **描述** :描述该节点的功能。
- **任务优先级** :worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- **Worker 分组** :任务分配给 worker 组的机器执行,选择 Default,会随机选择一台 worker 机执行。
- **环境名称** :配置运行脚本的环境。
- **失败重试次数** :任务失败重新提交的次数。
- **失败重试间隔** :任务失败重新提交任务的时间间隔,以分钟为单位。
- **延迟执行时间** :任务延迟执行的时间,以分钟为单位。
- **超时告警** :勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- **前置任务** :选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
以下是一些DVC 组件的常用参数
- **DVC任务类型** :可以选择 Upload、Download、Init DVC。
- **DVC仓库** :任务执行时关联的仓库地址。
| **任务参数** | **描述** |
|----------|-------------------------------|
| DVC任务类型 | 可以选择 Upload、Download、Init DVC |
| DVC仓库 | 任务执行时关联的仓库地址 |
### Init DVC
将git仓库初始化为DVC仓库, 并绑定数据储存的地方。
项目初始化后,仍然为git仓库,不过添加了DVC的特性。

31
docs/docs/zh/guide/task/emr.md

@ -2,31 +2,31 @@
## 综述
Amazon EMR 任务类型,用于在AWS上操作EMR集群并执行计算任务。
Amazon EMR 任务类型,用于在AWS上操作EMR集群并执行计算任务。
后台使用 [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) 将JSON参数转换为任务对象,提交到AWS,目前支持两种程序类型:
* `RUN_JOB_FLOW` 使用 [API_RunJobFlow](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples) 提交 [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) 对象
* `ADD_JOB_FLOW_STEPS` 使用 [API_AddJobFlowSteps](https://docs.aws.amazon.com/emr/latest/APIReference/API_AddJobFlowSteps.html#API_AddJobFlowSteps_Examples) 提交 [AddJobFlowStepsRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/AddJobFlowStepsRequest.html) 对象
## 任务参数
- 节点名称:一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述信息:描述该节点的功能。
- 任务优先级:worker线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker分组:任务分配给worker组的机器机执行,选择Default,会随机选择一台worker机执行。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.
- 程序类型:选择程序类型,如果是`RUN_JOB_FLOW`,则需要填写`jobFlowDefineJson`,如果是`ADD_JOB_FLOW_STEPS`,则需要填写`stepsDefineJson`。
- jobFlowDefineJson: [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) 对象对应的JSON,详细JSON定义参见 [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples)
- stepsDefineJson:[AddJobFlowStepsRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/AddJobFlowStepsRequest.html) 对象对应的JSON,详细JSON定义参见 [API_AddJobFlowSteps_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_AddJobFlowSteps.html#API_AddJobFlowSteps_Examples)
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 程序类型 | 选择程序类型,如果是`RUN_JOB_FLOW`,则需要填写`jobFlowDefineJson`,如果是`ADD_JOB_FLOW_STEPS`,则需要填写`stepsDefineJson` |
| jobFlowDefineJson | [RunJobFlowRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/RunJobFlowRequest.html) 对象对应的JSON,详细JSON定义参见 [API_RunJobFlow_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_Examples) |
| stepsDefineJson | [AddJobFlowStepsRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/AddJobFlowStepsRequest.html) 对象对应的JSON,详细JSON定义参见 [API_AddJobFlowSteps_Examples](https://docs.aws.amazon.com/emr/latest/APIReference/API_AddJobFlowSteps.html#API_AddJobFlowSteps_Examples) |
## 任务样例
### 创建EMR集群并运行Steps
该样例展示了如何创建`RUN_JOB_FLOW`类型`EMR`任务节点,以执行`SparkPi`为例,该任务会创建一个`EMR`集群,并且执行`SparkPi`示例程序。
![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_run_job_flow.png)
jobFlowDefineJson 参数样例
```json
{
"Name": "SparkPi",
@ -68,11 +68,13 @@ jobFlowDefineJson 参数样例
```
### 向运行中的EMR集群添加Step
该样例展示了如何创建`ADD_JOB_FLOW_STEPS`类型`EMR`任务节点,以执行`SparkPi`为例,该任务会向运行中的`EMR`集群添加一个`SparkPi`示例程序。
![ADD_JOB_FLOW_STEPS](../../../../img/tasks/demo/emr_add_job_flow_steps.png)
![JobFlowId](../../../../img/tasks/demo/emr_jobFlowId.png)
stepsDefineJson 参数样例
```json
{
"JobFlowId": "j-3V628TKAERHP8",
@ -95,5 +97,6 @@ stepsDefineJson 参数样例
## 注意事项:
- EMR 任务类型的故障转移尚未实现。目前,DolphinScheduler 仅支持对 yarn task type 进行故障转移。其他任务类型,如 EMR 任务、k8s 任务尚未准备好。
- `stepsDefineJson` 一个任务定义仅支持关联单个step,这样可以更好的保证任务状态的可靠性。
- EMR 任务类型的故障转移尚未实现。目前,DolphinScheduler 仅支持对 yarn task type 进行故障转移。其他任务类型,如 EMR 任务、k8s 任务尚未准备好。
- `stepsDefineJson` 一个任务定义仅支持关联单个step,这样可以更好的保证任务状态的可靠性。

51
docs/docs/zh/guide/task/flink.md

@ -15,34 +15,26 @@ Flink 任务类型,用于执行 Flink 程序。对于 Flink 节点:
## 任务参数
- 节点名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分钟为单位。
- 延迟执行时间:任务延迟执行的时间,以分钟为单位。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- 程序类型:支持 Java、Scala、 Python 和 SQL 四种语言。
- 主函数的 Class:Flink 程序的入口 Main Class 的**全路径**。
- 主程序包:执行 Flink 程序的 jar 包(通过资源中心上传)。
- 部署方式:支持 cluster、 local 和 application (Flink 1.11和之后的版本支持,参见 [Run an application in Application Mode](https://nightlies.apache.org/flink/flink-docs-release-1.11/ops/deployment/yarn_setup.html#run-an-application-in-application-mode)) 三种模式的部署。
- 初始化脚本:用于初始化会话上下文的脚本文件。
- 脚本:用户开发的应该执行的 SQL 脚本文件。
- Flink 版本:根据所需环境选择对应的版本即可。
- 任务名称(选填):Flink 程序的名称。
- jobManager 内存数:用于设置 jobManager 内存数,可根据实际生产环境设置对应的内存数。
- Slot 数量:用于设置 Slot 的数量,可根据实际生产环境设置对应的数量。
- taskManager 内存数:用于设置 taskManager 内存数,可根据实际生产环境设置对应的内存数。
- taskManager 数量:用于设置 taskManager 的数量,可根据实际生产环境设置对应的数量。
- 并行度:用于设置执行 Flink 任务的并行度。
- 主程序参数:设置 Flink 程序的输入参数,支持自定义参数变量的替换。
- 选项参数:支持 `--jar`、`--files`、`--archives`、`--conf` 格式。
- 资源:如果其他参数中引用了资源文件,需要在资源中选择指定。
- 自定义参数:是 Flink 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 程序类型 | 支持 Java、Scala、 Python 和 SQL 四种语言 |
| 主函数的 Class | Flink 程序的入口 Main Class 的**全路径** |
| 主程序包 | 执行 Flink 程序的 jar 包(通过资源中心上传) |
| 部署方式 | 支持 cluster、 local 和 application (Flink 1.11和之后的版本支持,参见 [Run an application in Application Mode](https://nightlies.apache.org/flink/flink-docs-release-1.11/ops/deployment/yarn_setup.html#run-an-application-in-application-mode)) 三种模式的部署 |
| 初始化脚本 | 用于初始化会话上下文的脚本文件 |
| 脚本 | 用户开发的应该执行的 SQL 脚本文件 |
| Flink 版本 | 根据所需环境选择对应的版本即可 |
| 任务名称(选填) | Flink 程序的名称 |
| jobManager 内存数 | 用于设置 jobManager 内存数,可根据实际生产环境设置对应的内存数 |
| Slot 数量 | 用于设置 Slot 的数量,可根据实际生产环境设置对应的数量 |
| taskManager 内存数 | 用于设置 taskManager 内存数,可根据实际生产环境设置对应的内存数 |
| taskManager 数量 | 用于设置 taskManager 的数量,可根据实际生产环境设置对应的数量 |
| 并行度 | 用于设置执行 Flink 任务的并行度 |
| 主程序参数 | 设置 Flink 程序的输入参数,支持自定义参数变量的替换 |
| 选项参数 | 支持 `--jar`、`--files`、`--archives`、`--conf` 格式 |
| 自定义参数 | 是 Flink 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容 |
## 任务样例
@ -56,7 +48,7 @@ Flink 任务类型,用于执行 Flink 程序。对于 Flink 节点:
![flink-configure](../../../../img/tasks/demo/flink_task01.png)
#### 上传主程序包
#### 上传主程序包
在使用 Flink 任务节点时,需要利用资源中心上传执行程序的 jar 包,可参考[资源中心](../resource/configuration.md)。
@ -81,3 +73,4 @@ Flink 任务类型,用于执行 Flink 程序。对于 Flink 节点:
- Java 和 Scala 只是用来标识,没有区别,如果是 Python 开发的 Flink 则没有主函数的 class,其余的都一样。
- 使用 SQL 执行 Flink SQL 任务,目前只支持 Flink 1.13及以上版本。

28
docs/docs/zh/guide/task/hive-cli.md

@ -22,26 +22,14 @@
## 任务参数
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
| **任务参数** | **描述** |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|
| 任务名称 | 设置任务的名称。一个工作流定义中的节点名称是唯一的。 |
| 运行标志 | 标识这个节点是否需要正常调度,如果不需要执行,可以打开禁止执行开关。 |
| 描述 | 描述该节点的功能。 |
| 任务优先级 | worker线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。 |
| Worker分组 | 任务分配给worker组的机器机执行,选择Default,会随机选择一台worker机执行。 |
| 任务组名称 | 任务资源组,如果没有配置的话就不会生效。 |
| 环境名称 | 配置任务执行的环境。 |
| 失败重试次数 | 任务失败重新提交的次数,支持下拉和手填。 |
| 失败重试间隔 | 任务失败重新提交任务的时间间隔,支持下拉和手填。 |
| CPU 配额 | 为执行的任务分配指定的CPU时间配额,单位百分比,默认-1代表不限制,例如1个核心的CPU满载是100%,16个核心的是1600%。 [task.resource.limit.state](../../architecture/configuration.md) |
| 最大内存 | 为执行的任务分配指定的内存大小,超过会触发OOM被Kill同时不会进行自动重试,单位MB,默认-1代表不限制。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制。 |
| 超时告警 | 勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制。 |
| Hive Cli 任务类型 | Hive Cli任务执行方式,可以选择`FROM_SCRIPT`或者`FROM_FILE`。 |
| Hive SQL 脚本 | 手动填入您的Hive SQL脚本语句。 |
| Hive Cli 选项 | Hive Cli的其他选项,如`--verbose`。 |
| 资源 | 如果您选择`FROM_FILE`作为Hive Cli任务类型,您需要在资源中选择Hive SQL文件。 |
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|---------------|-----------------------------------------------------|
| Hive Cli 任务类型 | Hive Cli任务执行方式,可以选择`FROM_SCRIPT`或者`FROM_FILE`。 |
| Hive SQL 脚本 | 手动填入您的Hive SQL脚本语句。 |
| Hive Cli 选项 | Hive Cli的其他选项,如`--verbose`。 |
| 资源 | 如果您选择`FROM_FILE`作为Hive Cli任务类型,您需要在资源中选择Hive SQL文件。 |
## 任务样例

31
docs/docs/zh/guide/task/http.md

@ -12,23 +12,16 @@
## 任务参数
- 节点名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器机执行,选择 Default,会随机选择一台 worker 机执行。
- 环境名称:配置运行任务的环境。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- 延迟执行时间:任务延迟执行的时间,以分为单位。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- 请求地址:http 请求 URL。
- 请求类型:支持 GET、POST、HEAD、PUT、DELETE。
- 请求参数:支持 Parameter、Body、Headers。
- 校验条件:支持默认响应码、自定义响应码、内容包含、内容不包含。
- 校验内容:当校验条件选择自定义响应码、内容包含、内容不包含时,需填写校验内容。
- 自定义参数:是 http 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|-------------------------------------|
| 请求地址 | http 请求 URL |
| 请求类型 | 支持 GET、POST、HEAD、PUT、DELETE |
| 请求参数 | 支持 Parameter、Body、Headers |
| 校验条件 | 支持默认响应码、自定义响应码、内容包含、内容不包含 |
| 校验内容 | 当校验条件选择自定义响应码、内容包含、内容不包含时,需填写校验内容 |
| 自定义参数 | 是 http 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容 |
## 任务样例
@ -38,8 +31,8 @@ HTTP 定义了与服务器交互的不同方法,最基本的方法有4种,
- URL:访问目标资源的地址,这里为系统的登录页面。
- HTTP Parameters
- userName:用户名;
- userPassword:用户登录密码。
- userName:用户名;
- userPassword:用户登录密码。
![http_task](../../../../img/tasks/demo/http_task01.png)

42
docs/docs/zh/guide/task/java.md

@ -0,0 +1,42 @@
# JAVA 节点
## 综述
该节点用于执行 java 类型的任务,支持使用单文件和jar包作为程序入口。
## 创建任务
- 点击项目管理 -> 项目名称 -> 工作流定义,点击”创建工作流”按钮,进入 DAG 编辑页面:
- 拖动工具栏的JAVA任务节点到画板中。
## 任务参数
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|---------------------------------------------------------------|
| 模块路径 | 开启使用JAVA9+的模块化特性,把所有资源放入--module-path中,要求您的worker中的JDK版本支持模块化 |
| 主程序参数 | 作为普通Java程序main方法入口参数 |
| 虚拟机参数 | 配置启动虚拟机参数 |
| 脚本 | 若使用JAVA运行类型则需要编写JAVA代码。代码中必须存在public类,不用写package语句 |
| 资源 | 可以是外部JAR包也可以是其他资源文件,它们都会被加入到类路径或模块路径中,您可以在自己的JAVA脚本中轻松获取 |
## 任务样例
java任务类型有两种运行模式,这里以JAVA模式为例进行演示。
主要配置参数如下:
- 运行类型
- 模块路径
- 主程序参数
- 虚拟机参数
- 脚本文件
![java_task](../../../../img/tasks/demo/java_task02.png)
## 注意事项
使用JAVA运行类型时代码中必须存在public类,可以不写package语句

53
docs/docs/zh/guide/task/jupyter.md

@ -6,20 +6,20 @@
点击[这里](https://papermill.readthedocs.io/en/latest/) 获取更多关于`papermill`的信息。
## Conda虚拟环境配置
- 在`common.properties`配置`conda.path`,将其指向您的`conda.sh`。这里的`conda`应该是您用来管理您的 `papermill`和`jupyter`所在python环境的相同`conda`。
点击 [这里](https://docs.conda.io/en/latest/) 获取更多关于`conda`的信息.
点击 [这里](https://docs.conda.io/en/latest/) 获取更多关于`conda`的信息.
- `conda.path`默认设置为`/opt/anaconda3/etc/profile.d/conda.sh`。 如果您不清楚您的`conda`环境在哪里,只需要在命令行执行`conda info | grep -i 'base environment'`即可获得。
> 注意:`Jupyter任务插件`使用`source`命令激活conda环境,
> 如果您的租户没有`source`命令使用权限,`Jupyter任务插件`将无法使用。
> 注意:`Jupyter任务插件`使用`source`命令激活conda环境,
> 如果您的租户没有`source`命令使用权限,`Jupyter任务插件`将无法使用。
## Python依赖管理
### 使用预装好的Conda环境
1. 手动或使用`shell任务`在您的目标机器上创建conda环境。
2. 在您的`jupyter任务`中,将`condaEnvName`设置为您在上一步创建的conda环境名。
2. 在您的`jupyter任务`中,将`condaEnvName`设置为您在上一步创建的conda环境名。
### 使用打包的Conda环境
@ -27,7 +27,7 @@
2. 将您打包好的conda环境上传到`资源中心`.
3. 在您的`jupyter任务`资源设置中,添加您在上一步中上传的conda环境包,如`jupyter_env.tar.gz`.
> **_提示:_** 请您按照 [Conda-Pack](https://conda.github.io/conda-pack/) 官方指导打包conda环境,
> **_提示:_** 请您按照 [Conda-Pack](https://conda.github.io/conda-pack/) 官方指导打包conda环境,
> 正确打包出的conda环境包解压后文件目录结构应和下图完全一致:
```
@ -39,11 +39,11 @@
├── lib
├── share
└── ssl
```
```
> 注意: 请严格按照上述`conda pack`指示操作,并且不要随意修改`bin/activate`。
> `Jupyter任务插件`使用`source`命令激活您打包的conda环境。
> 若您对使用`source`命令有安全性上的担忧,请使用其他方法管理您的python依赖。
> 若您对使用`source`命令有安全性上的担忧,请使用其他方法管理您的python依赖。
### 由依赖需求文本文件临时构建
@ -52,7 +52,7 @@
3. 在您`jupyter任务`的`资源`中选取您的python依赖需求文本文件,如`requirements.txt`。
如下是一个依赖需求文本文件的样例,通过该文件,`jupyter任务插件`会自动构建您的python依赖,并执行您的python代码,
执行完成后会自动释放临时构建的环境。
执行完成后会自动释放临时构建的环境。
```text
fastjsonschema==2.15.3
@ -92,7 +92,7 @@ packaging==21.3
pandas==1.4.2
pandocfilters==1.5.0
papermill==2.3.4
```
```
## 创建任务
@ -101,26 +101,19 @@ papermill==2.3.4
## 任务参数
- 任务名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker分组:任务分配给worker组的机器机执行,选择Default,会随机选择一台worker机执行。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- Cpu 配额: 为执行的任务分配指定的CPU时间配额,单位百分比,默认-1代表不限制,例如1个核心的CPU满载是100%,16个核心的是1600%。
- 最大内存:为执行的任务分配指定的内存大小,超过会触发OOM被Kill同时不会进行自动重试,单位MB,默认-1代表不限制。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- Conda Env Name: Conda环境或打包的Conda环境包名称
- Input Note Path: 输入的jupyter note模板路径。
- Out Note Path: 输出的jupyter note路径。
- Jupyter Parameters: 用于对接jupyter note参数化的JSON格式参数。
- Kernel: Jupyter notebook 内核。
- Engine: 用于执行Jupyter note的引擎名称。
- Jupyter Execution Timeout: 对于每个jupyter notebook cell设定的超时时间。
- Jupyter Start Timeout: 对于jupyter notebook kernel设定的启动超时时间。
- Others: 传入papermill命令的其他参数。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|---------------------------|------------------------------------|
| Conda Env Name | Conda环境或打包的Conda环境包名称 |
| Input Note Path | 输入的jupyter note模板路径 |
| Output Note Path | 输出的jupyter note路径 |
| Jupyter Parameters | 用于对接jupyter note参数化的JSON格式参数 |
| Kernel | Jupyter notebook 内核 |
| Engine | 用于执行Jupyter note的引擎名称 |
| Jupyter Execution Timeout | 对于每个jupyter notebook cell设定的超时时间 |
| Jupyter Start Timeout | 对于jupyter notebook kernel设定的启动超时时间 |
| Others | 传入papermill命令的其他参数 |
## 任务样例

27
docs/docs/zh/guide/task/kubernetes.md

@ -11,22 +11,15 @@ kubernetes任务类型,用于在kubernetes上执行一个短时和批处理的
## 任务参数
- 节点名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default 会随机选择一台 worker 机执行。
- 环境名称:配置运行任务的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 延迟执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 命名空间:选择kubernetes集群上存在的命名空间
- 最小CPU:任务在kubernetes上运行所需的最小CPU
- 最小内存:任务在kubernetes上运行所需的最小内存
- 镜像:镜像地址
- 自定义参数:kubernetes任务局部的用户自定义参数,自定义参数最终会通过环境变量形式存在于容器中,提供给kubernetes任务使用
- 前置任务:在当前kubernetes任务之前需要执行的任务
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|-----------------------------------------------------------------|
| 命名空间 | 选择kubernetes集群上存在的命名空间 |
| 最小CPU | 任务在kubernetes上运行所需的最小CPU |
| 最小内存 | 任务在kubernetes上运行所需的最小内存 |
| 镜像 | 镜像地址 |
| 自定义参数 | kubernetes任务局部的用户自定义参数,自定义参数最终会通过环境变量形式存在于容器中,提供给kubernetes任务使用 |
## 任务样例
@ -42,4 +35,4 @@ kubernetes任务类型,用于在kubernetes上执行一个短时和批处理的
## 注意事项
任务名字限制在小写字母、数字和-这三种字符之中
任务名字限制在小写字母、数字和-这三种字符之中

46
docs/docs/zh/guide/task/map-reduce.md

@ -11,44 +11,32 @@ MapReduce(MR) 任务类型,用于执行 MapReduce 程序。对于 MapReduce
## 任务参数
- 节点名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择Default,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 延迟执行时间:任务延迟执行的时间,以分为单位。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- 资源:是指脚本中需要调用的资源文件列表,资源中心-文件管理上传或创建的文件。
- 自定义参数:是 MapReduce 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
### JAVA/SCALA 程序
- 程序类型:选择 JAVA/SCALA 语言。
- 主函数的 Class:是 MapReduce 程序的入口 Main Class 的**全路径**。
- 主程序包:执行 MapReduce 程序的 jar 包。
- 任务名称(选填):MapReduce 任务名称。
- 命令行参数:是设置 MapReduce 程序的输入参数,支持自定义参数变量的替换。
- 其他参数:支持 –D、-files、-libjars、-archives 格式。
- 资源: 如果其他参数中引用了资源文件,需要在资源中选择指定
- 自定义参数:是 MapReduce 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容
| **任务参数** | **描述** |
|------------|------------------------------------------|
| 程序类型 | 选择 JAVA/SCALA 语言 |
| 主函数的 Class | 是 MapReduce 程序的入口 Main Class 的**全路径** |
| 主程序包 | 执行 MapReduce 程序的 jar 包 |
| 任务名称(选填) | MapReduce 任务名称 |
| 命令行参数 | 是设置 MapReduce 程序的输入参数,支持自定义参数变量的替换 |
| 其他参数 | 支持 –D、-files、-libjars、-archives 格式 |
| 自定义参数 | 是 MapReduce 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容 |
### Python 程序
- 程序类型:选择 Python 语言。
- 主 jar 包:是运行 MapReduce 的 Python jar 包。
- 其他参数:支持 –D、-mapper、-reducer、-input -output格式,这里可以设置用户自定义参数的输入,比如:
- -mapper "mapper.py 1" -file mapper.py -reducer reducer.py -file reducer.py –input /journey/words.txt -output /journey/out/mr/${currentTimeMillis}
- 其中 -mapper 后的 mapper.py 1是两个参数,第一个参数是 mapper.py,第二个参数是 1。
- 资源: 如果其他参数中引用了资源文件,需要在资源中选择指定。
- 自定义参数:是 MapReduce 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容。
| **任务参数** | **描述** |
|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 程序类型 | 选择 Python 语言 |
| 主 jar 包 | 是运行 MapReduce 的 Python jar 包 |
| 其他参数 | 支持 –D、-mapper、-reducer、-input -output格式,这里可以设置用户自定义参数的输入,比如:-mapper "mapper.py 1" -file mapper.py -reducer reducer.py -file reducer.py –input /journey/words.txt -output /journey/out/mr/${currentTimeMillis},其中 -mapper 后的 mapper.py 1是两个参数,第一个参数是 mapper.py,第二个参数是 1 |
| 自定义参数 | 是 MapReduce 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容 |
## 任务样例
### 执行 WordCount 程序
### 执行 WordCount 程序
该样例为 MapReduce 应用中常见的入门类型,主要为统计输入的文本中,相同单词的数量有多少。

83
docs/docs/zh/guide/task/mlflow.md

@ -31,28 +31,16 @@ MLflow 组件用于执行 MLflow 任务,目前包含Mlflow Projects, 和MLflow
- 点击项目管理-项目名称-工作流定义,点击“创建工作流”按钮,进入 DAG 编辑页面;
- 拖动工具栏的 <img src="../../../../img/tasks/icons/mlflow.png" width="15"/> 任务节点到画板中。
## 任务样例
首先介绍一些DS通用参数
- **节点名称** :设置任务的名称。一个工作流定义中的节点名称是唯一的。
- **运行标志** :标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- **描述** :描述该节点的功能。
- **任务优先级** :worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- **Worker 分组** :任务分配给 worker 组的机器执行,选择 Default,会随机选择一台 worker 机执行。
- **环境名称** :配置运行脚本的环境。
- **失败重试次数** :任务失败重新提交的次数。
- **失败重试间隔** :任务失败重新提交任务的时间间隔,以分钟为单位。
- **延迟执行时间** :任务延迟执行的时间,以分钟为单位。
- **超时告警** :勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- **前置任务** :选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
以下是一些MLflow 组件的常用参数
- **MLflow Tracking Server URI** :MLflow Tracking Server 的连接, 默认 http://localhost:5000。
- **实验名称** :任务运行时所在的实验,若实验不存在,则创建。若实验名称为空,则设置为`Default`, 与 MLflow 一样。
| **任务参数** | **描述** |
|----------------------------|---------------------------------------------------------|
| MLflow Tracking Server URI | MLflow Tracking Server 的连接,默认 http://localhost:5000 |
| 实验名称 | 任务运行时所在的实验,若实验不存在,则创建。若实验名称为空,则设置为`Default`,与 MLflow 一样 |
### MLflow Projects
@ -60,53 +48,44 @@ MLflow 组件用于执行 MLflow 任务,目前包含Mlflow Projects, 和MLflow
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-basic-algorithm.png)
**任务参数**
- **注册模型** :是否注册模型,若选择注册,则会展开以下参数。
- **注册的模型名称** : 注册的模型名称,会在原来的基础上加上一个模型版本,并注册为Production。
- **数据路径** : 文件/文件夹的绝对路径, 若文件需以.csv结尾(自动切分训练集与测试集), 文件夹需包含train.csv和test.csv(建议方式,用户应自行构建测试集用于模型评估)。
详细的参数列表如下:
- [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)
- [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC)
- [lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)
- [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)
- **算法** :选择的算法,目前基于 [scikit-learn](https://scikit-learn.org/) 形式支持 `lr`, `svm`, `lightgbm`, `xgboost`
- **参数搜索空间** : 运行对应算法的参数搜索空间, 可为空。如针对lightgbm 的 `max_depth=[5, 10];n_estimators=[100, 200]` 则会进行对应搜索。约定传入后会以;切分各个参数,等号前的名字作为参数名,等号后的名字将以python eval执行得到对应的参数值
| **任务参数** | **描述** |
|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 注册模型 | 是否注册模型,若选择注册,则会展开以下参数 |
| 注册的模型名称 | 注册的模型名称,会在原来的基础上加上一个模型版本,并注册为Production |
| 数据路径 | 文件/文件夹的绝对路径,若文件需以.csv结尾(自动切分训练集与测试集),文件夹需包含train.csv和test.csv(建议方式,用户应自行构建测试集用于模型评估)。详细的参数列表如下: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC) [lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier) [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier) |
| 算法 | 选择的算法,目前基于 [scikit-learn](https://scikit-learn.org/) 形式支持 `lr`,`svm`,`lightgbm`,`xgboost` |
| 参数搜索空间 | 运行对应算法的参数搜索空间,可为空。如针对lightgbm 的 `max_depth=[5, 10];n_estimators=[100, 200]` 则会进行对应搜索。约定传入后会以;切分各个参数,等号前的名字作为参数名,等号后的名字将以python eval执行得到对应的参数值 |
#### AutoML
![mlflow-automl](../../../../img/tasks/demo/mlflow-automl.png)
**任务参数**
- **注册模型** :是否注册模型,若选择注册,则会展开以下参数。
- **注册的模型名称** : 注册的模型名称,会在原来的基础上加上一个模型版本,并注册为Production。
- **数据路径** : 文件/文件夹的绝对路径, 若文件需以.csv结尾(自动切分训练集与测试集), 文件夹需包含train.csv和test.csv(建议方式,用户应自行构建测试集用于模型评估)。
- **参数** : 初始化AutoML训练器时的参数,可为空, 如针对 flaml 设置`time_budget=30;estimator_list=['lgbm']`。约定传入后会以; 切分各个参数,等号前的名字作为参数名,等号后的名字将以python eval执行得到对应的参数值。详细的参数列表如下:
- [flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects)
- [autosklearn](https://automl.github.io/auto-sklearn/master/api.html)
- **AutoML工具** : 使用的AutoML工具,目前支持 [autosklearn](https://github.com/automl/auto-sklearn)
, [flaml](https://github.com/microsoft/FLAML)。
| **任务参数** | **描述** |
|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 注册模型 | 是否注册模型,若选择注册,则会展开以下参数 |
| 注册的模型名称 | 注册的模型名称,会在原来的基础上加上一个模型版本,并注册为Production |
| 数据路径 | 文件/文件夹的绝对路径,若文件需以.csv结尾(自动切分训练集与测试集),文件夹需包含train.csv和test.csv(建议方式,用户应自行构建测试集用于模型评估) |
| 参数 | 初始化AutoML训练器时的参数,可为空,如针对 flaml 设置`time_budget=30;estimator_list=['lgbm']`。约定传入后会以; 切分各个参数,等号前的名字作为参数名,等号后的名字将以python eval执行得到对应的参数值。详细的参数列表如下: [flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects),[autosklearn](https://automl.github.io/auto-sklearn/master/api.html) |
| AutoML工具 | 使用的AutoML工具,目前支持 [autosklearn](https://github.com/automl/auto-sklearn),[flaml](https://github.com/microsoft/FLAML) |
#### Custom projects
![mlflow-custom-project.png](../../../../img/tasks/demo/mlflow-custom-project.png)
**任务参数**
- **参数** : `mlflow run`中的 --param-list 如 `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9`
- **运行仓库** : MLflow Project的仓库地址,可以为github地址,或者worker上的目录, 如MLflow project位于子目录,可以添加 `#` 隔开, `https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native`
- **项目版本** : 对应项目中git版本管理中的版本,默认 master
| **任务参数** | **描述** |
|----------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| 参数 | `mlflow run`中的 --param-list 如 `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9` |
| 运行仓库 | MLflow Project的仓库地址,可以为github地址,或者worker上的目录,如MLflow project位于子目录,可以添加 `#` 隔开,`https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native` |
| 项目版本 | 对应项目中git版本管理中的版本,默认 master |
现在你可以使用这个功能来运行github上所有的MLflow Projects (如 [MLflow examples](https://github.com/mlflow/mlflow/tree/master/examples) )了。你也可以创建自己的机器学习库,用来复用你的研究成果,以后你就可以使用DolphinScheduler来一键操作使用你的算法库。
### MLflow Models
常用参数:
- **部署模型的URI** :MLflow 服务里面模型对应的URI, 支持 `models:/<model_name>/suffix` 格式 和 `runs:/` 格式
- **监听端口** :部署服务时的端口。
| **任务参数** | **描述** |
|----------|-----------------------------------------------------------------------|
| 部署模型的URI | MLflow 服务里面模型对应的URI,支持 `models:/<model_name>/suffix` 格式 和 `runs:/` 格式 |
| 监听端口 | 部署服务时的端口 |
#### MLFLOW
@ -120,8 +99,10 @@ MLflow 组件用于执行 MLflow 任务,目前包含Mlflow Projects, 和MLflow
![mlflow-models-docker-compose](../../../../img/tasks/demo/mlflow-models-docker-compose.png)
- **最大CPU限制** :如 `1.0` 或者 `0.5`, 与 docker compose 一致。
- **最大内存限制** :如 `1G` 或者 `500M`, 与 docker compose 一致。
| **任务参数** | **描述** |
|----------|--------------------------------------|
| 最大CPU限制 | 如 `1.0` 或者 `0.5`,与 docker compose 一致 |
| 最大内存限制 | 如 `1G` 或者 `500M`,与 docker compose 一致 |
## 环境准备

31
docs/docs/zh/guide/task/openmldb.md

@ -13,29 +13,14 @@ OpenMLDB任务组件可以连接OpenMLDB集群执行任务。
## 任务样例
首先介绍一些DS通用参数:
- **节点名称** :设置任务的名称。一个工作流定义中的节点名称是唯一的。
- **运行标志** :标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- **描述** :描述该节点的功能。
- **任务优先级** :worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- **Worker 分组** :任务分配给 worker 组的机器执行,选择 Default,会随机选择一台 worker 机执行。
- **环境名称** :配置运行脚本的环境。
- **失败重试次数** :任务失败重新提交的次数。
- **失败重试间隔** :任务失败重新提交任务的时间间隔,以分钟为单位。
- **延迟执行时间** :任务延迟执行的时间,以分钟为单位。
- **超时告警** :勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- **前置任务** :选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
### OpenMLDB 参数
**任务参数**
- **zookeeper地址** :OpenMLDB集群连接地址中的zookeeper地址, e.g. 127.0.0.1:2181。
- **zookeeper路径** : OpenMLDB集群连接地址中的zookeeper路径, e.g. /openmldb。
- **执行模式** :初始执行模式(离线/在线),你可以在sql语句中随时切换。
- **SQL语句** :SQL语句。
- 自定义参数:是PYTHON局部的用户自定义参数,会替换脚本中以${变量}的内容。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|-------------|--------------------------------------------------|
| zookeeper地址 | OpenMLDB集群连接地址中的zookeeper地址, e.g. 127.0.0.1:2181 |
| zookeeper路径 | OpenMLDB集群连接地址中的zookeeper路径, e.g. /openmldb |
| 执行模式 | 初始执行模式(离线/在线),你可以在sql语句中随时切换 |
| SQL语句 | SQL语句 |
下面有几个例子:

15
docs/docs/zh/guide/task/pigeon.md

@ -8,12 +8,9 @@ Pigeon任务类型是通过调用远程websocket服务,实现远程任务的
## 任务参数
- 节点名称:一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述信息:描述该节点的功能。
- 任务优先级:worker线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker分组:任务分配给worker组的机器机执行,选择Default,会随机选择一台worker机执行。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.
- 目标任务名:输入Pigeon任务的目标任务名称
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|-------------------|
| 目标任务名 | 输入Pigeon任务的目标任务名称 |

21
docs/docs/zh/guide/task/python.md

@ -12,21 +12,12 @@ Python 任务类型,用于创建 Python 类型的任务并执行一系列的 P
## 任务参数
- 任务名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker分组:任务分配给worker组的机器机执行,选择Default,会随机选择一台worker机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- Cpu 配额: 为执行的任务分配指定的CPU时间配额,单位百分比,默认-1代表不限制,例如1个核心的CPU满载是100%,16个核心的是1600%。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 最大内存:为执行的任务分配指定的内存大小,超过会触发OOM被Kill同时不会进行自动重试,单位MB,默认-1代表不限制。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.
- 脚本:用户开发的PYTHON程序。
- 资源:是指脚本中需要调用的资源文件列表,资源中心-文件管理上传或创建的文件。
- 自定义参数:是PYTHON局部的用户自定义参数,会替换脚本中以${变量}的内容。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|-----------------------------------|
| 脚本 | 用户开发的PYTHON程序 |
| 自定义参数 | 是PYTHON局部的用户自定义参数,会替换脚本中以${变量}的内容 |
## 任务样例

55
docs/docs/zh/guide/task/pytorch.md

@ -13,57 +13,37 @@
- 点击项目管理-项目名称-工作流定义,点击“创建工作流”按钮,进入 DAG 编辑页面;
- 拖动工具栏的 <img src="../../../../img/tasks/icons/pytorch.png" width="15"/> 任务节点到画板中。
## 任务样例
组件图示如下:
![pytorch](../../../../img/tasks/demo/pytorch_en.png)
### 首先介绍一些DS通用参数
- **节点名称** :设置任务的名称。一个工作流定义中的节点名称是唯一的。
- **运行标志** :标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- **描述** :描述该节点的功能。
- **任务优先级** :worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- **Worker 分组** :任务分配给 worker 组的机器执行,选择 Default,会随机选择一台 worker 机执行。
- **环境名称** :配置运行脚本的环境。
- **失败重试次数** :任务失败重新提交的次数。
- **失败重试间隔** :任务失败重新提交任务的时间间隔,以分钟为单位。
- **延迟执行时间** :任务延迟执行的时间,以分钟为单位。
- **资源**:是指脚本中需要调用的资源文件列表,资源中心-文件管理上传或创建的文件。
- **自定义参数**:是 SHELL 局部的用户自定义参数,会替换脚本中以 `${变量}` 的内容。
- **前置任务** :选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
### Pytorch参数
### Pytorch组件独有的参数
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
#### 运行参数
- **python脚本** :需要运行的python脚本文件入口。
- **脚本启动参数** :运行时的输入参数。
| **任务参数** | **描述** |
|----------|-------------------|
| python脚本 | 需要运行的python脚本文件入口 |
| 脚本启动参数 | 运行时的输入参数 |
以上为两个最小化配置运行的参数,另外提供其他的一些配置参数如下可选,当选择展开更多配置时,可以配置更多参数。
- **python项目地址** :设置`PYTHONPATH`环境变量,设置后运行python脚本时可以加载该地址下的python包/项目代码。支持本地路径或者Git url
- 若为本地路径,作为`PYTHONPATH`环境变量。
- 如果为Git URL (以`git@ | https:// | http:// `前缀),则会下载项目,并将下载后存放地址作为新的**python项目地址**,若需要运行子文件夹下的项目,可以添加 `#subdirectory` 来配置
| **任务参数** | **描述** |
|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| python项目地址 | 设置`PYTHONPATH`环境变量,设置后运行python脚本时可以加载该地址下的python包/项目代码。支持本地路径或者Git url。若为本地路径,作为`PYTHONPATH`环境变量,如果为Git URL (以`git@ | https:// | http:// `前缀),则会下载项目,并将下载后存放地址作为新的**python项目地址**,若需要运行子文件夹下的项目,可以添加 `#subdirectory` 来配置 |
#### python环境参数
- **是否创建新环境** :是否创建新的python环境来运行该任务。
*否*
- **python命令路径** :如`/usr/bin/python`,默认为DS环境配置中的`${PYTHON_HOME}`。
*是*
- **python环境管理工具** :可以选择virtualenv或者conda。
- 若选择`virtualenv`,则会用`virtualenv`创建一个新环境,使用命令 `virtualenv -p ${PYTHON_HOME} venv` 创建
- 若选择`conda`, 则会使用`conda` 创建一个新环境,并需要指定创建的python版本。
- **依赖文件** :默认为 requirements.txt。
| **任务参数** | **描述** |
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| 是否创建新环境 | 是否创建新的python环境来运行该任务 |
| python命令路径 | 如`/usr/bin/python`,默认为DS环境配置中的`${PYTHON_HOME}` |
| python环境管理工具 | 可以选择virtualenv或者conda,若选择`virtualenv`,则会用`virtualenv`创建一个新环境,使用命令 `virtualenv -p ${PYTHON_HOME} venv` 创建;若选择`conda`, 则会使用`conda` 创建一个新环境,并需要指定创建的python版本 |
| 依赖文件 | 默认为 requirements.txt |
配置了`python项目地址`参数,那么`python脚本`和`依赖文件`参数允许输入相对路径
@ -75,7 +55,6 @@
![pytorch_note](../../../../img/tasks/demo/pytorch_note_en.png)
另外如果代码存放在资源中心,则可以使用`资源`参数下载代码,并将相关参数写成对应资源的路径即可。
## 环境配置
@ -89,6 +68,7 @@
### 使用Conda创建新环境
适用于新建环境运行该项目,需要在`安全中心`-`环境管理`中创建环境, 参考如下添加修改为实际环境即可。
```shell
# conda命令对应的目录加入PATH中
export PATH=$HOME/anaconda3/bin:$PATH
@ -104,9 +84,8 @@ export PATH=/home/lucky/anaconda3/bin:$PATH
export PYTHON_HOME=/usr/local/bin/python3.7
```
## 其他
本组件也可以运行xgboost, lightgbm, sklearn, tensorflow, keras 等项目。本组件可作为python组件运行机器学习任务的升级组件。
如果有需要,后续建议可以统一涵盖为PythonML组件,来运行机器学习项目。
如果有需要,后续建议可以统一涵盖为PythonML组件,来运行机器学习项目。

32
docs/docs/zh/guide/task/sagemaker.md

@ -8,45 +8,30 @@
对于使用大数据与人工智能的用户,SageMaker 任务组件帮助用户可以串联起大数据工作流与SagaMaker的使用场景。
DolphinScheduler SageMaker 组件的功能:
- 启动 SageMaker Pipeline Execution,并持续获取状态,直至Pipeline执行完成。
DolphinScheduler SageMaker 组件的功能:
- 启动 SageMaker Pipeline Execution,并持续获取状态,直至Pipeline执行完成。
## 创建任务
- 点击项目管理-项目名称-工作流定义,点击“创建工作流”按钮,进入 DAG 编辑页面;
- 拖动工具栏的 <img src="../../../../img/tasks/icons/sagemaker.png" width="15"/> 任务节点到画板中。
## 任务样例
首先介绍一些DS通用参数
- **节点名称** :设置任务的名称。一个工作流定义中的节点名称是唯一的。
- **运行标志** :标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- **描述** :描述该节点的功能。
- **任务优先级** :worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- **Worker 分组** :任务分配给 worker 组的机器执行,选择 Default,会随机选择一台 worker 机执行。
- **环境名称** :配置运行脚本的环境。
- **失败重试次数** :任务失败重新提交的次数。
- **失败重试间隔** :任务失败重新提交任务的时间间隔,以分钟为单位。
- **延迟执行时间** :任务延迟执行的时间,以分钟为单位。
- **超时告警** :勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- **前置任务** :选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
以上参数如无特殊需求,可以默认即可
- **SagemakerRequestJson**: 启动SageMakerPipeline的需要的请求参数,可见 [AWS API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartPipelineExecution.html)
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| SagemakerRequestJson | 启动SageMakerPipeline的需要的请求参数,可见 [AWS API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartPipelineExecution.html) |
组件图示如下:
![sagemaker_pipeline](../../../../img/tasks/demo/sagemaker_pipeline.png)
## 环境配置
需要进行AWS的一些配置,修改`common.properties`中的`xxxxx`为你的配置信息
```yaml
# The AWS access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.access.key.id=<YOUR AWS ACCESS KEY>
@ -54,4 +39,5 @@ resource.aws.access.key.id=<YOUR AWS ACCESS KEY>
resource.aws.secret.access.key=<YOUR AWS SECRET KEY>
# The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.region=<AWS REGION>
```
```

38
docs/docs/zh/guide/task/seatunnel.md

@ -12,31 +12,22 @@
## 任务参数
- 节点名称:设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个结点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default ,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- Cpu 配额: 为执行的任务分配指定的CPU时间配额,单位百分比,默认-1代表不限制,例如1个核心的CPU满载是100%,16个核心的是1600%。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 最大内存:为执行的任务分配指定的内存大小,超过会触发OOM被Kill同时不会进行自动重试,单位MB,默认-1代表不限制。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 延时执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
- 引擎:支持 FLINK 和 SPARK
- FLINK
- 运行模型:支持 `run``run-application` 两种模式
- 选项参数:用于添加 Flink 引擎本身参数,例如 `-m yarn-cluster -ynm seatunnel`
- SPARK
- 部署方式:指定部署模式,`cluster` `client` `local`
- Master:指定 `Master` 模型,`yarn` `local` `spark` `mesos`,其中 `spark``mesos` 需要指定 `Master` 服务地址,例如:127.0.0.1:7077
> 点击 [这里](https://seatunnel.apache.org/docs/2.1.2/command/usage) 获取更多关于`Apache SeaTunnel command` 使用的信息
- 自定义配置:支持自定义配置或从资源中心选择配置文件
> 点击 [这里](https://seatunnel.apache.org/docs/2.1.2/concept/config) 获取更多关于`Apache SeaTunnel config` 文件介绍
- FLINK
- 运行模型:支持 `run``run-application` 两种模式
- 选项参数:用于添加 Flink 引擎本身参数,例如 `-m yarn-cluster -ynm seatunnel`
- SPARK
- 部署方式:指定部署模式,`cluster` `client` `local`
- Master:指定 `Master` 模型,`yarn` `local` `spark` `mesos`,其中 `spark``mesos` 需要指定 `Master` 服务地址,例如:127.0.0.1:7077
> 点击 [这里](https://seatunnel.apache.org/docs/2.1.2/command/usage) 获取更多关于`Apache SeaTunnel command` 使用的信息
- 自定义配置:支持自定义配置或从资源中心选择配置文件
> 点击 [这里](https://seatunnel.apache.org/docs/2.1.2/concept/config) 获取更多关于`Apache SeaTunnel config` 文件介绍
- 脚本:在任务节点那自定义配置信息,包括四部分:`env` `source` `transform` `sink`
- 资源文件:在任务节点引用资源中心的配置文件,只可以引用一个配置文件。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
## 任务样例
@ -80,3 +71,4 @@ sink {
}
```

17
docs/docs/zh/guide/task/shell.md

@ -11,21 +11,8 @@ Shell 任务类型,用于创建 Shell 类型的任务并执行一系列的 She
## 任务参数
- 任务名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器机执行,选择 Default,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- Cpu 配额: 为执行的任务分配指定的CPU时间配额,单位百分比,默认-1代表不限制,例如1个核心的CPU满载是100%,16个核心的是1600%。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 最大内存:为执行的任务分配指定的内存大小,超过会触发OOM被Kill同时不会进行自动重试,单位MB,默认-1代表不限制。这个功能由 [task.resource.limit.state](../../architecture/configuration.md) 控制
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.
- 脚本:用户开发的 SHELL 程序。
- 资源:是指脚本中需要调用的资源文件列表,资源中心-文件管理上传或创建的文件。
- 自定义参数:是 SHELL 局部的用户自定义参数,会替换脚本中以 `${变量}` 的内容。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
- 除上述默认参数,此任务没有其他参数
## 任务样例

16
docs/docs/zh/guide/task/spark.md

@ -16,23 +16,14 @@ Spark 任务类型用于执行 Spark 应用。对于 Spark 节点,worker 支
## 任务参数
- 节点名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default 会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 延迟执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
- 程序类型:支持 Java、Scala、Python 和 SQL 四种语言。
- Spark 版本:支持 Spark1 和 Spark2。
- 主函数的 Class:Spark 程序的入口 Main class 的全路径。
- 主程序包:执行 Spark 程序的 jar 包(通过资源中心上传)。
- SQL脚本:Spark sql 运行的 .sql 文件中的 SQL 语句。
- 部署方式:(1) spark submit 支持 yarn-clusetr、yarn-client 和 local 三种模式。
(2) spark sql 支持 yarn-client 和 local 两种模式。
(2) spark sql 支持 yarn-client 和 local 两种模式。
- 任务名称(可选):Spark 程序的名称。
- Driver 核心数:用于设置 Driver 内核数,可根据实际生产环境设置对应的核心数。
- Driver 内存数:用于设置 Driver 内存数,可根据实际生产环境设置对应的内存数。
@ -42,7 +33,6 @@ Spark 任务类型用于执行 Spark 应用。对于 Spark 节点,worker 支
- 选项参数:支持 `--jar`、`--files`、`--archives`、`--conf` 格式。
- 资源:如果其他参数中引用了资源文件,需要在资源中选择指定。
- 自定义参数:是 Spark 局部的用户自定义参数,会替换脚本中以 ${变量} 的内容。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
## 任务样例
@ -58,7 +48,7 @@ Spark 任务类型用于执行 Spark 应用。对于 Spark 节点,worker 支
![spark_configure](../../../../img/tasks/demo/spark_task01.png)
##### 上传主程序包
##### 上传主程序包
在使用 Spark 任务节点时,需要利用资源中心上传执行程序的 jar 包,可参考[资源中心](../resource/configuration.md)。

10
docs/docs/zh/guide/task/sql.md

@ -15,13 +15,14 @@ SQL任务类型,用于连接数据库并执行相应SQL。
## 任务参数
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
- 数据源:选择对应的数据源
- sql类型:支持查询和非查询两种。
- 查询:支持 `DML select` 类型的命令,是有结果集返回的,可以指定邮件通知为表格、附件或表格附件三种模板;
- 非查询:支持 `DDL`全部命令 和 `DML update、delete、insert` 三种类型的命令;
- 分段执行符号:提供在数据源不支持一次执行多段SQL语句时,拆分SQL语句的符号来进行多次调用数据源执行方法。
例子:1.当数据源选择Hive数据源时,不需要填写此参数。因为Hive数据源本身支持一次执行多段SQL语句;
2.当数据源选择MySQL数据源时,并且要执行多段SQL语句时,需要填写此参数为分号 `;`。因为MySQL数据源不支持一次执行多段SQL语句;
例子:1.当数据源选择Hive数据源时,不需要填写此参数。因为Hive数据源本身支持一次执行多段SQL语句;
2.当数据源选择MySQL数据源时,并且要执行多段SQL语句时,需要填写此参数为分号 `;`。因为MySQL数据源不支持一次执行多段SQL语句;
- sql参数:输入参数格式为key1=value1;key2=value2…
- sql语句:SQL语句
- UDF函数:对于HIVE类型的数据源,可以引用资源中心中创建的UDF函数,其他类型的数据源暂不支持UDF函数。
@ -53,5 +54,6 @@ SQL任务类型,用于连接数据库并执行相应SQL。
## 注意事项
* 注意SQL类型的选择,如果是INSERT等操作需要选择非查询类型。
* 为了兼容长会话情况,UDF函数的创建是通过CREATE OR REPLACE语句
* 注意SQL类型的选择,如果是INSERT等操作需要选择非查询类型。
* 为了兼容长会话情况,UDF函数的创建是通过CREATE OR REPLACE语句

12
docs/docs/zh/guide/task/stored-procedure.md

@ -17,7 +17,13 @@ begin
END
```
- 数据源:存储过程的数据源类型支持MySQL、POSTGRESQL、ORACLE,选择对应的数据源
- SQL Statement:调用存储过程,如 `call test(${in1},${out1});`
- 自定义参数:存储过程的自定义参数类型支持IN、OUT两种,数据类型支持VARCHAR、INTEGER、LONG、FLOAT、DOUBLE、DATE、TIME、TIMESTAMP、BOOLEAN九种数据类型;
## 任务参数
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|---------------|--------------------------------------------------------------------------------------------------|
| 数据源 | 存储过程的数据源类型支持MySQL、POSTGRESQL、ORACLE,选择对应的数据源 |
| SQL Statement | 调用存储过程,如 `call test(${in1},${out1});` |
| 自定义参数 | 存储过程的自定义参数类型支持IN、OUT两种,数据类型支持VARCHAR、INTEGER、LONG、FLOAT、DOUBLE、DATE、TIME、TIMESTAMP、BOOLEAN九种数据类型 |

14
docs/docs/zh/guide/task/sub-process.md

@ -12,15 +12,11 @@
## 任务参数
- 节点名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default 会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.
- 子节点:是选择子流程的工作流定义,右上角进入该子节点可以跳转到所选子流程的工作流定义。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|----------------------------------------|
| 子节点 | 是选择子流程的工作流定义,右上角进入该子节点可以跳转到所选子流程的工作流定义 |
## 任务样例

19
docs/docs/zh/guide/task/switch.md

@ -11,19 +11,12 @@ Switch 是一个条件判断节点,依据[全局变量](../parameter/global.md
## 任务参数
- 节点名称:一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述信息:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器机执行,选择 Default,会随机选择一台 worker 机执行。
- 环境名称:安全中心中配置的环境,不配置则不使用。
- 任务组名称:资源中心中配置的任务组,不配置则不使用。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- 延时执行时间:任务延迟执行的时间。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- 条件:可以为 switch 任务配置多个条件,当条件满足时,就会执行指定的分支,可以配置多个不同的条件来满足不同的业务,使用字符串判断时需要使用""。
- 分支流转:默认的流转内容,当**条件**中的内容为全部不符合要求时,则运行**分支流转**中指定的分支。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|----------|-------------------------------------------------------------------------|
| 条件 | 可以为 switch 任务配置多个条件,当条件满足时,就会执行指定的分支,可以配置多个不同的条件来满足不同的业务,使用字符串判断时需要使用"" |
| 分支流转 | 默认的流转内容,当**条件**中的内容为全部不符合要求时,则运行**分支流转**中指定的分支 |
## 任务样例

33
docs/docs/zh/guide/task/zeppelin.md

@ -12,30 +12,25 @@
## 任务参数
- 任务名称:设置任务的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker分组:任务分配给worker组的机器机执行,选择Default,会随机选择一台worker机执行。
- 失败重试次数:任务失败重新提交的次数,支持下拉和手填。
- 失败重试间隔:任务失败重新提交任务的时间间隔,支持下拉和手填。
- 超时告警:勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败.
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
- Zeppelin Note ID:Zeppelin Note对应的唯一ID。
- Zeppelin Paragraph ID:Zeppelin Paragraph对应的唯一ID。如果你想一次性调度整个note,这一栏不填即可。
- Zeppelin Rest Endpoint:您的Zeppelin服务的REST Endpoint。
- Zeppelin Production Note Directory:生产模式下存放克隆note的目录。
- Zeppelin Parameters: 用于传入Zeppelin Dynamic Form的参数。
- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)。
| **任务参数** | **描述** |
|------------------------------------|---------------------------------------------------|
| Zeppelin Note ID | Zeppelin Note对应的唯一ID |
| Zeppelin Paragraph ID | Zeppelin Paragraph对应的唯一ID。如果你想一次性调度整个note,这一栏不填即可 |
| Zeppelin Rest Endpoint | 您的Zeppelin服务的REST Endpoint |
| Zeppelin Production Note Directory | 生产模式下存放克隆note的目录 |
| Zeppelin Parameters | 用于传入Zeppelin Dynamic Form的参数 |
## 生产(克隆)模式
- 填上`Zeppelin Production Note Directory`参数以启动`生产模式`。
- 在`生产模式`下,目标note会被克隆到您所填的`Zeppelin Production Note Directory`目录下。
`Zeppelin任务插件`将会执行克隆出来的note并在执行成功后自动清除它。
因为在此模式下,如果您不小心修改了正在被`Dolphin Scheduler`调度的note,也不会影响到生产任务的执行,
从而提高了稳定性。
- 在`生产模式`下,目标note会被克隆到您所填的`Zeppelin Production Note Directory`目录下。
`Zeppelin任务插件`将会执行克隆出来的note并在执行成功后自动清除它。
因为在此模式下,如果您不小心修改了正在被`Dolphin Scheduler`调度的note,也不会影响到生产任务的执行,
从而提高了稳定性。
- 如果您选择不填`Zeppelin Production Note Directory`这个参数,`Zeppelin任务插件`将会执行您的原始note。
'Zeppelin Production Note Directory'参数在格式上应该以`斜杠`开头和结尾,例如 `/production_note_directory/`
'Zeppelin Production Note Directory'参数在格式上应该以`斜杠`开头和结尾,例如 `/production_note_directory/`
## 任务样例

Loading…
Cancel
Save