Browse Source

[doc] Update task DataX document (#10218)

3.1.0-release
Jiajie Zhong 2 years ago committed by GitHub
parent
commit
f4b7754952
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 126
      docs/docs/en/guide/task/datax.md
  2. 122
      docs/docs/zh/guide/task/datax.md
  3. BIN
      docs/img/datax_edit.png

126
docs/docs/en/guide/task/datax.md

@ -1,63 +1,63 @@
# DataX
## Overview
DataX task type for executing DataX programs. For DataX nodes, the worker will execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file.
## Create Task
- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page.
- Drag from the toolbar <img src="/img/tasks/icons/datax.png" width="15"/> task node to canvas.
## Task Parameter
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- **Custom template**: Customize the content of the DataX node's JSON profile when the default DataSource provided does not meet the requirements.
- **JSON**: JSON configuration file for DataX synchronization.
- **Custom parameters**: SQL task type, and stored procedure is a custom parameter order, to set customized parameter type and data type for the method is the same as the stored procedure task type. The difference is that the custom parameter of the SQL task type replaces the `${variable}` in the SQL statement.
- **Data source**: Select the data source to extract data.
- **SQL statement**: The SQL statement used to extract data from the target database, the SQL query column name is automatically parsed when execute the node, and mapped to the target table to synchronize column name. When the column names of the source table and the target table are inconsistent, they can be converted by column alias (as)
- **Target library**: Select the target library for data synchronization.
- **Pre-SQL**: Pre-SQL executes before the SQL statement (executed by the target database).
- **Post-SQL**: Post-SQL executes after the SQL statement (executed by the target database).
- **Stream limit (number of bytes)**: Limit the number of bytes for a query.
- **Limit flow (number of records)**: Limit the number of records for a query.
- **Running memory**: Set the minimum and maximum memory required, which can be set according to the actual production environment.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task.
## Task Example
This example demonstrates how to import data from Hive into MySQL.
### Configure the DataX environment in DolphinScheduler
If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The following is the configuration file: `bin/env/dolphinscheduler_env.sh`.
![datax_task01](/img/tasks/demo/datax_task01.png)
After finish the environment configuration, need to restart DolphinScheduler.
### Configure DataX Task Node
As the default DataSource does not contain data read from Hive, require a custom JSON, refer to: [HDFS Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md). Note: Partition directories exist on the HDFS path, when importing data in real world situations, partitioning is recommended to be passed as a parameter, using custom parameters.
After finish the required JSON file, you can configure the node by following the steps in the diagram below:
![datax_task02](/img/tasks/demo/datax_task02.png)
### View Execution Result
![datax_task03](/img/tasks/demo/datax_task03.png)
### Notice
If the default DataSource provided does not meet your needs, you can configure the writer and reader of the DataX according to the actual usage environment in the custom template options, available at [DataX](https://github.com/alibaba/DataX).
# DataX
## Overview
DataX task type for executing DataX programs. For DataX nodes, the worker will execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file.
## Create Task
- Click Project Management -> Project Name -> Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page.
- Drag the <img src="/img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board.
## Task Parameter
- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch.
- **Descriptive information**: describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle.
- **Worker grouping**: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution.
- **Environment Name**: Configure the environment name in which to run the script.
- **Number of failed retry attempts**: The number of times the task failed to be resubmitted.
- **Failed retry interval**: The time, in cents, interval for resubmitting the task after a failed task.
- **Delayed execution time**: The time, in cents, that a task is delayed in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail.
- **Custom template**: Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements.
- **json**: json configuration file for DataX synchronization.
- **Custom parameters**: SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement.
- **Data source**: Select the data source from which the data will be extracted.
- **sql statement**: the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias.
- **Target library**: Select the target library for data synchronization.
- **Pre-sql**: Pre-sql is executed before the sql statement (executed by the target library).
- **Post-sql**: Post-sql is executed after the sql statement (executed by the target library).
- **Stream limit (number of bytes)**: Limits the number of bytes in the query.
- **Limit flow (number of records)**: Limit the number of records for a query.
- **Running memory**: the minimum and maximum memory required can be configured to suit the actual production environment.
- **Predecessor task**: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task.
## Task Example
This example demonstrates importing data from Hive into MySQL.
### Configuring the DataX environment in DolphinScheduler
If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The configuration file is as follows: `/dolphinscheduler/conf/env/dolphinscheduler_env.sh`.
![datax_task01](/img/tasks/demo/datax_task01.png)
After the environment has been configured, DolphinScheduler needs to be restarted.
### Configuring DataX Task Node
As the default data source does not contain data to be read from Hive, a custom json is required, refer to: [HDFS Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md). Note: Partition directories exist on the HDFS path, when importing data in real world situations, partitioning is recommended to be passed as a parameter, using custom parameters.
After writing the required json file, you can configure the node content by following the steps in the diagram below.
![datax_task02](/img/tasks/demo/datax_task02.png)
### View run results
![datax_task03](/img/tasks/demo/datax_task03.png)
### Notice
If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX.

122
docs/docs/zh/guide/task/datax.md

@ -1,59 +1,63 @@
# DATAX 节点
## 综述
DataX 任务类型,用于执行 DataX 程序。对于 DataX 节点,worker 会通过执行 `${DATAX_HOME}/bin/datax.py` 来解析传入的 json 文件。
## 创建任务
- 点击项目管理 -> 项目名称 -> 工作流定义,点击“创建工作流”按钮,进入 DAG 编辑页面;
- 拖动工具栏的<img src="/img/tasks/icons/datax.png" width="15"/> 任务节点到画板中。
## 任务参数
- 节点名称:设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个结点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default ,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 延时执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 自定义模板:当默认提供的数据源不满足所需要求的时,可自定义 datax 节点的 json 配置文件内容。
- json:DataX 同步的 json 配置文件。
- 自定义参数:sql 任务类型,而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换 sql 语句中 ${变量}。
- 数据源:选择抽取数据的数据源。
- sql 语句:目标库抽取数据的 sql 语句,节点执行时自动解析 sql 查询列名,映射为目标表同步列名,源表和目标表列名不一致时,可以通过列别名(as)转换。
- 目标库:选择数据同步的目标库。
- 目标库前置 sql:前置 sql 在 sql 语句之前执行(目标库执行)。
- 目标库后置 sql:后置 sql 在 sql 语句之后执行(目标库执行)。
- 限流(字节数):限制查询的字节数。
- 限流(记录数):限制查询的记录数。
- 运行内存:可根据实际生产环境配置所需的最小和最大内存。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
## 任务样例
该样例演示为从 Hive 数据导入到 MySQL 中。
### 在 DolphinScheduler 中配置 DataX 环境
若生产环境中要是使用到 DataX 任务类型,则需要先配置好所需的环境。配置文件如下:`bin/env/dolphinscheduler_env.sh`。
![datax_task01](/img/tasks/demo/datax_task01.png)
<p align="center">
<img src="/img/datax_edit.png" width="80%" />
</p>
- 自定义模板:打开自定义模板开关时,可以自定义datax节点的json配置文件内容(适用于控件配置不满足需求时)
- 数据源:选择抽取数据的数据源
- sql语句:目标库抽取数据的sql语句,节点执行时自动解析sql查询列名,映射为目标表同步列名,源表和目标表列名不一致时,可以通过列别名(as)转换
- 目标库:选择数据同步的目标库
- 目标表:数据同步的目标表名
- 前置sql:前置sql在sql语句之前执行(目标库执行)。
- 后置sql:后置sql在sql语句之后执行(目标库执行)。
- json:datax同步的json配置文件
- 自定义参数:SQL任务类型,而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换sql语句中${变量}。
# DATAX 节点
## 综述
DataX 任务类型,用于执行 DataX 程序。对于 DataX 节点,worker 会通过执行 `${DATAX_HOME}/bin/datax.py` 来解析传入的 json 文件。
## 创建任务
- 点击项目管理 -> 项目名称 -> 工作流定义,点击“创建工作流”按钮,进入 DAG 编辑页面;
- 拖动工具栏的<img src="/img/tasks/icons/datax.png" width="15"/> 任务节点到画板中。
## 任务参数
- 节点名称:设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
- 运行标志:标识这个结点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- 描述:描述该节点的功能。
- 任务优先级:worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- Worker 分组:任务分配给 worker 组的机器执行,选择 Default ,会随机选择一台 worker 机执行。
- 环境名称:配置运行脚本的环境。
- 失败重试次数:任务失败重新提交的次数。
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 延时执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 自定义模板:当默认提供的数据源不满足所需要求的时,可自定义 datax 节点的 json 配置文件内容。
- json:DataX 同步的 json 配置文件。
- 自定义参数:sql 任务类型,而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换 sql 语句中 ${变量}。
- 数据源:选择抽取数据的数据源。
- sql 语句:目标库抽取数据的 sql 语句,节点执行时自动解析 sql 查询列名,映射为目标表同步列名,源表和目标表列名不一致时,可以通过列别名(as)转换。
- 目标库:选择数据同步的目标库。
- 目标库前置 sql:前置 sql 在 sql 语句之前执行(目标库执行)。
- 目标库后置 sql:后置 sql 在 sql 语句之后执行(目标库执行)。
- 限流(字节数):限制查询的字节数。
- 限流(记录数):限制查询的记录数。
- 运行内存:可根据实际生产环境配置所需的最小和最大内存。
- 前置任务:选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。
## 任务样例
该样例演示为从 Hive 数据导入到 MySQL 中。
### 在 DolphinScheduler 中配置 DataX 环境
若生产环境中要是使用到 DataX 任务类型,则需要先配置好所需的环境。配置文件如下:`/dolphinscheduler/conf/env/dolphinscheduler_env.sh`。
![datax_task01](/img/tasks/demo/datax_task01.png)
当环境配置完成之后,需要重启 DolphinScheduler。
### 配置 DataX 任务节点
由于默认的的数据源中并不包含从 Hive 中读取数据,所以需要自定义 json,可参考:[HDFS Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md)。其中需要注意的是 HDFS 路径上存在分区目录,在实际情况导入数据时,分区建议进行传参,即使用自定义参数。
在编写好所需的 json 之后,可按照下图步骤进行配置节点内容。
![datax_task02](/img/tasks/demo/datax_task02.png)
### 查看运行结果
![datax_task03](/img/tasks/demo/datax_task03.png)
## 注意事项:
若默认提供的数据源不满足需求,可在自定义模板选项中,根据实际使用环境来配置 DataX 的 writer 和 reader,可参考:https://github.com/alibaba/DataX

BIN
docs/img/datax_edit.png

Binary file not shown.

Before

Width:  |  Height:  |  Size: 467 KiB

Loading…
Cancel
Save