Browse Source

[Improve][docs] Add spark sql docs to task spark (#9851)

3.0.0/version-upgrade
sq-q 3 years ago committed by GitHub
parent
commit
b1bb69c959
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 34
      docs/docs/en/guide/task/spark.md
  2. 36
      docs/docs/zh/guide/task/spark.md
  3. BIN
      docs/img/tasks/demo/spark_sql.png

34
docs/docs/en/guide/task/spark.md

@ -2,7 +2,11 @@
## Overview ## Overview
Spark task type used to execute Spark program. For Spark nodes, the worker submits the task by using the spark command `spark submit`. See [spark-submit](https://spark.apache.org/docs/3.2.1/submitting-applications.html#launching-applications-with-spark-submit) for more details. Spark task type for executing Spark application. When executing the Spark task, the worker will submits a job to the Spark cluster by following commands:
(1) `spark submit` method to submit tasks. See [spark-submit](https://spark.apache.org/docs/3.2.1/submitting-applications.html#launching-applications-with-spark-submit) for more details.
(2) `spark sql` method to submit tasks. See [spark sql](https://spark.apache.org/docs/3.2.1/sql-ref-syntax.html) for more details.
## Create Task ## Create Task
@ -21,11 +25,13 @@ Spark task type used to execute Spark program. For Spark nodes, the worker submi
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task. - **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution. - **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. - **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail.
- **Program type**: Supports Java, Scala and Python. - **Program type**: Supports Java, Scala, Python and SQL.
- **Spark version**: Support Spark1 and Spark2. - **Spark version**: Support Spark1 and Spark2.
- **The class of main function**: The **full path** of Main Class, the entry point of the Spark program. - **The class of main function**: The **full path** of Main Class, the entry point of the Spark program.
- **Main jar package**: The Spark jar package (upload by Resource Center). - **Main jar package**: The Spark jar package (upload by Resource Center).
- **Deployment mode**: Support 3 deployment modes: yarn-cluster, yarn-client and local. - **SQL scripts**: SQL statements in .sql files that Spark sql runs.
- **Deployment mode**: (1) spark submit supports three modes: yarn-clusetr, yarn-client and local.
(2) spark sql supports yarn-client and local modes.
- **Task name** (optional): Spark task name. - **Task name** (optional): Spark task name.
- **Driver core number**: Set the number of Driver core, which can be set according to the actual production environment. - **Driver core number**: Set the number of Driver core, which can be set according to the actual production environment.
- **Driver memory size**: Set the size of Driver memories, which can be set according to the actual production environment. - **Driver memory size**: Set the size of Driver memories, which can be set according to the actual production environment.
@ -39,17 +45,19 @@ Spark task type used to execute Spark program. For Spark nodes, the worker submi
## Task Example ## Task Example
### Execute the WordCount Program ### spark submit
#### Execute the WordCount Program
This is a common introductory case in the big data ecosystem, which often apply to computational frameworks such as MapReduce, Flink and Spark. The main purpose is to count the number of identical words in the input text. (Flink's releases attach this example job) This is a common introductory case in the big data ecosystem, which often apply to computational frameworks such as MapReduce, Flink and Spark. The main purpose is to count the number of identical words in the input text. (Flink's releases attach this example job)
#### Configure the Spark Environment in DolphinScheduler ##### Configure the Spark Environment in DolphinScheduler
If you are using the Spark task type in a production environment, it is necessary to configure the required environment first. The following is the configuration file: `bin/env/dolphinscheduler_env.sh`. If you are using the Spark task type in a production environment, it is necessary to configure the required environment first. The following is the configuration file: `bin/env/dolphinscheduler_env.sh`.
![spark_configure](/img/tasks/demo/spark_task01.png) ![spark_configure](/img/tasks/demo/spark_task01.png)
#### Upload the Main Package ##### Upload the Main Package
When using the Spark task node, you need to upload the jar package to the Resource Centre for the execution, refer to the [resource center](../resource.md). When using the Spark task node, you need to upload the jar package to the Resource Centre for the execution, refer to the [resource center](../resource.md).
@ -57,12 +65,22 @@ After finish the Resource Centre configuration, upload the required target files
![resource_upload](/img/tasks/demo/upload_jar.png) ![resource_upload](/img/tasks/demo/upload_jar.png)
#### Configure Spark Nodes ##### Configure Spark Nodes
Configure the required content according to the parameter descriptions above. Configure the required content according to the parameter descriptions above.
![demo-spark-simple](/img/tasks/demo/spark_task02.png) ![demo-spark-simple](/img/tasks/demo/spark_task02.png)
### spark sql
#### Execute DDL and DML statements
This case is to create a view table terms and write three rows of data and a table wc in parquet format and determine whether the table exists. The program type is SQL. Insert the data of the view table terms into the table wc in parquet format.
![spark_sql](/img/tasks/demo/spark_sql.png)
## Notice ## Notice
JAVA and Scala only used for identification, there is no difference. If you use Python to develop Spark application, there is no class of the main function and the rest is the same. JAVA and Scala are only used for identification, and there is no difference when you use the Spark task. If your application is developed by Python, you could just ignore the parameter **Main Class** in the form. Parameter **SQL scripts** is only for SQL type and could be ignored in JAVA, Scala and Python.
SQL does not currently support cluster mode.

36
docs/docs/zh/guide/task/spark.md

@ -2,7 +2,11 @@
## 综述 ## 综述
Spark 任务类型,用于执行 Spark 程序。对于 Spark 节点,worker 会通过使用 spark 命令 `spark submit` 方式提交任务。更多详情查看 [spark-submit](https://spark.apache.org/docs/3.2.1/submitting-applications.html#launching-applications-with-spark-submit)。 Spark 任务类型用于执行 Spark 应用。对于 Spark 节点,worker 支持两个不同类型的 spark 命令提交任务:
(1) `spark submit` 方式提交任务。更多详情查看 [spark-submit](https://spark.apache.org/docs/3.2.1/submitting-applications.html#launching-applications-with-spark-submit)。
(2) `spark sql` 方式提交任务。更多详情查看 [spark sql](https://spark.apache.org/docs/3.2.1/sql-ref-syntax.html)。
## 创建任务 ## 创建任务
@ -22,11 +26,13 @@ Spark 任务类型,用于执行 Spark 程序。对于 Spark 节点,worker
- 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。 - 失败重试间隔:任务失败重新提交任务的时间间隔,以分为单位。
- 延迟执行时间:任务延迟执行的时间,以分为单位。 - 延迟执行时间:任务延迟执行的时间,以分为单位。
- 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。 - 超时警告:勾选超时警告、超时失败,当任务超过“超时时长”后,会发送告警邮件并且任务执行失败。
- 程序类型:支持 Java、Scala 和 Python 三种语言。 - 程序类型:支持 Java、Scala、Python 和 SQL 四种语言。
- Spark 版本:支持 Spark1 和 Spark2。 - Spark 版本:支持 Spark1 和 Spark2。
- 主函数的 Class:Spark 程序的入口 Main class 的全路径。 - 主函数的 Class:Spark 程序的入口 Main class 的全路径。
- 主程序包:执行 Spark 程序的 jar 包(通过资源中心上传)。 - 主程序包:执行 Spark 程序的 jar 包(通过资源中心上传)。
- 部署方式:支持 yarn-clusetr、yarn-client 和 local 三种模式。 - SQL脚本:Spark sql 运行的 .sql 文件中的 SQL 语句。
- 部署方式:(1) spark submit 支持 yarn-clusetr、yarn-client 和 local 三种模式。
(2) spark sql 支持 yarn-client 和 local 两种模式。
- 任务名称(可选):Spark 程序的名称。 - 任务名称(可选):Spark 程序的名称。
- Driver 核心数:用于设置 Driver 内核数,可根据实际生产环境设置对应的核心数。 - Driver 核心数:用于设置 Driver 内核数,可根据实际生产环境设置对应的核心数。
- Driver 内存数:用于设置 Driver 内存数,可根据实际生产环境设置对应的内存数。 - Driver 内存数:用于设置 Driver 内存数,可根据实际生产环境设置对应的内存数。
@ -40,17 +46,19 @@ Spark 任务类型,用于执行 Spark 程序。对于 Spark 节点,worker
## 任务样例 ## 任务样例
### 执行 WordCount 程序 ### spark submit
#### 执行 WordCount 程序
本案例为大数据生态中常见的入门案例,常应用于 MapReduce、Flink、Spark 等计算框架。主要为统计输入的文本中,相同的单词的数量有多少。 本案例为大数据生态中常见的入门案例,常应用于 MapReduce、Flink、Spark 等计算框架。主要为统计输入的文本中,相同的单词的数量有多少。
#### 在 DolphinScheduler 中配置 Spark 环境 ##### 在 DolphinScheduler 中配置 Spark 环境
若生产环境中要是使用到 Spark 任务类型,则需要先配置好所需的环境。配置文件如下:`bin/env/dolphinscheduler_env.sh`。 若生产环境中要是使用到 Spark 任务类型,则需要先配置好所需的环境。配置文件如下:`bin/env/dolphinscheduler_env.sh`。
![spark_configure](/img/tasks/demo/spark_task01.png) ![spark_configure](/img/tasks/demo/spark_task01.png)
#### 上传主程序包 ##### 上传主程序包
在使用 Spark 任务节点时,需要利用资源中心上传执行程序的 jar 包,可参考[资源中心](../resource.md)。 在使用 Spark 任务节点时,需要利用资源中心上传执行程序的 jar 包,可参考[资源中心](../resource.md)。
@ -58,12 +66,24 @@ Spark 任务类型,用于执行 Spark 程序。对于 Spark 节点,worker
![resource_upload](/img/tasks/demo/upload_jar.png) ![resource_upload](/img/tasks/demo/upload_jar.png)
#### 配置 Spark 节点 ##### 配置 Spark 节点
根据上述参数说明,配置所需的内容即可。 根据上述参数说明,配置所需的内容即可。
![demo-spark-simple](/img/tasks/demo/spark_task02.png) ![demo-spark-simple](/img/tasks/demo/spark_task02.png)
### spark sql
#### 执行 DDL 和 DML 语句
本案例为创建一个视图表 terms 并写入三行数据和一个格式为 parquet 的表 wc 并判断该表是否存在。程序类型为 SQL。将视图表 terms 的数据插入到格式为 parquet 的表 wc。
![spark_sql](/img/tasks/demo/spark_sql.png)
## 注意事项: ## 注意事项:
注意:JAVA 和 Scala 只是用来标识,没有区别,如果是 Python 开发的 Spark 则没有主函数的 class,其他都是一样。 注意:
JAVA 和 Scala 只用于标识,使用 Spark 任务时没有区别。如果应用程序是由 Python 开发的,那么可以忽略表单中的参数**Main Class**。参数**SQL脚本**仅适用于 SQL 类型,在 JAVA、Scala 和 Python 中可以忽略。
SQL 目前不支持 cluster 模式。

BIN
docs/img/tasks/demo/spark_sql.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 976 KiB

Loading…
Cancel
Save