Browse Source

[1.3.6-prepare][Improvement][Docker&K8s] Transfer docker/k8s documents to the official website (#5191)

Shiwen Cheng 4 years ago committed by GitHub
parent
commit
be169b6390
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 12
      docker/README.md
  2. 910
      docker/build/README.md
  3. 910
      docker/build/README_zh_CN.md
  4. 722
      docker/kubernetes/dolphinscheduler/README.md

12
docker/README.md

@ -1 +1,11 @@
# DolphinScheduler for Docker
# DolphinScheduler for Docker and Kubernetes
### QuickStart in Docker
[![EN doc](https://img.shields.io/badge/document-English-blue.svg)](https://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/docker-deployment.html)
[![CN doc](https://img.shields.io/badge/文档-中文版-blue.svg)](https://dolphinscheduler.apache.org/zh-cn/docs/latest/user_doc/docker-deployment.html)
### QuickStart in Kubernetes
[![EN doc](https://img.shields.io/badge/document-English-blue.svg)](https://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/kubernetes-deployment.html)
[![CN doc](https://img.shields.io/badge/文档-中文版-blue.svg)](https://dolphinscheduler.apache.org/zh-cn/docs/latest/user_doc/kubernetes-deployment.html)

910
docker/build/README.md

@ -1,910 +0,0 @@
DolphinScheduler
=================
* [What is DolphinScheduler?](#what-is-dolphinscheduler)
* [Prerequisites](#prerequisites)
* [How to use this Docker image](#how-to-use-this-docker-image)
* [You can start a DolphinScheduler by docker\-compose (recommended)](#you-can-start-a-dolphinscheduler-by-docker-compose-recommended)
* [Or via specifying the existing PostgreSQL and ZooKeeper service](#or-via-specifying-the-existing-postgresql-and-zookeeper-service)
* [Or start a standalone DolphinScheduler server](#or-start-a-standalone-dolphinscheduler-server)
* [How to build a Docker image](#how-to-build-a-docker-image)
* [Support Matrix](#support-matrix)
* [Environment Variables](#environment-variables)
* [Database](#database)
* [ZooKeeper](#zookeeper)
* [Common](#common)
* [Master Server](#master-server)
* [Worker Server](#worker-server)
* [Alert Server](#alert-server)
* [Api Server](#api-server)
* [Logger Server](#logger-server)
* [Initialization scripts](#initialization-scripts)
* [FAQ](#faq)
* [How to stop DolphinScheduler by docker\-compose?](#how-to-stop-dolphinscheduler-by-docker-compose)
* [How to deploy DolphinScheduler on Docker Swarm?](#how-to-deploy-dolphinscheduler-on-docker-swarm)
* [How to use MySQL as the DolphinScheduler's database instead of PostgreSQL?](#how-to-use-mysql-as-the-dolphinschedulers-database-instead-of-postgresql)
* [How to support MySQL datasource in Datasource manage?](#how-to-support-mysql-datasource-in-datasource-manage)
* [How to support Oracle datasource in Datasource manage?](#how-to-support-oracle-datasource-in-datasource-manage)
* [How to support Python 2 pip and custom requirements\.txt?](#how-to-support-python-2-pip-and-custom-requirementstxt)
* [How to support Python 3?](#how-to-support-python-3)
* [How to support Hadoop, Spark, Flink, Hive or DataX?](#how-to-support-hadoop-spark-flink-hive-or-datax)
* [How to support Spark 3?](#how-to-support-spark-3)
* [How to support shared storage between Master, Worker and Api server?](#how-to-support-shared-storage-between-master-worker-and-api-server)
* [How to support local file resource storage instead of HDFS and S3?](#how-to-support-local-file-resource-storage-instead-of-hdfs-and-s3)
* [How to support S3 resource storage like MinIO?](#how-to-support-s3-resource-storage-like-minio)
* [How to configure SkyWalking?](#how-to-configure-skywalking)
## What is DolphinScheduler?
DolphinScheduler is a distributed and easy-to-expand visual DAG workflow scheduling system, dedicated to solving the complex dependencies in data processing, making the scheduling system out of the box for data processing.
GitHub URL: https://github.com/apache/incubator-dolphinscheduler
Official Website: https://dolphinscheduler.apache.org
![DolphinScheduler](https://dolphinscheduler.apache.org/img/hlogo_colorful.svg)
[![EN doc](https://img.shields.io/badge/document-English-blue.svg)](README.md)
[![CN doc](https://img.shields.io/badge/文档-中文版-blue.svg)](README_zh_CN.md)
## Prerequisites
- [Docker](https://docs.docker.com/engine/) 1.13.1+
- [Docker Compose](https://docs.docker.com/compose/) 1.11.0+
## How to use this Docker image
#### You can start a DolphinScheduler by docker-compose (recommended)
```
$ docker-compose -f ./docker/docker-swarm/docker-compose.yml up -d
```
The default **PostgreSQL** username `root`, password `root` and database `dolphinscheduler` are created in the `docker-compose.yml`.
The default **ZooKeeper** is created in the `docker-compose.yml`.
Access the Web UI: http://192.168.xx.xx:12345/dolphinscheduler
The default username is `admin` and the default password is `dolphinscheduler123`
> **Tip**: For quick start in docker, you can create a tenant named `ds` and associate the user `admin` with the tenant `ds`
#### Or via specifying the existing PostgreSQL and ZooKeeper service
You can specify existing **PostgreSQL** and **ZooKeeper** service. Example:
```
$ docker run -d --name dolphinscheduler \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
-p 12345:12345 \
apache/dolphinscheduler:latest all
```
Access the Web UI:http://192.168.xx.xx:12345/dolphinscheduler
#### Or start a standalone DolphinScheduler server
You can start a standalone DolphinScheduler server.
* Start a **master server**, For example:
```
$ docker run -d --name dolphinscheduler-master \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
apache/dolphinscheduler:latest master-server
```
* Start a **worker server** (including **logger server**), For example:
```
$ docker run -d --name dolphinscheduler-worker \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
apache/dolphinscheduler:latest worker-server
```
* Start a **api server**, For example:
```
$ docker run -d --name dolphinscheduler-api \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
-p 12345:12345 \
apache/dolphinscheduler:latest api-server
```
* Start a **alert server**, For example:
```
$ docker run -d --name dolphinscheduler-alert \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
apache/dolphinscheduler:latest alert-server
```
**Note**: You must be specify `DATABASE_HOST`, `DATABASE_PORT`, `DATABASE_DATABASE`, `DATABASE_USERNAME`, `DATABASE_PASSWORD`, `ZOOKEEPER_QUORUM` when start a standalone dolphinscheduler server.
## How to build a Docker image
You can build a docker image in A Unix-like operating system, You can also build it in Windows operating system.
In Unix-Like, Example:
```bash
$ cd path/incubator-dolphinscheduler
$ sh ./docker/build/hooks/build
```
In Windows, Example:
```bat
C:\incubator-dolphinscheduler>.\docker\build\hooks\build.bat
```
Please read `./docker/build/hooks/build` `./docker/build/hooks/build.bat` script files if you don't understand
## Support Matrix
| Type | Support | Notes |
| ------------------------------------------------------------ | ------------ | ------------------------------------- |
| Shell | Yes | |
| Python2 | Yes | |
| Python3 | Indirect Yes | Refer to FAQ |
| Hadoop2 | Indirect Yes | Refer to FAQ |
| Hadoop3 | Not Sure | Not tested |
| Spark-Local(client) | Indirect Yes | Refer to FAQ |
| Spark-YARN(cluster) | Indirect Yes | Refer to FAQ |
| Spark-Mesos(cluster) | Not Yet | |
| Spark-Standalone(cluster) | Not Yet | |
| Spark-Kubernetes(cluster) | Not Yet | |
| Flink-Local(local>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-YARN(yarn-cluster) | Indirect Yes | Refer to FAQ |
| Flink-YARN(yarn-session/yarn-per-job/yarn-application>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-Mesos(default) | Not Yet | |
| Flink-Mesos(remote>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-Standalone(default) | Not Yet | |
| Flink-Standalone(remote>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-Kubernetes(default) | Not Yet | |
| Flink-Kubernetes(remote>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-NativeKubernetes(kubernetes-session/application>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| MapReduce | Indirect Yes | Refer to FAQ |
| Kerberos | Indirect Yes | Refer to FAQ |
| HTTP | Yes | |
| DataX | Indirect Yes | Refer to FAQ |
| Sqoop | Indirect Yes | Refer to FAQ |
| SQL-MySQL | Indirect Yes | Refer to FAQ |
| SQL-PostgreSQL | Yes | |
| SQL-Hive | Indirect Yes | Refer to FAQ |
| SQL-Spark | Indirect Yes | Refer to FAQ |
| SQL-ClickHouse | Indirect Yes | Refer to FAQ |
| SQL-Oracle | Indirect Yes | Refer to FAQ |
| SQL-SQLServer | Indirect Yes | Refer to FAQ |
| SQL-DB2 | Indirect Yes | Refer to FAQ |
## Environment Variables
The Docker container is configured through environment variables, and the default value will be used if an environment variable is not set
Especially, it can be configured through the environment variable configuration file `config.env.sh` in Docker Compose and Docker Swarm
### Database
**`DATABASE_TYPE`**
This environment variable sets the type for database. The default value is `postgresql`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
**`DATABASE_DRIVER`**
This environment variable sets the type for database. The default value is `org.postgresql.Driver`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
**`DATABASE_HOST`**
This environment variable sets the host for database. The default value is `127.0.0.1`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
**`DATABASE_PORT`**
This environment variable sets the port for database. The default value is `5432`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
**`DATABASE_USERNAME`**
This environment variable sets the username for database. The default value is `root`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
**`DATABASE_PASSWORD`**
This environment variable sets the password for database. The default value is `root`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
**`DATABASE_DATABASE`**
This environment variable sets the database for database. The default value is `dolphinscheduler`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
**`DATABASE_PARAMS`**
This environment variable sets the database for database. The default value is `characterEncoding=utf8`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`, `alert-server`.
### ZooKeeper
**`ZOOKEEPER_QUORUM`**
This environment variable sets zookeeper quorum. The default value is `127.0.0.1:2181`.
**Note**: You must be specify it when start a standalone dolphinscheduler server. Like `master-server`, `worker-server`, `api-server`.
**`ZOOKEEPER_ROOT`**
This environment variable sets zookeeper root directory for dolphinscheduler. The default value is `/dolphinscheduler`.
### Common
**`DOLPHINSCHEDULER_OPTS`**
This environment variable sets jvm options for dolphinscheduler, suitable for `master-server`, `worker-server`, `api-server`, `alert-server`, `logger-server`. The default value is empty.
**`DATA_BASEDIR_PATH`**
User data directory path, self configuration, please make sure the directory exists and have read write permissions. The default value is `/tmp/dolphinscheduler`
**`RESOURCE_STORAGE_TYPE`**
This environment variable sets resource storage type for dolphinscheduler like `HDFS`, `S3`, `NONE`. The default value is `HDFS`.
**`RESOURCE_UPLOAD_PATH`**
This environment variable sets resource store path on HDFS/S3 for resource storage. The default value is `/dolphinscheduler`.
**`FS_DEFAULT_FS`**
This environment variable sets fs.defaultFS for resource storage like `file:///`, `hdfs://mycluster:8020` or `s3a://dolphinscheduler`. The default value is `file:///`.
**`FS_S3A_ENDPOINT`**
This environment variable sets s3 endpoint for resource storage. The default value is `s3.xxx.amazonaws.com`.
**`FS_S3A_ACCESS_KEY`**
This environment variable sets s3 access key for resource storage. The default value is `xxxxxxx`.
**`FS_S3A_SECRET_KEY`**
This environment variable sets s3 secret key for resource storage. The default value is `xxxxxxx`.
**`HADOOP_SECURITY_AUTHENTICATION_STARTUP_STATE`**
This environment variable sets whether to startup kerberos. The default value is `false`.
**`JAVA_SECURITY_KRB5_CONF_PATH`**
This environment variable sets java.security.krb5.conf path. The default value is `/opt/krb5.conf`.
**`LOGIN_USER_KEYTAB_USERNAME`**
This environment variable sets login user from keytab username. The default value is `hdfs@HADOOP.COM`.
**`LOGIN_USER_KEYTAB_PATH`**
This environment variable sets login user from keytab path. The default value is `/opt/hdfs.keytab`.
**`KERBEROS_EXPIRE_TIME`**
This environment variable sets kerberos expire time, the unit is hour. The default value is `2`.
**`HDFS_ROOT_USER`**
This environment variable sets hdfs root user when resource.storage.type=HDFS. The default value is `hdfs`.
**`YARN_RESOURCEMANAGER_HA_RM_IDS`**
This environment variable sets yarn resourcemanager ha rm ids. The default value is empty.
**`YARN_APPLICATION_STATUS_ADDRESS`**
This environment variable sets yarn application status address. The default value is `http://ds1:8088/ws/v1/cluster/apps/%s`.
**`SKYWALKING_ENABLE`**
This environment variable sets whether to enable skywalking. The default value is `false`.
**`SW_AGENT_COLLECTOR_BACKEND_SERVICES`**
This environment variable sets agent collector backend services for skywalking. The default value is `127.0.0.1:11800`.
**`SW_GRPC_LOG_SERVER_HOST`**
This environment variable sets grpc log server host for skywalking. The default value is `127.0.0.1`.
**`SW_GRPC_LOG_SERVER_PORT`**
This environment variable sets grpc log server port for skywalking. The default value is `11800`.
**`HADOOP_HOME`**
This environment variable sets `HADOOP_HOME`. The default value is `/opt/soft/hadoop`.
**`HADOOP_CONF_DIR`**
This environment variable sets `HADOOP_CONF_DIR`. The default value is `/opt/soft/hadoop/etc/hadoop`.
**`SPARK_HOME1`**
This environment variable sets `SPARK_HOME1`. The default value is `/opt/soft/spark1`.
**`SPARK_HOME2`**
This environment variable sets `SPARK_HOME2`. The default value is `/opt/soft/spark2`.
**`PYTHON_HOME`**
This environment variable sets `PYTHON_HOME`. The default value is `/usr/bin/python`.
**`JAVA_HOME`**
This environment variable sets `JAVA_HOME`. The default value is `/usr/local/openjdk-8`.
**`HIVE_HOME`**
This environment variable sets `HIVE_HOME`. The default value is `/opt/soft/hive`.
**`FLINK_HOME`**
This environment variable sets `FLINK_HOME`. The default value is `/opt/soft/flink`.
**`DATAX_HOME`**
This environment variable sets `DATAX_HOME`. The default value is `/opt/soft/datax`.
### Master Server
**`MASTER_SERVER_OPTS`**
This environment variable sets jvm options for `master-server`. The default value is `-Xms1g -Xmx1g -Xmn512m`.
**`MASTER_EXEC_THREADS`**
This environment variable sets exec thread number for `master-server`. The default value is `100`.
**`MASTER_EXEC_TASK_NUM`**
This environment variable sets exec task number for `master-server`. The default value is `20`.
**`MASTER_DISPATCH_TASK_NUM`**
This environment variable sets dispatch task number for `master-server`. The default value is `3`.
**`MASTER_HOST_SELECTOR`**
This environment variable sets host selector for `master-server`. Optional values include `Random`, `RoundRobin` and `LowerWeight`. The default value is `LowerWeight`.
**`MASTER_HEARTBEAT_INTERVAL`**
This environment variable sets heartbeat interval for `master-server`. The default value is `10`.
**`MASTER_TASK_COMMIT_RETRYTIMES`**
This environment variable sets task commit retry times for `master-server`. The default value is `5`.
**`MASTER_TASK_COMMIT_INTERVAL`**
This environment variable sets task commit interval for `master-server`. The default value is `1000`.
**`MASTER_MAX_CPULOAD_AVG`**
This environment variable sets max cpu load avg for `master-server`. The default value is `-1`.
**`MASTER_RESERVED_MEMORY`**
This environment variable sets reserved memory for `master-server`, the unit is G. The default value is `0.3`.
### Worker Server
**`WORKER_SERVER_OPTS`**
This environment variable sets jvm options for `worker-server`. The default value is `-Xms1g -Xmx1g -Xmn512m`.
**`WORKER_EXEC_THREADS`**
This environment variable sets exec thread number for `worker-server`. The default value is `100`.
**`WORKER_HEARTBEAT_INTERVAL`**
This environment variable sets heartbeat interval for `worker-server`. The default value is `10`.
**`WORKER_MAX_CPULOAD_AVG`**
This environment variable sets max cpu load avg for `worker-server`. The default value is `-1`.
**`WORKER_RESERVED_MEMORY`**
This environment variable sets reserved memory for `worker-server`, the unit is G. The default value is `0.3`.
**`WORKER_GROUPS`**
This environment variable sets groups for `worker-server`. The default value is `default`.
### Alert Server
**`ALERT_SERVER_OPTS`**
This environment variable sets jvm options for `alert-server`. The default value is `-Xms512m -Xmx512m -Xmn256m`.
**`XLS_FILE_PATH`**
This environment variable sets xls file path for `alert-server`. The default value is `/tmp/xls`.
**`MAIL_SERVER_HOST`**
This environment variable sets mail server host for `alert-server`. The default value is empty.
**`MAIL_SERVER_PORT`**
This environment variable sets mail server port for `alert-server`. The default value is empty.
**`MAIL_SENDER`**
This environment variable sets mail sender for `alert-server`. The default value is empty.
**`MAIL_USER=`**
This environment variable sets mail user for `alert-server`. The default value is empty.
**`MAIL_PASSWD`**
This environment variable sets mail password for `alert-server`. The default value is empty.
**`MAIL_SMTP_STARTTLS_ENABLE`**
This environment variable sets SMTP tls for `alert-server`. The default value is `true`.
**`MAIL_SMTP_SSL_ENABLE`**
This environment variable sets SMTP ssl for `alert-server`. The default value is `false`.
**`MAIL_SMTP_SSL_TRUST`**
This environment variable sets SMTP ssl truest for `alert-server`. The default value is empty.
**`ENTERPRISE_WECHAT_ENABLE`**
This environment variable sets enterprise wechat enable for `alert-server`. The default value is `false`.
**`ENTERPRISE_WECHAT_CORP_ID`**
This environment variable sets enterprise wechat corp id for `alert-server`. The default value is empty.
**`ENTERPRISE_WECHAT_SECRET`**
This environment variable sets enterprise wechat secret for `alert-server`. The default value is empty.
**`ENTERPRISE_WECHAT_AGENT_ID`**
This environment variable sets enterprise wechat agent id for `alert-server`. The default value is empty.
**`ENTERPRISE_WECHAT_USERS`**
This environment variable sets enterprise wechat users for `alert-server`. The default value is empty.
### Api Server
**`API_SERVER_OPTS`**
This environment variable sets jvm options for `api-server`. The default value is `-Xms512m -Xmx512m -Xmn256m`.
### Logger Server
**`LOGGER_SERVER_OPTS`**
This environment variable sets jvm options for `logger-server`. The default value is `-Xms512m -Xmx512m -Xmn256m`.
## Initialization scripts
If you would like to do additional initialization in an image derived from this one, add one or more environment variables under `/root/start-init-conf.sh`, and modify template files in `/opt/dolphinscheduler/conf/*.tpl`.
For example, to add an environment variable `SECURITY_AUTHENTICATION_TYPE` in `/root/start-init-conf.sh`:
```
export SECURITY_AUTHENTICATION_TYPE=PASSWORD
```
and to modify `application-api.properties.tpl` template file, add the `SECURITY_AUTHENTICATION_TYPE`:
```
security.authentication.type=${SECURITY_AUTHENTICATION_TYPE}
```
`/root/start-init-conf.sh` will dynamically generate config file:
```sh
echo "generate dolphinscheduler config"
ls ${DOLPHINSCHEDULER_HOME}/conf/ | grep ".tpl" | while read line; do
eval "cat << EOF
$(cat ${DOLPHINSCHEDULER_HOME}/conf/${line})
EOF
" > ${DOLPHINSCHEDULER_HOME}/conf/${line%.*}
done
```
## FAQ
### How to stop DolphinScheduler by docker-compose?
Stop containers:
```
docker-compose stop
```
Stop containers and remove containers, networks and volumes:
```
docker-compose down -v
```
### How to deploy DolphinScheduler on Docker Swarm?
Assuming that the Docker Swarm cluster has been created (If there is no Docker Swarm cluster, please refer to [create-swarm](https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/))
Start a stack named dolphinscheduler
```
docker stack deploy -c docker-stack.yml dolphinscheduler
```
Stop and remove the stack named dolphinscheduler
```
docker stack rm dolphinscheduler
```
### How to use MySQL as the DolphinScheduler's database instead of PostgreSQL?
> Because of the commercial license, we cannot directly use the driver of MySQL.
>
> If you want to use MySQL, you can build a new image based on the `apache/dolphinscheduler` image as follows.
1. Download the MySQL driver [mysql-connector-java-5.1.49.jar](https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar) (require `>=5.1.47`)
2. Create a new `Dockerfile` to add MySQL driver:
```
FROM apache/dolphinscheduler:latest
COPY mysql-connector-java-5.1.49.jar /opt/dolphinscheduler/lib
```
3. Build a new docker image including MySQL driver:
```
docker build -t apache/dolphinscheduler:mysql-driver .
```
4. Modify all `image` fields to `apache/dolphinscheduler:mysql-driver` in `docker-compose.yml`
> If you want to deploy dolphinscheduler on Docker Swarm, you need modify `docker-stack.yml`
5. Comment the `dolphinscheduler-postgresql` block in `docker-compose.yml`
6. Add `dolphinscheduler-mysql` service in `docker-compose.yml` (**Optional**, you can directly use a external MySQL database)
7. Modify DATABASE environment variables in `config.env.sh`
```
DATABASE_TYPE=mysql
DATABASE_DRIVER=com.mysql.jdbc.Driver
DATABASE_HOST=dolphinscheduler-mysql
DATABASE_PORT=3306
DATABASE_USERNAME=root
DATABASE_PASSWORD=root
DATABASE_DATABASE=dolphinscheduler
DATABASE_PARAMS=useUnicode=true&characterEncoding=UTF-8
```
> If you have added `dolphinscheduler-mysql` service in `docker-compose.yml`, just set `DATABASE_HOST` to `dolphinscheduler-mysql`
8. Run a dolphinscheduler (See **How to use this docker image**)
### How to support MySQL datasource in `Datasource manage`?
> Because of the commercial license, we cannot directly use the driver of MySQL.
>
> If you want to add MySQL datasource, you can build a new image based on the `apache/dolphinscheduler` image as follows.
1. Download the MySQL driver [mysql-connector-java-5.1.49.jar](https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar) (require `>=5.1.47`)
2. Create a new `Dockerfile` to add MySQL driver:
```
FROM apache/dolphinscheduler:latest
COPY mysql-connector-java-5.1.49.jar /opt/dolphinscheduler/lib
```
3. Build a new docker image including MySQL driver:
```
docker build -t apache/dolphinscheduler:mysql-driver .
```
4. Modify all `image` fields to `apache/dolphinscheduler:mysql-driver` in `docker-compose.yml`
> If you want to deploy dolphinscheduler on Docker Swarm, you need modify `docker-stack.yml`
5. Run a dolphinscheduler (See **How to use this docker image**)
6. Add a MySQL datasource in `Datasource manage`
### How to support Oracle datasource in `Datasource manage`?
> Because of the commercial license, we cannot directly use the driver of Oracle.
>
> If you want to add Oracle datasource, you can build a new image based on the `apache/dolphinscheduler` image as follows.
1. Download the Oracle driver [ojdbc8.jar](https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/) (such as `ojdbc8-19.9.0.0.jar`)
2. Create a new `Dockerfile` to add Oracle driver:
```
FROM apache/dolphinscheduler:latest
COPY ojdbc8-19.9.0.0.jar /opt/dolphinscheduler/lib
```
3. Build a new docker image including Oracle driver:
```
docker build -t apache/dolphinscheduler:oracle-driver .
```
4. Modify all `image` fields to `apache/dolphinscheduler:oracle-driver` in `docker-compose.yml`
> If you want to deploy dolphinscheduler on Docker Swarm, you need modify `docker-stack.yml`
5. Run a dolphinscheduler (See **How to use this docker image**)
6. Add a Oracle datasource in `Datasource manage`
### How to support Python 2 pip and custom requirements.txt?
1. Create a new `Dockerfile` to install pip:
```
FROM apache/dolphinscheduler:latest
COPY requirements.txt /tmp
RUN apt-get update && \
apt-get install -y --no-install-recommends python-pip && \
pip install --no-cache-dir -r /tmp/requirements.txt && \
rm -rf /var/lib/apt/lists/*
```
The command will install the default **pip 18.1**. If you upgrade the pip, just add one line
```
pip install --no-cache-dir -U pip && \
```
2. Build a new docker image including pip:
```
docker build -t apache/dolphinscheduler:pip .
```
3. Modify all `image` fields to `apache/dolphinscheduler:pip` in `docker-compose.yml`
> If you want to deploy dolphinscheduler on Docker Swarm, you need modify `docker-stack.yml`
4. Run a dolphinscheduler (See **How to use this docker image**)
5. Verify pip under a new Python task
### How to support Python 3?
1. Create a new `Dockerfile` to install Python 3:
```
FROM apache/dolphinscheduler:latest
RUN apt-get update && \
apt-get install -y --no-install-recommends python3 && \
rm -rf /var/lib/apt/lists/*
```
The command will install the default **Python 3.7.3**. If you also want to install **pip3**, just replace `python3` with `python3-pip` like
```
apt-get install -y --no-install-recommends python3-pip && \
```
2. Build a new docker image including Python 3:
```
docker build -t apache/dolphinscheduler:python3 .
```
3. Modify all `image` fields to `apache/dolphinscheduler:python3` in `docker-compose.yml`
> If you want to deploy dolphinscheduler on Docker Swarm, you need modify `docker-stack.yml`
4. Modify `PYTHON_HOME` to `/usr/bin/python3` in `config.env.sh`
5. Run a dolphinscheduler (See **How to use this docker image**)
6. Verify Python 3 under a new Python task
### How to support Hadoop, Spark, Flink, Hive or DataX?
Take Spark 2.4.7 as an example:
1. Download the Spark 2.4.7 release binary `spark-2.4.7-bin-hadoop2.7.tgz`
2. Run a dolphinscheduler (See **How to use this docker image**)
3. Copy the Spark 2.4.7 release binary into Docker container
```bash
docker cp spark-2.4.7-bin-hadoop2.7.tgz dolphinscheduler-worker:/opt/soft
```
Because the volume `dolphinscheduler-shared-local` is mounted on `/opt/soft`, all files in `/opt/soft` will not be lost
4. Attach the container and ensure that `SPARK_HOME2` exists
```bash
docker exec -it dolphinscheduler-worker bash
cd /opt/soft
tar zxf spark-2.4.7-bin-hadoop2.7.tgz
rm -f spark-2.4.7-bin-hadoop2.7.tgz
ln -s spark-2.4.7-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version
```
The last command will print Spark version if everything goes well
5. Verify Spark under a Shell task
```
$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.11-2.4.7.jar
```
Check whether the task log contains the output like `Pi is roughly 3.146015`
6. Verify Spark under a Spark task
The file `spark-examples_2.11-2.4.7.jar` needs to be uploaded to the resources first, and then create a Spark task with:
- Spark Version: `SPARK2`
- Main Class: `org.apache.spark.examples.SparkPi`
- Main Package: `spark-examples_2.11-2.4.7.jar`
- Deploy Mode: `local`
Similarly, check whether the task log contains the output like `Pi is roughly 3.146015`
7. Verify Spark on YARN
Spark on YARN (Deploy Mode is `cluster` or `client`) requires Hadoop support. Similar to Spark support, the operation of supporting Hadoop is almost the same as the previous steps
Ensure that `$HADOOP_HOME` and `$HADOOP_CONF_DIR` exists
### How to support Spark 3?
In fact, the way to submit applications with `spark-submit` is the same, regardless of Spark 1, 2 or 3. In other words, the semantics of `SPARK_HOME2` is the second `SPARK_HOME` instead of `SPARK2`'s `HOME`, so just set `SPARK_HOME2=/path/to/spark3`
Take Spark 3.1.1 as an example:
1. Download the Spark 3.1.1 release binary `spark-3.1.1-bin-hadoop2.7.tgz`
2. Run a dolphinscheduler (See **How to use this docker image**)
3. Copy the Spark 3.1.1 release binary into Docker container
```bash
docker cp spark-3.1.1-bin-hadoop2.7.tgz dolphinscheduler-worker:/opt/soft
```
4. Attach the container and ensure that `SPARK_HOME2` exists
```bash
docker exec -it dolphinscheduler-worker bash
cd /opt/soft
tar zxf spark-3.1.1-bin-hadoop2.7.tgz
rm -f spark-3.1.1-bin-hadoop2.7.tgz
ln -s spark-3.1.1-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version
```
The last command will print Spark version if everything goes well
5. Verify Spark under a Shell task
```
$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.12-3.1.1.jar
```
Check whether the task log contains the output like `Pi is roughly 3.146015`
### How to support shared storage between Master, Worker and Api server?
For example, Master, Worker and Api server may use Hadoop at the same time
1. Modify the volume `dolphinscheduler-shared-local` to support nfs in `docker-compose.yml`
> If you want to deploy dolphinscheduler on Docker Swarm, you need modify `docker-stack.yml`
```yaml
volumes:
dolphinscheduler-shared-local:
driver_opts:
type: "nfs"
o: "addr=10.40.0.199,nolock,soft,rw"
device: ":/path/to/shared/dir"
```
2. Put the Hadoop into the nfs
3. Ensure that `$HADOOP_HOME` and `$HADOOP_CONF_DIR` are correct
### How to support local file resource storage instead of HDFS and S3?
1. Modify the following environment variables in `config.env.sh`:
```
RESOURCE_STORAGE_TYPE=HDFS
FS_DEFAULT_FS=file:///
```
2. Modify the volume `dolphinscheduler-resource-local` to support nfs in `docker-compose.yml`
> If you want to deploy dolphinscheduler on Docker Swarm, you need modify `docker-stack.yml`
```yaml
volumes:
dolphinscheduler-resource-local:
driver_opts:
type: "nfs"
o: "addr=10.40.0.199,nolock,soft,rw"
device: ":/path/to/resource/dir"
```
### How to support S3 resource storage like MinIO?
Take MinIO as an example: Modify the following environment variables in `config.env.sh`
```
RESOURCE_STORAGE_TYPE=S3
RESOURCE_UPLOAD_PATH=/dolphinscheduler
FS_DEFAULT_FS=s3a://BUCKET_NAME
FS_S3A_ENDPOINT=http://MINIO_IP:9000
FS_S3A_ACCESS_KEY=MINIO_ACCESS_KEY
FS_S3A_SECRET_KEY=MINIO_SECRET_KEY
```
`BUCKET_NAME`, `MINIO_IP`, `MINIO_ACCESS_KEY` and `MINIO_SECRET_KEY` need to be modified to actual values
> **Note**: `MINIO_IP` can only use IP instead of domain name, because DolphinScheduler currently doesn't support S3 path style access
### How to configure SkyWalking?
Modify SKYWALKING environment variables in `config.env.sh`:
```
SKYWALKING_ENABLE=true
SW_AGENT_COLLECTOR_BACKEND_SERVICES=127.0.0.1:11800
SW_GRPC_LOG_SERVER_HOST=127.0.0.1
SW_GRPC_LOG_SERVER_PORT=11800
```
For more information please refer to the [incubator-dolphinscheduler](https://github.com/apache/incubator-dolphinscheduler.git) documentation.

910
docker/build/README_zh_CN.md

@ -1,910 +0,0 @@
DolphinScheduler
=================
* [DolphinScheduler 是什么?](#dolphinscheduler-是什么)
* [先决条件](#先决条件)
* [如何使用 Docker 镜像](#如何使用-docker-镜像)
* [以 docker\-compose 的方式启动 DolphinScheduler (推荐)](#以-docker-compose-的方式启动-dolphinscheduler-推荐)
* [或者通过指定已存在的 PostgreSQL 和 ZooKeeper 服务](#或者通过指定已存在的-postgresql-和-zookeeper-服务)
* [或者运行 DolphinScheduler 中的部分服务](#或者运行-dolphinscheduler-中的部分服务)
* [如何构建一个 Docker 镜像](#如何构建一个-docker-镜像)
* [支持矩阵](#支持矩阵)
* [环境变量](#环境变量)
* [数据库](#数据库)
* [ZooKeeper](#zookeeper)
* [通用](#通用)
* [Master Server](#master-server)
* [Worker Server](#worker-server)
* [Alert Server](#alert-server)
* [Api Server](#api-server)
* [Logger Server](#logger-server)
* [初始化脚本](#初始化脚本)
* [FAQ](#faq)
* [如何通过 docker\-compose 停止 DolphinScheduler?](#如何通过-docker-compose-停止-dolphinscheduler)
* [如何在 Docker Swarm 上部署 DolphinScheduler?](#如何在-docker-swarm-上部署-dolphinscheduler)
* [如何用 MySQL 替代 PostgreSQL 作为 DolphinScheduler 的数据库?](#如何用-mysql-替代-postgresql-作为-dolphinscheduler-的数据库)
* [如何在数据源中心支持 MySQL 数据源?](#如何在数据源中心支持-mysql-数据源)
* [如何在数据源中心支持 Oracle 数据源?](#如何在数据源中心支持-oracle-数据源)
* [如何支持 Python 2 pip 以及自定义 requirements\.txt?](#如何支持-python-2-pip-以及自定义-requirementstxt)
* [如何支持 Python 3?](#如何支持-python-3)
* [如何支持 Hadoop, Spark, Flink, Hive 或 DataX?](#如何支持-hadoop-spark-flink-hive-或-datax)
* [如何支持 Spark 3?](#如何支持-spark-3)
* [如何在 Master、Worker 和 Api 服务之间支持共享存储?](#如何在-masterworker-和-api-服务之间支持共享存储)
* [如何支持本地文件存储而非 HDFS 和 S3?](#如何支持本地文件存储而非-hdfs-和-s3)
* [如何支持 S3 资源存储,例如 MinIO?](#如何支持-s3-资源存储例如-minio)
* [如何配置 SkyWalking?](#如何配置-skywalking)
## DolphinScheduler 是什么?
一个分布式易扩展的可视化DAG工作流任务调度系统。致力于解决数据处理流程中错综复杂的依赖关系,使调度系统在数据处理流程中`开箱即用`。
GitHub URL: https://github.com/apache/incubator-dolphinscheduler
Official Website: https://dolphinscheduler.apache.org
![DolphinScheduler](https://dolphinscheduler.apache.org/img/hlogo_colorful.svg)
[![EN doc](https://img.shields.io/badge/document-English-blue.svg)](README.md)
[![CN doc](https://img.shields.io/badge/文档-中文版-blue.svg)](README_zh_CN.md)
## 先决条件
- [Docker](https://docs.docker.com/engine/) 1.13.1+
- [Docker Compose](https://docs.docker.com/compose/) 1.11.0+
## 如何使用 Docker 镜像
#### 以 docker-compose 的方式启动 DolphinScheduler (推荐)
```
$ docker-compose -f ./docker/docker-swarm/docker-compose.yml up -d
```
在`docker-compose.yml`文件中,默认创建的**PostgreSQL**的用户、密码和数据库,默认值分别为:`root`、`root`、`dolphinscheduler`。
同时,默认的**ZooKeeper**也会在`docker-compose.yml`文件中被创建。
访问前端页面:http://192.168.xx.xx:12345/dolphinscheduler
默认的用户是`admin`,默认的密码是`dolphinscheduler123`
> **提示**: 为了在docker中快速开始,你可以创建一个名为`ds`的租户,并将这个租户`ds`关联到用户`admin`
#### 或者通过指定已存在的 PostgreSQL 和 ZooKeeper 服务
你可以指定已存在的 **PostgreSQL****ZooKeeper** 服务. 如下:
```
$ docker run -d --name dolphinscheduler \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
-p 12345:12345 \
apache/dolphinscheduler:latest all
```
访问前端页面:http://192.168.xx.xx:12345/dolphinscheduler
#### 或者运行 DolphinScheduler 中的部分服务
你能够运行 DolphinScheduler 中的部分服务。
* 启动一个 **master server**, 如下:
```
$ docker run -d --name dolphinscheduler-master \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
apache/dolphinscheduler:latest master-server
```
* 启动一个 **worker server** (包括 **logger server**), 如下:
```
$ docker run -d --name dolphinscheduler-worker \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
apache/dolphinscheduler:latest worker-server
```
* 启动一个 **api server**, 如下:
```
$ docker run -d --name dolphinscheduler-api \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
-e ZOOKEEPER_QUORUM="192.168.x.x:2181" \
-p 12345:12345 \
apache/dolphinscheduler:latest api-server
```
* 启动一个 **alert server**, 如下:
```
$ docker run -d --name dolphinscheduler-alert \
-e DATABASE_HOST="192.168.x.x" -e DATABASE_PORT="5432" -e DATABASE_DATABASE="dolphinscheduler" \
-e DATABASE_USERNAME="test" -e DATABASE_PASSWORD="test" \
apache/dolphinscheduler:latest alert-server
```
**注意**: 当你运行dolphinscheduler中的部分服务时,你必须指定这些环境变量 `DATABASE_HOST`, `DATABASE_PORT`, `DATABASE_DATABASE`, `DATABASE_USERNAME`, `DATABASE_PASSWORD`, `ZOOKEEPER_QUORUM`
## 如何构建一个 Docker 镜像
你能够在类Unix系统和Windows系统中构建一个docker镜像。
类Unix系统, 如下:
```bash
$ cd path/incubator-dolphinscheduler
$ sh ./docker/build/hooks/build
```
Windows系统, 如下:
```bat
C:\incubator-dolphinscheduler>.\docker\build\hooks\build.bat
```
如果你不理解这些脚本 `./docker/build/hooks/build` `./docker/build/hooks/build.bat`,请阅读里面的内容
## 支持矩阵
| Type | 支持 | 备注 |
| ------------------------------------------------------------ | ------- | --------------------- |
| Shell | 是 | |
| Python2 | 是 | |
| Python3 | 间接支持 | 详见 FAQ |
| Hadoop2 | 间接支持 | 详见 FAQ |
| Hadoop3 | 尚未确定 | 尚未测试 |
| Spark-Local(client) | 间接支持 | 详见 FAQ |
| Spark-YARN(cluster) | 间接支持 | 详见 FAQ |
| Spark-Mesos(cluster) | 尚不 | |
| Spark-Standalone(cluster) | 尚不 | |
| Spark-Kubernetes(cluster) | 尚不 | |
| Flink-Local(local>=1.11) | 尚不 | Generic CLI 模式尚未支持 |
| Flink-YARN(yarn-cluster) | 间接支持 | 详见 FAQ |
| Flink-YARN(yarn-session/yarn-per-job/yarn-application>=1.11) | 尚不 | Generic CLI 模式尚未支持 |
| Flink-Mesos(default) | 尚不 | |
| Flink-Mesos(remote>=1.11) | 尚不 | Generic CLI 模式尚未支持 |
| Flink-Standalone(default) | 尚不 | |
| Flink-Standalone(remote>=1.11) | 尚不 | Generic CLI 模式尚未支持 |
| Flink-Kubernetes(default) | 尚不 | |
| Flink-Kubernetes(remote>=1.11) | 尚不 | Generic CLI 模式尚未支持 |
| Flink-NativeKubernetes(kubernetes-session/application>=1.11) | 尚不 | Generic CLI 模式尚未支持 |
| MapReduce | 间接支持 | 详见 FAQ |
| Kerberos | 间接支持 | 详见 FAQ |
| HTTP | 是 | |
| DataX | 间接支持 | 详见 FAQ |
| Sqoop | 间接支持 | 详见 FAQ |
| SQL-MySQL | 间接支持 | 详见 FAQ |
| SQL-PostgreSQL | 是 | |
| SQL-Hive | 间接支持 | 详见 FAQ |
| SQL-Spark | 间接支持 | 详见 FAQ |
| SQL-ClickHouse | 间接支持 | 详见 FAQ |
| SQL-Oracle | 间接支持 | 详见 FAQ |
| SQL-SQLServer | 间接支持 | 详见 FAQ |
| SQL-DB2 | 间接支持 | 详见 FAQ |
## 环境变量
Docker 容器通过环境变量进行配置,缺省时将会使用默认值
特别地,在 Docker Compose 和 Docker Swarm 中,可以通过环境变量配置文件 `config.env.sh` 进行配置
### 数据库
**`DATABASE_TYPE`**
配置`database`的`TYPE`, 默认值 `postgresql`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`DATABASE_DRIVER`**
配置`database`的`DRIVER`, 默认值 `org.postgresql.Driver`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`DATABASE_HOST`**
配置`database`的`HOST`, 默认值 `127.0.0.1`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`DATABASE_PORT`**
配置`database`的`PORT`, 默认值 `5432`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`DATABASE_USERNAME`**
配置`database`的`USERNAME`, 默认值 `root`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`DATABASE_PASSWORD`**
配置`database`的`PASSWORD`, 默认值 `root`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`DATABASE_DATABASE`**
配置`database`的`DATABASE`, 默认值 `dolphinscheduler`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`DATABASE_PARAMS`**
配置`database`的`PARAMS`, 默认值 `characterEncoding=utf8`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`、`alert-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
### ZooKeeper
**`ZOOKEEPER_QUORUM`**
配置`dolphinscheduler`的`Zookeeper`地址, 默认值 `127.0.0.1:2181`
**注意**: 当运行`dolphinscheduler`中`master-server`、`worker-server`、`api-server`这些服务时,必须指定这个环境变量,以便于你更好的搭建分布式服务。
**`ZOOKEEPER_ROOT`**
配置`dolphinscheduler`在`zookeeper`中数据存储的根目录,默认值 `/dolphinscheduler`
### 通用
**`DOLPHINSCHEDULER_OPTS`**
配置`dolphinscheduler`的`jvm options`,适用于`master-server`、`worker-server`、`api-server`、`alert-server`、`logger-server`,默认值 `""`
**`DATA_BASEDIR_PATH`**
用户数据目录, 用户自己配置, 请确保这个目录存在并且用户读写权限, 默认值 `/tmp/dolphinscheduler`
**`RESOURCE_STORAGE_TYPE`**
配置`dolphinscheduler`的资源存储类型,可选项为 `HDFS`、`S3`、`NONE`,默认值 `HDFS`
**`RESOURCE_UPLOAD_PATH`**
配置`HDFS/S3`上的资源存储路径,默认值 `/dolphinscheduler`
**`FS_DEFAULT_FS`**
配置资源存储的文件系统协议,如 `file:///`, `hdfs://mycluster:8020` or `s3a://dolphinscheduler`,默认值 `file:///`
**`FS_S3A_ENDPOINT`**
当`RESOURCE_STORAGE_TYPE=S3`时,需要配置`S3`的访问路径,默认值 `s3.xxx.amazonaws.com`
**`FS_S3A_ACCESS_KEY`**
当`RESOURCE_STORAGE_TYPE=S3`时,需要配置`S3`的`s3 access key`,默认值 `xxxxxxx`
**`FS_S3A_SECRET_KEY`**
当`RESOURCE_STORAGE_TYPE=S3`时,需要配置`S3`的`s3 secret key`,默认值 `xxxxxxx`
**`HADOOP_SECURITY_AUTHENTICATION_STARTUP_STATE`**
配置`dolphinscheduler`是否启用kerberos,默认值 `false`
**`JAVA_SECURITY_KRB5_CONF_PATH`**
配置`dolphinscheduler`的java.security.krb5.conf路径,默认值 `/opt/krb5.conf`
**`LOGIN_USER_KEYTAB_USERNAME`**
配置`dolphinscheduler`登录用户的keytab用户名,默认值 `hdfs@HADOOP.COM`
**`LOGIN_USER_KEYTAB_PATH`**
配置`dolphinscheduler`登录用户的keytab路径,默认值 `/opt/hdfs.keytab`
**`KERBEROS_EXPIRE_TIME`**
配置`dolphinscheduler`的kerberos过期时间,单位为小时,默认值 `2`
**`HDFS_ROOT_USER`**
当`RESOURCE_STORAGE_TYPE=HDFS`时,配置`dolphinscheduler`的hdfs的root用户名,默认值 `hdfs`
**`YARN_RESOURCEMANAGER_HA_RM_IDS`**
配置`dolphinscheduler`的yarn resourcemanager ha rm ids,默认值 `空`
**`YARN_APPLICATION_STATUS_ADDRESS`**
配置`dolphinscheduler`的yarn application status地址,默认值 `http://ds1:8088/ws/v1/cluster/apps/%s`
**`SKYWALKING_ENABLE`**
配置`skywalking`是否启用. 默认值 `false`
**`SW_AGENT_COLLECTOR_BACKEND_SERVICES`**
配置`skywalking`的collector后端地址. 默认值 `127.0.0.1:11800`
**`SW_GRPC_LOG_SERVER_HOST`**
配置`skywalking`的grpc服务主机或IP. 默认值 `127.0.0.1`
**`SW_GRPC_LOG_SERVER_PORT`**
配置`skywalking`的grpc服务端口. 默认值 `11800`
**`HADOOP_HOME`**
配置`dolphinscheduler`的`HADOOP_HOME`,默认值 `/opt/soft/hadoop`
**`HADOOP_CONF_DIR`**
配置`dolphinscheduler`的`HADOOP_CONF_DIR`,默认值 `/opt/soft/hadoop/etc/hadoop`
**`SPARK_HOME1`**
配置`dolphinscheduler`的`SPARK_HOME1`,默认值 `/opt/soft/spark1`
**`SPARK_HOME2`**
配置`dolphinscheduler`的`SPARK_HOME2`,默认值 `/opt/soft/spark2`
**`PYTHON_HOME`**
配置`dolphinscheduler`的`PYTHON_HOME`,默认值 `/usr/bin/python`
**`JAVA_HOME`**
配置`dolphinscheduler`的`JAVA_HOME`,默认值 `/usr/local/openjdk-8`
**`HIVE_HOME`**
配置`dolphinscheduler`的`HIVE_HOME`,默认值 `/opt/soft/hive`
**`FLINK_HOME`**
配置`dolphinscheduler`的`FLINK_HOME`,默认值 `/opt/soft/flink`
**`DATAX_HOME`**
配置`dolphinscheduler`的`DATAX_HOME`,默认值 `/opt/soft/datax`
### Master Server
**`MASTER_SERVER_OPTS`**
配置`master-server`的`jvm options`,默认值 `-Xms1g -Xmx1g -Xmn512m`
**`MASTER_EXEC_THREADS`**
配置`master-server`中的执行线程数量,默认值 `100`
**`MASTER_EXEC_TASK_NUM`**
配置`master-server`中的执行任务数量,默认值 `20`
**`MASTER_DISPATCH_TASK_NUM`**
配置`master-server`中的派发任务数量,默认值 `3`
**`MASTER_HOST_SELECTOR`**
配置`master-server`中派发任务时worker host的选择器,可选值为`Random`, `RoundRobin`和`LowerWeight`,默认值 `LowerWeight`
**`MASTER_HEARTBEAT_INTERVAL`**
配置`master-server`中的心跳交互时间,默认值 `10`
**`MASTER_TASK_COMMIT_RETRYTIMES`**
配置`master-server`中的任务提交重试次数,默认值 `5`
**`MASTER_TASK_COMMIT_INTERVAL`**
配置`master-server`中的任务提交交互时间,默认值 `1000`
**`MASTER_MAX_CPULOAD_AVG`**
配置`master-server`中的CPU中的`load average`值,默认值 `-1`
**`MASTER_RESERVED_MEMORY`**
配置`master-server`的保留内存,单位为G,默认值 `0.3`
### Worker Server
**`WORKER_SERVER_OPTS`**
配置`worker-server`的`jvm options`,默认值 `-Xms1g -Xmx1g -Xmn512m`
**`WORKER_EXEC_THREADS`**
配置`worker-server`中的执行线程数量,默认值 `100`
**`WORKER_HEARTBEAT_INTERVAL`**
配置`worker-server`中的心跳交互时间,默认值 `10`
**`WORKER_MAX_CPULOAD_AVG`**
配置`worker-server`中的CPU中的最大`load average`值,默认值 `-1`
**`WORKER_RESERVED_MEMORY`**
配置`worker-server`的保留内存,单位为G,默认值 `0.3`
**`WORKER_GROUPS`**
配置`worker-server`的分组,默认值 `default`
### Alert Server
**`ALERT_SERVER_OPTS`**
配置`alert-server`的`jvm options`,默认值 `-Xms512m -Xmx512m -Xmn256m`
**`XLS_FILE_PATH`**
配置`alert-server`的`XLS`文件的存储路径,默认值 `/tmp/xls`
**`MAIL_SERVER_HOST`**
配置`alert-server`的邮件服务地址,默认值 `空`
**`MAIL_SERVER_PORT`**
配置`alert-server`的邮件服务端口,默认值 `空`
**`MAIL_SENDER`**
配置`alert-server`的邮件发送人,默认值 `空`
**`MAIL_USER=`**
配置`alert-server`的邮件服务用户名,默认值 `空`
**`MAIL_PASSWD`**
配置`alert-server`的邮件服务用户密码,默认值 `空`
**`MAIL_SMTP_STARTTLS_ENABLE`**
配置`alert-server`的邮件服务是否启用TLS,默认值 `true`
**`MAIL_SMTP_SSL_ENABLE`**
配置`alert-server`的邮件服务是否启用SSL,默认值 `false`
**`MAIL_SMTP_SSL_TRUST`**
配置`alert-server`的邮件服务SSL的信任地址,默认值 `空`
**`ENTERPRISE_WECHAT_ENABLE`**
配置`alert-server`的邮件服务是否启用企业微信,默认值 `false`
**`ENTERPRISE_WECHAT_CORP_ID`**
配置`alert-server`的邮件服务企业微信`ID`,默认值 `空`
**`ENTERPRISE_WECHAT_SECRET`**
配置`alert-server`的邮件服务企业微信`SECRET`,默认值 `空`
**`ENTERPRISE_WECHAT_AGENT_ID`**
配置`alert-server`的邮件服务企业微信`AGENT_ID`,默认值 `空`
**`ENTERPRISE_WECHAT_USERS`**
配置`alert-server`的邮件服务企业微信`USERS`,默认值 `空`
### Api Server
**`API_SERVER_OPTS`**
配置`api-server`的`jvm options`,默认值 `-Xms512m -Xmx512m -Xmn256m`
### Logger Server
**`LOGGER_SERVER_OPTS`**
配置`logger-server`的`jvm options`,默认值 `-Xms512m -Xmx512m -Xmn256m`
## 初始化脚本
如果你想在编译的时候或者运行的时候附加一些其它的操作及新增一些环境变量,你可以在`/root/start-init-conf.sh`文件中进行修改,同时如果涉及到配置文件的修改,请在`/opt/dolphinscheduler/conf/*.tpl`中修改相应的配置文件
例如,在`/root/start-init-conf.sh`添加一个环境变量`SECURITY_AUTHENTICATION_TYPE`:
```
export SECURITY_AUTHENTICATION_TYPE=PASSWORD
```
当添加以上环境变量后,你应该在相应的模板文件`application-api.properties.tpl`中添加这个环境变量配置:
```
security.authentication.type=${SECURITY_AUTHENTICATION_TYPE}
```
`/root/start-init-conf.sh`将根据模板文件动态的生成配置文件:
```sh
echo "generate dolphinscheduler config"
ls ${DOLPHINSCHEDULER_HOME}/conf/ | grep ".tpl" | while read line; do
eval "cat << EOF
$(cat ${DOLPHINSCHEDULER_HOME}/conf/${line})
EOF
" > ${DOLPHINSCHEDULER_HOME}/conf/${line%.*}
done
```
## FAQ
### 如何通过 docker-compose 停止 DolphinScheduler?
停止所有容器:
```
docker-compose stop
```
停止所有容器并移除所有容器,网络和存储卷:
```
docker-compose down -v
```
### 如何在 Docker Swarm 上部署 DolphinScheduler?
假设 Docker Swarm 集群已经部署(如果还没有创建 Docker Swarm 集群,请参考 [create-swarm](https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/))
启动名为 dolphinscheduler 的 stack
```
docker stack deploy -c docker-stack.yml dolphinscheduler
```
启动并移除名为 dolphinscheduler 的 stack
```
docker stack rm dolphinscheduler
```
### 如何用 MySQL 替代 PostgreSQL 作为 DolphinScheduler 的数据库?
> 由于商业许可证的原因,我们不能直接使用 MySQL 的驱动包.
>
> 如果你要使用 MySQL, 你可以基于官方镜像 `apache/dolphinscheduler` 进行构建.
1. 下载 MySQL 驱动包 [mysql-connector-java-5.1.49.jar](https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar) (要求 `>=5.1.47`)
2. 创建一个新的 `Dockerfile`,用于添加 MySQL 的驱动包:
```
FROM apache/dolphinscheduler:latest
COPY mysql-connector-java-5.1.49.jar /opt/dolphinscheduler/lib
```
3. 构建一个包含 MySQL 驱动包的新镜像:
```
docker build -t apache/dolphinscheduler:mysql-driver .
```
4. 修改 `docker-compose.yml` 文件中的所有 image 字段为 `apache/dolphinscheduler:mysql-driver`
> 如果你想在 Docker Swarm 上部署 dolphinscheduler,你需要修改 `docker-stack.yml`
5. 注释 `docker-compose.yml` 文件中的 `dolphinscheduler-postgresql`
6. 在 `docker-compose.yml` 文件中添加 `dolphinscheduler-mysql` 服务(**可选**,你可以直接使用一个外部的 MySQL 数据库)
7. 修改 `config.env.sh` 文件中的 DATABASE 环境变量
```
DATABASE_TYPE=mysql
DATABASE_DRIVER=com.mysql.jdbc.Driver
DATABASE_HOST=dolphinscheduler-mysql
DATABASE_PORT=3306
DATABASE_USERNAME=root
DATABASE_PASSWORD=root
DATABASE_DATABASE=dolphinscheduler
DATABASE_PARAMS=useUnicode=true&characterEncoding=UTF-8
```
> 如果你已经添加了 `dolphinscheduler-mysql` 服务,设置 `DATABASE_HOST``dolphinscheduler-mysql` 即可
8. 运行 dolphinscheduler (详见**如何使用docker镜像**)
### 如何在数据源中心支持 MySQL 数据源?
> 由于商业许可证的原因,我们不能直接使用 MySQL 的驱动包.
>
> 如果你要添加 MySQL 数据源, 你可以基于官方镜像 `apache/dolphinscheduler` 进行构建.
1. 下载 MySQL 驱动包 [mysql-connector-java-5.1.49.jar](https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar) (要求 `>=5.1.47`)
2. 创建一个新的 `Dockerfile`,用于添加 MySQL 驱动包:
```
FROM apache/dolphinscheduler:latest
COPY mysql-connector-java-5.1.49.jar /opt/dolphinscheduler/lib
```
3. 构建一个包含 MySQL 驱动包的新镜像:
```
docker build -t apache/dolphinscheduler:mysql-driver .
```
4. 将 `docker-compose.yml` 文件中的所有 `image` 字段修改为 `apache/dolphinscheduler:mysql-driver`
> 如果你想在 Docker Swarm 上部署 dolphinscheduler,你需要修改 `docker-stack.yml`
5. 运行 dolphinscheduler (详见**如何使用docker镜像**)
6. 在数据源中心添加一个 MySQL 数据源
### 如何在数据源中心支持 Oracle 数据源?
> 由于商业许可证的原因,我们不能直接使用 Oracle 的驱动包.
>
> 如果你要添加 Oracle 数据源, 你可以基于官方镜像 `apache/dolphinscheduler` 进行构建.
1. 下载 Oracle 驱动包 [ojdbc8.jar](https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/) (such as `ojdbc8-19.9.0.0.jar`)
2. 创建一个新的 `Dockerfile`,用于添加 Oracle 驱动包:
```
FROM apache/dolphinscheduler:latest
COPY ojdbc8-19.9.0.0.jar /opt/dolphinscheduler/lib
```
3. 构建一个包含 Oracle 驱动包的新镜像:
```
docker build -t apache/dolphinscheduler:oracle-driver .
```
4. 将 `docker-compose.yml` 文件中的所有 `image` 字段修改为 `apache/dolphinscheduler:oracle-driver`
> 如果你想在 Docker Swarm 上部署 dolphinscheduler,你需要修改 `docker-stack.yml`
5. 运行 dolphinscheduler (详见**如何使用docker镜像**)
6. 在数据源中心添加一个 Oracle 数据源
### 如何支持 Python 2 pip 以及自定义 requirements.txt?
1. 创建一个新的 `Dockerfile`,用于安装 pip:
```
FROM apache/dolphinscheduler:latest
COPY requirements.txt /tmp
RUN apt-get update && \
apt-get install -y --no-install-recommends python-pip && \
pip install --no-cache-dir -r /tmp/requirements.txt && \
rm -rf /var/lib/apt/lists/*
```
这个命令会安装默认的 **pip 18.1**. 如果你想升级 pip, 只需添加一行
```
pip install --no-cache-dir -U pip && \
```
2. 构建一个包含 pip 的新镜像:
```
docker build -t apache/dolphinscheduler:pip .
```
3. 将 `docker-compose.yml` 文件中的所有 `image` 字段修改为 `apache/dolphinscheduler:pip`
> 如果你想在 Docker Swarm 上部署 dolphinscheduler,你需要修改 `docker-stack.yml`
4. 运行 dolphinscheduler (详见**如何使用docker镜像**)
5. 在一个新 Python 任务下验证 pip
### 如何支持 Python 3?
1. 创建一个新的 `Dockerfile`,用于安装 Python 3:
```
FROM apache/dolphinscheduler:latest
RUN apt-get update && \
apt-get install -y --no-install-recommends python3 && \
rm -rf /var/lib/apt/lists/*
```
这个命令会安装默认的 **Python 3.7.3**. 如果你也想安装 **pip3**, 将 `python3` 替换为 `python3-pip` 即可
```
apt-get install -y --no-install-recommends python3-pip && \
```
2. 构建一个包含 Python 3 的新镜像:
```
docker build -t apache/dolphinscheduler:python3 .
```
3. 将 `docker-compose.yml` 文件中的所有 `image` 字段修改为 `apache/dolphinscheduler:python3`
> 如果你想在 Docker Swarm 上部署 dolphinscheduler,你需要修改 `docker-stack.yml`
4. 修改 `config.env.sh` 文件中的 `PYTHON_HOME``/usr/bin/python3`
5. 运行 dolphinscheduler (详见**如何使用docker镜像**)
6. 在一个新 Python 任务下验证 Python 3
### 如何支持 Hadoop, Spark, Flink, Hive 或 DataX?
以 Spark 2.4.7 为例:
1. 下载 Spark 2.4.7 发布的二进制包 `spark-2.4.7-bin-hadoop2.7.tgz`
2. 运行 dolphinscheduler (详见**如何使用docker镜像**)
3. 复制 Spark 2.4.7 二进制包到 Docker 容器中
```bash
docker cp spark-2.4.7-bin-hadoop2.7.tgz dolphinscheduler-worker:/opt/soft
```
因为存储卷 `dolphinscheduler-shared-local` 被挂载到 `/opt/soft`, 因此 `/opt/soft` 中的所有文件都不会丢失
4. 登录到容器并确保 `SPARK_HOME2` 存在
```bash
docker exec -it dolphinscheduler-worker bash
cd /opt/soft
tar zxf spark-2.4.7-bin-hadoop2.7.tgz
rm -f spark-2.4.7-bin-hadoop2.7.tgz
ln -s spark-2.4.7-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version
```
如果一切执行正常,最后一条命令将会打印 Spark 版本信息
5. 在一个 Shell 任务下验证 Spark
```
$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.11-2.4.7.jar
```
检查任务日志是否包含输出 `Pi is roughly 3.146015`
6. 在一个 Spark 任务下验证 Spark
文件 `spark-examples_2.11-2.4.7.jar` 需要先被上传到资源中心,然后创建一个 Spark 任务并设置:
- Spark版本: `SPARK2`
- 主函数的Class: `org.apache.spark.examples.SparkPi`
- 主程序包: `spark-examples_2.11-2.4.7.jar`
- 部署方式: `local`
同样地, 检查任务日志是否包含输出 `Pi is roughly 3.146015`
7. 验证 Spark on YARN
Spark on YARN (部署方式为 `cluster``client`) 需要 Hadoop 支持. 类似于 Spark 支持, 支持 Hadoop 的操作几乎和前面的步骤相同
确保 `$HADOOP_HOME``$HADOOP_CONF_DIR` 存在
### 如何支持 Spark 3?
事实上,使用 `spark-submit` 提交应用的方式是相同的, 无论是 Spark 1, 2 或 3. 换句话说,`SPARK_HOME2` 的语义是第二个 `SPARK_HOME`, 而非 `SPARK2``HOME`, 因此只需设置 `SPARK_HOME2=/path/to/spark3` 即可
以 Spark 3.1.1 为例:
1. 下载 Spark 3.1.1 发布的二进制包 `spark-3.1.1-bin-hadoop2.7.tgz`
2. 运行 dolphinscheduler (详见**如何使用docker镜像**)
3. 复制 Spark 3.1.1 二进制包到 Docker 容器中
```bash
docker cp spark-3.1.1-bin-hadoop2.7.tgz dolphinscheduler-worker:/opt/soft
```
4. 登录到容器并确保 `SPARK_HOME2` 存在
```bash
docker exec -it dolphinscheduler-worker bash
cd /opt/soft
tar zxf spark-3.1.1-bin-hadoop2.7.tgz
rm -f spark-3.1.1-bin-hadoop2.7.tgz
ln -s spark-3.1.1-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version
```
如果一切执行正常,最后一条命令将会打印 Spark 版本信息
5. 在一个 Shell 任务下验证 Spark
```
$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.12-3.1.1.jar
```
检查任务日志是否包含输出 `Pi is roughly 3.146015`
### 如何在 Master、Worker 和 Api 服务之间支持共享存储?
例如, Master、Worker 和 Api 服务可能同时使用 Hadoop
1. 修改 `docker-compose.yml` 文件中的 `dolphinscheduler-shared-local` 存储卷,以支持 nfs
> 如果你想在 Docker Swarm 上部署 dolphinscheduler,你需要修改 `docker-stack.yml`
```yaml
volumes:
dolphinscheduler-shared-local:
driver_opts:
type: "nfs"
o: "addr=10.40.0.199,nolock,soft,rw"
device: ":/path/to/shared/dir"
```
2. 将 Hadoop 放到 nfs
3. 确保 `$HADOOP_HOME``$HADOOP_CONF_DIR` 正确
### 如何支持本地文件存储而非 HDFS 和 S3?
1. 修改 `config.env.sh` 文件中下面的环境变量:
```
RESOURCE_STORAGE_TYPE=HDFS
FS_DEFAULT_FS=file:///
```
2. 修改 `docker-compose.yml` 文件中的 `dolphinscheduler-shared-local` 存储卷,以支持 nfs
> 如果你想在 Docker Swarm 上部署 dolphinscheduler,你需要修改 `docker-stack.yml`
```yaml
volumes:
dolphinscheduler-resource-local:
driver_opts:
type: "nfs"
o: "addr=10.40.0.199,nolock,soft,rw"
device: ":/path/to/resource/dir"
```
### 如何支持 S3 资源存储,例如 MinIO?
以 MinIO 为例: 修改 `config.env.sh` 文件中下面的环境变量
```
RESOURCE_STORAGE_TYPE=S3
RESOURCE_UPLOAD_PATH=/dolphinscheduler
FS_DEFAULT_FS=s3a://BUCKET_NAME
FS_S3A_ENDPOINT=http://MINIO_IP:9000
FS_S3A_ACCESS_KEY=MINIO_ACCESS_KEY
FS_S3A_SECRET_KEY=MINIO_SECRET_KEY
```
`BUCKET_NAME`, `MINIO_IP`, `MINIO_ACCESS_KEY``MINIO_SECRET_KEY` 需要被修改为实际值
> **注意**: `MINIO_IP` 只能使用 IP 而非域名, 因为 DolphinScheduler 尚不支持 S3 路径风格访问 (S3 path style access)
### 如何配置 SkyWalking?
修改 `config.env.sh` 文件中的 SKYWALKING 环境变量
```
SKYWALKING_ENABLE=true
SW_AGENT_COLLECTOR_BACKEND_SERVICES=127.0.0.1:11800
SW_GRPC_LOG_SERVER_HOST=127.0.0.1
SW_GRPC_LOG_SERVER_PORT=11800
```
更多信息请查看 [incubator-dolphinscheduler](https://github.com/apache/incubator-dolphinscheduler.git) 文档.

722
docker/kubernetes/dolphinscheduler/README.md

@ -1,722 +0,0 @@
DolphinScheduler
=================
* [Introduction](#introduction)
* [Prerequisites](#prerequisites)
* [Installing the Chart](#installing-the-chart)
* [Access DolphinScheduler UI](#access-dolphinscheduler-ui)
* [Uninstalling the Chart](#uninstalling-the-chart)
* [Support Matrix](#support-matrix)
* [Configuration](#configuration)
* [FAQ](#faq)
* [How to use MySQL as the DolphinScheduler's database instead of PostgreSQL?](#how-to-use-mysql-as-the-dolphinschedulers-database-instead-of-postgresql)
* [How to support MySQL datasource in Datasource manage?](#how-to-support-mysql-datasource-in-datasource-manage)
* [How to support Oracle datasource in Datasource manage?](#how-to-support-oracle-datasource-in-datasource-manage)
* [How to support Python 2 pip and custom requirements\.txt?](#how-to-support-python-2-pip-and-custom-requirementstxt)
* [How to support Python 3?](#how-to-support-python-3)
* [How to support Hadoop, Spark, Flink, Hive or DataX?](#how-to-support-hadoop-spark-flink-hive-or-datax)
* [How to support Spark 3?](#how-to-support-spark-3)
* [How to support shared storage between Master, Worker and Api server?](#how-to-support-shared-storage-between-master-worker-and-api-server)
* [How to support local file resource storage instead of HDFS and S3?](#how-to-support-local-file-resource-storage-instead-of-hdfs-and-s3)
* [How to support S3 resource storage like MinIO?](#how-to-support-s3-resource-storage-like-minio)
* [How to configure SkyWalking?](#how-to-configure-skywalking)
## Introduction
[DolphinScheduler](https://dolphinscheduler.apache.org) is a distributed and easy-to-expand visual DAG workflow scheduling system, dedicated to solving the complex dependencies in data processing, making the scheduling system out of the box for data processing.
This chart bootstraps a [DolphinScheduler](https://dolphinscheduler.apache.org) distributed deployment on a [Kubernetes](http://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager.
## Prerequisites
- [Helm](https://helm.sh/) 3.1.0+
- [Kubernetes](https://kubernetes.io/) 1.12+
- PV provisioner support in the underlying infrastructure
## Installing the Chart
To install the chart with the release name `dolphinscheduler`:
```bash
$ git clone https://github.com/apache/incubator-dolphinscheduler.git
$ cd incubator-dolphinscheduler/docker/kubernetes/dolphinscheduler
$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm dependency update .
$ helm install dolphinscheduler .
```
To install the chart with a namespace named `test`:
```bash
$ helm install dolphinscheduler . -n test
```
> **Tip**: If a namespace named `test` is used, the option `-n test` needs to be added to the `helm` and `kubectl` command
These commands deploy DolphinScheduler on the Kubernetes cluster in the default configuration. The [configuration](#configuration) section lists the parameters that can be configured during installation.
> **Tip**: List all releases using `helm list`
## Access DolphinScheduler UI
If `ingress.enabled` in `values.yaml` is set to `true`, you just access `http://${ingress.host}/dolphinscheduler` in browser.
> **Tip**: If there is a problem with ingress access, please contact the Kubernetes administrator and refer to the [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/)
Otherwise, when `api.service.type=ClusterIP` you need to execute port-forward command like:
```bash
$ kubectl port-forward --address 0.0.0.0 svc/dolphinscheduler-api 12345:12345
$ kubectl port-forward --address 0.0.0.0 -n test svc/dolphinscheduler-api 12345:12345 # with test namespace
```
> **Tip**: If the error of `unable to do port forwarding: socat not found` appears, you need to install `socat` at first
And then access the web: http://192.168.xx.xx:12345/dolphinscheduler
Or when `api.service.type=NodePort` you need to execute the command:
```bash
NODE_IP=$(kubectl get no -n {{ .Release.Namespace }} -o jsonpath="{.items[0].status.addresses[0].address}")
NODE_PORT=$(kubectl get svc {{ template "dolphinscheduler.fullname" . }}-api -n {{ .Release.Namespace }} -o jsonpath="{.spec.ports[0].nodePort}")
echo http://$NODE_IP:$NODE_PORT/dolphinscheduler
```
And then access the web: http://$NODE_IP:$NODE_PORT/dolphinscheduler
The default username is `admin` and the default password is `dolphinscheduler123`
> **Tip**: For quick start in docker, you can create a tenant named `ds` and associate the user `admin` with the tenant `ds`
## Uninstalling the Chart
To uninstall/delete the `dolphinscheduler` deployment:
```bash
$ helm uninstall dolphinscheduler
```
The command removes all the Kubernetes components but PVC's associated with the chart and deletes the release.
To delete the PVC's associated with `dolphinscheduler`:
```bash
$ kubectl delete pvc -l app.kubernetes.io/instance=dolphinscheduler
```
> **Note**: Deleting the PVC's will delete all data as well. Please be cautious before doing it.
## Support Matrix
| Type | Support | Notes |
| ------------------------------------------------------------ | ------------ | ------------------------------------- |
| Shell | Yes | |
| Python2 | Yes | |
| Python3 | Indirect Yes | Refer to FAQ |
| Hadoop2 | Indirect Yes | Refer to FAQ |
| Hadoop3 | Not Sure | Not tested |
| Spark-Local(client) | Indirect Yes | Refer to FAQ |
| Spark-YARN(cluster) | Indirect Yes | Refer to FAQ |
| Spark-Mesos(cluster) | Not Yet | |
| Spark-Standalone(cluster) | Not Yet | |
| Spark-Kubernetes(cluster) | Not Yet | |
| Flink-Local(local>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-YARN(yarn-cluster) | Indirect Yes | Refer to FAQ |
| Flink-YARN(yarn-session/yarn-per-job/yarn-application>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-Mesos(default) | Not Yet | |
| Flink-Mesos(remote>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-Standalone(default) | Not Yet | |
| Flink-Standalone(remote>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-Kubernetes(default) | Not Yet | |
| Flink-Kubernetes(remote>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| Flink-NativeKubernetes(kubernetes-session/application>=1.11) | Not Yet | Generic CLI mode is not yet supported |
| MapReduce | Indirect Yes | Refer to FAQ |
| Kerberos | Indirect Yes | Refer to FAQ |
| HTTP | Yes | |
| DataX | Indirect Yes | Refer to FAQ |
| Sqoop | Indirect Yes | Refer to FAQ |
| SQL-MySQL | Indirect Yes | Refer to FAQ |
| SQL-PostgreSQL | Yes | |
| SQL-Hive | Indirect Yes | Refer to FAQ |
| SQL-Spark | Indirect Yes | Refer to FAQ |
| SQL-ClickHouse | Indirect Yes | Refer to FAQ |
| SQL-Oracle | Indirect Yes | Refer to FAQ |
| SQL-SQLServer | Indirect Yes | Refer to FAQ |
| SQL-DB2 | Indirect Yes | Refer to FAQ |
## Configuration
The configuration file is `values.yaml`, and the following tables lists the configurable parameters of the DolphinScheduler chart and their default values.
| Parameter | Description | Default |
| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------- |
| `timezone` | World time and date for cities in all time zones | `Asia/Shanghai` |
| | | |
| `image.repository` | Docker image repository for the DolphinScheduler | `apache/dolphinscheduler` |
| `image.tag` | Docker image version for the DolphinScheduler | `latest` |
| `image.pullPolicy` | Image pull policy. One of Always, Never, IfNotPresent | `IfNotPresent` |
| `image.pullSecret` | Image pull secret. An optional reference to secret in the same namespace to use for pulling any of the images | `nil` |
| | | |
| `postgresql.enabled` | If not exists external PostgreSQL, by default, the DolphinScheduler will use a internal PostgreSQL | `true` |
| `postgresql.postgresqlUsername` | The username for internal PostgreSQL | `root` |
| `postgresql.postgresqlPassword` | The password for internal PostgreSQL | `root` |
| `postgresql.postgresqlDatabase` | The database for internal PostgreSQL | `dolphinscheduler` |
| `postgresql.persistence.enabled` | Set `postgresql.persistence.enabled` to `true` to mount a new volume for internal PostgreSQL | `false` |
| `postgresql.persistence.size` | `PersistentVolumeClaim` size | `20Gi` |
| `postgresql.persistence.storageClass` | PostgreSQL data persistent volume storage class. If set to "-", storageClassName: "", which disables dynamic provisioning | `-` |
| `externalDatabase.type` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database type will use it | `postgresql` |
| `externalDatabase.driver` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database driver will use it | `org.postgresql.Driver` |
| `externalDatabase.host` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database host will use it | `localhost` |
| `externalDatabase.port` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database port will use it | `5432` |
| `externalDatabase.username` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database username will use it | `root` |
| `externalDatabase.password` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database password will use it | `root` |
| `externalDatabase.database` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database database will use it | `dolphinscheduler` |
| `externalDatabase.params` | If exists external PostgreSQL, and set `postgresql.enabled` value to false. DolphinScheduler's database params will use it | `characterEncoding=utf8` |
| | | |
| `zookeeper.enabled` | If not exists external Zookeeper, by default, the DolphinScheduler will use a internal Zookeeper | `true` |
| `zookeeper.fourlwCommandsWhitelist` | A list of comma separated Four Letter Words commands to use | `srvr,ruok,wchs,cons` |
| `zookeeper.persistence.enabled` | Set `zookeeper.persistence.enabled` to `true` to mount a new volume for internal Zookeeper | `false` |
| `zookeeper.persistence.size` | `PersistentVolumeClaim` size | `20Gi` |
| `zookeeper.persistence.storageClass` | Zookeeper data persistent volume storage class. If set to "-", storageClassName: "", which disables dynamic provisioning | `-` |
| `zookeeper.zookeeperRoot` | Specify dolphinscheduler root directory in Zookeeper | `/dolphinscheduler` |
| `externalZookeeper.zookeeperQuorum` | If exists external Zookeeper, and set `zookeeper.enabled` value to false. Specify Zookeeper quorum | `127.0.0.1:2181` |
| `externalZookeeper.zookeeperRoot` | If exists external Zookeeper, and set `zookeeper.enabled` value to false. Specify dolphinscheduler root directory in Zookeeper | `/dolphinscheduler` |
| | | |
| `common.configmap.DOLPHINSCHEDULER_OPTS` | The jvm options for dolphinscheduler, suitable for all servers | `""` |
| `common.configmap.DATA_BASEDIR_PATH` | User data directory path, self configuration, please make sure the directory exists and have read write permissions | `/tmp/dolphinscheduler` |
| `common.configmap.RESOURCE_STORAGE_TYPE` | Resource storage type: HDFS, S3, NONE | `HDFS` |
| `common.configmap.RESOURCE_UPLOAD_PATH` | Resource store on HDFS/S3 path, please make sure the directory exists on hdfs and have read write permissions | `/dolphinscheduler` |
| `common.configmap.FS_DEFAULT_FS` | Resource storage file system like `file:///`, `hdfs://mycluster:8020` or `s3a://dolphinscheduler` | `file:///` |
| `common.configmap.FS_S3A_ENDPOINT` | S3 endpoint when `common.configmap.RESOURCE_STORAGE_TYPE` is set to `S3` | `s3.xxx.amazonaws.com` |
| `common.configmap.FS_S3A_ACCESS_KEY` | S3 access key when `common.configmap.RESOURCE_STORAGE_TYPE` is set to `S3` | `xxxxxxx` |
| `common.configmap.FS_S3A_SECRET_KEY` | S3 secret key when `common.configmap.RESOURCE_STORAGE_TYPE` is set to `S3` | `xxxxxxx` |
| `common.configmap.HADOOP_SECURITY_AUTHENTICATION_STARTUP_STATE` | Whether to startup kerberos | `false` |
| `common.configmap.JAVA_SECURITY_KRB5_CONF_PATH` | The java.security.krb5.conf path | `/opt/krb5.conf` |
| `common.configmap.LOGIN_USER_KEYTAB_USERNAME` | The login user from keytab username | `hdfs@HADOOP.COM` |
| `common.configmap.LOGIN_USER_KEYTAB_PATH` | The login user from keytab path | `/opt/hdfs.keytab` |
| `common.configmap.KERBEROS_EXPIRE_TIME` | The kerberos expire time, the unit is hour | `2` |
| `common.configmap.HDFS_ROOT_USER` | The HDFS root user who must have the permission to create directories under the HDFS root path | `hdfs` |
| `common.configmap.YARN_RESOURCEMANAGER_HA_RM_IDS` | If resourcemanager HA is enabled, please set the HA IPs | `nil` |
| `common.configmap.YARN_APPLICATION_STATUS_ADDRESS` | If resourcemanager is single, you only need to replace ds1 to actual resourcemanager hostname, otherwise keep default | `http://ds1:8088/ws/v1/cluster/apps/%s` |
| `common.configmap.SKYWALKING_ENABLE` | Set whether to enable skywalking | `false` |
| `common.configmap.SW_AGENT_COLLECTOR_BACKEND_SERVICES` | Set agent collector backend services for skywalking | `127.0.0.1:11800` |
| `common.configmap.SW_GRPC_LOG_SERVER_HOST` | Set grpc log server host for skywalking | `127.0.0.1` |
| `common.configmap.SW_GRPC_LOG_SERVER_PORT` | Set grpc log server port for skywalking | `11800` |
| `common.configmap.HADOOP_HOME` | Set `HADOOP_HOME` for DolphinScheduler's task environment | `/opt/soft/hadoop` |
| `common.configmap.HADOOP_CONF_DIR` | Set `HADOOP_CONF_DIR` for DolphinScheduler's task environment | `/opt/soft/hadoop/etc/hadoop` |
| `common.configmap.SPARK_HOME1` | Set `SPARK_HOME1` for DolphinScheduler's task environment | `/opt/soft/spark1` |
| `common.configmap.SPARK_HOME2` | Set `SPARK_HOME2` for DolphinScheduler's task environment | `/opt/soft/spark2` |
| `common.configmap.PYTHON_HOME` | Set `PYTHON_HOME` for DolphinScheduler's task environment | `/usr/bin/python` |
| `common.configmap.JAVA_HOME` | Set `JAVA_HOME` for DolphinScheduler's task environment | `/usr/local/openjdk-8` |
| `common.configmap.HIVE_HOME` | Set `HIVE_HOME` for DolphinScheduler's task environment | `/opt/soft/hive` |
| `common.configmap.FLINK_HOME` | Set `FLINK_HOME` for DolphinScheduler's task environment | `/opt/soft/flink` |
| `common.configmap.DATAX_HOME` | Set `DATAX_HOME` for DolphinScheduler's task environment | `/opt/soft/datax` |
| `common.sharedStoragePersistence.enabled` | Set `common.sharedStoragePersistence.enabled` to `true` to mount a shared storage volume for Hadoop, Spark binary and etc | `false` |
| `common.sharedStoragePersistence.mountPath` | The mount path for the shared storage volume | `/opt/soft` |
| `common.sharedStoragePersistence.accessModes` | `PersistentVolumeClaim` access modes, must be `ReadWriteMany` | `[ReadWriteMany]` |
| `common.sharedStoragePersistence.storageClassName` | Shared Storage persistent volume storage class, must support the access mode: ReadWriteMany | `-` |
| `common.sharedStoragePersistence.storage` | `PersistentVolumeClaim` size | `20Gi` |
| `common.fsFileResourcePersistence.enabled` | Set `common.fsFileResourcePersistence.enabled` to `true` to mount a new file resource volume for `api` and `worker` | `false` |
| `common.fsFileResourcePersistence.accessModes` | `PersistentVolumeClaim` access modes, must be `ReadWriteMany` | `[ReadWriteMany]` |
| `common.fsFileResourcePersistence.storageClassName` | Resource persistent volume storage class, must support the access mode: ReadWriteMany | `-` |
| `common.fsFileResourcePersistence.storage` | `PersistentVolumeClaim` size | `20Gi` |
| | | |
| `master.podManagementPolicy` | PodManagementPolicy controls how pods are created during initial scale up, when replacing pods on nodes, or when scaling down | `Parallel` |
| `master.replicas` | Replicas is the desired number of replicas of the given Template | `3` |
| `master.annotations` | The `annotations` for master server | `{}` |
| `master.affinity` | If specified, the pod's scheduling constraints | `{}` |
| `master.nodeSelector` | NodeSelector is a selector which must be true for the pod to fit on a node | `{}` |
| `master.tolerations` | If specified, the pod's tolerations | `{}` |
| `master.resources` | The `resource` limit and request config for master server | `{}` |
| `master.configmap.MASTER_SERVER_OPTS` | The jvm options for master server | `-Xms1g -Xmx1g -Xmn512m` |
| `master.configmap.MASTER_EXEC_THREADS` | Master execute thread number | `100` |
| `master.configmap.MASTER_EXEC_TASK_NUM` | Master execute task number in parallel | `20` |
| `master.configmap.MASTER_DISPATCH_TASK_NUM` | Master dispatch task number | `3` |
| `master.configmap.MASTER_HOST_SELECTOR` | Master host selector to select a suitable worker, optional values include Random, RoundRobin, LowerWeight | `LowerWeight` |
| `master.configmap.MASTER_HEARTBEAT_INTERVAL` | Master heartbeat interval | `10` |
| `master.configmap.MASTER_TASK_COMMIT_RETRYTIMES` | Master commit task retry times | `5` |
| `master.configmap.MASTER_TASK_COMMIT_INTERVAL` | Master commit task interval | `1000` |
| `master.configmap.MASTER_MAX_CPULOAD_AVG` | Only less than cpu avg load, master server can work. default value : the number of cpu cores * 2 | `-1` |
| `master.configmap.MASTER_RESERVED_MEMORY` | Only larger than reserved memory, master server can work. default value : physical memory * 1/10, unit is G | `0.3` |
| `master.livenessProbe.enabled` | Turn on and off liveness probe | `true` |
| `master.livenessProbe.initialDelaySeconds` | Delay before liveness probe is initiated | `30` |
| `master.livenessProbe.periodSeconds` | How often to perform the probe | `30` |
| `master.livenessProbe.timeoutSeconds` | When the probe times out | `5` |
| `master.livenessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `master.livenessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `master.readinessProbe.enabled` | Turn on and off readiness probe | `true` |
| `master.readinessProbe.initialDelaySeconds` | Delay before readiness probe is initiated | `30` |
| `master.readinessProbe.periodSeconds` | How often to perform the probe | `30` |
| `master.readinessProbe.timeoutSeconds` | When the probe times out | `5` |
| `master.readinessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `master.readinessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `master.persistentVolumeClaim.enabled` | Set `master.persistentVolumeClaim.enabled` to `true` to mount a new volume for `master` | `false` |
| `master.persistentVolumeClaim.accessModes` | `PersistentVolumeClaim` access modes | `[ReadWriteOnce]` |
| `master.persistentVolumeClaim.storageClassName` | `Master` logs data persistent volume storage class. If set to "-", storageClassName: "", which disables dynamic provisioning | `-` |
| `master.persistentVolumeClaim.storage` | `PersistentVolumeClaim` size | `20Gi` |
| | | |
| `worker.podManagementPolicy` | PodManagementPolicy controls how pods are created during initial scale up, when replacing pods on nodes, or when scaling down | `Parallel` |
| `worker.replicas` | Replicas is the desired number of replicas of the given Template | `3` |
| `worker.annotations` | The `annotations` for worker server | `{}` |
| `worker.affinity` | If specified, the pod's scheduling constraints | `{}` |
| `worker.nodeSelector` | NodeSelector is a selector which must be true for the pod to fit on a node | `{}` |
| `worker.tolerations` | If specified, the pod's tolerations | `{}` |
| `worker.resources` | The `resource` limit and request config for worker server | `{}` |
| `worker.configmap.LOGGER_SERVER_OPTS` | The jvm options for logger server | `-Xms512m -Xmx512m -Xmn256m` |
| `worker.configmap.WORKER_SERVER_OPTS` | The jvm options for worker server | `-Xms1g -Xmx1g -Xmn512m` |
| `worker.configmap.WORKER_EXEC_THREADS` | Worker execute thread number | `100` |
| `worker.configmap.WORKER_HEARTBEAT_INTERVAL` | Worker heartbeat interval | `10` |
| `worker.configmap.WORKER_MAX_CPULOAD_AVG` | Only less than cpu avg load, worker server can work. default value : the number of cpu cores * 2 | `-1` |
| `worker.configmap.WORKER_RESERVED_MEMORY` | Only larger than reserved memory, worker server can work. default value : physical memory * 1/10, unit is G | `0.3` |
| `worker.configmap.WORKER_GROUPS` | Worker groups | `default` |
| `worker.livenessProbe.enabled` | Turn on and off liveness probe | `true` |
| `worker.livenessProbe.initialDelaySeconds` | Delay before liveness probe is initiated | `30` |
| `worker.livenessProbe.periodSeconds` | How often to perform the probe | `30` |
| `worker.livenessProbe.timeoutSeconds` | When the probe times out | `5` |
| `worker.livenessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `worker.livenessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `worker.readinessProbe.enabled` | Turn on and off readiness probe | `true` |
| `worker.readinessProbe.initialDelaySeconds` | Delay before readiness probe is initiated | `30` |
| `worker.readinessProbe.periodSeconds` | How often to perform the probe | `30` |
| `worker.readinessProbe.timeoutSeconds` | When the probe times out | `5` |
| `worker.readinessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `worker.readinessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `worker.persistentVolumeClaim.enabled` | Set `worker.persistentVolumeClaim.enabled` to `true` to enable `persistentVolumeClaim` for `worker` | `false` |
| `worker.persistentVolumeClaim.dataPersistentVolume.enabled` | Set `worker.persistentVolumeClaim.dataPersistentVolume.enabled` to `true` to mount a data volume for `worker` | `false` |
| `worker.persistentVolumeClaim.dataPersistentVolume.accessModes` | `PersistentVolumeClaim` access modes | `[ReadWriteOnce]` |
| `worker.persistentVolumeClaim.dataPersistentVolume.storageClassName` | `Worker` data persistent volume storage class. If set to "-", storageClassName: "", which disables dynamic provisioning | `-` |
| `worker.persistentVolumeClaim.dataPersistentVolume.storage` | `PersistentVolumeClaim` size | `20Gi` |
| `worker.persistentVolumeClaim.logsPersistentVolume.enabled` | Set `worker.persistentVolumeClaim.logsPersistentVolume.enabled` to `true` to mount a logs volume for `worker` | `false` |
| `worker.persistentVolumeClaim.logsPersistentVolume.accessModes` | `PersistentVolumeClaim` access modes | `[ReadWriteOnce]` |
| `worker.persistentVolumeClaim.logsPersistentVolume.storageClassName` | `Worker` logs data persistent volume storage class. If set to "-", storageClassName: "", which disables dynamic provisioning | `-` |
| `worker.persistentVolumeClaim.logsPersistentVolume.storage` | `PersistentVolumeClaim` size | `20Gi` |
| | | |
| `alert.replicas` | Replicas is the desired number of replicas of the given Template | `1` |
| `alert.strategy.type` | Type of deployment. Can be "Recreate" or "RollingUpdate" | `RollingUpdate` |
| `alert.strategy.rollingUpdate.maxSurge` | The maximum number of pods that can be scheduled above the desired number of pods | `25%` |
| `alert.strategy.rollingUpdate.maxUnavailable` | The maximum number of pods that can be unavailable during the update | `25%` |
| `alert.annotations` | The `annotations` for alert server | `{}` |
| `alert.affinity` | If specified, the pod's scheduling constraints | `{}` |
| `alert.nodeSelector` | NodeSelector is a selector which must be true for the pod to fit on a node | `{}` |
| `alert.tolerations` | If specified, the pod's tolerations | `{}` |
| `alert.resources` | The `resource` limit and request config for alert server | `{}` |
| `alert.configmap.ALERT_SERVER_OPTS` | The jvm options for alert server | `-Xms512m -Xmx512m -Xmn256m` |
| `alert.configmap.XLS_FILE_PATH` | XLS file path | `/tmp/xls` |
| `alert.configmap.MAIL_SERVER_HOST` | Mail `SERVER HOST ` | `nil` |
| `alert.configmap.MAIL_SERVER_PORT` | Mail `SERVER PORT` | `nil` |
| `alert.configmap.MAIL_SENDER` | Mail `SENDER` | `nil` |
| `alert.configmap.MAIL_USER` | Mail `USER` | `nil` |
| `alert.configmap.MAIL_PASSWD` | Mail `PASSWORD` | `nil` |
| `alert.configmap.MAIL_SMTP_STARTTLS_ENABLE` | Mail `SMTP STARTTLS` enable | `false` |
| `alert.configmap.MAIL_SMTP_SSL_ENABLE` | Mail `SMTP SSL` enable | `false` |
| `alert.configmap.MAIL_SMTP_SSL_TRUST` | Mail `SMTP SSL TRUST` | `nil` |
| `alert.configmap.ENTERPRISE_WECHAT_ENABLE` | `Enterprise Wechat` enable | `false` |
| `alert.configmap.ENTERPRISE_WECHAT_CORP_ID` | `Enterprise Wechat` corp id | `nil` |
| `alert.configmap.ENTERPRISE_WECHAT_SECRET` | `Enterprise Wechat` secret | `nil` |
| `alert.configmap.ENTERPRISE_WECHAT_AGENT_ID` | `Enterprise Wechat` agent id | `nil` |
| `alert.configmap.ENTERPRISE_WECHAT_USERS` | `Enterprise Wechat` users | `nil` |
| `alert.livenessProbe.enabled` | Turn on and off liveness probe | `true` |
| `alert.livenessProbe.initialDelaySeconds` | Delay before liveness probe is initiated | `30` |
| `alert.livenessProbe.periodSeconds` | How often to perform the probe | `30` |
| `alert.livenessProbe.timeoutSeconds` | When the probe times out | `5` |
| `alert.livenessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `alert.livenessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `alert.readinessProbe.enabled` | Turn on and off readiness probe | `true` |
| `alert.readinessProbe.initialDelaySeconds` | Delay before readiness probe is initiated | `30` |
| `alert.readinessProbe.periodSeconds` | How often to perform the probe | `30` |
| `alert.readinessProbe.timeoutSeconds` | When the probe times out | `5` |
| `alert.readinessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `alert.readinessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `alert.persistentVolumeClaim.enabled` | Set `alert.persistentVolumeClaim.enabled` to `true` to mount a new volume for `alert` | `false` |
| `alert.persistentVolumeClaim.accessModes` | `PersistentVolumeClaim` access modes | `[ReadWriteOnce]` |
| `alert.persistentVolumeClaim.storageClassName` | `Alert` logs data persistent volume storage class. If set to "-", storageClassName: "", which disables dynamic provisioning | `-` |
| `alert.persistentVolumeClaim.storage` | `PersistentVolumeClaim` size | `20Gi` |
| | | |
| `api.replicas` | Replicas is the desired number of replicas of the given Template | `1` |
| `api.strategy.type` | Type of deployment. Can be "Recreate" or "RollingUpdate" | `RollingUpdate` |
| `api.strategy.rollingUpdate.maxSurge` | The maximum number of pods that can be scheduled above the desired number of pods | `25%` |
| `api.strategy.rollingUpdate.maxUnavailable` | The maximum number of pods that can be unavailable during the update | `25%` |
| `api.annotations` | The `annotations` for api server | `{}` |
| `api.affinity` | If specified, the pod's scheduling constraints | `{}` |
| `api.nodeSelector` | NodeSelector is a selector which must be true for the pod to fit on a node | `{}` |
| `api.tolerations` | If specified, the pod's tolerations | `{}` |
| `api.resources` | The `resource` limit and request config for api server | `{}` |
| `api.configmap.API_SERVER_OPTS` | The jvm options for api server | `-Xms512m -Xmx512m -Xmn256m` |
| `api.livenessProbe.enabled` | Turn on and off liveness probe | `true` |
| `api.livenessProbe.initialDelaySeconds` | Delay before liveness probe is initiated | `30` |
| `api.livenessProbe.periodSeconds` | How often to perform the probe | `30` |
| `api.livenessProbe.timeoutSeconds` | When the probe times out | `5` |
| `api.livenessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `api.livenessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `api.readinessProbe.enabled` | Turn on and off readiness probe | `true` |
| `api.readinessProbe.initialDelaySeconds` | Delay before readiness probe is initiated | `30` |
| `api.readinessProbe.periodSeconds` | How often to perform the probe | `30` |
| `api.readinessProbe.timeoutSeconds` | When the probe times out | `5` |
| `api.readinessProbe.failureThreshold` | Minimum consecutive successes for the probe | `3` |
| `api.readinessProbe.successThreshold` | Minimum consecutive failures for the probe | `1` |
| `api.persistentVolumeClaim.enabled` | Set `api.persistentVolumeClaim.enabled` to `true` to mount a new volume for `api` | `false` |
| `api.persistentVolumeClaim.accessModes` | `PersistentVolumeClaim` access modes | `[ReadWriteOnce]` |
| `api.persistentVolumeClaim.storageClassName` | `api` logs data persistent volume storage class. If set to "-", storageClassName: "", which disables dynamic provisioning | `-` |
| `api.persistentVolumeClaim.storage` | `PersistentVolumeClaim` size | `20Gi` |
| `api.service.type` | `type` determines how the Service is exposed. Valid options are ExternalName, ClusterIP, NodePort, and LoadBalancer | `ClusterIP` |
| `api.service.clusterIP` | `clusterIP` is the IP address of the service and is usually assigned randomly by the master | `nil` |
| `api.service.nodePort` | `nodePort` is the port on each node on which this service is exposed when type=NodePort | `nil` |
| `api.service.externalIPs` | `externalIPs` is a list of IP addresses for which nodes in the cluster will also accept traffic for this service | `[]` |
| `api.service.externalName` | `externalName` is the external reference that kubedns or equivalent will return as a CNAME record for this service | `nil` |
| `api.service.loadBalancerIP` | `loadBalancerIP` when service.type is LoadBalancer. LoadBalancer will get created with the IP specified in this field | `nil` |
| `api.service.annotations` | `annotations` may need to be set when service.type is LoadBalancer | `{}` |
| | | |
| `ingress.enabled` | Enable ingress | `false` |
| `ingress.host` | Ingress host | `dolphinscheduler.org` |
| `ingress.path` | Ingress path | `/dolphinscheduler` |
| `ingress.tls.enabled` | Enable ingress tls | `false` |
| `ingress.tls.secretName` | Ingress tls secret name | `dolphinscheduler-tls` |
## FAQ
### How to use MySQL as the DolphinScheduler's database instead of PostgreSQL?
> Because of the commercial license, we cannot directly use the driver of MySQL.
>
> If you want to use MySQL, you can build a new image based on the `apache/dolphinscheduler` image as follows.
1. Download the MySQL driver [mysql-connector-java-5.1.49.jar](https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar) (require `>=5.1.47`)
2. Create a new `Dockerfile` to add MySQL driver:
```
FROM apache/dolphinscheduler:latest
COPY mysql-connector-java-5.1.49.jar /opt/dolphinscheduler/lib
```
3. Build a new docker image including MySQL driver:
```
docker build -t apache/dolphinscheduler:mysql-driver .
```
4. Push the docker image `apache/dolphinscheduler:mysql-driver` to a docker registry
5. Modify image `repository` and update `tag` to `mysql-driver` in `values.yaml`
6. Modify postgresql `enabled` to `false` in `values.yaml`
7. Modify externalDatabase (especially modify `host`, `username` and `password`) in `values.yaml`:
```yaml
externalDatabase:
type: "mysql"
driver: "com.mysql.jdbc.Driver"
host: "localhost"
port: "3306"
username: "root"
password: "root"
database: "dolphinscheduler"
params: "useUnicode=true&characterEncoding=UTF-8"
```
8. Run a DolphinScheduler release in Kubernetes (See **Installing the Chart**)
### How to support MySQL datasource in `Datasource manage`?
> Because of the commercial license, we cannot directly use the driver of MySQL.
>
> If you want to add MySQL datasource, you can build a new image based on the `apache/dolphinscheduler` image as follows.
1. Download the MySQL driver [mysql-connector-java-5.1.49.jar](https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar) (require `>=5.1.47`)
2. Create a new `Dockerfile` to add MySQL driver:
```
FROM apache/dolphinscheduler:latest
COPY mysql-connector-java-5.1.49.jar /opt/dolphinscheduler/lib
```
3. Build a new docker image including MySQL driver:
```
docker build -t apache/dolphinscheduler:mysql-driver .
```
4. Push the docker image `apache/dolphinscheduler:mysql-driver` to a docker registry
5. Modify image `repository` and update `tag` to `mysql-driver` in `values.yaml`
6. Run a DolphinScheduler release in Kubernetes (See **Installing the Chart**)
7. Add a MySQL datasource in `Datasource manage`
### How to support Oracle datasource in `Datasource manage`?
> Because of the commercial license, we cannot directly use the driver of Oracle.
>
> If you want to add Oracle datasource, you can build a new image based on the `apache/dolphinscheduler` image as follows.
1. Download the Oracle driver [ojdbc8.jar](https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/) (such as `ojdbc8-19.9.0.0.jar`)
2. Create a new `Dockerfile` to add Oracle driver:
```
FROM apache/dolphinscheduler:latest
COPY ojdbc8-19.9.0.0.jar /opt/dolphinscheduler/lib
```
3. Build a new docker image including Oracle driver:
```
docker build -t apache/dolphinscheduler:oracle-driver .
```
4. Push the docker image `apache/dolphinscheduler:oracle-driver` to a docker registry
5. Modify image `repository` and update `tag` to `oracle-driver` in `values.yaml`
6. Run a DolphinScheduler release in Kubernetes (See **Installing the Chart**)
7. Add a Oracle datasource in `Datasource manage`
### How to support Python 2 pip and custom requirements.txt?
1. Create a new `Dockerfile` to install pip:
```
FROM apache/dolphinscheduler:latest
COPY requirements.txt /tmp
RUN apt-get update && \
apt-get install -y --no-install-recommends python-pip && \
pip install --no-cache-dir -r /tmp/requirements.txt && \
rm -rf /var/lib/apt/lists/*
```
The command will install the default **pip 18.1**. If you upgrade the pip, just add one line
```
pip install --no-cache-dir -U pip && \
```
2. Build a new docker image including pip:
```
docker build -t apache/dolphinscheduler:pip .
```
3. Push the docker image `apache/dolphinscheduler:pip` to a docker registry
4. Modify image `repository` and update `tag` to `pip` in `values.yaml`
5. Run a DolphinScheduler release in Kubernetes (See **Installing the Chart**)
6. Verify pip under a new Python task
### How to support Python 3?
1. Create a new `Dockerfile` to install Python 3:
```
FROM apache/dolphinscheduler:latest
RUN apt-get update && \
apt-get install -y --no-install-recommends python3 && \
rm -rf /var/lib/apt/lists/*
```
The command will install the default **Python 3.7.3**. If you also want to install **pip3**, just replace `python3` with `python3-pip` like
```
apt-get install -y --no-install-recommends python3-pip && \
```
2. Build a new docker image including Python 3:
```
docker build -t apache/dolphinscheduler:python3 .
```
3. Push the docker image `apache/dolphinscheduler:python3` to a docker registry
4. Modify image `repository` and update `tag` to `python3` in `values.yaml`
5. Modify `PYTHON_HOME` to `/usr/bin/python3` in `values.yaml`
6. Run a DolphinScheduler release in Kubernetes (See **Installing the Chart**)
7. Verify Python 3 under a new Python task
### How to support Hadoop, Spark, Flink, Hive or DataX?
Take Spark 2.4.7 as an example:
1. Download the Spark 2.4.7 release binary `spark-2.4.7-bin-hadoop2.7.tgz`
2. Ensure that `common.sharedStoragePersistence.enabled` is turned on
3. Run a DolphinScheduler release in Kubernetes (See **Installing the Chart**)
4. Copy the Spark 2.4.7 release binary into Docker container
```bash
kubectl cp spark-2.4.7-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft
kubectl cp -n test spark-2.4.7-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft # with test namespace
```
Because the volume `sharedStoragePersistence` is mounted on `/opt/soft`, all files in `/opt/soft` will not be lost
5. Attach the container and ensure that `SPARK_HOME2` exists
```bash
kubectl exec -it dolphinscheduler-worker-0 bash
kubectl exec -n test -it dolphinscheduler-worker-0 bash # with test namespace
cd /opt/soft
tar zxf spark-2.4.7-bin-hadoop2.7.tgz
rm -f spark-2.4.7-bin-hadoop2.7.tgz
ln -s spark-2.4.7-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version
```
The last command will print Spark version if everything goes well
6. Verify Spark under a Shell task
```
$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.11-2.4.7.jar
```
Check whether the task log contains the output like `Pi is roughly 3.146015`
7. Verify Spark under a Spark task
The file `spark-examples_2.11-2.4.7.jar` needs to be uploaded to the resources first, and then create a Spark task with:
- Spark Version: `SPARK2`
- Main Class: `org.apache.spark.examples.SparkPi`
- Main Package: `spark-examples_2.11-2.4.7.jar`
- Deploy Mode: `local`
Similarly, check whether the task log contains the output like `Pi is roughly 3.146015`
8. Verify Spark on YARN
Spark on YARN (Deploy Mode is `cluster` or `client`) requires Hadoop support. Similar to Spark support, the operation of supporting Hadoop is almost the same as the previous steps
Ensure that `$HADOOP_HOME` and `$HADOOP_CONF_DIR` exists
### How to support Spark 3?
In fact, the way to submit applications with `spark-submit` is the same, regardless of Spark 1, 2 or 3. In other words, the semantics of `SPARK_HOME2` is the second `SPARK_HOME` instead of `SPARK2`'s `HOME`, so just set `SPARK_HOME2=/path/to/spark3`
Take Spark 3.1.1 as an example:
1. Download the Spark 3.1.1 release binary `spark-3.1.1-bin-hadoop2.7.tgz`
2. Ensure that `common.sharedStoragePersistence.enabled` is turned on
3. Run a DolphinScheduler release in Kubernetes (See **Installing the Chart**)
4. Copy the Spark 3.1.1 release binary into Docker container
```bash
kubectl cp spark-3.1.1-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft
kubectl cp -n test spark-3.1.1-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft # with test namespace
```
5. Attach the container and ensure that `SPARK_HOME2` exists
```bash
kubectl exec -it dolphinscheduler-worker-0 bash
kubectl exec -n test -it dolphinscheduler-worker-0 bash # with test namespace
cd /opt/soft
tar zxf spark-3.1.1-bin-hadoop2.7.tgz
rm -f spark-3.1.1-bin-hadoop2.7.tgz
ln -s spark-3.1.1-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version
```
The last command will print Spark version if everything goes well
6. Verify Spark under a Shell task
```
$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.12-3.1.1.jar
```
Check whether the task log contains the output like `Pi is roughly 3.146015`
### How to support shared storage between Master, Worker and Api server?
For example, Master, Worker and Api server may use Hadoop at the same time
1. Modify the following configurations in `values.yaml`
```yaml
common:
sharedStoragePersistence:
enabled: false
mountPath: "/opt/soft"
accessModes:
- "ReadWriteMany"
storageClassName: "-"
storage: "20Gi"
```
`storageClassName` and `storage` need to be modified to actual values
> **Note**: `storageClassName` must support the access mode: `ReadWriteMany`
2. Put the Hadoop into the nfs
3. Ensure that `$HADOOP_HOME` and `$HADOOP_CONF_DIR` are correct
### How to support local file resource storage instead of HDFS and S3?
Modify the following configurations in `values.yaml`
```yaml
common:
configmap:
RESOURCE_STORAGE_TYPE: "HDFS"
RESOURCE_UPLOAD_PATH: "/dolphinscheduler"
FS_DEFAULT_FS: "file:///"
fsFileResourcePersistence:
enabled: true
accessModes:
- "ReadWriteMany"
storageClassName: "-"
storage: "20Gi"
```
`storageClassName` and `storage` need to be modified to actual values
> **Note**: `storageClassName` must support the access mode: `ReadWriteMany`
### How to support S3 resource storage like MinIO?
Take MinIO as an example: Modify the following configurations in `values.yaml`
```yaml
common:
configmap:
RESOURCE_STORAGE_TYPE: "S3"
RESOURCE_UPLOAD_PATH: "/dolphinscheduler"
FS_DEFAULT_FS: "s3a://BUCKET_NAME"
FS_S3A_ENDPOINT: "http://MINIO_IP:9000"
FS_S3A_ACCESS_KEY: "MINIO_ACCESS_KEY"
FS_S3A_SECRET_KEY: "MINIO_SECRET_KEY"
```
`BUCKET_NAME`, `MINIO_IP`, `MINIO_ACCESS_KEY` and `MINIO_SECRET_KEY` need to be modified to actual values
> **Note**: `MINIO_IP` can only use IP instead of domain name, because DolphinScheduler currently doesn't support S3 path style access
### How to configure SkyWalking?
Modify SKYWALKING configurations in `values.yaml`:
```yaml
common:
configmap:
SKYWALKING_ENABLE: "true"
SW_AGENT_COLLECTOR_BACKEND_SERVICES: "127.0.0.1:11800"
SW_GRPC_LOG_SERVER_HOST: "127.0.0.1"
SW_GRPC_LOG_SERVER_PORT: "11800"
```
For more information please refer to the [incubator-dolphinscheduler](https://github.com/apache/incubator-dolphinscheduler.git) documentation.
Loading…
Cancel
Save