sneh-wha
2 years ago
committed by
caishunfeng
20 changed files with 781 additions and 374 deletions
@ -1,63 +1,67 @@
|
||||
# DataX |
||||
|
||||
## Overview |
||||
|
||||
DataX task type for executing DataX programs. For DataX nodes, the worker will execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file. |
||||
|
||||
## Create Task |
||||
|
||||
- Click Project Management -> Project Name -> Workflow Definition, and click the "Create Workflow" button to enter the DAG editing page. |
||||
- Drag the <img src="/img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board. |
||||
|
||||
## Task Parameter |
||||
|
||||
- **Node name**: The node name in a workflow definition is unique. |
||||
- **Run flag**: Identifies whether this node can be scheduled normally, if it does not need to be executed, you can turn on the prohibition switch. |
||||
- **Descriptive information**: describe the function of the node. |
||||
- **Task priority**: When the number of worker threads is insufficient, they are executed in order from high to low, and when the priority is the same, they are executed according to the first-in first-out principle. |
||||
- **Worker grouping**: Tasks are assigned to the machines of the worker group to execute. If Default is selected, a worker machine will be randomly selected for execution. |
||||
- **Environment Name**: Configure the environment name in which to run the script. |
||||
- **Number of failed retry attempts**: The number of times the task failed to be resubmitted. |
||||
- **Failed retry interval**: The time, in cents, interval for resubmitting the task after a failed task. |
||||
- **Delayed execution time**: The time, in cents, that a task is delayed in execution. |
||||
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. |
||||
- **Custom template**: Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements. |
||||
- **json**: json configuration file for DataX synchronization. |
||||
- **Custom parameters**: SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement. |
||||
- **Data source**: Select the data source from which the data will be extracted. |
||||
- **sql statement**: the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias. |
||||
- **Target library**: Select the target library for data synchronization. |
||||
- **Pre-sql**: Pre-sql is executed before the sql statement (executed by the target library). |
||||
- **Post-sql**: Post-sql is executed after the sql statement (executed by the target library). |
||||
- **Stream limit (number of bytes)**: Limits the number of bytes in the query. |
||||
- **Limit flow (number of records)**: Limit the number of records for a query. |
||||
- **Running memory**: the minimum and maximum memory required can be configured to suit the actual production environment. |
||||
- **Predecessor task**: Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. |
||||
|
||||
## Task Example |
||||
|
||||
This example demonstrates importing data from Hive into MySQL. |
||||
|
||||
### Configuring the DataX environment in DolphinScheduler |
||||
|
||||
If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The configuration file is as follows: `/dolphinscheduler/conf/env/dolphinscheduler_env.sh`. |
||||
|
||||
![datax_task01](/img/tasks/demo/datax_task01.png) |
||||
|
||||
After the environment has been configured, DolphinScheduler needs to be restarted. |
||||
|
||||
### Configuring DataX Task Node |
||||
|
||||
As the default data source does not contain data to be read from Hive, a custom json is required, refer to: [HDFS Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md). Note: Partition directories exist on the HDFS path, when importing data in real world situations, partitioning is recommended to be passed as a parameter, using custom parameters. |
||||
|
||||
After writing the required json file, you can configure the node content by following the steps in the diagram below. |
||||
|
||||
![datax_task02](/img/tasks/demo/datax_task02.png) |
||||
|
||||
### View run results |
||||
|
||||
![datax_task03](/img/tasks/demo/datax_task03.png) |
||||
|
||||
### Notice |
||||
|
||||
If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX. |
||||
# DataX |
||||
|
||||
## Overview |
||||
|
||||
DataX task type for executing DataX programs. For DataX nodes, the worker will execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file. |
||||
|
||||
## Create Task |
||||
|
||||
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. |
||||
- Drag the <img src="../../../../img/tasks/icons/datax.png" width="15"/> from the toolbar to the drawing board. |
||||
|
||||
## Task Parameters |
||||
|
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Node name | The node name in a workflow definition is unique. | |
||||
| Run flag | Identifies whether this node schedules normally, if it does not need to execute, select the prohibition execution. | |
||||
| Task priority | When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. | |
||||
| Description | Describe the function of the node. | |
||||
| Worker group | Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. | |
||||
| Environment Name | Configure the environment name in which run the script. | |
||||
| Number of failed retries | The number of times the task failed to resubmit. | |
||||
| Failed retry interval | The time interval (unit minute) for resubmitting the task after a failed task. | |
||||
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | |
||||
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md) | |
||||
| Delayed execution time | The time, in cents, that a task is delayed in execution. | |
||||
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will be sent and the task execution will fail. | |
||||
| Custom template | Custom the content of the DataX node's json profile when the default data source provided does not meet the required requirements. | |
||||
| json | json configuration file for DataX synchronization. | |
||||
| Custom parameters | SQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the \${variable} in the SQL statement. | |
||||
| Data source | Select the data source from which the data will be extracted. | |
||||
| sql statement | the sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias. | |
||||
| Target library | Select the target library for data synchronization. | |
||||
| Pre-sql | Pre-sql is executed before the sql statement (executed by the target library). | |
||||
| Post-sql | Post-sql is executed after the sql statement (executed by the target library). | |
||||
| Stream limit (number of bytes) | Limits the number of bytes in the query. | |
||||
| Limit flow (number of records) | Limit the number of records for a query. | |
||||
| Running memory | the minimum and maximum memory required can be configured to suit the actual production environment. | |
||||
| Predecessor task | Selecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task. | |
||||
|
||||
## Task Example |
||||
|
||||
This example demonstrates importing data from Hive into MySQL. |
||||
|
||||
### Configuring the DataX environment in DolphinScheduler |
||||
|
||||
If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The configuration file is as follows: `/dolphinscheduler/conf/env/dolphinscheduler_env.sh`. |
||||
|
||||
![datax_task01](../../../../img/tasks/demo/datax_task01.png) |
||||
|
||||
After the environment has been configured, DolphinScheduler needs to be restarted. |
||||
|
||||
### Configuring DataX Task Node |
||||
|
||||
As the default data source does not contain data to be read from Hive, a custom json is required, refer to: [HDFS Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md). Note: Partition directories exist on the HDFS path, when importing data in real world situations, partitioning is recommended to be passed as a parameter, using custom parameters. |
||||
|
||||
After writing the required json file, you can configure the node content by following the steps in the diagram below. |
||||
|
||||
![datax_task02](../../../../img/tasks/demo/datax_task02.png) |
||||
|
||||
### View run results |
||||
|
||||
![datax_task03](../../../../img/tasks/demo/datax_task03.png) |
||||
|
||||
### Note |
||||
|
||||
If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX. |
||||
|
@ -0,0 +1,86 @@
|
||||
# Jupyter |
||||
|
||||
## Overview |
||||
|
||||
Use `Jupyter Task` to create a jupyter-type task and execute jupyter notes. When the worker executes `Jupyter Task`, |
||||
it will use `papermill` to evaluate jupyter notes. Click [here](https://papermill.readthedocs.io/en/latest/) for details about `papermill`. |
||||
|
||||
## Conda Configuration |
||||
|
||||
- Config `conda.path` in `common.properties` to the path of your `conda.sh`, which should be the same `conda` you use to manage the python environment of your `papermill` and `jupyter`. |
||||
Click [here](https://docs.conda.io/en/latest/) for more information about `conda`. |
||||
- `conda.path` is set to `/opt/anaconda3/etc/profile.d/conda.sh` by default. If you have no idea where your `conda` is, simply run `conda info | grep -i 'base environment'`. |
||||
|
||||
> NOTE: `Jupyter Task Plugin` uses `source` command to activate conda environment. |
||||
> If your tenant does not have permission to use `source`, `Jupyter Task Plugin` will not function. |
||||
|
||||
|
||||
## Python Dependency Management |
||||
|
||||
### Use Pre-Installed Conda Environment |
||||
|
||||
1. Create a conda environment manually or using `shell task` on your target worker. |
||||
2. In your `jupyter task`, set `condaEnvName` as the name of the conda environment you just created. |
||||
|
||||
### Use Packed Conda Environment |
||||
|
||||
1. Use [Conda-Pack](https://conda.github.io/conda-pack/) to pack your conda environment into `tarball`. |
||||
2. Upload packed conda environment to `resource center`. |
||||
3. Select your packed conda environment as `resource` in your `jupyter task`, e.g. `jupyter_env.tar.gz`. |
||||
|
||||
> NOTE: Make sure you follow the [Conda-Pack](https://conda.github.io/conda-pack/) official instructions. |
||||
> If you unpack your packed conda environment, the directory structure should be the same as below: |
||||
|
||||
``` |
||||
. |
||||
├── bin |
||||
├── conda-meta |
||||
├── etc |
||||
├── include |
||||
├── lib |
||||
├── share |
||||
└── ssl |
||||
``` |
||||
|
||||
> NOTE: Please follow the `conda pack` instructions above strictly, and DO NOT modify `bin/activate`. |
||||
> `Jupyter Task Plugin` uses `source` command to activate your packed conda environment. |
||||
> If you are concerned about using `source`, choose other options to manage your python dependency. |
||||
|
||||
## Create Task |
||||
|
||||
- Click `Project Management-Project Name-Workflow Definition`, and click the "`Create Workflow`" button to enter the DAG editing page. |
||||
- Drag <img src="../../../../img/tasks/icons/jupyter.png" width="15"/> from the toolbar to the canvas. |
||||
|
||||
## Task Parameters |
||||
|
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Node Name | Set the name of the task. Node names within a workflow definition are unique. | |
||||
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. | |
||||
| Description | Describes the function of this node. | |
||||
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. | |
||||
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. | |
||||
| Task group name | The group in Resources, if not configured, it will not be used. | |
||||
| Environment Name | Configure the environment in which to run the script. | |
||||
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. | |
||||
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. | |
||||
| Cpu quota | Assign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). | |
||||
| Max memory | Assign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by [task.resource.limit.state](../../architecture/configuration.md). | |
||||
| Timeout alarm | Check the timeout alarm and timeout failure. When the task exceeds the "timeout period", an alarm email will send and the task execution will fail. | |
||||
| Conda Env Name | Name of conda environment or packed conda environment tarball. | |
||||
|Input Note Path | Path of input jupyter note template. | |
||||
| Out Note Path | Path of output note. | |
||||
| Jupyter Parameters | Parameters in json format used for jupyter note parameterization. | |
||||
| Kernel | Jupyter notebook kernel. | |
||||
| Engine | Engine to evaluate jupyter notes. | |
||||
| Jupyter Execution Timeout | Timeout set for each jupyter notebook cell. | |
||||
| Jupyter Start Timeout | Timeout set for jupyter notebook kernel. | |
||||
| Others | Other command options for papermill. | |
||||
|
||||
## Task Example |
||||
|
||||
### Jupyter Task Example |
||||
|
||||
This example illustrates how to create a jupyter task node. |
||||
|
||||
![demo-jupyter-simple](../../../../img/tasks/demo/jupyter.png) |
@ -0,0 +1,48 @@
|
||||
# K8S Node |
||||
|
||||
## Overview |
||||
|
||||
K8S task type used to execute a batch task. In this task, the worker submits the task by using a k8s client. |
||||
|
||||
## Create Task |
||||
|
||||
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. |
||||
- Drag from the toolbar <img src="../../../../img/tasks/icons/kubernetes.png" width="15"/> to the canvas. |
||||
|
||||
## Task Parameters |
||||
|
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Node Name | Set the name of the task. Node names within a workflow definition are unique. | |
||||
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. | |
||||
| Description | Describes the function of this node. | |
||||
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. | |
||||
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. | |
||||
| Task group name | The group in Resources, if not configured, it will not be used. | |
||||
| Environment Name | Configure the environment in which to run the script. | |
||||
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. | |
||||
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. | |
||||
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. | |
||||
| Namespace | The namespace for running k8s task. | |
||||
| Min CPU | Minimum CPU requirement for running k8s task. | |
||||
| Min Memory | Minimum memory requirement for running k8s task. | |
||||
| Image | The registry url for image. | |
||||
| Custom parameter | It is a local user-defined parameter for K8S task, these params will pass to container as environment variables. | |
||||
| Predecessor task | Selecting a predecessor task for the current task, will set the selected predecessor task as upstream of the current task. | |
||||
|
||||
|
||||
## Task Example |
||||
|
||||
### Configure the K8S Environment in DolphinScheduler |
||||
|
||||
If you are using the K8S task type in a production environment, the K8S cluster environment is required. |
||||
|
||||
### Configure K8S Nodes |
||||
|
||||
Configure the required content according to the parameter descriptions above. |
||||
|
||||
![K8S](../../../../img/tasks/demo/kubernetes-task-en.png) |
||||
|
||||
## Note |
||||
|
||||
Task name contains only lowercase alphanumeric characters or '-' |
@ -0,0 +1,151 @@
|
||||
# MLflow Node |
||||
|
||||
## Overview |
||||
|
||||
[MLflow](https://mlflow.org) is an excellent open source platform to manage the ML lifecycle, including experimentation, |
||||
reproducibility, deployment, and a central model registry. |
||||
|
||||
MLflow task plugin used to execute MLflow tasks,Currently contains MLflow Projects and MLflow Models. (Model Registry will soon be rewarded for support) |
||||
|
||||
- MLflow Projects: Package data science code in a format to reproduce runs on any platform. |
||||
- MLflow Models: Deploy machine learning models in diverse serving environments. |
||||
- Model Registry: Store, annotate, discover, and manage models in a central repository. |
||||
|
||||
The MLflow plugin currently supports and will support the following: |
||||
|
||||
- [x] MLflow Projects |
||||
- [x] BasicAlgorithm: contains LogisticRegression, svm, lightgbm, xgboost |
||||
- [x] AutoML: AutoML tool,contains autosklean, flaml |
||||
- [x] Custom projects: Support for running your own MLflow projects |
||||
- [ ] MLflow Models |
||||
- [x] MLFLOW: Use `MLflow models serve` to deploy a model service |
||||
- [x] Docker: Run the container after packaging the docker image |
||||
- [x] Docker Compose: Use docker compose to run the container, it will replace the docker run above |
||||
- [ ] Seldon core: Use Selcon core to deploy model to k8s cluster |
||||
- [ ] k8s: Deploy containers directly to K8S |
||||
- [ ] MLflow deployments: Built-in deployment modules, such as built-in deployment to SageMaker, etc |
||||
- [ ] Model Registry |
||||
- [ ] Register Model: Allows artifacts (Including model and related parameters, indicators) to be registered directly into the model center |
||||
|
||||
|
||||
|
||||
## Create Task |
||||
|
||||
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. |
||||
- Drag from the toolbar <img src="../../../../img/tasks/icons/mlflow.png" width="15"/> task node to canvas. |
||||
|
||||
## Task Parameters and Example |
||||
|
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Node Name | Set the name of the task. Node names within a workflow definition are unique. | |
||||
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. | |
||||
| Description | Describes the function of this node. | |
||||
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. | |
||||
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. | |
||||
| Task group name | The group in Resources, if not configured, it will not be used. | |
||||
| Environment Name | Configure the environment in which to run the script. | |
||||
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. | |
||||
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. | |
||||
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. | |
||||
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. | |
||||
| MLflow Tracking Server URI | MLflow Tracking Server URI, default http://localhost:5000. | |
||||
| Experiment Name | Create the experiment where the task is running, if the experiment does not exist. If the name is empty, it is set to ` Default `, the same as MLflow. | |
||||
|
||||
### MLflow Projects |
||||
|
||||
#### BasicAlgorithm |
||||
|
||||
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-basic-algorithm.png) |
||||
|
||||
**Task Parameters** |
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. | |
||||
| Model Name | The registered model name is added to the original model version and registered as Production. | |
||||
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation. | |
||||
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `"time_budget=30;estimator_list=['lgbm']"` for flaml 。The convention will be passed with '; ' shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. <ul><li>[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)</li><li>[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC)</li><li>[lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)</li><li>[xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)</li></ul> | |
||||
| Algorithm |The selected algorithm currently supports `LR`, `SVM`, `LightGBM` and `XGboost` based on [scikit-learn](https://scikit-learn.org/) form. | |
||||
| Parameter Search Space | Parameter search space when running the corresponding algorithm, which can be empty. For example, the parameter `max_depth=[5, 10];n_estimators=[100, 200]` for lightgbm 。The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. | |
||||
|
||||
#### AutoML |
||||
|
||||
![mlflow-automl](../../../../img/tasks/demo/mlflow-automl.png) |
||||
|
||||
**Task Parameter** |
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Register Model | Register the model or not. If register is selected, the following parameters are expanded. | |
||||
| model name | The registered model name is added to the original model version and registered as Production. | |
||||
| Data Path | The absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation). | |
||||
| Parameters | Parameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters `n_estimators=200;learning_rate=0.2` for flaml. The convention will be passed with '; 'shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through `python eval()`. The detailed parameter list is as follows: <ul><li>[flaml](https://microsoft.github.io/FLAML/docs/reference/automl#automl-objects)</li><li>[autosklearn](https://automl.github.io/auto-sklearn/master/api.html)</li></ul> | |
||||
| AutoML tool | The AutoML tool used, currently supports [autosklearn](https://github.com/automl/auto-sklearn) and [flaml](https://github.com/microsoft/FLAML). | |
||||
|
||||
#### Custom projects |
||||
|
||||
![mlflow-custom-project.png](../../../../img/tasks/demo/mlflow-custom-project.png) |
||||
|
||||
**Task Parameter** |
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| parameters | `--param-list` in `mlflow run`. For example `-P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9`. | |
||||
| Repository | Repository url of MLflow Project,Support git address and directory on worker. If it's in a subdirectory,We add `#` to support this (same as `mlflow run`) , for example `https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native`. | |
||||
| Project Version | Version of the project,default master. | |
||||
|
||||
You can now use this feature to run all MLFlow projects on Github (For example [MLflow examples](https://github.com/mlflow/mlflow/tree/master/examples) ). You can also create your own machine learning library to reuse your work, and then use DolphinScheduler to use your library with one click. |
||||
|
||||
|
||||
### MLflow Models |
||||
|
||||
**General Parameters** |
||||
|
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Model-URI | Model-URI of MLflow , support `models:/<model_name>/suffix` format and `runs:/` format. See https://mlflow.org/docs/latest/tracking.html#artifact-stores | |
||||
| Port | The port to listen on. | |
||||
|
||||
|
||||
#### MLflow |
||||
|
||||
![mlflow-models-mlflow](../../../../img/tasks/demo/mlflow-models-mlflow.png) |
||||
|
||||
#### Docker |
||||
|
||||
![mlflow-models-docker](../../../../img/tasks/demo/mlflow-models-docker.png) |
||||
|
||||
#### DOCKER COMPOSE |
||||
|
||||
![mlflow-models-docker-compose](../../../../img/tasks/demo/mlflow-models-docker-compose.png) |
||||
|
||||
## Environment to Prepare |
||||
|
||||
### Conda environment |
||||
|
||||
You need to enter the admin account to configure a conda environment variable(Please |
||||
install [anaconda](https://docs.continuum.io/anaconda/install/) |
||||
or [miniconda](https://docs.conda.io/en/latest/miniconda.html#installing ) in advance). |
||||
|
||||
![mlflow-conda-env](../../../../img/tasks/demo/mlflow-conda-env.png) |
||||
|
||||
Note During the configuration task, select the conda environment created above. Otherwise, the program cannot find the |
||||
Conda environment. |
||||
|
||||
![mlflow-set-conda-env](../../../../img/tasks/demo/mlflow-set-conda-env.png) |
||||
|
||||
### Start the MLflow Service |
||||
|
||||
Make sure you have installed MLflow, using 'pip install mlflow'. |
||||
|
||||
Create a folder where you want to save your experiments and models and start MLflow service. |
||||
|
||||
```sh |
||||
mkdir mlflow |
||||
cd mlflow |
||||
mlflow server -h 0.0.0.0 -p 5000 --serve-artifacts --backend-store-uri sqlite:///mlflow.db |
||||
``` |
||||
|
||||
After running, an MLflow service is started. |
||||
|
||||
After this, you can visit the MLflow service (`http://localhost:5000`) page to view the experiments and models. |
||||
|
||||
![mlflow-server](../../../../img/tasks/demo/mlflow-server.png) |
@ -0,0 +1,65 @@
|
||||
# OpenMLDB Node |
||||
|
||||
## Overview |
||||
|
||||
[OpenMLDB](https://openmldb.ai/) is an excellent open source machine learning database, providing a full-stack |
||||
FeatureOps solution for production. |
||||
|
||||
OpenMLDB task plugin used to execute tasks on OpenMLDB cluster. |
||||
|
||||
## Create Task |
||||
|
||||
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. |
||||
- Drag from the toolbar <img src="../../../../img/tasks/icons/openmldb.png" width="15"/> task node to canvas. |
||||
|
||||
## Task Parameters |
||||
|
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Node Name | Set the name of the task. Node names within a workflow definition are unique. | |
||||
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. | |
||||
| Description | Describes the function of this node. | |
||||
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. | |
||||
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. | |
||||
| Task group name | The group in Resources, if not configured, it will not be used. | |
||||
| Environment Name | Configure the environment in which to run the script. | |
||||
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. | |
||||
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. | |
||||
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. | |
||||
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. | |
||||
| zookeeper | OpenMLDB cluster zookeeper address, e.g. 127.0.0.1:2181. | |
||||
| zookeeper path | OpenMLDB cluster zookeeper path, e.g. /openmldb. | |
||||
| Execute Mode | Determine the init mode, offline or online. You can switch it in sql statement. | |
||||
| SQL statement | SQL statement. | |
||||
| Custom parameters | It is the user-defined parameters of Python, which will replace the content with \${variable} in the script. | |
||||
|
||||
## Task Examples |
||||
|
||||
### Load data |
||||
|
||||
![load data](../../../../img/tasks/demo/openmldb-load-data.png) |
||||
|
||||
We use `LOAD DATA` to load data into OpenMLDB cluster. We select `offline` here, so it will load to offline storage. |
||||
|
||||
### Feature extraction |
||||
|
||||
![fe](../../../../img/tasks/demo/openmldb-feature-extraction.png) |
||||
|
||||
We use `SELECT INTO` to do feature extraction. We select `offline` here, so it will run sql on offline engine. |
||||
|
||||
### Environment to Prepare |
||||
|
||||
#### Start the OpenMLDB Cluster |
||||
|
||||
You should create an OpenMLDB cluster first. If in production env, please check [deploy OpenMLDB](https://openmldb.ai/docs/en/v0.5/deploy/install_deploy.html). |
||||
|
||||
You can follow [run OpenMLDB in docker](https://openmldb.ai/docs/zh/v0.5/quickstart/openmldb_quickstart.html#id11) |
||||
to a quick start. |
||||
|
||||
#### Python Environment |
||||
|
||||
The OpenMLDB task will use OpenMLDB Python SDK to connect OpenMLDB cluster. So you should have the Python env. |
||||
|
||||
We will use `python3` by default. You can set `PYTHON_HOME` to use your custom python env. |
||||
|
||||
Make sure you have installed OpenMLDB Python SDK in the host where the worker server running, using `pip install openmldb`. |
@ -1,19 +1,27 @@
|
||||
# Pigeon |
||||
|
||||
## Overview |
||||
|
||||
Pigeon is a task used to trigger remote tasks, acquire logs or status by calling remote WebSocket service. It is DolphinScheduler uses a remote WebSocket service to call tasks. |
||||
|
||||
## Create |
||||
## Create Task |
||||
|
||||
Drag from the toolbar <img src="/img/pigeon.png" width="20"/> to the canvas to create a new Pigeon task. |
||||
- Click `Project Management -> Project Name -> Workflow Definition`, and click the `Create Workflow` button to enter the DAG editing page. |
||||
- Drag from the toolbar <img src="../../../../img/pigeon.png" width="20"/> to the canvas to create a new Pigeon task. |
||||
|
||||
## Parameter |
||||
## Task Parameters |
||||
|
||||
- Node name: The node name in a workflow definition is unique. |
||||
- Run flag: Identifies whether this node schedules normally, if it does not need to execute, select the `prohibition execution`. |
||||
- Descriptive information: Describe the function of the node. |
||||
- Task priority: When the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order. |
||||
- Worker grouping: Assign tasks to the machines of the worker group to execute. If `Default` is selected, randomly select a worker machine for execution. |
||||
- Times of failed retry attempts: The number of times the task failed to resubmit. You can select from drop-down or fill-in a number. |
||||
- Failed retry interval: The time interval for resubmitting the task after a failed task. You can select from drop-down or fill-in a number. |
||||
- Timeout alarm: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm email will send and the task execution will fail. |
||||
- Target task name: Target task name of this Pigeon node. |
||||
| **Parameter** | **Description** | |
||||
| ------- | ---------- | |
||||
| Node Name | Set the name of the task. Node names within a workflow definition are unique. | |
||||
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. | |
||||
| Description | Describes the function of this node. | |
||||
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. | |
||||
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. | |
||||
| Task group name | The group in Resources, if not configured, it will not be used. | |
||||
| Environment Name | Configure the environment in which to run the script. | |
||||
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. | |
||||
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. | |
||||
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. | |
||||
| Target task name | Target task name of this Pigeon node. | |
||||
| Predecessor task | Selecting the predecessor task of the current task will set the selected predecessor task as the upstream of the current task. | |
Loading…
Reference in new issue