@ -6,7 +6,7 @@ Before explaining the architecture of the schedule system, let us first understa
**DAG:** Full name Directed Acyclic Graph,referred to as DAG。Tasks in the workflow are assembled in the form of directed acyclic graphs, which are topologically traversed from nodes with zero indegrees of ingress until there are no successor nodes. For example, the following picture:
- The role of Master is mainly responsible for task distribution and supervising the health status of Slave. It can dynamically balance the task to Slave, so that the Slave node will not be "busy" or "free".
@ -125,7 +125,7 @@ Problems in the design of centralized :
- In the decentralized design, there is usually no Master/Slave concept, all roles are the same, the status is equal, the global Internet is a typical decentralized distributed system, networked arbitrary node equipment down machine , all will only affect a small range of features.
@ -141,13 +141,13 @@ EasyScheduler uses ZooKeeper distributed locks to implement only one Master to e
1. The core process algorithm for obtaining distributed locks is as follows
##### Third, the thread is insufficient loop waiting problem
@ -156,7 +156,7 @@ EasyScheduler uses ZooKeeper distributed locks to implement only one Master to e
- If a large number of sub-processes are nested in a large DAG, the following figure will result in a "dead" state:
<palign="center">
<imgsrc="https://analysys.github.io/EasyScheduler/zh_CN/images/lack_thread.png"alt="Thread is not enough to wait for loop"width="50%"/>
<imgsrc="https://analysys.github.io/easyscheduler_docs_cn/images/lack_thread.png"alt="Thread is not enough to wait for loop"width="50%"/>
</p>
In the above figure, MainFlowThread waits for SubFlowThread1 to end, SubFlowThread1 waits for SubFlowThread2 to end, SubFlowThread2 waits for SubFlowThread3 to end, and SubFlowThread3 waits for a new thread in the thread pool, then the entire DAG process cannot end, and thus the thread cannot be released. This forms the state of the child parent process loop waiting. At this point, the scheduling cluster will no longer be available unless a new Master is started to add threads to break such a "stuck."
@ -180,7 +180,7 @@ Fault tolerance is divided into service fault tolerance and task retry. Service
Service fault tolerance design relies on ZooKeeper's Watcher mechanism. The implementation principle is as follows:
The Master monitors the directories of other Masters and Workers. If the remove event is detected, the process instance is fault-tolerant or the task instance is fault-tolerant according to the specific business logic.
@ -190,7 +190,7 @@ The Master monitors the directories of other Masters and Workers. If the remove
After the ZooKeeper Master is fault-tolerant, it is rescheduled by the Scheduler thread in EasyScheduler. It traverses the DAG to find the "Running" and "Submit Successful" tasks, and monitors the status of its task instance for the "Running" task. You need to determine whether the Task Queue already exists. If it exists, monitor the status of the task instance. If it does not exist, resubmit the task instance.
@ -200,7 +200,7 @@ After the ZooKeeper Master is fault-tolerant, it is rescheduled by the Scheduler
Once the Master Scheduler thread finds the task instance as "need to be fault tolerant", it takes over the task and resubmits.
@ -239,13 +239,13 @@ In the early scheduling design, if there is no priority design and fair scheduli
- The priority of the process definition is that some processes need to be processed before other processes. This can be configured at the start of the process or at the time of scheduled start. There are 5 levels, followed by HIGHEST, HIGH, MEDIUM, LOW, and LOWEST. As shown below
ZooKeeper Master容错完成之后则重新由EasyScheduler中Scheduler线程调度,遍历 DAG 找到”正在运行”和“提交成功”的任务,对”正在运行”的任务监控其任务实例的状态,对”提交成功”的任务需要判断Task Queue中是否已经存在,如果存在则同样监控任务实例的状态,如果不存在则重新提交任务实例。