EasyScheduler Proposal Abstract EasyScheduler is a distributed ETL scheduling engine with powerful DAG visualization interface. EasyScheduler focuses on solving the problem of 'complex task dependencies & triggers ' in data processing. Just like its name, we dedicated to making the scheduling system out of the box. Proposal EasyScheduler provides many easy-to-use features to accelerate the engineer enficiency on data ETL workflow job. We propose a new concept of 'instance of process' and 'instance of task' to let developers to tuning their jobs on the running state of workflow instead of changing the task's template. EasyScheduler DAG UI also let data engineer easily add or change a ETL job in a complex schedulnig system. Its main objectives are as follows: Define the complex tasks' dependencies & triggers in a DAG graph by dragging and dropping . DAG graph can also show the running state of task in real time. Define the process result dependencies besides DAG dependencies, e.g. weekly job depends on the success of last 7 daily job. Support cluster HA, and implement distributed Master clusters and Worker clusters base on Zookeeper services. Support multi-tenant Support automatical failure job retry and recovery from specified nodes by pre-define rules. Support parallel or serial backfilling data instead of duplicated maintainence work. Support many data task types: Shell, MapReduce, Spark, SQL (mysql, postgresql, hive, sparksql), Python, Sub_Process, Stored Procedure, etc. Support process running history tree/gantt chart display, support task status statistics, process status statistics. Support process scheduling, manual scheduling, also support manual pause/stop/recovery, killing task on runnning time. Support define process priority, task priority and relative task timeout alarm. Support process global parameters and node specified parameter settings. Support online upload/download/manage resource files and also support online file creation and editing. Support task log online viewing and scrolling, online download log, etc. Support online viewing of Master/Worker cpu load, memory, etc. For now, EasyScheduler has a fairly huge community in China. It is also widely adopted by many companies and organizations as its ETL scheduling tool. We believe that bringing EasyScheduler into ASF could advance development of a much more stronger and more diverse open source community. Analysys submits this proposal to donate EasyScheduler's source codes and all related documentations to Apache Software Foundation. The codes are already under Apache License Version 2.0. Code base: https://www.github.com/analysys/easyscheduler Documentations: https://analysys.github.io/easyscheduler_docs Background We want to find a data processing tool with the following features: Easy to use,developers can build a ETL process with a very simple drag and drop operation. not only for ETL developers,people who can't write ETL code also can use this tool for ETL operation such as system adminitrator. Support HA and automatically change master and worker node to keep whole system stable. Solving the problem of "complex task dependencies" , and it can monitor the ETL running status. Support multi-tenant. Support many task types: Shell, MR, Spark, SQL (mysql, postgresql, hive, sparksql), Python, Sub_Process, Procedure, etc. Linear scalability. For the above reasons, we realized that no existing product met our requirements, so we decided to develop this tool ourselves. We designed EasyScheduler at the end of 2017. The first internal use version was completed in May 2018. We then iterated several internal versions and the system gradually became stabilized. Then we open the source code of EasyScheduler on March 2019. It soon gained lot's of ETL developers interest and stars on github. Then it won the GVP (Gitee Most Valuable Project) in April 2019 and our key member was invited to GAIC Summit 2019 for speech on June 2019. Rationale Many organizations (>30) (refer to Who is using EasyScheduler ) already benefit from running EasyScheduler to make data process pipelines more easier. More than 100 feature ideas come from EasyScheduler community. Some 3th projects also wanted to integrate with EasyScheduler through task plugin, like Scriptis 、waterdrop and so on. these will strengthen the features of EasyScheduler. Current Status Meritocracy EasyScheduler was incubated at Analysys in 2017 and open sourced on GitHub in March 2019. Once open source,We have been quickly adopted by multiple organizations,EasyScheduler has contributors and users from many companies; we have set up the PMC Team and Committer Team. New contributors are guided and reviewed by existed PMC members. When they are ready, PMC will start a vote to promote him/her to become a member of PMC or Committer Team. Contributions are always welcomed and highly valued. Community Now we have set development teams for EasyScheduler in Analysys, and we already have external developers who contributed the code. We already have a user group of more than 1,000 people. We hope to grow the base of contributors by inviting all those who offer contributions through The Apache Way. Right now, we make use of github as code hosting as well as gitter for community communication. Core Developers The core developers, including experienced open source developers and team leaders, have formed a group full of diversity. All of these core developers have deep expertise in workflow processing and the Hadoop Ecosystem in general. Known Risks Orphaned products EasyScheduler is widely adopted in China by many companies and organizations. The core developers of EasyScheduler team plan to work full time on this project. Currently there are 10 use cases with more that 1000 activity tasks per day using EasyScheduler in the user's production environment. Furthermore, since EasyScheduler has received more than 1500 stars and been forked more than 500 times. EasyScheduler has eight major release so far and and received 365 pull requests from contributors, which further demonstrates EasyScheduler as a very active project. We plan to extend and diversify this community further through Apache. Thus, it is very unlikely that EasyScheduler becomes orphaned. Inexperience with Open Source The core developers are all active users and followers of open source. They are already committers and contributors to the EasyScheduler Github project. All have been involved with the source code that has been released under an open source license, and several of them also have experience developing code in an open source environment, they are also active in presto, alluxio and other projects. Therefore, we believe we have enough experience to deal with open source. Homogenous Developers The current developers work across a variety of organizations including Analysys, guandata and hydee; some individual developers are accepted as developers of EasyScheduler as well. Considering that fengjr and sefonsoft have shown great interest in EasyScheduler, we plan to encourage them to contribute and invite them as contributors to work together. Reliance on Salaried Developers At present, eight of the core developers are paid by their employer to contribute to EasyScheduler project. we also find some developers and researchers (>8) to contribute to the project, and we will make efforts to increase the diversity of the contributors and actively lobby for Domain experts in the workflow space to contribute. Relationships with Other Apache Products EasyScheduler integrates Apache Zookeeper as one of the service registration/discovery mechanisms. EasyScheduler is deeply integrated with Apache products. It currently support many task types like Apache Hive, Apache Spark, Apache Hadoop, and so on A Excessive Fascination with the Apache Brand We recognize the value and reputation that the Apache brand will bring to EasyScheduler. However, we prefer that the community provided by the Apache Software Foundation will enable the project to achieve long-term stable development. so EasyScheduler is proposing to enter incubation at Apache in order to help efforts to diversify the community, not so much to capitalize on the Apache brand. Documentation A complete set of Sharding-Sphere documentations is provided on shardingsphere.io in both English and Simplified Chinese. English Chinese Initial Source The project consists of two distinct codebases: core and document. The address of two existed git repositories are as follows: https://github.com/analysys/easyscheduler https://github.com/analysys/easyscheduler_docs Source and Intellectual Property Submission Plan As soon as EasyScheduler is approved to join Apache Incubator, Analysys will execute a Software Grant Agreement and the source code will be transitioned onto ASF infrastructure. The code is already licensed under the Apache Software License, version 2.0. External Dependencies As all backend code dependencies are managed using Apache Maven, none of the external libraries need to be packaged in a source distribution. spring-tx-5.1.5.RELEASE.jar Apache V2.0 spring-web-5.1.5.RELEASE.jar Apache V2.0 spring-webmvc-5.1.5.RELEASE.jar Apache V2.0 stringtemplate-3.2.1.jar BSD swagger-annotations-1.5.20.jar Apache V2.0 swagger-bootstrap-ui-1.9.3.jar Apache V2.0 swagger-models-1.5.20.jar Apache V2.0 tephra-api-0.6.0.jar Apache V2.0 tephra-core-0.6.0.jar Apache V2.0 tephra-hbase-compat-1.0-0.6.0.jar Apache V2.0 threetenbp-1.3.6.jar BSD 3-clause transaction-api-1.1.jar CDDL1.0 twill-api-0.6.0-incubating.jar Apache V2.0 twill-common-0.6.0-incubating.jar Apache V2.0 twill-core-0.6.0-incubating.jar Apache V2.0 twill-discovery-api-0.6.0-incubating.jar Apache V2.0 twill-discovery-core-0.6.0-incubating.jar Apache V2.0 twill-zookeeper-0.6.0-incubating.jar Apache V2.0 validation-api-2.0.1.Final.jar Apache V2.0 xercesImpl-2.9.1.jar Apache V2.0 xml-apis-1.4.01.jar Apache V2.0,W3C xz-1.0.jar Public zookeeper-3.4.8.jar Apache The front-end UI currently relies on many components, which we will list separately. https://github.com/analysys/easyscheduler_docs.git Issue Tracking The community would like to continue using GitHub Issues. Continuous Integration tool Travis (TODO) Mailing Lists EasyScheduler-dev: for development discussions EasyScheduler-private: for PPMC discussions EasyScheduler-notifications: for users notifications Initial Committers William-GuoWei Lidong Dai Zhanwei Qiao Liang Bao Gang Li Zijian Gong Jun Gao Baoqi Wu Affiliations Analysys: William-GuoWei,Zhanwei Qiao,Liang Bao,Gang Li,Jun Gao,Lidong Dai Hydee: Zijian Gong Guandata: Baoqi Wu Sponsors Champion Sheng Wu ( Apache Software Foundation Member wusheng@apache.org) Mentors Sheng Wu ( Apache Software Foundation Member wusheng@apache.org) ShaoFeng Shi ( Apache Kylin committer & PMC, Apache Incubator PMC, shaofengshi@apache.org) Liang Chen ( Apache Software Foundation Member chenliang613@apache.org](mailto:chenliang613@apache.org)) Sponsoring Entity We are expecting the Apache Incubator could sponsor this project.