Oozie

Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs.

Oozie is a Java Web-Application that runs in a Java servlet-container.

For the purposes of Oozie, a workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph). "control dependency" from one action to another means that the second action can't run until the first action has completed.

Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL).

Oozie workflow actions start jobs in remote systems (i.e. Hadoop, Pig). Upon action completion, the remote systems callback Oozie to notify the action completion, at this point Oozie proceeds to the next action in the workflow.

Oozie workflows contain
1-control flow nodes and
2-action nodes.

Control flow nodes --> define the beginning and the end of a workflow ( start , end and fail nodes) and provide a mechanism to control the workflow execution path ( decision , fork and join nodes).

Action nodes --> are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for different types of actions: Hadoop map-reduce, Hadoop file system, Pig, SSH, HTTP, eMail and Oozie sub-workflow. Oozie can be extended to support additional type of actions.

Oozie workflows can be parameterized (using variables like ${inputDir} within the workflow definition). When submitting a workflow job values for the parameters must be provided. If properly parameterized (i.e. using different output directories) several identical workflow jobs can concurrently.

Action in the Oozie
Email Action
Shell Action
Hive Action
Hive 2 Action
Sqoop Action
Ssh Action
DistCp Action
Spark Action
Writing a Custom Action Executor

Data Pipeline Application

Commonly, multiple workflow applications are chained together to form a more complex application.

The output of multiple workflow jobs of a single workflow application is then consumed by a single workflow job of another workflow application
This set of interdependent coordinator applications is referred as a data pipeline application.

Coordinator Action Execution Policies
Timeout,Concurrency,Throttle

<dataset name="inputdataset" frequency="${coord:days(1)}" initial-instance="2016-04-10T06:00Z" timezone="GMT"> <uri-template>${nameNode}/user/aks/aaa/${YEAR}${MONTH}${DAY}</uri-template> <done-flag>_SUCCESS</done-flag> </dataset> <input-events> <data-in name="inputevent" dataset="inputdataset"> <instance>${coord:current(0)}</instance> </data-in> </input-events>

In this we defined a input data set and referred that in input event. We can specify which instance of dataset to wait for using oozie EL functions. This application will wait till that folder is available and _SUCCESS flag is there.

Tea with Java

Search This Blog

Oozie

Comments

Post a Comment