FATA #14/ Big Data in Workflow

Nazar Khimin
3 min readApr 1, 2022

[FATA] — From test automation to architecture article series

A Recurrent Problem

Hadoop, Pig, Hive, and many other projects provide the foundation for storing and processing large amounts of data in an efficient way. Most of the time, it is not possible to perform all required processing with a single MapReduce, Ping or Hive job. They are often chained together, producing and consuming intermediate data and coordinate their flow of execution.

Main issues:

  1. Developers moved from one project to another and they had to learn the specifics of custom framework used by the project they were joining.
  2. Much effort to support multiple frameworks for accomplishing the same task.
  3. Hard to maintain and check health check for administrator
  4. not easy to monitor
  5. hard to track errors

Apache Oozie

Oozie — general-purpose system to run multistage Hadoop jobs.

Features:

  1. Send email notifications upon completion of jobs.
  2. Provision for executing jobs to run periodically.
  3. Can be controlled fro anywhere

Advantages

  1. Designed to scale in a Hadoop cluster. Each job will be launched from a different DataNode. This means that the workflow load will be balanced and no single machine will become overburdened.
  2. Can handle 1250 job submissions.

Disadvantages

  1. Can be difficulr for non-technical
  2. The list of allowed actions is limited
  3. less flexibility with actions and dependency

Oozie does not have a user interface for building workflows.

User interface that supports Oozie: Hue, Ambari, Falcon

Control nodes —

  1. They define job chronology, setting rules for beginning and ending a workflow
  2. Include start, end, kill, fork, decision and join nodes.

Action nodes —

  1. They trigger the execution of tasks.

Flow nodes:

  1. Start
  2. End
  3. Kill
  4. Fork

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Apache Airflow — is a platform for programmatically authoring, scheduling, and monitoring workflows.

Advantages:

  1. Easy-to-use UI
  2. Independent scheduler
  3. Pure Python
  4. Operators
  5. Functional principles
  6. Open source

Limitations/Downsides:

  1. Not intuitive data modeling
  2. Changing the schedule intercal requires renaming the DAG
  3. No native Windows support
  4. CI/CD — if you deploying on Docker, the CI/CD process that restarts the Docker container will kill any running processes.

Notes:

  1. Every task execution is independent of other executions.
  2. Tasks can be scheduled on different machines
  3. Tasks do not sharea a single proces, which helps with performance
  4. Tasks do not communicate with each other when they are being executed
  5. If Airflow is installed on a single machine, and tasks will run on the same machine,
  6. If it is installed on a distributed system (multi-nide architecture), each task will run on different machines.

--

--