I just completed by Big Data Processing Using Hadoop class at John's Hopkins. The class started with a general overview of the Apache Hadoop platform with HDFS (Hadoop File System), Map Reduce framework. We dived into Pig, Hive and few other components. We completed 3 decent size projects and numerous lab assignments. Lab assignments were designed to get us introduced to these topics while assignments were designed to give us real life scenario problems and how this very complicated ecosystem of tools and ideas can be used to solve these problems. I very much enjoyed the topics and assignments in this class.
We also had an opportunity to choose our own topic (outside of what was being taught in class) and present it in class. I thought this was a really nice opportunity to dive into one of those other Apache projects that compliment the Hadoop platform. Working on few of the assignments, it become clear to me that we really need a way to organize Hadoop Jobs into a workflow and schedule them for execution.
As part of the assignment requirements we were asked to create shell scripts to start the execution of our Hadoop jobs. Though this was a very quick and easy thing to do, it really was not very clean. It was also very limiting. Therefore I picked to present Apache Oozie. I thought I share my learning experience with Oozie here with you.
I started my investigation by looking at Apache Ozzie's web site. There is ton of information and documentation available for you there. It can be overwhelming if you are not sure where to begin.
Here is a quick overview of Oozie.
We also had an opportunity to choose our own topic (outside of what was being taught in class) and present it in class. I thought this was a really nice opportunity to dive into one of those other Apache projects that compliment the Hadoop platform. Working on few of the assignments, it become clear to me that we really need a way to organize Hadoop Jobs into a workflow and schedule them for execution.
As part of the assignment requirements we were asked to create shell scripts to start the execution of our Hadoop jobs. Though this was a very quick and easy thing to do, it really was not very clean. It was also very limiting. Therefore I picked to present Apache Oozie. I thought I share my learning experience with Oozie here with you.
I started my investigation by looking at Apache Ozzie's web site. There is ton of information and documentation available for you there. It can be overwhelming if you are not sure where to begin.
Here is a quick overview of Oozie.
- It is a Server Based workflow engine that runs in Java application server. Its really a web application running on a servlet container.
- The Oozie workflows are a collection of Map Reduce jobs arranged in a control dependency DAG (Direct Acyclic Graph) which means jobs must complete before next jobs can begin and there can be no cycles or loops in the graph definition.
- The workflow definitions are written in hPDL (XML Processing Language)
- The workflow actions start jobs in remote systems (Hadoop Clusters).
- Ozzie provides support for different actions Pig, Java, Mr, Hive, HTTP.
- You can also provide your own custom action support if you like. Its highly extensible.
The installation of Apache Oozie was very simple. In class, I was using a sudo-distributed CDH bundle that I installed on a CentOS 7 VM. There was already a bundle available from yum. I just installed using "sudo yum install oozie".
This installed oozie-client which is a command line tool to interface with oozie-server. It also installed the oozie-server itself and all of its dependencies. It also nicely registers a control script that you can use to stop and start Oozie using "sudo service oozie start | stop | restart | status".
Oozie uses a RDMS (Relational Database Management System) to persist metadata about workflow and scheduling definitions. You can use a database from a wide variety of selections. I used an existing PostgreSQL database and configured it with Oozie. This configuration is pretty simple and does not take very long. Once the database was configured, rest of the configuration was very easy to set up. I created a new user directory in HDFS and made sure that user oozie owns this new user directory as seen below.
sudo -u hdfs hadoop fs -mkdir /user/oozie
sudo –u hdfs hadoop fs –chown oozie:oozie /user/oozie
Oozie installation contains a oozie-sharelib.tar.gz in the /usr/lib/oozie/ directory. This archive contains all of the necessary shared libraries for Oozie. The next step was to extract this to a directory that was created above.
tar xzf /usr/lib/oozie/oozie-sharelib.tar.gz
sudo -u oozie hadoop fs -put share /user/oozie/share
Once the configuration was done, I tried to execute few of the examples they provided. I ran into many issues getting any of the examples to work. After many hours of debugging, I found out that this particular installation of Oozie uses a default configuration that uses Map Reduce 1. When I started the jobs, it was looking for Job Tracker which did not exist in my environment as my environment was set to use Map Reduce 2 (YARN). I had to make configuration changes to ensure that Oozie can work with YARN. The way I achieved this was to modify CATALINA_BASE in /etc/oozie/conf/oozie-env.sh script.
export CATALINA_BASE=/usr/lib/oozie/oozie-server
Once I made this change and restarted Oozie server, example jobs started executing. I could see these in the Oozie Web Console application that was located at http://localhost:11000/oozie.
The next step for me was to be able to take one of the assignments I completed in the course and use Apache Oozie to create a workflow definition so these jobs could execute. Before I dive into details of how I ended up doing, let me give you a quick overview of Oozie workflows. Oozie workflow definitions can have action nodes, control nodes, and transitions between these nodes. Control nodes control the start, execution path, and the ending of the workflow. Following nodes are some of the available control nodes: start, end, decision, fork, join, kill. On the other hand, action nodes are used to trigger the execution of a computation task. Action nodes include: map-reduce, java, pig, hive, ssh, http nodes.
You can use these control and action nodes to define your Oozie workflows. Each of these action nodes have their own configuration attributes that you need to provide. For example, using map-reduce action node, you would need to provide your mapper and reducer classes. Depending on your job configuration, you may also have to provide your input and output paths.
In our assignments, we used ToolRunners to execute our Map Reduce jobs. This allowed us to be able use YARN to easily execute our jobs. It also provided us with abilities to use extra configuration and validation capabilities. I had many validation checks and extra configuration parameters in my ToolRunners. I did not want to re-write some of these and wanted to use these in my Oozie workflows as well. So I ended up using the java action node to configure my Oozie workflows instead of using the map-reduce jobs. Though this was not a very ideal situation, as Oozie wraps these ToolRunners which then themselves wrap the Map Reduce, it worked very nice. In my workflow definition, I ended up providing input, output paths, and the class configuration to my ToolRunner which then Oozie was able to use to start executing.
What I learned from all of this is that Oozie is a very capable workflow engine for Hadoop jobs. It provides capabilities to use many of the already existing tools in the ecosystem. Though I did not get into detail here, you can use Coordination workflows to schedule your workflows. This scheduling could happen on a time base or input base. I think the capability to wait and detect input data arrival to start jobs is a really nice future. I will soon be posting my workflow definition here for you to look at and provide me with comments.
In all of this, I thing I could not get to work was DISTRIBUTED CACHE. In one of my assignments, I had a csv file that was being made available to distributed cache. I read and tried to configure my workflows to use and have access to this file through DISTRIBUTED CACHE but I kept running into issues with the file not being found. If you have any tips on how to get this working, I'd appreciate if you leave comments.
Comments