Quick Start Hadoop Development Using Cloudera VM

So your company has some Big Data needs and decided to use Hadoop for processing all the data. As a developer you wonder where to start? You download and install Hadoop from Apache . You get started fairly quickly and begin writing your first Map Reduce job. Pretty soon you realize you need a workflow engine like Oozie and soon after that you think Hbase might be a good fit for what you are trying to accomplish or use Hive instead of writing Java code for Map Reduce.

The Hadoop ecosystem has grown quite a bit and manually installing each piece can become frustrating and time consuming. A low barrier alternative to being productive quickly with Hadoop technologies is to use a vendor distribution like the one from Cloudera. Since we use the Cloudera distribution at BlueCanary, the rest of this tutorial will be for using Cloudera’s distribution of Hadoop.

Cloudera has a prebuilt developer VM that has all the major components and technologies used in the enterprise Hadoop stack. It includes Hadoop, Hive, Hbase, Oozie, Impala and Hue. Explaing each one is out of the scope for this post and I will elaborate on each one in future posts. Hue however is worth giving a quick mention to. Hue is a web based UI for accessing HDFS, monitoring jobs, viewing logs, running Hive and Impala queries and many other nice features which can make you productive fast.

Although the VM is very good start there are some annoyances that can arise. For example, you have to move your code to the VM and you cannot interact with it like you would with a cluster out of the box. Also, VM can be prone to corruption, and when that happens you may loose the code that you were working on in the VM.

To get around this, a few simple steps can make the VM act like a cluster. Developing and testing in this manner also leads to much smoother deployments to your production cluster.

Step 1: Download VM

The VM can be downloaded directly from Cloudera here.

Step 2: Load VM on Virtual Box and Configure

Open Virtual Box and click on “File -> Import Appliance…”

Screen Shot 2014-03-07 at 11.05.53 AM

From the file dialog open “cloudera-quickstart-vm-4.4.0-1-virtualbox.ovf” which will be in the decompressed (or unzipped) Cloudera VM download

Setup “Network Adapter 2” in Network Settings in Virtual Box as “Host-only Adapter”.

When setting up  “Host-only Adapter”, If the “Name” drop down is showing “Not selected” only, then cancel this and go back to Virtual Box preferences (“Virtual Box -> preferences ->network)

Select “Host-Only Networks” then “Add” and a new entry will be created (something like “vboxnet0”). Click “OK”. Now go back to Network Settings of the VM and this time Adapter 2 should show vboxnet0 in the “Name” drop down. Select “vboxnet0”.

Screen Shot 2014-03-07 at 11.11.04 AM

Update Virtual Box to latest if you are not able to add it in this manner. The menu might be present on older version but may throw an exception/error message when you attempt to add.

Add Port 50010 to NAT adapter

Open the VM Network Settings

Go to Adapter 1. It should say ‘Attached to:’ NAT

Screen Shot 2014-03-07 at 11.15.08 AM

Screen Shot 2014-03-07 at 11.15.32 AM

Note: The “Host-only Adapter” is needed for the Host machine to access the guest VM (CDH).

Step 2: Setup to Access VM as Cluster

Perform the following steps on the guest VM:

Create the same user from your host machine on the VM. Example add user jdoe to cloudera group with password “test”:

 sudo useradd jdoe -g cloudera -p test

Edit the host file on host machine and add an IP entry for virtual box:

vi /etc/hosts – on host machine add : <<vm_ip>> localhost.localdomain

For Example – 192.168.56.101        localhost.localdomain

Note: This has to be the 192.X.X.X IP and NOT 10.X.X.X of the VM. If you do not have the 192.X.X.X IP then your “host adapter” network setting in Virtual Box is not setup correctly.

Step 3: Setup Oozie for Running Hive Actions and start Hiveserver2 for JDBC Access

On virtual box execute following commands using hdfs user:

sudo -su hdfs hadoop fs -mkdir /etc/hive/conf

sudo -su hdfs hadoop fs -put /etc/hive/conf/hive-site.xml /etc/hive/conf

Note: now you can specify the HDFS location to this file in your hive action using the Oozie <job-xml></job-xml> element for the Hive action.

Start hive2 server on VM for JDBC access:

hive –service hiveserver2

You can now access your VM as a cluster from your host machine. Give it a try, the following URL should bring up Hue:

http://localhost.localdomain:8888/

For accessing HDFS from Java Hadoop API using the following for core-site.xml:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost.localdomain:8020</value>
</property>
</configuration>

For Oozie, give the following for JobTracker and NameNode configs:

nameNode=hdfs://localhost:9000
jobTracker=localhost:9001

You are all set to develop your big data app now! You can access HDFS, launch Oozie jobs,  run Hive queries via JDBC and much more from your host machine. This makes interacting with the VM the same as interacting with a Hadoop cluster running Cloudera’s distribution. In future posts I will dive into details of each technology.