Analytics

How To Rebuild Cloudera’s Spark

As a followup to the post How to upgrade Spark on CDH5.5, I will show you how to get a build environment up and running with a CentOS 7 virtual machine running via Vagrant and Virtual Box. This will allow for the quick build or rebuild of Cloudera’s version of Apache Spark from https://github.com/cloudera/spark.

Why?

You may want to rebuild Cloudera’s Spark in the event that you want to add functionality that was not compiled in by default. The Thriftserver and SparkR are two things that Cloudera does not ship (nor support), so if you are looking for these things, these instructions will help.

Using a disposable virtual machine will allow for a repeatable build and will keep your workstation computing environment clean of all the bits that may get installed.

Requirements

You will need Internet access during the installation and compilation of the Spark software.

Make sure that you have the following software installed on your local workstation:

Installation of these components are documented at their respective links.

Get Started

Clone the vagrant-sparkbuilder git repository to your local workstation:

git clone https://github.com/teamclairvoyant/vagrant-sparkbuilder.git
cd vagrant-sparkbuilder

Start the Vagrant instance that comes with the vagrant-sparkbuilder repository. This will boot a CentOS 7 virtual machine, install the Puppet agent, and instruct Puppet to configure the virtual machine with Oracle Java and the Cloudera Spark git repository. Then it will log you in to the virtual machine.

vagrant up
vagrant ssh

Inside the virtual machine, change to the spark directory:

cd spark

The automation has already checked out the branch/tag that corresponds to the target CDH version (presently defaulting to cdh5.7.0-release). Now you just need to build Spark with the Hive Thriftserver while excluding dependencies that are shipped as part of CDH. The key options in this example are the -Phive -Phive-thriftserver. Expect the compilation to take 20-30 minutes depending upon your Internet speed and workstation CPU and disk speed.

patch -p0 </vagrant/undelete.patch
./make-distribution.sh -DskipTests \
  -Dhadoop.version=2.6.0-cdh5.7.0 \
  -Phadoop-2.6 \
  -Pyarn \
  -Phive -Phive-thriftserver \
  -Pflume-provided \
  -Phadoop-provided \
  -Phbase-provided \
  -Phive-provided \
  -Pparquet-provided

If the above command fails with a ‘Cannot allocate memory’ error, either run it again or increase the amount of memory in the Vagrantfile.

Copy the resulting distribution back to your local workstation:

rsync -a dist/ /vagrant/dist-cdh5.7.0-nodeps

If you want to build against a different CDH release, then use git to change the code:

git checkout -- make-distribution.sh
git checkout cdh5.5.2-release
patch -p0 </vagrant/undelete.patch

Log out of the virtual machine with the exit command, then stop and/or destroy the virtual machine:

vagrant halt
vagrant destroy

More examples can be found at the vagrant-sparkbuilder GitHub site.

What Now?

From here, you should be able to make use of the newly built software.

If you are recompiling Spark in order to get the Hive integration along with the JDBC Thriftserver, copy over and then install the newly built jars to the correct locations on the node which will run the Spark Thriftserver.

install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/lib/spark*.jar \
  /opt/cloudera/parcels/CDH/jars/
install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/sbin/start-thriftserver.sh \
  /opt/cloudera/parcels/CDH/lib/spark/sbin/
install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/sbin/stop-thriftserver.sh \
  /opt/cloudera/parcels/CDH/lib/spark/sbin/

You should only need to install on the one node, not on all the cluster members.