How To Rebuild Cloudera’s Spark

As a followup to the post How to upgrade Spark on CDH5.5, I will show you how to get a build environment up and running with a CentOS 7 virtual machine running via Vagrant and Virtual Box. This will allow for the quick build or rebuild of Cloudera’s version of Apache Spark from https://github.com/cloudera/spark.

Why?

You may want to rebuild Cloudera’s Spark in the event that you want to add functionality that was not compiled in by default. The Thriftserver and SparkR are two things that Cloudera does not ship (nor support), so if you are looking for these things, these instructions will help.

Using a disposable virtual machine will allow for a repeatable build and will keep your workstation computing environment clean of all the bits that may get installed.

Requirements

You will need Internet access during the installation and compilation of the Spark software.

Make sure that you have the following software installed on your local workstation:

Installation of these components are documented at their respective links.

Get Started

Clone the vagrant-sparkbuilder git repository to your local workstation:

git clone https://github.com/teamclairvoyant/vagrant-sparkbuilder.git
cd vagrant-sparkbuilder

Start the Vagrant instance that comes with the vagrant-sparkbuilder repository. This will boot a CentOS 7 virtual machine, install the Puppet agent, and instruct Puppet to configure the virtual machine with Oracle Java and the Cloudera Spark git repository. Then it will log you in to the virtual machine.

vagrant up
vagrant ssh

Inside the virtual machine, change to the spark directory:

cd spark

The automation has already checked out the branch/tag that corresponds to the target CDH version (presently defaulting to cdh5.7.0-release). Now you just need to build Spark with the Hive Thriftserver while excluding dependencies that are shipped as part of CDH. The key options in this example are the -Phive -Phive-thriftserver. Expect the compilation to take 20-30 minutes depending upon your Internet speed and workstation CPU and disk speed.

patch -p0 </vagrant/undelete.patch
./make-distribution.sh -DskipTests \
  -Dhadoop.version=2.6.0-cdh5.7.0 \
  -Phadoop-2.6 \
  -Pyarn \
  -Phive -Phive-thriftserver \
  -Pflume-provided \
  -Phadoop-provided \
  -Phbase-provided \
  -Phive-provided \
  -Pparquet-provided

If the above command fails with a ‘Cannot allocate memory’ error, either run it again or increase the amount of memory in the Vagrantfile.

Copy the resulting distribution back to your local workstation:

rsync -a dist/ /vagrant/dist-cdh5.7.0-nodeps

If you want to build against a different CDH release, then use git to change the code:

git checkout -- make-distribution.sh
git checkout cdh5.5.2-release
patch -p0 </vagrant/undelete.patch

Log out of the virtual machine with the exit command, then stop and/or destroy the virtual machine:

vagrant halt
vagrant destroy

More examples can be found at the vagrant-sparkbuilder GitHub site.

What Now?

From here, you should be able to make use of the newly built software.

If you are recompiling Spark in order to get the Hive integration along with the JDBC Thriftserver, copy over and then install the newly built jars to the correct locations on the node which will run the Spark Thriftserver.

install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/lib/spark*.jar \
  /opt/cloudera/parcels/CDH/jars/
install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/sbin/start-thriftserver.sh \
  /opt/cloudera/parcels/CDH/lib/spark/sbin/
install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/sbin/stop-thriftserver.sh \
  /opt/cloudera/parcels/CDH/lib/spark/sbin/

You should only need to install on the one node, not on all the cluster members.

How to upgrade Spark on CDH5.5

CDH is normally on a six month release cycle and Spark is on 4 month release cycle. This often leads to an extra latency before a spark version gets integrated and supported in CDH. At times Cloudera might even choose to skip a couple of Spark versions. Reason for this  latency seems to be the effort needed with integration testing and bugs to fix in Spark and other projects to get it all work together in time.
But if you are like our teams, and can’t wait to get your hands on latest and greatest of both spark and CDH, you can always run the latest version of Spark on CDH. We encountered a similar scenario recently, one of our clients wanted to use the Spark Thrift Server on CDH5.5, but Spark Thrift JDBC/ODBC server is not included on CDH5.5. We figured this might be a common use case for many of  you who wants to be on the latest and greatest of Spark quickly, so we decided to put together the steps to install latest version of Spark on CDH.
  •  Download the Spark Source of the version that you want in your CDH. This can be done either by getting the code from the git repository or just by downloading the source of the specific version from the apache spark site. Here is how to get the source for a specific version from git
    To get the latest code from master use below
        git clone git://github.com/apache/spark.git
    To get a specific maintenance branch with stability fixes on top of Spark 1.5
        git clone git://github.com/apache/spark.git -b branch-1.5
    
  • cd /spark
  • execute the below command to build spark libraries from source
     make-distribution.sh -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.5.0 -DskipTests -Phive -Phive-thriftserver

    • Spark needs to be built with -Phive and -Phive-thriftserver options to get the thrift server as part of the distribution
    • Maven versions needs to be 3.3 or higher and need to make sure Maven has enough heap memory (1-2 GB) allotted via Xmx settings
    • it will take around 15 -30 mins and will the spit the libraries into the folder /dist
      spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar
      spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar
      spark-1.5.1-cdh5.5.0-yarn-shuffle.jar
  • Stop the Spark service on your cluster.
  • Copy the three jar files to all machines in your cluster. I copied them to   /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars
  • Do this for all the nodes on the cluster
    • Log onto each nodes via shell and run the below commands
    • Change directory to /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
      • cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
    • update the spark-assembly jar
      sudo rm spark-assembly.jar
      sudo rm  spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-assembly.jar
      
      
    • update the spark examples jar
      sudo rm  spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
      sudo rm spark-examples.jar
      sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-examples.jar
      
      
    • update yarn-shuffle jar
      cd  /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hadoop-yarn/lib
      sudo rm spark-1.5.0-cdh5.5.0-yarn-shuffle.jar
      sudo rm spark-yarn-shuffle.jar
      sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar
      sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar spark-yarn-shuffle.jar
      cd  /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/examples/lib
      sudo rm  spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../lib/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar
    • List out the symlinks  on all the folders and make sure none of them are blinking, blinking indicates that the symlinks might be pointing to the wrong location
    • Restart the spark service from cloudera manager
    • Copy the thrift server start and stop scripts from the /dist/sbin folder to the location  /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/sbin/  on all the nodes that we want to start the thrift server from
  • Start the thriftsever with the command
    ./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10001

    • If JAVA_HOME environment variable is not set , you will need to set it before starting the server.
That should be it, you are all set to start playing on the latest version of Spark on your CDH. Please note, installing latest version of spark like this might put you out of compliance for Cloudera support in-case of any bugs or issues that might arise. If Cloudera support is important for you, you should check with your account rep to understand the implications.