How to upgrade Spark on CDH5.5

CDH is normally on a six month release cycle and Spark is on 4 month release cycle. This often leads to an extra latency before a spark version gets integrated and supported in CDH. At times Cloudera might even choose to skip a couple of Spark versions. Reason for this  latency seems to be the effort needed with integration testing and bugs to fix in Spark and other projects to get it all work together in time.
But if you are like our teams, and can’t wait to get your hands on latest and greatest of both spark and CDH, you can always run the latest version of Spark on CDH. We encountered a similar scenario recently, one of our clients wanted to use the Spark Thrift Server on CDH5.5, but Spark Thrift JDBC/ODBC server is not included on CDH5.5. We figured this might be a common use case for many of  you who wants to be on the latest and greatest of Spark quickly, so we decided to put together the steps to install latest version of Spark on CDH.
  •  Download the Spark Source of the version that you want in your CDH. This can be done either by getting the code from the git repository or just by downloading the source of the specific version from the apache spark site. Here is how to get the source for a specific version from git
    To get the latest code from master use below
        git clone git://
    To get a specific maintenance branch with stability fixes on top of Spark 1.5
        git clone git:// -b branch-1.5
  • cd /spark
  • execute the below command to build spark libraries from source -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.5.0 -DskipTests -Phive -Phive-thriftserver

    • Spark needs to be built with -Phive and -Phive-thriftserver options to get the thrift server as part of the distribution
    • Maven versions needs to be 3.3 or higher and need to make sure Maven has enough heap memory (1-2 GB) allotted via Xmx settings
    • it will take around 15 -30 mins and will the spit the libraries into the folder /dist
  • Stop the Spark service on your cluster.
  • Copy the three jar files to all machines in your cluster. I copied them to   /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars
  • Do this for all the nodes on the cluster
    • Log onto each nodes via shell and run the below commands
    • Change directory to /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
      • cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
    • update the spark-assembly jar
      sudo rm spark-assembly.jar
      sudo rm  spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-assembly.jar
    • update the spark examples jar
      sudo rm  spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
      sudo rm spark-examples.jar
      sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-examples.jar
    • update yarn-shuffle jar
      cd  /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hadoop-yarn/lib
      sudo rm spark-1.5.0-cdh5.5.0-yarn-shuffle.jar
      sudo rm spark-yarn-shuffle.jar
      sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar
      sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar spark-yarn-shuffle.jar
      cd  /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/examples/lib
      sudo rm  spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
      sudo ln -s ../../lib/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar
    • List out the symlinks  on all the folders and make sure none of them are blinking, blinking indicates that the symlinks might be pointing to the wrong location
    • Restart the spark service from cloudera manager
    • Copy the thrift server start and stop scripts from the /dist/sbin folder to the location  /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/sbin/  on all the nodes that we want to start the thrift server from
  • Start the thriftsever with the command
    ./sbin/ --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10001

    • If JAVA_HOME environment variable is not set , you will need to set it before starting the server.
That should be it, you are all set to start playing on the latest version of Spark on your CDH. Please note, installing latest version of spark like this might put you out of compliance for Cloudera support in-case of any bugs or issues that might arise. If Cloudera support is important for you, you should check with your account rep to understand the implications.


Announcing Insight

Clairvoyant is proud to announce our new offering, Insight a managed service that meets all your big data needs.

Why are we launching this offering?

Over the course of the last three years, our work with various organizations and teams has showed us that there is no shortage of interesting problems; problems that can be solved by leveraging the data assets these organizations already have. There is a growing and widespread awareness of how all businesses are in some fashion or the other “DIGITAL BUSINESSES”. Data, lots of it, decisions and strategies powered by this data is the cornerstone of this transformation businesses are aiming for.

Why is it then, that we still are not seeing so many successful applications that are truly data driven? What is hampering the true application of the data driven process at scale? From a technology perspective it all boils down to one thing – Infrastructure.

Infrastructure needs to be considered on multiple levels

  • Servers, Networks, Security
  • Platform Infrastructure
  • Application Infrastructure

Figuring out the right set of infrastructure components across all these layers is a complex, non trivial task. We find teams asking and facing the similar questions centered around picking the right tooling wishing the Hadoop ecosystem to map to their use cases. Putting together the planning, skillset and experience required to do this with a quick turnaround is difficult and results in prolonged infrastructure projects.

While large enterprises typically have the financial means to account for these delays – most organizations do not have the luxury, and time to market is key. “Insight” solves this problem for you.

Our experience, lessons learnt combined with the best of breed open source solutions helps address the decision fatigue involved in Big Data projects. We bring solutions and architectural blue prints that map to your problems and use cases to the table. Solutions that have been proven to work, solutions that reduce your time to market drastically. Our goal is to onboard your use case and make your data usable/query in a matter of days.

Why should you consider it?

Speed, Agility, time to market. Not having to spend endless cycles figuring out the landscape and building your first solution. In the cloud or on premise – we bring best practices, security and proven solutions together to accelerate your path to implementing data driven applications.

How do you get started?

Reach out to us: insight.sales and we will schedule initial calls to help walk you through our approach, solution options, and come up with specific configuration that best suits your needs.