Impala High Availability

Impala daemon is a core component in the Impala architecture. The daemon process runs on each data node  and is the process to which the clients (Hue,JDBC,ODBC) connect to issue queries.  When a query gets submitted to an Impala daemon ,  that node serves as the coordinator node for that query. Impala daemon acting as the co-ordinator parallelizes the queries and distributes work to other nodes in the Impala cluster. The other nodes transmit partial results back to the coordinator, which constructs the final result set for a query.

It is a recommended practice to run Impalad on each of the Data nodes in a cluster , as Impala takes advantage of the data locality while processing its queries. So most of the time the Impala clients connect to any of the data nodes to run their queries. This might create a single point of failure for the clients if the clients are always issuing queries to a single data node. In addition to that the node acting as a coordinator node for each Impala query potentially requires more memory and CPU cycles than the other nodes that process the query. For clusters running production workloads,  High Availability from the Impala clients standpoint and load distribution across the nodes can be achieved by  having a proxy server or load-balancer to issue queries to impala daemons using a round-robin scheduling.

HAProxy  is free, open source load balancer that can be used as a proxy-server or load balancer to distribute the load across different impala daemons. The high level architecture for this setup looks like below.

impala-high-availability

 

Install the load balancer:

HAProxy can installed and configured on Red Hat Enterprise Linux system and Centos OS using the following instructions.

yum install haproxy

Set up the configuration file: /etc/haproxy/haproxy.cfg.

See the following section for a sample configuration file

global
log         127.0.0.1 local2

chroot      /var/lib/haproxy
pidfile     /var/run/haproxy.pid
maxconn     4000
user       haproxy
group      haproxy
daemon

# turn on stats unix socket
stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
mode                   tcp
log                    global
 
retries               3
timeout connect       50000s
timeout client        50000s
timeout server        50000s
maxconn               3000

#---------------------------------------------------------------------
# main frontend which proxys to the backends - change the port
# if you want
#---------------------------------------------------------------------
frontend main *:5000
acl url_static       path_beg       -i /static /images /javascript/stylesheets
acl url_static       path_end       -i .jpg .gif .png .css .js

use_backend static          if url_static
default_backend            impala

#---------------------------------------------------------------------
#static backend for serving up images, stylesheets and such
#---------------------------------------------------------------------
backend static
    balance     roundrobin
    server      static 127.0.0.1:4331 check

#---------------------------------------------------------------------
#round robin balancing between the various backends
#---------------------------------------------------------------------
backend impala
    mode          tcp
    option        tcplog
    balance       roundrobin
    #balance      leastconn
	#---------------------------------------------------------------------
	# Replace the ip addresses with your client nodes ip addresses
	#---------------------------------------------------------------------
    server client1 192.168.3.163:21000
    server client2 192.168.3.164:21000
    server client3 192.168.3.165:21000

Run the following command after done the changes

service haproxy reload;

Note:

The key configuration options are balance and server in the backend impala section. As well as the timeout configuration options in the defaults section. The server with the lowest number of connections receives the connection only when the balance parameter is set to leastconn. If balance parameter is set to roundrobin, the proxy server can issue queries to each connection uses a different coordinator node.

  1. On systems managed by Cloudera Manager, on the page Impala > Configuration > Impala Daemon Default Group, specify a value for the Impala Daemons Load Balancer field. Specify the address of the load balancer in host:port format. This setting lets Cloudera Manager route all appropriate Impala-related operations through the proxy server.
  2. For any scripts, jobs, or configuration settings for applications that formerly connected to a specific datanode to run Impala SQL statements, change the connection information (such as the -i option inimpala-shell) to point to the load balancer instead.

Test Impala through a Proxy for High Availability:

Manual testing with HAProxy:

Stop the impala daemon service one by one and run the queries, check the impala high availability is working fine or not.

Test the impala high availability with shell script using HAProxy:

Run the following shell script and test the impala high availability using HAProxy.

Note: Please change the ‘table_name’ and ‘database_name’ placeholders.

for (( i = 0 ; i < 5; i++ ))
do
    impala-shell -i localhost:5000 -q "select * from {table_name}" -d {database_name}
done

Result: Run the above script and find the usage of load balancer.

  • Query should be executing on different impala daemon nodes for each iteration (when balance is roundrobin).

How To Rebuild Cloudera’s Spark

As a followup to the post How to upgrade Spark on CDH5.5, I will show you how to get a build environment up and running with a CentOS 7 virtual machine running via Vagrant and Virtual Box. This will allow for the quick build or rebuild of Cloudera’s version of Apache Spark from https://github.com/cloudera/spark.

Why?

You may want to rebuild Cloudera’s Spark in the event that you want to add functionality that was not compiled in by default. The Thriftserver and SparkR are two things that Cloudera does not ship (nor support), so if you are looking for these things, these instructions will help.

Using a disposable virtual machine will allow for a repeatable build and will keep your workstation computing environment clean of all the bits that may get installed.

Requirements

You will need Internet access during the installation and compilation of the Spark software.

Make sure that you have the following software installed on your local workstation:

Installation of these components are documented at their respective links.

Get Started

Clone the vagrant-sparkbuilder git repository to your local workstation:

git clone https://github.com/teamclairvoyant/vagrant-sparkbuilder.git
cd vagrant-sparkbuilder

Start the Vagrant instance that comes with the vagrant-sparkbuilder repository. This will boot a CentOS 7 virtual machine, install the Puppet agent, and instruct Puppet to configure the virtual machine with Oracle Java and the Cloudera Spark git repository. Then it will log you in to the virtual machine.

vagrant up
vagrant ssh

Inside the virtual machine, change to the spark directory:

cd spark

The automation has already checked out the branch/tag that corresponds to the target CDH version (presently defaulting to cdh5.7.0-release). Now you just need to build Spark with the Hive Thriftserver while excluding dependencies that are shipped as part of CDH. The key options in this example are the -Phive -Phive-thriftserver. Expect the compilation to take 20-30 minutes depending upon your Internet speed and workstation CPU and disk speed.

patch -p0 </vagrant/undelete.patch
./make-distribution.sh -DskipTests \
  -Dhadoop.version=2.6.0-cdh5.7.0 \
  -Phadoop-2.6 \
  -Pyarn \
  -Phive -Phive-thriftserver \
  -Pflume-provided \
  -Phadoop-provided \
  -Phbase-provided \
  -Phive-provided \
  -Pparquet-provided

If the above command fails with a ‘Cannot allocate memory’ error, either run it again or increase the amount of memory in the Vagrantfile.

Copy the resulting distribution back to your local workstation:

rsync -a dist/ /vagrant/dist-cdh5.7.0-nodeps

If you want to build against a different CDH release, then use git to change the code:

git checkout -- make-distribution.sh
git checkout cdh5.5.2-release
patch -p0 </vagrant/undelete.patch

Log out of the virtual machine with the exit command, then stop and/or destroy the virtual machine:

vagrant halt
vagrant destroy

More examples can be found at the vagrant-sparkbuilder GitHub site.

What Now?

From here, you should be able to make use of the newly built software.

If you are recompiling Spark in order to get the Hive integration along with the JDBC Thriftserver, copy over and then install the newly built jars to the correct locations on the node which will run the Spark Thriftserver.

install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/lib/spark*.jar \
  /opt/cloudera/parcels/CDH/jars/
install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/sbin/start-thriftserver.sh \
  /opt/cloudera/parcels/CDH/lib/spark/sbin/
install -o root -g root -m 0644 dist-cdh5.7.0-nodeps/sbin/stop-thriftserver.sh \
  /opt/cloudera/parcels/CDH/lib/spark/sbin/

You should only need to install on the one node, not on all the cluster members.