In a previous post, we explained how to configure a proxy server to provide load balancing for the Impala daemon. The proxy software used was HAproxy, a free, open source load balancer. This post will demonstrate how to use Amazon’s Elastic Load Balancer (ELB) to perform Impala load balancing when running in Amazon’s Elastic Compute Cloud (EC2). Details Similar to HAproxy, an Elastic Load Balancer is a reverse proxy that will take incoming TCP connections and distribute them amongst a set of EC2 instances. This is done partly for fault tolerance and partly for load distribution. Cloudera’s Using Impala through a Proxy for High Availability details how load balancing applies to part of Impala. To summarize, the proxy will allow us to configure our Impala clients (Hue, Tableau, etc) with a single hostname and port. This well-known hostname will not have to be changed out if there were to be…
Category: Impala
Impala High Availability
Impala daemon is a core component in the Impala architecture. The daemon process runs on each data node and is the process to which the clients (Hue,JDBC,ODBC) connect to issue queries. When a query gets submitted to an Impala daemon , that node serves as the coordinator node for that query. Impala daemon acting as the co-ordinator parallelizes the queries and distributes work to other nodes in the Impala cluster. The other nodes transmit partial results back to the coordinator, which constructs the final result set for a query. It is a recommended practice to run Impalad on each of the Data nodes in a cluster , as Impala takes advantage of the data locality while processing its queries. So most of the time the Impala clients connect to any of the data nodes to run their queries. This might create a single point of failure for the clients if the…