Impala Load Balancing with Amazon Elastic Load Balancer

In a previous post, we explained how to configure a proxy server to provide load balancing for the Impala daemon. The proxy software used was HAproxy, a free, open source load balancer. This post will demonstrate how to use Amazon’s Elastic Load Balancer (ELB) to perform Impala load balancing when running in Amazon’s Elastic Compute Cloud (EC2).

Details

Similar to HAproxy, an Elastic Load Balancer is a reverse proxy that will take incoming TCP connections and distribute them amongst a set of EC2 instances. This is done partly for fault tolerance and partly for load distribution. Cloudera’s Using Impala through a Proxy for High Availability details how load balancing applies to part of Impala.

To summarize, the proxy will allow us to configure our Impala clients (Hue, Tableau, etc) with a single hostname and port. This well-known hostname will not have to be changed out if there were to be a failure of one or multiple Impala daemons. It will also distribute the load of running a query’s coordinator processing amongst the Impala daemons.

The steps we will take to set up the Impala load balancer are:

  1. Configure security groups.
  2. Create the ELB.
  3. Configure the ELB.
  4. Add instances to the ELB.
  5. Test the ELB.
  6. Configure Impala.

Requirements

The following examples require that you have the AWS CLI software installed and configured on your system. This can be on your Windows, Mac, or Linux workstation/laptop or on a Linux host running elsewhere. Spinning up an EC2 instance running Amazon Linux might be the fastest way to get the tools.

Of course, you will need an AWS account and IAM privileges to create both the security groups and the ELB.

Implementation

The following code is run in a shell (bash or cmd.exe) on the system with the AWS CLI tools.

First, we will need to name the cluster that is running the Impala service. This will be used to name the ELB and security groups. Then, we will create some variables that will hold the ID numbers of existing AWS infrastructure. Lastly, we use an additional variable named $OPTS for general AWS CLI options we may need.

We will have to look up the VPC ID, the ID of the subnet where we will be deploying the load balancer, and the instance IDs of the Hadoop cluster workers that are running the Impala Daemon. The ELB can reside in the same subnet as the Hadoop cluster or it can be placed in a separate subnet. It is advisable to keep the ELB in the same Availability Zone as the cluster. (You are deploying your Hadoop cluster instances to the same AZ, right?)

CLNAME=Cluster1
VPCID=""
SUBNETID=""
INSTANCES=""
#OPTS="--profile default --region us-west2"

If you have tagged your objects appropriately, you can use aws ec2 describe-* to look them up programmatically. These are examples. Your environment will be different.

# Return the VPCID of the VPC named "Cluster1".
VPCID=$(aws $OPTS ec2 describe-vpcs --output text --query 'Vpcs[*].VpcId' \
  --filter Name=tag:Name,Values="${CLNAME}")

# Return the SUBNETID of the subnet named "Cluster1 Private subnet 0".
SUBNETID=$(aws $OPTS ec2 describe-subnets --output text \
  --query 'Subnets[*].SubnetId' \
  --filter Name=tag:Name,Values="${CLNAME} Private subnet 0")

# Return the instance IDs of instances tagged "env=Cluster1" and "type=worker".
INSTANCES=$(aws $OPTS ec2 describe-instances --output text \
  --query 'Reservations[*].Instances[*].InstanceId' \
  --filters "Name=tag:env,Values=${CLNAME}" Name=tag:type,Values=worker)

Step 1

Second, we will define new security groups. The first group will allow client initiated traffic to reach the ELB. This should be locked down to something smaller than “everything” (0.0.0.0/0) especially if you have your cluster on the big, bad Internet.
The second group will allow the ELB to reach the Impala daemons running on the Hadoop cluster worker nodes.

echo "Allow Impala connections from clients to the load balancer."
FRONTEND=$(aws $OPTS ec2 create-security-group --output text --vpc-id $VPCID \
  --group-name "${CLNAME} Impala FE" --description "Impala Front-End Traffic")
aws $OPTS ec2 authorize-security-group-ingress --group-id $FRONTEND \
  --protocol tcp --port 21000 --cidr 0.0.0.0/0
aws $OPTS ec2 authorize-security-group-ingress --group-id $FRONTEND \
  --protocol tcp --port 21050 --cidr 0.0.0.0/0

echo "Allow Impala connections from the load balancer to the cluster."
BACKEND=$(aws $OPTS ec2 create-security-group --output text --vpc-id $VPCID \
  --group-name "${CLNAME} Impala BE" --description "Impala Back-End Traffic")
aws $OPTS ec2 authorize-security-group-ingress --group-id $BACKEND \
  --protocol tcp --port 21000 --source-group $FRONTEND
aws $OPTS ec2 authorize-security-group-ingress --group-id $BACKEND \
  --protocol tcp --port 21050 --source-group $FRONTEND

Then we will add each EC2 instance to the new $BACKEND security group.

for INSTANCEID in $INSTANCES; do
  GROUPID=$(aws $OPTS ec2 describe-instance-attribute --instance-id $INSTANCEID \
    --attribute groupSet --output text --query 'Groups[*].GroupId')
  aws $OPTS ec2 modify-instance-attribute --instance-id $INSTANCEID \
    --groups $GROUPID $BACKEND
done

Step 2

Next we get to the meat of this post: creating the load balancer. We will create an ELB with the name “elb-impala-Cluster1” and tell it to listen on ports 21000/TCP and 21050/TCP. The ELB will reside on subnet $SUBNETID and be a member of the security group $FRONTEND. This ELB will be internal/private and will not be available on the Internet. You can change this with the --scheme argument.

aws $OPTS elb create-load-balancer \
  --load-balancer-name elb-impala-${CLNAME} \
  --listeners \
  "Protocol=TCP,LoadBalancerPort=21000,InstanceProtocol=TCP,InstancePort=21000" \
  "Protocol=TCP,LoadBalancerPort=21050,InstanceProtocol=TCP,InstancePort=21050" \
  --subnets $SUBNETID \
  --security-groups $FRONTEND \
  --scheme internal

Step 3

After creation, we will modify some of the ELB configuration. This command will modify the Connection Idle Timeout value to 3600 seconds. It will also set up logging to go to a previously created S3 bucket named “Cluster1-logs” where files prefixed with “Cluster1-Impala” will be written every 60 minutes.

aws $OPTS elb modify-load-balancer-attributes \
  --load-balancer-name elb-impala-${CLNAME} \
  --load-balancer-attributes \
  AccessLog={Enabled=true,S3BucketName=${CLNAME}-logs,EmitInterval=60,S3BucketPrefix=${CLNAME}-Impala,ConnectionSettings={IdleTimeout=3600}

Further modifications to the ELB will add a health check for the ELB to determine if individual instances are available. The ELB will connect to port 21000/TCP every 30 seconds to test if the instance application is listening. The individual checks will timeout after 5 seconds with no response. The instance will be considered to have failed after two failed checks. The instance will return to a healthy status after five successful checks.

aws $OPTS elb configure-health-check \
  --load-balancer-name elb-impala-${CLNAME} \
  --health-check \
  Target=TCP:21000/png,Interval=30,Timeout=5,UnhealthyThreshold=2,HealthyThreshold=5

Step 4

Finally, we will attach the Hadoop worker instances to the ELB and load balancing will begin to be available.

aws $OPTS elb register-instances-with-load-balancer \
  --load-balancer-name elb-impala-${CLNAME} \
  --instances $INSTANCES

Lets not forget to look up the all-important DNS name that we will be using to talk to the ELB. We will use this to configure our client applications and for impala-shell.

ELBDNSNAME=$(aws $OPTS elb describe-load-balancers --load-balancer-names \
  elb-impala-${CLNAME} --output text --query 'LoadBalancerDescriptions[*].DNSName')
echo "*** SAVE ME ***"
echo "ELBDNSNAME : ${ELBDNSNAME}"
ELBDNSNAME=${ELBDNSNAME}

Testing

Step 5

We will test our implementation to confirm that it works to our expectation.

for (( i = 0 ; i < 10; i++ )); do
  impala-shell -i ${ELBDNSNAME} -q 'SELECT pid();' 2>&1 | grep Coordinator:
done

You should get output similar to the following which shows that we are connecting to a new coordinator each time:

Query submitted at: 2017-06-23 19:53:00 (Coordinator: http://ip-10-30-1-35.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:01 (Coordinator: http://ip-10-30-1-4.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:01 (Coordinator: http://ip-10-30-1-46.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:01 (Coordinator: http://ip-10-30-1-33.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:01 (Coordinator: http://ip-10-30-1-10.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:01 (Coordinator: http://ip-10-30-1-35.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:02 (Coordinator: http://ip-10-30-1-4.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:02 (Coordinator: http://ip-10-30-1-46.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:02 (Coordinator: http://ip-10-30-1-33.ec2.internal:25000)
Query submitted at: 2017-06-23 19:53:02 (Coordinator: http://ip-10-30-1-10.ec2.internal:25000)

Configure Impala

Step 6

Technically, there is nothing you need to configure in Impala. At least not on an insecure (non-Kerberized) cluster. You do need to tell other applications in your Hadoop distribution about the load balancer.

From Cloudera’s Using Impala through a Proxy for High Availability:

On systems managed by Cloudera Manager, on the page Impala > Configuration > Impala Daemon Default Group, specify a value for the Impala Daemons Load Balancer field. Specify the address of the load balancer in host:port format. This setting lets Cloudera Manager route all appropriate Impala-related operations through the proxy server.

Security

Since we at Clairvoyant tend to do a lot of security-enabled Hadoop deployments, it makes sense to describe how to get TLS enabled on the ELB.

Amazon has a service called AWS Certificate Manager. This service lets you provision a CA-signed TLS certificate onto the ELB with very little effort and will automatically update the certificate for you before it expires.

First we will request a certificate. Set the variable $DNAME to the name of the fully qualified domain name that you are using for the certificate. Then we will add tags so that we can provide a simple name.

DNAME=

ARN=$(aws $OPTS acm request-certificate --domain-name $DNAME \
  --subject-alternative-names $ELBDNSNAME --output text)
aws $OPTS acm add-tags-to-certificate --certificate-arn $ARN --tags \
  Name=tag:Name,Values="${CLNAME}"

At this point, the certificate is not yet issued. Emails have been sent to the domain contacts for approval of the request. Once the request is approved, we can continue with assigning the certificate to the ELB. We can list the certificates and watch to see if it has been approved. If there is any output to this command, it means the request has been approved by the domain owner.

aws $OPTS acm list-certificates --certificate-statuses ISSUED --output text \
  | grep $ARN

Lastly, we will modify the ELB from TCP mode to SSL mode and tell it to use the TLS certificate on ports 21000/TCP and 21050/TCP.

aws $OPTS delete-load-balancer-listeners \
  --load-balancer-name elb-impala-${CLNAME} --load-balancer-ports 21000
aws $OPTS elb create-load-balancer-listeners \
  --load-balancer-name elb-impala-${CLNAME} \
  --listeners \
  "Protocol=SSL,LoadBalancerPort=21000,InstanceProtocol=SSL,InstancePort=21000,SSLCertificateId=${ARN}"

aws $OPTS delete-load-balancer-listeners \
  --load-balancer-name elb-impala-${CLNAME} --load-balancer-ports 21050
aws $OPTS elb create-load-balancer-listeners \
  --load-balancer-name elb-impala-${CLNAME} \
  --listeners \
  "Protocol=SSL,LoadBalancerPort=21050,InstanceProtocol=SSL,InstancePort=21050,SSLCertificateId=${ARN}"

You should now have an Amazon-signed TLS certificate protecting your ELB traffic. To confirm, run the following command and look for something like issuer= /C=US/O=Amazon/OU=Server CA 1B/CN=Amazon.

openssl s_client -connect ${ELBDNSNAME}:21000 -nbio /dev/null \
  | openssl x509 -noout -issuer

Thats it. Happy load balancing!

Encrypting Amazon EC2 boot volumes via Packer

In order to layer on some easy data-at-rest security, I want to encrypt the boot volumes of my Amazon EC2 instances.  I also want to use the centos.org CentOS images but those are not encrypted.  How can I end up with an encrypted copy of those AMIs in the fewest steps?

In the past, I have used shell scripts and the AWS CLI to perform the boot volume encryption dance. The steps are basically:

  1. Deploy an instance running the source AMI.
  2. Create an image from that instance.
  3. Copy the image and encrypt the copy.
  4. Delete the unencrypted image.
  5. Terminate the instance.
  6. Add tags to new AMI.

The script has a need for a lot of VPC/subnet/security group preparation (which I guess could have been added to the script), and if there were errors during the execution then cleanup was very manual (more possible script work). The script is very flexible and meets my needs, but it is a codebase that needs expertise in order to maintain. And I have better things to do with my time.

A simpler solution is Packer.

I had looked at Packer around July of 2016 and it was very promising, but it was missing one key feature: it could not actually encrypt the boot volume. Dave Konopka wrote a post describing the problem and his solution of using Ansible in Encrypted Amazon EC2 boot volumes with Packer and Ansible. Luckily, there was an outstanding pull request and as of version 0.11.0, Packer now has support for boot volume encryption whilst copying Marketplace AMIs.

The nice thing about a Packer template is that it takes care of dynamic generation of most objects. Temporary SSH keys and security groups are created just for the build and are then destroyed. The above steps for the boot volume encryption dance are followed with built-in error checking and recovery in case something goes wrong.

This template assumes automatic lookup of your AWS credentials. Read the docs (Specifying Amazon Credentials section) for more details.

Code can be downloaded from GitHub.

$ cat encrypt-centos.org-7-ami.json
{
    "description": "Copy the centos.org CentOS 7 AMI into our account so that we can add boot volume encryption.",
    "min_packer_version": "0.11.0",
    "variables": {
        "aws_region": "us-east-1",
        "aws_vpc": null,
        "aws_subnet": null,
        "ssh_username": "centos"
    },
    "builders": [
        {
            "type": "amazon-ebs",
            "ami_name": "CentOS Linux 7 x86_64 HVM EBS (encrypted) {{isotime \"20060102\"}}",
            "ami_description": "CentOS Linux 7 x86_64 HVM EBS (encrypted) {{isotime \"20060102\"}}",
            "instance_type": "t2.nano",
            "region": "{{user `aws_region`}}",
            "vpc_id": "{{user `aws_vpc`}}",
            "subnet_id": "{{user `aws_subnet`}}",
            "source_ami_filter": {
                "filters": {
                    "owner-alias": "aws-marketplace",
                    "product-code": "aw0evgkw8e5c1q413zgy5pjce",
                    "virtualization-type": "hvm"
                },
                "most_recent": true
            },
            "ami_virtualization_type": "hvm",
            "ssh_username": "{{user `ssh_username`}}",
            "associate_public_ip_address": true,
            "tags": {
                "Name": "CentOS 7",
                "OS": "CentOS",
                "OSVER": "7"
            },
            "encrypt_boot": true,
            "ami_block_device_mappings": [
                {
                    "device_name": "/dev/sda1",
                    "volume_type": "gp2",
                    "volume_size": 8,
                    "encrypted": true,
                    "delete_on_termination": true
                }
            ],
            "communicator": "ssh",
            "ssh_pty": true
        }
    ],
    "provisioners": [
        {
            "type": "shell",
            "execute_command": "sudo -S sh '{{.Path}}'",
            "inline_shebang": "/bin/sh -e -x",
            "inline": [
                "echo '** Shreding sensitive data ...'",
                "shred -u /etc/ssh/*_key /etc/ssh/*_key.pub",
                "shred -u /root/.*history /home/{{user `ssh_username`}}/.*history",
                "shred -u /root/.ssh/authorized_keys /home/{{user `ssh_username`}}/.ssh/authorized_keys",
                "sync; sleep 1; sync"
            ]
        }
    ]
}

To copy the CentoS 6 AMI, change any references of CentOS “7” to “6” and the product-code from “aw0evgkw8e5c1q413zgy5pjce” to “6x5jmcajty9edm3f211pqjfn2”.

When you build with this Packer template, you will have to pass in the variables aws_vpc and aws_subnet. The AWS region defaults to us-east-1, but can be overridden by setting aws_region. The newest centos.org CentOS AMI in that region will be automatically discovered.

$ packer build -var 'aws_vpc=vpc-12345678' -var 'aws_subnet=subnet-23456789' \
encrypt-centos.org-7-ami.json
amazon-ebs output will be in this color.

==> amazon-ebs: Prevalidating AMI Name...
    amazon-ebs: Found Image ID: ami-6d1c2007
==> amazon-ebs: Creating temporary keypair: packer_583c7438-d1d8-f33d-8517-1bdbbd84d2c9
==> amazon-ebs: Creating temporary security group for this instance...
==> amazon-ebs: Authorizing access to port 22 the temporary security group...
==> amazon-ebs: Launching a source AWS instance...
    amazon-ebs: Instance ID: i-5b68a2c4
==> amazon-ebs: Waiting for instance (i-5b68a2c4) to become ready...
==> amazon-ebs: Waiting for SSH to become available...
==> amazon-ebs: Connected to SSH!
==> amazon-ebs: Provisioning with shell script: /var/folders/42/drnmdknj7zz7bf03d91v8nkr0000gq/T/packer-shell797958164
    amazon-ebs: ** Shreding sensitive data ...
    amazon-ebs: shred: /root/.*history: failed to open for writing: No such file or directory
    amazon-ebs: shred: /home/centos/.*history: failed to open for writing: No such file or directory
==> amazon-ebs: Stopping the source instance...
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating the AMI: CentOS Linux 7 x86_64 HVM EBS (encrypted) 1480356920
    amazon-ebs: AMI: ami-33506f25
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Creating Encrypted AMI Copy
==> amazon-ebs: Copying AMI: us-east-1(ami-33506f25)
==> amazon-ebs: Waiting for AMI copy to become ready...
==> amazon-ebs: Deregistering unencrypted AMI
==> amazon-ebs: Deleting unencrypted snapshots
    amazon-ebs: Snapshot ID: snap-5c87d7eb
==> amazon-ebs: Modifying attributes on AMI (ami-9d4b748b)...
    amazon-ebs: Modifying: description
==> amazon-ebs: Adding tags to AMI (ami-9d4b748b)...
    amazon-ebs: Adding tag: "OS": "CentOS"
    amazon-ebs: Adding tag: "OSVER": "7"
    amazon-ebs: Adding tag: "Name": "CentOS 7"
==> amazon-ebs: Tagging snapshot: snap-1eb5dc01
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: Destroying volume (vol-aa727a37)...
==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' finished.

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:

us-east-1: ami-9d4b748b