Migrate Cloudera SCM Database from Embedded PostgreSQL to MySQL

If the cloudera EDH was setup using the embedded Postgresql DB on the cluster, it is  desired and often required to migrate to an external DB, especially for production environment. It is possible to migrate the  cloudera management services databases (activity monitor, service monitor, report manager, host monitor) on an external Postgresql/MySQL DB from the cloudera manager console, but the scm-db would still use the embedded Postgresql.

The scm-db contains all the information about the services configured, their role assignments, configuration history, commands, users, and running processes. This is a relatively small database (<100 MB), and is the most important to back up and migrate. A monitoring database contains monitoring information about service and host status. In large clusters, this database can grow large.

The following steps guide through the process of migrating the scm-db from the embedded postgres to external mysql instance.

Helpful Vantage Point
source: http://www.cloudera.com/documentation/enterprise/latest/images/cm_arch.png

1. Stop all cluster services including the Cloudera Manager monitoring services using the Cloudera Manager UI
    Stop all CM agents by ssh-ing to the hosts and issuing `$ sudo service cloudera-scm-agent stop`


2. Backup  the Cloudera Manager configuration by invoking the API call: /api/<version>/cm/deployment.
$ curl -v -u admin:admin http://manager_host:7180/api/<version>/cm/deployment > /home/user/deployment
Note: It is critical this file be retained and protected as it will be needed to re-import the functional configuration into the new database. 
Obtain version with: http://manager_host:7180/api/version


3. Take a backup from the embedded postgres database or make a tarball  of the /var/lib/cloudera-scm-server-db/ directory to a safe location. Either will accomplish the needed outcome. Also backup the contents of the directory /etc/cloudera-scm-server/*

#cd  /var/lib/cloudera-scm-server-db/
#tar -cvf /home/user/cloudera-scm-server-db.tar .
#cd /etc/cloudera-scm-server/
#tar -cvf /home/user/cloudera-scm-server.tar .


4. On the server where the Cloudera Manager embedded database is running:

$ sudo service cloudera-scm-server stop 
$ sudo service cloudera-scm-server-db stop 
$ sudo chkconfig cloudera-scm-server-db off 


5. Ensure the MYSQL JDBC connector is installed and in the correct location [1]


6. Install, configure, and start a new external MySQL database [2]


7. Create “temp” db and grants for “temp” db as documented in this “Example 1: Running the script when MySQL is installed on another host” [3]

#mysql -h mysql_hostname -u user -p
    mysql> create database temp; 
    mysql> grant all on *.* to 'temp'@'%' identified by 'temp' with grant option; 


8. Execute

$/usr/share/cmf/schema/scm_prepare_database.sh mysql -h mysql_hostname  -u user -p --scm-host manager_host <scm_db_name> <scm_user> <scm_password>
Note:  run $ /usr/share/cmf/schema/scm_prepare_database.sh 
with no arguments to get syntax assistance


9. When complete, start the Cloudera Manager server service:

$ sudo service cloudera-scm-server start 


10. Log in to the Cloudera Manager web user interface and upload the enterprise license key.


11. Restart the scm-server

$ sudo service cloudera-scm-server restart 


12. Restore the Cloudera Manager configuration by invoking the API call, from the command line of the manager node (from step 2), issue:

$ curl --upload-file /home/user/deployment -u admin:admin http://manager_host:7180/api/<version>/cm/deployment?deleteCurrentDeployment=true
//add http header option -H "Content-Type: application/json" if required
//verify that there are no errors, the command prints the api even on connection failure.


13. Log in to Cloudera Manager and verify configuration options are now all present, all services should now be available but stopped.


14. Cleanup: $ yum erase cloudera-manager-server-db
Note: Ensure the file /etc/cloudera-scm-server/db.propertiesis updated with new established database


15. If cloudera-scm-agents fail to communicate to the manager you will have to delete or move the cm_guid from /var/lib/cloudera-scm-agent/ and restart the agent.



Impala High Availability

Impala daemon is a core component in the Impala architecture. The daemon process runs on each data node  and is the process to which the clients (Hue,JDBC,ODBC) connect to issue queries.  When a query gets submitted to an Impala daemon ,  that node serves as the coordinator node for that query. Impala daemon acting as the co-ordinator parallelizes the queries and distributes work to other nodes in the Impala cluster. The other nodes transmit partial results back to the coordinator, which constructs the final result set for a query.

It is a recommended practice to run Impalad on each of the Data nodes in a cluster , as Impala takes advantage of the data locality while processing its queries. So most of the time the Impala clients connect to any of the data nodes to run their queries. This might create a single point of failure for the clients if the clients are always issuing queries to a single data node. In addition to that the node acting as a coordinator node for each Impala query potentially requires more memory and CPU cycles than the other nodes that process the query. For clusters running production workloads,  High Availability from the Impala clients standpoint and load distribution across the nodes can be achieved by  having a proxy server or load-balancer to issue queries to impala daemons using a round-robin scheduling.

HAProxy  is free, open source load balancer that can be used as a proxy-server or load balancer to distribute the load across different impala daemons. The high level architecture for this setup looks like below.



Install the load balancer:

HAProxy can installed and configured on Red Hat Enterprise Linux system and Centos OS using the following instructions.

yum install haproxy

Set up the configuration file: /etc/haproxy/haproxy.cfg.

See the following section for a sample configuration file

log local2

chroot      /var/lib/haproxy
pidfile     /var/run/haproxy.pid
maxconn     4000
user       haproxy
group      haproxy

# turn on stats unix socket
stats socket /var/lib/haproxy/stats

# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
mode                   tcp
log                    global
retries               3
timeout connect       50000s
timeout client        50000s
timeout server        50000s
maxconn               3000

# main frontend which proxys to the backends - change the port
# if you want
frontend main *:5000
acl url_static       path_beg       -i /static /images /javascript/stylesheets
acl url_static       path_end       -i .jpg .gif .png .css .js

use_backend static          if url_static
default_backend            impala

#static backend for serving up images, stylesheets and such
backend static
    balance     roundrobin
    server      static check

#round robin balancing between the various backends
backend impala
    mode          tcp
    option        tcplog
    balance       roundrobin
    #balance      leastconn
	# Replace the ip addresses with your client nodes ip addresses
    server client1
    server client2
    server client3

Run the following command after done the changes

service haproxy reload;


The key configuration options are balance and server in the backend impala section. As well as the timeout configuration options in the defaults section. The server with the lowest number of connections receives the connection only when the balance parameter is set to leastconn. If balance parameter is set to roundrobin, the proxy server can issue queries to each connection uses a different coordinator node.

  1. On systems managed by Cloudera Manager, on the page Impala > Configuration > Impala Daemon Default Group, specify a value for the Impala Daemons Load Balancer field. Specify the address of the load balancer in host:port format. This setting lets Cloudera Manager route all appropriate Impala-related operations through the proxy server.
  2. For any scripts, jobs, or configuration settings for applications that formerly connected to a specific datanode to run Impala SQL statements, change the connection information (such as the -i option inimpala-shell) to point to the load balancer instead.

Test Impala through a Proxy for High Availability:

Manual testing with HAProxy:

Stop the impala daemon service one by one and run the queries, check the impala high availability is working fine or not.

Test the impala high availability with shell script using HAProxy:

Run the following shell script and test the impala high availability using HAProxy.

Note: Please change the ‘table_name’ and ‘database_name’ placeholders.

for (( i = 0 ; i < 5; i++ ))
    impala-shell -i localhost:5000 -q "select * from {table_name}" -d {database_name}

Result: Run the above script and find the usage of load balancer.

  • Query should be executing on different impala daemon nodes for each iteration (when balance is roundrobin).