2016 Phoenix Data Conference

The Third Phoenix Data Conference – the largest big data event in the Phoenix Valley had a tremendous response this year. Having established itself as a must attend big data event in the Phoenix area, the event this year attracted thought leaders and key companies discussing latest trends and technologies in the big data space. Over 350 technologists, business leaders, data analysts and engineers attended the 2016 conference

The Phoenix data conference 2016, that concluded last Saturday focussed on practical big data use cases from leading analytics companies like ClouderaSAP, Clairvoyant, MapR, StreamSets, MicrosoftConfluent, SnapLogicDataTorrent, Tresata, AmazonChoice Hotels, MemSQL, Wells Fargo and others. Technology leaders in the big data space shared innovative implementations and advances in the hadoop space. Specific challenges around security, talent availability, technical deployments, managed services etc were discussed by the speakers.

We would like to thank our speakers, sponsors, all the attendees and the volunteers for making the Phoenix Data Conference – 2016 a huge success. The intent of the conference was to encourage local community to come together and learn from other practitioners and experts in this field. We strongly believe that an active  community and exposure to new ideas is key to help improve the local talent base and improve opportunities for every one. We really hope an educated and well informed talent pool will help bring more interesting companies and work to the greater Phoenix area.

None of this would have been possible without the speaker’s time and effort to share their knowledge with us. It was a pleasure to host all of you. We had a good mix of sessions that included techniques to solve problems using different products, lessons learnt based sessions, deep dive into certain technology, and also a showcase of how companies are implementing their big data strategy.

A Big Thank You also to all the attendees, all the discussions and questions made for a very engaging day. You made the event a great success!!

If you missed all this big data fun but are interested to know more, visit the website www.phxdataconference.com or join-in for the monthly Phoenix Hadoop meetup 

 In addition to the regular monthly meetups, and the Phoenix Data Conference – we also hosted an all day hands on session on Hadoop. Based on the interest and the feedback we received, we will be hosting more workshops in the future to provide hands on training for Hadoop.

If you have any additional feedback, suggestions for future events or any questions please do drop us a line at contact@phxdataconference.com. Your input is very valuable to help us organize such events – please keep it coming!

You can also fill out a short survey at



The Phoenix Data Conference Team

Intro to Machine Learning





Machine Learning, Statistical modeling and Predictive Analytics have been around for a long time but the hype is right now due to rise of social media in the last decade. Tera bytes and Peta bytes of digital data is being created on a daily basis. This data contains a lot of hidden stories and insights, implementing a business feature or solution using those can be beneficial to its owners. But it is impossible for humans to read through that much data manually and derive insights. This is where machine learning comes in handy.

Ok, so what is Machine Learning?

In simple technical terms, it is training a statistical algorithm/model with historic data and using that model to predict outcome of new or unseen data.

Machine Learning can be categorized into 2 high level groups.

  • Supervised Learning
  • Unsupervised Learning.

Supervised Learning

In supervised learning, each entry in dataset has input (a set of attributes) for a model and a desired output (a target class). We train a model using input and output; and use that model to predict same target class for new or unseen data.

Algorithms: Naïve Bayes, Decision Tree, Random Forrest etc.

Eample1: Given a passenger’s information we try to hypothesize if the person would have survived during the sinking of the Titanic. In this scenario, outcome is categorical. [True or False]. We use Classifier algorithms for categorical prediction.

Example 2: Given a house features, we try to predict the price range of the house. Here the target class is price and it is a continuous value. Regression algorithms serve these scenarios.


Unsupervised Learning:

With unsupervised learning, we don’t have to do the first step of curating data and annotating target class manually (which is essential in supervised learning). In unsupervised learning, data entries or elements that are similar are grouped together. This process is called Clustering. Once clusters are created, new data is classified into one of the clusters formed.

Algorithms: k-means, Dimensionality Reduction, Mean Shift etc.

Example: Categorization of News: If we have a set of news articles, categorizing them into few groups or clusters and classifying subsequent news items to recently formed groups.


Decision Tree classifier for Titanic passenger data:

Lets look at Titanic data and hypothesize to determine if a passenger had survived given his/her age, class, and gender information. We use DecisionTreeClassifier from Scikit Machine Learning.

Training Data:

     #Loads the CSV file
     titanic = pd.read_csv("** path/train.csv on your machine **")

Sample CSV Data is as below:

[table border=”2″]


Our training data consists of seven features. Survived is the target class we are trying to predict for new data. Not all features in the raw data are useful for training and some may even harm a model performance. Selecting right features for training is called Feature Selection.

# From data we are selecting pclass, Age, Gender for training and survived is our target we are trying to predict.
# This is called Feature Selection
     dataColumnsWithTarget = [ 'Pclass', 'Age', 'Gender','Survived']
     titanicdata = titanic[dataColumnsWithTarget]

Any missing values have to be filled in with appropriate values. Sometimes it can be mean, median or ignore those data entries altogether if missing values can’t be filled in. For instance, Age is missing for few entries. We will fill in missing values with its average.


# Check if there are any NA or NaN (unknown) values in numerical fields. If we have categorical data, we have to convert it to numerical first and then che k. We will do this step with [Gender] field.
# If the result is True we have to address the case of missing values. In our scenario, we have missing values for Age.
     print np.isnan(titanicdata['Age']).any()
     print np.isnan(titanicdata['Pclass']).any()
# Find average of [Age] and fill in NaN values with the average.
# To do that, first fill in NaNs with zeros. We can not compute mean with NaN in dataset.
     titanicdata['Age'].fillna(0, inplace=True)
     print titanicdata

Sometimes we do not get all features in the form required by the model. Transforming the available fields to best fit the model is called Feature Extraction.

Gender field is categorical. But DecisionTreeClassifier accepts numerical values. We have to convert this data into numerical. We can use label_encoder or OneHotEncoder packages from Scikit to convert them.


    from sklearn.preprocessing import LabelEncoder
    enc = LabelEncoder()
    label_encoder = enc.fit(titanicdata['Gender'])
    titanicdata['Gender'] = label_encoder.transform(titanicdata['Gender'])


Apart from features, fine-tuning the model parameters is another important aspect that increases model performance and it is called Model Selection.

# We need to split initial train.csv(training data) into 2 splits. Use one split for training the model and the other split to test the model. We split it by 80:20 for training and testing.
   from sklearn.cross_validation import train_test_split
   x_train, x_test, y_train ,y_test = train_test_split(titanic_x, titanic_y, test_size=0.80, random_state=33)
# Training a DecisionTreeClassifier
# Selecting model parameters is called Model Selection
   from sklearn import tree
   clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)
# Train the model using training data (80% of train.csv)
   clf = clf.fit(x_train, y_train)


Now we use the 20% of the training data that was saved for testing. We can evaluate Model accuracy, Confusion Matrix, Classification report that shows precession, recall and f1 score using Metrics package from Scikit learn.

# Now lets actually predict for the test.csv, where we do not know/ have the outcome [Survived]
# We follow same steps we followed with training data on feature selection and extraction.
   test_df = pd.read_csv("** path/test.csv on your local machine **")
   from sklearn import metrics
   y_test_pred = clf.predict(x_test)
# Print Accuracy by comparing actual ['Survived'] data we have from train.csv and the prediction from the model.
   print "Test Accuracy:" , metrics.accuracy_score(y_test, y_test_pred)
# Print Confusion Matrix by comparing Actual to predicted outcomes. TruePositive and TrueNegative values should be high the matrix.
   print "Confusion Matrix"
   print metrics.confusion_matrix(y_test, y_test_pred)
# Classification Report shows Precision and recall that are used to measure a model's performance.
   print "Classification Report:"
   print metrics.classification_report(y_test, y_test_pred)

We can iterate these steps with different combinations of features and model parameters until we get satisfactory model performance.

The complete code can be download from the link.