Machine Learning, Statistical modeling and Predictive Analytics have been around for a long time but the hype is right now due to rise of social media in the last decade. Tera bytes and Peta bytes of digital data is being created on a daily basis. This data contains a lot of hidden stories and insights, implementing a business feature or solution using those can be beneficial to its owners. But it is impossible for humans to read through that much data manually and derive insights. This is where machine learning comes in handy.
Ok, so what is Machine Learning?
In simple technical terms, it is training a statistical algorithm/model with historic data and using that model to predict outcome of new or unseen data.
Machine Learning can be categorized into 2 high level groups.
- Supervised Learning
- Unsupervised Learning.
In supervised learning, each entry in dataset has input (a set of attributes) for a model and a desired output (a target class). We train a model using input and output; and use that model to predict same target class for new or unseen data.
Algorithms: Naïve Bayes, Decision Tree, Random Forrest etc.
Eample1: Given a passenger’s information we try to hypothesize if the person would have survived during the sinking of the Titanic. In this scenario, outcome is categorical. [True or False]. We use Classifier algorithms for categorical prediction.
Example 2: Given a house features, we try to predict the price range of the house. Here the target class is price and it is a continuous value. Regression algorithms serve these scenarios.
With unsupervised learning, we don’t have to do the first step of curating data and annotating target class manually (which is essential in supervised learning). In unsupervised learning, data entries or elements that are similar are grouped together. This process is called Clustering. Once clusters are created, new data is classified into one of the clusters formed.
Algorithms: k-means, Dimensionality Reduction, Mean Shift etc.
Example: Categorization of News: If we have a set of news articles, categorizing them into few groups or clusters and classifying subsequent news items to recently formed groups.
Decision Tree classifier for Titanic passenger data:
Lets look at Titanic data and hypothesize to determine if a passenger had survived given his/her age, class, and gender information. We use DecisionTreeClassifier from Scikit Machine Learning.
#Loads the CSV file titanic = pd.read_csv("** path/train.csv on your machine **")
Sample CSV Data is as below:
Our training data consists of seven features. Survived is the target class we are trying to predict for new data. Not all features in the raw data are useful for training and some may even harm a model performance. Selecting right features for training is called Feature Selection.
# From data we are selecting pclass, Age, Gender for training and survived is our target we are trying to predict. # This is called Feature Selection dataColumnsWithTarget = [ 'Pclass', 'Age', 'Gender','Survived'] titanicdata = titanic[dataColumnsWithTarget]
Any missing values have to be filled in with appropriate values. Sometimes it can be mean, median or ignore those data entries altogether if missing values can’t be filled in. For instance, Age is missing for few entries. We will fill in missing values with its average.
# Check if there are any NA or NaN (unknown) values in numerical fields. If we have categorical data, we have to convert it to numerical first and then che k. We will do this step with [Gender] field. # If the result is True we have to address the case of missing values. In our scenario, we have missing values for Age. print np.isnan(titanicdata['Age']).any() print np.isnan(titanicdata['Pclass']).any() # Find average of [Age] and fill in NaN values with the average. # To do that, first fill in NaNs with zeros. We can not compute mean with NaN in dataset. titanicdata['Age'].fillna(0, inplace=True) print titanicdata
Sometimes we do not get all features in the form required by the model. Transforming the available fields to best fit the model is called Feature Extraction.
Gender field is categorical. But DecisionTreeClassifier accepts numerical values. We have to convert this data into numerical. We can use label_encoder or OneHotEncoder packages from Scikit to convert them.
from sklearn.preprocessing import LabelEncoder enc = LabelEncoder() label_encoder = enc.fit(titanicdata['Gender']) titanicdata['Gender'] = label_encoder.transform(titanicdata['Gender'])
Apart from features, fine-tuning the model parameters is another important aspect that increases model performance and it is called Model Selection.
# We need to split initial train.csv(training data) into 2 splits. Use one split for training the model and the other split to test the model. We split it by 80:20 for training and testing. from sklearn.cross_validation import train_test_split x_train, x_test, y_train ,y_test = train_test_split(titanic_x, titanic_y, test_size=0.80, random_state=33) # Training a DecisionTreeClassifier # Selecting model parameters is called Model Selection from sklearn import tree clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5) # Train the model using training data (80% of train.csv) clf = clf.fit(x_train, y_train)
Now we use the 20% of the training data that was saved for testing. We can evaluate Model accuracy, Confusion Matrix, Classification report that shows precession, recall and f1 score using Metrics package from Scikit learn.
# Now lets actually predict for the test.csv, where we do not know/ have the outcome [Survived] # We follow same steps we followed with training data on feature selection and extraction. test_df = pd.read_csv("** path/test.csv on your local machine **") from sklearn import metrics y_test_pred = clf.predict(x_test) # Print Accuracy by comparing actual ['Survived'] data we have from train.csv and the prediction from the model. print "Test Accuracy:" , metrics.accuracy_score(y_test, y_test_pred) # Print Confusion Matrix by comparing Actual to predicted outcomes. TruePositive and TrueNegative values should be high the matrix. print "Confusion Matrix" print metrics.confusion_matrix(y_test, y_test_pred) # Classification Report shows Precision and recall that are used to measure a model's performance. print "Classification Report:" print metrics.classification_report(y_test, y_test_pred)
We can iterate these steps with different combinations of features and model parameters until we get satisfactory model performance.
The complete code can be download from the link.