Main
Statistics for Machine Learning: Techniques for exploring supervised, unsupervised, and reinforcement..
Statistics for Machine Learning: Techniques for exploring supervised, unsupervised, and reinforcement learning models with Python and R
Pratap Dangeti
Key Features  Learn about the statistics behind powerful predictive models with pvalue, ANOVA, and F statistics.
 Implement statistical computations programmatically for supervised and unsupervised learning through Kmeans clustering.
 Master the statistical aspect of Machine Learning with the help of this examplerich guide to R and Python.
Book Description Complex statistics in Machine Learning worry a lot of developers. Knowing statistics helps you build strong Machine Learning models that are optimized for a given problem statement. This book will teach you all it takes to perform complex statistical computations required for Machine Learning. You will gain information on statistics behind supervised learning, unsupervised learning, reinforcement learning, and more. Understand the realworld examples that discuss the statistical side of Machine Learning and familiarize yourself with it. You will also design programs for performing tasks such as model, parameter fitting, regression, classification, density collection, and more. By the end of the book, you will have mastered the required statistics for Machine Learning and will be able to apply your new skills to any sort of industry problem. What you will learn  Understand the Statistical and Machine Learning fundamentals necessary to build models
 Understand the major differences and parallels between the statistical way and the Machine Learning way to solve problems
 Learn how to prepare data and feed models by using the appropriate Machine Learning algorithms from the morethanadequate R and Python packages
 Analyze the results and tune the model appropriately to your own predictive goals
 Understand the concepts of required statistics for
Year:
2017
Publisher:
Packt Publishing
Language:
english
Pages:
442 / 438
ISBN 10:
1788295757
ISBN 13:
9781788295758
File:
PDF, 16.45 MB
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses.
Read more.
You may be interested in
Most frequently terms
1


Year:
1952
Language:
english
File:
PDF, 25.83 MB


2


Year:
1945
Language:
english
File:
PDF, 9.06 MB


Statistics for Machine Learning
Build supervised, unsupervised, and reinforcement learning
models using both Python and R
Pratap Dangeti
BIRMINGHAM  MUMBAI
Statistics for Machine Learning
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2017
Production reference: 1180717
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 9781788295758
www.packtpub.com
Credits
Author
Pratap Dangeti
Copy Editor
Safis Editing
Reviewer
Manuel Amunategui
Project Coordinator
Nidhi Joshi
Commissioning Editor
Veena Pagare
Proofreader
Safis Editing
Acquisition Editor
Aman Singh
Indexer
Tejal Daruwale Soni
Content Development Editor
Mayur Pawanikar
Graphics
Tania Dutta
Technical Editor
Dinesh Pawar
Production Coordinator
Arvindkumar Gupta
About the Author
Pratap Dangeti develops machine learning and deep learning solutions for structured,
image, and text data at TCS, analytics and insights, innovation lab in Bangalore. He has
acquired a lot of experience in both analytics and data science. He received his master's
degree from IIT Bombay in its industrial engineering and operations research program. He
is an artificial intelligence enthusiast. When not working, he likes to read about nextgen
technologies and innovative methodologies.
First and foremost, I would like to thank my mom, Lakshmi, for her support throughout
my career and in writing this book. She has been my inspiration and motivation for
continuing to improve my knowledge and helping me move ahead in my career. She is my
strongest supporter, and I dedicate this book to her. I also thank my family and friends for
their encouragement, without which it would not be possible to write this book.
I would like to thank my acquisition editor, Aman Singh, and content development editor,
Mayur Pawanikar, who chose me to write this book and encouraged me constantly
throughout the period of writing with their invaluable feedback and input.
About the Reviewer
Manuel Amunategui is vice president of data science at SpringML, a startup offering
Google Cloud TensorFlow and Salesforce enterprise solutions. Prior to that, he worked as a
quantitative developer on Wall Street for a large equityoptions marketmaking firm and as
a software developer at Microsoft. He holds master degrees in predictive analytics and
international administration.
He is a data science advocate, blogger/vlogger (amunategui.github.io) and a trainer on
Udemy and O'Reilly Media, and technical reviewer at Packt Publishing.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with us
at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and
eBooks.
https://www.packtpub.com/mapt
Get the most indemand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industryleading tools to help you plan your personal
development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/1788295757.
If you'd like to join our team of regular reviewers, you can email us at
customerreviews@packtpub.com. We award our regular reviewers with free eBooks and
videos in exchange for their valuable feedback. Help us be relentless in improving our
products!
Table of Contents
Preface
Chapter 1: Journey from Statistics to Machine Learning
Statistical terminology for model building and validation
Machine learning
Major differences between statistical modeling and machine learning
Steps in machine learning model development and deployment
Statistical fundamentals and terminology for model building and
validation
Bias versus variance tradeoff
Train and test data
Machine learning terminology for model building and validation
Linear regression versus gradient descent
Machine learning losses
When to stop tuning machine learning models
Train, validation, and test data
Crossvalidation
Grid search
Machine learning model overview
Summary
Chapter 2: Parallelism of Statistics and Machine Learning
Comparison between regression and machine learning models
Compensating factors in machine learning models
Assumptions of linear regression
Steps applied in linear regression modeling
Example of simple linear regression from first principles
Example of simple linear regression using the wine quality data
Example of multilinear regression  stepbystep methodology of model
building
Backward and forward selection
Machine learning models  ridge and lasso regression
Example of ridge regression machine learning
Example of lasso regression machine learning model
Regularization parameters in linear regression and ridge/lasso regression
Summary
1
7
8
8
10
11
12
32
34
35
38
41
43
44
46
46
50
54
55
55
57
58
61
61
64
66
69
75
77
80
82
82
Chapter 3: Logistic Regression Versus Random Forest
Maximum likelihood estimation
Logistic regression – introduction and advantages
Terminology involved in logistic regression
Applying steps in logistic regression modeling
Example of logistic regression using German credit data
Random forest
Example of random forest using German credit data
Grid search on random forest
Variable importance plot
Comparison of logistic regression with random forest
Summary
Chapter 4: TreeBased Machine Learning Models
Introducing decision tree classifiers
Terminology used in decision trees
Decision tree working methodology from first principles
Comparison between logistic regression and decision trees
Comparison of error components across various styles of models
Remedial actions to push the model towards the ideal region
HR attrition data example
Decision tree classifier
Tuning class weights in decision tree classifier
Bagging classifier
Random forest classifier
Random forest classifier  grid search
AdaBoost classifier
Gradient boosting classifier
Comparison between AdaBoosting versus gradient boosting
Extreme gradient boosting  XGBoost classifier
Ensemble of ensembles  model stacking
Ensemble of ensembles with different types of classifiers
Ensemble of ensembles with bootstrap samples using a single type of
classifier
Summary
Chapter 5: KNearest Neighbors and Naive Bayes
Knearest neighbors
KNN voter example
Curse of dimensionality
83
83
85
87
94
94
111
113
117
120
122
124
125
126
127
128
134
135
136
137
140
143
145
149
155
158
163
166
169
174
174
182
185
186
187
187
188
[ ii ]
Curse of dimensionality with 1D, 2D, and 3D example
KNN classifier with breast cancer Wisconsin data example
Tuning of kvalue in KNN classifier
Naive Bayes
Probability fundamentals
Joint probability
Understanding Bayes theorem with conditional probability
Naive Bayes classification
Laplace estimator
Naive Bayes SMS spam classification example
Summary
Chapter 6: Support Vector Machines and Neural Networks
Support vector machines working principles
Maximum margin classifier
Support vector classifier
Support vector machines
Kernel functions
SVM multilabel classifier with letter recognition data example
Maximum margin classifier  linear kernel
Polynomial kernel
RBF kernel
Artificial neural networks  ANN
Activation functions
Forward propagation and backpropagation
Optimization of neural networks
Stochastic gradient descent  SGD
Momentum
Nesterov accelerated gradient  NAG
Adagrad
Adadelta
RMSprop
Adaptive moment estimation  Adam
Limitedmemory broydenfletchergoldfarbshanno  LBFGS
optimization algorithm
Dropout in neural networks
ANN classifier applied on handwritten digits using scikitlearn
Introduction to deep learning
Solving methodology
Deep learning software
[ iii ]
191
194
199
202
203
204
205
207
208
209
219
220
220
221
223
224
226
227
228
231
233
240
243
244
253
254
255
256
257
257
257
258
258
260
261
267
269
270
Deep neural network classifier applied on handwritten digits using Keras 271
Summary
279
Chapter 7: Recommendation Engines
Contentbased filtering
Cosine similarity
Collaborative filtering
Advantages of collaborative filtering over contentbased filtering
Matrix factorization using the alternating least squares algorithm for
collaborative filtering
Evaluation of recommendation engine model
Hyperparameter selection in recommendation engines using grid search
Recommendation engine application on movie lens data
Useruser similarity matrix
Moviemovie similarity matrix
Collaborative filtering using ALS
Grid search on collaborative filtering
Summary
Chapter 8: Unsupervised Learning
280
280
281
282
283
283
286
286
287
290
292
294
299
303
304
Kmeans clustering
Kmeans working methodology from first principles
Optimal number of clusters and cluster evaluation
The elbow method
Kmeans clustering with the iris data example
Principal component analysis  PCA
PCA working methodology from first principles
PCA applied on handwritten digits using scikitlearn
Singular value decomposition  SVD
SVD applied on handwritten digits using scikitlearn
Deep auto encoders
Model building technique using encoderdecoder architecture
Deep auto encoders applied on handwritten digits using Keras
Summary
Chapter 9: Reinforcement Learning
305
306
313
313
314
320
325
328
339
340
343
344
346
357
358
Introduction to reinforcement learning
Comparing supervised, unsupervised, and reinforcement learning in
detail
Characteristics of reinforcement learning
Reinforcement learning basics
Category 1  value based
[ iv ]
359
359
360
361
365
Category 2  policy based
Category 3  actorcritic
Category 4  modelfree
Category 5  modelbased
Fundamental categories in sequential decision making
Markov decision processes and Bellman equations
Dynamic programming
Algorithms to compute optimal policy using dynamic programming
Grid world example using value and policy iteration algorithms with
basic Python
Monte Carlo methods
Comparison between dynamic programming and Monte Carlo methods
Key advantages of MC over DP methods
Monte Carlo prediction
The suitability of Monte Carlo prediction on gridworld problems
Modeling Blackjack example of Monte Carlo methods using Python
Temporal difference learning
Comparison between Monte Carlo methods and temporal difference
learning
TD prediction
Driving office example for TD learning
SARSA onpolicy TD control
Qlearning  offpolicy TD control
Cliff walking example of onpolicy and offpolicy of TD control
Applications of reinforcement learning with integration of machine
learning and deep learning
Automotive vehicle control  selfdriving cars
Google DeepMind's AlphaGo
Robo soccer
Further reading
Summary
Index
366
366
366
367
368
368
376
377
381
388
388
388
390
391
392
402
403
403
405
406
408
409
415
415
416
417
418
418
419
[v]
Preface
Complex statistics in machine learning worry a lot of developers. Knowing statistics helps
you build strong machine learning models that are optimized for a given problem
statement. I believe that any machine learning practitioner should be proficient in statistics
as well as in mathematics, so that they can speculate and solve any machine learning
problem in an efficient manner. In this book, we will cover the fundamentals of statistics
and machine learning, giving you a holistic view of the application of machine learning
techniques for relevant problems. We will discuss the application of frequently used
algorithms on various domain problems, using both Python and R programming. We will
use libraries such as scikitlearn, e1071, randomForest, c50, xgboost, and so on. We
will also go over the fundamentals of deep learning with the help of Keras software.
Furthermore, we will have an overview of reinforcement learning with pure Python
programming language.
The book is motivated by the following goals:
To help newbies get up to speed with various fundamentals, whilst also allowing
experienced professionals to refresh their knowledge on various concepts and to
have more clarity when applying algorithms on their chosen data.
To give a holistic view of both Python and R, this book will take you through
various examples using both languages.
To provide an introduction to new trends in machine learning, fundamentals of
deep learning and reinforcement learning are covered with suitable examples to
teach you state of the art techniques.
What this book covers
Chapter 1, Journey from Statistics to Machine Learning, introduces you to all the necessary
fundamentals and basic building blocks of both statistics and machine learning. All
fundamentals are explained with the support of both Python and R code examples across
the chapter.
Chapter 2, Parallelism of Statistics and Machine Learning, compares the differences and draws
parallels between statistical modeling and machine learning using linear regression and
lasso/ridge regression examples.
Preface
Chapter 3, Logistic Regression Versus Random Forest, describes the comparison between
logistic regression and random forest using a classification example, explaining the detailed
steps in both modeling processes. By the end of this chapter, you will have a complete
picture of both the streams of statistics and machine learning.
Chapter 4, TreeBased Machine Learning Models, focuses on the various treebased machine
learning models used by industry practitioners, including decision trees, bagging, random
forest, AdaBoost, gradient boosting, and XGBoost with the HR attrition example in both
languages.
Chapter 5, KNearest Neighbors and Naive Bayes, illustrates simple methods of machine
learning. Knearest neighbors is explained using breast cancer data. The Naive Bayes model
is explained with a message classification example using various NLP preprocessing
techniques.
Chapter 6, Support Vector Machines and Neural Networks, describes the various
functionalities involved in support vector machines and the usage of kernels. It then
provides an introduction to neural networks. Fundamentals of deep learning are
exhaustively covered in this chapter.
Chapter 7, Recommendation Engines, shows us how to find similar movies based on similar
users, which is based on the useruser similarity matrix. In the second section,
recommendations are made based on the moviemovies similarity matrix, in which similar
movies are extracted using cosine similarity. And, finally, the collaborative filtering
technique that considers both users and movies to determine recommendations, is applied,
which is utilized alternating the least squares methodology.
Chapter 8, Unsupervised Learning, presents various techniques such as kmeans clustering,
principal component analysis, singular value decomposition, and deep learning based deep
auto encoders. At the end is an explanation of why deep auto encoders are much more
powerful than the conventional PCA techniques.
Chapter 9, Reinforcement Learning, provides exhaustive techniques that learn the optimal
path to reach a goal over the episodic states, such as the Markov decision process, dynamic
programming, Monte Carlo methods, and temporal difference learning. Finally, some use
cases are provided for superb applications using machine learning and reinforcement
learning.
[2]
Preface
What you need for this book
This book assumes that you know the basics of Python and R and how to install the
libraries. It does not assume that you are already equipped with the knowledge of advanced
statistics and mathematics, like linear algebra and so on.
The following versions of software are used throughout this book, but it should run fine
with any more recent ones as well:
Anaconda 3–4.3.1 (all Python and its relevant packages are included in
Anaconda, Python 3.6.1, NumPy 1.12.1, Pandas 0.19.2, and scikitlearn 0.18.1)
R 3.4.0 and RStudio 1.0.143
Theano 0.9.0
Keras 2.0.2
Who this book is for
This book is intended for developers with little to no background in statistics who want to
implement machine learning in their systems. Some programming knowledge in R or
Python will be useful.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The mode
function was not implemented in the numpy package.". Any commandline input or output
is written as follows:
>>> import numpy as np
>>> from scipy import stats
>>> data = np.array([4,5,1,2,7,2,6,9,3])
# Calculate Mean
>>> dt_mean = np.mean(data) ;
print ("Mean :",round(dt_mean,2))
[3]
Preface
New terms and important words are shown in bold.
Warnings or important notes appear like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you thought about this
bookwhat you liked or disliked. Reader feedback is important for us as it helps us to
develop titles that you will really get the most out of. To send us general feedback, simply
email feedback@packtpub.com, and mention the book's title in the subject of your
message. If there is a topic that you have expertise in and you are interested in either
writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
Downloading the example code
You can download the example code files for this book from your account at http://www.p
acktpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.c
om/supportand register to have the files emailed directly to you. You can download the
code files by following these steps:
1.
2.
3.
4.
Log in or register to our website using your email address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
[4]
Preface
5. Select the book for which you're looking to download the code files.
6. Choose from the dropdown menu where you purchased this book from.
7. Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7Zip for Windows
Zipeg / iZip / UnRarX for Mac
7Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPubl
ishing/StatisticsforMachineLearning. We also have other code bundles from our
rich catalog of books and videos available at https://github.com/PacktPublishing/.
Check them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in given outputs.
You can download this file from https://www.packtpub.com/sites/default/files/down
loads/StatisticsforMachineLearning_ColorImages.pdf.
Errata
Although we have taken care to ensure the accuracy of our content, mistakes do happen. If
you find a mistake in one of our booksmaybe a mistake in the text or the codewe would be
grateful if you could report this to us. By doing so, you can save other readers from
frustration and help us to improve subsequent versions of this book. If you find any errata,
please report them by visiting http://www.packtpub.com/submiterrata, selecting your
book, clicking on the Errata Submission Form link, and entering the details of your errata.
Once your errata are verified, your submission will be accepted and the errata will be
uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to https://www.packtpub.com/book
s/content/supportand enter the name of the book in the search field. The required
information will appear under the Errata section.
[5]
Preface
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately. Please contact us at
copyright@packtpub.com with a link to the suspected pirated material. We appreciate
your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspects of this book, you can contact us at
questions@packtpub.com, and we will do our best to address it.
[6]
1
Journey from Statistics to
Machine Learning
In recent times, machine learning (ML) and data science have gained popularity like never
before. This field is expected to grow exponentially in the coming years. First of all, what is
machine learning? And why does someone need to take pains to understand the principles?
Well, we have the answers for you. One simple example could be book recommendations in
ecommerce websites when someone went to search for a particular book or any other
product recommendations which were bought together to provide an idea to users which
they might like. Sounds magic, right? In fact, utilizing machine learning, can achieve much
more than this.
Machine learning is a branch of study in which a model can learn automatically from the
experiences based on data without exclusively being modeled like in statistical models.
Over a period and with more data, model predictions will become better.
In this first chapter, we will introduce the basic concepts which are necessary to understand
both the statistical and machine learning terminology necessary to create a foundation for
understanding the similarity between both the streams, who are either fulltime statisticians
or software engineers who do the implementation of machine learning but would like to
understand the statistical workings behind the ML methods. We will quickly cover the
fundamentals necessary for understanding the building blocks of models.
Journey from Statistics to Machine Learning
In this chapter, we will cover the following:
Statistical terminology for model building and validation
Machine learning terminology for model building and validation
Machine learning model overview
Statistical terminology for model building
and validation
Statistics is the branch of mathematics dealing with the collection, analysis, interpretation,
presentation, and organization of numerical data.
Statistics are mainly classified into two subbranches:
Descriptive statistics: These are used to summarize data, such as the mean,
standard deviation for continuous data types (such as age), whereas frequency
and percentage are useful for categorical data (such as gender).
Inferential statistics: Many times, a collection of the entire data (also known as
population in statistical methodology) is impossible, hence a subset of the data
points is collected, also called a sample, and conclusions about the entire
population will be drawn, which is known as inferential statistics. Inferences are
drawn using hypothesis testing, the estimation of numerical characteristics, the
correlation of relationships within data, and so on.
Statistical modeling is applying statistics on data to find underlying hidden relationships by
analyzing the significance of the variables.
Machine learning
Machine learning is the branch of computer science that utilizes past experience to learn
from and use its knowledge to make future decisions. Machine learning is at the
intersection of computer science, engineering, and statistics. The goal of machine learning is
to generalize a detectable pattern or to create an unknown rule from given examples. An
overview of machine learning landscape is as follows:
[8]
Journey from Statistics to Machine Learning
Machine learning is broadly classified into three categories but nonetheless, based on the
situation, these categories can be combined to achieve the desired results for particular
applications:
Supervised learning: This is teaching machines to learn the relationship between
other variables and a target variable, similar to the way in which a teacher
provides feedback to students on their performance. The major segments within
supervised learning are as follows:
Classification problem
Regression problem
Unsupervised learning: In unsupervised learning, algorithms learn by
themselves without any supervision or without any target variable provided. It is
a question of finding hidden patterns and relations in the given data. The
categories in unsupervised learning are as follows:
Dimensionality reduction
Clustering
Reinforcement learning: This allows the machine or agent to learn its behavior
based on feedback from the environment. In reinforcement learning, the agent
takes a series of decisive actions without supervision and, in the end, a reward
will be given, either +1 or 1. Based on the final payoff/reward, the agent
reevaluates its paths. Reinforcement learning problems are closer to the artificial
intelligence methodology rather than frequently used machine learning
algorithms.
[9]
Journey from Statistics to Machine Learning
In some cases, we initially perform unsupervised learning to reduce the dimensions
followed by supervised learning when the number of variables is very high. Similarly, in
some artificial intelligence applications, supervised learning combined with reinforcement
learning could be utilized for solving a problem; an example is selfdriving cars in which,
initially, images are converted to some numeric format using supervised learning and
combined with driving actions (left, forward, right, and backward).
Major differences between statistical modeling
and machine learning
Though there are inherent similarities between statistical modeling and machine learning
methodologies, sometimes it is not obviously apparent for many practitioners. In the
following table, we explain the differences succinctly to show the ways in which both
streams are similar and the differences between them:
Statistical modeling
Machine learning
Formalization of relationships between
variables in the form of mathematical
equations.
Algorithm that can learn from the data without
relying on rulebased programming.
Required to assume shape of the model
curve prior to perform model fitting on
the data (for example, linear, polynomial,
and so on).
Does not need to assume underlying shape, as
machine learning algorithms can learn
complex patterns automatically based on the
provided data.
Statistical model predicts the output with Machine learning just predicts the output with
accuracy of 85 percent and having 90
accuracy of 85 percent.
percent confidence about it.
In statistical modeling, various
Machine learning models do not perform any
diagnostics of parameters are performed, statistical diagnostic significance tests.
like pvalue, and so on.
Data will be split into 70 percent  30
percent to create training and testing
data. Model developed on training data
and tested on testing data.
Data will be split into 50 percent  25 percent 25 percent to create training, validation, and
testing data. Models developed on training
and hyperparameters are tuned on validation
data and finally get evaluated against test data.
[ 10 ]
Journey from Statistics to Machine Learning
Statistical models can be developed on a
single dataset called training data, as
diagnostics are performed at both overall
accuracy and individual variable level.
Due to lack of diagnostics on variables,
machine learning algorithms need to be
trained on two datasets, called training and
validation data, to ensure twopoint validation.
Statistical modeling is mostly used for
research purposes.
Machine learning is very apt for
implementation in a production environment.
From the school of statistics and
mathematics.
From the school of computer science.
Steps in machine learning model development
and deployment
The development and deployment of machine learning models involves a series of steps
that are almost similar to the statistical modeling process, in order to develop, validate, and
implement machine learning models. The steps are as follows:
1. Collection of data: Data for machine learning is collected directly from
structured source data, web scrapping, API, chat interaction, and so on, as
machine learning can work on both structured and unstructured data (voice,
image, and text).
2. Data preparation and missing/outlier treatment: Data is to be formatted as per
the chosen machine learning algorithm; also, missing value treatment needs to be
performed by replacing missing and outlier values with the mean/median, and so
on.
3. Data analysis and feature engineering: Data needs to be analyzed in order to
find any hidden patterns and relations between variables, and so on. Correct
feature engineering with appropriate business knowledge will solve 70 percent of
the problems. Also, in practice, 70 percent of the data scientist's time is spent on
feature engineering tasks.
4. Train algorithm on training and validation data: Post feature engineering, data
will be divided into three chunks (train, validation, and test data) rather than two
(train and test) in statistical modeling. Machine learning are applied on training
data and the hyperparameters of the model are tuned based on validation data to
avoid overfitting.
[ 11 ]
Journey from Statistics to Machine Learning
5. Test the algorithm on test data: Once the model has shown a good enough
performance on train and validation data, its performance will be checked against
unseen test data. If the performance is still good enough, we can proceed to the
next and final step.
6. Deploy the algorithm: Trained machine learning algorithms will be deployed on
live streaming data to classify the outcomes. One example could be recommender
systems implemented by ecommerce websites.
Statistical fundamentals and terminology for
model building and validation
Statistics itself is a vast subject on which a complete book could be written; however, here
the attempt is to focus on key concepts that are very much necessary with respect to the
machine learning perspective. In this section, a few fundamentals are covered and the
remaining concepts will be covered in later chapters wherever it is necessary to understand
the statistical equivalents of machine learning.
Predictive analytics depends on one major assumption: that history repeats itself!
By fitting a predictive model on historical data after validating key measures, the same
model will be utilized for predicting future events based on the same explanatory variables
that were significant on past data.
The first movers of statistical model implementers were the banking and pharmaceutical
industries; over a period, analytics expanded to other industries as well.
Statistical models are a class of mathematical models that are usually specified by
mathematical equations that relate one or more variables to approximate reality.
Assumptions embodied by statistical models describe a set of probability distributions,
which distinguishes it from nonstatistical, mathematical, or machine learning models
Statistical models always start with some underlying assumptions for which all the
variables should hold, then the performance provided by the model is statistically
significant. Hence, knowing the various bits and pieces involved in all building blocks
provides a strong foundation for being a successful statistician.
In the following section, we have described various fundamentals with relevant codes:
Population: This is the totality, the complete list of observations, or all the data
points about the subject under study.
[ 12 ]
Journey from Statistics to Machine Learning
Sample: A sample is a subset of a population, usually a small portion of the
population that is being analyzed.
Usually, it is expensive to perform an analysis on an entire population;
hence, most statistical methods are about drawing conclusions about a
population by analyzing a sample.
Parameter versus statistic: Any measure that is calculated on the population is a
parameter, whereas on a sample it is called a statistic.
Mean: This is a simple arithmetic average, which is computed by taking the
aggregated sum of values divided by a count of those values. The mean is
sensitive to outliers in the data. An outlier is the value of a set or column that is
highly deviant from the many other values in the same data; it usually has very
high or low values.
Median: This is the midpoint of the data, and is calculated by either arranging it
in ascending or descending order. If there are N observations.
Mode: This is the most repetitive data point in the data:
[ 13 ]
Journey from Statistics to Machine Learning
The Python code for the calculation of mean, median, and mode using a
numpy array and the stats package is as follows:
>>> import numpy as np
>>> from scipy import stats
>>> data = np.array([4,5,1,2,7,2,6,9,3])
# Calculate Mean
>>> dt_mean = np.mean(data) ; print ("Mean :",round(dt_mean,2))
# Calculate Median
>>> dt_median = np.median(data) ; print ("Median :",dt_median)
# Calculate Mode
>>> dt_mode = stats.mode(data); print ("Mode :",dt_mode[0][0])
The output of the preceding code is as follows:
We have used a NumPy array instead of a basic list as the data structure;
the reason behind using this is the scikitlearn package built on top of
NumPy array in which all statistical models and machine learning
algorithms have been built on NumPy array itself. The mode function is
not implemented in the numpy package, hence we have used SciPy's
stats package. SciPy is also built on top of NumPy arrays.
The R code for descriptive statistics (mean, median, and mode) is given as
follows:
data < c(4,5,1,2,7,2,6,9,3)
dt_mean = mean(data) ; print(round(dt_mean,2))
dt_median = median (data); print (dt_median)
func_mode < function (input_dt) {
unq < unique(input_dt)
unq[which.max(tabulate(match(input_dt,unq)))]
}
dt_mode = func_mode (data); print (dt_mode)
[ 14 ]
Journey from Statistics to Machine Learning
We have used the default stats package for R; however, the mode
function was not builtin, hence we have written custom code for
calculating the mode.
Measure of variation: Dispersion is the variation in the data, and measures the
inconsistencies in the value of variables in the data. Dispersion actually provides
an idea about the spread rather than central values.
Range: This is the difference between the maximum and minimum of the value.
Variance: This is the mean of squared deviations from the mean (xi = data points,
µ = mean of the data, N = number of data points). The dimension of variance is
the square of the actual values. The reason to use denominator N1 for a sample
instead of N in the population is due the degree of freedom. 1 degree of freedom
lost in a sample by the time of calculating variance is due to extraction of
substitution of sample:
Standard deviation: This is the square root of variance. By applying the square
root on variance, we measure the dispersion with respect to the original variable
rather than square of the dimension:
[ 15 ]
Journey from Statistics to Machine Learning
Quantiles: These are simply identical fragments of the data. Quantiles cover
percentiles, deciles, quartiles, and so on. These measures are calculated after
arranging the data in ascending order:
Percentile: This is nothing but the percentage of data points below
the value of the original whole data. The median is the 50th
percentile, as the number of data points below the median is about
50 percent of the data.
Decile: This is 10th percentile, which means the number of data
points below the decile is 10 percent of the whole data.
th
Quartile: This is onefourth of the data, and also is the 25
percentile. The first quartile is 25 percent of the data, the second
quartile is 50 percent of the data, the third quartile is 75 percent of
the data. The second quartile is also known as the median or 50th
percentile or 5th decile.
Interquartile range: This is the difference between the third
quartile and first quartile. It is effective in identifying outliers in
data. The interquartile range describes the middle 50 percent of the
data points.
[ 16 ]
Journey from Statistics to Machine Learning
The Python code is as follows:
>>> from statistics import variance, stdev
>>> game_points =
np.array([35,56,43,59,63,79,35,41,64,43,93,60,77,24,82])
# Calculate Variance
>>> dt_var = variance(game_points) ; print ("Sample variance:",
round(dt_var,2))
# Calculate Standard Deviation
>>> dt_std = stdev(game_points) ; print ("Sample std.dev:",
round(dt_std,2))
# Calculate Range
>>> dt_rng = np.max(game_points,axis=0) np.min(game_points,axis=0) ; print ("Range:",dt_rng)
#Calculate percentiles
>>> print ("Quantiles:")
>>> for val in [20,80,100]:
>>>
dt_qntls = np.percentile(game_points,val)
>>>
print (str(val)+"%" ,dt_qntls)
# Calculate IQR
>>> q75, q25 = np.percentile(game_points, [75 ,25]); print ("Inter
quartile range:",q75q25)
The output of the preceding code is as follows:
The R code for dispersion (variance, standard deviation, range, quantiles, and
IQR) is as follows:
game_points < c(35,56,43,59,63,79,35,41,64,43,93,60,77,24,82)
dt_var = var(game_points); print(round(dt_var,2))
dt_std = sd(game_points); print(round(dt_std,2))
range_val<function(x) return(diff(range(x)))
[ 17 ]
Journey from Statistics to Machine Learning
dt_range = range_val(game_points); print(dt_range)
dt_quantile = quantile(game_points,probs = c(0.2,0.8,1.0));
print(dt_quantile)
dt_iqr = IQR(game_points); print(dt_iqr)
Hypothesis testing: This is the process of making inferences about the overall
population by conducting some statistical tests on a sample. Null and alternate
hypotheses are ways to validate whether an assumption is statistically significant
or not.
Pvalue: The probability of obtaining a test statistic result is at least as extreme as
the one that was actually observed, assuming that the null hypothesis is true
(usually in modeling, against each independent variable, a pvalue less than 0.05
is considered significant and greater than 0.05 is considered insignificant;
nonetheless, these values and definitions may change with respect to context).
The steps involved in hypothesis testing are as follows:
1. Assume a null hypothesis (usually no difference, no significance, and
so on; a null hypothesis always tries to assume that there is no anomaly
pattern and is always homogeneous, and so on).
2. Collect the sample.
3. Calculate test statistics from the sample in order to verify whether the
hypothesis is statistically significant or not.
4. Decide either to accept or reject the null hypothesis based on the test
statistic.
Example of hypothesis testing: A chocolate manufacturer who is also your
friend claims that all chocolates produced from his factory weigh at least 1,000 g
and you have got a funny feeling that it might not be true; you both collected a
sample of 30 chocolates and found that the average chocolate weight as 990 g
with sample standard deviation as 12.5 g. Given the 0.05 significance level, can
we reject the claim made by your friend?
The null hypothesis is that µ0 ≥ 1000 (all chocolates weigh more than 1,000 g).
Collected sample:
[ 18 ]
Journey from Statistics to Machine Learning
Calculate test statistic:
t = (990  1000) / (12.5/sqrt(30)) =  4.3818
Critical t value from t tables = t0.05, 30 = 1.699 =>  t0.05, 30 = 1.699
Pvalue = 7.03 e05
Test statistic is 4.3818, which is less than the critical value of 1.699. Hence,
we can reject the null hypothesis (your friend's claim) that the mean weight
of a chocolate is above 1,000 g.
Also, another way of deciding the claim is by using the pvalue. A pvalue
less than 0.05 means both claimed values and distribution mean values are
significantly different, hence we can reject the null hypothesis:
[ 19 ]
Journey from Statistics to Machine Learning
The Python code is as follows:
>>> from scipy import stats
>>> xbar = 990; mu0 = 1000; s = 12.5; n = 30
# Test Statistic
>>> t_smple = (xbarmu0)/(s/np.sqrt(float(n))); print ("Test
Statistic:",round(t_smple,2))
# Critical value from ttable
>>> alpha = 0.05
>>> t_alpha = stats.t.ppf(alpha,n1); print ("Critical value
from ttable:",round(t_alpha,3))
#Lower tail pvalue from ttable
>>> p_val = stats.t.sf(np.abs(t_smple), n1); print ("Lower
tail pvalue from ttable", p_val)
The R code for Tdistribution is as follows:
xbar = 990; mu0 = 1000; s = 12.5 ; n = 30
t_smple = (xbar  mu0)/(s/sqrt(n));print (round(t_smple,2))
alpha = 0.05
t_alpha = qt(alpha,df= n1);print (round(t_alpha,3))
p_val = pt(t_smple,df = n1);print (p_val)
Type I and II error: Hypothesis testing is usually done on the samples rather
than the entire population, due to the practical constraints of available resources
to collect all the available data. However, performing inferences about the
population from samples comes with its own costs, such as rejecting good results
or accepting false results, not to mention separately, when increases in sample
size lead to minimizing type I and II errors:
Type I error: Rejecting a null hypothesis when it is true
Type II error: Accepting a null hypothesis when it is false
[ 20 ]
Journey from Statistics to Machine Learning
Normal distribution: This is very important in statistics because of the central
limit theorem, which states that the population of all possible samples of size n
from a population with mean μ and variance σ2 approaches a normal
distribution:
Example: Assume that the test scores of an entrance exam fit a normal
distribution. Furthermore, the mean test score is 52 and the standard
deviation is 16.3. What is the percentage of students scoring 67 or more in the
exam?
[ 21 ]
Journey from Statistics to Machine Learning
The Python code is as follows:
>>> from scipy import stats
>>> xbar = 67; mu0 = 52; s = 16.3
# Calculating zscore
>>> z = (6752)/16.3
# Calculating probability under the curve
>>> p_val = 1 stats.norm.cdf(z)
>>> print ("Prob. to score more than 67 is
",round(p_val*100,2),"%")
The R code for normal distribution is as follows:
xbar = 67; mu0 = 52; s = 16.3
pr = 1 pnorm(67, mean=52, sd=16.3)
print(paste("Prob. to score more than 67 is
",round(pr*100,2),"%"))
Chisquare: This test of independence is one of the most basic and common
hypothesis tests in the statistical analysis of categorical data. Given two
categorical random variables X and Y, the chisquare test of independence
determines whether or not there exists a statistical dependence between them.
The test is usually performed by calculating χ2 from the data and χ2 with
(m1, n1) degrees from the table. A decision is made as to whether both
variables are independent based on the actual value and table value,
whichever is higher:
[ 22 ]
Journey from Statistics to Machine Learning
Example: In the following table, calculate whether the smoking habit has an
impact on exercise behavior:
The Python code is as follows:
>>> import pandas as pd
>>> from scipy import stats
>>> survey = pd.read_csv("survey.csv")
# Tabulating 2 variables with row & column variables
respectively
>>> survey_tab = pd.crosstab(survey.Smoke, survey.Exer, margins
= True)
While creating a table using the crosstab function, we will obtain both row
and column totals fields extra. However, in order to create the observed
table, we need to extract the variables part and ignore the totals:
# Creating observed table for analysis
>>> observed = survey_tab.ix[0:4,0:3]
[ 23 ]
Journey from Statistics to Machine Learning
The chi2_contingency function in the stats package uses the observed
table and subsequently calculates its expected table, followed by calculating
the pvalue in order to check whether two variables are dependent or not. If
pvalue < 0.05, there is a strong dependency between two variables, whereas if
pvalue > 0.05, there is no dependency between the variables:
>>> contg = stats.chi2_contingency(observed= observed)
>>> p_value = round(contg[1],3)
>>> print ("Pvalue is: ",p_value)
The pvalue is 0.483, which means there is no dependency between the
smoking habit and exercise behavior.
The R code for chisquare is as follows:
survey = read.csv("survey.csv",header=TRUE)
tbl = table(survey$Smoke,survey$Exer)
p_val = chisq.test(tbl)
ANOVA: Analyzing variance tests the hypothesis that the means of two or more
populations are equal. ANOVAs assess the importance of one or more factors by
comparing the response variable means at the different factor levels. The null
hypothesis states that all population means are equal while the alternative
hypothesis states that at least one is different.
Example: A fertilizer company developed three new types of universal
fertilizers after research that can be utilized to grow any type of crop. In
order to find out whether all three have a similar crop yield, they randomly
chose six crop types in the study. In accordance with the randomized block
design, each crop type will be tested with all three types of fertilizer
separately. The following table represents the yield in g/m2. At the 0.05 level
of significance, test whether the mean yields for the three new types of
fertilizers are all equal:
Fertilizer 1 Fertilizer 2 Fertilizer 3
62
54
48
62
56
62
[ 24 ]
Journey from Statistics to Machine Learning
90
58
92
42
36
96
84
72
92
64
34
80
The Python code is as follows:
>>> import pandas as pd
>>> from scipy import stats
>>> fetilizers = pd.read_csv("fetilizers.csv")
Calculating oneway ANOVA using the stats package:
>>> one_way_anova = stats.f_oneway(fetilizers["fertilizer1"],
fetilizers["fertilizer2"], fetilizers["fertilizer3"])
>>> print ("Statistic :", round(one_way_anova[0],2),", pvalue
:",round(one_way_anova[1],3))
Result: The pvalue did come as less than 0.05, hence we can reject the null
hypothesis that the mean crop yields of the fertilizers are equal. Fertilizers
make a significant difference to crops.
The R code for ANOVA is as follows:
fetilizers = read.csv("fetilizers.csv",header=TRUE)
r = c(t(as.matrix(fetilizers)))
f = c("fertilizer1","fertilizer2","fertilizer3")
k = 3; n = 6
tm = gl(k,1,n*k,factor(f))
blk = gl(n,k,k*n)
av = aov(r ~ tm + blk)
smry = summary(av)
[ 25 ]
Journey from Statistics to Machine Learning
Confusion matrix: This is the matrix of the actual versus the predicted. This
concept is better explained with the example of cancer prediction using the
model:
Some terms used in a confusion matrix are:
True positives (TPs): True positives are cases when we predict the
disease as yes when the patient actually does have the disease.
True negatives (TNs): Cases when we predict the disease as no
when the patient actually does not have the disease.
False positives (FPs): When we predict the disease as yes when the
patient actually does not have the disease. FPs are also considered
to be type I errors.
False negatives (FNs): When we predict the disease as no when the
patient actually does have the disease. FNs are also considered to
be type II errors.
Precision (P): When yes is predicted, how often is it correct?
(TP/TP+FP)
Recall (R)/sensitivity/true positive rate: Among the actual yeses,
what fraction was predicted as yes?
(TP/TP+FN)
[ 26 ]
Journey from Statistics to Machine Learning
F1 score (F1): This is the harmonic mean of the precision and recall.
Multiplying the constant of 2 scales the score to 1 when both
precision and recall are 1:
Specificity: Among the actual nos, what fraction was predicted as
no? Also equivalent to 1 false positive rate:
(TN/TN+FP)
Area under curve (ROC): Receiver operating characteristic curve is
used to plot between true positive rate (TPR) and false positive
rate (FPR), also known as a sensitivity and 1 specificity graph:
[ 27 ]
Journey from Statistics to Machine Learning
Area under curve is utilized for setting the threshold of cutoff
probability to classify the predicted probability into various classes;
we will be covering how this method works in upcoming chapters.
Observation and performance window: In statistical modeling, the model tries
to predict the event in advance rather than at the moment, so that some buffer
time will exist to work on corrective actions. For example, a question from a
credit card company would be, for example, what is the probability that a
particular customer will default in the coming 12month period? So that I can call
him and offer any discounts or develop my collection strategies accordingly.
In order to answer this question, a probability of default model (or behavioral
scorecard in technical terms) needs to be developed by using independent
variables from the past 24 months and a dependent variable from the next 12
months. After preparing data with X and Y variables, it will be split into 70
percent  30 percent as train and test data randomly; this method is called intime validation as both train and test samples are from the same time period:
Intime and outoftime validation: Intime validation implies obtaining both a
training and testing dataset from the same period of time, whereas outoftime
validation implies training and testing datasets drawn from different time
periods. Usually, the model performs worse in outoftime validation rather than
intime due to the obvious reason that the characteristics of the train and test
datasets might differ.
[ 28 ]
Journey from Statistics to Machine Learning
Rsquared (coefficient of determination): This is the measure of the percentage
of the response variable variation that is explained by a model. It also a measure
of how well the model minimizes error compared with just utilizing the mean as
an estimate. In some extreme cases, Rsquared can have a value less than zero
also, which means the predicted values from the model perform worse than just
taking the simple mean as a prediction for all the observations. We will study this
parameter in detail in upcoming chapters:
Adjusted Rsquared: The explanation of the adjusted Rsquared statistic is
almost the same as Rsquared but it penalizes the Rsquared value if extra
variables without a strong correlation are included in the model:
Here, R2 = sample Rsquared value, n = sample size, k = number of predictors
(or) variables.
Adjusted Rsquared value is the key metric in evaluating the quality of linear
regressions. Any linear regression model having the value of R2 adjusted >=
0.7 is considered as a good enough model to implement.
Example: The Rsquared value of a sample is 0.5, with a sample size of 50 and
the independent variables are 10 in number. Calculated adjusted Rsquared:
Maximum likelihood estimate (MLE): This is estimating the parameter values of
a statistical model (logistic regression, to be precise) by finding the parameter
values that maximize the likelihood of making the observations. We will cover
this method in more depth in Chapter 3, Logistic Regression Versus Random Forest.
[ 29 ]
Journey from Statistics to Machine Learning
Akaike information criteria (AIC): This is used in logistic regression, which is
similar to the principle of adjusted Rsquare for linear regression. It measures the
relative quality of a model for a given set of data:
Here, k = number of predictors or variables
The idea of AIC is to penalize the objective function if extra variables without
strong predictive abilities are included in the model. This is a kind of
regularization in logistic regression.
Entropy: This comes from information theory and is the measure of impurity in
the data. If the sample is completely homogeneous, the entropy is zero and if the
sample is equally divided, it has an entropy of 1. In decision trees, the predictor
with the most heterogeneousness will be considered nearest to the root node to
classify given data into classes in a greedy mode. We will cover this topic in more
depth in Chapter 4, TreeBased Machine Learning Models:
Here, n = number of classes. Entropy is maximal at the middle, with the value
of 1 and minimal at the extremes as 0. A low value of entropy is desirable as
it will segregate classes better:
[ 30 ]
Journey from Statistics to Machine Learning
Example: Given two types of coin in which the first one is a fair one (1/2 head
and 1/2 tail probabilities) and the other is a biased one (1/3 head and 2/3 tail
probabilities), calculate the entropy for both and justify which one is better
with respect to modeling:
From both values, the decision tree algorithm chooses the biased coin rather
than the fair coin as an observation splitter due to the fact the value of
entropy is less.
Information gain: This is the expected reduction in entropy caused by
partitioning the examples according to a given attribute. The idea is to start with
mixed classes and to keep partitioning until each node reaches its observations of
the purest class. At every stage, the variable with maximum information gain is
chosen in greedy fashion:
Information gain = Entropy of parent  sum (weighted % * Entropy of child)
Weighted % = Number of observations in particular child / sum (observations in all child nodes)
Gini: Gini impurity is a measure of misclassification, which applies in a
multiclass classifier context. Gini works almost the same as entropy, except Gini
is faster to calculate:
[ 31 ]
Journey from Statistics to Machine Learning
Here, i = number of classes. The similarity between Gini and entropy is
shown as follows:
Bias versus variance tradeoff
Every model has both bias and variance error components in addition to white noise. Bias
and variance are inversely related to each other; while trying to reduce one component, the
other component of the model will increase. The true art lies in creating a good fit by
balancing both. The ideal model will have both low bias and low variance.
Errors from the bias component come from erroneous assumptions in the underlying
learning algorithm. High bias can cause an algorithm to miss the relevant relations between
features and target outputs; this phenomenon causes an underfitting problem.
On the other hand, errors from the variance component come from sensitivity to change in
the fit of the model, even a small change in training data; high variance can cause an
overfitting problem:
[ 32 ]
Journey from Statistics to Machine Learning
An example of a high bias model is logistic or linear regression, in which the fit of the
model is merely a straight line and may have a high error component due to the fact that a
linear model could not approximate underlying data well.
An example of a high variance model is a decision tree, in which the model may create too
much wiggly curve as a fit, in which even a small change in training data will cause a
drastic change in the fit of the curve.
At the moment, stateoftheart models are utilizing high variance models such as decision
trees and performing ensemble on top of them to reduce the errors caused by high variance
and at the same time not compromising on increases in errors due to the bias component.
The best example of this category is random forest, in which many decision trees will be
grown independently and ensemble in order to come up with the best fit; we will cover this
in upcoming chapters:
[ 33 ]
Journey from Statistics to Machine Learning
Train and test data
In practice, data usually will be split randomly 7030 or 8020 into train and test datasets
respectively in statistical modeling, in which training data utilized for building the model
and its effectiveness will be checked on test data:
In the following code, we split the original data into train and test data by 70 percent  30
percent. An important point to consider here is that we set the seed values for random
numbers in order to repeat the random sampling every time we create the same
observations in training and testing data. Repeatability is very much needed in order to
reproduce the results:
# Train & Test split
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> original_data = pd.read_csv("mtcars.csv")
In the following code, train size is 0.7, which means 70 percent of the data should be
split into the training dataset and the remaining 30% should be in the testing dataset.
Random state is seed in this process of generating pseudorandom numbers, which makes
the results reproducible by splitting the exact same observations while running every time:
>>> train_data,test_data = train_test_split(original_data,train_size =
0.7,random_state=42)
[ 34 ]
Journey from Statistics to Machine Learning
The R code for the train and test split for statistical modeling is as follows:
full_data = read.csv("mtcars.csv",header=TRUE)
set.seed(123)
numrow = nrow(full_data)
trnind = sample(1:numrow,size = as.integer(0.7*numrow))
train_data = full_data[trnind,]
test_data = full_data[trnind,]
Machine learning terminology for model
building and validation
There seems to be an analogy between statistical modeling and machine learning that we
will cover in subsequent chapters in depth. However, a quick view has been provided as
follows: in statistical modeling, linear regression with two independent variables is trying
to fit the best plane with the least errors, whereas in machine learning independent
variables have been converted into the square of error terms (squaring ensures the function
will become convex, which enhances faster convergence and also ensures a global
optimum) and optimized based on coefficient values rather than independent variables:
Machine learning utilizes optimization for tuning all the parameters of various algorithms.
Hence, it is a good idea to know some basics about optimization.
[ 35 ]
Journey from Statistics to Machine Learning
Before stepping into gradient descent, the introduction of convex and nonconvex functions
is very helpful. Convex functions are functions in which a line drawn between any two
random points on the function also lies within the function, whereas this isn't true for nonconvex functions. It is important to know whether the function is convex or nonconvex due
to the fact that in convex functions, the local optimum is also the global optimum, whereas
for nonconvex functions, the local optimum does not guarantee the global optimum:
Does it seem like a tough problem? One turnaround could be to initiate a search process at
different random locations; by doing so, it usually converges to the global optimum:
Gradient descent: This is a way to minimize the objective function J(Θ)
d
parameterized by the model's parameter Θ ε R by updating the parameters in
the opposite direction to the gradient of the objective function with respect to the
parameters. The learning rate determines the size of steps taken to reach the
minimum.
Full batch gradient descent (all training observations considered in each and
every iteration): In full batch gradient descent, all the observations are
considered for each and every iteration; this methodology takes a lot of memory
and will be slow as well. Also, in practice, we do not need to have all the
observations to update the weights. Nonetheless, this method provides the best
way of updating parameters with less noise at the expense of huge computation.
Stochastic gradient descent (one observation per iteration): This method
updates weights by taking one observation at each stage of iteration. This method
provides the quickest way of traversing weights; however, a lot of noise is
involved while converging.
[ 36 ]
Journey from Statistics to Machine Learning
Mini batch gradient descent (about 30 training observations or more for each
and every iteration): This is a tradeoff between huge computational costs and a
quick method of updating weights. In this method, at each iteration, about 30
observations will be selected at random and gradients calculated to update the
model weights. Here, a question many can ask is, why the minimum 30 and not
any other number? If we look into statistical basics, 30 observations required to
be considering in order approximating sample as a population. However, even
40, 50, and so on will also do well in batch size selection. Nonetheless, a
practitioner needs to change the batch size and verify the results, to determine at
what value the model is producing the optimum results:
[ 37 ]
Journey from Statistics to Machine Learning
Linear regression versus gradient descent
In the following code, a comparison has been made between applying linear regression in a
statistical way and gradient descent in a machine learning way on the same dataset:
>>> import numpy as np
>>> import pandas as pd
The following code describes reading data using a pandas DataFrame:
>>> train_data = pd.read_csv("mtcars.csv")
Converting DataFrame variables into NumPy arrays in order to process them in scikit learn
packages, as scikitlearn is built on NumPy arrays itself, is shown next:
>>> X = np.array(train_data["hp"]) ; y = np.array(train_data["mpg"])
>>> X = X.reshape(32,1); y = y.reshape(32,1)
Importing linear regression from the scikitlearn package; this works on the least squares
method:
>>> from sklearn.linear_model import LinearRegression
>>> model = LinearRegression(fit_intercept = True)
Fitting a linear regression model on the data and display intercept and coefficient of single
variable (hp variable):
>>> model.fit(X,y)
>>> print ("Linear Regression Results" )
>>> print ("Intercept",model.intercept_[0] ,"Coefficient", model.coef_[0])
Now we will apply gradient descent from scratch; in future chapters, we can use the scikitlearn builtin modules rather than doing it from first principles. However, here, an
illustration has been provided on the internal workings of the optimization method on
which the whole machine learning has been built.
[ 38 ]
Journey from Statistics to Machine Learning
Defining the gradient descent function gradient_descent with the following:
x: Independent variable.
y: Dependent variable.
learn_rate: Learning rate with which gradients are updated; too low causes
slower convergence and too high causes overshooting of gradients.
batch_size: Number of observations considered at each iteration for updating
gradients; a high number causes a lower number of iterations and a lower
number causes an erratic decrease in errors. Ideally, the batch size should be a
minimum value of 30 due to statistical significance. However, various settings
need to be tried to check which one is better.
max_iter: Maximum number of iteration, beyond which the algorithm will get
autoterminated:
>>> def gradient_descent(x, y,learn_rate,
conv_threshold,batch_size, max_iter):
...
converged = False
...
iter = 0
...
m = batch_size
...
t0 = np.random.random(x.shape[1])
...
t1 = np.random.random(x.shape[1])
Mean square error calculation
Squaring of error has been performed to create the convex function, which
has nice convergence properties:
... MSE = (sum([(t0 + t1*x[i]  y[i])**2 for i in
range(m)])/ m)
The following code states, run the algorithm until it does not meet the convergence criteria:
...
while not converged:
...
grad0 = 1.0/m * sum([(t0 + t1*x[i]  y[i]) for i in range(m)])
...
grad1 = 1.0/m * sum([(t0 + t1*x[i]  y[i])*x[i] for i in
range(m)])
...
temp0 = t0  learn_rate * grad0
...
temp1 = t1  learn_rate * grad1
...
t0 = temp0
...
t1 = temp1
[ 39 ]
Journey from Statistics to Machine Learning
Calculate a new error with updated parameters, in order to check whether the new error
changed more than the predefined convergence threshold value; otherwise, stop the
iterations and return parameters:
...
) / m)
...
...
...
...
...
...
...
...
...
MSE_New = (sum( [ (t0 + t1*x[i]  y[i])**2 for i in range(m)]
if abs(MSE  MSE_New ) <= conv_threshold:
print 'Converged, iterations: ', iter
converged = True
MSE = MSE_New
iter += 1
if iter == max_iter:
print 'Max interactions reached'
converged = True
return t0,t1
The following code describes running the gradient descent function with defined values.
Learn rate = 0.0003, convergence threshold = 1e8, batch size = 32, maximum number of
iteration = 1500000:
>>> if __name__ == '__main__':
...
Inter, Coeff = gradient_descent(x = X,y = y,learn_rate=0.00003 ,
conv_threshold = 1e8, batch_size=32,max_iter=1500000)
...
print ('Gradient Descent Results')
...
print (('Intercept = %s Coefficient = %s') %(Inter, Coeff))
The R code for linear regression versus gradient descent is as follows:
# Linear Regression
train_data = read.csv("mtcars.csv",header=TRUE)
model < lm(mpg ~ hp, data = train_data)
print (coef(model))
# Gradient descent
gradDesc < function(x, y, learn_rate, conv_threshold, batch_size,
max_iter) {
m < runif(1, 0, 1)
c < runif(1, 0, 1)
ypred < m * x + c
MSE < sum((y  ypred) ^ 2) / batch_size
converged = F
iterations = 0
[ 40 ]
Journey from Statistics to Machine Learning
while(converged == F) {
m_new < m  learn_rate * ((1 / batch_size) * (sum((ypred  y) * x)))
c_new < c  learn_rate * ((1 / batch_size) * (sum(ypred  y)))
m < m_new
c < c_new
ypred < m * x + c
MSE_new < sum((y  ypred) ^ 2) / batch_size
if(MSE  MSE_new <= conv_threshold) {
converged = T
return(paste("Iterations:",iterations,"Optimal intercept:", c,
"Optimal slope:", m))
}
iterations = iterations + 1
if(iterations > max_iter) {
converged = T
return(paste("Iterations:",iterations,"Optimal intercept:", c,
"Optimal slope:", m))
}
MSE = MSE_new
}
}
gradDesc(x = train_data$hp,y = train_data$mpg, learn_rate = 0.00003,
conv_threshold = 1e8, batch_size = 32, max_iter = 1500000)
Machine learning losses
The loss function or cost function in machine learning is a function that maps the values of
variables onto a real number intuitively representing some cost associated with the variable
values. Optimization methods are applied to minimize the loss function by changing the
parameter values, which is the central theme of machine learning.
Zeroone loss is L01 = 1 (m <= 0); in zeroone loss, value of loss is 0 for m >= 0 whereas 1 for
m < 0. The difficult part with this loss is it is not differentiable, nonconvex, and also NPhard. Hence, in order to make optimization feasible and solvable, these losses are replaced
by different surrogate losses for different problems.
[ 41 ]
Journey from Statistics to Machine Learning
Surrogate losses used for machine learning in place of zeroone loss are given as follows.
The zeroone loss is not differentiable, hence approximated losses are being used instead:
Squared loss (for regression)
Hinge loss (SVM)
Logistic/log loss (logistic regression)
[ 42 ]
Journey from Statistics to Machine Learning
Some loss functions are as follows:
When to stop tuning machine learning models
When to stop tuning the hyperparameters in a machine learning model is a milliondollar
question. This problem can be mostly solved by keeping tabs on training and testing errors.
While increasing the complexity of a model, the following stages occur:
Stage 1: Underfitting stage  high train and high test errors (or low train and low
test accuracy)
Stage 2: Good fit stage (ideal scenario)  low train and low test errors (or high
train and high test accuracy)
[ 43 ]
Journey from Statistics to Machine Learning
Stage 3: Overfitting stage  low train and high test errors (or high train and low
test accuracy)
Train, validation, and test data
Crossvalidation is not popular in the statistical modeling world for many reasons;
statistical models are linear in nature and robust, and do not have a high
variance/overfitting problem. Hence, the model fit will remain the same either on train or
test data, which does not hold true in the machine learning world. Also, in statistical
modeling, lots of tests are performed at the individual parameter level apart from
aggregated metrics, whereas in machine learning we do not have visibility at the individual
parameter level:
[ 44 ]
Journey from Statistics to Machine Learning
In the following code, both the R and Python implementation has been provided. If none of
the percentages are provided, the default parameters are 50 percent for train data, 25
percent for validation data, and 25 percent for the remaining test data.
Python implementation has only one train and test split functionality, hence we have used
it twice and also used the number of observations to split rather than the percentage (as
shown in the previous train and test split example). Hence, a customized function is needed
to split into three datasets:
>>>
>>>
>>>
>>>
...
...
...
import pandas as pd
from sklearn.model_selection import train_test_split
original_data = pd.read_csv("mtcars.csv")
def data_split(dat,trf = 0.5,vlf=0.25,tsf = 0.25):
nrows = dat.shape[0]
trnr = int(nrows*trf)
vlnr = int(nrows*vlf)
The following Python code splits the data into training and the remaining data. The
remaining data will be further split into validation and test datasets:
...
tr_data,rmng = train_test_split(dat,train_size =
trnr,random_state=42)
...
vl_data, ts_data = train_test_split(rmng,train_size =
vlnr,random_state=45)
...
return (tr_data,vl_data,ts_data)
Implementation of the split function on the original data to create three datasets (by 50
percent, 25 percent, and 25 percent splits) is as follows:
>>> train_data, validation_data, test_data = data_split (original_data
,trf=0.5, vlf=0.25,tsf=0.25)
The R code for the train, validation, and test split is as follows:
# Train Validation & Test samples
trvaltest < function(dat,prop = c(0.5,0.25,0.25)){
nrw = nrow(dat)
trnr = as.integer(nrw *prop[1])
vlnr = as.integer(nrw*prop[2])
set.seed(123)
trni = sample(1:nrow(dat),trnr)
trndata = dat[trni,]
rmng = dat[trni,]
vlni = sample(1:nrow(rmng),vlnr)
valdata = rmng[vlni,]
tstdata = rmng[vlni,]
mylist = list("trn" = trndata,"val"= valdata,"tst" = tstdata)
[ 45 ]
Journey from Statistics to Machine Learning
return(mylist)
}
outdata = trvaltest(mtcars,prop = c(0.5,0.25,0.25))
train_data = outdata$trn; valid_data = outdata$val; test_data = outdata$tst
Crossvalidation
Crossvalidation is another way of ensuring robustness in the model at the expense of
computation. In the ordinary modeling methodology, a model is developed on train data
and evaluated on test data. In some extreme cases, train and test might not have been
homogeneously selected and some unseen extreme cases might appear in the test data,
which will drag down the performance of the model.
On the other hand, in crossvalidation methodology, data was divided into equal parts and
training performed on all the other parts of the data except one part, on which performance
will be evaluated. This process repeated as many parts user has chosen.
Example: In fivefold crossvalidation, data will be divided into five parts, subsequently
trained on four parts of the data, and tested on the one part of the data. This process will
run five times, in order to cover all points in the data. Finally, the error calculated will be
the average of all the errors:
[ 46 ]
Journey from Statistics to Machine Learning
Grid search
Grid search in machine learning is a popular way to tune the hyperparameters of the model
in order to find the best combination for determining the best fit:
In the following code, implementation has been performed to determine whether a
particular user will click an ad or not. Grid search has been implemented using a decision
tree classifier for classification purposes. Tuning parameters are the depth of the tree, the
minimum number of observations in terminal node, and the minimum number of
observations required to perform the node split:
# Grid search
>>> import pandas as pd
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import
classification_report,confusion_matrix,accuracy_score
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> input_data = pd.read_csv("ad.csv",header=None)
>>> X_columns = set(input_data.columns.values)
>>> y = input_data[len(input_data.columns.values)1]
[ 47 ]
Journey from Statistics to Machine Learning
>>> X_columns.remove(len(input_data.columns.values)1)
>>> X = input_data[list(X_columns)]
Split the data into train and testing:
>>> X_train, X_test,y_train,y_test = train_test_split(X,y,train_size =
0.7,random_state=33)
Create a pipeline to create combinations of variables for the grid search:
>>> pipeline = Pipeline([
...
('clf', DecisionTreeClassifier(criterion='entropy')) ])
Combinations to explore are given as parameters in Python dictionary format:
>>> parameters = {
...
'clf__max_depth': (50,100,150),
...
'clf__min_samples_split': (2, 3),
...
'clf__min_samples_leaf': (1, 2, 3)}
The n_jobs field is for selecting the number of cores in a computer; 1 means it uses all the
cores in the computer. The scoring methodology is accuracy, in which many other options
can be chosen, such as precision, recall, and f1:
>>> grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1,
scoring='accuracy')
>>> grid_search.fit(X_train, y_train)
Predict using the best parameters of grid search:
>>> y_pred = grid_search.predict(X_test)
The output is as follows:
>>> print ('\n Best score: \n', grid_search.best_score_)
>>> print ('\n Best parameters set: \n')
>>> best_parameters = grid_search.best_estimator_.get_params()
>>> for param_name in sorted(parameters.keys()):
>>>
print ('\t%s: %r' % (param_name, best_parameters[param_name]))
>>> print ("\n Confusion Matrix on Test data
\n",confusion_matrix(y_test,y_pred))
>>> print ("\n Test Accuracy \n",accuracy_score(y_test,y_pred))
>>> print ("\nPrecision Recall f1 table \n",classification_report(y_test,
y_pred))
[ 48 ]
Journey from Statistics to Machine Learning
The R code for grid searches on decision trees is as follows:
# Grid Search on Decision Trees
library(rpart)
input_data = read.csv("ad.csv",header=FALSE)
input_data$V1559 = as.factor(input_data$V1559)
set.seed(123)
numrow = nrow(input_data)
trnind = sample(1:numrow,size = as.integer(0.7*numrow))
train_data = input_data[trnind,];test_data = input_data[trnind,]
minspset = c(2,3);minobset = c(1,2,3)
initacc = 0
for (minsp in minspset){
for (minob in minobset){
tr_fit = rpart(V1559 ~.,data = train_data,method = "class",minsplit =
minsp, minbucket = minob)
tr_predt = predict(tr_fit,newdata = train_data,type = "class")
tble = table(tr_predt,train_data$V1559)
acc = (tble[1,1]+tble[2,2])/sum(tble)
acc
[ 49 ]
Journey from Statistics to Machine Learning
if (acc > initacc){
tr_predtst = predict(tr_fit,newdata = test_data,type = "class")
tblet = table(test_data$V1559,tr_predtst)
acct = (tblet[1,1]+tblet[2,2])/sum(tblet)
acct
print(paste("Best Score"))
print( paste("Train Accuracy ",round(acc,3),"Test
Accuracy",round(acct,3)))
print( paste(" Min split ",minsp," Min obs per node ",minob))
print(paste("Confusion matrix on test data"))
print(tblet)
precsn_0 = (tblet[1,1])/(tblet[1,1]+tblet[2,1])
precsn_1 = (tblet[2,2])/(tblet[1,2]+tblet[2,2])
print(paste("Precision_0: ",round(precsn_0,3),"Precision_1:
",round(precsn_1,3)))
rcall_0 = (tblet[1,1])/(tblet[1,1]+tblet[1,2])
rcall_1 = (tblet[2,2])/(tblet[2,1]+tblet[2,2])
print(paste("Recall_0: ",round(rcall_0,3),"Recall_1:
",round(rcall_1,3)))
initacc = acc
}
}
}
Machine learning model overview
Machine learning models are classified mainly into supervised, unsupervised, and
reinforcement learning methods. We will be covering detailed discussions about each
technique in later chapters; here is a very basic summary of them:
Supervised learning: This is where an instructor provides feedback to a student
on whether they have performed well in an examination or not. In which target
variable do present and models do get tune to achieve it. Many machine learning
methods fall in to this category:
Classification problems
Logistic regression
Lasso and ridge regression
Decision trees (classification trees)
Bagging classifier
Random forest classifier
Boosting classifier (adaboost, gradient boost, and xgboost)
SVM classifier
[ 50 ]
Journey from Statistics to Machine Learning
Recommendation engine
Regression problems
Linear regression (lasso and ridge regression)
Decision trees (regression trees)
Bagging regressor
Random forest regressor
Boosting regressor  (adaboost, gradient boost, and xgboost)
SVM regressor
Unsupervised learning: Similar to the teacherstudent analogy, in which the
instructor does not present and provide feedback to the student and who needs
to prepare on his/her own. Unsupervised learning does not have as many are in
supervised learning:
Principal component analysis (PCA)
Kmeans clustering
Reinforcement learning: This is the scenario in which multiple decisions need to
be taken by an agent prior to reaching the target and it provides a reward, either
+1 or 1, rather than notifying how well or how badly the agent performed across
the path:
Markov decision process
Monte Carlo methods
Temporal difference learning
Logistic regression: This is the problem in which outcomes are discrete classes
rather than continuous values. For example, a customer will arrive or not, he will
purchase the product or not, and so on. In statistical methodology, it uses the
maximum likelihood method to calculate the parameter of individual variables.
In contrast, in machine learning methodology, log loss will be minimized with
respect to β coefficients (also known as weights). Logistic regression has a high
bias and a low variance error.
Linear regression: This is used for the prediction of continuous variables such as
customer income and so on. It utilizes error minimization to fit the best possible
line in statistical methodology. However, in machine learning methodology,
squared loss will be minimized with respect to β coefficients. Linear regression
also has a high bias and a low variance error.
[ 51 ]
Journey from Statistics to Machine Learning
Lasso and ridge regression: This uses regularization to control overfitting issues
by applying a penalty on coefficients. In ridge regression, a penalty is applied on
the sum of squares of coefficients, whereas in lasso, a penalty is applied on the
absolute values of the coefficients. The penalty can be tuned in order to change
the dynamics of the model fit. Ridge regression tries to minimize the magnitude
of coefficients, whereas lasso tries to eliminate them.
Decision trees: Recursive binary splitting is applied to split the classes at each
level to classify observations to their purest class. The classification error rate is
simply the fraction of the training observations in that region that do not belong
to the most common class. Decision trees have an overfitting problem due to their
high variance in a way to fit; pruning is applied to reduce the overfitting problem
by growing the tree completely. Decision trees have low a bias and a high
variance error.
Bagging: This is an ensemble technique applied on decision trees in order to
minimize the variance error and at the same time not increase the error
component due to bias. In bagging, various samples are selected with a
subsample of observations and all variables (columns), subsequently fit
individual decision trees independently on each sample and later ensemble the
results by taking the maximum vote (in regression cases, the mean of outcomes
calculated).
Random forest: This is similar to bagging except for one difference. In bagging,
all the variables/columns are selected for each sample, whereas in random forest
a few subcolumns are selected. The reason behind the selection of a few variables
rather than all was that during each independent tree sampled, significant
variables always came first in the top layer of splitting which makes all the trees
look more or less similar and defies the sole purpose of ensemble: that it works
better on diversified and independent individual models rather than correlated
individual models. Random forest has both low bias and variance errors.
Boosting: This is a sequential algorithm that applies on weak classifiers such as a
decision stump (a onelevel decision tree or a tree with one root node and two
terminal nodes) to create a strong classifier by ensembling the results. The
algorithm starts with equal weights assigned to all the observations, followed by
subsequent iterations where more focus was given to misclassified observations
by increasing the weight of misclassified observations and decreasing the weight
of properly classified observations. In the end, all the individual classifiers were
combined to create a strong classifier. Boosting might have an overfitting
problem, but by carefully tuning the parameters, we can obtain the best of the self
machine learning model.
[ 52 ]
Journey from Statistics to Machine Learning
Support vector machines (SVMs): This maximizes the margin between classes by
fitting the widest possible hyperplane between them. In the case of nonlinearly
separable classes, it uses kernels to move observations into higherdimensional
space and then separates them linearly with the hyperplane there.
Recommendation engine: This utilizes a collaborative filtering algorithm to
identify highprobability items to its respective users, who have not used it in the
past, by considering the tastes of similar users who would be using that
particular item. It uses the alternating least squares (ALS) methodology to solve
this problem.
Principal component analysis (PCA): This is a dimensionality reduction
technique in which principal components are calculated in place of the original
variable. Principal components are determined where the variance in data is
maximum; subsequently, the top n components will be taken by covering about
80 percent of variance and will be used in further modeling processes, or
exploratory analysis will be performed as unsupervised learning.
Kmeans clustering: This is an unsupervised algorithm that is mainly utilized for
segmentation exercise. Kmeans clustering classifies the given data into k clusters
in such a way that, within the cluster, variation is minimal and across the cluster,
variation is maximal.
Markov decision process (MDP): In reinforcement learning, MDP is a
mathematical framework for modeling decisionmaking of an agent in situations
or environments where outcomes are partly random and partly under control. In
this model, environment is modeled as a set of states and actions that can be
performed by an agent to control the system's state. The objective is to control the
system in such a way that the agent's total payoff is maximized.
Monte Carlo method: Monte Carlo methods do not require complete knowledge
of the environment, in contrast with MDP. Monte Carlo methods require only
experience, which is obtained by sample sequences of states, actions, and rewards
from actual or simulated interaction with the environment. Monte Carlo methods
explore the space until the final outcome of a chosen sample sequences and
update estimates accordingly.
Temporal difference learning: This is a core theme in reinforcement learning.
Temporal difference is a combination of both Monte Carlo and dynamic
programming ideas. Similar to Monte Carlo, temporal difference methods can
learn directly from raw experience without a model of the environment's
dynamics. Like dynamic programming, temporal difference methods update
estimates based in part on other learned estimates, without waiting for a final
outcome. Temporal difference is the best of both worlds and is most commonly
used in games such as AlphaGo and so on.
[ 53 ]
Journey from Statistics to Machine Learning
Summary
In this chapter, we have gained a highlevel view of various basic building blocks and
subcomponents involved in statistical modeling and machine learning, such as mean,
variance, interquartile range, pvalue, bias versus variance tradeoff, AIC, Gini, area under
the curve, and so on with respect to the statistics context, and crossvalidation, gradient
descent, and grid search concepts with respect to machine learning. We have explained all
the concepts with the support of both Python and R code with various libraries such as
numpy, scipy, pandas, and scikit learn, and the stats model in Python and the basic
stats package in R. In the next chapter, we will learn to draw parallels between statistical
models and machine learning models with linear regression problems and ridge/lasso
regression in machine learning using both Python and R code.
[ 54 ]
2
Parallelism of Statistics and
Machine Learning
At first glance, machine learning seems to be distant from statistics. However, if we take a
deeper look into them, we can draw parallels between both. In this chapter, we will deep
dive into the details. Comparisons have been made between linear regression and
lasso/ridge regression in order to provide a simple comparison between statistical modeling
and machine learning. These are basic models in both worlds and are good to start with.
In this chapter, we will cover the following:
Understanding of statistical parameters and diagnostics
Compensating factors in machine learning models to equate statistical
diagnostics
Ridge and lasso regression
Comparison of adjusted Rsquare with accuracy
Comparison between regression and
machine learning models
Linear regression and machine learning models both try to solve the same problem in
different ways. In the following simple example of a twovariable equation fitting the best
possible plane, regression models try to fit the best possible hyperplane by minimizing the
errors between the hyperplane and actual observations. However, in machine learning, the
same problem has been converted into an optimization problem in which errors are
modeled in squared form to minimize errors by altering the weights.
Parallelism of Statistics and Machine Learning
In statistical modeling, samples are drawn from the population and the model will be fitted
on sampled data. However, in machine learning, even small numbers such as 30
observations would be good enough to update the weights at the end of each iteration; in a
few cases, such as online learning, the model will be updated with even one observation:
Machine learning models can be effectively parallelized and made to work on multiple
machines in which model weights are broadcast across the machines, and so on. In the case
of big data with Spark, these techniques are implemented.
Statistical models are parametric in nature, which means a model will have parameters on
which diagnostics are performed to check the validity of the model. Whereas machine
learning models are nonparametric, do not have any parameters, or curve assumptions;
these models learn by themselves based on provided data and come up with complex and
intricate functions rather than predefined function fitting.
Multicollinearity checks are required to be performed in statistical modeling. Whereas, in
machine learning space, weights automatically get adjusted to compensate the multicollinearity problem. If we consider treebased ensemble methods such as bagging, random
forest, boosting, and so on, multicollinearity does not even exist, as the underlying model
is a decision tree, which does not have a multicollinearity problem in the first place.
With the evolution of big data and distributed parallel computing, more complex models
are producing stateoftheart results which were impossible with past technology.
[ 56 ]
Parallelism of Statistics and Machine Learning
Compensating factors in machine learning
models
Compensating factors in machine learning models to equate statistical diagnostics is
explained with the example of a beam being supported by two supports. If one of the
supports doesn't exist, the beam will eventually fall down by moving out of balance. A
similar analogy is applied for comparing statistical modeling and machine learning
methodologies here.
The twopoint validation is performed on the statistical modeling methodology on training
data using overall model accuracy and individual parameters significance test. Due to the
fact that either linear or logistic regression has less variance by shape of the model itself,
hence there would be very little chance of it working worse on unseen data. Hence, during
deployment, these models do not incur too many deviated results.
However, in the machine learning space, models have a high degree of flexibility which can
change from simple to highly complex. On top, statistical diagnostics on individual
variables are not performed in machine learning. Hence, it is important to ensure the
robustness to avoid overfitting of the models, which will ensure its usability during the
implementation phase to ensure correct usage on unseen data.
As mentioned previously, in machine learning, data will be split into three parts (train data
 50 percent, validation data  25 percent, testing data  25 percent) rather than two parts in
statistical methodology. Machine learning models should be developed on training data,
and its hyperparameters should be tuned based on validation data to ensure the twopoint
validation equivalence; this way, the robustness of models is ensured without diagnostics
performed at an individual variable level:
[ 57 ]
Parallelism of Statistics and Machine Learning
Before diving deep into comparisons between both streams, we will start understanding the
fundamentals of each model individually. Let us start with linear regression! This model
might sound trivial; however, knowing the linear regression working principles will create
a foundation for more advanced statistical and machine learning models. Below are the
assumptions of linear regression.
Assumptions of linear regression
Linear regression has the following assumptions, failing which the linear regression model
does not hold true:
The dependent variable should be a linear combination of independent variables
No autocorrelation in error terms
Errors should have zero mean and be normally distributed
No or little multicollinearity
Error terms should be homoscedastic
These are explained in detail as follows:
The dependent variable should be a linear combination of independent
variables: Y should be a linear combination of X variables. Please note, in the
following equation, X2 has raised to the power of 2, the equation is still holding
the assumption of a linear combination of variables:
[ 58 ]
Parallelism of Statistics and Machine Learning
How to diagnose: Look into residual plots of residual versus independent
variables. Also try to include polynomial terms and see any decrease in
residual values, as polynomial terms may capture more signals from the data
in case simple linear models do not capture them.
In the preceding sample graph, initially, linear regression was applied and
the errors seem to have a pattern rather than being pure white noise; in this
case, it is simply showing the presence of nonlinearity. After increasing the
power of the polynomial value, now the errors simply look like white noise.
No autocorrelation in error terms: Presence of correlation in error terms
penalized model accuracy.
How to diagnose: Look for the DurbinWatson test. DurbinWatson's d tests
the null hypothesis that the residuals are not linearly auto correlated. While d
can lie between 0 and 4, if d ≈ 2 indicates no autocorrelation, 0<d<2 implies
positive autocorrelation, and 2<d<4 indicates negative autocorrelation.
Error should have zero mean and be normally distributed: Errors should have
zero mean for the model to create an unbiased estimate. Plotting the errors will
show the distribution of errors. Whereas, if error terms are not normally
distributed, it implies confidence intervals will become too wide or narrow,
which leads to difficulty in estimating coefficients based on minimization of least
squares:
[ 59 ]
Parallelism of Statistics and Machine Learning
How to diagnose: Look into QQ plot and also tests such as KolmogorovSmirnov tests will be helpful. By looking into the above QQ plot, it is evident
that the first chart shows errors are normally distributed, as the residuals do
not seem to be deviating much compared with the diagonallike line,
whereas in the righthand chart, it is clearly showing that errors are not
normally distributed; in these scenarios, we need to reevaluate the variables
by taking log transformations and so on to make residuals look as they do on
the lefthand chart.
No or little multicollinearity: Multicollinearity is the case in which
independent variables are correlated with each other and this situation creates
unstable models by inflating the magnitude of coefficients/estimates. It also
becomes difficult to determine which variable is contributing to predict the
response variable. VIF is calculated for each independent variable by calculating
the Rsquared value with respect to all the other independent variables and tries
to eliminate which variable has the highest VIF value one by one:
How to diagnose: Look into scatter plots, run correlation coefficient on all the
variables of data. Calculate the variance inflation factor (VIF). If VIF <= 4
suggests no multicollinearity, in banking scenarios, people use VIF <= 2 also!
Errors should be homoscedastic: Errors should have constant variance with
respect to the independent variable, which leads to impractically wide or narrow
confidence intervals for estimates, which degrades the model's performance. One
reason for not holding homoscedasticity is due to the presence of outliers in the
data, which drags the model fit toward them with higher weights:
[ 60 ]
Parallelism of Statistics and Machine Learning
How to diagnose: Look into the residual versus dependent variables plot; if
any pattern of cone or divergence does exist, it indicates the errors do not
have constant variance, which impacts its predictions.
Steps applied in linear regression modeling
The following steps are applied in linear regression modeling in industry:
1.
2.
3.
4.
5.
Missing value and outlier treatment
Correlation check of independent variables
Train and test random classification
Fit the model on train data
Evaluate model on test data
Example of simple linear regression from first
principles
The entire chapter has been presented with the popular wine quality dataset which is
openly available from the UCI machine learning repository at
https://archive.ics.uci.edu/ml/datasets/Wine+Quality.
[ 61 ]
Parallelism of Statistics and Machine Learning
Simple linear regression is a straightforward approach for predicting the
dependent/response variable Y given the independent/predictor variable X. It assumes a
linear relationship between X and Y:
β0 and β1 are two unknown constants which are intercept and slope parameters
respectively. Once we determine the constants, we can utilize them for the prediction of the
dependent variable:
[ 62 ]
Parallelism of Statistics and Machine Learning
Residuals are the differences between the ith observed response value and the ith response
value that is predicted from the model. Residual sum of squares is shown. The least squares
approach chooses estimates by minimizing errors.
In order to prove statistically that linear regression is significant, we have to perform
hypothesis testing. Let's assume we start with the null hypothesis that there is no significant
relationship between X and Y:
Since, if β1 = 0, then the model shows no association between both variables (Y = β0 + ε),
these are the null hypothesis assumptions; in order to prove this assumption right or
wrong, we need to determine β1 is sufficiently far from 0 (statistically significant in distance
from 0 to be precise), that we can be confident that β1 is nonzero and have a significant
relationship between both variables. Now, the question is, how far is far enough from zero?
It depends on the distribution of β1, which is its mean and standard error (similar to
standard deviation). In some cases, if the standard error is small, even relatively small
values may provide strong evidence that β1 ≠ 0, hence there is a relationship between X and
Y. In contrast, if SE(β1) is large, then β1 must be large in absolute value in order for us to
reject the null hypothesis. We usually perform the following test to check how many
standard deviations β1 is away from the value 0:
[ 63 ]
Parallelism of Statistics and Machine Learning
With this t value, we calculate the probability of observing any value equal to t or larger,
assuming β1 = 0; this probability is also known as the pvalue. If pvalue < 0.05, it signifies
that β1 is significantly far from 0, hence we can reject the null hypothesis and agree that
there exists a strong relationship, whereas if pvalue > 0.05, we accept the null hypothesis
and conclude that there is no significant relationship between both variables.
Once we have the coefficient values, we will try to predict the dependent value and check
for the Rsquared value; if the value is >= 0.7, it means the model is good enough to deploy
on unseen data, whereas if it is not such a good value (<0.6), we can conclude that this
model is not good enough to deploy.
Example of simple linear regression using the
wine quality data
In the wine quality data, the dependent variable (Y) is wine quality and the independent (X)
variable we have chosen is alcohol content. We are testing here whether there is any
significant relation between both, to check whether a change in alcohol percentage is the
deciding factor in the quality of the wine:
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import r2_score
>>> wine_quality = pd.read_csv("winequalityred.csv",sep=';')
>>> wine_quality.rename(columns=lambda x: x.replace(" ", "_"),
inplace=True)
[ 64 ]
Parallelism of Statistics and Machine Learning
In the following step, the data is split into train and test using the 70 percent  30 percent
rule:
>>> x_train,x_test,y_train,y_test = train_test_split (wine_quality
['alcohol'], wine_quality["quality"],train_size = 0.7,random_state=42)
After splitting a single variable out of the DataFrame, it becomes a pandas series, hence we
need to convert it back into a pandas DataFrame again:
>>> x_train = pd.DataFrame(x_train);x_test = pd.DataFrame(x_test)
>>> y_train = pd.DataFrame(y_train);y_test = pd.DataFrame(y_test)
The following function is for calculating the mean from the columns of the DataFrame. The
mean was calculated for both alcohol (independent) and the quality (dependent)
variables:
>>> def mean(values):
...
return round(sum(values)/float(len(values)),2)
>>> alcohol_mean = mean(x_train['alcohol'])
>>> quality_mean = mean(y_train['quality'])
Variance and covariance is indeed needed for calculating the coefficients of the regression
model:
>>> alcohol_variance = round(sum((x_train['alcohol']  alcohol_mean)**2),2)
>>> quality_variance = round(sum((y_train['quality']  quality_mean)**2),2)
>>> covariance = round(sum((x_train['alcohol']  alcohol_mean) *
(y_train['quality']  quality_mean )),2)
>>> b1 = covariance/alcohol_variance
>>> b0 = quality_mean  b1*alcohol_mean
>>> print ("\n\nIntercept (B0):",round(b0,4),"Coefficient
(B1):",round(b1,4))
After computing coefficients, it is necessary to predict the quality variable, which will test
the quality of fit using Rsquared value:
>>> y_test["y_pred"] = pd.DataFrame(b0+b1*x_test['alcohol'])
>>> R_sqrd = 1 ( sum((y_test['quality']y_test['y_pred'])**2) /
sum((y_test['quality']  mean(y_test['quality']))**2 ))
>>> print ("Test Rsquared value",round(R_sqrd,4))
[ 65 ]
Parallelism of Statistics and Machine Learning
From the test Rsquared value, we can conclude that there is no strong relationship between
quality and alcohol variables in the wine data, as Rsquared is less than 0.7.
Simple regression fit using first principles is described in the following R code:
wine_quality = read.csv("winequalityred.csv",header=TRUE,sep =
";",check.names = FALSE)
names(wine_quality) < gsub(" ", "_", names(wine_quality))
set.seed(123)
numrow = nrow(wine_quality)
trnind = sample(1:numrow,size = as.integer(0.7*numrow))
train_data = wine_quality[trnind,]
test_data = wine_quality[trnind,]
x_train = train_data$alcohol;y_train = train_data$quality
x_test = test_data$alcohol; y_test = test_data$quality
x_mean = mean(x_train); y_mean = mean(y_train)
x_var = sum((x_train  x_mean)**2) ; y_var = sum((y_trainy_mean)**2)
covariance = sum((x_trainx_mean)*(y_trainy_mean))
b1 = covariance/x_var
b0 = y_mean  b1*x_mean
pred_y = b0+b1*x_test
R2 < 1 
