Main
Python Machine Learning: Machine Learning and Deep Learning with Python, scikitlearn, and TensorFlow
Python Machine Learning: Machine Learning and Deep Learning with Python, scikitlearn, and TensorFlow
Sebastian Raschka, Vahid Mirjalili
Categories:
Computer Science
Edition:
2nd
Language:
english
Pages:
622
ISBN 10:
1787125939
File:
PDF, 10.79 MB
Download (pdf, 10.79 MB)
Preview
 Checking other formats...
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You may be interested in
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
[1] Python Machine Learning Second Edition Machine Learning and Deep Learning with Python, scikitlearn, and TensorFlow Sebastian Raschka Vahid Mirjalili BIRMINGHAM  MUMBAI Python Machine Learning Second Edition Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: September 2015 Second edition: September 2017 Production reference: 3231017 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 9781787125933 www.packtpub.com Credits Authors Sebastian Raschka Project Coordinator Suzanne Coutinho Vahid Mirjalili Proofreader Reviewers Safis Editing Jared Huffman HuaiEn, Sun (Ryan Sun) Acquisition Editor Frank Pohlmann Content Development Editor Chris Nelson Project Editor Monika Sangwan Technical Editors Bhagyashree Rai Nidhisha Shetty Copy Editor Safis Editing Indexer Tejal Daruwale Soni Graphics Kirk D'Penha Production Coordinator Arvindkumar Gupta About the Authors Sebastian Raschka, the author of the bestselling book, Python Machine Learning, has many years of experience with coding in Python, and he has given several seminars on the practical applications of data science, machine learning, and deep learning including a machine learning tutorial at SciPy—the leading conference for scientific computing in Python. While Sebastian's academic research projects are mainly centered around problemsolving in computational biology, he loves to write and talk about data science, machine learning, and Python in general, and he is motivated to help people develop datadriven solutions without necessarily requiring a machine learning background. His work and contributions have recently been recognized by the departmental outstanding graduate student award 20162017 as well as the ACM Computing Reviews' Best of 2016 award. In his free time, Sebastian loves to contribute to open source projects, and the methods that he has implemented are now successfully used in machine learning competitions, such as Kaggle. I would like to take this opportunity to thank the great Python community and developers of open source packages who helped me create the perfect environment for scientific research and data science. Also, I want to thank my parents who always encouraged and supported me in pursuing the path and career that I was so passionate about. Special thanks to the core developers of scikitlearn. As a contributor to this project, I had the pleasure to work with great people who are not only very knowledgeable when it comes to machine learning but are also excellent programmers. Lastly, I'd like to thank Elie Kawerk, who volunteered to review the book and provided valuable feedback on the new chapters. Vahid Mirjalili obtained his PhD in mechanical engineering working on novel methods for largescale, computational simulations of molecular structures. Currently, he is focusing his research efforts on applications of machine learning in various computer vision projects at the department of computer science and engineering at Michigan State University. Vahid picked Python as his numberone choice of programming language, and throughout his academic and research career he has gained tremendous experience with coding in Python. He taught Python programming to the engineering class at Michigan State University, which gave him a chance to help students understand different data structures and develop efficient code in Python. While Vahid's broad research interests focus on deep learning and computer vision applications, he is especially interested in leveraging deep learning techniques to extend privacy in biometric data such as face images so that information is not revealed beyond what users intend to reveal. Furthermore, he also collaborates with a team of engineers working on selfdriving cars, where he designs neural network models for the fusion of multispectral images for pedestrian detection. I would like to thank my PhD advisor, Dr. Arun Ross, for giving me the opportunity to work on novel problems in his research lab. I also like to thank Dr. Vishnu Boddeti for inspiring my interests in deep learning and demystifying its core concepts. About the Reviewers Jared Huffman is an entrepreneur, gamer, storyteller, machine learning fanatic, and database aficionado. He has dedicated the past 10 years to developing software and analyzing data. His previous work has spanned a variety of topics, including network security, financial systems, and business intelligence, as well as web services, developer tools, and business strategy. Most recently, he was the founder of the data science team at Minecraft, with a focus on big data and machine learning. When not working, you can typically find him gaming or enjoying the beautiful Pacific Northwest with friends and family. I'd like to thank Packt for giving me the opportunity to work on such a great book, my wife for the constant encouragement, and my daughter for sleeping through most of the late nights while I was reviewing and debugging code. HuaiEn, Sun (Ryan Sun) holds a master's degree in statistics from the National Chiao Tung University. He is currently working as a data scientist for analyzing the production line at PEGATRON. Machine learning and deep learning are his main areas of research. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most indemand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industryleading tools to help you plan your personal development and advance your career. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787125939. If you'd like to join our team of regular reviewers, you can email us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products! Table of Contents Preface Chapter 1: Giving Computers the Ability to Learn from Data Building intelligent machines to transform data into knowledge The three different types of machine learning Making predictions about the future with supervised learning Classification for predicting class labels Regression for predicting continuous outcomes Solving interactive problems with reinforcement learning Discovering hidden structures with unsupervised learning Finding subgroups with clustering Dimensionality reduction for data compression Introduction to the basic terminology and notations A roadmap for building machine learning systems Preprocessing – getting data into shape Training and selecting a predictive model Evaluating models and predicting unseen data instances Using Python for machine learning Installing Python and packages from the Python Package Index Using the Anaconda Python distribution and package manager Packages for scientific computing, data science, and machine learning Summary Chapter 2: Training Simple Machine Learning Algorithms for Classification Artificial neurons – a brief glimpse into the early history of machine learning The formal definition of an artificial neuron The perceptron learning rule [i] xi 1 2 2 3 3 5 6 7 7 8 8 11 12 12 13 13 14 14 15 15 17 18 19 21 Table of Contents Implementing a perceptron learning algorithm in Python An objectoriented perceptron API Training a perceptron model on the Iris dataset Adaptive linear neurons and the convergence of learning Minimizing cost functions with gradient descent Implementing Adaline in Python Improving gradient descent through feature scaling Largescale machine learning and stochastic gradient descent Summary Chapter 3: A Tour of Machine Learning Classifiers Using scikitlearn Choosing a classification algorithm First steps with scikitlearn – training a perceptron Modeling class probabilities via logistic regression Logistic regression intuition and conditional probabilities Learning the weights of the logistic cost function Converting an Adaline implementation into an algorithm for logistic regression Training a logistic regression model with scikitlearn Tackling overfitting via regularization Maximum margin classification with support vector machines Maximum margin intuition Dealing with a nonlinearly separable case using slack variables Alternative implementations in scikitlearn Solving nonlinear problems using a kernel SVM Kernel methods for linearly inseparable data Using the kernel trick to find separating hyperplanes in highdimensional space Decision tree learning Maximizing information gain – getting the most bang for your buck Building a decision tree Combining multiple decision trees via random forests Knearest neighbors – a lazy learning algorithm Summary Chapter 4: Building Good Training Sets – Data Preprocessing Dealing with missing data Identifying missing values in tabular data Eliminating samples or features with missing values Imputing missing values Understanding the scikitlearn estimator API [ ii ] 24 24 28 34 35 38 42 44 50 51 52 52 59 59 63 66 71 73 76 77 79 81 82 82 84 88 90 95 98 101 105 107 107 108 109 110 111 Table of Contents Handling categorical data Nominal and ordinal features 112 113 Mapping ordinal features Encoding class labels Performing onehot encoding on nominal features Partitioning a dataset into separate training and test sets Bringing features onto the same scale Selecting meaningful features L1 and L2 regularization as penalties against model complexity A geometric interpretation of L2 regularization Sparse solutions with L1 regularization Sequential feature selection algorithms Assessing feature importance with random forests Summary 113 114 116 118 120 123 124 124 126 130 136 139 Creating an example dataset Chapter 5: Compressing Data via Dimensionality Reduction Unsupervised dimensionality reduction via principal component analysis The main steps behind principal component analysis Extracting the principal components step by step Total and explained variance Feature transformation Principal component analysis in scikitlearn Supervised data compression via linear discriminant analysis Principal component analysis versus linear discriminant analysis The inner workings of linear discriminant analysis Computing the scatter matrices Selecting linear discriminants for the new feature subspace Projecting samples onto the new feature space LDA via scikitlearn Using kernel principal component analysis for nonlinear mappings Kernel functions and the kernel trick Implementing a kernel principal component analysis in Python Example 1 – separating halfmoon shapes Example 2 – separating concentric circles Projecting new data points Kernel principal component analysis in scikitlearn Summary [ iii ] 113 141 142 142 144 147 148 151 155 155 156 157 160 162 163 165 166 172 173 176 179 183 184 Table of Contents Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Tuning 185 Chapter 7: Combining Different Models for Ensemble Learning 219 Streamlining workflows with pipelines Loading the Breast Cancer Wisconsin dataset Combining transformers and estimators in a pipeline Using kfold crossvalidation to assess model performance The holdout method Kfold crossvalidation Debugging algorithms with learning and validation curves Diagnosing bias and variance problems with learning curves Addressing over and underfitting with validation curves Finetuning machine learning models via grid search Tuning hyperparameters via grid search Algorithm selection with nested crossvalidation Looking at different performance evaluation metrics Reading a confusion matrix Optimizing the precision and recall of a classification model Plotting a receiver operating characteristic Scoring metrics for multiclass classification Dealing with class imbalance Summary Learning with ensembles Combining classifiers via majority vote Implementing a simple majority vote classifier Using the majority voting principle to make predictions Evaluating and tuning the ensemble classifier Bagging – building an ensemble of classifiers from bootstrap samples Bagging in a nutshell Applying bagging to classify samples in the Wine dataset Leveraging weak learners via adaptive boosting How boosting works Applying AdaBoost using scikitlearn Summary Chapter 8: Applying Machine Learning to Sentiment Analysis Preparing the IMDb movie review data for text processing Obtaining the movie review dataset Preprocessing the movie dataset into more convenient format [ iv ] 185 186 187 189 190 191 195 196 199 201 201 203 205 206 207 210 213 214 216 219 224 224 231 234 240 240 242 246 246 251 254 255 256 256 257 Table of Contents Introducing the bagofwords model Transforming words into feature vectors Assessing word relevancy via term frequencyinverse document frequency Cleaning text data Processing documents into tokens Training a logistic regression model for document classification Working with bigger data – online algorithms and outofcore learning Topic modeling with Latent Dirichlet Allocation Decomposing text documents with LDA LDA with scikitlearn Summary Chapter 9: Embedding a Machine Learning Model into a Web Application Serializing fitted scikitlearn estimators Setting up an SQLite database for data storage Developing a web application with Flask Our first Flask web application Form validation and rendering Setting up the directory structure Implementing a macro using the Jinja2 templating engine Adding style via CSS Creating the result page Turning the movie review classifier into a web application Files and folders – looking at the directory tree Implementing the main application as app.py Setting up the review form Creating a results page template Deploying the web application to a public server Creating a PythonAnywhere account Uploading the movie classifier application Updating the movie classifier Summary Chapter 10: Predicting Continuous Target Variables with Regression Analysis Introducing linear regression Simple linear regression Multiple linear regression Exploring the Housing dataset Loading the Housing dataset into a data frame [v] 259 259 261 264 266 268 270 274 275 275 279 281 282 285 287 288 290 291 292 293 294 294 296 298 300 302 304 304 305 306 308 309 310 310 311 312 313 Table of Contents Visualizing the important characteristics of a dataset Looking at relationships using a correlation matrix Implementing an ordinary least squares linear regression model Solving regression for regression parameters with gradient descent Estimating coefficient of a regression model via scikitlearn Fitting a robust regression model using RANSAC Evaluating the performance of linear regression models Using regularized methods for regression Turning a linear regression model into a curve – polynomial regression Adding polynomial terms using scikitlearn Modeling nonlinear relationships in the Housing dataset Dealing with nonlinear relationships using random forests Decision tree regression Random forest regression Summary 314 316 319 319 324 325 328 332 334 334 336 339 340 342 345 Chapter 11: Working with Unlabeled Data – Clustering Analysis 347 Chapter 12: Implementing a Multilayer Artificial Neural Network from Scratch 379 Grouping objects by similarity using kmeans Kmeans clustering using scikitlearn A smarter way of placing the initial cluster centroids using kmeans++ Hard versus soft clustering Using the elbow method to find the optimal number of clusters Quantifying the quality of clustering via silhouette plots Organizing clusters as a hierarchical tree Grouping clusters in bottomup fashion Performing hierarchical clustering on a distance matrix Attaching dendrograms to a heat map Applying agglomerative clustering via scikitlearn Locating regions of high density via DBSCAN Summary Modeling complex functions with artificial neural networks Singlelayer neural network recap Introducing the multilayer neural network architecture Activating a neural network via forward propagation Classifying handwritten digits Obtaining the MNIST dataset Implementing a multilayer perceptron [ vi ] 348 348 353 354 357 358 363 364 365 369 371 372 378 380 382 384 387 389 390 396 Table of Contents Training an artificial neural network Computing the logistic cost function Developing your intuition for backpropagation Training neural networks via backpropagation About the convergence in neural networks A few last words about the neural network implementation Summary Chapter 13: Parallelizing Neural Network Training with TensorFlow TensorFlow and training performance What is TensorFlow? How we will learn TensorFlow First steps with TensorFlow Working with array structures Developing a simple model with the lowlevel TensorFlow API Training neural networks efficiently with highlevel TensorFlow APIs Building multilayer neural networks using TensorFlow's Layers API Developing a multilayer neural network with Keras Choosing activation functions for multilayer networks Logistic function recap Estimating class probabilities in multiclass classification via the softmax function Broadening the output spectrum using a hyperbolic tangent Rectified linear unit activation Summary Chapter 14: Going Deeper – The Mechanics of TensorFlow Key features of TensorFlow TensorFlow ranks and tensors How to get the rank and shape of a tensor Understanding TensorFlow's computation graphs Placeholders in TensorFlow Defining placeholders Feeding placeholders with data Defining placeholders for data arrays with varying batchsizes Variables in TensorFlow Defining variables Initializing variables Variable scope Reusing variables [ vii ] 407 408 411 412 417 418 419 421 421 423 424 424 427 428 433 434 438 443 444 446 447 449 451 453 454 454 455 456 459 459 460 461 462 463 465 466 468 Table of Contents Building a regression model Executing objects in a TensorFlow graph using their names Saving and restoring a model in TensorFlow Transforming Tensors as multidimensional data arrays Utilizing control flow mechanics in building graphs Visualizing the graph with TensorBoard Extending your TensorBoard experience Summary Chapter 15: Classifying Images with Deep Convolutional Neural Networks Building blocks of convolutional neural networks Understanding CNNs and learning feature hierarchies Performing discrete convolutions Performing a discrete convolution in one dimension The effect of zeropadding in a convolution Determining the size of the convolution output Performing a discrete convolution in 2D Subsampling Putting everything together to build a CNN Working with multiple input or color channels Regularizing a neural network with dropout Implementing a deep convolutional neural network using TensorFlow The multilayer CNN architecture Loading and preprocessing the data Implementing a CNN in the TensorFlow lowlevel API Implementing a CNN in the TensorFlow Layers API Summary Chapter 16: Modeling Sequential Data Using Recurrent Neural Networks Introducing sequential data Modeling sequential data – order matters Representing sequences The different categories of sequence modeling RNNs for modeling sequences Understanding the structure and flow of an RNN Computing activations in an RNN The challenges of learning longrange interactions LSTM units [ viii ] 471 475 476 479 483 487 490 491 493 494 494 496 496 499 501 502 506 508 508 512 514 514 516 517 530 536 537 538 538 539 540 541 541 543 546 548 Table of Contents Implementing a multilayer RNN for sequence modeling in TensorFlow Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs Preparing the data Embedding Building an RNN model The SentimentRNN class constructor The build method Step 1 – defining multilayer RNN cells Step 2 – defining the initial states for the RNN cells Step 3 – creating the RNN using the RNN cells and their states The train method The predict method Instantiating the SentimentRNN class Training and optimizing the sentiment analysis RNN model Project two – implementing an RNN for characterlevel language modeling in TensorFlow Preparing the data Building a characterlevel RNN model The constructor The build method The train method The sample method Creating and training the CharRNN Model The CharRNN model in the sampling mode Chapter and book summary Index [ ix ] 550 551 552 556 558 559 560 562 562 563 563 565 565 566 567 568 572 573 574 576 578 579 580 580 583 Preface Through exposure to the news and social media, you are probably aware of the fact that machine learning has become one of the most exciting technologies of our time and age. Large companies, such as Google, Facebook, Apple, Amazon, and IBM, heavily invest in machine learning research and applications for good reasons. While it may seem that machine learning has become the buzzword of our time and age, it is certainly not a fad. This exciting field opens the way to new possibilities and has become indispensable to our daily lives. This is evident in talking to the voice assistant on our smartphones, recommending the right product for our customers, preventing credit card fraud, filtering out spam from our email inboxes, detecting and diagnosing medical diseases, the list goes on and on. If you want to become a machine learning practitioner, a better problem solver, or maybe even consider a career in machine learning research, then this book is for you. However, for a novice, the theoretical concepts behind machine learning can be quite overwhelming. Many practical books have been published in recent years that will help you get started in machine learning by implementing powerful learning algorithms. Getting exposed to practical code examples and working through example applications of machine learning are a great way to dive into this field. Concrete examples help illustrate the broader concepts by putting the learned material directly into action. However, remember that with great power comes great responsibility! In addition to offering a handson experience with machine learning using the Python programming languages and Pythonbased machine learning libraries, this book introduces the mathematical concepts behind machine learning algorithms, which is essential for using machine learning successfully. Thus, this book is different from a purely practical book; it is a book that discusses the necessary details regarding machine learning concepts and offers intuitive yet informative explanations of how machine learning algorithms work, how to use them, and most importantly, how to avoid the most common pitfalls. [ xi ] Preface Currently, if you type "machine learning" as a search term in Google Scholar, it returns an overwhelmingly large number of publications—1,800,000. Of course, we cannot discuss the nittygritty of all the different algorithms and applications that have emerged in the last 60 years. However, in this book, we will embark on an exciting journey that covers all the essential topics and concepts to give you a head start in this field. If you find that your thirst for knowledge is not satisfied, this book references many useful resources that can be used to follow up on the essential breakthroughs in this field. If you have already studied machine learning theory in detail, this book will show you how to put your knowledge into practice. If you have used machine learning techniques before and want to gain more insight into how machine learning actually works, this book is for you. Don't worry if you are completely new to the machine learning field; you have even more reason to be excited. Here is a promise that machine learning will change the way you think about the problems you want to solve and will show you how to tackle them by unlocking the power of data. Before we dive deeper into the machine learning field, let's answer your most important question, "Why Python?" The answer is simple: it is powerful yet very accessible. Python has become the most popular programming language for data science because it allows us to forget about the tedious parts of programming and offers us an environment where we can quickly jot down our ideas and put concepts directly into action. We, the authors, can truly say that the study of machine learning has made us better scientists, thinkers, and problem solvers. In this book, we want to share this knowledge with you. Knowledge is gained by learning. The key is our enthusiasm, and the real mastery of skills can only be achieved by practice. The road ahead may be bumpy on occasions and some topics may be more challenging than others, but we hope that you will embrace this opportunity and focus on the reward. Remember that we are on this journey together, and throughout this book, we will add many powerful techniques to your arsenal that will help us solve even the toughest problems the datadriven way. What this book covers Chapter 1, Giving Computers the Ability to Learn from Data, introduces you to the main subareas of machine learning in order to tackle various problem tasks. In addition, it discusses the essential steps for creating a typical machine learning model by building a pipeline that will guide us through the following chapters. [ xii ] Preface Chapter 2, Training Simple Machine Learning Algorithms for Classification, goes back to the origins of machine learning and introduces binary perceptron classifiers and adaptive linear neurons. This chapter is a gentle introduction to the fundamentals of pattern classification and focuses on the interplay of optimization algorithms and machine learning. Chapter 3, A Tour of Machine Learning Classifiers Using scikitlearn, describes the essential machine learning algorithms for classification and provides practical examples using one of the most popular and comprehensive open source machine learning libraries: scikitlearn. Chapter 4, Building Good Training Sets – Data Preprocessing, discusses how to deal with the most common problems in unprocessed datasets, such as missing data. It also discusses several approaches to identify the most informative features in datasets and teaches you how to prepare variables of different types as proper input for machine learning algorithms. Chapter 5, Compressing Data via Dimensionality Reduction, describes the essential techniques to reduce the number of features in a dataset to smaller sets while retaining most of their useful and discriminatory information. It discusses the standard approach to dimensionality reduction via principal component analysis and compares it to supervised and nonlinear transformation techniques. Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, discusses the dos and don'ts for estimating the performances of predictive models. Moreover, it discusses different metrics for measuring the performance of our models and techniques to finetune machine learning algorithms. Chapter 7, Combining Different Models for Ensemble Learning, introduces you to the different concepts of combining multiple learning algorithms effectively. It teaches you how to build ensembles of experts to overcome the weaknesses of individual learners, resulting in more accurate and reliable predictions. Chapter 8, Applying Machine Learning to Sentiment Analysis, discusses the essential steps to transform textual data into meaningful representations for machine learning algorithms to predict the opinions of people based on their writing. Chapter 9, Embedding a Machine Learning Model into a Web Application, continues with the predictive model from the previous chapter and walks you through the essential steps of developing web applications with embedded machine learning models. [ xiii ] Preface Chapter 10, Predicting Continuous Target Variables with Regression Analysis, discusses the essential techniques for modeling linear relationships between target and response variables to make predictions on a continuous scale. After introducing different linear models, it also talks about polynomial regression and treebased approaches. Chapter 11, Working with Unlabeled Data – Clustering Analysis, shifts the focus to a different subarea of machine learning, unsupervised learning. We apply algorithms from three fundamental families of clustering algorithms to find groups of objects that share a certain degree of similarity. Chapter 12, Implementing a Multilayer Artificial Neural Network from Scratch, extends the concept of gradientbased optimization, which we first introduced in Chapter 2, Training Simple Machine Learning Algorithms for Classification, to build powerful, multilayer neural networks based on the popular backpropagation algorithm in Python. Chapter 13, Parallelizing Neural Network Training with TensorFlow, builds upon the knowledge from the previous chapter to provide you with a practical guide for training neural networks more efficiently. The focus of this chapter is on TensorFlow, an open source Python library that allows us to utilize multiple cores of modern GPUs. Chapter 14, Going Deeper – The Mechanics of TensorFlow, covers TensorFlow in greater detail explaining its core concepts of computational graphs and sessions. In addition, this chapter covers topics such as saving and visualizing neural network graphs, which will come in very handy during the remaining chapters of this book. Chapter 15, Classifying Images with Deep Convolutional Neural Networks, discusses deep neural network architectures that have become the new standard in computer vision and image recognition fields—convolutional neural networks. This chapter will discuss the main concepts between convolutional layers as a feature extractor and apply convolutional neural network architectures to an image classification task to achieve almost perfect classification accuracy. Chapter 16, Modeling Sequential Data Using Recurrent Neural Networks, introduces another popular neural network architecture for deep learning that is especially well suited for working with sequential data and time series data. In this chapter, we will apply different recurrent neural network architectures to text data. We will start with a sentiment analysis task as a warmup exercise and will learn how to generate entirely new text. [ xiv ] Preface What you need for this book The execution of the code examples provided in this book requires an installation of Python 3.6.0 or newer on macOS, Linux, or Microsoft Windows. We will make frequent use of Python's essential libraries for scientific computing throughout this book, including SciPy, NumPy, scikitlearn, Matplotlib, and pandas. The first chapter will provide you with instructions and useful tips to set up your Python environment and these core libraries. We will add additional libraries to our repertoire; moreover, installation instructions are provided in the respective chapters: the NLTK library for natural language processing (Chapter 8, Applying Machine Learning to Sentiment Analysis), the Flask web framework (Chapter 9, Embedding a Machine Learning Model into a Web Application), the Seaborn library for statistical data visualization (Chapter 10, Predicting Continuous Target Variables with Regression Analysis), and TensorFlow for efficient neural network training on graphical processing units (Chapters 13 to 16). Who this book is for If you want to find out how to use Python to start answering critical questions of your data, pick up Python Machine Learning, Second Edition—whether you want to start from scratch or extend your data science knowledge, this is an essential and unmissable resource. Conventions In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Using the out_file=None setting, we directly assigned the dot data to a dot_data variable, instead of writing an intermediate tree.dot file to disk." [ xv ] Preface A block of code is set as follows: >>> >>> ... >>> >>> ... >>> >>> >>> from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski') knn.fit(X_train_std, y_train) plot_decision_regions(X_combined_std, y_combined, classifier=knn, test_idx=range(105,150)) plt.xlabel('petal length [standardized]') plt.ylabel('petal width [standardized]') plt.show() Any commandline input or output is written as follows: pip3 install graphviz New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "After we click on the Dashboard button in the topright corner, we have access to the control panel shown at the top of the page." Warnings or important notes appear in a box like this. Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email feedback@packtpub.com, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors. [ xvi ] Preface Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps: 1. Log in or register to our website using your email address and password. 2. Hover the mouse pointer on the SUPPORT tab at the top. 3. Click on Code Downloads & Errata. 4. Enter the name of the book in the Search box. 5. Select the book for which you're looking to download the code files. 6. Choose from the dropdown menu where you purchased this book from. 7. Click on Code Download. You can also download the code files by clicking on the Code Files button on the book's web page at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: • WinRAR / 7Zip for Windows • Zipeg / iZip / UnRarX for Mac • 7Zip / PeaZip for Linux The code bundle for the book is also hosted on GitHub at https://github.com/ PacktPublishing/PythonMachineLearningSecondEdition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! [ xvii ] Preface Downloading the color images of this book We also provide you with a PDF file that has color images of the screenshots/ diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub. com/sites/default/files/downloads/PythonMachineLearningSecondEdition_ ColorImages.pdf. In addition, lower resolution color images are embedded in the code notebooks of this book that come bundled with the example code files. Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub. com/submiterrata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/ content/support and enter the name of the book in the search field. The required information will appear under the Errata section. Piracy Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at copyright@packtpub.com with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content. Questions If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem. [ xviii ] Giving Computers the Ability to Learn from Data In my opinion, machine learning, the application and science of algorithms that make sense of data, is the most exciting field of all the computer sciences! We are living in an age where data comes in abundance; using selflearning algorithms from the field of machine learning, we can turn this data into knowledge. Thanks to the many powerful open source libraries that have been developed in recent years, there has probably never been a better time to break into the machine learning field and learn how to utilize powerful algorithms to spot patterns in data and make predictions about future events. In this chapter, you will learn about the main concepts and different types of machine learning. Together with a basic introduction to the relevant terminology, we will lay the groundwork for successfully using machine learning techniques for practical problem solving. In this chapter, we will cover the following topics: • The general concepts of machine learning • The three types of learning and basic terminology • The building blocks for successfully designing machine learning systems • Installing and setting up Python for data analysis and machine learning [1] Giving Computers the Ability to Learn from Data Building intelligent machines to transform data into knowledge In this age of modern technology, there is one resource that we have in abundance: a large amount of structured and unstructured data. In the second half of the twentieth century, machine learning evolved as a subfield of Artificial Intelligence (AI) that involved selflearning algorithms that derived knowledge from data in order to make predictions. Instead of requiring humans to manually derive rules and build models from analyzing large amounts of data, machine learning offers a more efficient alternative for capturing the knowledge in data to gradually improve the performance of predictive models and make datadriven decisions. Not only is machine learning becoming increasingly important in computer science research, but it also plays an ever greater role in our everyday lives. Thanks to machine learning, we enjoy robust email spam filters, convenient text and voice recognition software, reliable web search engines, challenging chessplaying programs, and, hopefully soon, safe and efficient selfdriving cars. The three different types of machine learning In this section, we will take a look at the three types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. We will learn about the fundamental differences between the three different learning types and, using conceptual examples, we will develop an intuition for the practical problem domains where these can be applied: [2] Chapter 1 Making predictions about the future with supervised learning The main goal in supervised learning is to learn a model from labeled training data that allows us to make predictions about unseen or future data. Here, the term supervised refers to a set of samples where the desired output signals (labels) are already known. Considering the example of email spam filtering, we can train a model using a supervised machine learning algorithm on a corpus of labeled emails, emails that are correctly marked as spam or notspam, to predict whether a new email belongs to either of the two categories. A supervised learning task with discrete class labels, such as in the previous email spam filtering example, is also called a classification task. Another subcategory of supervised learning is regression, where the outcome signal is a continuous value: Classification for predicting class labels Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels of new instances, based on past observations. Those class labels are discrete, unordered values that can be understood as the group memberships of the instances. The previously mentioned example of email spam detection represents a typical example of a binary classification task, where the machine learning algorithm learns a set of rules in order to distinguish between two possible classes: spam and nonspam emails. [3] Giving Computers the Ability to Learn from Data However, the set of class labels does not have to be of a binary nature. The predictive model learned by a supervised learning algorithm can assign any class label that was presented in the training dataset to a new, unlabeled instance. A typical example of a multiclass classification task is handwritten character recognition. Here, we could collect a training dataset that consists of multiple handwritten examples of each letter in the alphabet. Now, if a user provides a new handwritten character via an input device, our predictive model will be able to predict the correct letter in the alphabet with certain accuracy. However, our machine learning system would be unable to correctly recognize any of the digits zero to nine, for example, if they were not part of our training dataset. The following figure illustrates the concept of a binary classification task given 30 training samples; 15 training samples are labeled as negative class (minus signs) and 15 training samples are labeled as positive class (plus signs). In this scenario, our dataset is twodimensional, which means that each sample has two values associated with it: x1 and x2 . Now, we can use a supervised machine learning algorithm to learn a rule—the decision boundary represented as a dashed line—that can separate those two classes and classify new data into each of those two categories given its x1 and x2 values: [4] Chapter 1 Regression for predicting continuous outcomes We learned in the previous section that the task of classification is to assign categorical, unordered labels to instances. A second type of supervised learning is the prediction of continuous outcomes, which is also called regression analysis. In regression analysis, we are given a number of predictor (explanatory) variables and a continuous response variable (outcome or target), and we try to find a relationship between those variables that allows us to predict an outcome. For example, let's assume that we are interested in predicting the math SAT scores of our students. If there is a relationship between the time spent studying for the test and the final scores, we could use it as training data to learn a model that uses the study time to predict the test scores of future students who are planning to take this test. The term regression was devised by Francis Galton in his article Regression towards Mediocrity in Hereditary Stature in 1886. Galton described the biological phenomenon that the variance of height in a population does not increase over time. He observed that the height of parents is not passed on to their children, but instead the children's height is regressing towards the population mean. The following figure illustrates the concept of linear regression. Given a predictor variable x and a response variable y, we fit a straight line to this data that minimizes the distance—most commonly the average squared distance—between the sample points and the fitted line. We can now use the intercept and slope learned from this data to predict the outcome variable of new data: [5] Giving Computers the Ability to Learn from Data Solving interactive problems with reinforcement learning Another type of machine learning is reinforcement learning. In reinforcement learning, the goal is to develop a system (agent) that improves its performance based on interactions with the environment. Since the information about the current state of the environment typically also includes a socalled reward signal, we can think of reinforcement learning as a field related to supervised learning. However, in reinforcement learning this feedback is not the correct ground truth label or value, but a measure of how well the action was measured by a reward function. Through its interaction with the environment, an agent can then use reinforcement learning to learn a series of actions that maximizes this reward via an exploratory trialanderror approach or deliberative planning. A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as win or lose at the end of the game: There are many different subtypes of reinforcement learning. However, a general scheme is that the agent in reinforcement learning tries to maximize the reward by a series of interactions with the environment. Each state can be associated with a positive or negative reward, and a reward can be defined as accomplishing an overall goal, such as winning or losing a game of chess. For instance, in chess the outcome of each move can be thought of as a different state of the environment. To explore the chess example further, let's think of visiting certain locations on the chess board as being associated with a positive event—for instance, removing an opponent's chess piece from the board or threatening the queen. Other positions, however, are associated with a negative event, such as losing a chess piece to the opponent in the following turn. Now, not every turn results in the removal of a chess piece, and reinforcement learning is concerned with learning the series of steps by maximizing a reward based on immediate and delayed feedback. While this section provides a basic overview of reinforcement learning, please note that applications of reinforcement learning are beyond the scope of this book, which primarily focusses on classification, regression analysis, and clustering. [6] Chapter 1 Discovering hidden structures with unsupervised learning In supervised learning, we know the right answer beforehand when we train our model, and in reinforcement learning, we define a measure of reward for particular actions by the agent. In unsupervised learning, however, we are dealing with unlabeled data or data of unknown structure. Using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function. Finding subgroups with clustering Clustering is an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) without having any prior knowledge of their group memberships. Each cluster that arises during the analysis defines a group of objects that share a certain degree of similarity but are more dissimilar to objects in other clusters, which is why clustering is also sometimes called unsupervised classification. Clustering is a great technique for structuring information and deriving meaningful relationships from data. For example, it allows marketers to discover customer groups based on their interests, in order to develop distinct marketing programs. The following figure illustrates how clustering can be applied to organizing unlabeled data into three distinct groups based on the similarity of their features x1 and x2 : [7] Giving Computers the Ability to Learn from Data Dimensionality reduction for data compression Another subfield of unsupervised learning is dimensionality reduction. Often we are working with data of high dimensionality—each observation comes with a high number of measurements—that can present a challenge for limited storage space and the computational performance of machine learning algorithms. Unsupervised dimensionality reduction is a commonly used approach in feature preprocessing to remove noise from data, which can also degrade the predictive performance of certain algorithms, and compress the data onto a smaller dimensional subspace while retaining most of the relevant information. Sometimes, dimensionality reduction can also be useful for visualizing data, for example, a highdimensional feature set can be projected onto one, two, or threedimensional feature spaces in order to visualize it via 3D or 2D scatterplots or histograms. The following figure shows an example where nonlinear dimensionality reduction was applied to compress a 3D Swiss Roll onto a new 2D feature subspace: Introduction to the basic terminology and notations Now that we have discussed the three broad categories of machine learning— supervised, unsupervised, and reinforcement learning—let us have a look at the basic terminology that we will be using throughout the book. The following table depicts an excerpt of the Iris dataset, which is a classic example in the field of machine learning. The Iris dataset contains the measurements of 150 Iris flowers from three different species—Setosa, Versicolor, and Virginica. Here, each flower sample represents one row in our dataset, and the flower measurements in centimeters are stored as columns, which we also call the features of the dataset: [8] Chapter 1 To keep the notation and implementation simple yet efficient, we will make use of some of the basics of linear algebra. In the following chapters, we will use a matrix and vector notation to refer to our data. We will follow the common convention to represent each sample as a separate row in a feature matrix X, where each feature is stored as a separate column. The Iris dataset consisting of 150 samples and four features can then be written as a 150×4 150 × 4 matrix X ∈ : x1(1) ( 2) x1 x (150) 1 x2(1) x3(1) x2( ) x3( ) x2( x3( 2 2 150 ) [9] 150 ) x4(1) 2 x4( ) 150 x4( ) Giving Computers the Ability to Learn from Data For the rest of this book, unless noted otherwise, we will use the superscript i to refer to the ith training sample, and the subscript j to refer to the jth dimension of the training dataset. We use lowercase, boldface letters to refer to vectors ( x ∈ R ) and uppercase, boldface letters to refer to matrices ( X ∈ n×m ) . To refer to single elements in a vector or matrix, we write the letters in italics n×1 (x (n) ( ) or x( m ) , respectively). n 150 For example, x1 refers to the first dimension of flower sample 150, the sepal length. Thus, each row in this feature matrix represents one flower instance and can be written as a fourdimensional row vector x (i ) ∈ 1×4 : i i x ( ) = x1( ) x2( ) i x3( ) i i x4( ) And each feature dimension is a 150dimensional column vector x j ∈ 150×1 . For example: x j (1) ( 2) x xj = j x (150) j Similarly, we store the target variables (here, class labels) as a 150dimensional column vector: y (1) y = … ( y ∈ {Setosa, Versicolor, Virginica} ) y (150) [ 10 ] Chapter 1 A roadmap for building machine learning systems In previous sections, we discussed the basic concepts of machine learning and the three different types of learning. In this section, we will discuss the other important parts of a machine learning system accompanying the learning algorithm. The following diagram shows a typical workflow for using machine learning in predictive modeling, which we will discuss in the following subsections: [ 11 ] Giving Computers the Ability to Learn from Data Preprocessing – getting data into shape Let's begin with discussing the roadmap for building machine learning systems. Raw data rarely comes in the form and shape that is necessary for the optimal performance of a learning algorithm. Thus, the preprocessing of the data is one of the most crucial steps in any machine learning application. If we take the Iris flower dataset from the previous section as an example, we can think of the raw data as a series of flower images from which we want to extract meaningful features. Useful features could be the color, the hue, the intensity of the flowers, the height, and the flower lengths and widths. Many machine learning algorithms also require that the selected features are on the same scale for optimal performance, which is often achieved by transforming the features in the range [0, 1] or a standard normal distribution with zero mean and unit variance, as we will see in later chapters. Some of the selected features may be highly correlated and therefore redundant to a certain degree. In those cases, dimensionality reduction techniques are useful for compressing the features onto a lower dimensional subspace. Reducing the dimensionality of our feature space has the advantage that less storage space is required, and the learning algorithm can run much faster. In certain cases, dimensionality reduction can also improve the predictive performance of a model if the dataset contains a large number of irrelevant features (or noise), that is, if the dataset has a low signaltonoise ratio. To determine whether our machine learning algorithm not only performs well on the training set but also generalizes well to new data, we also want to randomly divide the dataset into a separate training and test set. We use the training set to train and optimize our machine learning model, while we keep the test set until the very end to evaluate the final model. Training and selecting a predictive model As we will see in later chapters, many different machine learning algorithms have been developed to solve different problem tasks. An important point that can be summarized from David Wolpert's famous No free lunch theorems is that we can't get learning "for free" (The Lack of A Priori Distinctions Between Learning Algorithms, D.H. Wolpert 1996; No free lunch theorems for optimization, D.H. Wolpert and W.G. Macready, 1997). Intuitively, we can relate this concept to the popular saying, I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail (Abraham Maslow, 1966). For example, each classification algorithm has its inherent biases, and no single classification model enjoys superiority if we don't make any assumptions about the task. In practice, it is therefore essential to compare at least a handful of different algorithms in order to train and select the best performing model. But before we can compare different models, we first have to decide upon a metric to measure performance. One commonly used metric is classification accuracy, which is defined as the proportion of correctly classified instances. [ 12 ] Chapter 1 One legitimate question to ask is this: how do we know which model performs well on the final test dataset and realworld data if we don't use this test set for the model selection, but keep it for the final model evaluation? In order to address the issue embedded in this question, different crossvalidation techniques can be used where the training dataset is further divided into training and validation subsets in order to estimate the generalization performance of the model. Finally, we also cannot expect that the default parameters of the different learning algorithms provided by software libraries are optimal for our specific problem task. Therefore, we will make frequent use of hyperparameter optimization techniques that help us to finetune the performance of our model in later chapters. Intuitively, we can think of those hyperparameters as parameters that are not learned from the data but represent the knobs of a model that we can turn to improve its performance. This will become much clearer in later chapters when we see actual examples. Evaluating models and predicting unseen data instances After we have selected a model that has been fitted on the training dataset, we can use the test dataset to estimate how well it performs on this unseen data to estimate the generalization error. If we are satisfied with its performance, we can now use this model to predict new, future data. It is important to note that the parameters for the previously mentioned procedures, such as feature scaling and dimensionality reduction, are solely obtained from the training dataset, and the same parameters are later reapplied to transform the test dataset, as well as any new data samples—the performance measured on the test data may be overly optimistic otherwise. Using Python for machine learning Python is one of the most popular programming languages for data science and therefore enjoys a large number of useful addon libraries developed by its great developer and and opensource community. Although the performance of interpreted languages, such as Python, for computationintensive tasks is inferior to lowerlevel programming languages, extension libraries such as NumPy and SciPy have been developed that build upon lowerlayer Fortran and C implementations for fast and vectorized operations on multidimensional arrays. For machine learning programming tasks, we will mostly refer to the scikitlearn library, which is currently one of the most popular and accessible open source machine learning libraries. [ 13 ] Giving Computers the Ability to Learn from Data Installing Python and packages from the Python Package Index Python is available for all three major operating systems—Microsoft Windows, macOS, and Linux—and the installer, as well as the documentation, can be downloaded from the official Python website: https://www.python.org. This book is written for Python version 3.5.2 or higher, and it is recommended you use the most recent version of Python 3 that is currently available, although most of the code examples may also be compatible with Python 2.7.13 or higher. If you decide to use Python 2.7 to execute the code examples, please make sure that you know about the major differences between the two Python versions. A good summary of the differences between Python 3.5 and 2.7 can be found at https://wiki.python.org/moin/Python2orPython3. The additional packages that we will be using throughout this book can be installed via the pip installer program, which has been part of the Python standard library since Python 3.3. More information about pip can be found at https://docs.python.org/3/installing/index.html. After we have successfully installed Python, we can execute pip from the Terminal to install additional Python packages: pip install SomePackage Already installed packages can be updated via the upgrade flag: pip install SomePackage upgrade Using the Anaconda Python distribution and package manager A highly recommended alternative Python distribution for scientific computing is Anaconda by Continuum Analytics. Anaconda is a free—including for commercial use—enterpriseready Python distribution that bundles all the essential Python packages for data science, math, and engineering in one userfriendly crossplatform distribution. The Anaconda installer can be downloaded at http://continuum.io/downloads, and an Anaconda quickstart guide is available at https://conda.io/docs/testdrive.html. After successfully installing Anaconda, we can install new Python packages using the following command: conda install SomePackage [ 14 ] Chapter 1 Existing packages can be updated using the following command: conda update SomePackage Packages for scientific computing, data science, and machine learning Throughout this book, we will mainly use NumPy's multidimensional arrays to store and manipulate data. Occasionally, we will make use of pandas, which is a library built on top of NumPy that provides additional higherlevel data manipulation tools that make working with tabular data even more convenient. To augment our learning experience and visualize quantitative data, which is often extremely useful to intuitively make sense of it, we will use the very customizable Matplotlib library. The version numbers of the major Python packages that were used for writing this book are mentioned in the following list. Please make sure that the version numbers of your installed packages are equal to, or greater than, those version numbers to ensure the code examples run correctly: • NumPy 1.12.1 • SciPy 0.19.0 • scikitlearn 0.18.1 • Matplotlib 2.0.2 • pandas 0.20.1 Summary In this chapter, we explored machine learning at a very high level and familiarized ourselves with the big picture and major concepts that we are going to explore in the following chapters in more detail. We learned that supervised learning is composed of two important subfields: classification and regression. While classification models allow us to categorize objects into known classes, we can use regression analysis to predict the continuous outcomes of target variables. Unsupervised learning not only offers useful techniques for discovering structures in unlabeled data, but it can also be useful for data compression in feature preprocessing steps. We briefly went over the typical roadmap for applying machine learning to problem tasks, which we will use as a foundation for deeper discussions and handson examples in the following chapters. Eventually, we set up our Python environment and installed and updated the required packages to get ready to see machine learning in action. [ 15 ] Giving Computers the Ability to Learn from Data Later in this book, in addition to machine learning itself, we will also introduce different techniques to preprocess our dataset, which will help us to get the best performance out of different machine learning algorithms. While we will cover classification algorithms quite extensively throughout the book, we will also explore different techniques for regression analysis and clustering. We have an exciting journey ahead, covering many powerful techniques in the vast field of machine learning. However, we will approach machine learning one step at a time, building upon our knowledge gradually throughout the chapters of this book. In the following chapter, we will start this journey by implementing one of the earliest machine learning algorithms for classification, which will prepare us for Chapter 3, A Tour of Machine Learning Classifiers Using scikitlearn, where we cover more advanced machine learning algorithms using the scikitlearn open source machine learning library. [ 16 ] Training Simple Machine Learning Algorithms for Classification In this chapter, we will make use of two of the first algorithmically described machine learning algorithms for classification, the perceptron and adaptive linear neurons. We will start by implementing a perceptron step by step in Python and training it to classify different flower species in the Iris dataset. This will help us understand the concept of machine learning algorithms for classification and how they can be efficiently implemented in Python. Discussing the basics of optimization using adaptive linear neurons will then lay the groundwork for using more powerful classifiers via the scikitlearn machine learning library in Chapter 3, A Tour of Machine Learning Classifiers Using scikitlearn. The topics that we will cover in this chapter are as follows: • Building an intuition for machine learning algorithms • Using pandas, NumPy, and Matplotlib to read in, process, and visualize data • Implementing linear classification algorithms in Python [ 17 ] Training Simple Machine Learning Algorithms for Classification Artificial neurons – a brief glimpse into the early history of machine learning Before we discuss the perceptron and related algorithms in more detail, let us take a brief tour through the early beginnings of machine learning. Trying to understand how the biological brain works, in order to design AI, Warren McCullock and Walter Pitts published the first concept of a simplified brain cell, the socalled McCullockPitts (MCP) neuron, in 1943 (A Logical Calculus of the Ideas Immanent in Nervous Activity, W. S. McCulloch and W. Pitts, Bulletin of Mathematical Biophysics, 5(4): 115133, 1943). Neurons are interconnected nerve cells in the brain that are involved in the processing and transmitting of chemical and electrical signals, which is illustrated in the following figure: McCullock and Pitts described such a nerve cell as a simple logic gate with binary outputs; multiple signals arrive at the dendrites, are then integrated into the cell body, and, if the accumulated signal exceeds a certain threshold, an output signal is generated that will be passed on by the axon. Only a few years later, Frank Rosenblatt published the first concept of the perceptron learning rule based on the MCP neuron model (The Perceptron: A Perceiving and Recognizing Automaton, F. Rosenblatt, Cornell Aeronautical Laboratory, 1957). With his perceptron rule, Rosenblatt proposed an algorithm that would automatically learn the optimal weight coefficients that are then multiplied with the input features in order to make the decision of whether a neuron fires or not. In the context of supervised learning and classification, such an algorithm could then be used to predict if a sample belongs to one class or the other. [ 18 ] Chapter 2 The formal definition of an artificial neuron More formally, we can put the idea behind artificial neurons into the context of a binary classification task where we refer to our two classes as 1 (positive class) and 1 (negative class) for simplicity. We can then define a decision function ( φ ( z ) ) that takes a linear combination of certain input values x and a corresponding weight vector w, where z is the socalled net input z = w1 x1 + … + wm xm : w1 x1 w = , x = wm xm i Now, if the net input of a particular sample x ( ) is greater than a defined threshold θ , we predict class 1, and class 1 otherwise. In the perceptron algorithm, the decision function φ ( ⋅) is a variant of a unit step function: 1 if z ≥ θ −1 otherwise φ (z) = For simplicity, we can bring the threshold θ to the left side of the equation and define a weightzero as w0 = −θ and x0 = 1 so that we write z in a more compact form: z = w0 x0 + w1 x1 + … + wm xm = w T x And: 1 if z ≥ 0 −1 otherwise φ (z) = In machine learning literature, the negative threshold, or weight, w0 = −θ , is usually called the bias unit. [ 19 ] Training Simple Machine Learning Algorithms for Classification In the following sections, we will often make use of basic notations from linear algebra. For example, we will abbreviate the sum of the products of the values in x and w using a vector dot product, whereas superscript T stands for transpose, which is an operation that transforms a column vector into a row vector and vice versa: z = w0 x0 + w1 x1 + + wm xm = ∑ j =0 x j w j = w T x m For example: [1 4 2 3]× 5 = 1× 4 + 2 × 5 + 3 × 6 = 32 6 Furthermore, the transpose operation can also be applied to matrices to reflect it over its diagonal, for example: T 1 2 1 3 5 3 4 = 2 4 6 5 6 In this book, we will only use very basic concepts from linear algebra; however, if you need a quick refresher, please take a look at Zico Kolter's excellent Linear Algebra Review and Reference, which is freely available at http://www.cs.cmu.edu/~zkolter/course/ linalg/linalg_notes.pdf. The following figure illustrates how the net input z = w T x is squashed into a binary output (1 or 1) by the decision function of the perceptron (left subfigure) and how it can be used to discriminate between two linearly separable classes (right subfigure): [ 20 ] Chapter 2 The perceptron learning rule The whole idea behind the MCP neuron and Rosenblatt's thresholded perceptron model is to use a reductionist approach to mimic how a single neuron in the brain works: it either fires or it doesn't. Thus, Rosenblatt's initial perceptron rule is fairly simple and can be summarized by the following steps: 1. Initialize the weights to 0 or small random numbers. i 2. For each training sample x ( ) : a. Compute the output value ŷ . b. Update the weights. Here, the output value is the class label predicted by the unit step function that we defined earlier, and the simultaneous update of each weight w j in the weight vector w can be more formally written as: w j := w j + ∆w j The value of ∆w j , which is used to update the weight w j , is calculated by the perceptron learning rule: ( ) ∆w j = η y (i ) − yˆ (i ) x(ji ) [ 21 ] Training Simple Machine Learning Algorithms for Classification (i ) Where η is the learning rate (typically a constant between 0.0 and 1.0), y is the (i ) true class label of the ith training sample, and yˆ is the predicted class label. It is important to note that all weights in the weight vector are being updated (i ) simultaneously, which means that we don't recompute the yˆ before all of the weights ∆w j are updated. Concretely, for a twodimensional dataset, we would write the update as: ( ∆w0 = η y ( ) − output ( ) ( i i ) ) (i ) ∆w1 = η y ( ) − output ( ) x1 i i ( ) (i ) ∆w2 = η y (i ) − output (i ) x2 Before we implement the perceptron rule in Python, let us make a simple thought experiment to illustrate how beautifully simple this learning rule really is. In the two scenarios where the perceptron predicts the class label correctly, the weights remain unchanged: ∆w j = η ( −1 − ( −1) ) x j = 0 (i ) (i ) ∆w j = η (1 − 1) x j = 0 However, in the case of a wrong prediction, the weights are being pushed towards the direction of the positive or negative target class: (i ) (i ) ∆w j = η (1 − −1) x j = η ( 2 ) x j (i ) (i ) ∆w j = η ( −1 − 1) x j = η ( −2 ) x j (i ) To get a better intuition for the multiplicative factor x j , let us go through another simple example, where: i i yˆ ( ) = −1, y ( ) = +1, η = 1 [ 22 ] Chapter 2 (i ) Let's assume that x j = 0.5 , and we misclassify this sample as 1. In this case, we (i ) would increase the corresponding weight by 1 so that the net input x j × w j would be more positive the next time we encounter this sample, and thus be more likely to be above the threshold of the unit step function to classify the sample as +1: ∆w j = (1 − −1) 0.5 = ( 2 ) 0.5 = 1 (i ) The weight update is proportional to the value of x j . For example, if we have (i ) another sample x j = 2 that is incorrectly classified as 1, we'd push the decision boundary by an even larger extent to classify this sample correctly the next time: ∆w j = (1 − −1) 2 = ( 2 ) 2 = 4 It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable and the learning rate is sufficiently small. If the two classes can't be separated by a linear decision boundary, we can set a maximum number of passes over the training dataset (epochs) and/or a threshold for the number of tolerated misclassifications—the perceptron would never stop updating the weights otherwise: Downloading the example code If you bought this book directly from Packt, you can download the example code files from your account at http://www.packtpub. com. If you purchased this book elsewhere, you can download all code examples and datasets directly from https://github.com/rasbt/ pythonmachinelearningbook2ndedition. [ 23 ] Training Simple Machine Learning Algorithms for Classification Now, before we jump into the implementation in the next section, let us summarize what we just learned in a simple diagram that illustrates the general concept of the perceptron: The preceding diagram illustrates how the perceptron receives the inputs of a sample x and combines them with the weights w to compute the net input. The net input is then passed on to the threshold function, which generates a binary output 1 or +1— the predicted class label of the sample. During the learning phase, this output is used to calculate the error of the prediction and update the weights. Implementing a perceptron learning algorithm in Python In the previous section, we learned how the Rosenblatt's perceptron rule works; let us now go ahead and implement it in Python, and apply it to the Iris dataset that we introduced in Chapter 1, Giving Computers the Ability to Learn from Data. An objectoriented perceptron API We will take an objectoriented approach to define the perceptron interface as a Python class, which allows us to initialize new Perceptron objects that can learn from data via a fit method, and make predictions via a separate predict method. As a convention, we append an underscore (_) to attributes that are not being created upon the initialization of the object but by calling the object's other methods, for example, self.w_. [ 24 ] Chapter 2 If you are not yet familiar with Python's scientific libraries or need a refresher, please see the following resources: • • • NumPy: https://sebastianraschka.com/pdf/books/ dlb/appendix_f_numpyintro.pdf pandas: https://pandas.pydata.org/pandasdocs/ stable/10min.html Matplotlib: http://matplotlib.org/users/beginner. html The following is the implementation of a perceptron: import numpy as np class Perceptron(object): """Perceptron classifier. Parameters eta : float Learning rate (between 0.0 and 1.0) n_iter : int Passes over the training dataset. random_state : int Random number generator seed for random weight initialization. Attributes w_ : 1darray Weights after fitting. errors_ : list Number of misclassifications (updates) in each epoch. """ def __init__(self, eta=0.01, n_iter=50, random_state=1): self.eta = eta self.n_iter = n_iter self.random_state = random_state def fit(self, X, y): """Fit training data. Parameters [ 25 ] Training Simple Machine Learning Algorithms for Classification X : {arraylike}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : arraylike, shape = [n_samples] Target values. Returns self : object """ rgen = np.random.RandomState(self.random_state) self.w_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1]) self.errors_ = [] for _ in range(self.n_iter): errors = 0 for xi, target in zip(X, y): update = self.eta * (target  self.predict(xi)) self.w_[1:] += update * xi self.w_[0] += update errors += int(update != 0.0) self.errors_.append(errors) return self def net_input(self, X): """Calculate net input""" return np.dot(X, self.w_[1:]) + self.w_[0] def predict(self, X): """Return class label after unit step""" return np.where(self.net_input(X) >= 0.0, 1, 1) Using this perceptron implementation, we can now initialize new Perceptron objects with a given learning rate eta and n_iter, which is the number of epochs (passes over the training set). Via the fit method, we initialize the weights in m+1 self.w_ to a vector , where m stands for the number of dimensions (features) in the dataset, where we add 1 for the first element in this vector that represents the bias unit. Remember that the first element in this vector, self.w_[0], represents the socalled bias unit that we discussed earlier. [ 26 ] Chapter 2 Also notice that this vector contains small random numbers drawn from a normal distribution with standard deviation 0.01 via rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1]), where rgen is a NumPy random number generator that we seeded with a userspecified random seed so that we can reproduce previous results if desired. Now, the reason we don't initialize the weights to zero is that the learning rate η (eta) only has an effect on the classification outcome if the weights are initialized to nonzero values. If all the weights are initialized to zero, the learning rate parameter eta affects only the scale of the weight vector, not the direction. If you are familiar with trigonometry, consider a vector v1 = [1 2 3] , where the angle between v1 and a vector v 2 = 0.5 × v1 would be exactly zero, as demonstrated by the following code snippet: >>> v1 = np.array([1, 2, 3]) >>> v2 = 0.5 * v1 >>> np.arccos(v1.dot(v2) / (np.linalg.norm(v1) * ... np.linalg.norm(v2))) 0.0 Here, np.arccos is the trigonometric inverse cosine and np.linalg.norm is a function that computes the length of a vector. (The reason why we have drawn the random numbers from a random normal distribution—for example, instead from a uniform distribution—and why we used a standard deviation of 0.01 was arbitrary; remember, we are just interested in small random values to avoid the properties of allzero vectors as discussed earlier.) NumPy indexing for onedimensional arrays works similarly to Python lists using the squarebracket ([]) notation. For twodimensional arrays, the first indexer refers to the row number and the second indexer to the column number. For example, we would use X[2, 3] to select the third row and fourth column of a twodimensional array X. After the weights have been initialized, the fit method loops over all individual samples in the training set and updates the weights according to the perceptron learning rule that we discussed in the previous section. The class labels are predicted by the predict method, which is called in the fit method to predict the class label for the weight update, but predict can also be used to predict the class labels of new data after we have fitted our model. Furthermore, we also collect the number of misclassifications during each epoch in the self.errors_ list so that we can later analyze how well our perceptron performed during the training. The np.dot function that is used in the net_input method simply calculates the vector dot product w T x . [ 27 ] Training Simple Machine Learning Algorithms for Classification Instead of using NumPy to calculate the vector dot product between two arrays a and b via a.dot(b) or np.dot(a, b), we could also perform the calculation in pure Python via sum([j * j for i, j in zip(a, b)]). However, the advantage of using NumPy over classic Python for loop structures is that its arithmetic operations are vectorized. Vectorization means that an elemental arithmetic operation is automatically applied to all elements in an array. By formulating our arithmetic operations as a sequence of instructions on an array, rather than performing a set of operations for each element at the time, we can make better use of our modern CPU architectures with Single Instruction, Multiple Data (SIMD) support. Furthermore, NumPy uses highly optimized linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) that have been written in C or Fortran. Lastly, NumPy also allows us to write our code in a more compact and intuitive way using the basics of linear algebra, such as vector and matrix dot products. Training a perceptron model on the Iris dataset To test our perceptron implementation, we will load the two flower classes Setosa and Versicolor from the Iris dataset. Although the perceptron rule is not restricted to two dimensions, we will only consider the two features sepal length and petal length for visualization purposes. Also, we only chose the two flower classes Setosa and Versicolor for practical reasons. However, the perceptron algorithm can be extended to multiclass classification—for example, the OneversusAll (OvA) technique. OvA, or sometimes also called OneversusRest (OvR), is a technique that allows us to extend a binary classifier to multiclass problems. Using OvA, we can train one classifier per class, where the particular class is treated as the positive class and the samples from all other classes are considered negative classes. If we were to classify a new data sample, we would use our n classifiers, where n is the number of class labels, and assign the class label with the highest confidence to the particular sample. In the case of the perceptron, we would use OvA to choose the class label that is associated with the largest absolute net input value. [ 28 ] Chapter 2 First, we will use the pandas library to load the Iris dataset directly from the UCI Machine Learning Repository into a DataFrame object and print the last five lines via the tail method to check the data was loaded correctly: >>> import pandas as pd >>> df = pd.read_csv('https://archive.ics.uci.edu/ml/' ... 'machinelearningdatabases/iris/iris.data', ... header=None) >>> df.tail() You can find a copy of the Iris dataset (and all other datasets used in this book) in the code bundle of this book, which you can use if you are working offline or the UCI server at https://archive.ics.uci. edu/ml/machinelearningdatabases/iris/iris.data is temporarily unavailable. For instance, to load the Iris dataset from a local directory, you can replace this line: df = pd.read_csv('https://archive.ics.uci.edu/ml/' 'machinelearningdatabases/iris/iris.data', header=None) Replace it with this: df = pd.read_csv('your/local/path/to/iris.data', header=None) Next, we extract the first 100 class labels that correspond to the 50 Irissetosa and 50 Irisversicolor flowers, and convert the class labels into the two integer class labels 1 (versicolor) and 1 (setosa) that we assign to a vector y, where the values method of a pandas DataFrame yields the corresponding NumPy representation. [ 29 ] Training Simple Machine Learning Algorithms for Classification Similarly, we extract the first feature column (sepal length) and the third feature column (petal length) of those 100 training samples and assign them to a feature matrix X, which we can visualize via a twodimensional scatter plot: >>> import matplotlib.pyplot as plt >>> import numpy as np >>> # select setosa and versicolor >>> y = df.iloc[0:100, 4].values >>> y = np.where(y == 'Irissetosa', 1, 1) >>> # extract sepal length and petal length >>> X = df.iloc[0:100, [0, 2]].values >>> >>> ... >>> ... >>> >>> >>> >>> # plot data plt.scatter(X[:50, 0], X[:50, 1], color='red', marker='o', label='setosa') plt.scatter(X[50:100, 0], X[50:100, 1], color='blue', marker='x', label='versicolor') plt.xlabel('sepal length [cm]') plt.ylabel('petal length [cm]') plt.legend(loc='upper left') plt.show() After executing the preceding code example, we should now see the following scatterplot: [ 30 ] Chapter 2 The preceding scatterplot shows the distribution of flower samples in the Iris dataset along the two feature axes, petal length and sepal length. In this twodimensional feature subspace, we can see that a linear decision boundary should be sufficient to separate Setosa from Versicolor flowers. Thus, a linear classifier such as the perceptron should be able to classify the flowers in this dataset perfectly. Now, it's time to train our perceptron algorithm on the Iris data subset that we just extracted. Also, we will plot the misclassification error for each epoch to check whether the algorithm converged and found a decision boundary that separates the two Iris flower classes: >>> >>> >>> ... >>> >>> >>> ppn = Perceptron(eta=0.1, n_iter=10) ppn.fit(X, y) plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_, marker='o') plt.xlabel('Epochs') plt.ylabel('Number of updates') plt.show() After executing the preceding code, we should see the plot of the misclassification errors versus the number of epochs, as shown here: [ 31 ] Training Simple Machine Learning Algorithms for Classification As we can see in the preceding plot, our perceptron converged after the sixth epoch and should now be able to classify the training samples perfectly. Let us implement a small convenience function to visualize the decision boundaries for twodimensional datasets: from matplotlib.colors import ListedColormap def plot_decision_regions(X, y, classifier, resolution=0.02): # setup marker generator and color map markers = ('s', 'x', 'o', '^', 'v') colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan') cmap = ListedColormap(colors[:len(np.unique(y))]) # plot the decision surface x1_min, x1_max = X[:, 0].min()  1, X[:, 0].max() + 1 x2_min, x2_max = X[:, 1].min()  1, X[:, 1].max() + 1 xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution)) Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T) Z = Z.reshape(xx1.shape) plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap) plt.xlim(xx1.min(), xx1.max()) plt.ylim(xx2.min(), xx2.max()) # plot class samples for idx, cl in enumerate(np.unique(y)): plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, c=colors[idx], marker=markers[idx], label=cl, edgecolor='black') First, we define a number of colors and markers and create a colormap from the list of colors via ListedColormap. Then, we determine the minimum and maximum values for the two features and use those feature vectors to create a pair of grid arrays xx1 and xx2 via the NumPy meshgrid function. Since we trained our perceptron classifier on two feature dimensions, we need to flatten the grid arrays and create a matrix that has the same number of columns as the Iris training subset so that we can use the predict method to predict the class labels Z of the corresponding grid points. [ 32 ] Chapter 2 After reshaping the predicted class labels Z into a grid with the same dimensions as xx1 and xx2, we can now draw a contour plot via Matplotlib's contourf function, which maps the different decision regions to different colors for each predicted class in the grid array: >>> >>> >>> >>> >>> plot_decision_regions(X, y, classifier=ppn) plt.xlabel('sepal length [cm]') plt.ylabel('petal length [cm]') plt.legend(loc='upper left') plt.show() After executing the preceding code example, we should now see a plot of the decision regions, as shown in the following figure: As we can see in the plot, the perceptron learned a decision boundary that is able to classify all flower samples in the Iris training subset perfectly. Although the perceptron classified the two Iris flower classes perfectly, convergence is one of the biggest problems of the perceptron. Frank Rosenblatt proved mathematically that the perceptron learning rule converges if the two classes can be separated by a linear hyperplane. However, if classes cannot be separated perfectly by such a linear decision boundary, the weights will never stop updating unless we set a maximum number of epochs. [ 33 ] Training Simple Machine Learning Algorithms for Classification Adaptive linear neurons and the convergence of learning In this section, we will take a look at another type of singlelayer neural network: ADAptive LInear NEuron (Adaline). Adaline was published by Bernard Widrow and his doctoral student Tedd Hoff, only a few years after Frank Rosenblatt's perceptron algorithm, and can be considered as an improvement on the latter. (Refer to An Adaptive "Adaline" Neuron Using Chemical "Memistors", Technical Report Number 15532, B. Widrow and others, Stanford Electron Labs, Stanford, CA, October 1960). The Adaline algorithm is particularly interesting because it illustrates the key concepts of defining and minimizing continuous cost functions. This lays the groundwork for understanding more advanced machine learning algorithms for classification, such as logistic regression, support vector machines, and regression models, which we will discuss in future chapters. The key difference between the Adaline rule (also known as the WidrowHoff rule) and Rosenblatt's perceptron is that the weights are updated based on a linear activation function rather than a unit step function like in the perceptron. In Adaline, this linear activation function φ ( z ) is simply the identity function of the net input, so that: φ ( wT x ) = wT x While the linear activation function is used for learning the weights, we still use a threshold function to make the final prediction, which is similar to the unit step function that we have seen earlier. The main differences between the perceptron and Adaline algorithm are highlighted in the following figure: [ 34 ] Chapter 2 The illustration shows that the Adaline algorithm compares the true class labels with the linear activation function's continuous valued output to compute the model error and update the weights. In contrast, the perceptron compares the true class labels to the predicted class labels. Minimizing cost functions with gradient descent One of the key ingredients of supervised machine learning algorithms is a defined objective function that is to be optimized during the learning process. This objective function is often a cost function that we want to minimize. In the case of Adaline, we can define the cost function J to learn the weights as the Sum of Squared Errors (SSE) between the calculated outcome and the true class label: J (w) = ( ( )) 1 y (i ) − φ z (i ) ∑ i 2 [ 35 ] 2 Training Simple Machine Learning Algorithms for Classification 1 The term is just added for our convenience, which will make it easier to derive 2 the gradient, as we will see in the following paragraphs. The main advantage of this continuous linear activation function, in contrast to the unit step function, is that the cost function becomes differentiable. Another nice property of this cost function is that it is convex; thus, we can use a simple yet powerful optimization algorithm called gradient descent to find the weights that minimize our cost function to classify the samples in the Iris dataset. As illustrated in the following figure, we can describe the main idea behind gradient descent as climbing down a hill until a local or global cost minimum is reached. In each iteration, we take a step in the opposite direction of the gradient where the step size is determined by the value of the learning rate, as well as the slope of the gradient: Using gradient descent, we can now update the weights by taking a step in the opposite direction of the gradient ∇J ( w ) of our cost function J ( w ) : w := w + ∆w Where the weight change ∆w is defined as the negative gradient multiplied by the learning rate η : ∆w = −η∇J ( w ) To compute the gradient of the cost function, we need to compute the partial derivative of the cost function with respect to each weight w j : ( ( )) ∂J = −∑ y (i ) − φ z (i ) x(ji ) ∂w j i [ 36 ] Chapter 2 So that we can write the update of weight w j as: ( ( )) ∂J = η ∑ y (i ) − φ z (i ) x(ji ) ∂w j i ∆w j = −η Since we update all weights simultaneously, our Adaline learning rule becomes: w := w + ∆w For those who are familiar with calculus, the partial derivative of the SSE cost function with respect to the jth weight can be obtained as follows: ( ( )) ∂J ∂ 1 i i = ∑ y( ) − φ z( ) ∂w j ∂w j 2 i = = 1 2 1 ∂ 2 ∂w j ∑ ( y( ) − φ ( z( ) )) i ∑2 ( y ( ) − φ ( z ( ) ) ) ∂w i (y ∂ i i ( 2 i j ( )) (i ) − φ z( ) i ( )) ∂w∂ y( ) − ∑ ( w x( ) ) = ∑ y (i ) − φ z (i ) i i 2 ( i i j ( i j ( ) ) x( ) = −∑ y ( i ) − φ z ( i ) i i j ( ) ) ( − x( ) ) = ∑ y (i ) − φ z (i ) i j i j Although the Adaline learning rule looks identical to the perceptron rule, we should i i (i ) note that the φ z with z ( ) = w T x ( ) is a real number and not an integer class label. Furthermore, the weight update is calculated based on all samples in the training set (instead of updating the weights incrementally after each sample), which is why this approach is also referred to as batch gradient descent. ( ) [ 37 ] Training Simple Machine Learning Algorithms for Classification Implementing Adaline in Python Since the perceptron rule and Adaline are very similar, we will take the perceptron implementation that we defined earlier and change the fit method so that the weights are updated by minimizing the cost function via gradient descent: class AdalineGD(object): """ADAptive LInear NEuron classifier. Parameters eta : float Learning rate (between 0.0 and 1.0) n_iter : int Passes over the training dataset. random_state : int Random number generator seed for random weight initialization. Attributes w_ : 1darray Weights after fitting. cost_ : list Sumofsquares cost function value in each epoch. """ def __init__(self, eta=0.01, n_iter=50, random_state=1): self.eta = eta self.n_iter = n_iter self.random_state = random_state def fit(self, X, y): """ Fit training data. Parameters X : {arraylike}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : arraylike, shape = [n_samples] [ 38 ] Chapter 2 Target values. Returns self : object """ rgen = np.random.RandomState(self.random_state) self.w_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1]) self.cost_ = [] for i in range(self.n_iter): net_input = self.net_input(X) output = self.activation(net_input) errors = (y  output) self.w_[1:] += self.eta * X.T.dot(errors) self.w_[0] += self.eta * errors.sum() cost = (errors**2).sum() / 2.0 self.cost_.append(cost) return self def net_input(self, X): """Calculate net input""" return np.dot(X, self.w_[1:]) + self.w_[0] def activation(self, X): """Compute linear activation""" return X def predict(self, X): """Return class label after unit step""" return np.where(self.activation(self.net_input(X)) >= 0.0, 1, 1) Instead of updating the weights after evaluating each individual training sample, as in the perceptron, we calculate the gradient based on the whole training dataset via self.eta * errors.sum() for the bias unit (zeroweight) and via self.eta * X.T.dot(errors) for the weights 1 to m where X.T.dot(errors) is a matrixvector multiplication between our feature matrix and the error vector. [ 39 ] Training Simple Machine Learning Algorithms for Classification Please note that the activation method has no effect in the code since it is simply an identity function. Here, we added the activation function (computed via the activation method) to illustrate how information flows through a single layer neural network: features from the input data, net input, activation, and output. In the next chapter, we will learn about a logistic regression classifier that uses a nonidentity, nonlinear activation function. We will see that a logistic regression model is closely related to Adaline with the only difference being its activation and cost function. Now, similar to the previous perceptron implementation, we collect the cost values in a self.cost_ list to check whether the algorithm converged after training. Performing a matrixvector multiplication is similar to calculating a vector dotproduct where each row in the matrix is treated as a single row vector. This vectorized approach represents a more compact notation and results in a more efficient computation using NumPy. For example: 7 1 2 3 1× 7 + 2 × 8 + 3 × 9 50 × 4 5 6 8 = 4 × 7 + 5 × 8 + 6 × 9 = 122 9 In practice, it often requires some experimentation to find a good learning rate η for optimal convergence. So, let's choose two different learning rates, η = 0.1 and η = 0.0001 , to start with and plot the cost functions versus the number of epochs to see how well the Adaline implementation learns from the training data. The learning rate η (eta), as well as the number of epochs (n_iter), are the socalled hyperparameters of the perceptron and Adaline learning algorithms. In Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we will take a look at different techniques to automatically find the values of different hyperparameters that yield optimal performance of the classification model. Let us now plot the cost against the number of epochs for the two different learning rates: >>> fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4)) >>> >>> ... >>> >>> ada1 = AdalineGD(n_iter=10, eta=0.01).fit(X, y) ax[0].plot(range(1, len(ada1.cost_) + 1), np.log10(ada1.cost_), marker='o') ax[0].set_xlabel('Epochs') ax[0].set_ylabel('log(Sumsquarederror)') [ 40 ] Chapter 2 >>> ax[0].set_title('Adaline  Learning rate 0.01') >>> >>> ... >>> >>> >>> >>> ada2 = AdalineGD(n_iter=10, eta=0.0001).fit(X, y) ax[1].plot(range(1, len(ada2.cost_) + 1), ada2.cost_, marker='o') ax[1].set_xlabel('Epochs') ax[1].set_ylabel('Sumsquarederror') ax[1].set_title('Adaline  Learning rate 0.0001') plt.show() As we can see in the resulting costfunction plots, we encountered two different types of problem. The left chart shows what could happen if we choose a learning rate that is too large. Instead of minimizing the cost function, the error becomes larger in every epoch, because we overshoot the global minimum. On the other hand, we can see that the cost decreases on the right plot, but the chosen learning rate η = 0.0001 is so small that the algorithm would require a very large number of epochs to converge to the global cost minimum: [ 41 ] Training Simple Machine Learning Algorithms for Classification The following figure illustrates what might happen if we change the value of a particular weight parameter to minimize the cost function J . The left subfigure illustrates the case of a wellchosen learning rate, where the cost decreases gradually, moving in the direction of the global minimum. The subfigure on the right, however, illustrates what happens if we choose a learning rate that is too large—we overshoot the global minimum: Improving gradient descent through feature scaling Many machine learning algorithms that we will encounter throughout this book require some sort of feature scaling for optimal performance, which we will discuss in more detail in Chapter 3, A Tour of Machine Learning Classifiers Using scikitlearn and Chapter 4, Building Good Training Sets – Data Preprocessing. Gradient descent is one of the many algorithms that benefit from feature scaling. In this section, we will use a feature scaling method called standardization, which gives our data the property of a standard normal distribution, which helps gradient descent learning to converge more quickly. Standardization shifts the mean of each feature so that it is centered at zero and each feature has a standard deviation of 1. For instance, to standardize the jth feature, we can simply subtract the sample mean µ j from every training sample and divide it by its standard deviation σ j : x ′j = xj − µj σj Here, x j is a vector consisting of the jth feature values of all training samples n, and this standardization technique is applied to each feature j in our dataset. [ 42 ] Chapter 2 One of the reasons why standardization helps with gradient descent learning is that the optimizer has to go through