Introduction to Machine Learning with Python: A Guide for Data Scientists
Andreas C. Müller, Sarah GuidoMachine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.
You’ll learn the steps necessary to create a successful machinelearning application with Python and the scikitlearn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book.
With this book, you’ll learn:
 Fundamental concepts and applications of machine learning
 Advantages and shortcomings of widely used machine learning algorithms
 How to represent data processed by machine learning, including which data aspects to focus on
 Advanced methods for model evaluation and parameter tuning
 The concept of pipelines for chaining models and encapsulating your workflow
 Methods for working with text data, including textspecific processing techniques
 Suggestions for improving your machine learning and data science skills
 Open in Browser
 Checking other formats...
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You may be interested in


Introduction to Machine Learning with Python A GUIDE FOR DATA SCIENTISTS Andreas C. Müller & Sarah Guido Introduction to Machine Learning with Python A Guide for Data Scientists Andreas C. Müller and Sarah Guido Beijing Boston Farnham Sebastopol Tokyo Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido Copyright © 2017 Sarah Guido, Andreas Müller. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 8009989938 or corporate@oreilly.com. Editor: Dawn Schanafelt Production Editor: Kristen Brown Copyeditor: Rachel Head Proofreader: Jasmine Kwityn Indexer: Judy McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition October 2016: Revision History for the First Edition 20160922: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781449369415 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Introduction to Machine Learning with Python, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 9781449369415 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Why Machine Learning? Problems Machine Learning Can Solve Knowing Your Task and Knowing Your Data Why Python? scikitlearn Installing scikitlearn Essential Libraries and Tools Jupyter Notebook NumPy SciPy matplotlib pandas mglearn Python 2 Versus Python 3 Versions Used in this Book A First Application: Classifying Iris Species Meet the Data Measuring Success: Training and Testing Data First Things First: Look at Your Data Building Your First Model: kNearest Neighbors Making Predictions Evaluating the Model Summary and Outlook 1 2 4 5 5 6 7 7 7 8 9 10 11 12 12 13 14 17 19 20 22 22 23 iii 2. Supervised Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Classification and Regression Generalization, Overfitting, and Underfitting Relation of Model Complexity to Dataset Size Supervised Machine Learning Algorithms Some Sample Datasets kNearest Neighbors Linear Models Naive Bayes Classifiers Decision Trees Ensembles of Decision Trees Kernelized Support Vector Machines Neural Networks (Deep Learning) Uncertainty Estimates from Classifiers The Decision Function Predicting Probabilities Uncertainty in Multiclass Classification Summary and Outlook 25 26 29 29 30 35 45 68 70 83 92 104 119 120 122 124 127 3. Unsupervised Learning and Preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Types of Unsupervised Learning Challenges in Unsupervised Learning Preprocessing and Scaling Different Kinds of Preprocessing Applying Data Transformations Scaling Training and Test Data the Same Way The Effect of Preprocessing on Supervised Learning Dimensionality Reduction, Feature Extraction, and Manifold Learning Principal Component Analysis (PCA) NonNegative Matrix Factorization (NMF) Manifold Learning with tSNE Clustering kMeans Clustering Agglomerative Clustering DBSCAN Comparing and Evaluating Clustering Algorithms Summary of Clustering Methods Summary and Outlook 131 132 132 133 134 136 138 140 140 156 163 168 168 182 187 191 207 208 4. Representing Data and Engineering Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Categorical Variables OneHotEncoding (Dummy Variables) iv  Table of Contents 212 213 Numbers Can Encode Categoricals Binning, Discretization, Linear Models, and Trees Interactions and Polynomials Univariate Nonlinear Transformations Automatic Feature Selection Univariate Statistics ModelBased Feature Selection Iterative Feature Selection Utilizing Expert Knowledge Summary and Outlook 218 220 224 232 236 236 238 240 242 250 5. Model Evaluation and Improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 CrossValidation CrossValidation in scikitlearn Benefits of CrossValidation Stratified kFold CrossValidation and Other Strategies Grid Search Simple Grid Search The Danger of Overfitting the Parameters and the Validation Set Grid Search with CrossValidation Evaluation Metrics and Scoring Keep the End Goal in Mind Metrics for Binary Classification Metrics for Multiclass Classification Regression Metrics Using Evaluation Metrics in Model Selection Summary and Outlook 252 253 254 254 260 261 261 263 275 275 276 296 299 300 302 6. Algorithm Chains and Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Parameter Selection with Preprocessing Building Pipelines Using Pipelines in Grid Searches The General Pipeline Interface Convenient Pipeline Creation with make_pipeline Accessing Step Attributes Accessing Attributes in a GridSearched Pipeline GridSearching Preprocessing Steps and Model Parameters GridSearching Which Model To Use Summary and Outlook 306 308 309 312 313 314 315 317 319 320 7. Working with Text Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Types of Data Represented as Strings 323 Table of Contents  v Example Application: Sentiment Analysis of Movie Reviews Representing Text Data as a Bag of Words Applying BagofWords to a Toy Dataset BagofWords for Movie Reviews Stopwords Rescaling the Data with tf–idf Investigating Model Coefficients BagofWords with More Than One Word (nGrams) Advanced Tokenization, Stemming, and Lemmatization Topic Modeling and Document Clustering Latent Dirichlet Allocation Summary and Outlook 325 327 329 330 334 336 338 339 344 347 348 355 8. Wrapping Up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Approaching a Machine Learning Problem Humans in the Loop From Prototype to Production Testing Production Systems Building Your Own Estimator Where to Go from Here Theory Other Machine Learning Frameworks and Packages Ranking, Recommender Systems, and Other Kinds of Learning Probabilistic Modeling, Inference, and Probabilistic Programming Neural Networks Scaling to Larger Datasets Honing Your Skills Conclusion 357 358 359 359 360 361 361 362 363 363 364 364 365 366 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 vi  Table of Contents Preface Machine learning is an integral part of many commercial applications and research projects today, in areas ranging from medical diagnosis and treatment to finding your friends on social networks. Many people think that machine learning can only be applied by large companies with extensive research teams. In this book, we want to show you how easy it can be to build machine learning solutions yourself, and how to best go about it. With the knowledge in this book, you can build your own system for finding out how people feel on Twitter, or making predictions about global warming. The applications of machine learning are endless and, with the amount of data avail‐ able today, mostly limited by your imagination. Who Should Read This Book This book is for current and aspiring machine learning practitioners looking to implement solutions to realworld machine learning problems. This is an introduc‐ tory book requiring no previous knowledge of machine learning or artificial intelli‐ gence (AI). We focus on using Python and the scikitlearn library, and work through all the steps to create a successful machine learning application. The meth‐ ods we introduce will be helpful for scientists and researchers, as well as data scien‐ tists working on commercial applications. You will get the most out of the book if you are somewhat familiar with Python and the NumPy and matplotlib libraries. We made a conscious effort not to focus too much on the math, but rather on the practical aspects of using machine learning algorithms. As mathematics (probability theory, in particular) is the foundation upon which machine learning is built, we won’t go into the analysis of the algorithms in great detail. If you are interested in the mathematics of machine learning algorithms, we recommend the book The Elements of Statistical Learning (Springer) by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which is available for free at the authors’ website. We will also not describe how to write machine learning algorithms from scratch, and will instead focus on vii how to use the large array of models already implemented in scikitlearn and other libraries. Why We Wrote This Book There are many books on machine learning and AI. However, all of them are meant for graduate students or PhD students in computer science, and they’re full of advanced mathematics. This is in stark contrast with how machine learning is being used, as a commodity tool in research and commercial applications. Today, applying machine learning does not require a PhD. However, there are few resources out there that fully cover all the important aspects of implementing machine learning in prac‐ tice, without requiring you to take advanced math courses. We hope this book will help people who want to apply machine learning without reading up on years’ worth of calculus, linear algebra, and probability theory. Navigating This Book This book is organized roughly as follows: • Chapter 1 introduces the fundamental concepts of machine learning and its applications, and describes the setup we will be using throughout the book. • Chapters 2 and 3 describe the actual machine learning algorithms that are most widely used in practice, and discuss their advantages and shortcomings. • Chapter 4 discusses the importance of how we represent data that is processed by machine learning, and what aspects of the data to pay attention to. • Chapter 5 covers advanced methods for model evaluation and parameter tuning, with a particular focus on crossvalidation and grid search. • Chapter 6 explains the concept of pipelines for chaining models and encapsulat‐ ing your workflow. • Chapter 7 shows how to apply the methods described in earlier chapters to text data, and introduces some textspecific processing techniques. • Chapter 8 offers a highlevel overview, and includes references to more advanced topics. While Chapters 2 and 3 provide the actual algorithms, understanding all of these algorithms might not be necessary for a beginner. If you need to build a machine learning system ASAP, we suggest starting with Chapter 1 and the opening sections of Chapter 2, which introduce all the core concepts. You can then skip to “Summary and Outlook” on page 127 in Chapter 2, which includes a list of all the supervised models that we cover. Choose the model that best fits your needs and flip back to read the viii  Preface section devoted to it for details. Then you can use the techniques in Chapter 5 to eval‐ uate and tune your model. Online Resources While studying this book, definitely refer to the scikitlearn website for more indepth documentation of the classes and functions, and many examples. There is also a video course created by Andreas Müller, “Advanced Machine Learning with scikitlearn,” that supplements this book. You can find it at http://bit.ly/ advanced_machine_learning_scikitlearn. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Also used for commands and module and package names. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with usersupplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. Preface  ix This icon indicates a warning or caution. Using Code Examples Supplemental material (code examples, IPython notebooks, etc.) is available for download at https://github.com/amueller/introduction_to_ml_with_python. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CDROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “An Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido (O’Reilly). Copyright 2017 Sarah Guido and Andreas Müller, 9781449369415.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Safari® Books Online Safari Books Online is an ondemand digital library that deliv‐ ers expert content in both book and video form from the world’s leading authors in technology and business. Technology professionals, software developers, web designers, and business and crea‐ tive professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training. Safari Books Online offers a range of plans and pricing for enterprise, government, education, and individuals. Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, AddisonWesley Professional, Microsoft Press, Sams, Que, x  Preface Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kauf‐ mann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGrawHill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 8009989938 (in the United States or Canada) 7078290515 (international or local) 7078290104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/intromachinelearningpython. To comment or ask technical questions about this book, send email to bookques‐ tions@oreilly.com. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments From Andreas Without the help and support of a large group of people, this book would never have existed. I would like to thank the editors, Meghan Blanchette, Brian MacDonald, and in par‐ ticular Dawn Schanafelt, for helping Sarah and me make this book a reality. I want to thank my reviewers, Thomas Caswell, Olivier Grisel, Stefan van der Walt, and John Myles White, who took the time to read the early versions of this book and provided me with invaluable feedback—in addition to being some of the corner‐ stones of the scientific open source ecosystem. Preface  xi I am forever thankful for the welcoming open source scientific Python community, especially the contributors to scikitlearn. Without the support and help from this community, in particular from Gael Varoquaux, Alex Gramfort, and Olivier Grisel, I would never have become a core contributor to scikitlearn or learned to under‐ stand this package as well as I do now. My thanks also go out to all the other contrib‐ utors who donate their time to improve and maintain this package. I’m also thankful for the discussions with many of my colleagues and peers that hel‐ ped me understand the challenges of machine learning and gave me ideas for struc‐ turing a textbook. Among the people I talk to about machine learning, I specifically want to thank Brian McFee, Daniela Huttenkoppen, Joel Nothman, Gilles Louppe, Hugo BowneAnderson, Sven Kreis, Alice Zheng, Kyunghyun Cho, Pablo Baberas, and Dan Cervone. My thanks also go out to Rachel Rakov, who was an eager beta tester and proofreader of an early version of this book, and helped me shape it in many ways. On the personal side, I want to thank my parents, Harald and Margot, and my sister, Miriam, for their continuing support and encouragement. I also want to thank the many people in my life whose love and friendship gave me the energy and support to undertake such a challenging task. From Sarah I would like to thank Meg Blanchette, without whose help and guidance this project would not have even existed. Thanks to Celia La and Brian Carlson for reading in the early days. Thanks to the O’Reilly folks for their endless patience. And finally, thanks to DTS, for your everlasting and endless support. xii  Preface CHAPTER 1 Introduction Machine learning is about extracting knowledge from data. It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning. The application of machine learning methods has in recent years become ubiquitous in everyday life. From auto‐ matic recommendations of which movies to watch, to what food to order or which products to buy, to personalized online radio and recognizing your friends in your photos, many modern websites and devices have machine learning algorithms at their core. When you look at a complex website like Facebook, Amazon, or Netflix, it is very likely that every part of the site contains multiple machine learning models. Outside of commercial applications, machine learning has had a tremendous influ‐ ence on the way datadriven research is done today. The tools introduced in this book have been applied to diverse scientific problems such as understanding stars, finding distant planets, discovering new particles, analyzing DNA sequences, and providing personalized cancer treatments. Your application doesn’t need to be as largescale or worldchanging as these exam‐ ples in order to benefit from machine learning, though. In this chapter, we will explain why machine learning has become so popular and discuss what kinds of problems can be solved using machine learning. Then, we will show you how to build your first machine learning model, introducing important concepts along the way. Why Machine Learning? In the early days of “intelligent” applications, many systems used handcoded rules of “if ” and “else” decisions to process data or adjust to user input. Think of a spam filter whose job is to move the appropriate incoming email messages to a spam folder. You could make up a blacklist of words that would result in an email being marked as 1 spam. This would be an example of using an expertdesigned rule system to design an “intelligent” application. Manually crafting decision rules is feasible for some applica‐ tions, particularly those in which humans have a good understanding of the process to model. However, using handcoded rules to make decisions has two major disad‐ vantages: • The logic required to make a decision is specific to a single domain and task. Changing the task even slightly might require a rewrite of the whole system. • Designing rules requires a deep understanding of how a decision should be made by a human expert. One example of where this handcoded approach will fail is in detecting faces in images. Today, every smartphone can detect a face in an image. However, face detec‐ tion was an unsolved problem until as recently as 2001. The main problem is that the way in which pixels (which make up an image in a computer) are “perceived” by the computer is very different from how humans perceive a face. This difference in repre‐ sentation makes it basically impossible for a human to come up with a good set of rules to describe what constitutes a face in a digital image. Using machine learning, however, simply presenting a program with a large collec‐ tion of images of faces is enough for an algorithm to determine what characteristics are needed to identify a face. Problems Machine Learning Can Solve The most successful kinds of machine learning algorithms are those that automate decisionmaking processes by generalizing from known examples. In this setting, which is known as supervised learning, the user provides the algorithm with pairs of inputs and desired outputs, and the algorithm finds a way to produce the desired out‐ put given an input. In particular, the algorithm is able to create an output for an input it has never seen before without any help from a human. Going back to our example of spam classification, using machine learning, the user provides the algorithm with a large number of emails (which are the input), together with information about whether any of these emails are spam (which is the desired output). Given a new email, the algorithm will then produce a prediction as to whether the new email is spam. Machine learning algorithms that learn from input/output pairs are called supervised learning algorithms because a “teacher” provides supervision to the algorithms in the form of the desired outputs for each example that they learn from. While creating a dataset of inputs and outputs is often a laborious manual process, supervised learning algorithms are well understood and their performance is easy to measure. If your application can be formulated as a supervised learning problem, and you are able to 2  Chapter 1: Introduction create a dataset that includes the desired outcome, machine learning will likely be able to solve your problem. Examples of supervised machine learning tasks include: Identifying the zip code from handwritten digits on an envelope Here the input is a scan of the handwriting, and the desired output is the actual digits in the zip code. To create a dataset for building a machine learning model, you need to collect many envelopes. Then you can read the zip codes yourself and store the digits as your desired outcomes. Determining whether a tumor is benign based on a medical image Here the input is the image, and the output is whether the tumor is benign. To create a dataset for building a model, you need a database of medical images. You also need an expert opinion, so a doctor needs to look at all of the images and decide which tumors are benign and which are not. It might even be necessary to do additional diagnosis beyond the content of the image to determine whether the tumor in the image is cancerous or not. Detecting fraudulent activity in credit card transactions Here the input is a record of the credit card transaction, and the output is whether it is likely to be fraudulent or not. Assuming that you are the entity dis‐ tributing the credit cards, collecting a dataset means storing all transactions and recording if a user reports any transaction as fraudulent. An interesting thing to note about these examples is that although the inputs and out‐ puts look fairly straightforward, the data collection process for these three tasks is vastly different. While reading envelopes is laborious, it is easy and cheap. Obtaining medical imaging and diagnoses, on the other hand, requires not only expensive machinery but also rare and expensive expert knowledge, not to mention the ethical concerns and privacy issues. In the example of detecting credit card fraud, data col‐ lection is much simpler. Your customers will provide you with the desired output, as they will report fraud. All you have to do to obtain the input/output pairs of fraudu‐ lent and nonfraudulent activity is wait. Unsupervised algorithms are the other type of algorithm that we will cover in this book. In unsupervised learning, only the input data is known, and no known output data is given to the algorithm. While there are many successful applications of these methods, they are usually harder to understand and evaluate. Examples of unsupervised learning include: Identifying topics in a set of blog posts If you have a large collection of text data, you might want to summarize it and find prevalent themes in it. You might not know beforehand what these topics are, or how many topics there might be. Therefore, there are no known outputs. Why Machine Learning?  3 Segmenting customers into groups with similar preferences Given a set of customer records, you might want to identify which customers are similar, and whether there are groups of customers with similar preferences. For a shopping site, these might be “parents,” “bookworms,” or “gamers.” Because you don’t know in advance what these groups might be, or even how many there are, you have no known outputs. Detecting abnormal access patterns to a website To identify abuse or bugs, it is often helpful to find access patterns that are differ‐ ent from the norm. Each abnormal pattern might be very different, and you might not have any recorded instances of abnormal behavior. Because in this example you only observe traffic, and you don’t know what constitutes normal and abnormal behavior, this is an unsupervised problem. For both supervised and unsupervised learning tasks, it is important to have a repre‐ sentation of your input data that a computer can understand. Often it is helpful to think of your data as a table. Each data point that you want to reason about (each email, each customer, each transaction) is a row, and each property that describes that data point (say, the age of a customer or the amount or location of a transaction) is a column. You might describe users by their age, their gender, when they created an account, and how often they have bought from your online shop. You might describe the image of a tumor by the grayscale values of each pixel, or maybe by using the size, shape, and color of the tumor. Each entity or row here is known as a sample (or data point) in machine learning, while the columns—the properties that describe these entities—are called features. Later in this book we will go into more detail on the topic of building a good repre‐ sentation of your data, which is called feature extraction or feature engineering. You should keep in mind, however, that no machine learning algorithm will be able to make a prediction on data for which it has no information. For example, if the only feature that you have for a patient is their last name, no algorithm will be able to pre‐ dict their gender. This information is simply not contained in your data. If you add another feature that contains the patient’s first name, you will have much better luck, as it is often possible to tell the gender by a person’s first name. Knowing Your Task and Knowing Your Data Quite possibly the most important part in the machine learning process is under‐ standing the data you are working with and how it relates to the task you want to solve. It will not be effective to randomly choose an algorithm and throw your data at it. It is necessary to understand what is going on in your dataset before you begin building a model. Each algorithm is different in terms of what kind of data and what problem setting it works best for. While you are building a machine learning solution, you should answer, or at least keep in mind, the following questions: 4  Chapter 1: Introduction • What question(s) am I trying to answer? Do I think the data collected can answer that question? • What is the best way to phrase my question(s) as a machine learning problem? • Have I collected enough data to represent the problem I want to solve? • What features of the data did I extract, and will these enable the right predictions? • How will I measure success in my application? • How will the machine learning solution interact with other parts of my research or business product? In a larger context, the algorithms and methods in machine learning are only one part of a greater process to solve a particular problem, and it is good to keep the big picture in mind at all times. Many people spend a lot of time building complex machine learning solutions, only to find out they don’t solve the right problem. When going deep into the technical aspects of machine learning (as we will in this book), it is easy to lose sight of the ultimate goals. While we will not discuss the ques‐ tions listed here in detail, we still encourage you to keep in mind all the assumptions that you might be making, explicitly or implicitly, when you start building machine learning models. Why Python? Python has become the lingua franca for many data science applications. It combines the power of generalpurpose programming languages with the ease of use of domainspecific scripting languages like MATLAB or R. Python has libraries for data loading, visualization, statistics, natural language processing, image processing, and more. This vast toolbox provides data scientists with a large array of general and specialpurpose functionality. One of the main advantages of using Python is the abil‐ ity to interact directly with the code, using a terminal or other tools like the Jupyter Notebook, which we’ll look at shortly. Machine learning and data analysis are funda‐ mentally iterative processes, in which the data drives the analysis. It is essential for these processes to have tools that allow quick iteration and easy interaction. As a generalpurpose programming language, Python also allows for the creation of complex graphical user interfaces (GUIs) and web services, and for integration into existing systems. scikitlearn scikitlearn is an open source project, meaning that it is free to use and distribute, and anyone can easily obtain the source code to see what is going on behind the Why Python?  5 scenes. The scikitlearn project is constantly being developed and improved, and it has a very active user community. It contains a number of stateoftheart machine learning algorithms, as well as comprehensive documentation about each algorithm. scikitlearn is a very popular tool, and the most prominent Python library for machine learning. It is widely used in industry and academia, and a wealth of tutori‐ als and code snippets are available online. scikitlearn works well with a number of other scientific Python tools, which we will discuss later in this chapter. While reading this, we recommend that you also browse the scikitlearn user guide and API documentation for additional details on and many more options for each algorithm. The online documentation is very thorough, and this book will provide you with all the prerequisites in machine learning to understand it in detail. Installing scikitlearn scikitlearn depends on two other Python packages, NumPy and SciPy. For plot‐ ting and interactive development, you should also install matplotlib, IPython, and the Jupyter Notebook. We recommend using one of the following prepackaged Python distributions, which will provide the necessary packages: Anaconda A Python distribution made for largescale data processing, predictive analytics, and scientific computing. Anaconda comes with NumPy, SciPy, matplotlib, pandas, IPython, Jupyter Notebook, and scikitlearn. Available on Mac OS, Windows, and Linux, it is a very convenient solution and is the one we suggest for people without an existing installation of the scientific Python packages. Ana‐ conda now also includes the commercial Intel MKL library for free. Using MKL (which is done automatically when Anaconda is installed) can give significant speed improvements for many algorithms in scikitlearn. Enthought Canopy Another Python distribution for scientific computing. This comes with NumPy, SciPy, matplotlib, pandas, and IPython, but the free version does not come with scikitlearn. If you are part of an academic, degreegranting institution, you can request an academic license and get free access to the paid subscription ver‐ sion of Enthought Canopy. Enthought Canopy is available for Python 2.7.x, and works on Mac OS, Windows, and Linux. Python(x,y) A free Python distribution for scientific computing, specifically for Windows. Python(x,y) comes with NumPy, SciPy, matplotlib, pandas, IPython, and scikitlearn. 6  Chapter 1: Introduction If you already have a Python installation set up, you can use pip to install all of these packages: $ pip install numpy scipy matplotlib ipython scikitlearn pandas Essential Libraries and Tools Understanding what scikitlearn is and how to use it is important, but there are a few other libraries that will enhance your experience. scikitlearn is built on top of the NumPy and SciPy scientific Python libraries. In addition to NumPy and SciPy, we will be using pandas and matplotlib. We will also introduce the Jupyter Notebook, which is a browserbased interactive programming environment. Briefly, here is what you should know about these tools in order to get the most out of scikitlearn.1 Jupyter Notebook The Jupyter Notebook is an interactive environment for running code in the browser. It is a great tool for exploratory data analysis and is widely used by data scientists. While the Jupyter Notebook supports many programming languages, we only need the Python support. The Jupyter Notebook makes it easy to incorporate code, text, and images, and all of this book was in fact written as a Jupyter Notebook. All of the code examples we include can be downloaded from GitHub. NumPy NumPy is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, highlevel mathematical func‐ tions such as linear algebra operations and the Fourier transform, and pseudorandom number generators. In scikitlearn, the NumPy array is the fundamental data structure. scikitlearn takes in data in the form of NumPy arrays. Any data you’re using will have to be con‐ verted to a NumPy array. The core functionality of NumPy is the ndarray class, a multidimensional (ndimensional) array. All elements of the array must be of the same type. A NumPy array looks like this: In[2]: import numpy as np x = np.array([[1, 2, 3], [4, 5, 6]]) print("x:\n{}".format(x)) 1 If you are unfamiliar with NumPy or matplotlib, we recommend reading the first chapter of the SciPy Lec‐ ture Notes. Essential Libraries and Tools  7 Out[2]: x: [[1 2 3] [4 5 6]] We will be using NumPy a lot in this book, and we will refer to objects of the NumPy ndarray class as “NumPy arrays” or just “arrays.” SciPy SciPy is a collection of functions for scientific computing in Python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distri‐ butions. scikitlearn draws from SciPy’s collection of functions for implementing its algorithms. The most important part of SciPy for us is scipy.sparse: this provides sparse matrices, which are another representation that is used for data in scikitlearn. Sparse matrices are used whenever we want to store a 2D array that contains mostly zeros: In[3]: from scipy import sparse # Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else eye = np.eye(4) print("NumPy array:\n{}".format(eye)) Out[3]: NumPy array: [[ 1. 0. 0. [ 0. 1. 0. [ 0. 0. 1. [ 0. 0. 0. 0.] 0.] 0.] 1.]] In[4]: # Convert the NumPy array to a SciPy sparse matrix in CSR format # Only the nonzero entries are stored sparse_matrix = sparse.csr_matrix(eye) print("\nSciPy sparse CSR matrix:\n{}".format(sparse_matrix)) Out[4]: SciPy (0, (1, (2, (3, 8  sparse CSR matrix: 0) 1.0 1) 1.0 2) 1.0 3) 1.0 Chapter 1: Introduction Usually it is not possible to create dense representations of sparse data (as they would not fit into memory), so we need to create sparse representations directly. Here is a way to create the same sparse matrix as before, using the COO format: In[5]: data = np.ones(4) row_indices = np.arange(4) col_indices = np.arange(4) eye_coo = sparse.coo_matrix((data, (row_indices, col_indices))) print("COO representation:\n{}".format(eye_coo)) Out[5]: COO representation: (0, 0) 1.0 (1, 1) 1.0 (2, 2) 1.0 (3, 3) 1.0 More details on SciPy sparse matrices can be found in the SciPy Lecture Notes. matplotlib matplotlib is the primary scientific plotting library in Python. It provides functions for making publicationquality visualizations such as line charts, histograms, scatter plots, and so on. Visualizing your data and different aspects of your analysis can give you important insights, and we will be using matplotlib for all our visualizations. When working inside the Jupyter Notebook, you can show figures directly in the browser by using the %matplotlib notebook and %matplotlib inline commands. We recommend using %matplotlib notebook, which provides an interactive envi‐ ronment (though we are using %matplotlib inline to produce this book). For example, this code produces the plot in Figure 11: In[6]: %matplotlib inline import matplotlib.pyplot as plt # Generate a sequence of numbers from 10 to 10 with 100 steps in between x = np.linspace(10, 10, 100) # Create a second array using sine y = np.sin(x) # The plot function makes a line chart of one array against another plt.plot(x, y, marker="x") Essential Libraries and Tools  9 Figure 11. Simple line plot of the sine function using matplotlib pandas pandas is a Python library for data wrangling and analysis. It is built around a data structure called the DataFrame that is modeled after the R DataFrame. Simply put, a pandas DataFrame is a table, similar to an Excel spreadsheet. pandas provides a great range of methods to modify and operate on this table; in particular, it allows SQLlike queries and joins of tables. In contrast to NumPy, which requires that all entries in an array be of the same type, pandas allows each column to have a separate type (for example, integers, dates, floatingpoint numbers, and strings). Another valuable tool provided by pandas is its ability to ingest from a great variety of file formats and data‐ bases, like SQL, Excel files, and commaseparated values (CSV) files. Going into detail about the functionality of pandas is out of the scope of this book. However, Python for Data Analysis by Wes McKinney (O’Reilly, 2012) provides a great guide. Here is a small example of creating a DataFrame using a dictionary: In[7]: import pandas as pd # create a simple dataset of people data = {'Name': ["John", "Anna", "Peter", "Linda"], 'Location' : ["New York", "Paris", "Berlin", "London"], 'Age' : [24, 13, 53, 33] } data_pandas = pd.DataFrame(data) # IPython.display allows "pretty printing" of dataframes # in the Jupyter notebook display(data_pandas) 10  Chapter 1: Introduction This produces the following output: Age Location Name 0 24 New York John 1 13 Paris Anna 2 53 Berlin Peter 3 33 London Linda There are several possible ways to query this table. For example: In[8]: # Select all rows that have an age column greater than 30 display(data_pandas[data_pandas.Age > 30]) This produces the following result: Age Location Name 2 53 Berlin Peter 3 33 London Linda mglearn This book comes with accompanying code, which you can find on GitHub. The accompanying code includes not only all the examples shown in this book, but also the mglearn library. This is a library of utility functions we wrote for this book, so that we don’t clutter up our code listings with details of plotting and data loading. If you’re interested, you can look up all the functions in the repository, but the details of the mglearn module are not really important to the material in this book. If you see a call to mglearn in the code, it is usually a way to make a pretty picture quickly, or to get our hands on some interesting data. Throughout the book we make ample use of NumPy, matplotlib and pandas. All the code will assume the following imports: import import import import numpy as np matplotlib.pyplot as plt pandas as pd mglearn We also assume that you will run the code in a Jupyter Notebook with the %matplotlib notebook or %matplotlib inline magic enabled to show plots. If you are not using the notebook or these magic commands, you will have to call plt.show to actually show any of the figures. Essential Libraries and Tools  11 Python 2 Versus Python 3 There are two major versions of Python that are widely used at the moment: Python 2 (more precisely, 2.7) and Python 3 (with the latest release being 3.5 at the time of writing). This sometimes leads to some confusion. Python 2 is no longer actively developed, but because Python 3 contains major changes, Python 2 code usually does not run on Python 3. If you are new to Python, or are starting a new project from scratch, we highly recommend using the latest version of Python 3 without changes. If you have a large codebase that you rely on that is written for Python 2, you are excused from upgrading for now. However, you should try to migrate to Python 3 as soon as possible. When writing any new code, it is for the most part quite easy to write code that runs under Python 2 and Python 3. 2 If you don’t have to interface with legacy software, you should definitely use Python 3. All the code in this book is writ‐ ten in a way that works for both versions. However, the exact output might differ slightly under Python 2. Versions Used in this Book We are using the following versions of the previously mentioned libraries in this book: In[9]: import sys print("Python version: {}".format(sys.version)) import pandas as pd print("pandas version: {}".format(pd.__version__)) import matplotlib print("matplotlib version: {}".format(matplotlib.__version__)) import numpy as np print("NumPy version: {}".format(np.__version__)) import scipy as sp print("SciPy version: {}".format(sp.__version__)) import IPython print("IPython version: {}".format(IPython.__version__)) import sklearn print("scikitlearn version: {}".format(sklearn.__version__)) 2 The six package can be very handy for that. 12  Chapter 1: Introduction Out[9]: Python version: 3.5.2 Anaconda 4.1.1 (64bit) (default, Jul [GCC 4.4.7 20120313 (Red Hat 4.4.71)] pandas version: 0.18.1 matplotlib version: 1.5.1 NumPy version: 1.11.1 SciPy version: 0.17.1 IPython version: 5.1.0 scikitlearn version: 0.18 2 2016, 17:53:06) While it is not important to match these versions exactly, you should have a version of scikitlearn that is as least as recent as the one we used. Now that we have everything set up, let’s dive into our first application of machine learning. This book assumes that you have version 0.18 or later of scikitlearn. The model_selection module was added in 0.18, and if you use an earlier version of scikitlearn, you will need to adjust the imports from this module. A First Application: Classifying Iris Species In this section, we will go through a simple machine learning application and create our first model. In the process, we will introduce some core concepts and terms. Let’s assume that a hobby botanist is interested in distinguishing the species of some iris flowers that she has found. She has collected some measurements associated with each iris: the length and width of the petals and the length and width of the sepals, all measured in centimeters (see Figure 12). She also has the measurements of some irises that have been previously identified by an expert botanist as belonging to the species setosa, versicolor, or virginica. For these measurements, she can be certain of which species each iris belongs to. Let’s assume that these are the only species our hobby botanist will encounter in the wild. Our goal is to build a machine learning model that can learn from the measurements of these irises whose species is known, so that we can predict the species for a new iris. A First Application: Classifying Iris Species  13 Figure 12. Parts of the iris flower Because we have measurements for which we know the correct species of iris, this is a supervised learning problem. In this problem, we want to predict one of several options (the species of iris). This is an example of a classification problem. The possi‐ ble outputs (different species of irises) are called classes. Every iris in the dataset belongs to one of three classes, so this problem is a threeclass classification problem. The desired output for a single data point (an iris) is the species of this flower. For a particular data point, the species it belongs to is called its label. Meet the Data The data we will use for this example is the Iris dataset, a classical dataset in machine learning and statistics. It is included in scikitlearn in the datasets module. We can load it by calling the load_iris function: In[10]: from sklearn.datasets import load_iris iris_dataset = load_iris() The iris object that is returned by load_iris is a Bunch object, which is very similar to a dictionary. It contains keys and values: 14  Chapter 1: Introduction In[11]: print("Keys of iris_dataset: \n{}".format(iris_dataset.keys())) Out[11]: Keys of iris_dataset: dict_keys(['target_names', 'feature_names', 'DESCR', 'data', 'target']) The value of the key DESCR is a short description of the dataset. We show the begin‐ ning of the description here (feel free to look up the rest yourself): In[12]: print(iris_dataset['DESCR'][:193] + "\n...") Out[12]: Iris Plants Database ==================== Notes Data Set Characteristics: :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive att ...  The value of the key target_names is an array of strings, containing the species of flower that we want to predict: In[13]: print("Target names: {}".format(iris_dataset['target_names'])) Out[13]: Target names: ['setosa' 'versicolor' 'virginica'] The value of feature_names is a list of strings, giving the description of each feature: In[14]: print("Feature names: \n{}".format(iris_dataset['feature_names'])) Out[14]: Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] The data itself is contained in the target and data fields. data contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a NumPy array: A First Application: Classifying Iris Species  15 In[15]: print("Type of data: {}".format(type(iris_dataset['data']))) Out[15]: Type of data: <class 'numpy.ndarray'> The rows in the data array correspond to flowers, while the columns represent the four measurements that were taken for each flower: In[16]: print("Shape of data: {}".format(iris_dataset['data'].shape)) Out[16]: Shape of data: (150, 4) We see that the array contains measurements for 150 different flowers. Remember that the individual items are called samples in machine learning, and their properties are called features. The shape of the data array is the number of samples multiplied by the number of features. This is a convention in scikitlearn, and your data will always be assumed to be in this shape. Here are the feature values for the first five samples: In[17]: print("First five columns of data:\n{}".format(iris_dataset['data'][:5])) Out[17]: First five columns of data: [[ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2] [ 5. 3.6 1.4 0.2]] From this data, we can see that all of the first five flowers have a petal width of 0.2 cm and that the first flower has the longest sepal, at 5.1 cm. The target array contains the species of each of the flowers that were measured, also as a NumPy array: In[18]: print("Type of target: {}".format(type(iris_dataset['target']))) Out[18]: Type of target: <class 'numpy.ndarray'> target is a onedimensional array, with one entry per flower: 16  Chapter 1: Introduction In[19]: print("Shape of target: {}".format(iris_dataset['target'].shape)) Out[19]: Shape of target: (150,) The species are encoded as integers from 0 to 2: In[20]: print("Target:\n{}".format(iris_dataset['target'])) Out[20]: Target: [0 0 0 0 0 0 0 0 1 1 1 1 2 2 2 2 2 2] 0 0 1 2 0 0 1 2 0 0 1 2 0 0 1 2 0 0 1 2 0 0 1 2 0 0 1 2 0 0 1 2 0 0 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 1 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 0 1 2 2 The meanings of the numbers are given by the iris['target_names'] array: 0 means setosa, 1 means versicolor, and 2 means virginica. Measuring Success: Training and Testing Data We want to build a machine learning model from this data that can predict the spe‐ cies of iris for a new set of measurements. But before we can apply our model to new measurements, we need to know whether it actually works—that is, whether we should trust its predictions. Unfortunately, we cannot use the data we used to build the model to evaluate it. This is because our model can always simply remember the whole training set, and will therefore always predict the correct label for any point in the training set. This “remembering” does not indicate to us whether our model will generalize well (in other words, whether it will also perform well on new data). To assess the model’s performance, we show it new data (data that it hasn’t seen before) for which we have labels. This is usually done by splitting the labeled data we have collected (here, our 150 flower measurements) into two parts. One part of the data is used to build our machine learning model, and is called the training data or training set. The rest of the data will be used to assess how well the model works; this is called the test data, test set, or holdout set. scikitlearn contains a function that shuffles the dataset and splits it for you: the train_test_split function. This function extracts 75% of the rows in the data as the training set, together with the corresponding labels for this data. The remaining 25% of the data, together with the remaining labels, is declared as the test set. Deciding A First Application: Classifying Iris Species  17 how much data you want to put into the training and the test set respectively is some‐ what arbitrary, but using a test set containing 25% of the data is a good rule of thumb. In scikitlearn, data is usually denoted with a capital X, while labels are denoted by a lowercase y. This is inspired by the standard formulation f(x)=y in mathematics, where x is the input to a function and y is the output. Following more conventions from mathematics, we use a capital X because the data is a twodimensional array (a matrix) and a lowercase y because the target is a onedimensional array (a vector). Let’s call train_test_split on our data and assign the outputs using this nomencla‐ ture: In[21]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( iris_dataset['data'], iris_dataset['target'], random_state=0) Before making the split, the train_test_split function shuffles the dataset using a pseudorandom number generator. If we just took the last 25% of the data as a test set, all the data points would have the label 2, as the data points are sorted by the label (see the output for iris['target'] shown earlier). Using a test set containing only one of the three classes would not tell us much about how well our model generalizes, so we shuffle our data to make sure the test data contains data from all classes. To make sure that we will get the same output if we run the same function several times, we provide the pseudorandom number generator with a fixed seed using the random_state parameter. This will make the outcome deterministic, so this line will always have the same outcome. We will always fix the random_state in this way when using randomized procedures in this book. The output of the train_test_split function is X_train, X_test, y_train, and y_test, which are all NumPy arrays. X_train contains 75% of the rows of the dataset, and X_test contains the remaining 25%: In[22]: print("X_train shape: {}".format(X_train.shape)) print("y_train shape: {}".format(y_train.shape)) Out[22]: X_train shape: (112, 4) y_train shape: (112,) 18  Chapter 1: Introduction In[23]: print("X_test shape: {}".format(X_test.shape)) print("y_test shape: {}".format(y_test.shape)) Out[23]: X_test shape: (38, 4) y_test shape: (38,) First Things First: Look at Your Data Before building a machine learning model it is often a good idea to inspect the data, to see if the task is easily solvable without machine learning, or if the desired infor‐ mation might not be contained in the data. Additionally, inspecting your data is a good way to find abnormalities and peculiari‐ ties. Maybe some of your irises were measured using inches and not centimeters, for example. In the real world, inconsistencies in the data and unexpected measurements are very common. One of the best ways to inspect data is to visualize it. One way to do this is by using a scatter plot. A scatter plot of the data puts one feature along the xaxis and another along the yaxis, and draws a dot for each data point. Unfortunately, computer screens have only two dimensions, which allows us to plot only two (or maybe three) features at a time. It is difficult to plot datasets with more than three features this way. One way around this problem is to do a pair plot, which looks at all possible pairs of features. If you have a small number of features, such as the four we have here, this is quite reasonable. You should keep in mind, however, that a pair plot does not show the interaction of all of features at once, so some interesting aspects of the data may not be revealed when visualizing it this way. Figure 13 is a pair plot of the features in the training set. The data points are colored according to the species the iris belongs to. To create the plot, we first convert the NumPy array into a pandas DataFrame. pandas has a function to create pair plots called scatter_matrix. The diagonal of this matrix is filled with histograms of each feature: In[24]: # create dataframe from data in X_train # label the columns using the strings in iris_dataset.feature_names iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names) # create a scatter matrix from the dataframe, color by y_train grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o', hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3) A First Application: Classifying Iris Species  19 Figure 13. Pair plot of the Iris dataset, colored by class label From the plots, we can see that the three classes seem to be relatively well separated using the sepal and petal measurements. This means that a machine learning model will likely be able to learn to separate them. Building Your First Model: kNearest Neighbors Now we can start building the actual machine learning model. There are many classi‐ fication algorithms in scikitlearn that we could use. Here we will use a knearest neighbors classifier, which is easy to understand. Building this model only consists of storing the training set. To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point. Then it assigns the label of this training point to the new data point. 20  Chapter 1: Introduction The k in knearest neighbors signifies that instead of using only the closest neighbor to the new data point, we can consider any fixed number k of neighbors in the train‐ ing (for example, the closest three or five neighbors). Then, we can make a prediction using the majority class among these neighbors. We will go into more detail about this in Chapter 2; for now, we’ll use only a single neighbor. All machine learning models in scikitlearn are implemented in their own classes, which are called Estimator classes. The knearest neighbors classification algorithm is implemented in the KNeighborsClassifier class in the neighbors module. Before we can use the model, we need to instantiate the class into an object. This is when we will set any parameters of the model. The most important parameter of KNeighbor sClassifier is the number of neighbors, which we will set to 1: In[25]: from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=1) The knn object encapsulates the algorithm that will be used to build the model from the training data, as well the algorithm to make predictions on new data points. It will also hold the information that the algorithm has extracted from the training data. In the case of KNeighborsClassifier, it will just store the training set. To build the model on the training set, we call the fit method of the knn object, which takes as arguments the NumPy array X_train containing the training data and the NumPy array y_train of the corresponding training labels: In[26]: knn.fit(X_train, y_train) Out[26]: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=1, p=2, weights='uniform') The fit method returns the knn object itself (and modifies it in place), so we get a string representation of our classifier. The representation shows us which parameters were used in creating the model. Nearly all of them are the default values, but you can also find n_neighbors=1, which is the parameter that we passed. Most models in scikitlearn have many parameters, but the majority of them are either speed opti‐ mizations or for very special use cases. You don’t have to worry about the other parameters shown in this representation. Printing a scikitlearn model can yield very long strings, but don’t be intimidated by these. We will cover all the important parameters in Chapter 2. In the remainder of this book, we will not show the output of fit because it doesn’t contain any new information. A First Application: Classifying Iris Species  21 Making Predictions We can now make predictions using this model on new data for which we might not know the correct labels. Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9 cm, a petal length of 1 cm, and a petal width of 0.2 cm. What species of iris would this be? We can put this data into a NumPy array, again by calculating the shape—that is, the number of samples (1) multiplied by the number of features (4): In[27]: X_new = np.array([[5, 2.9, 1, 0.2]]) print("X_new.shape: {}".format(X_new.shape)) Out[27]: X_new.shape: (1, 4) Note that we made the measurements of this single flower into a row in a twodimensional NumPy array, as scikitlearn always expects twodimensional arrays for the data. To make a prediction, we call the predict method of the knn object: In[28]: prediction = knn.predict(X_new) print("Prediction: {}".format(prediction)) print("Predicted target name: {}".format( iris_dataset['target_names'][prediction])) Out[28]: Prediction: [0] Predicted target name: ['setosa'] Our model predicts that this new iris belongs to the class 0, meaning its species is setosa. But how do we know whether we can trust our model? We don’t know the cor‐ rect species of this sample, which is the whole point of building the model! Evaluating the Model This is where the test set that we created earlier comes in. This data was not used to build the model, but we do know what the correct species is for each iris in the test set. Therefore, we can make a prediction for each iris in the test data and compare it against its label (the known species). We can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the right species was predicted: 22  Chapter 1: Introduction In[29]: y_pred = knn.predict(X_test) print("Test set predictions:\n {}".format(y_pred)) Out[29]: Test set predictions: [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0 2] In[30]: print("Test set score: {:.2f}".format(np.mean(y_pred == y_test))) Out[30]: Test set score: 0.97 We can also use the score method of the knn object, which will compute the test set accuracy for us: In[31]: print("Test set score: {:.2f}".format(knn.score(X_test, y_test))) Out[31]: Test set score: 0.97 For this model, the test set accuracy is about 0.97, which means we made the right prediction for 97% of the irises in the test set. Under some mathematical assump‐ tions, this means that we can expect our model to be correct 97% of the time for new irises. For our hobby botanist application, this high level of accuracy means that our model may be trustworthy enough to use. In later chapters we will discuss how we can improve performance, and what caveats there are in tuning a model. Summary and Outlook Let’s summarize what we learned in this chapter. We started with a brief introduction to machine learning and its applications, then discussed the distinction between supervised and unsupervised learning and gave an overview of the tools we’ll be using in this book. Then, we formulated the task of predicting which species of iris a particular flower belongs to by using physical measurements of the flower. We used a dataset of measurements that was annotated by an expert with the correct species to build our model, making this a supervised learning task. There were three possible species, setosa, versicolor, or virginica, which made the task a threeclass classification problem. The possible species are called classes in the classification problem, and the species of a single iris is called its label. The Iris dataset consists of two NumPy arrays: one containing the data, which is referred to as X in scikitlearn, and one containing the correct or desired outputs, Summary and Outlook  23 which is called y. The array X is a twodimensional array of features, with one row per data point and one column per feature. The array y is a onedimensional array, which here contains one class label, an integer ranging from 0 to 2, for each of the samples. We split our dataset into a training set, to build our model, and a test set, to evaluate how well our model will generalize to new, previously unseen data. We chose the knearest neighbors classification algorithm, which makes predictions for a new data point by considering its closest neighbor(s) in the training set. This is implemented in the KNeighborsClassifier class, which contains the algorithm that builds the model as well as the algorithm that makes a prediction using the model. We instantiated the class, setting parameters. Then we built the model by calling the fit method, passing the training data (X_train) and training outputs (y_train) as parameters. We evaluated the model using the score method, which computes the accuracy of the model. We applied the score method to the test set data and the test set labels and found that our model is about 97% accurate, meaning it is correct 97% of the time on the test set. This gave us the confidence to apply the model to new data (in our example, new flower measurements) and trust that the model will be correct about 97% of the time. Here is a summary of the code needed for the whole training and evaluation procedure: In[32]: X_train, X_test, y_train, y_test = train_test_split( iris_dataset['data'], iris_dataset['target'], random_state=0) knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, y_train) print("Test set score: {:.2f}".format(knn.score(X_test, y_test))) Out[32]: Test set score: 0.97 This snippet contains the core code for applying any machine learning algorithm using scikitlearn. The fit, predict, and score methods are the common inter‐ face to supervised models in scikitlearn, and with the concepts introduced in this chapter, you can apply these models to many machine learning tasks. In the next chapter, we will go into more depth about the different kinds of supervised models in scikitlearn and how to apply them successfully. 24  Chapter 1: Introduction CHAPTER 2 Supervised Learning As we mentioned earlier, supervised machine learning is one of the most commonly used and successful types of machine learning. In this chapter, we will describe super‐ vised learning in more detail and explain several popular supervised learning algo‐ rithms. We already saw an application of supervised machine learning in Chapter 1: classifying iris flowers into several species using physical measurements of the flowers. Remember that supervised learning is used whenever we want to predict a certain outcome from a given input, and we have examples of input/output pairs. We build a machine learning model from these input/output pairs, which comprise our training set. Our goal is to make accurate predictions for new, neverbeforeseen data. Super‐ vised learning often requires human effort to build the training set, but afterward automates and often speeds up an otherwise laborious or infeasible task. Classification and Regression There are two major types of supervised machine learning problems, called classifica‐ tion and regression. In classification, the goal is to predict a class label, which is a choice from a predefined list of possibilities. In Chapter 1 we used the example of classifying irises into one of three possible species. Classification is sometimes separated into binary classification, which is the special case of distinguishing between exactly two classes, and multiclass classification, which is classification between more than two classes. You can think of binary classification as trying to answer a yes/no question. Classifying emails as either spam or not spam is an example of a binary classification problem. In this binary classification task, the yes/no question being asked would be “Is this email spam?” 25 In binary classification we often speak of one class being the posi‐ tive class and the other class being the negative class. Here, positive doesn’t represent having benefit or value, but rather what the object of the study is. So, when looking for spam, “positive” could mean the spam class. Which of the two classes is called positive is often a subjective matter, and specific to the domain. The iris example, on the other hand, is an example of a multiclass classification prob‐ lem. Another example is predicting what language a website is in from the text on the website. The classes here would be a predefined list of possible languages. For regression tasks, the goal is to predict a continuous number, or a floatingpoint number in programming terms (or real number in mathematical terms). Predicting a person’s annual income from their education, their age, and where they live is an example of a regression task. When predicting income, the predicted value is an amount, and can be any number in a given range. Another example of a regression task is predicting the yield of a corn farm given attributes such as previous yields, weather, and number of employees working on the farm. The yield again can be an arbitrary number. An easy way to distinguish between classification and regression tasks is to ask whether there is some kind of continuity in the output. If there is continuity between possible outcomes, then the problem is a regression problem. Think about predicting annual income. There is a clear continuity in the output. Whether a person makes $40,000 or $40,001 a year does not make a tangible difference, even though these are different amounts of money; if our algorithm predicts $39,999 or $40,001 when it should have predicted $40,000, we don’t mind that much. By contrast, for the task of recognizing the language of a website (which is a classifi‐ cation problem), there is no matter of degree. A website is in one language, or it is in another. There is no continuity between languages, and there is no language that is between English and French.1 Generalization, Overfitting, and Underfitting In supervised learning, we want to build a model on the training data and then be able to make accurate predictions on new, unseen data that has the same characteris‐ tics as the training set that we used. If a model is able to make accurate predictions on unseen data, we say it is able to generalize from the training set to the test set. We want to build a model that is able to generalize as accurately as possible. 1 We ask linguists to excuse the simplified presentation of languages as distinct and fixed entities. 26  Chapter 2: Supervised Learning Usually we build a model in such a way that it can make accurate predictions on the training set. If the training and test sets have enough in common, we expect the model to also be accurate on the test set. However, there are some cases where this can go wrong. For example, if we allow ourselves to build very complex models, we can always be as accurate as we like on the training set. Let’s take a look at a madeup example to illustrate this point. Say a novice data scien‐ tist wants to predict whether a customer will buy a boat, given records of previous boat buyers and customers who we know are not interested in buying a boat.2 The goal is to send out promotional emails to people who are likely to actually make a purchase, but not bother those customers who won’t be interested. Suppose we have the customer records shown in Table 21. Table 21. Example data about customers Age Number of Owns house Number of children Marital status Owns a dog Bought a boat cars owned 66 1 yes 2 widowed no yes 52 2 yes 3 married no yes 22 0 no 0 married yes no 25 1 no 1 single no no 44 0 no 2 divorced yes no 39 1 yes 2 married yes no 26 1 no 2 single no no 40 3 yes 1 married yes no 53 2 yes 2 divorced no yes 64 2 yes 3 divorced no no 58 2 yes 2 married yes yes 33 1 no 1 single no no After looking at the data for a while, our novice data scientist comes up with the fol‐ lowing rule: “If the customer is older than 45, and has less than 3 children or is not divorced, then they want to buy a boat.” When asked how well this rule of his does, our data scientist answers, “It’s 100 percent accurate!” And indeed, on the data that is in the table, the rule is perfectly accurate. There are many possible rules we could come up with that would explain perfectly if someone in this dataset wants to buy a boat. No age appears twice in the data, so we could say people who are 66, 52, 53, or 2 In the real world, this is actually a tricky problem. While we know that the other customers haven’t bought a boat from us yet, they might have bought one from someone else, or they may still be saving and plan to buy one in the future. Generalization, Overfitting, and Underfitting  27 58 years old want to buy a boat, while all others don’t. While we can make up many rules that work well on this data, remember that we are not interested in making pre‐ dictions for this dataset; we already know the answers for these customers. We want to know if new customers are likely to buy a boat. We therefore want to find a rule that will work well for new customers, and achieving 100 percent accuracy on the training set does not help us there. We might not expect that the rule our data scientist came up with will work very well on new customers. It seems too complex, and it is sup‐ ported by very little data. For example, the “or is not divorced” part of the rule hinges on a single customer. The only measure of whether an algorithm will perform well on new data is the eval‐ uation on the test set. However, intuitively3 we expect simple models to generalize better to new data. If the rule was “People older than 50 want to buy a boat,” and this would explain the behavior of all the customers, we would trust it more than the rule involving children and marital status in addition to age. Therefore, we always want to find the simplest model. Building a model that is too complex for the amount of information we have, as our novice data scientist did, is called overfitting. Overfitting occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data. On the other hand, if your model is too simple—say, “Everybody who owns a house buys a boat”—then you might not be able to capture all the aspects of and vari‐ ability in the data, and your model will do badly even on the training set. Choosing too simple a model is called underfitting. The more complex we allow our model to be, the better we will be able to predict on the training data. However, if our model becomes too complex, we start focusing too much on each individual data point in our training set, and the model will not gener‐ alize well to new data. There is a sweet spot in between that will yield the best generalization performance. This is the model we want to find. The tradeoff between overfitting and underfitting is illustrated in Figure 21. 3 And also provably, with the right math. 28  Chapter 2: Supervised Learning Figure 21. Tradeoff of model complexity against training and test accuracy Relation of Model Complexity to Dataset Size It’s important to note that model complexity is intimately tied to the variation of inputs contained in your training dataset: the larger variety of data points your data‐ set contains, the more complex a model you can use without overfitting. Usually, col‐ lecting more data points will yield more variety, so larger datasets allow building more complex models. However, simply duplicating the same data points or collect‐ ing very similar data will not help. Going back to the boat selling example, if we saw 10,000 more rows of customer data, and all of them complied with the rule “If the customer is older than 45, and has less than 3 children or is not divorced, then they want to buy a boat,” we would be much more likely to believe this to be a good rule than when it was developed using only the 12 rows in Table 21. Having more data and building appropriately more complex models can often work wonders for supervised learning tasks. In this book, we will focus on working with datasets of fixed sizes. In the real world, you often have the ability to decide how much data to collect, which might be more beneficial than tweaking and tuning your model. Never underestimate the power of more data. Supervised Machine Learning Algorithms We will now review the most popular machine learning algorithms and explain how they learn from data and how they make predictions. We will also discuss how the concept of model complexity plays out for each of these models, and provide an over‐ Supervised Machine Learning Algorithms  29 view of how each algorithm builds a model. We will examine the strengths and weak‐ nesses of each algorithm, and what kind of data they can best be applied to. We will also explain the meaning of the most important parameters and options.4 Many algo‐ rithms have a classification and a regression variant, and we will describe both. It is not necessary to read through the descriptions of each algorithm in detail, but understanding the models will give you a better feeling for the different ways machine learning algorithms can work. This chapter can also be used as a reference guide, and you can come back to it when you are unsure about the workings of any of the algorithms. Some Sample Datasets We will use several datasets to illustrate the different algorithms. Some of the datasets will be small and synthetic (meaning madeup), designed to highlight particular aspects of the algorithms. Other datasets will be large, realworld examples. An example of a synthetic twoclass classification dataset is the forge dataset, which has two features. The following code creates a scatter plot (Figure 22) visualizing all of the data points in this dataset. The plot has the first feature on the xaxis and the second feature on the yaxis. As is always the case in scatter plots, each data point is represented as one dot. The color and shape of the dot indicates its class: In[2]: # generate dataset X, y = mglearn.datasets.make_forge() # plot dataset mglearn.discrete_scatter(X[:, 0], X[:, 1], y) plt.legend(["Class 0", "Class 1"], loc=4) plt.xlabel("First feature") plt.ylabel("Second feature") print("X.shape: {}".format(X.shape)) Out[2]: X.shape: (26, 2) 4 Discussing all of them is beyond the scope of the book, and we refer you to the scikitlearn documentation for more details. 30  Chapter 2: Supervised Learning Figure 22. Scatter plot of the forge dataset As you can see from X.shape, this dataset consists of 26 data points, with 2 features. To illustrate regression algorithms, we will use the synthetic wave dataset. The wave dataset has a single input feature and a continuous target variable (or response) that we want to model. The plot created here (Figure 23) shows the single feature on the xaxis and the regression target (the output) on the yaxis: In[3]: X, y = mglearn.datasets.make_wave(n_samples=40) plt.plot(X, y, 'o') plt.ylim(3, 3) plt.xlabel("Feature") plt.ylabel("Target") Supervised Machine Learning Algorithms  31 Figure 23. Plot of the wave dataset, with the xaxis showing the feature and the yaxis showing the regression target We are using these very simple, lowdimensional datasets because we can easily visu‐ alize them—a printed page has two dimensions, so data with more than two features is hard to show. Any intuition derived from datasets with few features (also called lowdimensional datasets) might not hold in datasets with many features (highdimensional datasets). As long as you keep that in mind, inspecting algorithms on lowdimensional datasets can be very instructive. We will complement these small synthetic datasets with two realworld datasets that are included in scikitlearn. One is the Wisconsin Breast Cancer dataset (cancer, for short), which records clinical measurements of breast cancer tumors. Each tumor is labeled as “benign” (for harmless tumors) or “malignant” (for cancerous tumors), and the task is to learn to predict whether a tumor is malignant based on the meas‐ urements of the tissue. The data can be loaded using the load_breast_cancer function from scikitlearn: In[4]: from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() print("cancer.keys(): \n{}".format(cancer.keys())) 32  Chapter 2: Supervised Learning Out[4]: cancer.keys(): dict_keys(['feature_names', 'data', 'DESCR', 'target', 'target_names']) Datasets that are included in scikitlearn are usually stored as Bunch objects, which contain some information about the dataset as well as the actual data. All you need to know about Bunch objects is that they behave like dictionaries, with the added benefit that you can access values using a dot (as in bunch.key instead of bunch['key']). The dataset consists of 569 data points, with 30 features each: In[5]: print("Shape of cancer data: {}".format(cancer.data.shape)) Out[5]: Shape of cancer data: (569, 30) Of these 569 data points, 212 are labeled as malignant and 357 as benign: In[6]: print("Sample counts per class:\n{}".format( {n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))})) Out[6]: Sample counts per class: {'benign': 357, 'malignant': 212} To get a description of the semantic meaning of each feature, we can have a look at the feature_names attribute: In[7]: print("Feature names:\n{}".format(cancer.feature_names)) Out[7]: Feature names: ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension'] Supervised Machine Learning Algorithms  33 You can find out more about the data by reading cancer.DESCR if you are interested. We will also be using a realworld regression dataset, the Boston Housing dataset. The task associated with this dataset is to predict the median value of homes in sev‐ eral Boston neighborhoods in the 1970s, using information such as crime rate, prox‐ imity to the Charles River, highway accessibility, and so on. The dataset contains 506 data points, described by 13 features: In[8]: from sklearn.datasets import load_boston boston = load_boston() print("Data shape: {}".format(boston.data.shape)) Out[8]: Data shape: (506, 13) Again, you can get more information about the dataset by reading the DESCR attribute of boston. For our purposes here, we will actually expand this dataset by not only considering these 13 measurements as input features, but also looking at all products (also called interactions) between features. In other words, we will not only consider crime rate and highway accessibility as features, but also the product of crime rate and highway accessibility. Including derived feature like these is called feature engi‐ neering, which we will discuss in more detail in Chapter 4. This derived dataset can be loaded using the load_extended_boston function: In[9]: X, y = mglearn.datasets.load_extended_boston() print("X.shape: {}".format(X.shape)) Out[9]: X.shape: (506, 104) The resulting 104 features are the 13 original features together with the 91 possible combinations of two features within those 13.5 We will use these datasets to explain and illustrate the properties of the different machine learning algorithms. But for now, let’s get to the algorithms themselves. First, we will revisit the knearest neighbors (kNN) algorithm that we saw in the pre‐ vious chapter. 5 This is called the binomial coefficient, which is the number of combinations of k elements that can be selected from a set of n elements. Often this is written as 34  Chapter 2: Supervised Learning n and spoken as “n choose k”—in this case, “13 choose 2.” k kNearest Neighbors The kNN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training dataset. To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset—its “nearest neighbors.” kNeighbors classification In its simplest version, the kNN algorithm only considers exactly one nearest neigh‐ bor, which is the closest training data point to the point we want to make a prediction for. The prediction is then simply the known output for this training point. Figure 24 illustrates this for the case of classification on the forge dataset: In[10]: mglearn.plots.plot_knn_classification(n_neighbors=1) Figure 24. Predictions made by the onenearestneighbor model on the forge dataset Here, we added three new data points, shown as stars. For each of them, we marked the closest point in the training set. The prediction of the onenearestneighbor algo‐ rithm is the label of that point (shown by the color of the cross). Supervised Machine Learning Algorithms  35 Instead of considering only the closest neighbor, we can also consider an arbitrary number, k, of neighbors. This is where the name of the knearest neighbors algorithm comes from. When considering more than one neighbor, we use voting to assign a label. This means that for each test point, we count how many neighbors belong to class 0 and how many neighbors belong to class 1. We then assign the class that is more frequent: in other words, the majority class among the knearest neighbors. The following example (Figure 25) uses the three closest neighbors: In[11]: mglearn.plots.plot_knn_classification(n_neighbors=3) Figure 25. Predictions made by the threenearestneighbors model on the forge dataset Again, the prediction is shown as the color of the cross. You can see that the predic‐ tion for the new data point at the top left is not the same as the prediction when we used only one neighbor. While this illustration is for a binary classification problem, this method can be applied to datasets with any number of classes. For more classes, we count how many neighbors belong to each class and again predict the most common class. Now let’s look at how we can apply the knearest neighbors algorithm using scikitlearn. First, we split our data into a training and a test set so we can evaluate general‐ ization performance, as discussed in Chapter 1: 36  Chapter 2: Supervised Learning In[12]: from sklearn.model_selection import train_test_split X, y = mglearn.datasets.make_forge() X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) Next, we import and instantiate the class. This is when we can set parameters, like the number of neighbors to use. Here, we set it to 3: In[13]: from sklearn.neighbors import KNeighborsClassifier clf = KNeighborsClassifier(n_neighbors=3) Now, we fit the classifier using the training set. For KNeighborsClassifier this means storing the dataset, so we can compute neighbors during prediction: In[14]: clf.fit(X_train, y_train) To make predictions on the test data, we call the predict method. For each data point in the test set, this computes its nearest neighbors in the training set and finds the most common class among these: In[15]: print("Test set predictions: {}".format(clf.predict(X_test))) Out[15]: Test set predictions: [1 0 1 0 1 0 0] To evaluate how well our model generalizes, we can call the score method with the test data together with the test labels: In[16]: print("Test set accuracy: {:.2f}".format(clf.score(X_test, y_test))) Out[16]: Test set accuracy: 0.86 We see that our model is about 86% accurate, meaning the model predicted the class correctly for 86% of the samples in the test dataset. Analyzing KNeighborsClassifier For twodimensional datasets, we can also illustrate the prediction for all possible test points in the xyplane. We color the plane according to the class that would be assigned to a point in this region. This lets us view the decision boundary, which is the divide between where the algorithm assigns class 0 versus where it assigns class 1. Supervised Machine Learning Algorithms  37 The following code produces the visualizations of the decision boundaries for one, three, and nine neighbors shown in Figure 26: In[17]: fig, axes = plt.subplots(1, 3, figsize=(10, 3)) for n_neighbors, ax in zip([1, 3, 9], axes): # the fit method returns the object self, so we can instantiate # and fit in one line clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y) mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4) mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax) ax.set_title("{} neighbor(s)".format(n_neighbors)) ax.set_xlabel("feature 0") ax.set_ylabel("feature 1") axes[0].legend(loc=3) Figure 26. Decision boundaries created by the nearest neighbors model for different val‐ ues of n_neighbors As you can see on the left in the figure, using a single neighbor results in a decision boundary that follows the training data closely. Considering more and more neigh‐ bors leads to a smoother decision boundary. A smoother boundary corresponds to a simpler model. In other words, using few neighbors corresponds to high model com‐ plexity (as shown on the right side of Figure 21), and using many neighbors corre‐ sponds to low model complexity (as shown on the left side of Figure 21). If you consider the extreme case where the number of neighbors is the number of all data points in the training set, each test point would have exactly the same neighbors (all training points) and all predictions would be the same: the class that is most frequent in the training set. Let’s investigate whether we can confirm the connection between model complexity and generalization that we discussed earlier. We will do this on the realworld Breast Cancer dataset. We begin by splitting the dataset into a training and a test set. Then 38  Chapter 2: Supervised Learning we evaluate training and test set performance with different numbers of neighbors. The results are shown in Figure 27: In[18]: from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=66) training_accuracy = [] test_accuracy = [] # try n_neighbors from 1 to 10 neighbors_settings = range(1, 11) for n_neighbors in neighbors_settings: # build the model clf = KNeighborsClassifier(n_neighbors=n_neighbors) clf.fit(X_train, y_train) # record training set accuracy training_accuracy.append(clf.score(X_train, y_train)) # record generalization accuracy test_accuracy.append(clf.score(X_test, y_test)) plt.plot(neighbors_settings, training_accuracy, label="training accuracy") plt.plot(neighbors_settings, test_accuracy, label="test accuracy") plt.ylabel("Accuracy") plt.xlabel("n_neighbors") plt.legend() The plot shows the training and test set accuracy on the yaxis against the setting of n_neighbors on the xaxis. While realworld plots are rarely very smooth, we can still recognize some of the characteristics of overfitting and underfitting (note that because considering fewer neighbors corresponds to a more complex model, the plot is horizontally flipped relative to the illustration in Figure 21). Considering a single nearest neighbor, the prediction on the training set is perfect. But when more neigh‐ bors are considered, the model becomes simpler and the training accuracy drops. The test set accuracy for using a single neighbor is lower than when using more neigh‐ bors, indicating that using the single nearest neighbor leads to a model that is too complex. On the other hand, when considering 10 neighbors, the model is too simple and performance is even worse. The best performance is somewhere in the middle, using around six neighbors. Still, it is good to keep the scale of the plot in mind. The worst performance is around 88% accuracy, which might still be acceptable. Supervised Machine Learning Algorithms  39 Figure 27. Comparison of training and test accuracy as a function of n_neighbors kneighbors regression There is also a regression variant of the knearest neighbors algorithm. Again, let’s start by using the single nearest neighbor, this time using the wave dataset. We’ve added three test data points as green stars on the xaxis. The prediction using a single neighbor is just the target value of the nearest neighbor. These are shown as blue stars in Figure 28: In[19]: mglearn.plots.plot_knn_regression(n_neighbors=1) 40  Chapter 2: Supervised Learning Figure 28. Predictions made by onenearestneighbor regression on the wave dataset Again, we can use more than the single closest neighbor for regression. When using multiple nearest neighbors, the prediction is the average, or mean, of the relevant neighbors (Figure 29): In[20]: mglearn.plots.plot_knn_regression(n_neighbors=3) Supervised Machine Learning Algorithms  41 Figure 29. Predictions made by threenearestneighbors regression on the wave dataset The knearest neighbors algorithm for regression is implemented in the KNeighbors Regressor class in scikitlearn. It’s used similarly to KNeighborsClassifier: In[21]: from sklearn.neighbors import KNeighborsRegressor X, y = mglearn.datasets.make_wave(n_samples=40) # split the wave dataset into a training and a test set X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # instantiate the model and set the number of neighbors to consider to 3 reg = KNeighborsRegressor(n_neighbors=3) # fit the model using the training data and training targets reg.fit(X_train, y_train) Now we can make predictions on the test set: In[22]: print("Test set predictions:\n{}".format(reg.predict(X_test))) 42  Chapter 2: Supervised Learning Out[22]: Test set predictions: [0.054 0.357 1.137 1.894 1.139 1.631 0.357 0.912 0.447 1.139] We can also evaluate the model using the score method, which for regressors returns the R2 score. The R2 score, also known as the coefficient of determination, is a meas‐ ure of goodness of a prediction for a regression model, and yields a score between 0 and 1. A value of 1 corresponds to a perfect prediction, and a value of 0 corresponds to a constant model that just predicts the mean of the training set responses, y_train: In[23]: print("Test set R^2: {:.2f}".format(reg.score(X_test, y_test))) Out[23]: Test set R^2: 0.83 Here, the score is 0.83, which indicates a relatively good model fit. Analyzing KNeighborsRegressor For our onedimensional dataset, we can see what the predictions look like for all possible feature values (Figure 210). To do this, we create a test dataset consisting of many points on the line: In[24]: fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # create 1,000 data points, evenly spaced between 3 and 3 line = np.linspace(3, 3, 1000).reshape(1, 1) for n_neighbors, ax in zip([1, 3, 9], axes): # make predictions using 1, 3, or 9 neighbors reg = KNeighborsRegressor(n_neighbors=n_neighbors) reg.fit(X_train, y_train) ax.plot(line, reg.predict(line)) ax.plot(X_train, y_train, '^', c=mglearn.cm2(0), markersize=8) ax.plot(X_test, y_test, 'v', c=mglearn.cm2(1), markersize=8) ax.set_title( "{} neighbor(s)\n train score: {:.2f} test score: {:.2f}".format( n_neighbors, reg.score(X_train, y_train), reg.score(X_test, y_test))) ax.set_xlabel("Feature") ax.set_ylabel("Target") axes[0].legend(["Model predictions", "Training data/target", "Test data/target"], loc="best") Supervised Machine Learning Algorithms  43 Figure 210. Comparing predictions made by nearest neighbors regression for different values of n_neighbors As we can see from the plot, using only a single neighbor, each point in the training set has an obvious influence on the predictions, and the predicted values go through all of the data points. This leads to a very unsteady prediction. Considering more neighbors leads to smoother predictions, but these do not fit the training data as well. Strengths, weaknesses, and parameters In principle, there are two important parameters to the KNeighbors classifier: the number of neighbors and how you measure distance between data points. In practice, using a small number of neighbors like three or five often works well, but you should certainly adjust this parameter. Choosing the right distance measure is somewhat beyond the scope of this book. By default, Euclidean distance is used, which works well in many settings. One of the strengths of kNN is that the model is very easy to understand, and often gives reasonable performance without a lot of adjustments. Using this algorithm is a good baseline method to try before considering more advanced techniques. Building the nearest neighbors model is usually very fast, but when your training set is very large (either in number of features or in number of samples) prediction can be slow. When using the kNN algorithm, it’s important to preprocess your data (see Chap‐ ter 3). This approach often does not perform well on datasets with many features (hundreds or more), and it does particularly badly with datasets where most features are 0 most of the time (socalled sparse datasets). So, while the nearest kneighbors algorithm is easy to understand, it is not often used in practice, due to prediction being slow and its inability to handle many features. The method we discuss next has neither of these drawbacks. 44  Chapter 2: Supervised Learning Linear Models Linear models are a class of models that are widely used in practice and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models make a prediction using a linear function of the input features, which we will explain shortly. Linear models for regression For regression, the general prediction formula for a linear model looks as follows: ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b Here, x[0] to x[p] denotes the features (in this example, the number of features is p) of a single data point, w and b are parameters of the model that are learned, and ŷ is the prediction the model makes. For a dataset with a single feature, this is: ŷ = w[0] * x[0] + b which you might remember from high school mathematics as the equation for a line. Here, w[0] is the slope and b is the yaxis offset. For more features, w contains the slopes along each feature axis. Alternatively, you can think of the predicted response as being a weighted sum of the input features, with weights (which can be negative) given by the entries of w. Trying to learn the parameters w[0] and b on our onedimensional wave dataset might lead to the following line (see Figure 211): In[25]: mglearn.plots.plot_linear_regression_wave() Out[25]: w[0]: 0.393906 b: 0.031804 Supervised Machine Learning Algorithms  45 Figure 211. Predictions of a linear model on the wave dataset We added a coordinate cross into the plot to make it easier to understand the line. Looking at w[0] we see that the slope should be around 0.4, which we can confirm visually in the plot. The intercept is where the prediction line should cross the yaxis: this is slightly below zero, which you can also confirm in the image. Linear models for regression can be characterized as regression models for which the prediction is a line for a single feature, a plane when using two features, or a hyper‐ plane in higher dimensions (that is, when using more features). If you compare the predictions made by the straight line with those made by the KNeighborsRegressor in Figure 210, using a straight line to make predictions seems very restrictive. It looks like all the fine details of the data are lost. In a sense, this is true. It is a strong (and somewhat unrealistic) assumption that our target y is a linear 46  Chapter 2: Supervised Learning combination of the features. But looking at onedimensional data gives a somewhat skewed perspective. For datasets with many features, linear models can be very pow‐ erful. In particular, if you have more features than training data points, any target y can be perfectly modeled (on the training set) as a linear function.6 There are many different linear models for regression. The difference between these models lies in how the model parameters w and b are learned from the training data, and how model complexity can be controlled. We will now take a look at the most popular linear models for regression. Linear regression (aka ordinary least squares) Linear regression, or ordinary least squares (OLS), is the simplest and most classic lin‐ ear method for regression. Linear regression finds the parameters w and b that mini‐ mize the mean squared error between predictions and the true regression targets, y, on the training set. The mean squared error is the sum of the squared differences between the predictions and the true values. Linear regression has no parameters, which is a benefit, but it also has no way to control model complexity. Here is the code that produces the model you can see in Figure 211: In[26]: from sklearn.linear_model import LinearRegression X, y = mglearn.datasets.make_wave(n_samples=60) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) lr = LinearRegression().fit(X_train, y_train) The “slope” parameters (w), also called weights or coefficients, are stored in the coef_ attribute, while the offset or intercept (b) is stored in the intercept_ attribute: In[27]: print("lr.coef_: {}".format(lr.coef_)) print("lr.intercept_: {}".format(lr.intercept_)) Out[27]: lr.coef_: [ 0.394] lr.intercept_: 0.031804343026759746 6 This is easy to see if you know some linear algebra. Supervised Machine Learning Algorithms  47 You might notice the strangelooking trailing underscore at the end of coef_ and intercept_. scikitlearn always stores a