Main
Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning
Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning
Chris Albon
This practical guide provides nearly 200 selfcontained recipes to help you solve machine learning challenges you may encounter in your daily work. If you’re comfortable with Python and its libraries, including pandas and scikitlearn, you’ll be able to address specific problems such as loading data, handling text or numerical data, model selection, and dimensionality reduction and many other topics.
Each recipe includes code that you can copy and paste into a toy dataset to ensure that it actually works. From there, you can insert, combine, or adapt the code to help construct your application. Recipes also include a discussion that explains the solution and provides meaningful context. This cookbook takes you beyond theory and concepts by providing the nuts and bolts you need to construct working machine learning applications.
You’ll find recipes for:
● Vectors, matrices, and arrays
● Handling numerical and categorical data, text, images, and dates and times
● Dimensionality reduction using feature extraction or feature selection
● Model evaluation and selection
● Linear and logical regression, trees and forests, and knearest neighbors
● Support vector machines (SVM), naïve Bayes, clustering, and neural networks
● Saving and loading trained models
Who This Book Is For
This book is not an introduction to machine learning. If you are not comfortable with the basic concepts of machine learning or have never spent time learning machine learning, do not buy this book. Instead, this book is for the machine learning practitioner who, while comfortable with the theory and concepts of machine learning, would benefit from a quick reference containing code to solve challenges he runs into working on machine learning on an everyday basis.
This book assumes the reader is comfortable with the Python programming language and package management.
Who This Book Is Not For
As stated previously, this book is not an introduction to machine learning. This book should not be your first. If you are unfamiliar with concepts like crossvalidation, random forest, and gradient descent, you will likely not benefit from this book as much as one of the many highquality texts specifically designed to introduce you to the topic. I recommend reading one of those books and then coming back to this book to learn working, practical solutions for machine learning.
Each recipe includes code that you can copy and paste into a toy dataset to ensure that it actually works. From there, you can insert, combine, or adapt the code to help construct your application. Recipes also include a discussion that explains the solution and provides meaningful context. This cookbook takes you beyond theory and concepts by providing the nuts and bolts you need to construct working machine learning applications.
You’ll find recipes for:
● Vectors, matrices, and arrays
● Handling numerical and categorical data, text, images, and dates and times
● Dimensionality reduction using feature extraction or feature selection
● Model evaluation and selection
● Linear and logical regression, trees and forests, and knearest neighbors
● Support vector machines (SVM), naïve Bayes, clustering, and neural networks
● Saving and loading trained models
Who This Book Is For
This book is not an introduction to machine learning. If you are not comfortable with the basic concepts of machine learning or have never spent time learning machine learning, do not buy this book. Instead, this book is for the machine learning practitioner who, while comfortable with the theory and concepts of machine learning, would benefit from a quick reference containing code to solve challenges he runs into working on machine learning on an everyday basis.
This book assumes the reader is comfortable with the Python programming language and package management.
Who This Book Is Not For
As stated previously, this book is not an introduction to machine learning. This book should not be your first. If you are unfamiliar with concepts like crossvalidation, random forest, and gradient descent, you will likely not benefit from this book as much as one of the many highquality texts specifically designed to introduce you to the topic. I recommend reading one of those books and then coming back to this book to learn working, practical solutions for machine learning.
Categories:
Computers\\Cybernetics: Artificial Intelligence
Year:
2018
Edition:
1
Publisher:
O’Reilly Media
Language:
english
Pages:
366
ISBN 10:
1491989386
ISBN 13:
9781491989388
File:
PDF, 4.59 MB
Download (pdf, 4.59 MB)
Preview
 Checking other formats...
 Convert to EPUB
 Convert to FB2
 Convert to MOBI
 Convert to TXT
 Convert to RTF
 Converted file can differ from the original. If possible, download the file in its original format.
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You may be interested in
Most frequently terms
import^{520}
matrix^{322}
observations^{250}
vector^{207}
neural^{190}
array^{187}
dataframe^{184}
classifier^{176}
scikit^{168}
regression^{154}
activation^{113}
target vector^{113}
neural network^{111}
load libraries^{110}
pandas^{109}
cv2^{107}
parameter^{102}
feature matrix^{102}
linear^{100}
predicted^{97}
variance^{92}
import numpy^{91}
numpy as np^{91}
binary^{89}
validation^{81}
accuracy^{77}
keras^{76}
neural networks^{76}
predict^{76}
output^{75}
iris^{72}
kernel^{70}
select^{70}
library import^{67}
hyperparameter^{66}
clustering^{66}
calculate^{65}
metric^{64}
datasets^{64}
parameters^{63}
relu^{62}
activation function^{61}
logistic^{60}
columns^{59}
libraries import^{58}
url^{58}
sklearn import^{57}
matrices^{57}
algorithm^{57}
probabilities^{53}
convert^{52}
algorithms^{52}
naive bayes^{51}
penalty^{49}
dataset^{48}
categorical^{48}
import datasets^{46}
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

Python Machine Learning Cookbook PRACTICAL SOLUTIONS FROM PREPROCESSING TO DEEP LEARNING Chris Albon Machine Learning with Python Cookbook Practical Solutions from Preprocessing to Deep Learning Chris Albon Machine Learning with Python Cookbook by Chris Albon Copyright © 2018 Chris Albon. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 8009989938 or corporate@oreilly.com. Editors: Rachel Roumeliotis and Jeff Bleiel Production Editor: Melanie Yarbrough Copyeditor: Kim Cofer Proofreader: Rachel Monaghan April 2018: Indexer: Wendy Catalano Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 20180309: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491989388 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning with Python Cook‐ book, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 9781491989388 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Vectors, Matrices, and Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.0 Introduction 1.1 Creating a Vector 1.2 Creating a Matrix 1.3 Creating a Sparse Matrix 1.4 Selecting Elements 1.5 Describing a Matrix 1.6 Applying Operations to Elements 1.7 Finding the Maximum and Minimum Values 1.8 Calculating the Average, Variance, and Standard Deviation 1.9 Reshaping Arrays 1.10 Transposing a Vector or Matrix 1.11 Flattening a Matrix 1.12 Finding the Rank of a Matrix 1.13 Calculating the Determinant 1.14 Getting the Diagonal of a Matrix 1.15 Calculating the Trace of a Matrix 1.16 Finding Eigenvalues and Eigenvectors 1.17 Calculating Dot Products 1.18 Adding and Subtracting Matrices 1.19 Multiplying Matrices 1.20 Inverting a Matrix 1.21 Generating Random Values 1 1 2 3 4 6 6 7 8 9 10 11 12 12 13 14 15 16 17 18 19 20 2. Loading Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.0 Introduction 23 iii 2.1 Loading a Sample Dataset 2.2 Creating a Simulated Dataset 2.3 Loading a CSV File 2.4 Loading an Excel File 2.5 Loading a JSON File 2.6 Querying a SQL Database 23 24 27 28 29 30 3. Data Wrangling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.0 Introduction 3.1 Creating a Data Frame 3.2 Describing the Data 3.3 Navigating DataFrames 3.4 Selecting Rows Based on Conditionals 3.5 Replacing Values 3.6 Renaming Columns 3.7 Finding the Minimum, Maximum, Sum, Average, and Count 3.8 Finding Unique Values 3.9 Handling Missing Values 3.10 Deleting a Column 3.11 Deleting a Row 3.12 Dropping Duplicate Rows 3.13 Grouping Rows by Values 3.14 Grouping Rows by Time 3.15 Looping Over a Column 3.16 Applying a Function Over All Elements in a Column 3.17 Applying a Function to Groups 3.18 Concatenating DataFrames 3.19 Merging DataFrames 33 34 35 37 38 39 41 42 43 44 46 47 48 50 51 53 54 55 55 57 4. Handling Numerical Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.0 Introduction 4.1 Rescaling a Feature 4.2 Standardizing a Feature 4.3 Normalizing Observations 4.4 Generating Polynomial and Interaction Features 4.5 Transforming Features 4.6 Detecting Outliers 4.7 Handling Outliers 4.8 Discretizating Features 4.9 Grouping Observations Using Clustering 4.10 Deleting Observations with Missing Values 4.11 Imputing Missing Values iv  Table of Contents 61 61 63 64 66 68 69 71 73 74 76 78 5. Handling Categorical Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.0 Introduction 5.1 Encoding Nominal Categorical Features 5.2 Encoding Ordinal Categorical Features 5.3 Encoding Dictionaries of Features 5.4 Imputing Missing Class Values 5.5 Handling Imbalanced Classes 81 82 84 86 88 90 6. Handling Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.0 Introduction 6.1 Cleaning Text 6.2 Parsing and Cleaning HTML 6.3 Removing Punctuation 6.4 Tokenizing Text 6.5 Removing Stop Words 6.6 Stemming Words 6.7 Tagging Parts of Speech 6.8 Encoding Text as a Bag of Words 6.9 Weighting Word Importance 95 95 97 98 98 99 100 101 104 106 7. Handling Dates and Times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.0 Introduction 7.1 Converting Strings to Dates 7.2 Handling Time Zones 7.3 Selecting Dates and Times 7.4 Breaking Up Date Data into Multiple Features 7.5 Calculating the Difference Between Dates 7.6 Encoding Days of the Week 7.7 Creating a Lagged Feature 7.8 Using Rolling Time Windows 7.9 Handling Missing Data in Time Series 109 109 111 112 113 114 115 116 117 118 8. Handling Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 8.0 Introduction 8.1 Loading Images 8.2 Saving Images 8.3 Resizing Images 8.4 Cropping Images 8.5 Blurring Images 8.6 Sharpening Images 8.7 Enhancing Contrast 8.8 Isolating Colors 121 122 124 125 126 128 131 133 135 Table of Contents  v 8.9 Binarizing Images 8.10 Removing Backgrounds 8.11 Detecting Edges 8.12 Detecting Corners 8.13 Creating Features for Machine Learning 8.14 Encoding Mean Color as a Feature 8.15 Encoding Color Histograms as Features 137 140 144 146 150 152 153 9. Dimensionality Reduction Using Feature Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.0 Introduction 9.1 Reducing Features Using Principal Components 9.2 Reducing Features When Data Is Linearly Inseparable 9.3 Reducing Features by Maximizing Class Separability 9.4 Reducing Features Using Matrix Factorization 9.5 Reducing Features on Sparse Data 157 158 160 162 165 166 10. Dimensionality Reduction Using Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 10.0 Introduction 10.1 Thresholding Numerical Feature Variance 10.2 Thresholding Binary Feature Variance 10.3 Handling Highly Correlated Features 10.4 Removing Irrelevant Features for Classification 10.5 Recursively Eliminating Features 169 170 171 172 174 176 11. Model Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.0 Introduction 11.1 CrossValidating Models 11.2 Creating a Baseline Regression Model 11.3 Creating a Baseline Classification Model 11.4 Evaluating Binary Classifier Predictions 11.5 Evaluating Binary Classifier Thresholds 11.6 Evaluating Multiclass Classifier Predictions 11.7 Visualizing a Classifier’s Performance 11.8 Evaluating Regression Models 11.9 Evaluating Clustering Models 11.10 Creating a Custom Evaluation Metric 11.11 Visualizing the Effect of Training Set Size 11.12 Creating a Text Report of Evaluation Metrics 11.13 Visualizing the Effect of Hyperparameter Values 179 179 183 184 186 189 192 194 196 198 199 201 203 205 12. Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 12.0 Introduction vi  Table of Contents 209 12.1 Selecting Best Models Using Exhaustive Search 12.2 Selecting Best Models Using Randomized Search 12.3 Selecting Best Models from Multiple Learning Algorithms 12.4 Selecting Best Models When Preprocessing 12.5 Speeding Up Model Selection with Parallelization 12.6 Speeding Up Model Selection Using AlgorithmSpecific Methods 12.7 Evaluating Performance After Model Selection 210 212 214 215 217 219 220 13. Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 13.0 Introduction 13.1 Fitting a Line 13.2 Handling Interactive Effects 13.3 Fitting a Nonlinear Relationship 13.4 Reducing Variance with Regularization 13.5 Reducing Features with Lasso Regression 223 223 225 227 229 231 14. Trees and Forests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 14.0 Introduction 14.1 Training a Decision Tree Classifier 14.2 Training a Decision Tree Regressor 14.3 Visualizing a Decision Tree Model 14.4 Training a Random Forest Classifier 14.5 Training a Random Forest Regressor 14.6 Identifying Important Features in Random Forests 14.7 Selecting Important Features in Random Forests 14.8 Handling Imbalanced Classes 14.9 Controlling Tree Size 14.10 Improving Performance Through Boosting 14.11 Evaluating Random Forests with OutofBag Errors 233 233 235 236 238 240 241 243 245 246 247 249 15. KNearest Neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 15.0 Introduction 15.1 Finding an Observation’s Nearest Neighbors 15.2 Creating a KNearest Neighbor Classifier 15.3 Identifying the Best Neighborhood Size 15.4 Creating a RadiusBased Nearest Neighbor Classifier 251 251 254 256 257 16. Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 16.0 Introduction 16.1 Training a Binary Classifier 16.2 Training a Multiclass Classifier 16.3 Reducing Variance Through Regularization 259 259 261 262 Table of Contents  vii 16.4 Training a Classifier on Very Large Data 16.5 Handling Imbalanced Classes 263 264 17. Support Vector Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 17.0 Introduction 17.1 Training a Linear Classifier 17.2 Handling Linearly Inseparable Classes Using Kernels 17.3 Creating Predicted Probabilities 17.4 Identifying Support Vectors 17.5 Handling Imbalanced Classes 267 267 270 274 276 277 18. Naive Bayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 18.0 Introduction 18.1 Training a Classifier for Continuous Features 18.2 Training a Classifier for Discrete and Count Features 18.3 Training a Naive Bayes Classifier for Binary Features 18.4 Calibrating Predicted Probabilities 279 280 282 283 284 19. Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 19.0 Introduction 19.1 Clustering Using KMeans 19.2 Speeding Up KMeans Clustering 19.3 Clustering Using Meanshift 19.4 Clustering Using DBSCAN 19.5 Clustering Using Hierarchical Merging 287 287 290 291 292 294 20. Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 20.0 Introduction 20.1 Preprocessing Data for Neural Networks 20.2 Designing a Neural Network 20.3 Training a Binary Classifier 20.4 Training a Multiclass Classifier 20.5 Training a Regressor 20.6 Making Predictions 20.7 Visualize Training History 20.8 Reducing Overfitting with Weight Regularization 20.9 Reducing Overfitting with Early Stopping 20.10 Reducing Overfitting with Dropout 20.11 Saving Model Training Progress 20.12 kFold CrossValidating Neural Networks 20.13 Tuning Neural Networks 20.14 Visualizing Neural Networks viii  Table of Contents 297 298 300 303 305 307 309 310 313 315 317 319 321 322 325 20.15 Classifying Images 20.16 Improving Performance with Image Augmentation 20.17 Classifying Text 327 331 333 21. Saving and Loading Trained Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 21.0 Introduction 21.1 Saving and Loading a scikitlearn Model 21.2 Saving and Loading a Keras Model 337 337 339 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Table of Contents  ix Preface Over the last few years machine learning has become embedded in a wide variety of daytoday business, nonprofit, and government operations. As the popularity of machine learning increased, a cottage industry of highquality literature that taught applied machine learning to practitioners developed. This literature has been highly successful in training an entire generation of data scientists and machine learning engineers. This literature also approached the topic of machine learning from the perspective of providing a learning resource to teach an individual what machine learning is and how it works. However, while fruitful, this approach left out a differ‐ ent perspective on the topic: the nuts and bolts of doing machine learning day to day. That is the motivation of this book—not as a tome of machine learning knowledge for the student but as a wrench for the professional, to sit with dogeared pages on desks ready to solve the practical daytoday problems of a machine learning practi‐ tioner. More specifically, the book takes a taskbased approach to machine learning, with almost 200 selfcontained solutions (you can copy and paste the code and it’ll run) for the most common tasks a data scientist or machine learning engineer building a model will run into. The ultimate goal is for the book to be a reference for people building real machine learning systems. For example, imagine a reader has a JSON file containing 1,000 cat‐ egorical and numerical features with missing data and categorical target vectors with imbalanced classes, and wants an interpretable model. The motivation for this book is to provide recipes to help the reader learn processes such as: • 2.5 Loading a JSON file • 4.2 Standardizing a Feature • 5.3 Encoding Dictionaries of Features • 5.4 Imputing Missing Class Values xi • 9.1 Reducing Features Using Principal Components • 12.2 Selecting Best Models Using Randomized Search • 14.4 Training a Random Forest Classifier • 14.7 Selecting Random Features in Random Forests The goal is for the reader to be able to: 1. Copy/paste the code and gain confidence that it actually works with the included toy dataset. 2. Read the discussion to gain an understanding of the theory behind the technique the code is executing and learn which parameters are important to consider. 3. Insert/combine/adapt the code from the recipes to construct the actual applica‐ tion. Who This Book Is For This book is not an introduction to machine learning. If you are not comfortable with the basic concepts of machine learning or have never spent time learning machine learning, do not buy this book. Instead, this book is for the machine learning practi‐ tioner who, while comfortable with the theory and concepts of machine learning, would benefit from a quick reference containing code to solve challenges he runs into working on machine learning on an everyday basis. This book assumes the reader is comfortable with the Python programming language and package management. Who This Book Is Not For As stated previously, this book is not an introduction to machine learning. This book should not be your first. If you are unfamiliar with concepts like crossvalidation, random forest, and gradient descent, you will likely not benefit from this book as much as one of the many highquality texts specifically designed to introduce you to the topic. I recommend reading one of those books and then coming back to this book to learn working, practical solutions for machine learning. Terminology Used in This Book Machine learning draws upon techniques from a wide range of fields, including com‐ puter science, statistics, and mathematics. For this reason, there is significant varia‐ tion in the terminology used in the discussions of machine learning: xii  Preface Observation A single unit in our level of observation—for example, a person, a sale, or a record. Learning algorithms An algorithm used to learn the best parameters of a model—for example, linear regression, naive Bayes, or decision trees. Models An output of a learning algorithm’s training. Learning algorithms train models, which we then use to make predictions. Parameters The weights or coefficients of a model learned through training. Hyperparameters The settings of a learning algorithm that need to be set before training. Performance A metric used to evaluate a model. Loss A metric to maximize or minimize through training. Train Applying a learning algorithm to data using numerical approaches like gradient descent. Fit Applying a learning algorithm to data using analytical approaches. Data A collection of observations. Acknowledgments This book would not have been possible without the kind help of a number of friends and strangers. Listing everyone who lent a hand to this project would be impossible, but I wanted to at least mention: Angela Bassa, Teresa Borcuch, Justin Bozonier, Andre deBruin, Numa Dhamani, Dan Friedman, Joel Grus, Sarah Guido, Bill Kam‐ bouroglou, Mat Kelcey, Lizzie Kumar, Hilary Parker, Niti Paudyal, Sebastian Raschka, and Shreya Shankar. I owe them all a beer or five. Preface  xiii CHAPTER 1 Vectors, Matrices, and Arrays 1.0 Introduction NumPy is the foundation of the Python machine learning stack. NumPy allows for efficient operations on the data structures often used in machine learning: vectors, matrices, and tensors. While NumPy is not the focus of this book, it will show up fre‐ quently throughout the following chapters. This chapter covers the most common NumPy operations we are likely to run into while working on machine learning workflows. 1.1 Creating a Vector Problem You need to create a vector. Solution Use NumPy to create a onedimensional array: # Load library import numpy as np # Create a vector as a row vector_row = np.array([1, 2, 3]) # Create a vector as a column vector_column = np.array([[1], [2], [3]]) 1 Discussion NumPy’s main data structure is the multidimensional array. To create a vector, we simply create a onedimensional array. Just like vectors, these arrays can be repre‐ sented horizontally (i.e., rows) or vertically (i.e., columns). See Also • Vectors, Math Is Fun • Euclidean vector, Wikipedia 1.2 Creating a Matrix Problem You need to create a matrix. Solution Use NumPy to create a twodimensional array: # Load library import numpy as np # Create a matrix matrix = np.array([[1, 2], [1, 2], [1, 2]]) Discussion To create a matrix we can use a NumPy twodimensional array. In our solution, the matrix contains three rows and two columns (a column of 1s and a column of 2s). NumPy actually has a dedicated matrix data structure: matrix_object = np.mat([[1, 2], [1, 2], [1, 2]]) matrix([[1, 2], [1, 2], [1, 2]]) However, the matrix data structure is not recommended for two reasons. First, arrays are the de facto standard data structure of NumPy. Second, the vast majority of NumPy operations return arrays, not matrix objects. 2  Chapter 1: Vectors, Matrices, and Arrays See Also • Matrix, Wikipedia • Matrix, Wolfram MathWorld 1.3 Creating a Sparse Matrix Problem Given data with very few nonzero values, you want to efficiently represent it. Solution Create a sparse matrix: # Load libraries import numpy as np from scipy import sparse # Create a matrix matrix = np.array([[0, 0], [0, 1], [3, 0]]) # Create compressed sparse row (CSR) matrix matrix_sparse = sparse.csr_matrix(matrix) Discussion A frequent situation in machine learning is having a huge amount of data; however, most of the elements in the data are zeros. For example, imagine a matrix where the columns are every movie on Netflix, the rows are every Netflix user, and the values are how many times a user has watched that particular movie. This matrix would have tens of thousands of columns and millions of rows! However, since most users do not watch most movies, the vast majority of elements would be zero. Sparse matrices only store nonzero elements and assume all other values will be zero, leading to significant computational savings. In our solution, we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view the sparse matrix we can see that only the nonzero values are stored: # View sparse matrix print(matrix_sparse) (1, 1) (2, 0) 1 3 1.3 Creating a Sparse Matrix  3 There are a number of types of sparse matrices. However, in compressed sparse row (CSR) matrices, (1, 1) and (2, 0) represent the (zeroindexed) indices of the nonzero values 1 and 3, respectively. For example, the element 1 is in the second row and second column. We can see the advantage of sparse matrices if we create a much larger matrix with many more zero elements and then compare this larger matrix with our original sparse matrix: # Create larger matrix matrix_large = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) # Create compressed sparse row (CSR) matrix matrix_large_sparse = sparse.csr_matrix(matrix_large) # View original sparse matrix print(matrix_sparse) (1, 1) (2, 0) 1 3 # View larger sparse matrix print(matrix_large_sparse) (1, 1) (2, 0) 1 3 As we can see, despite the fact that we added many more zero elements in the larger matrix, its sparse representation is exactly the same as our original sparse matrix. That is, the addition of zero elements did not change the size of the sparse matrix. As mentioned, there are many different types of sparse matrices, such as compressed sparse column, list of lists, and dictionary of keys. While an explanation of the differ‐ ent types and their implications is outside the scope of this book, it is worth noting that while there is no “best” sparse matrix type, there are meaningful differences between them and we should be conscious about why we are choosing one type over another. See Also • Sparse matrices, SciPy documentation • 101 Ways to Store a Sparse Matrix 1.4 Selecting Elements Problem You need to select one or more elements in a vector or matrix. 4  Chapter 1: Vectors, Matrices, and Arrays Solution NumPy’s arrays make that easy: # Load library import numpy as np # Create row vector vector = np.array([1, 2, 3, 4, 5, 6]) # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Select third element of vector vector[2] 3 # Select second row, second column matrix[1,1] 5 Discussion Like most things in Python, NumPy arrays are zeroindexed, meaning that the index of the first element is 0, not 1. With that caveat, NumPy offers a wide variety of meth‐ ods for selecting (i.e., indexing and slicing) elements or groups of elements in arrays: # Select all elements of a vector vector[:] array([1, 2, 3, 4, 5, 6]) # Select everything up to and including the third element vector[:3] array([1, 2, 3]) # Select everything after the third element vector[3:] array([4, 5, 6]) # Select the last element vector[1] 6 # Select the first two rows and all columns of a matrix matrix[:2,:] array([[1, 2, 3], [4, 5, 6]]) 1.4 Selecting Elements  5 # Select all rows and the second column matrix[:,1:2] array([[2], [5], [8]]) 1.5 Describing a Matrix Problem You want to describe the shape, size, and dimensions of the matrix. Solution Use shape, size, and ndim: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) # View number of rows and columns matrix.shape (3, 4) # View number of elements (rows * columns) matrix.size 12 # View number of dimensions matrix.ndim 2 Discussion This might seem basic (and it is); however, time and again it will be valuable to check the shape and size of an array both for further calculations and simply as a gut check after some operation. 1.6 Applying Operations to Elements Problem You want to apply some function to multiple elements in an array. 6  Chapter 1: Vectors, Matrices, and Arrays Solution Use NumPy’s vectorize: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Create function that adds 100 to something add_100 = lambda i: i + 100 # Create vectorized function vectorized_add_100 = np.vectorize(add_100) # Apply function to all elements in matrix vectorized_add_100(matrix) array([[101, 102, 103], [104, 105, 106], [107, 108, 109]]) Discussion NumPy’s vectorize class converts a function into a function that can apply to all ele‐ ments in an array or slice of an array. It’s worth noting that vectorize is essentially a for loop over the elements and does not increase performance. Furthermore, NumPy arrays allow us to perform operations between arrays even if their dimensions are not the same (a process called broadcasting). For example, we can create a much simpler version of our solution using broadcasting: # Add 100 to all elements matrix + 100 array([[101, 102, 103], [104, 105, 106], [107, 108, 109]]) 1.7 Finding the Maximum and Minimum Values Problem You need to find the maximum or minimum value in an array. Solution Use NumPy’s max and min: 1.7 Finding the Maximum and Minimum Values  7 # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Return maximum element np.max(matrix) 9 # Return minimum element np.min(matrix) 1 Discussion Often we want to know the maximum and minimum value in an array or subset of an array. This can be accomplished with the max and min methods. Using the axis parameter we can also apply the operation along a certain axis: # Find maximum element in each column np.max(matrix, axis=0) array([7, 8, 9]) # Find maximum element in each row np.max(matrix, axis=1) array([3, 6, 9]) 1.8 Calculating the Average, Variance, and Standard Deviation Problem You want to calculate some descriptive statistics about an array. Solution Use NumPy’s mean, var, and std: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) 8  Chapter 1: Vectors, Matrices, and Arrays # Return mean np.mean(matrix) 5.0 # Return variance np.var(matrix) 6.666666666666667 # Return standard deviation np.std(matrix) 2.5819888974716112 Discussion Just like with max and min, we can easily get descriptive statistics about the whole matrix or do calculations along a single axis: # Find the mean value in each column np.mean(matrix, axis=0) array([ 4., 5., 6.]) 1.9 Reshaping Arrays Problem You want to change the shape (number of rows and columns) of an array without changing the element values. Solution Use NumPy’s reshape: # Load library import numpy as np # Create 4x3 matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) # Reshape matrix into 2x6 matrix matrix.reshape(2, 6) array([[ 1, [ 7, 2, 8, 3, 4, 5, 6], 9, 10, 11, 12]]) 1.9 Reshaping Arrays  9 Discussion reshape allows us to restructure an array so that we maintain the same data but it is organized as a different number of rows and columns. The only requirement is that the shape of the original and new matrix contain the same number of elements (i.e., the same size). We can see the size of a matrix using size: matrix.size 12 One useful argument in reshape is 1, which effectively means “as many as needed,” so reshape(1, 1) means one row and as many columns as needed: matrix.reshape(1, 1) array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]) Finally, if we provide one integer, reshape will return a 1D array of that length: matrix.reshape(12) array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) 1.10 Transposing a Vector or Matrix Problem You need to transpose a vector or matrix. Solution Use the T method: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Transpose matrix matrix.T array([[1, 4, 7], [2, 5, 8], [3, 6, 9]]) 10  Chapter 1: Vectors, Matrices, and Arrays Discussion Transposing is a common operation in linear algebra where the column and row indices of each element are swapped. One nuanced point that is typically overlooked outside of a linear algebra class is that, technically, a vector cannot be transposed because it is just a collection of values: # Transpose vector np.array([1, 2, 3, 4, 5, 6]).T array([1, 2, 3, 4, 5, 6]) However, it is common to refer to transposing a vector as converting a row vector to a column vector (notice the second pair of brackets) or vice versa: # Tranpose row vector np.array([[1, 2, 3, 4, 5, 6]]).T array([[1], [2], [3], [4], [5], [6]]) 1.11 Flattening a Matrix Problem You need to transform a matrix into a onedimensional array. Solution Use flatten: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Flatten matrix matrix.flatten() array([1, 2, 3, 4, 5, 6, 7, 8, 9]) 1.11 Flattening a Matrix  11 Discussion flatten is a simple method to transform a matrix into a onedimensional array. Alternatively, we can use reshape to create a row vector: matrix.reshape(1, 1) array([[1, 2, 3, 4, 5, 6, 7, 8, 9]]) 1.12 Finding the Rank of a Matrix Problem You need to know the rank of a matrix. Solution Use NumPy’s linear algebra method matrix_rank: # Load library import numpy as np # Create matrix matrix = np.array([[1, 1, 1], [1, 1, 10], [1, 1, 15]]) # Return matrix rank np.linalg.matrix_rank(matrix) 2 Discussion The rank of a matrix is the dimensions of the vector space spanned by its columns or rows. Finding the rank of a matrix is easy in NumPy thanks to matrix_rank. See Also • The Rank of a Matrix, CliffsNotes 1.13 Calculating the Determinant Problem You need to know the determinant of a matrix. 12  Chapter 1: Vectors, Matrices, and Arrays Solution Use NumPy’s linear algebra method det: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [2, 4, 6], [3, 8, 9]]) # Return determinant of matrix np.linalg.det(matrix) 0.0 Discussion It can sometimes be useful to calculate the determinant of a matrix. NumPy makes this easy with det. See Also • The determinant  Essence of linear algebra, chapter 5, 3Blue1Brown • Determinant, Wolfram MathWorld 1.14 Getting the Diagonal of a Matrix Problem You need to get the diagonal elements of a matrix. Solution Use diagonal: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [2, 4, 6], [3, 8, 9]]) # Return diagonal elements matrix.diagonal() 1.14 Getting the Diagonal of a Matrix  13 array([1, 4, 9]) Discussion NumPy makes getting the diagonal elements of a matrix easy with diagonal. It is also possible to get a diagonal off from the main diagonal by using the offset parameter: # Return diagonal one above the main diagonal matrix.diagonal(offset=1) array([2, 6]) # Return diagonal one below the main diagonal matrix.diagonal(offset=1) array([2, 8]) 1.15 Calculating the Trace of a Matrix Problem You need to calculate the trace of a matrix. Solution Use trace: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [2, 4, 6], [3, 8, 9]]) # Return trace matrix.trace() 14 Discussion The trace of a matrix is the sum of the diagonal elements and is often used under the hood in machine learning methods. Given a NumPy multidimensional array, we can calculate the trace using trace. We can also return the diagonal of a matrix and calcu‐ late its sum: # Return diagonal and sum elements sum(matrix.diagonal()) 14 14  Chapter 1: Vectors, Matrices, and Arrays See Also • The Trace of a Square Matrix 1.16 Finding Eigenvalues and Eigenvectors Problem You need to find the eigenvalues and eigenvectors of a square matrix. Solution Use NumPy’s linalg.eig: # Load library import numpy as np # Create matrix matrix = np.array([[1, 1, 3], [1, 1, 6], [3, 8, 9]]) # Calculate eigenvalues and eigenvectors eigenvalues, eigenvectors = np.linalg.eig(matrix) # View eigenvalues eigenvalues array([ 13.55075847, 0.74003145, 3.29078992]) # View eigenvectors eigenvectors array([[0.17622017, 0.96677403, 0.53373322], [0.435951 , 0.2053623 , 0.64324848], [0.88254925, 0.15223105, 0.54896288]]) Discussion Eigenvectors are widely used in machine learning libraries. Intuitively, given a linear transformation represented by a matrix, A, eigenvectors are vectors that, when that transformation is applied, change only in scale (not direction). More formally: Av = λv where A is a square matrix, λ contains the eigenvalues and v contains the eigenvec‐ tors. In NumPy’s linear algebra toolset, eig lets us calculate the eigenvalues, and eigenvectors of any square matrix. 1.16 Finding Eigenvalues and Eigenvectors  15 See Also • Eigenvectors and Eigenvalues Explained Visually, Setosa.io • Eigenvectors and eigenvalues  Essence of linear algebra, Chapter 10, 3Blue1Brown 1.17 Calculating Dot Products Problem You need to calculate the dot product of two vectors. Solution Use NumPy’s dot: # Load library import numpy as np # Create two vectors vector_a = np.array([1,2,3]) vector_b = np.array([4,5,6]) # Calculate dot product np.dot(vector_a, vector_b) 32 Discussion The dot product of two vectors, a and b, is defined as: n ∑ aibi i=1 where ai is the ith element of vector a. We can use NumPy’s dot class to calculate the dot product. Alternatively, in Python 3.5+ we can use the new @ operator: # Calculate dot product vector_a @ vector_b 32 See Also • Vector dot product and vector length, Khan Academy 16  Chapter 1: Vectors, Matrices, and Arrays • Dot Product, Paul’s Online Math Notes 1.18 Adding and Subtracting Matrices Problem You want to add or subtract two matrices. Solution Use NumPy’s add and subtract: # Load library import numpy as np # Create matrix matrix_a = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 2]]) # Create matrix matrix_b = np.array([[1, 3, 1], [1, 3, 1], [1, 3, 8]]) # Add two matrices np.add(matrix_a, matrix_b) array([[ 2, [ 2, [ 2, 4, 2], 4, 2], 4, 10]]) # Subtract two matrices np.subtract(matrix_a, matrix_b) array([[ 0, 2, 0], [ 0, 2, 0], [ 0, 2, 6]]) Discussion Alternatively, we can simply use the + and  operators: # Add two matrices matrix_a + matrix_b array([[ 2, [ 2, [ 2, 4, 2], 4, 2], 4, 10]]) 1.18 Adding and Subtracting Matrices  17 1.19 Multiplying Matrices Problem You want to multiply two matrices. Solution Use NumPy’s dot: # Load library import numpy as np # Create matrix matrix_a = np.array([[1, 1], [1, 2]]) # Create matrix matrix_b = np.array([[1, 3], [1, 2]]) # Multiply two matrices np.dot(matrix_a, matrix_b) array([[2, 5], [3, 7]]) Discussion Alternatively, in Python 3.5+ we can use the @ operator: # Multiply two matrices matrix_a @ matrix_b array([[2, 5], [3, 7]]) If we want to do elementwise multiplication, we can use the * operator: # Multiply two matrices elementwise matrix_a * matrix_b array([[1, 3], [1, 4]]) See Also • Array vs. Matrix Operations, MathWorks 18  Chapter 1: Vectors, Matrices, and Arrays 1.20 Inverting a Matrix Problem You want to calculate the inverse of a square matrix. Solution Use NumPy’s linear algebra inv method: # Load library import numpy as np # Create matrix matrix = np.array([[1, 4], [2, 5]]) # Calculate inverse of matrix np.linalg.inv(matrix) array([[1.66666667, 1.33333333], [ 0.66666667, 0.33333333]]) Discussion The inverse of a square matrix, A, is a second matrix A–1, such that: AA−1 = I where I is the identity matrix. In NumPy we can use linalg.inv to calculate A–1 if it exists. To see this in action, we can multiply a matrix by its inverse and the result is the identity matrix: # Multiply matrix and its inverse matrix @ np.linalg.inv(matrix) array([[ 1., [ 0., 0.], 1.]]) See Also • Inverse of a Matrix 1.20 Inverting a Matrix  19 1.21 Generating Random Values Problem You want to generate pseudorandom values. Solution Use NumPy’s random: # Load library import numpy as np # Set seed np.random.seed(0) # Generate three random floats between 0.0 and 1.0 np.random.random(3) array([ 0.5488135 , 0.71518937, 0.60276338]) Discussion NumPy offers a wide variety of means to generate random numbers, many more than can be covered here. In our solution we generated floats; however, it is also common to generate integers: # Generate three random integers between 1 and 10 np.random.randint(0, 11, 3) array([3, 7, 9]) Alternatively, we can generate numbers by drawing them from a distribution: # Draw three numbers from a normal distribution with mean 0.0 # and standard deviation of 1.0 np.random.normal(0.0, 1.0, 3) array([1.42232584, 1.52006949, 0.29139398]) # Draw three numbers from a logistic distribution with mean 0.0 and scale of 1.0 np.random.logistic(0.0, 1.0, 3) array([0.98118713, 0.08939902, 1.46416405]) # Draw three numbers greater than or equal to 1.0 and less than 2.0 np.random.uniform(1.0, 2.0, 3) array([ 1.47997717, 1.3927848 , 1.83607876]) Finally, it can sometimes be useful to return the same random numbers multiple times to get predictable, repeatable results. We can do this by setting the “seed” (an integer) of the pseudorandom generator. Random processes with the same seed will 20  Chapter 1: Vectors, Matrices, and Arrays always produce the same output. We will use seeds throughout this book so that the code you see in the book and the code you run on your computer produces the same results. 1.21 Generating Random Values  21 CHAPTER 2 Loading Data 2.0 Introduction The first step in any machine learning endeavor is to get the raw data into our system. The raw data might be a logfile, dataset file, or database. Furthermore, often we will want to retrieve data from multiple sources. The recipes in this chapter look at meth‐ ods of loading data from a variety of sources, including CSV files and SQL databases. We also cover methods of generating simulated data with desirable properties for experimentation. Finally, while there are many ways to load data in the Python eco‐ system, we will focus on using the pandas library’s extensive set of methods for load‐ ing external data, and using scikitlearn—an open source machine learning library in Python—for generating simulated data. 2.1 Loading a Sample Dataset Problem You want to load a preexisting sample dataset. Solution scikitlearn comes with a number of popular datasets for you to use: # Load scikitlearn's datasets from sklearn import datasets # Load digits dataset digits = datasets.load_digits() # Create features matrix features = digits.data 23 # Create target vector target = digits.target # View first observation features[0] array([ 0., 15., 8., 5., 1., 0., 0., 10., 0., 8., 12., 0., 5., 15., 0., 0., 7., 0., 13., 5., 4., 0., 0., 6., 9., 0., 12., 9., 0., 13., 1., 0., 0., 8., 2., 10., 0., 3., 0., 0., 14., 0., 0., 15., 8., 0., 5., 0., 0., 0., 2., 0., 8., 0., 4., 11., 10., 12., 0.]) 13., 11., 0., 0., 0., Discussion Often we do not want to go through the work of loading, transforming, and cleaning a realworld dataset before we can explore some machine learning algorithm or method. Luckily, scikitlearn comes with some common datasets we can quickly load. These datasets are often called “toy” datasets because they are far smaller and cleaner than a dataset we would see in the real world. Some popular sample datasets in scikitlearn are: load_boston Contains 503 observations on Boston housing prices. It is a good dataset for exploring regression algorithms. load_iris Contains 150 observations on the measurements of Iris flowers. It is a good data‐ set for exploring classification algorithms. load_digits Contains 1,797 observations from images of handwritten digits. It is a good data‐ set for teaching image classification. See Also • scikitlearn toy datasets • The Digit Dataset 2.2 Creating a Simulated Dataset Problem You need to generate a dataset of simulated data. 24  Chapter 2: Loading Data Solution scikitlearn offers many methods for creating simulated data. Of those, three methods are particularly useful. When we want a dataset designed to be used with linear regression, make_regression is a good choice: # Load library from sklearn.datasets import make_regression # Generate features matrix, target vector, and the true coefficients features, target, coefficients = make_regression(n_samples = 100, n_features = 3, n_informative = 3, n_targets = 1, noise = 0.0, coef = True, random_state = 1) # View feature matrix and target vector print('Feature Matrix\n', features[:3]) print('Target Vector\n', target[:3]) Feature Matrix [[ 1.29322588 0.61736206 0.11044703] [2.793085 0.36633201 1.93752881] [ 0.80186103 0.18656977 0.0465673 ]] Target Vector [10.37865986 25.5124503 19.67705609] If we are interested in creating a simulated dataset for classification, we can use make_classification: # Load library from sklearn.datasets import make_classification # Generate features matrix and target vector features, target = make_classification(n_samples = 100, n_features = 3, n_informative = 3, n_redundant = 0, n_classes = 2, weights = [.25, .75], random_state = 1) # View feature matrix and target vector print('Feature Matrix\n', features[:3]) print('Target Vector\n', target[:3]) Feature Matrix [[ 1.06354768 1.42632219 1.02163151] [ 0.23156977 1.49535261 0.33251578] 2.2 Creating a Simulated Dataset  25 [ 0.15972951 Target Vector [1 0 0] 0.83533515 0.40869554]] Finally, if we want a dataset designed to work well with clustering techniques, scikitlearn offers make_blobs: # Load library from sklearn.datasets import make_blobs # Generate feature matrix and target vector features, target = make_blobs(n_samples = 100, n_features = 2, centers = 3, cluster_std = 0.5, shuffle = True, random_state = 1) # View feature matrix and target vector print('Feature Matrix\n', features[:3]) print('Target Vector\n', target[:3]) Feature Matrix [[ 1.22685609 3.25572052] [ 9.57463218 4.38310652] [10.71976941 4.20558148]] Target Vector [0 1 1] Discussion As might be apparent from the solutions, make_regression returns a feature matrix of float values and a target vector of float values, while make_classification and make_blobs return a feature matrix of float values and a target vector of integers rep‐ resenting membership in a class. scikitlearn’s simulated datasets offer extensive options to control the type of data generated. scikitlearn’s documentation contains a full description of all the parame‐ ters, but a few are worth noting. In make_regression and make_classification, n_informative determines the number of features that are used to generate the target vector. If n_informative is less than the total number of features (n_features), the resulting dataset will have redun‐ dant features that can be identified through feature selection techniques. In addition, make_classification contains a weights parameter that allows us to simulate datasets with imbalanced classes. For example, weights = [.25, .75] would return a dataset with 25% of observations belonging to one class and 75% of observations belonging to a second class. 26  Chapter 2: Loading Data For make_blobs, the centers parameter determines the number of clusters generated. Using the matplotlib visualization library, we can visualize the clusters generated by make_blobs: # Load library import matplotlib.pyplot as plt # View scatterplot plt.scatter(features[:,0], features[:,1], c=target) plt.show() See Also • make_regression documentation • make_classification documentation • make_blobs documentation 2.3 Loading a CSV File Problem You need to import a commaseparated values (CSV) file. 2.3 Loading a CSV File  27 Solution Use the pandas library’s read_csv to load a local or hosted CSV file: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/simulated_data' # Load dataset dataframe = pd.read_csv(url) # View first two rows dataframe.head(2) integer datetime category 0 5 20150101 00:00:00 0 1 5 20150101 00:00:01 0 Discussion There are two things to note about loading CSV files. First, it is often useful to take a quick look at the contents of the file before loading. It can be very helpful to see how a dataset is structured beforehand and what parameters we need to set to load in the file. Second, read_csv has over 30 parameters and therefore the documentation can be daunting. Fortunately, those parameters are mostly there to allow it to handle a wide variety of CSV formats. For example, CSV files get their names from the fact that the values are literally separated by commas (e.g., one row might be 2,"20150101 00:00:00",0); however, it is common for “CSV” files to use other characters as separators, like tabs. pandas’ sep parameter allows us to define the delimiter used in the file. Although it is not always the case, a common formatting issue with CSV files is that the first line of the file is used to define column headers (e.g., integer, datetime, category in our solution). The header parameter allows us to specify if or where a header row exists. If a header row does not exist, we set header=None. 2.4 Loading an Excel File Problem You need to import an Excel spreadsheet. Solution Use the pandas library’s read_excel to load an Excel spreadsheet: 28  Chapter 2: Loading Data # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/simulated_excel' # Load data dataframe = pd.read_excel(url, sheetname=0, header=1) # View the first two rows dataframe.head(2) 5 20150101 00:00:00 0 0 5 20150101 00:00:01 0 1 9 20150101 00:00:02 0 Discussion This solution is similar to our solution for reading CSV files. The main difference is the additional parameter, sheetname, that specifies which sheet in the Excel file we wish to load. sheetname can accept both strings containing the name of the sheet and integers pointing to sheet positions (zeroindexed). If we need to load multiple sheets, include them as a list. For example, sheetname=[0,1,2, "Monthly Sales"] will return a dictionary of pandas DataFrames containing the first, second, and third sheets and the sheet named Monthly Sales. 2.5 Loading a JSON File Problem You need to load a JSON file for data preprocessing. Solution The pandas library provides read_json to convert a JSON file into a pandas object: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/simulated_json' # Load data dataframe = pd.read_json(url, orient='columns') # View the first two rows dataframe.head(2) 2.5 Loading a JSON File  29 category datetime integer 0 0 20150101 00:00:00 5 1 0 20150101 00:00:01 5 Discussion Importing JSON files into pandas is similar to the last few recipes we have seen. The key difference is the orient parameter, which indicates to pandas how the JSON file is structured. However, it might take some experimenting to figure out which argu‐ ment ( split, records, index, columns, and values) is the right one. Another helpful tool pandas offers is json_normalize, which can help convert semistructured JSON data into a pandas DataFrame. See Also • json_normalize documentation 2.6 Querying a SQL Database Problem You need to load data from a database using the structured query language (SQL). Solution pandas’ read_sql_query allows us to make a SQL query to a database and load it: # Load libraries import pandas as pd from sqlalchemy import create_engine # Create a connection to the database database_connection = create_engine('sqlite:///sample.db') # Load data dataframe = pd.read_sql_query('SELECT * FROM data', database_connection) # View first two rows dataframe.head(2) 30  Chapter 2: Loading Data first_name last_name age preTestScore postTestScore 0 Jason Miller 42 4 25 1 Molly Jacobson 52 24 94 Discussion Out of all of the recipes presented in this chapter, this recipe is probably the one we will use most in the real world. SQL is the lingua franca for pulling data from data‐ bases. In this recipe we first use create_engine to define a connection to a SQL data‐ base engine called SQLite. Next we use pandas’ read_sql_query to query that data‐ base using SQL and put the results in a DataFrame. SQL is a language in its own right and, while beyond the scope of this book, it is cer‐ tainly worth knowing for anyone wanting to learn machine learning. Our SQL query, SELECT * FROM data, asks the database to give us all columns (*) from the table called data. See Also • SQLite • W3Schools SQL Tutorial 2.6 Querying a SQL Database  31 CHAPTER 3 Data Wrangling 3.0 Introduction Data wrangling is a broad term used, often informally, to describe the process of transforming raw data to a clean and organized format ready for use. For us, data wrangling is only one step in preprocessing our data, but it is an important step. The most common data structure used to “wrangle” data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that they are based on rows and columns like you would see in a spreadsheet. Here is a data frame created from data about passengers on the Titanic: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data as a dataframe dataframe = pd.read_csv(url) # Show first 5 rows dataframe.head(5) 0 1 2 3 4 Name Allen, Miss Elisabeth Walton Allison, Miss Helen Loraine Allison, Mr Hudson Joshua Creighton Allison, Mrs Hudson JC (Bessie Waldo Daniels) Allison, Master Hudson Trevor PClass 1st 1st 1st 1st 1st Age 29.00 2.00 30.00 25.00 0.92 Sex female female male female male Survived 1 0 0 0 1 SexCode 1 1 0 1 0 33 There are three important things to notice in this data frame. First, in a data frame each row corresponds to one observation (e.g., a passenger) and each column corresponds to one feature (gender, age, etc.). For example, by looking at the first observation we can see that Miss Elisabeth Walton Allen stayed in first class, was 29 years old, was female, and survived the disaster. Second, each column contains a name (e.g., Name, PClass, Age) and each row contains an index number (e.g., 0 for the lucky Miss Elisabeth Walton Allen). We will use these to select and manipulate observations and features. Third, two columns, Sex and SexCode, contain the same information in different for‐ mats. In Sex, a woman is indicated by the string female, while in SexCode, a woman is indicated by using the integer 1. We will want all our features to be unique, and therefore we will need to remove one of these columns. In this chapter, we will cover a wide variety of techniques to manipulate data frames using the pandas library with the goal of creating a clean, wellstructured set of obser‐ vations for further preprocessing. 3.1 Creating a Data Frame Problem You want to create a new data frame. Solution pandas has many methods of creating a new DataFrame object. One easy method is to create an empty data frame using DataFrame and then define each column sepa‐ rately: # Load library import pandas as pd # Create DataFrame dataframe = pd.DataFrame() # Add columns dataframe['Name'] = ['Jacky Jackson', 'Steven Stevenson'] dataframe['Age'] = [38, 25] dataframe['Driver'] = [True, False] # Show DataFrame dataframe 34  Chapter 3: Data Wrangling Name Age Driver 0 Jacky Jackson 38 True 1 Steven Stevenson 25 False Alternatively, once we have created a DataFrame object, we can append new rows to the bottom: # Create row new_person = pd.Series(['Molly Mooney', 40, True], index=['Name','Age','Driver']) # Append row dataframe.append(new_person, ignore_index=True) Name Age 0 Jacky Jackson 38 1 Steven Stevenson 25 2 Molly Mooney 40 Driver True False True Discussion pandas offers what can feel like an infinite number of ways to create a DataFrame. In the real world, creating an empty DataFrame and then populating it will almost never happen. Instead, our DataFrames will be created from real data we have loading from other sources (e.g., a CSV file or database). 3.2 Describing the Data Problem You want to view some characteristics of a DataFrame. Solution One of the easiest things we can do after loading the data is view the first few rows using head: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Show two rows dataframe.head(2) 3.2 Describing the Data  35 Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 We can also take a look at the number of rows and columns: # Show dimensions dataframe.shape (1313, 6) Additionally, we can get descriptive statistics for any numeric columns using describe: # Show statistics dataframe.describe() Age count 756.000000 mean 30.397989 std 14.259049 min 0.170000 25% 21.000000 50% 28.000000 75% 39.000000 max 71.000000 Survived 1313.000000 0.342727 0.474802 0.000000 0.000000 0.000000 1.000000 1.000000 SexCode 1313.000000 0.351866 0.477734 0.000000 0.000000 0.000000 1.000000 1.000000 Discussion After we load some data, it is a good idea to understand how it is structured and what kind of information it contains. Ideally, we would view the full data directly. But with most realworld cases, the data could have thousands to hundreds of thousands to millions of rows and columns. Instead, we have to rely on pulling samples to view small slices and calculating summary statistics of the data. In our solution, we are using a toy dataset of the passengers of the Titanic on her last voyage. Using head we can take a look at the first few rows (five by default) of the data. Alternatively, we can use tail to view the last few rows. With shape we can see how many rows and columns our DataFrame contains. And finally, with describe we can see some basic descriptive statistics for any numerical column. It is worth noting that summary statistics do not always tell the full story. For exam‐ ple, pandas treats the columns Survived and SexCode as numeric columns because they contain 1s and 0s. However, in this case the numerical values represent cate‐ gories. For example, if Survived equals 1, it indicates that the passenger survived the disaster. For this reason, some of the summary statistics provided don’t make sense, 36  Chapter 3: Data Wrangling such as the standard deviation of the SexCode column (an indicator of the passenger’s gender). 3.3 Navigating DataFrames Problem You need to select individual data or slices of a DataFrame. Solution Use loc or iloc to select one or more rows or values: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com//titaniccsv' # Load data dataframe = pd.read_csv(url) # Select first row dataframe.iloc[0] Name Allen, Miss Elisabeth Walton PClass 1st Age 29 Sex female Survived 1 SexCode 1 Name: 0, dtype: object We can use : to define a slice of rows we want, such as selecting the second, third, and fourth rows: # Select three rows dataframe.iloc[1:4] Name 1 Allison, Miss Helen Loraine 2 Allison, Mr Hudson Joshua Creighton 3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) PClass 1st 1st 1st Age 2.0 30.0 25.0 Sex female male female Survived 0 0 0 SexCode 1 0 1 We can even use it to get all rows up to a point, such as all rows up to and including the fourth row: # Select three rows dataframe.iloc[:4] 3.3 Navigating DataFrames  37 0 1 2 3 Name Allen, Miss Elisabeth Walton Allison, Miss Helen Loraine Allison, Mr Hudson Joshua Creighton Allison, Mrs Hudson JC (Bessie Waldo Daniels) PClass 1st 1st 1st 1st Age 29.0 2.0 30.0 25.0 Sex female female male female Survived 1 0 0 0 SexCode 1 1 0 1 DataFrames do not need to be numerically indexed. We can set the index of a Data‐ Frame to any value where the value is unique to each row. For example, we can set the index to be passenger names and then select rows using a name: # Set index dataframe = dataframe.set_index(dataframe['Name']) # Show row dataframe.loc['Allen, Miss Elisabeth Walton'] Name Allen, Miss Elisabeth Walton PClass 1st Age 29 Sex female Survived 1 SexCode 1 Name: Allen, Miss Elisabeth Walton, dtype: object Discussion All rows in a pandas DataFrame have a unique index value. By default, this index is an integer indicating the row position in the DataFrame; however, it does not have to be. DataFrame indexes can be set to be unique alphanumeric strings or customer numbers. To select individual rows and slices of rows, pandas provides two methods: • loc is useful when the index of the DataFrame is a label (e.g., a string). • iloc works by looking for the position in the DataFrame. For example, iloc[0] will return the first row regardless of whether the index is an integer or a label. It is useful to be comfortable with both loc and iloc since they will come up a lot during data cleaning. 3.4 Selecting Rows Based on Conditionals Problem You want to select DataFrame rows based on some condition. 38  Chapter 3: Data Wrangling Solution This can be easily done in pandas. For example, if we wanted to select all the women on the Titanic: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Show top two rows where column 'sex' is 'female' dataframe[dataframe['Sex'] == 'female'].head(2) Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 Take a second and look at the format of this solution. dataframe['Sex'] == 'female' is our conditional statement; by wrapping that in dataframe[] we are tell‐ ing pandas to “select all the rows in the DataFrame where the value of data frame['Sex'] is 'female'. Multiple conditions are easy as well. For example, here we select all the rows where the passenger is a female 65 or older: # Filter rows dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age'] >= 65)] Name PClass Age Sex Survived SexCode 73 Crosby, Mrs Edward Gifford (Catherine Elizabet... 1st 69.0 female 1 1 Discussion Conditionally selecting and filtering data is one of the most common tasks in data wrangling. You rarely want all the raw data from the source; instead, you are interes‐ ted in only some subsection of it. For example, you might only be interested in stores in certain states or the records of patients over a certain age. 3.5 Replacing Values Problem You need to replace values in a DataFrame. 3.5 Replacing Values  39 Solution pandas’ replace is an easy way to find and replace values. For example, we can replace any instance of "female" in the Sex column with "Woman": # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Replace values, show two rows dataframe['Sex'].replace("female", "Woman").head(2) 0 Woman 1 Woman Name: Sex, dtype: object We can also replace multiple values at the same time: # Replace "female" and "male with "Woman" and "Man" dataframe['Sex'].replace(["female", "male"], ["Woman", "Man"]).head(5) 0 Woman 1 Woman 2 Man 3 Woman 4 Man Name: Sex, dtype: object We can also find and replace across the entire DataFrame object by specifying the whole data frame instead of a single column: # Replace values, show two rows dataframe.replace(1, "One").head(2) Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29 female One One 1 Allison, Miss Helen Loraine 1st 2 female 0 One replace also accepts regular expressions: # Replace values, show two rows dataframe.replace(r"1st", "First", regex=True).head(2) Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton First 29.0 female 1 1 1 Allison, Miss Helen Loraine First 2.0 female 0 1 40  Chapter 3: Data Wrangling Discussion replace is a tool we use to replace values that is simple and yet has the powerful abil‐ ity to accept regular expressions. 3.6 Renaming Columns Problem You want to rename a column in a pandas DataFrame. Solution Rename columns using the rename method: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Rename column, show two rows dataframe.rename(columns={'PClass': 'Passenger Class'}).head(2) Name Passenger Class Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 Notice that the rename method can accept a dictionary as a parameter. We can use the dictionary to change multiple column names at once: # Rename columns, show two rows dataframe.rename(columns={'PClass': 'Passenger Class', 'Sex': 'Gender'}).head(2) Name Passenger Class Age Gender Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 Discussion Using rename with a dictionary as an argument to the columns parameter is my pre‐ ferred way to rename columns because it works with any number of columns. If we want to rename all columns at once, this helpful snippet of code creates a dictionary with the old column names as keys and empty strings as values: 3.6 Renaming Columns  41 # Load library import collections # Create dictionary column_names = collections.defaultdict(str) # Create keys for name in dataframe.columns: column_names[name] # Show dictionary column_names defaultdict(str, {'Age': '', 'Name': '', 'PClass': '', 'Sex': '', 'SexCode': '', 'Survived': ''}) 3.7 Finding the Minimum, Maximum, Sum, Average, and Count Problem You want to find the min, max, sum, average, or count of a numeric column. Solution pandas comes with some builtin methods for commonly used descriptive statistics: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Calculate statistics print('Maximum:', dataframe['Age'].max()) print('Minimum:', dataframe['Age'].min()) print('Mean:', dataframe['Age'].mean()) print('Sum:', dataframe['Age'].sum()) print('Count:', dataframe['Age'].count()) Maximum: 71.0 Minimum: 0.17 Mean: 30.397989417989415 42  Chapter 3: Data Wrangling Sum: 22980.879999999997 Count: 756 Discussion In addition to the statistics used in the solution, pandas offers variance (var), stan‐ dard deviation (std), kurtosis (kurt), skewness (skew), standard error of the mean (sem), mode (mode), median (median), and a number of others. Furthermore, we can also apply these methods to the whole DataFrame: # Show counts dataframe.count() Name 1313 PClass 1313 Age 756 Sex 1313 Survived 1313 SexCode 1313 dtype: int64 3.8 Finding Unique Values Problem You want to select all unique values in a column. Solution Use unique to view an array of all unique values in a column: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Select unique values dataframe['Sex'].unique() array(['female', 'male'], dtype=object) Alternatively, value_counts will display all unique values with the number of times each value appears: # Show counts dataframe['Sex'].value_counts() 3.8 Finding Unique Values  43 male 851 female 462 Name: Sex, dtype: int64 Discussion Both unique and value_counts are useful for manipulating and exploring categorical columns. Very often in categorical columns there will be classes that need to be han‐ dled in the data wrangling phase. For example, in the Titanic dataset, PClass is a col‐ umn indicating the class of a passenger’s ticket. There were three classes on the Titanic; however, if we use value_counts we can see a problem: # Show counts dataframe['PClass'].value_counts() 3rd 711 1st 322 2nd 279 * 1 Name: PClass, dtype: int64 While almost all passengers belong to one of three classes as expected, a single pas‐ senger has the class *. There are a number of strategies for handling this type of issue, which we will address in Chapter 5, but for now just realize that “extra” classes are common in categorical data and should not be ignored. Finally, if we simply want to count the number of unique values, we can use nunique: # Show number of unique values dataframe['PClass'].nunique() 4 3.9 Handling Missing Values Problem You want to select missing values in a DataFrame. Solution isnull and notnull return booleans indicating whether a value is missing: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data 44  Chapter 3: Data Wrangling dataframe = pd.read_csv(url) ## Select missing values, show two rows dataframe[dataframe['Age'].isnull()].head(2) Name PClass Age Sex Survived SexCode 12 Aubert, Mrs Leontine Pauline 1st NaN female 1 1 13 Barkworth, Mr Algernon H 1st NaN male 1 0 Discussion Missing values are a ubiquitous problem in data wrangling, yet many underestimate the difficulty of working with missing data. pandas uses NumPy’s NaN (“Not A Num‐ ber”) value to denote missing values, but it is important to note that NaN is not fully implemented natively in pandas. For example, if we wanted to replace all strings con‐ taining male with missing values, we return an error: # Attempt to replace values with NaN dataframe['Sex'] = dataframe['Sex'].replace('male', NaN) NameError Traceback (most recent call last) <ipythoninput75682d714f87d> in <module>() 1 # Attempt to replace values with NaN > 2 dataframe['Sex'] = dataframe['Sex'].replace('male', NaN) NameError: name 'NaN' is not defined  To have full functionality with NaN we need to import the NumPy library first: # Load library import numpy as np # Replace values with NaN dataframe['Sex'] = dataframe['Sex'].replace('male', np.nan) Oftentimes a dataset uses a specific value to denote a missing observation, such as NONE, 999, or .. pandas’ read_csv includes a parameter allowing us to specify the values used to indicate missing values: # Load data, set missing values dataframe = pd.read_csv(url, na_values=[np.nan, 'NONE', 999]) 3.9 Handling Missing Values  45 3.10 Deleting a Column Problem You want to delete a column from your DataFrame. Solution The best way to delete a column is to use drop with the parameter axis=1 (i.e., the column axis): # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Delete column dataframe.drop('Age', axis=1).head(2) Name PClass Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st female 1 1 1 Allison, Miss Helen Loraine 1st female 0 1 You can also use a list of column names as the main argument to drop multiple col‐ umns at once: # Drop columns dataframe.drop(['Age', 'Sex'], axis=1).head(2) Name PClass Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 1 1 1 Allison, Miss Helen Loraine 1st 0 1 If a column does not have a name (which can sometimes happen), you can drop it by its column index using dataframe.columns: # Drop column dataframe.drop(dataframe.columns[1], axis=1).head(2) Name Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 29.0 female 1 1 1 Allison, Miss Helen Loraine 2.0 female 0 1 46  Chapter 3: Data Wrangling Discussion drop is the idiomatic method of deleting a column. An alternative method is del dataframe['Age'], which works most of the time but is not recommended because of how it is called within pandas (the details of which are outside the scope of this book). One habit I recommend learning is to never use pandas’ inplace=True argument. Many pandas methods include an inplace parameter, which when True edits the DataFrame directly. This can lead to problems in more complex data processing pipe‐ lines because we are treating the DataFrames as mutable objects (which they techni‐ cally are). I recommend treating DataFrames as immutable objects. For example: # Create a new DataFrame dataframe_name_dropped = dataframe.drop(dataframe.columns[0], axis=1) In this example, we are not mutating the DataFrame dataframe but instead are mak‐ ing a new DataFrame that is an altered version of dataframe called data frame_name_dropped. If you treat your DataFrames as immutable objects, you will save yourself a lot of headaches down the road. 3.11 Deleting a Row Problem You want to delete one or more rows from a DataFrame. Solution Use a boolean condition to create a new DataFrame excluding the rows you want to delete: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Delete rows, show first two rows of output dataframe[dataframe['Sex'] != 'male'].head(2) Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 3.11 Deleting a Row  47 Discussion While technically you can use the drop method (for example, df.drop([0, 1], axis=0) to drop the first two rows), a more practical method is simply to wrap a boolean condition inside df[]. The reason is because we can use the power of condi‐ tionals to delete either a single row or (far more likely) many rows at once. We can use boolean conditions to easily delete single rows by matching a unique value: # Delete row, show first two rows of output dataframe[dataframe['Name'] != 'Allison, Miss Helen Loraine'].head(2) Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male 0 0 And we can even use it to delete a single row by row index: # Delete row, show first two rows of output dataframe[dataframe.index != 0].head(2) Name PClass Age Sex Survived SexCode 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male 0 0 3.12 Dropping Duplicate Rows Problem You want to drop duplicate rows from your DataFrame. Solution Use drop_duplicates, but be mindful of the parameters: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Drop duplicates, show first two rows of output dataframe.drop_duplicates().head(2) 48  Chapter 3: Data Wrangling Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 Discussion A keen reader will notice that the solution didn’t actually drop any rows: # Show number of rows print("Number Of Rows In The Original DataFrame:", len(dataframe)) print("Number Of Rows After Deduping:", len(dataframe.drop_duplicates())) Number Of Rows In The Original DataFrame: 1313 Number Of Rows After Deduping: 1313 The reason is because drop_duplicates defaults to only dropping rows that match perfectly across all columns. Under this condition, every row in our DataFrame, data frame, is actually unique. However, often we want to consider only a subset of col‐ umns to check for duplicate rows. We can accomplish this using the subset parame‐ ter: # Drop duplicates dataframe.drop_duplicates(subset=['Sex']) Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male 0 0 Take a close look at the preceding output: we told drop_duplicates to only consider any two rows with the same value for Sex to be duplicates and to drop them. Now we are left with a DataFrame of only two rows: one man and one woman. You might be asking why drop_duplicates decided to keep these two rows instead of two different rows. The answer is that drop_duplicates defaults to keeping the first occurrence of a duplicated row and dropping the rest. We can control this behavior using the keep parameter: # Drop duplicates dataframe.drop_duplicates(subset=['Sex'], keep='last') Name PClass Age Sex Survived SexCode 1307 Zabour, Miss Tamini 3rd NaN female 0 1 1312 Zimmerman, Leo 3rd 29.0 male 0 0 A related method is duplicated, which returns a boolean series denoting if a row is a duplicate or not. This is a good option if you don’t want to simply drop duplicates. 3.12 Dropping Duplicate Rows  49 3.13 Grouping Rows by Values Problem You want to group individual rows according to some shared value. Solution groupby is one of the most powerful features in pandas: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Group rows by the values of the column 'Sex', calculate mean # of each group dataframe.groupby('Sex').mean() Age Survived SexCode Sex female 29.396424 0.666667 1.0 male 31.014338 0.166863 0.0 Discussion groupby is where data wrangling really starts to take shape. It is very common to have a DataFrame where each row is a person or an event and we want to group them according to some criterion and then calculate a statistic. For example, you can imag‐ ine a DataFrame where each row is an individual sale at a national restaurant chain and we want the total sales per restaurant. We can accomplish this by grouping rows by individual resturants and then calculating the sum of each group. Users new to groupby often write a line like this and are confused by what is returned: # Group rows dataframe.groupby('Sex') <pandas.core.groupby.DataFrameGroupBy object at 0x10efacf28> Why didn’t it return something more useful? The reason is because groupby needs to be paired with some operation we want to apply to each group, such as calculating an aggregate statistic (e.g., mean, median, sum). When talking about grouping we often use shorthand and say “group by gender,” but that is incomplete. For grouping to be 50  Chapter 3: Data Wrangling useful, we need to group by something and then apply a function to each of those groups: # Group rows, count rows dataframe.groupby('Survived')['Name'].count() Survived 0 863 1 450 Name: Name, dtype: int64 Notice Name added after the groupby? That is because particular summary statistics are only meaningful to certain types of data. For example, while calculating the aver‐ age age by gender makes sense, calculating the total age by gender does not. In this case we group the data into survived or not, then count the number of names (i.e., passengers) in each group. We can also group by a first column, then group that grouping by a second column: # Group rows, calculate mean dataframe.groupby(['Sex','Survived'])['Age'].mean() Sex female Survived 0 24.901408 1 30.867143 male 0 32.320780 1 25.951875 Name: Age, dtype: float64 3.14 Grouping Rows by Time Problem You need to group individual rows by time periods. Solution Use resample to group rows by chunks of time: # Load libraries import pandas as pd import numpy as np # Create date range time_index = pd.date_range('06/06/2017', periods=100000, freq='30S') # Create DataFrame dataframe = pd.DataFrame(index=time_index) # Create column of random values dataframe['Sale_Amount'] = np.random.randint(1, 10, 100000) 3.14 Grouping Rows by Time  51 # Group rows by week, calculate sum per week dataframe.resample('W').sum() 20170611 20170618 20170625 20170702 20170709 20170716 Sale_Amount 86423 101045 100867 100894 100438 10297 Discussion Our standard Titanic dataset does not contain a datetime column, so for this recipe we have generated a simple DataFrame where each row is an individual sale. For each sale we know its date and time and its dollar amount (this data isn’t realistic because every sale takes place precisely 30 seconds apart and is an exact dollar amount, but for the sake of simplicity let us pretend). The raw data looks like this: # Show three rows dataframe.head(3) Sale_Amount 20170606 00:00:00 7 20170606 00:00:30 2 20170606 00:01:00 7 Notice that the date and time of each sale is the index of the DataFrame; this is because resample requires the index to be datetimelike values. Using resample we can group the rows by a wide array of time periods (offsets) and then we can calculate some statistic on each time group: # Group by two weeks, calculate mean dataframe.resample('2W').mean() 20170611 20170625 20170709 20170723 52  Sale_Amount 5.001331 5.007738 4.993353 4.950481 Chapter 3: Data Wrangling # Group by month, count rows dataframe.resample('M').count() Sale_Amount 20170630 72000 20170731 28000 You might notice that in the two outputs the datetime index is a date despite the fact that we are grouping by weeks and months, respectively. The reason is because by default resample returns the label of the right “edge” (the last label) of the time group. We can control this behavior using the label parameter: # Group by month, count rows dataframe.resample('M', label='left').count() Sale_Amount 20170531 72000 20170630 28000 See Also • List of pandas time offset aliases 3.15 Looping Over a Column Problem You want to iterate over every element in a column and apply some action. Solution You can treat a pandas column like any other sequence in Python: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Print first two names uppercased for name in dataframe['Name'][0:2]: print(name.upper()) 3.15 Looping Over a Column  53 ALLEN, MISS ELISABETH WALTON ALLISON, MISS HELEN LORAINE Discussion In addition to loops (often called for loops), we can also use list comprehensions: # Show first two names uppercased [name.upper() for name in dataframe['Name'][0:2]] ['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE'] Despite the temptation to fall back on for loops, a more Pythonic solution would use pandas’ apply method, described in the next recipe. 3.16 Applying a Function Over All Elements in a Column Problem You want to apply some function over all elements in a column. Solution Use apply to apply a builtin or custom function on every element in a column: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Create function def uppercase(x): return x.upper() # Apply function, show two rows dataframe['Name'].apply(uppercase)[0:2] 0 ALLEN, MISS ELISABETH WALTON 1 ALLISON, MISS HELEN LORAINE Name: Name, dtype: object Discussion apply is a great way to do data cleaning and wrangling. It is common to write a func‐ tion to perform some useful operation (separate first and last names, convert strings to floats, etc.) and then map that function to every element in a column. 54  Chapter 3: Data Wrangling 3.17 Applying a Function to Groups Problem You have grouped rows using groupby and want to apply a function to each group. Solution Combine groupby and apply: # Load library import pandas as pd # Create URL url = 'https://tinyurl.com/titaniccsv' # Load data dataframe = pd.read_csv(url) # Group rows, apply function to groups dataframe.groupby('Sex').apply(lambda x: x.count()) Name PClass Age Sex Survived SexCode Sex female 462 male 851 462 851 288 462 462 468 851 851 462 851 Discussion In Recipe 3.16 I mentioned apply. apply is particularly useful when you want to apply a function to groups. By combining groupby and apply we can calculate cus‐ tom statistics or apply any function to each group separately. 3.18 Concatenating DataFrames Problem You want to concatenate two DataFrames. Solution Use concat with axis=0 to concatenate along the row axis: # Load library import pandas as pd # Create DataFrame 3.17 Applying a Function to Groups  55 data_a = {'id': ['1', '2', '3'], 'first': ['Alex', 'Amy', 'Allen'], 'last': ['Anderson', 'Ackerman', 'Ali']} dataframe_a = pd.DataFrame(data_a, columns = ['id', 'first', 'last']) # Create DataFrame data_b = {'id': ['4', '5', '6'], 'first': ['Billy', 'Brian', 'Bran'], 'last': ['Bonder', 'Black', 'Balwner']} dataframe_b = pd.DataFrame(data_b, columns = ['id', 'first', 'last']) # Concatenate DataFrames by rows pd.concat([dataframe_a, dataframe_b], axis=0) 0 1 2 0 1 2 id 1 2 3 4 5 6 first Alex Amy Allen Billy Brian Bran last Anderson Ackerman Ali Bonder Black Balwner You can use axis=1 to concatenate along the column axis: # Concatenate DataFrames by columns pd.concat([dataframe_a, dataframe_b], axis=1) id 0 1 1 2 2 3 first Alex Amy Allen last Anderson Ackerman Ali id 4 5 6 first Billy Brian Bran last Bonder Black Balwner Discussion Concatenating is not a word you hear much outside of computer science and pro‐ gramming, so if you have not heard it before, do not worry. The informal definition of concatenate is to glue two objects together. In the solution we glued together two small DataFrames using the axis parameter to indicate whether we wanted to stack the two DataFrames on top of each other or place them side by side. Alternatively we can use append to add a new row to a DataFrame: # Create row row = pd.Series([10, 'Chris', 'Chillon'], index=['id', 'first', 'last']) # Append row dataframe_a.append(row, ignore_index=True) 56  Chapter 3: Data Wrangling 0 1 2 3 id 1 2 3 10 first Alex Amy Allen Chris last Anderson Ackerman Ali Chillon 3.19 Merging DataFrames Problem You want to merge two DataFrames. Solution To inner join, use merge with the on parameter to specify the column to merge on: # Load library import pandas as pd # Create DataFrame employee_data = {'employee_id': ['1', '2', '3', '4'], 'name': ['Amy Jones', 'Allen Keys', 'Alice Bees', 'Tim Horton']} dataframe_employees = pd.DataFrame(employee_data, columns = ['employee_id', 'name']) # Create DataFrame sales_data = {'employee_id': ['3', '4', '5', '6'], 'total_sales': [23456, 2512, 2345, 1455]} dataframe_sales = pd.DataFrame(sales_data, columns = ['employee_id', 'total_sales']) # Merge DataFrames pd.merge(dataframe_employees, dataframe_sales, on='employee_id') employee_id name total_sales 0 3 Alice Bees 23456 1 4 Tim Horton 2512 merge defaults to inner joins. If we want to do an outer join, we can specify that with the how parameter: 3.19 Merging DataFrames  57 # Merge DataFrames pd.merge(dataframe_employees, dataframe_sales, on='employee_id', how='outer') 0 1 2 3 4 5 employee_id 1 2 3 4 5 6 name Amy Jones Allen Keys Alice Bees Tim Horton NaN NaN total_sales NaN NaN 23456.0 2512.0 2345.0 1455.0 The same parameter can be used to specify left and right joins: # Merge DataFrames pd.merge(dataframe_employees, dataframe_sales, on='employee_id', how='left') 0 1 2 3 employee_id 1 2 3 4 name Amy Jones Allen Keys Alice Bees Tim Horton total_sales NaN NaN 23456.0 2512.0 We can also specify the column name in each DataFrame to merge on: # Merge DataFrames pd.merge(dataframe_employees, dataframe_sales, left_on='employee_id', right_on='employee_id') employee_id name total_sales 0 3 Alice Bees 23456 1 4 Tim Horton 2512 If instead of merging on two columns we want to merge on the indexes of each Data‐ Frame, we can replace the left_on and right_on parameters with right_index=True and left_index=True. Discussion Oftentimes, the data we need to use is complex; it doesn’t always come in one piece. Instead in the real world, we’re usually faced with disparate datasets, from multiple database queries or files. To get all that data into one place, we can load each data query or data file into pandas as individual DataFrames and then merge them together into a single DataFrame. 58  Chapter 3: Data Wrangling This process might be familiar to anyone who has used SQL, a popular language for doing merging operations (called joins). While the exact parameters used by pandas will be different, they follow the same general patterns used by other software lan‐ guages and tools. There are three aspects to specify with any merge operation. First, we have to specify the two DataFrames we want to merge together. In the solution we named them data frame_employees and dataframe_sales. Second, we have to specify the name(s) of the columns to merge on—that is, the columns whose values are shared between the two DataFrames. For example, in our solution both DataFrames have a column named employee_id. To merge the two DataFrames we will match up the values in each DataFrame’s employee_id column with each other. If these two columns use the same name, we can use the on parameter. However, if they have different names we can use left_on and right_on. What is the left and right DataFrame? The simple answer is that the left DataFrame is the first one we specified in merge and the right DataFrame is the second one. This language comes up again in the next sets of parameters we will need. The last aspect, and most difficult for some people to grasp, is the type of merge oper‐ ation we want to conduct. This is specified by the how parameter. merge supports the four main types of joins: Inner Return only the rows that match in both DataFrames (e.g., return any row with an employee_id value appearing in both dataframe_employees and data frame_sales). Outer Return all rows in both DataFrames. If a row exists in one DataFrame but not in the other DataFrame, fill NaN values for the missing values (e.g., return all rows in both employee_id and dataframe_sales). Left Return all rows from the left DataFrame but only rows from the right DataFrame that matched with the left DataFrame. Fill NaN values for the missing values (e.g., return all rows from dataframe_employees but only rows from data frame_sales that have a value for employee_id that appears in data frame_employees). Right Return all rows from the right DataFrame but only rows from the left DataFrame that matched with the right DataFrame. Fill NaN values for the missing values (e.g., return all rows from dataframe_sales but only rows from data 3.19 Merging DataFrames  59 frame_employees that have a value for employee_id that appears in data frame_sales). If you did not understand all of that right now, I encourage you to play around with the how parameter in your code and see how it affects what merge returns. See Also • A Visual Explanation of SQL Joins • pandas documentation on merging 60  Chapter 3: Data Wrangling CHAPTER 4 Handling Numerical Data 4.0 Introduction Quantitative data is the measurement of something—whether class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 29 students, $529,392 in sales). In this chapter, we will cover numerous strate‐ gies for transforming raw numerical data into features purposebuilt for machine learning algorithms. 4.1 Rescaling a Feature Problem You need to rescale the values of a numerical feature to be between two values. Solution Use scikitlearn’s MinMaxScaler to rescale a feature array: # Load libraries import numpy as np from sklearn import preprocessing # Create feature feature = np.array([[500.5], [100.1], [0], [100.1], [900.9]]) # Create scaler minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)) 61 # Scale feature scaled_feature = minmax_scale.fit_transform(feature) # Show feature scaled_feature array([[ [ [ [ [ 0. ], 0.28571429], 0.35714286], 0.42857143], 1. ]]) Discussion Rescaling is a common preprocessing task in machine learning. Many of the algo‐ rithms described later in this book will assume all features are on the same scale, typi‐ cally 0 to 1 or –1 to 1. There are a number of rescaling techniques, but one of the simplest is called minmax scaling. Minmax scaling uses the minimum and maxi‐ mum values of a feature to rescale values to within a range. Specifically, minmax cal‐ culates: xi′ = xi − min x max x − min x where x is the feature vector, x’i is an individual element of feature x, and x’i is the rescaled element. In our example, we can see from the outputted array that the fea‐ ture has been successfully rescaled to between 0 and 1: array([[ 0. ], [ 0.28571429], [ 0.35714286], [ 0.42857143], [ 1. ]]) scikitlearn’s MinMaxScaler offers two options to rescale a feature. One option is to use fit to calculate the minimum and maximum values of the feature, then use trans form to rescale the feature. The second option is to use fit_transform to do both operations at once. There is no mathematical difference between the two options, but there is sometimes a practical benefit to keeping the operations separate because it allows us to apply the same transformation to different sets of the data. See Also • Feature scaling, Wikipedia • About Feature Scaling and Normalization, Sebastian Raschka 62  Chapter 4: Handling Numerical Data 4.2 Standardizing a Feature Problem You want to transform a feature to have a mean of 0 and a standard deviation of 1. Solution scikitlearn’s StandardScaler performs both transformations: # Load libraries import numpy as np from sklearn import preprocessing # Create feature x = np.array([[1000.1], [200.2], [500.5], [600.6], [9000.9]]) # Create scaler scaler = preprocessing.StandardScaler() # Transform the feature standardized = scaler.fit_transform(x) # Show feature standardized array([[0.76058269], [0.54177196], [0.35009716], [0.32271504], [ 1.97516685]]) Discussion A common alternative to minmax scaling discussed in Recipe 4.1 is rescaling of fea‐ tures to be approximately standard normally distributed. To achieve this, we use standardization to transform the data such that it has a mean, x̄, of 0 and a standard deviation, σ, of 1. Specifically, each element in the feature is transformed so that: xi′ = xi − x σ 4.2 Standardizing a Feature  63 where x’i is our standardized form of xi. The transformed feature represents the num‐ ber of standard deviations the original value is away from the feature’s mean value (also called a zscore in statistics). Standardization is a common goto scaling method for machine learning preprocess‐ ing and in my experience is used more than minmax scaling. However, it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while minmax scaling is often recommended for neural net‐ works (both algorithms are discussed later in this book). As a general rule, I’d recom‐ mend defaulting to standardization unless you have a specific reason to use an alter‐ native. We can see the effect of standardization by looking at the mean and standard devia‐ tion of our solution’s output: # Print mean and standard deviation print("Mean:", round(standardized.mean())) print("Standard deviation:", standardized.std()) Mean: 0.0 Standard deviation: 1.0 If our data has significant outliers, it can negatively impact our standardization by affecting the feature’s mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikitlearn, we do this using the RobustScaler method: # Create scaler robust_scaler = preprocessing.RobustScaler() # Transform feature robust_scaler.fit_transform(x) array([[ 1.87387612], [ 0.875 ], [ 0. ], [ 0.125 ], [ 10.61488511]]) 4.3 Normalizing Observations Problem You want to rescale the feature values of observations to have unit norm (a total length of 1). Solution Use Normalizer with a norm argument: 64  Chapter 4: Handling Numerical Data # Load libraries import numpy as np from sklearn.preprocessing import Normalizer # Create feature matrix features = np.array([[0.5, 0.5], [1.1, 3.4], [1.5, 20.2], [1.63, 34.4], [10.9, 3.3]]) # Create normalizer normalizer = Normalizer(norm="l2") # Transform feature matrix normalizer.transform(features) array([[ [ [ [ [ 0.70710678, 0.30782029, 0.07405353, 0.04733062, 0.95709822, 0.70710678], 0.95144452], 0.99725427], 0.99887928], 0.28976368]]) Discussion Many rescaling methods (e.g., minmax scaling and standardization) operate on fea‐ tures; however, we can also rescale across individual observations. Normalizer rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent fea‐ tures (e.g., text classification when every word or nword group is a feature). Normalizer provides three norm options with Euclidean norm (often called L2) being the default argument: ∥ x ∥2 = x21 + x22 + ⋯ + x2n where x is an individual observation and xn is that observation’s value for the nth fea‐ ture. # Transform feature matrix features_l2_norm = Normalizer(norm="l2").transform(features) # Show feature matrix features_l2_norm array([[ [ [ [ [ 0.70710678, 0.30782029, 0.07405353, 0.04733062, 0.95709822, 0.70710678], 0.95144452], 0.99725427], 0.99887928], 0.28976368]]) 4.3 Normalizing Observations  65 Alternatively, we can specify Manhattan norm (L1): ∥ x ∥1 = n ∑ xi i=1 . # Transform feature matrix features_l1_norm = Normalizer(norm="l1").transform(features) # Show feature matrix features_l1_norm array([[ [ [ [ [ 0.5 , 0.24444444, 0.06912442, 0.04524008, 0.76760563, 0.5 ], 0.75555556], 0.93087558], 0.95475992], 0.23239437]]) Intuitively, L2 norm can be thought of as the distance between two points in New York for a bird (i.e., a straight line), while L1 can be thought of as the distance for a human walking on the street (walk north one block, east one block, north one block, east one block, etc.), which is why it is called “Manhattan norm” or “Taxicab norm.” Practically, notice that norm='l1' rescales an observation’s values so they sum to 1, which can sometimes be a desirable quality: # Print sum print("Sum of the first observation\'s values:", features_l1_norm[0, 0] + features_l1_norm[0, 1]) Sum of the first observation's values: 1.0 4.4 Generating Polynomial and Interaction Features Problem You want to create polynominal and interaction features. Solution Even though some choose to create polynomial and interaction features manually, scikitlearn offers a builtin method: # Load libraries import numpy as np from sklearn.preprocessing import PolynomialFeatures # Create feature matrix features = np.array([[2, 3], [2, 3], 66  Chapter 4: Handling Numerical Data [2, 3]]) # Create PolynomialFeatures object polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False) # Create polynomial features polynomial_interaction.fit_transform(features) array([[ 2., [ 2., [ 2., 3., 3., 3., 4., 4., 4., 6., 6., 6., 9.], 9.], 9.]]) The degree parameter determines the maximum degree of the polynomial. For example, degree=2 will create new features raised to the second power: x1, x2, x21, x22 while degree=3 will create new features raised to the second and third power: x1, x2, x21, x22, x31, x32 Furthermore, by default PolynomialFeatures includes interaction features: x1x2 We can restrict the features created to only interaction features by setting interac tion_only to True: interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) interaction.fit_transform(features) array([[ 2., [ 2., [ 2., 3., 3., 3., 6.], 6.], 6.]]) Discussion Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target. For example, we might suspect that the effect of age on the probability of having a major medical con‐ dition is not constant over time but increases as age increases. We can encode that nonconstant effect in a feature, x, by generating that feature’s higherorder forms (x2, x3, etc.). 4.4 Generating Polynomial and Interaction Features  67 Additionally, often we run into situations where the effect of one feature is dependent on another feature. A simple example would be if we were trying to predict whether or not our coffee was sweet and we had two features: 1) whether or not the coffee was stirred and 2) if we added sugar. Individually, each feature does not predict coffee sweetness, but the combination of their effects does. That is, a coffee would only be sweet if the coffee had sugar and was stirred. The effects of each feature on the target (sweetness) are dependent on each other. We can encode that relationship by includ‐ ing an interaction feature that is the product of the individual features. 4.5 Transforming Features Problem You want to make a custom transformation to one or more features. Solution In scikitlearn, use FunctionTransformer to apply a function to a set of features: # Load libraries import numpy as np from sklearn.preprocessing import FunctionTransformer # Create feature matrix features = np.array([[2, 3], [2, 3], [2, 3]]) # Define a simple function def add_ten(x): return x + 10 # Create transformer ten_transformer = FunctionTransformer(add_ten) # Transform feature matrix ten_transformer.transform(features) array([[12, 13], [12, 13], [12, 13]]) We can create the same transformation in pandas using apply: # Load library import pandas as pd # Create DataFrame df = pd.DataFrame(features, columns=["feature_1", "feature_2"]) 68  Chapter 4: Handling Numerical Data # Apply function df.apply(add_ten) feature_1 0 12 1 12 2 12 feature_2 13 13 13 Discussion It is common to want to make some custom transformations to one or more features. For example, we might want to create a feature that is the natural log of the values of the different feature. We can do this by creating a function and then mapping it to features using either scikitlearn’s FunctionTransformer or pandas’ apply. In the sol‐ ution we created a very simple function, add_ten, which added 10 to each input, but there is no reason we could not define a much more complex function. 4.6 Detecting Outliers Problem You want to identify extreme observations. Solution Detecting outliers is unfortunately more of an art than a science. However, a common method is to assume the data is normally distributed and based on that assumption “draw” an ellipse around the data, classifying any observation inside the ellipse as an inlier (labeled as 1) and any observation outside the ellipse as an outlier (labeled as 1): # Load libraries import numpy as np from sklearn.covariance import EllipticEnvelope from sklearn.datasets import make_blobs # Create simulated data features, _ = make_blobs(n_samples = 10, n_features = 2, centers = 1, random_state = 1) # Replace the first observation's values with extreme values features[0,0] = 10000 features[0,1] = 10000 # Create detector 4.6 Detecting Outliers  69 outlier_detector = EllipticEnvelope(contamination=.1) # Fit detector outlier_detector.fit(features) # Predict outliers outlier_detector.predict(features) array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) A major limitation of this approach is the need to specify a contamination parame‐ ter, which is the proportion of observations that are outliers—a value that we don’t know. Think of contamination as our estimate of the cleanliness of our data. If we expect our data to have few outliers, we can set contamination to something small. However, if we believe that the data is very likely to have outliers, we can set it to a higher value. Instead of looking at observations as a whole, we can instead look at individual fea‐ tures and identify extreme values in those features using interquartile range (IQR): # Create one feature feature = features[:,0] # Create a function to return index of outliers def indicies_of_outliers(x): q1, q3 = np.percentile(x, [25, 75]) iqr = q3  q1 lower_bound = q1  (iqr * 1.5) upper_bound = q3 + (iqr * 1.5) return np.where((x > upper_bound)  (x < lower_bound)) # Run function indicies_of_outliers(feature) (array([0]),) IQR is the difference between the first and third quartile of a set of data. You can think of IQR as the spread of the bulk of the data, with outliers being observations far from the main concentration of data. Outliers are commonly defined as any value 1.5 IQRs less than the first quartile or 1.5 IQRs greater than the third quartile. Discussion There is no single best technique for detecting outliers. Instead, we have a collection of techniques all with their own advantages and disadvantages. Our best strategy is often trying multiple techniques (e.g., both EllipticEnvelope and IQRbased detec‐ tion) and looking at the results as a whole. If at all possible, we should take a look at observations we detect as outliers and try to understand them. For example, if we have a dataset of houses and one feature is num‐ 70  Chapter 4: Handling Numerical Data ber of rooms, is an outlier with 100 rooms really a house or is it actually a hotel that has been misclassified? See Also • Three ways to detect outliers (and the source of the IQR function used in this recipe) 4.7 Handling Outliers Problem You have outliers. Solution Typically we have three strategies we can use to handle outliers. First, we can drop them: # Load library import pandas as pd # Create DataFrame houses = pd.DataFrame() houses['Price'] = [534433, 392