Main
The HundredPage Machine Learning Book
The HundredPage Machine Learning Book
Andriy Burkov
Is this book for you?
You will enjoy the book if you are:
 a software engineer or a scientist who wants to become a machine learning engineer or a data scientist
 a data scientist trying to stay on the edge of the stateoftheart and deepen their ML expertise
 a manager who wants to feel confident while talking about AI with engineers and product people
 a curious person looking to find out how machine learning works and maybe build something new
You will enjoy the book if you are:
 a software engineer or a scientist who wants to become a machine learning engineer or a data scientist
 a data scientist trying to stay on the edge of the stateoftheart and deepen their ML expertise
 a manager who wants to feel confident while talking about AI with engineers and product people
 a curious person looking to find out how machine learning works and maybe build something new
Categories:
Computers\\Cybernetics: Artificial Intelligence
Year:
2019
Publisher:
Andriy Burkov
Language:
english
Pages:
152
ISBN 13:
9781999579500
File:
PDF, 6.98 MB
Download (pdf, 6.98 MB)
Preview
 Open in Browser
 Checking other formats...
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You may be interested in
Most frequently terms
vector^{180}
algorithm^{169}
andriy burkov^{152}
draft^{140}
burkov the hundred^{140}
input^{139}
regression^{128}
neural^{113}
algorithms^{110}
dataset^{97}
gradient^{95}
linear^{86}
classification^{85}
label^{84}
vectors^{82}
erent^{74}
learning algorithm^{73}
labels^{73}
dimensional^{72}
parameters^{67}
feature vector^{65}
output^{64}
neural network^{62}
matrix^{61}
def^{56}
fig^{55}
svm^{53}
labeled^{50}
neural networks^{50}
prediction^{47}
learning algorithms^{46}
optimization^{46}
predict^{45}
gradient descent^{44}
feature vectors^{44}
metric^{43}
probability^{43}
logistic^{42}
cluster^{41}
sequence^{41}
layers^{41}
validation^{40}
hyperparameter^{37}
user^{37}
embedding^{36}
spam^{35}
derivative^{35}
binary^{34}
linear regression^{34}
clustering^{32}
variable^{32}
hyperparameters^{31}
supervised learning^{31}
dimensionality^{31}
logistic regression^{31}
units^{31}
likelihood^{30}
outputs^{29}
score^{28}
kernel^{27}
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

The Hundred Page Machine Learning Book Andriy Burkov “All models are wrong, but some are useful.” — George Box The book is distributed on the “read first, buy later” principle. Andriy Burkov The HundredPage Machine Learning Book  Draft Preface Let’s start by telling the truth: machines don’t learn. What a typical “learning machine” does, is finding a mathematical formula, which, when applied to a collection of inputs (called “training data”), produces the desired outputs. This mathematical formula also generates the correct outputs for most other inputs (distinct from the training data) on the condition that those inputs come from the same or a similar statistical distribution as the one the training data was drawn from. Why isn’t that learning? Because if you slightly distort the inputs, the output is very likely to become completely wrong. It’s not how learning in animals works. If you learned to play a video game by looking straight at the screen, you would still be a good player if someone rotates the screen slightly. A machine learning algorithm, if it was trained by “looking” straight at the screen, unless it was also trained to recognize rotation, will fail to play the game on a rotated screen. So why the name “machine learning” then? The reason, as is often the case, is marketing: Arthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term in 1959 while at IBM. Similarly to how in the 2010s IBM tried to market the term “cognitive computing” to stand out from competition, in the 1960s, IBM used the new cool term “machine learning” to attract both clients and talented employees. As you can see, just like artificial intelligence is not intelligence, machine learning is not learning. However, machine learning is a universally recognized term that usually refers to the science and engineering of building machines capable of doing various useful things without being explicitly programmed to do so. So, the word “learning” in the term is used by analogy with the learning in animals rather than literally. Who This Book is For This book contains only those parts of the vast body of material on machine learning developed since the 1960s that have proven to have a significant practical value. A beginner in machine learning will find in this book just enough details to get a comfortable level of understanding of the field and start asking the right questions. Practitioners with experience can use this book as a collection of directions for further selfimprovement. The book also comes in handy when brainstorming at the beginning of a project, when you try to answer the question whether a given technical or business problem is “machinelearnable” and, if yes, which techniques you should try to solve it. How to Use This Book If you are about to start learning machine learning, you should read this book from the beginning to the end. (It’s just a hundred pages, not a big deal.) If you are interested Andriy Burkov The HundredPage Machine Learning Book  Draft 3 in a specific topic covered in the book and want to know more, most sections have a QR code. By scanning one of those QR codes with your phone, you will get a link to a page on the book’s companion wiki theMLbook.com with additional materials: recommended reads, videos, Q&As, code snippets, tutorials, and other bonuses. The book’s wiki is continuously updated with contributions from the book’s author himself as well as volunteers from all over the world. So this book, like a good wine, keeps getting better after you buy it. Scan the QR code below with your phone to get to the book’s wiki: Some sections don’t have a QR code, but they still most likely have a wiki page. You can find it by submitting the section’s title to the wiki’s search engine. Should You Buy This Book? This book is distributed on the “read first, buy later” principle. I firmly believe that paying for the content before consuming it is buying a pig in a poke. You can see and try a car in a dealership before you buy it. You can try on a shirt or a dress in a department store. You have to be able to read a book before paying for it. The read first, buy later principle implies that you can freely download the book, read it and share it with your friends and colleagues. If you liked the book, only then you have to buy it. Now you are all set. Enjoy your reading! Andriy Burkov The HundredPage Machine Learning Book  Draft 4 http://themlbook.com The Hundred Page Machine Learning Book Andriy Burkov “All models are wrong, but some are useful.” — George Box The book is distributed on the “read first, buy later” principle. Andriy Burkov The HundredPage Machine Learning Book  Draft 1 Introduction 1.1 What is Machine Learning Machine learning is a subfield of computer science that is concerned with building algorithms which, to be useful, rely on a collection of examples of some phenomenon. These examples can come from nature, be handcrafted by humans or generated by another algorithm. Machine learning can also be defined as the process of solving a practical problem by 1) gathering a dataset, and 2) algorithmically building a statistical model based on that dataset. That statistical model is assumed to be used somehow to solve the practical problem. To save keystrokes, I use the terms “learning” and “machine learning” interchangeably. 1.2 Types of Learning Learning can be supervised, semisupervised, unsupervised and reinforcement. 1.2.1 Supervised Learning In supervised learning1, the dataset is the collection of labeled examples {(xi, yi)}Ni=1. Each element xi among N is called a feature vector. A feature vector is a vector in which each dimension j = 1, . . . , D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j). For instance, if each example x in our collection represents a person, then the first feature, x(1), could contain height in cm, the second feature, x(2), could contain weight in kg, x(3) could contain gender, and so on. For all examples in the dataset, the feature at position j in the feature vector always contains the same kind of information. It means that if x(2)i contains weight in kg in some example xi, then x(2)k will also contain weight in kg in every example xk, k = 1, . . . , N . The label yi can be either an element belonging to a finite set of classes {1, 2, . . . , C}, or a real number, or a more complex structure, like a vector, a matrix, a tree, or a graph. Unless otherwise stated, in this book yi is either one of a finite set of classes or a real number. You can see a class as a category to which an example belongs. For instance, if your examples are email messages and your problem is spam detection, then you have two classes {spam, not_spam}. The goal of a supervised learning algorithm is to use the dataset to produce a model that takes a feature vector x as input and outputs information that allows deducing the label for this feature vector. For instance, the model created using the dataset of people could take as input a feature vector describing a person and output a probability that the person has cancer. 1 In this book, if a term is in bold, that means that this term can be found in the index at the end of the book. Andriy Burkov The HundredPage Machine Learning Book  Draft 3 1.2.2 Unsupervised Learning In unsupervised learning, the dataset is a collection of unlabeled examples {xi}Ni=1. Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to create a model that takes a feature vector x as input and either transforms it into another vector or into a value that can be used to solve a practical problem. For example, in clustering, the model returns the id of the cluster for each feature vector in the dataset. In dimensionality reduction, the output of the model is a feature vector that has fewer features than the input x; in outlier detection, the output is a real number that indicates how x is di�erent from a “typical” example in the dataset. 1.2.3 SemiSupervised Learning In semisupervised learning, the dataset contains both labeled and unlabeled examples. Usually, the quantity of unlabeled examples is much higher than the number of labeled examples. The goal of a semisupervised learning algorithm is the same as the goal of the supervised learning algorithm. The hope here is that using many unlabeled examples can help the learning algorithm to find (we might say “produce” or “compute”) a better model2. 1.2.4 Reinforcement Learning Reinforcement learning is a subfield of machine learning where the machine “lives” in an environment and is capable of perceiving the state of that environment as a vector of features. The machine can execute actions in every state. Di�erent actions bring di�erent rewards and could also move the machine to another state of the environment. The goal of a reinforcement learning algorithm is to learn a policy. A policy is a function f (similar to the model in supervised learning) that takes the feature vector of a state as input and outputs an optimal action to execute in that state. The action is optimal if it maximizes the expected average reward. Reinforcement learning solves a particular kind of problems where decision making is sequential, and the goal is longterm, such as game playing, robotics, resource management, or logistics. In this book, I put emphasis on oneshot decision making where input examples are independent of one another and the predictions made in the past. I leave reinforcement learning out of the scope of this book. 2 It could look counterintuitive that learning could benefit from adding more unlabeled examples. It seems like we add more uncertainty to the problem. However, when you add unlabeled examples, you add more information about your problem: a larger sample reflects better the probability distribution the data we labeled came from. Theoretically, a learning algorithm should be able to leverage this additional information. Andriy Burkov The HundredPage Machine Learning Book  Draft 4 1.3 How Supervised Learning Works In this section, I briefly explain how supervised learning works so that you have the picture of the whole process before we go into detail. I decided to use supervised learning as an example because it’s the type of machine learning most frequently used in practice. The supervised learning process starts with gathering the data. The data for supervised learning is a collection of pairs (input, output). Input could be anything, for example, email messages, pictures, or sensor measurements. Outputs are usually real numbers, or labels (e.g. “spam”, “not_spam”, “cat”, “dog”, “mouse”, etc). In some cases, outputs are vectors (e.g., four coordinates of the rectangle around a person on the picture), sequences (e.g. [“adjective”, “adjective”, “noun”] for the input “big beautiful car”), or have some other structure. Let’s say the problem that you want to solve using supervised learning is spam detection. You gather the data, for example, 10,000 email messages, each with a label either “spam” or “not_spam” (you could add those labels manually or pay someone to do that for us). Now, you have to convert each email message into a feature vector. The data analyst decides, based on their experience, how to convert a realworld entity, such as an email message, into a feature vector. One common way to convert a text into a feature vector, called bag of words, is to take a dictionary of English words (let’s say it contains 20,000 alphabetically sorted words) and stipulate that in our feature vector: • the first feature is equal to 1 if the email message contains the word “a”; otherwise, this feature is 0; • the second feature is equal to 1 if the email message contains the word “aaron”; otherwise, this feature equals 0; • . . . • the feature at position 20,000 is equal to 1 if the email message contains the word “zulu”; otherwise, this feature is equal to 0. You repeat the above procedure for every email message in our collection, which gives us 10,000 feature vectors (each vector having the dimensionality of 20,000) and a label (“spam”/“not_spam”). Now you have a machinereadable input data, but the output labels are still in the form of humanreadable text. Some learning algorithms require transforming labels into numbers. For example, some algorithms require numbers like 0 (to represent the label “not_spam”) and 1 (to represent the label “spam”). The algorithm I use to illustrate supervised learning is called Support Vector Machine (SVM). This algorithm requires that the positive label (in our case it’s “spam”) has the numeric value of +1 (one), and the negative label (“not_spam”) has the value of ≠1 (minus one). At this point, you have a dataset and a learning algorithm, so you are ready to apply the learning algorithm to the dataset to get the model. SVM sees every feature vector as a point in a highdimensional space (in our case, space Andriy Burkov The HundredPage Machine Learning Book  Draft 5 is 20,000dimensional). The algorithm puts all feature vectors on an imaginary 20,000 dimensional plot and draws an imaginary 20,000dimensional line (a hyperplane) that separates examples with positive labels from examples with negative labels. In machine learning, the boundary separating the examples of di�erent classes is called the decision boundary. The equation of the hyperplane is given by two parameters, a realvalued vector w of the same dimensionality as our input feature vector x, and a real number b like this: wx ≠ b = 0, where the expression wx means w(1)x(1) + w(2)x(2) + . . . + w(D)x(D), and D is the number of dimensions of the feature vector x. (If some equations aren’t clear to you right now, in Chapter 2 we revisit the math and statistical concepts necessary to understand them. For the moment, try to get an intuition of what’s happening here. It all becomes more clear after you read the next chapter.) Now, the predicted label for some input feature vector x is given like this: y = sign(wx ≠ b), where sign is a mathematical operator that takes any value as input and returns +1 if the input is a positive number or ≠1 if the input is a negative number. The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find the optimal values wú and bú for parameters w and b. Once the learning algorithm identifies these optimal values, the model f(x) is then defined as: f(x) = sign(wúx ≠ bú) Therefore, to predict whether an email message is spam or not spam using an SVM model, you have to take a text of the message, convert it into a feature vector, then multiply this vector by wú, subtract bú and take the sign of the result. This will give us the prediction (+1 means “spam”, ≠1 means “not_spam”). Now, how does the machine find wú and bú? It solves an optimization problem. Machines are good at optimizing functions under constraints. So what are the constraints we want to satisfy here? First of all, we want the model to predict the labels of our 10,000 examples correctly. Remember that each example i = 1, . . . , 10000 is given by a pair (xi, yi), where xi is the feature vector of example i and yi is its label that takes values either ≠1 or +1. So the constraints are naturally: • wxi ≠ b Ø 1 if yi = +1, and • wxi ≠ b Æ ≠1 if yi = ≠1 Andriy Burkov The HundredPage Machine Learning Book  Draft 6 x(2) x(1) wx — b = 0 wx — b = 1 wx — b = — 1 b w 2 w Figure 1: An example of an SVM model for twodimensional feature vectors. We would also prefer that the hyperplane separates positive examples from negative ones with the largest margin. The margin is the distance between the closest examples of two classes, as defined by the decision boundary. A large margin contributes to a better generalization, that is how well the model will classify new examples in the future. To achieve that, we need to minimize the Euclidean norm of w denoted by ÎwÎ and given by ÒqD j=1(w(j))2. So, the optimization problem that we want the machine to solve looks like this: Minimize ÎwÎ subject to yi(wxi ≠ b) Ø 1 for i = 1, . . . , N . The expression yi(wxi ≠ b) Ø 1 is just a compact way to write the above two constraints. The solution of this optimization problem, given by wú and bú, is called the statistical model, or, simply, the model. The process of building the model is called training. For twodimensional feature vectors, the problem and the solution can be visualized as shown in fig. 1. The blue and orange circles represent, respectively, positive and negative examples, and the line given by wx ≠ b = 0 is the decision boundary. Why, by minimizing the norm of w, do we find the highest margin between the two classes? Geometrically, the equations wx ≠ b = 1 and wx ≠ b = ≠1 define two parallel hyperplanes, as you see in fig. 1. The distance between these hyperplanes is given by 2ÎwÎ , so the smaller Andriy Burkov The HundredPage Machine Learning Book  Draft 7 the norm ÎwÎ, the larger the distance between these two hyperplanes. That’s how Support Vector Machines work. This particular version of the algorithm builds the socalled linear model. It’s called linear because the decision boundary is a straight line (or a plane, or a hyperplane). SVM can also incorporate kernels that can make the decision boundary arbitrarily nonlinear. In some cases, it could be impossible to perfectly separate the two groups of points because of noise in the data, errors of labeling, or outliers (examples very di�erent from a “typical” example in the dataset). Another version of SVM can also incorporate a penalty hyperparameter for misclassification of training examples of specific classes. We study the SVM algorithm in more detail in Chapter 3. At this point, you should retain the following: any classification learning algorithm that builds a model implicitly or explicitly creates a decision boundary. The decision boundary can be straight, or curved, or it can have a complex form, or it can be a superposition of some geometrical figures. The form of the decision boundary determines the accuracy of the model (that is the ratio of examples whose labels are predicted correctly). The form of the decision boundary, the way it is algorithmically or mathematically computed based on the training data, di�erentiates one learning algorithm from another. In practice, there are two other essential di�erentiators of learning algorithms to consider: speed of model building and prediction processing time. In many practical cases, you would prefer a learning algorithm that builds a less accurate model fast. Additionally, you might prefer a less accurate model that is much quicker at making predictions. 1.4 Why the Model Works on New Data Why is a machinelearned model capable of predicting correctly the labels of new, previously unseen examples? To understand that, look at the plot in fig. 1. If two classes are separable from one another by a decision boundary, then, obviously, examples that belong to each class are located in two di�erent subspaces which the decision boundary creates. If the examples used for training were selected randomly, independently of one another, and following the same procedure, then, statistically, it is more likely that the new negative example will be located on the plot somewhere not too far from other negative examples. The same concerns the new positive example: it will likely come from the surroundings of other positive examples. In such a case, our decision boundary will still, with high probability, separate well new positive and negative examples from one another. For other, less likely situations, our model will make errors, but because such situations are less likely, the number of errors will likely be smaller than the number of correct predictions. Intuitively, the larger is the set of training examples, the more unlikely that the new examples will be dissimilar to (and lie on the plot far from) the examples used for training. To minimize the probability of making errors on new examples, the SVM algorithm, by looking for the largest margin, explicitly tries to draw the decision boundary in such a way that it lies as far as possible from examples of both classes. Andriy Burkov The HundredPage Machine Learning Book  Draft 8 The reader interested in knowing more about the learnability and un derstanding the close relationship between the model error, the size of the training set, the form of the mathematical equation that defines the model, and the time it takes to build the model is encouraged to read about the PAC learning. The PAC (for “probably approximately correct”) learning theory helps to analyze whether and under what conditions a learning algorithm will probably output an approximately correct classifier. Andriy Burkov The HundredPage Machine Learning Book  Draft 9 The Hundred Page Machine Learning Book Andriy Burkov “All models are wrong, but some are useful.” — George Box The book is distributed on the “read first, buy later” principle. Andriy Burkov The HundredPage Machine Learning Book  Draft 2 Notation and Definitions 2.1 Notation Let’s start by revisiting the mathematical notation we all learned at school, but some likely forgot right after the prom. 2.1.1 Scalars, Vectors, and Sets A scalar is a simple numerical value, like 15 or ≠3.25. Variables or constants that take scalar values are denoted by an italic letter, like x or a. Figure 1: Three vectors visualized as directions and as points. A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold character, for example, x or w. Vectors can be visualized as arrows that point to some directions as well as points in a multidimensional space. Illustrations of three twodimensional vectors, a = [2, 3], b = [≠2, 5], and c = [1, 0] is given in fig. 1. We denote an attribute of a vector as an italic value with an index, like this: w(j) or x(j). The index j denotes a specific dimension of the vector, the position of an attribute in the list. For instance, in the vector a shown in red in fig. 1, a(1) = 2 and a(2) = 3. The notation x(j) should not be confused with the power operator, like this x2 (squared) or x3 (cubed). If we want to apply a power operator, say square, to an indexed attribute of a vector, we write like this: (x(j))2. A variable can have two or more indices, like this: x(j)i or like this x (k) i,j . For example, in neural networks, we denote as x(j)l,u the input feature j of unit u in layer l. Andriy Burkov The HundredPage Machine Learning Book  Draft 3 A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, S. A set of numbers can be finite (include a fixed amount of values). In this case, it is denoted using accolades, for example, {1, 3, 18, 23, 235} or {x1, x2, x3, x4, . . . , xn}. A set can be infinite and include all values in some interval. If a set includes all values between a and b, including a and b, it is denoted using brackets as [a, b]. If the set doesn’t include the values a and b, such a set is denoted using parentheses like this: (a, b). For example, the set [0, 1] includes such values as 0, 0.0001, 0.25, 0.784, 0.9995, and 1.0. A special set denoted R includes all numbers from minus infinity to plus infinity. When an element x belongs to a set S, we write x œ S. We can obtain a new set S3 as an intersection of two sets S1 and S2. In this case, we write S3 Ω S1 fl S2. For example {1, 3, 5, 8} fl {1, 8, 4} gives the new set {1, 8}. We can obtain a new set S3 as a union of two sets S1 and S2. In this case, we write S3 Ω S1 fi S2. For example {1, 3, 5, 8} fi {1, 8, 4} gives the new set {1, 3, 4, 5, 8}. 2.1.2 Capital Sigma Notation The summation over a collection X = {x1, x2, . . . , xn≠1, xn} or over the attributes of a vector x = [x(1), x(2), . . . , x(m≠1), x(m)] is denoted like this: nÿ i=1 xi def= x1 + x2 + . . . + xn≠1 + xn, or else: mÿ j=1 x(j) def= x(1) + x(2) + . . . + x(m≠1) + x(m). The notation def= means “is defined as”. 2.1.3 Capital Pi Notation A notation analogous to capital sigma is the capital pi notation. It denotes a product of elements in a collection or attributes of a vector: nŸ i=1 xi def= x1 · x2 · . . . · xn≠1 · xn, where a · b means a multiplied by b. Where possible, we omit · to simplify the notation, so ab also means a multiplied by b. Andriy Burkov The HundredPage Machine Learning Book  Draft 4 2.1.4 Operations on Sets A derived set creation operator looks like this: S Õ Ω {x2  x œ S, x > 3}. This notation means that we create a new set S Õ by putting into it x squared such that that x is in S, and x is greater than 3. The cardinality operator S returns the number of elements in set S. 2.1.5 Operations on Vectors The sum of two vectors x + z is defined as the vector [x(1) + z(1), x(2) + z(2), . . . , x(m) + z(m)]. The di�erence of two vectors x ≠ z is defined as the vector [x(1) ≠ z(1), x(2) ≠ z(2), . . . , x(m) ≠ z(m)]. A vector multiplied by a scalar is a vector. For example xc def= [cx(1), cx(2), . . . , cx(m)]. A dotproduct of two vectors is a scalar. For example, wx def= qm i=1 w (i)x(i). In some books, the dotproduct is denoted as w · x. The two vectors must be of the same dimensionality. Otherwise, the dotproduct is undefined. The multiplication of a matrix W by a vector x gives another vector as a result. Let our matrix be, W = 5 w(1,1) w(1,2) w(1,3) w(2,1) w(2,2) w(2,3) 6 . When vectors participate in operations on matrices, a vector is by default represented as a matrix with one column. When the vector is on the right of the matrix, it remains a column vector. We can only multiply a matrix by vector if the vector has the same number of rows as the number of columns in the matrix. Let our vector be x def= [x(1), x(2), x(3)]. Then Wx is a twodimensional vector defined as, Wx = 5 w(1,1) w(1,2) w(1,3) w(2,1) w(2,2) w(2,3) 6 S U x(1) x(2) x(3) T V def= 5 w(1,1)x(1) + w(1,2)x(2) + w(1,3)x(3) w(2,1)x(1) + w(2,2)x(2) + w(2,3)x(3) 6 = 5 w (1) x w (2) x 6 If our matrix had, say, five rows, the result of the above product would be a fivedimensional vector. Andriy Burkov The HundredPage Machine Learning Book  Draft 5 When the vector is on the left side of the matrix in the multiplication, then it has to be transposed before we multiply it by the matrix. The transpose of the vector x denoted as x€ makes a row vector out of a column vector. Let’s say, x = 5 x(1) x(2) 6 , then, x € def= Ë x(1), x(2) È . The multiplication of the vector x by the matrix W is given by x€W, x € W = Ë x(1), x(2) È 5w(1,1) w(1,2) w(1,3) w(2,1) w(2,2) w(2,3) 6 def= # w(1,1)x(1) + w(2,1)x(2), w(1,2)x(1) + w(2,2)x(2), w(1,3)x(1) + w(2,3)x(2) $ As you can see, we can only multiply a vector by a matrix if the vector has the same number of dimensions as the number of rows in the matrix. 2.1.6 Functions A function is a relation that associates each element x of a set X , the domain of the function, to a single element y of another set Y , the codomain of the function. A function usually has a name. If the function is called f , this relation is denoted y = f(x) (read f of x), the element x is the argument or input of the function, and y is the value of the function or the output. The symbol that is used for representing the input is the variable of the function (we often say that f is a function of the variable x). We say that f(x) has a local minimum at x = c if f(x) Ø f(c) for every x in some open interval around x = c. An interval is a set of real numbers with the property that any number that lies between two numbers in the set is also included in the set. An open interval does not include its endpoints and is denoted using parentheses. For example, (0, 1) means greater than 0 and less than 1. The minimal value among all the local minima is called the global minimum. See illustration in fig. 2. A vector function, denoted as y = f(x) is a function that returns a vector y. It can have a vector or a scalar argument. Andriy Burkov The HundredPage Machine Learning Book  Draft 6 global minimum local minimum 6 4 2 0 –2 –4 –6 f(x) Figure 2: A local and a global minima of a function. 2.1.7 Max and Arg Max Given a set of values A = {a1, a2, . . . , an}, the operator, max aœA f(a) returns the highest value f(a) for all elements in the set A. On the other hand, the operator, arg max aœA f(a) returns the element of the set A that maximizes f(a). Sometimes, when the set is implicit or infinite, we can write maxa f(a) or arg max a f(a). Operators min and arg min operate in a similar manner. 2.1.8 Assignment Operator The expression a Ω f(x) means that the variable a gets the new value: the result of f(x). We say that the variable a gets assigned a new value. Similarly, a Ω [a1, a2] means that the twodimensional vector a gets the value [a1, a2]. Andriy Burkov The HundredPage Machine Learning Book  Draft 7 2.1.9 Derivative and Gradient A derivative f Õ of a function f is a function or a value that describes how fast f grows (or decreases). If the derivative is a constant value, like 5 or ≠3, then the function grows (or decreases) constantly at any point x of its domain. If the derivative f Õ is a function, then the function f can grow at a di�erent pace in di�erent regions of its domain. If the derivative f Õ is positive at some point x, then the function f grows at this point. If the derivative of f is negative at some x, then the function decreases at this point. The derivative of zero at x means that the function’s slope at x is horizontal. The process of finding a derivative is called di�erentiation. Derivatives for basic functions are known. For example if f(x) = x2, then f Õ(x) = 2x; if f(x) = 2x then f Õ(x) = 2; if f(x) = 2 then f Õ(x) = 0 (the derivative of any function f(x) = c, where c is a constant value, is zero). If the function we want to di�erentiate is not basic, we can find its derivative using the chain rule. For example if F (x) = f(g(x)), where f and g are some functions, then F Õ(x) = f Õ(g(x))gÕ(x). For example if F (x) = (5x + 1)2 then g(x) = 5x + 1 and f(g(x)) = (g(x))2. By applying the chain rule, we find F Õ(x) = 2(5x + 1)gÕ(x) = 2(5x + 1)5 = 50x + 10. Gradient is the generalization of derivative for functions that take several inputs (or one input in the form of a vector or some other complex structure). A gradient of a function is a vector of partial derivatives. You can look at finding a partial derivative of a function as the process of finding the derivative by focusing on one of the function’s inputs and by considering all other inputs as constant values. For example, if our function is defined as f([x(1), x(2)]) = ax(1) + bx(2) + c, then the partial derivative of function f with respect to x(1), denoted as ˆf ˆx(1) , is given by, ˆf ˆx(1) = a + 0 + 0 = a, where a is the derivative of the function ax(1); the two zeroes are respectively derivatives of bx(2) and c, because x(2) is considered constant when we compute the derivative with respect to x(1), and the derivative of any constant is zero. Similarly, the partial derivative of function f with respect to x(2), ˆf ˆx(2) , is given by, ˆf ˆx(2) = 0 + b + 0 = b. The gradient of function f , denoted as Òf is given by the vector [ ˆf ˆx(1) , ˆf ˆx(2) ]. The chain rule works with partial derivatives too, as I illustrate in Chapter 4. Andriy Burkov The HundredPage Machine Learning Book  Draft 8 (a) (b) Figure 3: A probability mass function and a probability density function. 2.2 Random Variable A random variable, usually written as an italic capital letter, like X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables: discrete and continuous. A discrete random variable takes on only a countable number of distinct values such as red, yellow, blue or 1, 2, 3, . . .. The probability distribution of a discrete random variable is described by a list of probabilities associated with each of its possible values. This list of probabilities is called probability mass function (pmf). For example: Pr(X = red) = 0.3, Pr(X = yellow) = 0.45, Pr(X = blue) = 0.25. Each probability in a probability mass function is a value greater than or equal to 0. The sum of probabilities equals 1 (fig. 3a). A continuous random variable takes an infinite number of possible values in some interval. Examples include height, weight, and time. Because the number of values of a continuous random variable X is infinite, the probability Pr(X = c) for any c is 0. Therefore, instead of the list of probabilities, the probability distribution of a continuous random variable (a continuous probability distribution) is described by a probability density function (pdf). The pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1 (fig. 3b). Let a discrete random variable X have k possible values {xi}ki=1. The expectation of X denoted as E[X] is given by, Andriy Burkov The HundredPage Machine Learning Book  Draft 9 E[X] def= kÿ i=1 xi Pr(X = xi) = x1 Pr(X = x1) + x2 Pr(X = x2) + · · · + xk Pr(X = xk), (1) where Pr(X = xi) is the probability that X has the value xi according to the pmf. The expectation of a random variable is also called the mean, average or expected value and is frequently denoted with the letter µ. The expectation is one of the most important statistics of a random variable. Another important statistic is the standard deviation. For a discrete random variable, the standard deviation usually denoted as ‡ is given by: ‡ def= E[(X ≠ µ)2] = Pr(X = x1)(x1 ≠ µ)2 + Pr(X = x2)(x2 ≠ µ)2 + · · · + Pr(X = xk)(xk ≠ µ)2, where µ = E[X]. The expectation of a continuous random variable X is given by, E[X] def= ⁄ R xfX(x) dx, (2) where fX is the pdf of the variable X and s R is the integral of function xfX . Integral is an equivalent of the summation over all values of the function when the function has a continuous domain. It equals the area under the curve of the function. The property of the pdf that the area under its curve is 1 mathematically means that s R fX(x) dx = 1. Most of the time we don’t know fX , but we can observe some values of X. In machine learning, we call these values examples, and the collection of these examples is called a sample or a dataset. 2.3 Unbiased Estimators Because fX is usually unknown, but we have a sample SX = {xi}Ni=1, we often content ourselves not with the true values of statistics of the probability distribution, such as expectation, but with their unbiased estimators. We say that ◊̂(SX) is an unbiased estimator of some statistic ◊ calculated using a sample SX drawn from an unknown probability distribution if ◊̂(SX) has the following property: E Ë ◊̂(SX) È = ◊, Andriy Burkov The HundredPage Machine Learning Book  Draft 10 where ◊̂ is a sample statistic, obtained using a sample SX and not the real statistic ◊ that can be obtained only knowing X; the expectation is taken over all possible samples drawn from X. Intuitively, this means that if you can have an unlimited number of such samples as SX , and you compute some unbiased estimator, such as µ̂, using each sample, then the average of all these µ̂ equals the real statistic µ that you would get computed on X. It can be shown that an unbiased estimator of an unknown E[X] (given by either eq. 1 or eq. 2) is given by 1N qN i=1 xi (called in statistics the sample mean). 2.4 Bayes’ Rule The conditional probability Pr(X = xY = y) is the probability of the random variable X to have a specific value x given that another random variable Y has a specific value of y. The Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that: Pr(X = xY = y) = Pr(Y = yX = x) Pr(X = x)Pr(Y = y) . 2.5 Parameter Estimation Bayes’ Rule comes in handy when we have a model of X’s distribution, and this model f◊ is a function that has some parameters in the form of a vector ◊. An example of such a function could be the Gaussian function that has two parameters, µ and ‡, and is defined as: f◊(x) = 1Ô 2fi‡2 e≠ (x≠µ)2 2‡2 , where ◊ def= [µ, ‡]. This function has all the properties of a pdf. Therefore, we can use it as a model of an unknown distribution of X. We can update the values of parameters in the vector ◊ from the data using the Bayes’ Rule: Pr(◊ = ◊̂X = x) Ω Pr(X = x◊ = ◊̂) Pr(◊ = ◊̂)Pr(X = x) = Pr(X = x◊ = ◊̂) Pr(◊ = ◊̂) q ◊̃ Pr(X = x◊ = ◊̃) . (3) where Pr(X = x◊ = ◊̂) def= f◊̂. If we have a sample S of X and the set of possible values for ◊ is finite, we can easily estimate Pr(◊ = ◊̂) by applying Bayes’ Rule iteratively, one example x œ S at a time. The initial value Pr(◊ = ◊̂) can be guessed such that q ◊̂ Pr(◊ = ◊̂) = 1. This guess of the probabilities for di�erent ◊̂ is called the prior. Andriy Burkov The HundredPage Machine Learning Book  Draft 11 First, we compute Pr(◊ = ◊̂X = x1) for all possible values ◊̂. Then, before updating Pr(◊ = ◊̂X = x) once again, this time for x = x2 œ S using eq. 3, we replace the prior Pr(◊ = ◊̂) in eq. 3 by the new estimate Pr(◊ = ◊̂) Ω 1N q xœS Pr(◊ = ◊̂X = x). The best value of the parameters ◊ú given one example is obtained using the principle of maximumlikelihood: ◊ú = arg max ◊ NŸ i=1 Pr(◊ = ◊̂X = xi). (4) If the set of possible values for ◊ isn’t finite, then we need to optimize eq. 4 directly using a numerical optimization routine, such as gradient descent, which we consider in Chapter 4. Usually, we optimize the natural logarithm of the righthand side expression in eq. 4 because the logarithm of a product becomes the sum of logarithms and it’s easier for the machine to work with the sum than with a product1. 2.6 Classification vs. Regression Classification is a problem of automatically assigning a label to an unlabeled example. Spam detection is a famous example of classification. In machine learning, the classification problem is solved by a classification learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and either directly output a label or output a number that can be used by the data analyst to deduce the label easily. An example of such a number is a probability. In a classification problem, a label is a member of a finite set of classes. If the size of the set of classes is two (“sick”/“healthy”, “spam”/“not_spam”), we talk about binary classification (also called binomial in some books). Multiclass classification (also called multinomial) is a classification problem with three or more classes2. While some learning algorithms naturally allow for more than two classes, others are by nature binary classification algorithms. There are strategies allowing to turn a binary classification learning algorithm into a multiclass one. I talk about one of them in Chapter 7. Regression is a problem of predicting a realvalued label (often called a target) given an unlabeled example. Estimating house price valuation based on house features, such as area, the number of bedrooms, location and so on is a famous example of regression. 1Multiplication of many numbers can give either a very small result or a very large one. It often results in the problem of numerical overflow when the machine cannot store such extreme numbers in memory. 2There’s still one label per example though. Andriy Burkov The HundredPage Machine Learning Book  Draft 12 The regression problem is solved by a regression learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and output a target. 2.7 ModelBased vs. InstanceBased Learning Most supervised learning algorithms are modelbased. We have already seen one such algorithm: SVM. Modelbased learning algorithms use the training data to create a model that has parameters learned from the training data. In SVM, the two parameters we saw were wú and bú. After the model was built, the training data can be discarded. Instancebased learning algorithms use the whole dataset as the model. One instancebased algorithm frequently used in practice is kNearest Neighbors (kNN). In classification, to predict a label for an input example the kNN algorithm looks at the close neighborhood of the input example in the space of feature vectors and outputs the label that it saw the most often in this close neighborhood. 2.8 Shallow vs. Deep Learning A shallow learning algorithm learns the parameters of the model directly from the features of the training examples. Most supervised learning algorithms are shallow. The notorious exceptions are neural network learning algorithms, specifically those that build neural networks with more than one layer between input and output. Such neural networks are called deep neural networks. In deep neural network learning (or, simply, deep learning), contrary to shallow learning, most model parameters are learned not directly from the features of the training examples, but from the outputs of the preceding layers. Don’t worry if you don’t understand what that means right now. We look at neural networks more closely in Chapter 6. Andriy Burkov The HundredPage Machine Learning Book  Draft 13 The Hundred Page Machine Learning Book Andriy Burkov “All models are wrong, but some are useful.” — George Box The book is distributed on the “read first, buy later” principle. Andriy Burkov The HundredPage Machine Learning Book  Draft 3 Fundamental Algorithms In this chapter, I describe five algorithms which are not just the most known but also either very e�ective on their own or are used as building blocks for the most e�ective learning algorithms out there. 3.1 Linear Regression Linear regression is a popular regression learning algorithm that learns a model which is a linear combination of features of the input example. 3.1.1 Problem Statement We have a collection of labeled examples {(xi, yi)}Ni=1, where N is the size of the collection, xi is the Ddimensional feature vector of example i = 1, . . . , N , yi is a realvalued1 target and every feature x(j)i , j = 1, . . . , D, is also a real number. We want to build a model fw,b(x) as a linear combination of features of example x: fw,b(x) = wx + b, (1) where w is a Ddimensional vector of parameters and b is a real number. The notation fw,b means that the model f is parametrized by two values: w and b. We will use the model to predict the unknown y for a given x like this: y Ω fw,b(x). Two models parametrized by two di�erent pairs (w, b) will likely produce two di�erent predictions when applied to the same example. We want to find the optimal values (wú, bú). Obviously, the optimal values of parameters define the model that makes the most accurate predictions. You could have noticed that the form of our linear model in eq. 1 is very similar to the form of the SVM model. The only di�erence is the missing sign operator. The two models are indeed similar. However, the hyperplane in the SVM plays the role of the decision boundary: it’s used to separate two groups of examples from one another. As such, it has to be as far from each group as possible. On the other hand, the hyperplane in linear regression is chosen to be as close to all training examples as possible. You can see why this latter requirement is essential by looking at the illustration in fig. 1. It displays the regression line (in lightblue) for onedimensional examples (darkblue dots). We can use this line to predict the value of the target ynew for a new unlabeled input example xnew. If our examples are Ddimensional feature vectors (for D > 1), the only di�erence 1To say that yi is realvalued, we write yi œ R, where R denotes the set of all real numbers, an infinite set of numbers from minus infinity to plus infinity. Andriy Burkov The HundredPage Machine Learning Book  Draft 3 Figure 1: Linear Regression for onedimensional examples. with the onedimensional case is that the regression model is not a line but a plane (for two dimensions) or a hyperplane (for D > 2). Now you see why it’s essential to have the requirement that the regression hyperplane lies as close to the training examples as possible: if the blue line in fig. 1 was far from the blue dots, the prediction ynew would have fewer chances to be correct. 3.1.2 Solution To get this latter requirement satisfied, the optimization procedure which we use to find the optimal values for wú and bú tries to minimize the following expression: 1 N ÿ i=1...N (fw,b(xi) ≠ yi)2. (2) In mathematics, the expression we minimize or maximize is called an objective function, or, simply, an objective. The expression (f(xi) ≠ yi)2 in the above objective is called the loss function. It’s a measure of penalty for misclassification of example i. This particular choice of the loss function is called squared error loss. All modelbased learning algorithms have a loss function and what we do to find the best model is we try to minimize the objective known as the cost function. In linear regression, the cost function is given by the average loss, also called the empirical risk. The average loss, or empirical risk, for a model, is the average of all penalties obtained by applying the model to the training data. Andriy Burkov The HundredPage Machine Learning Book  Draft 4 Why is the loss in linear regression a quadratic function? Why couldn’t we get the absolute value of the di�erence between the true target yi and the predicted value f(xi) and use that as a penalty? We could. Moreover, we also could use a cube instead of a square. Now you probably start realizing how many seemingly arbitrary decisions are made when we design a machine learning algorithm: we decided to use the linear combination of features to predict the target. However, we could use a square or some other polynomial to combine the values of features. We could also use some other loss function that makes sense: the absolute di�erence between f(xi) and yi makes sense, the cube of the di�erence too; the binary loss (1 when f(xi) and yi are di�erent and 0 when they are the same) also makes sense, right? If we made di�erent decisions about the form of the model, the form of the loss function, and about the choice of the algorithm that minimizes the average loss to find the best values of parameters, we would end up inventing a di�erent machine learning algorithm. Sounds easy, doesn’t it? However, do not rush to invent a new learning algorithm. The fact that it’s di�erent doesn’t mean that it will work better in practice. People invent new learning algorithms for one of the two main reasons: 1. The new algorithm solves a specific practical problem better than the existing algorithms. 2. The new algorithm has better theoretical guarantees on the quality of the model it produces. One practical justification of the choice of the linear form for the model is that it’s simple. Why use a complex model when you can use a simple one? Another consideration is that linear models rarely overfit. Overfitting is the property of a model such that the model predicts very well labels of the examples used during training but frequently makes errors when applied to examples that weren’t seen by the learning algorithm during training. An example of overfitting in regression is shown in fig. 2. The data used to build the red regression line is the same as in fig. 1. The di�erence is that this time, this is the polynomial regression with a polynomial of degree 10. The regression line predicts almost perfectly the targets almost all training examples, but will likely make significant errors on new data, as you can see in fig. 1 for xnew. We talk more about overfitting and how to avoid it Chapter 5. Now you know why linear regression can be useful: it doesn’t overfit much. But what about the squared loss? Why did we decide that it should be squared? In 1705, the French mathematician AdrienMarie Legendre, who first published the sum of squares method for gauging the quality of the model stated that squaring the error before summing is convenient. Why did he say that? The absolute value is not convenient, because it doesn’t have a continuous derivative, which makes the function not smooth. Functions that are not smooth create unnecessary di�culties when employing linear algebra to find closed form solutions to optimization problems. Closed form solutions to finding an optimum of a function are simple algebraic expressions and are often preferable to using complex numerical optimization methods, such as gradient descent (used, among others, to train neural networks). Intuitively, squared penalties are also advantageous because they exaggerate the di�erence between the true target and the predicted one according to the value of this di�erence. We Andriy Burkov The HundredPage Machine Learning Book  Draft 5 y x new new Figure 2: Overfitting. might also use the powers 3 or 4, but their derivatives are more complicated to work with. Finally, why do we care about the derivative of the average loss? Remember from algebra that if we can calculate the gradient of the function in eq. 2, we can then set this gradient to zero2 and find the solution to a system of equations that gives us the optimal values wú and b ú. You can spend several minutes and check it yourself. 3.2 Logistic Regression The first thing to say is that logistic regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulation of logistic regression is similar to that of linear regression. I explain logistic regression on the case of binary classification. However, it can naturally be extended to multiclass classification. 3.2.1 Problem Statement In logistic regression, we still want to model yi as a linear function of xi, however, with a binary yi this is not straightforward. The linear combination of features such as wxi + b is a function that spans from minus infinity to plus infinity, while yi has only two possible values. 2To find the minimum or the maximum of a function, we set the gradient to zero because the value of the gradient at extrema of a function is always zero. In 2D, the gradient at an extremum is a horizontal line. Andriy Burkov The HundredPage Machine Learning Book  Draft 6 Figure 3: Standard logistic function. At the time where the absence of computers required scientists to perform manual calculations, they were eager to find a linear classification model. They figured out that if we define a negative label as 0 and the positive label as 1, we would just need to find a simple continuous function whose codomain is (0, 1). In such a case, if the value returned by the model for input x is closer to 0, then we assign a negative label to x; otherwise, the example is labeled as positive. One function that has such a property is the standard logistic function (also known as the sigmoid function): f(x) = 11 + e≠x , where e is the base of the natural logarithm (also called Euler’s number ; ex is also known as the exp(x) function in Excel and many programming languages). Its graph is depicted in fig. 3. By looking at the graph of the standard logistic function, we can see how well it fits our classification purpose: if we optimize the values of x and b appropriately, we could interpret the output of f(x) as the probability of yi being positive. For example, if it’s higher than or equal to the threshold 0.5 we would say that the class of x is positive; otherwise, it’s negative. In practice, the choice of the threshold could be di�erent depending on the problem. We return to this discussion in Chapter 5 when we talk about model performance assessment. So our logistic regression model looks like this: Andriy Burkov The HundredPage Machine Learning Book  Draft 7 fw,b(x) def= 1 1 + e≠(wx+b) . (3) You can see the familiar term wx + b from linear regression. Now, how do we find the best values wú and bú for our model? In linear regression, we minimized the empirical risk which was defined as the average squared error loss, also known as the mean squared error or MSE. 3.2.2 Solution In logistic regression, instead of using a squared loss and trying to minimize the empirical risk, we maximize the likelihood of our training set according to the model. In statistics, the likelihood function defines how likely the observation (an example) is according to our model. For instance, assume that we have a labeled example (xi, yi) in our training data. Assume also that we have found (guessed) some specific values ŵ and b̂ of our parameters. If we now apply our model fŵ,b̂ to xi using eq. 3 we will get some value 0 < p < 1 as output. If yi is the positive class, the likelihood of yi being the positive class, according to our model, is given by p. Similarly, if yi is the negative class, the likelihood of it being the negative class is given by 1 ≠ p. The optimization criterion in logistic regression is called maximum likelihood. Instead of minimizing the average loss, like in linear regression, we now maximize the likelihood of the training data according to our model: Lw,b def= Ÿ i=1...N fw,b(xi)yi(1 ≠ fw,b(xi))(1≠yi). (4) The expression fw,b(x)yi(1 ≠ fw,b(x))(1≠yi) may look scary but it’s just a fancy mathematical way of saying: “fw,b(x) when yi = 1 and (1 ≠ fw,b(x)) otherwise”. Indeed, if yi = 1, then (1 ≠ fw,b(x))(1≠yi) equals 1 because (1 ≠ yi) = 0 and we know that anything power 0 equals 1. On the other hand, if yi = 0, then fw,b(x)yi equals 1 for the same reason. You may have noticed that we used the product operator r in the objective function instead of the sum operator q which was used in linear regression. It’s because the likelihood of observing N labels for N examples is the product of likelihoods of each observation (assuming that all observations are independent of one another, which is the case). You can draw a parallel with the multiplication of probabilities of outcomes in a series of independent experiments in the probability theory. Because of the exp function used in the model, in practice, it’s more convenient to maximize the loglikelihood instead of likelihood. The loglikelihood is defined like follows: LogLw,b def= ln(L(w,b(x)) = Nÿ i=1 yi ln fw,b(x) + (1 ≠ yi) ln (1 ≠ fw,b(x)). Andriy Burkov The HundredPage Machine Learning Book  Draft 8 Because ln is a strictly increasing function, maximizing this function is the same as maximizing its argument, and the solution to this new optimization problem is the same as the solution to the original problem. Contrary to linear regression, there’s no closed form solution to the above optimization problem. A typical numerical optimization procedure used in such cases is gradient descent. I talk about it in the next chapter. 3.3 Decision Tree Learning A decision tree is an acyclic graph that can be used to make decisions. In each branching node of the graph, a specific feature j of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed; otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs. As the title of the section suggests, a decision tree can be learned from data. 3.3.1 Problem Statement Like previously, we have a collection of labeled examples; labels belong to the set {0, 1}. We want to build a decision tree that would allow us to predict the class of an example given a feature vector. 3.3.2 Solution There are various formulations of the decision tree learning algorithm. In this book, we consider just one, called ID3. The optimization criterion, in this case, is the average loglikelihood: 1 N Nÿ i=1 yi ln fID3(xi) + (1 ≠ yi) ln (1 ≠ fID3(xi)), (5) where fID3 is a decision tree. By now, it looks very similar to logistic regression. However, contrary to the logistic regression learning algorithm which builds a parametric model fwú,bú by finding an optimal solution to the optimization criterion, the ID3 algorithm optimizes it approximately by constructing a nonparametric model fID3(x) def= Pr(y = 1x). Andriy Burkov The HundredPage Machine Learning Book  Draft 9 S={(x1, y1), (x2, y2), (x3, y3), (x4, y4), (x5, y5), (x6, y6), (x7, y7), (x8, y8), (x9, y9), (x10, y10), (x11, y11), (x12, y12)} x Pr(y = 1x) = (y1+y2+y3+y4+y5 +y6+y7+y8+y9+y10+y11+y12)/12 Pr(y = 1x) (a) x Pr(y = 1x) = (y1+y2+y4 +y6+y7+y8+y9)/7 Pr(y = 1x) x(3) < 18.3? S = {(x1, y1), (x2, y2), (x4, y4), (x6, y6), (x7, y7), (x8, y8), (x9, y9)} Pr(y = 1x) = (y3+y5+y10+y11+y12)/5 Pr(y = 1x) S+ = {(x3, y3), (x5, y5), (x10, y10), (x11, y11), (x12, y12)} Yes No (b) Figure 4: An illustration of a decision tree building algorithm. The set S contains 12 labeled examples. (a) In the beginning, the decision tree only contains the start node; it makes the same prediction for any input. (b) The decision tree after the first split; it tests whether feature 3 is less than 18.3 and, depending on the result, the prediction is made in one of the two leaf nodes. The ID3 learning algorithm works as follows. Let S denote a set of labeled examples. In the beginning, the decision tree only has a start node that contains all examples: S def= {(xi, yi)}Ni=1. Start with a constant model fSID3: f S ID3 = 1 S ÿ (x,y)œS y. (6) The prediction given by the above model, fSID3(x), would be the same for any input x. The corresponding decision tree is shown in fig 4a. Then we search through all features j = 1, . . . , D and all thresholds t, and split the set S into two subsets: S≠ def= {(x, y)  (x, y) œ S, x(j) < t} and S+ = {(x, y)  (x, y) œ S , x(j) Ø t}. The two new subsets would go to two new leaf nodes, and we evaluate, for all possible pairs (j, t) how good the split with pieces S≠ and S+ is. Finally, we pick the best such values (j, t), split S into S+ and S≠, form two new leaf nodes, and continue recursively on S+ and S≠ (or quit if no split produces a model that’s su�ciently better than the current one). A decision Andriy Burkov The HundredPage Machine Learning Book  Draft 10 tree after one split is illustrated in fig 4b. Now you should wonder what do the words “evaluate how good the split is” mean. In ID3, the goodness of a split is estimated by using the criterion called entropy. Entropy is a measure of uncertainty about a random variable. It reaches its maximum when all values of the random variables are equiprobable. Entropy reaches its minimum when the random variable can have only one value. The entropy of a set of examples S is given by: H(S) = ≠fSID3 ln fSID3 ≠ (1 ≠ fSID3) ln(1 ≠ fSID3). When we split a set of examples by a certain feature j and a threshold t, the entropy of a split, H(S≠, S+), is simply a weighted sum of two entropies: H(S≠, S+) = S≠ S H(S≠) + S+ S H(S+). (7) So, in ID3, at each step, at each leaf node, we find a split that minimizes the entropy given by eq. 7 or we stop at this leaf node. The algorithm stops at a leaf node in any of the below situations: • All examples in the leaf node are classified correctly by the onepiece model (eq. 6). • We cannot find an attribute to split upon. • The split reduces the entropy less than some ‘ (the value for which has to be found experimentally3). • The tree reaches some maximum depth d (also has to be found experimentally). Because in ID3, the decision to split the dataset on each iteration is local (doesn’t depend on future splits), the algorithm doesn’t guarantee an optimal solution. The model can be improved by using techniques like backtracking during the search for the optimal decision tree at the cost of possibly taking longer to build a model. The entropybased split criterion intuitively makes sense: entropy reaches its minimum of 0 when all examples in S have the same label; on the other hand, the entropy is at its maximum of 1 when exactly onehalf of examples in S is labeled with 1, making such a leaf useless for classification. The only remaining question is how this algorithm approximately maximizes the average loglikelihood criterion. I leave it for further reading. 3In Chapter 5, we will see how to do that when we talk about hyperparameter tuning. Andriy Burkov The HundredPage Machine Learning Book  Draft 11 3.4 Support Vector Machine We already considered SVM in the introduction, so this section only fills a couple of blanks. Two critical questions need to be answered: 1. What if there’s noise in the data and no hyperplane can perfectly separate positive examples from negative ones? 2. What if the data cannot be separated using a plane, but could be separated by a higherorder polynomial? Figure 5: Linearly nonseparable cases. Left: the presence of noise. Right: inherent nonlinearity. You can see both situations depicted in fig 5. In the left case, the data could be separated by a straight line if not for the noise (outliers or examples with wrong labels). In the right case, the decision boundary is a circle and not a straight line. Remember that in SVM, we want to satisfy the following constraints: a) wxi ≠ b Ø 1 if yi = +1, and b) wxi ≠ b Æ ≠1 if yi = ≠1 We also want to minimize ÎwÎ so that the hyperplane was equally distant from the closest examples of each class. Minimizing ÎwÎ is equivalent to minimizing 12 w 2, and the use of this term makes it possible to perform quadratic programming optimization later on. The optimization problem for SVM, therefore, looks like this: min 12 w 2 , such that yi(xiw ≠ b) ≠ 1 Ø 0, i = 1, . . . , N. (8) Andriy Burkov The HundredPage Machine Learning Book  Draft 12 3.4.1 Dealing with Noise To extend SVM to cases in which the data is not linearly separable, we introduce the hinge loss function: max (0, 1 ≠ yi(wxi ≠ b)). The hinge loss function is zero if the constraints a) and b) are satisfied, in other words, if wxi lies on the correct side of the decision boundary. For data on the wrong side of the decision boundary, the function’s value is proportional to the distance from the decision boundary. We then wish to minimize the following cost function, CÎwÎ2 + 1 N Nÿ i=1 max (0, 1 ≠ yi(wxi ≠ b)) , where the hyperparameter C determines the tradeo� between increasing the size of the decision boundary and ensuring that each xi lies on the correct side of the decision boundary. The value of C is usually chosen experimentally, just like ID3’s hyperparameters ‘ and d. SVMs that optimize hinge loss are called softmargin SVMs, while the original formulation is referred to as a hardmargin SVM. As you can see, for su�ciently high values of C, the second term in the cost function will become negligible, so the SVM algorithm will try to find the highest margin by completely ignoring misclassification. As we decrease the value of C, making classification errors is becoming more costly, so the SVM algorithm will try to make fewer mistakes by sacrificing the margin size. As we have already discussed, a larger margin is better for generalization. Therefore, C regulates the tradeo� between classifying the training data well (minimizing empirical risk) and classifying future examples well (generalization). 3.4.2 Dealing with Inherent NonLinearity SVM can be adapted to work with datasets that cannot be separated by a hyperplane in its original space. However, if we manage to transform the original space into a space of higher dimensionality, we could hope that the examples will become linearly separable in this transformed space. In SVMs, using a function to implicitly transform the original space into a higher dimensional space during the cost function optimization is called the kernel trick. The e�ect of applying the kernel trick is illustrated in fig. 6. As you can see, it’s possible to transform a twodimensional nonlinearlyseparable data into a linearlyseparable three dimensional data using a specific mapping „ : x ‘æ „(x), where „(x) is a vector of higher dimensionality than x. For the example of 2D data in fig. 5 (right), the mapping „ for example x = [q, p] that projects this example into a 3D space (fig. 6) would look like this „([q, p]) def= (q2, Ô 2qp, p2), where q2 means q squared. You see now that the data becomes linearly separable in the transformed space. Andriy Burkov The HundredPage Machine Learning Book  Draft 13 Figure 6: The data from fig. 5 (right) becomes linearly separable after a transformation into a threedimensional space. However, we don’t know a priori which mapping „ would work for our data. If we first transform all our input examples using some mapping into very high dimensional vectors and then apply SVM to this data, and we try all possible mapping functions, the computation could become very ine�cient, and we would never solve our classification problem. Fortunately, scientists figured out how to use kernel functions (or, simply, kernels) to e�ciently work in higherdimensional spaces without doing this transformation explicitly. To understand how kernels work, we have to see first how the optimization algorithm for SVM finds the optimal values for w and b. The method traditionally used to solve the optimization problem in eq. 8 is the method of Lagrange multipliers. Instead of solving the original problem from eq. 8, it is convenient to solve an equivalent problem formulated like this: max –1...–N Nÿ i=1 –i ≠ 1 2 Nÿ i=1 Nÿ k=1 yi–i(xixk)yk–k subject to Nÿ i=1 –iyi = 0 and –i Ø 0, i = 1, . . . , N, where –i are called Lagrange multipliers. When formulated like this, the optimization problem becomes a convex quadratic optimization problem, e�ciently solvable by quadratic programming algorithms. Andriy Burkov The HundredPage Machine Learning Book  Draft 14 Now, you could have noticed that in the above formulation, there is a term xixk, and this is the only place where the feature vectors are used. If we want to transform our vector space into higher dimensional space, we need to transform xi into „(xi) and xj into „(xj) and then multiply „(xi) and „(xj). It would be very costly to do so. On the other hand, we are only interested in the result of the dotproduct xixk, which, as we know, is a real number. We don’t care how this number was obtained as long as it’s correct. By using the kernel trick, we can get rid of a costly transformation of original feature vectors into higherdimensional vectors and avoid computing their dotproduct. We replace that by a simple operation on the original feature vectors that gives the same result. For example, instead of transforming (q1, p1) into (q21 , Ô 2q1p1, p21) and (q2, p2) into (q22 , Ô 2q2p2, p22) and then computing the dotproduct of (q21 , Ô 2q1p1, p21) and (q22 , Ô 2q2p2, p22) to obtain (q21q22 +2q1q2p1p2 +p21p22) we could find the dotproduct between (q1, p1) and (q2, p2) to get (q1q2 +p1p2) and then square it to get exactly the same result (q21q22 +2q1q2p1p2 +p21p22). That was an example of the kernel trick, and we used the quadratic kernel k(xi, xk) def= (xixk)2. Multiple kernel functions exist, the most widely used of which is the RBF kernel: k(x, xÕ) = exp 3 ≠Îx ≠ x ÕÎ2 2‡2 4 , where Îx ≠ xÕÎ2 is the squared Euclidean distance between two feature vectors. The Euclidean distance is given by the following equation: d(xi, xk) def= Ú1 x (1) i ≠ x (1) k 22 + 1 x (2) i ≠ x (2) k 22 + · · · + 1 x (N) i ≠ x (N) k 22 = ı̂ıÙ Dÿ j=1 1 x (j) i ≠ x (j) k 22 . It can be shown that the feature space of the RBF (for “radial basis function”) kernel has an infinite number of dimensions. By varying the hyperparameter ‡, the data analyst can choose between getting a smooth or curvy decision boundary in the original space. 3.5 kNearest Neighbors kNearest Neighbors (kNN) is a nonparametric learning algorithm. Contrary to other learning algorithms that allow discarding the training data after the model is built, kNN keeps all training examples in memory. Once a new, previously unseen example x comes in, the kNN algorithm finds k training examples closest to x and returns the majority label (in case of classification) or the average label (in case of regression). The closeness of two points is given by a distance function. For example, Euclidean distance seen above is frequently used in practice. Another popular choice of the distance function is the negative cosine similarity. Cosine similarity defined as, Andriy Burkov The HundredPage Machine Learning Book  Draft 15 s(xi, xk) def= cos(\(xi, xk)) = qD j=1 x (j) i x (j) kÚ qD j=1 1 x (j) i 22ÚqD j=1 1 x (j) k 22 , is a measure of similarity of the directions of two vectors. If the angle between two vectors is 0 degrees, then two vectors point to the same direction, and cosine similarity is equal to 1. If the vectors are orthogonal, the cosine similarity is 0. For vectors pointing in opposite directions, the cosine similarity is ≠1. If we want to use cosine similarity as a distance metric, we need to multiply it by ≠1. Other popular distance metrics include Chebychev distance, Mahalanobis distance, and Hamming distance. The choice of the distance metric, as well as the value for k, are the choices the analyst makes before running the algorithm. So these are hyperparameters. The distance metric could also be learned from data (as opposed to guessing it). We talk about that in Chapter 10. Now you know how the model building algorithm works and how the prediction is made. A reasonable question is what is the cost function here? Surprisingly, this question has not been well studied in the literature, despite the algorithm’s popularity since the earlier 1960s. The only attempt to analyze the cost function of kNN I’m aware of was undertaken by Li and Yang in 20034. Below, I outline their considerations. For simplicity, let’s make our derivation under the assumptions of binary classification (y œ {0, 1}) with cosine similarity and normalized feature vectors5. Under these assumptions, kNN does a locally linear classification with the vector of coe�cients, wx = ÿ (xÕ,yÕ)œRk(x) y Õ x Õ , (9) where Rk(x) is the set of k nearest neighbors to the input example x. The above equation says that we take the sum of all nearest neighbor feature vectors to some input vector x by ignoring those that have label 0. The classification decision is obtained by defining a threshold on the dotproduct wxx which, in the case of normalized feature vectors, is equal to the cosine similarity between wx and x. Now, defining the cost function like this: L = ≠ ÿ (xÕ,yÕ)œRk(x) y Õ x Õ wx + 1 2 w 2 and setting the first order derivative of the righthand side to zero yields the formula for the coe�cient vector in eq. 9. 4F. Li and Y. Yang, “A loss function analysis for classification methods in text categorization,” in ICML 2003, pp. 472–479, 2003. 5We discuss normalization later; for the moment assume that all features of feature vectors were squeezed into the range [0, 1]. Andriy Burkov The HundredPage Machine Learning Book  Draft 16 The Hundred Page Machine Learning Book Andriy Burkov “All models are wrong, but some are useful.” — George Box The book is distributed on the “read first, buy later” principle. Andriy Burkov The HundredPage Machine Learning Book  Draft 4 Anatomy of a Learning Algorithm 4.1 Building Blocks of a Learning Algorithm You may have noticed by reading the previous chapter that each learning algorithm we saw consisted of three parts: 1) a loss function; 2) an optimization criterion based on the loss function (a cost function, for example); and 3) an optimization routine that leverages training data to find a solution to the optimization criterion. These are the building blocks of any learning algorithm. You saw in the previous chapter that some algorithms were designed to explicitly optimize a specific criterion (both linear and logistic regressions, SVM). Some others, including decision tree learning and kNN, optimize the criterion implicitly. Decision tree learning and kNN are among the oldest machine learning algorithms and were invented experimentally based on intuition, without a specific global optimization criterion in mind, and (like it often happens in scientific history) the optimization criteria were developed later to explain why those algorithms work. By reading modern literature on machine learning, you often encounter references to gradient descent or stochastic gradient descent. These are two most frequently used optimization algorithms used in cases where the optimization criterion is di�erentiable. Gradient descent is an iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one starts at some random point and takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. Gradient descent can be used to find optimal parameters for linear and logistic regression, SVM and also neural networks which we consider later. For many models, such as logistic regression or SVM, the optimization criterion is convex. Convex functions have only one minimum, which is global. Optimization criteria for neural networks are not convex, but in practice even finding a local minimum su�ces. Let’s see how gradient descent works. 4.2 Gradient Descent In this section, I demonstrate how gradient descent finds the solution to a linear regression problem1. I illustrate my description with Python source code as well as with plots that show how the solution improves after some iterations of the gradient descent algorithm. 1 As you know, linear regression has a closed form solution. That means that gradient descent is not needed to solve this specific type of problem. However, for illustration purposes, linear regression is a perfect problem to explain gradient descent. Andriy Burkov The HundredPage Machine Learning Book  Draft 3 I use a dataset with only one feature. However, the optimization criterion will have two parameters: w and b. The extension to multidimensional training data is straightforward: you have variables w(1), w(2), and b for twodimensional data, w(1), w(2), w(3), and b for threedimensional data and so on. Figure 1: The original data. The Yaxis corresponds to the sales in units (the quantity we want to predict), the Xaxis corresponds to our feature: the spendings on radio ads in M$. To give a practical example, I use the real dataset with the following columns: the Spendings of various companies on radio advertising each year and their annual Sales in terms of units sold. We want to build a regression model that we can use to predict units sold based on how much a company spends on radio advertising. Each row in the dataset represents one specific company: Company Spendings, M$ Sales, Units 1 37.8 22.1 2 39.3 10.4 3 45.9 9.3 4 41.3 18.5 .. .. .. We have data for 200 companies, so we have 200 training examples. Fig. 1 shows all examples on a 2D plot. Remember that the linear regression model looks like this: f(x) = wx + b. We don’t know Andriy Burkov The HundredPage Machine Learning Book  Draft 4 what the optimal values for w and b are and we want to learn them from data. To do that, we look for such values for w and b that minimize the mean squared error: l = 1 N Nÿ i=1 (yi ≠ (wxi + b))2. Gradient descent starts with calculating the partial derivative for every parameter: ˆl ˆw = 1 N Nÿ i=1 ≠2xi(yi ≠ (wxi + b)); ˆl ˆb = 1 N Nÿ i=1 ≠2(yi ≠ (wxi + b)). (1) To find the partial derivative of the term (yi ≠ (wx + b))2 with respect to w we applied the chain rule. Here, we have the chain f = f2(f1) where f1 = yi ≠ (wx + b) and f2 = f21 . To find a partial derivative of f with respect to w we have to first find the partial derivative of f with respect to f2 which is equal to 2(yi ≠ (wx + b)) (from calculus, we know that the derivative ˆf ˆx x 2 = 2x) and then we have to multiply it by the partial derivative of yi ≠ (wx + b) with respect to w which is equal to ≠x. So overall ˆlˆw = 1 N qN i=1 ≠2xi(yi ≠ (wxi + b)). In a similar way, the partial derivative of l with respect to b, ˆlˆb , was calculated. We initialize2 w0 = 0 and b0 = 0 and then iterate through our training examples, each example having the form of (xi, yi) = (Spendingsi, Salesi). For each training example, we update w and b using our partial derivatives. The learning rate – controls the size of an update: wi Ω – ≠2xi(yi ≠ (wi≠1xi + bi≠1)) N ; bi Ω – ≠2(yi ≠ (wi≠1xi + bi≠1)) N , (2) where wi and bi denote the values of w and b after using the example (xi, yi) for the update. One pass through all training examples is called an epoch. Typically, we need multiple epochs until we start seeing that the values for w and b don’t change much; then we stop. 2 In complex models, such as neural networks, which have thousands of parameters, the initialization of parameters may significantly a�ect the solution found using gradient descent. There are di�erent initialization methods (at random, with all zeroes, with small values around zero, and others) and it is an important choice the data analyst has to make. Andriy Burkov The HundredPage Machine Learning Book  Draft 5 It’s hard to imagine a machine learning engineer who doesn’t use Python. So, if you waited for the right moment to start learning Python, this is that moment. Below we show how to program gradient descent in Python. The function that updates the parameters w and b during one epoch is shown below: def update_w_and_b(spendings, sales, w, b, alpha): dl_dw = 0.0 dl_db = 0.0 N = len(spendings) for i in range(N): dl_dw += 2*spendings[i]*(sales[i]  (w*spendings[i] + b)) dl_db += 2*(sales[i]  (w*spendings[i] + b)) # update w and b w = w  (1/float(N))*dl_dw*alpha b = b  (1/float(N))*dl_db*alpha return w, b The function that loops over multiple epochs is shown below: def train(spendings, sales, w, b, alpha, epochs): for e in range(epochs): w, b = update_w_and_b(spendings, sales, w, b, alpha) # log the progress if e % 400 == 0: print("epoch:", e, "loss: ", avg_loss(spendings, sales, w, b)) return w, b The function avg_loss in the above code snippet is a function that computes the mean squared error. It is defined as: def avg_loss(spendings, sales, w, b): N = len(spendings) total_error = 0.0 for i in range(N): total_error += (sales[i]  (w*spendings[i] + b))**2 return total_error / float(N) If we run the train function for – = 0.001, w = 0.0, b = 0.0, and 15000 epochs, we will see the following output (shown partially): epoch: 0 loss: 92.32078294903626 epoch: 400 loss: 33.79131790081576 Andriy Burkov The HundredPage Machine Learning Book  Draft 6 epoch: 800 loss: 27.9918542960729 epoch: 1200 loss: 24.33481690722147 epoch: 1600 loss: 22.028754937538633 ... epoch: 2800 loss: 19.07940244306619 Epoch 0 Epoch 400 Epoch 800 Epoch 1200 Epoch 1600 Epoch 3000 Figure 2: The evolution of the regression line through gradient descent epochs. You can see that the average loss decreases as the train function loops through epochs. Fig. 2 shows the evolution of the regression line through epochs. Finally, once we have found the optimal values of parameters w and b, the only missing piece is a function that makes predictions: def predict(x, w, b): return w*x + b Try to execute the following code: w, b = train(x, y, 0.0, 0.0, 0.001, 15000) x_new = 23.0 y_new = predict(x_new, w, b) print(y_new) Andriy Burkov The HundredPage Machine Learning Book  Draft 7 The output is 13.97. The gradient descent algorithm is sensitive to the choice of the step –. It is also slow for large datasets. Fortunately, several significant improvements to this algorithm have been proposed. Stochastic gradient descent (SGD) is a version of the algorithm that speeds up the computation by approximating the gradient using smaller batches (subsets) of the training data. SGD itself has various “upgrades”. Adagrad is a version of SGD that scales – for each parameter according to the history of gradients. As a result, – is reduced for very large gradients and viceversa. Momentum is a method that helps accelerate SGD by orienting the gradient descent in the relevant direction and reducing oscillations. In neural network training, variants of SGD such as RMSprop and Adam, are most frequently used. Notice that gradient descent and its variants are not machine learning algorithms. They are solvers of minimization problems in which the function to minimize has a gradient in most points of its domain. 4.3 How Machine Learning Engineers Work Unless you are a research scientist or work for a huge corporation with a large R&D budget, you usually don’t implement machine learning algorithms yourself. You don’t implement gradient descent or some other solver either. You use libraries, most of which are open source. A library is a collection of algorithms and supporting tools implemented with stability and e�ciency in mind. The most frequently used in practice opensource machine learning library is scikitlearn. It’s written in Python and C. Here’s how you do linear regression in scikitlearn: def train(x, y): from sklearn.linear_model import LinearRegression model = LinearRegression().fit(x,y) return model model = train(x,y) x_new = 23.0 y_new = model.predict(x_new) print(y_new) The output will, again, be 13.97. Easy, right? You can replace LinearRegression with some other type of regression learning algorithm without modifying anything else. It just works. The same can be said about classification. You can easily replace LogisticRegression algorithm with SVC algorithm (this is scikitlearn’s name for the Support Vector Machine algorithm), DecisionTreeClassifier, NearestNeighbors or many other classification learning algorithms implemented in scikitlearn. Andriy Burkov The HundredPage Machine Learning Book  Draft 8 4.4 Learning Algorithms’ Particularities Here we outline some practical particularities that can di�erentiate one learning algorithm from another. You already know that di�erent learning algorithms can have di�erent hyperparameters (C in SVM, ‘ and d in ID3). Solvers such as gradient descent can also have hyperparameters, like – for example. Some algorithms, like decision tree learning, can accept categorical features. For example, if you have a feature “color” that can take values “red”, “yellow”, or “green”, you can keep this feature as is. SVM, logistic and linear regression, as well as kNN (with cosine similarity or Euclidean distance metrics), expect numerical values for all features. All algorithms implemented in scikitlearn expect numerical features. I show in the next chapter how to convert categorical features into numerical ones. Some algorithms, like SVM, allow the data analyst to provide weightings for each class. These weightings influence how the decision boundary is drawn. If the weight of some class is high, the learning algorithm tries to not make errors in predicting training examples of this class (typically, for the cost of making an error elsewhere). That could be important if instances of some class are in the minority in your training data, but you would like to avoid misclassifying examples of that class as much as possible. Some classification models, like SVM and kNN, given a feature vector only output the class. Others, like logistic regression or decision trees, can also return the score between 0 and 1 which can be interpreted as either how confident the model is about the prediction or as the probability that the input example belongs to a certain class3. Some classification algorithms (like decision tree learning, logistic regression, or SVM) build the model using the whole dataset at once. If you have got additional labeled examples, you have to rebuild the model from scratch. Other algorithms (such as Naïve Bayes, multilayer percep tron, SGDClassifier/SGDRegressor, PassiveAggressiveClassifier/PassiveAggressiveRegressor in scikitlearn) can be trained iteratively, one batch at a time. Once new training examples are available, you can update the model using only the new data. Finally, some algorithms, like decision tree learning, SVM, and kNN can be used for both clas sification and regression, while others can only solve one type of problem: either classification or regression, but not both. Usually, each library provides the documentation that explains what kind of problem each algorithm solves, what input values are allowed and what kind of output the model produces. The documentation also provides information on hyperparameters. 3 If it’s really necessary, the score for SVM and kNN predictions could be synthetically created using some simple techniques. Andriy Burkov The HundredPage Machine Learning Book  Draft 9 Andriy Burkov's “All models are wrong, but some are useful.” — George Box The book is distributed on the “read first, buy later” principle. Andriy Burkov The HundredPage Machine Learning Book  Draft 5 Basic Practice Until now, I only mentioned in passing some problems a data analyst can encounter when working on a machine learning problem: feature engineering, overfitting, and hyperparameter tuning. In this chapter, we talk about these and other challenges that have to be addressed before you can type model = LogisticRegresion().fit(x,y) in scikitlearn. 5.1 Feature Engineering When a product manager tells you “We need to be able to predict whether a particular customer will stay with us. Here are the logs of customers’ interactions with our product for five years.” you cannot just grab the data, load it into a library and get a prediction. You need to build a dataset first. Remember from the first chapter that the dataset is the collection of labeled examples {(xi, yi)}Ni=1. Each element xi among N is called a feature vector. A feature vector is a vector in which each dimension j = 1, . . . , D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j). The problem of transforming raw data into a dataset is called feature engineering. For most practical problems, feature engineering is a laborintensive process that demands from the data analyst a lot of creativity and, preferably, domain knowledge. For example, to transform the logs of user interaction with a computer system, one could create features that contain information about the user and various statistics extracted from the logs. For each user, one feature would contain the price of the subscription; other features would contain the frequency of connections per day, week and year. Another feature would contain the average session duration in seconds or the average response time for one request, and so on. Everything measurable can be used as a feature. The role of the data analyst is to create informative features: those would allow the learning algorithm to build a model that predicts well labels of the data used for training. Highly informative features are also called features with high predictive power. For example, the average duration of a user’s session has high predictive power for the problem of predicting whether the user will keep using the application in the future. We say that a model has a low bias when it predicts well the training data. That is, the model makes few mistakes when we try to predict labels of the examples used to build the model. 5.1.1 OneHot Encoding Some learning algorithms only work with numerical feature vectors. When some feature in your dataset is categorical, like “colors” or “days of the week,” you can transform such a categorical feature into several binary ones. Andriy Burkov The HundredPage Machine Learning Book  Draft 3 If your example has a categorical feature “colors” and this feature has three possible values: “red,” “yellow,” “green,” you can transform this feature into a vector of three numerical values: red = [1, 0, 0] yellow = [0, 1, 0] green = [0, 0, 1] (1) By doing so, you increase the dimensionality of your feature vectors. You should not transform red into 1, yellow into 2, and green into 3 to avoid increasing the dimensionality because that would imply that there’s an order among the values in this category and this specific order is important for the decision making. If the order of a feature’s values is not important, using ordered numbers as values is likely to confuse the learning algorithm,1 because the algorithm will try to find a regularity where there’s no one, which may potentially lead to overfitting. 5.1.2 Binning An opposite situation, occurring less frequently in practice, is when you have a numerical feature but you want to convert it into a categorical one. Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range. For example, instead of representing age as a single realvalued feature, the analyst could chop ranges of age into discrete bins: all ages between 0 and 5 yearsold could be put into one bin, 6 to 10 yearsold could be in the second bin, 11 to 15 yearsold could be in the third bin, and so on. For example, suppose in our feature j = 18 represents age. By applying binning, we replace this feature with the corresponding bins. Let the three new bins, “age_bin1”, “age_bin2” and “age_bin3” be added with indexes j = 123, j = 124 and j = 125 respectively. Now if x(18)i = 7 for some example xi, then we set feature x (124) i to 1; if x (18) i = 13, then we set feature x(125)i to 1, and so on. In some cases, a carefully designed binning can help the learning algorithm to learn using fewer examples. It happens because we give a “hint” to the learning algorithm that if the value of a feature falls within a specific range, the exact value of the feature doesn’t matter. 1 When the ordering of values of some categorical variable matters, we can replace those values by numbers by keeping only one variable. For example, if our variable represents the quality of an article, and the values are {poor, decent, good, excellent}, then we could replace those categories by numbers, for example, {1, 2, 3, 4}. Andriy Burkov The HundredPage Machine Learning Book  Draft 4 5.1.3 Normalization Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [≠1, 1] or [0, 1]. For example, suppose the natural range of a particular feature is 350 to 1450. By subtracting 350 from every value of the feature, and dividing the result by 1100, one can normalize those values into the range [0, 1]. More generally, the normalization formula looks like this: x̄(j) = x (j) ≠ min(j) max(j) ≠ min(j) , where min(j) and max(j) are, respectively, the minimum and the maximum value of the feature j in the dataset. Why do we normalize? Normalizing the data is not a strict requirement. However, in practice, it can lead to an increased speed of learning. Remember the gradient descent example from the previous chapter. Imagine you have a twodimensional feature vector. When you update the parameters of w(1) and w(2), you use partial derivatives of the average squared error with respect to w(1) and w(2). If x(1) is in the range [0, 1000] and x(2) the range [0, 0.0001], then the derivative with respect to a larger feature will dominate the update. Additionally, it’s useful to ensure that our inputs are roughly in the same relatively small range to avoid problems which computers have when working with very small or very big numbers (known as numerical overflow). 5.1.4 Standardization Standardization (or zscore normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with µ = 0 and ‡ = 1, where µ is the mean (the average value of the feature, averaged over all examples in the dataset) and ‡ is the standard deviation from the mean. Standard scores (or zscores) of features are calculated as follows: x̂(j) = x (j) ≠ µ(j) ‡(j) . You may ask when you should use normalization and when standardization. There’s no definitive answer to this question. Usually, if your dataset is not too big and you have time, you can try both and see which one performs better for your task. If you don’t have time to run multiple experiments, as a rule of thumb: Andriy Burkov The HundredPage Machine Learning Book  Draft 5 • unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization; • standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (socalled bell curve); • again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range; • in all other cases, normalization is preferable. Modern implementations of the learning algorithms, which you can find in popular libraries, are robust to features lying in di�erent ranges. Feature rescaling is usually beneficial to most learning algorithms, but in many cases, the model will still be good when trained from the original features. 5.1.5 Dealing with Missing Features In some cases, the data comes to the analyst in the form of a dataset with features already defined. In some examples, values of some features can be missing. That often happens when the dataset was handcrafted, and the person working on it forgot to fill some values or didn’t get them measured at all. The typical approaches of dealing with missing values for a feature include: • Removing the examples with missing features from the dataset. That can be done if your dataset is big enough so you can sacrifice some training examples. • Using a learning algorithm that can deal with missing feature values (depends on the library and a specific implementation of the algorithm). • Using a data imputation technique. 5.1.6 Data Imputation Techniques One technique consists in replacing the missing value of a feature by an average value of this feature in the dataset: x̂(j) = 1 N x(j). Another technique is to replace the missing value by the same value outside the normal range of values. For example, if the normal range is [0, 1], then you can set the missing value equal to 2 or ≠1. The idea is that the learning algorithm will learn what is it better to do when the feature has a value significantly di�erent from other values. Alternatively, you can replace the missing value by a value in the middle of the range. For example, if the range for a feature is [≠1, 1], you can set the missing value to be equal to 0. Here, the idea is that if we use the value in the middle of the range to replace missing features, such value will not significantly a�ect the prediction. Andriy Burkov The HundredPage Machine Learning Book  Draft 6 A more advanced technique is to use the missing value as the target variable for a regression problem. You can use all remaining features [x(1)i , x (2) i , . . . , x (j≠1) i , x (j+1) i , . . . , x (D) i ] to form a feature vector x̂i, set ŷi = x(j), where j is the feature with a missing value. Now we can build a regression model to predict ŷ from the feature vectors x̂. Of course, to build training examples (x̂, ŷ), you only use those examples from the original dataset, in which the value of feature j is present. Finally, if you have a significantly large dataset and just a few features with missing values, you can increase the dimensionality of your feature vectors by adding a binary indicator feature for each feature with missing values. Let’s say feature j = 12 in your Ddimensional dataset has missing values. For each feature vector x, you then add the feature j = D + 1 which is equal to 1 if the value of feature 12 is present in x and 0 otherwise. The missing feature value then can be replaced by 0 or any number of your choice. At prediction time, if your example is not complete, you should use the same data imputation technique to fill the missing features as the technique you used to complete the training data. Before you start working on the learning problem, you cannot tell which data imputation technique will work the best. Try several techniques, build several models and select the one that works the best. 5.2 Learning Algorithm Selection Choosing a machine learning algorithm can be a di�cult task. If you have much time, you can try all of them. However, usually the time you have to solve a problem is limited. You can ask yourself several questions before starting to work on the problem. Depending on your answers, you can shortlist some algorithms and try them on your data. • Explainability Does your model have to be explainable to a nontechnical audience? Most very accurate learning algorithms are socalled “black boxes.” They learn models that make very few errors, but why a model made a specific prediction could be very hard to understand and even harder to explain. Examples of such models are neural networks or ensemble models. On the other hand, kNN, linear regression, or decision tree learning algorithms produce models that are not always the most accurate, however, the way they make their prediction is very straightforward. • Inmemory vs. outofmemory Can your dataset be fully loaded into the RAM of your server or personal computer? If yes, then you can choose from a wide variety of algorithms. Otherwise, you would prefer incremental learning algorithms that can improve the model by adding more data gradually. • Number of features and examples Andriy Burkov The HundredPage Machine Learning Book  Draft 7 How many training examples do you have in your dataset? How many features does each example have? Some algorithms, including neural networks and gradient boosting (we consider both later), can handle a huge number of examples and millions of features. Others, like SVM, can be very modest in their capacity. • Categorical vs. numerical features Is your data composed of categorical only, or numerical only features, or a mix of both? Depending on your answer, some algorithms cannot handle your dataset directly, and you would need to convert your categorical features into numerical ones by using some techniques like onehot encoding. • Nonlinearity of the data Is your data linearly separable or can it be modeled using a linear model? If yes, SVM with the linear kernel, logistic regression or linear regression can be a good choice. Otherwise, deep neural networks or ensemble algorithms, discussed in Chapters 6 and 7, might work better for your data. • Training speed How much time is a learning algorithm allowed to use to build a model? Neural networks are known to be slow to train. Simple algorithms like logistic and linear regression as well as decision tree learning are much faster. Some specialized libraries contain very e�cient implementations of some algorithms; you may prefer to do research online to find such libraries. Some algorithms, such as random forests, benefit from the availability of multiple CPU cores, so their model building time can be significantly reduced on a machine with dozens of CPU cores. • Prediction speed How fast does the model have to be when generating predictions? Will your model be used in production where very high throughput is required? Some algorithms, like SVMs, linear and logistic regression, or some types of neural networks, are extremely fast at the prediction time. Some others, like kNN, ensemble algorithms, and very deep or recurrent neural networks, can be slower2. If you don’t want to guess the best algorithm for your data, a popular way to choose one is by testing it on the validation set. We talk about that below. Alternatively, if you use scikitlearn, you could try their algorithm selection diagram shown in fig. 1. 2 The prediction speeds of kNN and ensemble methods implemented in the modern libraries are still very fast. Don’t be afraid of using these algorithms in your practice. Andriy Burkov The HundredPage Machine Learning Book  Draft 8 Fi gu re 1: M ac hi ne le ar ni ng al go rit hm se le ct io n di ag ra m fo r sc ik it le ar n. Andriy Burkov The HundredPage Machine Learning Book  Draft 9 5.3 Three Sets Until now, I used the expressions “dataset” and “training set” interchangeably. However, in practice data analysts work with three sets of labeled examples: 1) training set, 2) validation set, and 3) test set. Once you have got your annotated dataset, the first thing you do is you shu�e the examples and split the dataset into three subsets: training, validation, and test. The training set is usually the biggest one, and you use it to build the model. The validation and test sets are roughly the same sizes, much smaller than the size of the training set. The learning algorithm cannot use examples from these two subsets to build a model. That is why those two sets are often called holdout sets. There’s no optimal proportion to split the dataset into these three subsets. In the past, the rule of thumb was to use 70% of the dataset for training, 15% for validation and 15% for testing. However, in the age of big data, datasets often have millions of examples. In such cases, it could be reasonable to keep 95% for training and 2.5%/2.5% for validation/testing. You may wonder, what is the reason to have three sets and not one. Th