Pattern Recognition and Machine Learning
Christopher M. BishopThe dramatic growth in practical applications for machine learning over the last ten years has been accompanied by many important developments in the underlying algorithms and techniques. For example, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic techniques. The practical applicability of Bayesian methods has been greatly enhanced by the development of a range of approximate inference algorithms such as variational Bayes and expectation propagation, while new models based on kernels have had a significant impact on both algorithms and applications.
This completely new textbook reflects these recent developments while providing a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or firstyear PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. Familiarity with multivariate calculus and basic linear algebra is required, and some experience in the use of probabilities would be helpful though not essential as the book includes a selfcontained introduction to basic probability theory.
The book is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. Extensive support is provided for course instructors, including more than 400 exercises, graded according to difficulty. Example solutions for a subset of the exercises are available from the book web site, while solutions for the remainder can be obtained by instructors from the publisher. The book is supported by a great deal of additional material, and the reader is encouraged to visit the book web site for the latest information.
Coming soon:
*For students, worked solutions to a subset of exercises available on a public web site (for exercises marked "www" in the text)
*For instructors, worked solutions to remaining exercises from the Springer web site
*Lecture slides to accompany each chapter
*Data sets available for download
 Open in Browser
 Checking other formats...
 Convert to EPUB
 Convert to FB2
 Convert to MOBI
 Convert to TXT
 Convert to RTF
 Converted file can differ from the original. If possible, download the file in its original format.
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
You may be interested in Powered by Rec2Me
Most frequently terms

Information Science and Statistics Series Editors: M. Jordan J. Kleinberg B. Schölkopf Information Science and Statistics Akaike and Kitagawa: The Practice of Time Series Analysis. Bishop: Pattern Recognition and Machine Learning. Cowell, Dawid, Lauritzen, and Spiegelhalter: Probabilistic Networks and Expert Systems. Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice. Fine: Feedforward Neural Network Methodology. Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality Improvement. Jensen: Bayesian Networks and Decision Graphs. Marchette: Computer Intrusion Detection and Network Monitoring: A Statistical Viewpoint. Rubinstein and Kroese: The CrossEntropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning. Studený: Probabilistic Conditional Independence Structures. Vapnik: The Nature of Statistical Learning Theory, Second Edition. Wallace: Statistical and Inductive Inference by Minimum Massage Length. Christopher M. Bishop Pattern Recognition and Machine Learning Christopher M. Bishop F.R.Eng. Assistant Director Microsoft Research Ltd Cambridge CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/⬃cmbishop Series Editors Michael Jordan Department of Computer Science and Department of Statistics University of California, Berkeley Berkeley, CA 94720 USA Professor Jon Kleinberg Department of Computer Science Cornell University Ithaca, NY 14853 USA Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstrasse 38 72076 Tübingen Germany Library of Congress Control Number: 2006922522 ISBN10: 0387310738 ISBN13: 9780387310732 Printed on acidfree paper. © 2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly anal; ysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in Singapore. 9 8 7 6 5 4 3 2 1 springer.com (KYO) This book is dedicated to my family: Jenna, Mark, and Hugh Total eclipse of the sun, Antalya, Turkey, 29 March 2006. Preface Pattern recognition has its origins in engineering, whereas machine learning grew out of computer science. However, these activities can be viewed as two facets of the same ﬁeld, and together they have undergone substantial development over the past ten years. In particular, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic models. Also, the practical applicability of Bayesian methods has been greatly enhanced through the development of a range of approximate inference algorithms such as variational Bayes and expectation propagation. Similarly, new models based on kernels have had signiﬁcant impact on both algorithms and applications. This new textbook reﬂects these recent developments while providing a comprehensive introduction to the ﬁelds of pattern recognition and machine learning. It is aimed at advanced undergraduates or ﬁrst year PhD students, as well as researchers and practitioners, and assumes no previous knowledge of pattern recognition or machine learning concepts. Knowledge of multivariate calculus and basic linear algebra is required, and some familiarity with probabilities would be helpful though not essential as the book includes a selfcontained introduction to basic probability theory. Because this book has broad scope, it is impossible to provide a complete list of references, and in particular no attempt has been made to provide accurate historical attribution of ideas. Instead, the aim has been to give references that offer greater detail than is possible here and that hopefully provide entry points into what, in some cases, is a very extensive literature. For this reason, the references are often to more recent textbooks and review articles rather than to original sources. The book is supported by a great deal of additional material, including lecture slides as well as the complete set of ﬁgures used in the book, and the reader is encouraged to visit the book web site for the latest information: http://research.microsoft.com/∼cmbishop/PRML vii viii PREFACE Exercises The exercises that appear at the end of every chapter form an important component of the book. Each exercise has been carefully chosen to reinforce concepts explained in the text or to develop and generalize them in signiﬁcant ways, and each is graded according to difﬁculty ranging from (), which denotes a simple exercise taking a few minutes to complete, through to ( ), which denotes a signiﬁcantly more complex exercise. It has been difﬁcult to know to what extent these solutions should be made widely available. Those engaged in self study will ﬁnd worked solutions very beneﬁcial, whereas many course tutors request that solutions be available only via the publisher so that the exercises may be used in class. In order to try to meet these conﬂicting requirements, those exercises that help amplify key points in the text, or that ﬁll in important details, have solutions that are available as a PDF ﬁle from the book web site. Such exercises are denoted by www . Solutions for the remaining exercises are available to course tutors by contacting the publisher (contact details are given on the book web site). Readers are strongly encouraged to work through the exercises unaided, and to turn to the solutions only as required. Although this book focuses on concepts and principles, in a taught course the students should ideally have the opportunity to experiment with some of the key algorithms using appropriate data sets. A companion volume (Bishop and Nabney, 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab software implementing most of the algorithms discussed in this book. Acknowledgements First of all I would like to express my sincere thanks to Markus Svensén who has provided immense help with preparation of ﬁgures and with the typesetting of the book in LATEX. His assistance has been invaluable. I am very grateful to Microsoft Research for providing a highly stimulating research environment and for giving me the freedom to write this book (the views and opinions expressed in this book, however, are my own and are therefore not necessarily the same as those of Microsoft or its afﬁliates). Springer has provided excellent support throughout the ﬁnal stages of preparation of this book, and I would like to thank my commissioning editor John Kimmel for his support and professionalism, as well as Joseph Piliero for his help in designing the cover and the text format and MaryAnn Brickner for her numerous contributions during the production phase. The inspiration for the cover design came from a discussion with Antonio Criminisi. I also wish to thank Oxford University Press for permission to reproduce excerpts from an earlier textbook, Neural Networks for Pattern Recognition (Bishop, 1995a). The images of the Mark 1 perceptron and of Frank Rosenblatt are reproduced with the permission of Arvin Calspan Advanced Technology Center. I would also like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1, and Bernhard Schölkopf for permission to use his kernel PCA code to plot Figure 12.17. PREFACE ix Many people have helped by proofreading draft material and providing comments and suggestions, including Shivani Agarwal, Cédric Archambeau, Arik Azran, Andrew Blake, Hakan Cevikalp, Michael Fourman, Brendan Frey, Zoubin Ghahramani, Thore Graepel, Katherine Heller, Ralf Herbrich, Geoffrey Hinton, Adam Johansen, Matthew Johnson, Michael Jordan, Eva Kalyvianaki, Anitha Kannan, Julia Lasserre, David Liu, Tom Minka, Ian Nabney, Tonatiuh Pena, Yuan Qi, Sam Roweis, Balaji Sanjiya, Toby Sharp, Ana Costa e Silva, David Spiegelhalter, Jay Stokes, Tara Symeonides, Martin Szummer, Marshall Tappen, Ilkay Ulusoy, Chris Williams, John Winn, and Andrew Zisserman. Finally, I would like to thank my wife Jenna who has been hugely supportive throughout the several years it has taken to write this book. Chris Bishop Cambridge February 2006 Mathematical notation I have tried to keep the mathematical content of the book to the minimum necessary to achieve a proper understanding of the ﬁeld. However, this minimum level is nonzero, and it should be emphasized that a good grasp of calculus, linear algebra, and probability theory is essential for a clear understanding of modern pattern recognition and machine learning techniques. Nevertheless, the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book, although at times this means departing from some of the conventions used in the corresponding research literature. Vectors are denoted by lower case bold Roman letters such as x, and all vectors are assumed to be column vectors. A superscript T denotes the transpose of a matrix or vector, so that xT will be a row vector. Uppercase bold roman letters, such as M, denote matrices. The notation (w1 , . . . , wM ) denotes a row vector with M elements, while the corresponding column vector is written as w = (w1 , . . . , wM )T . The notation [a, b] is used to denote the closed interval from a to b, that is the interval including the values a and b themselves, while (a, b) denotes the corresponding open interval, that is the interval excluding a and b. Similarly, [a, b) denotes an interval that includes a but excludes b. For the most part, however, there will be little need to dwell on such reﬁnements as whether the end points of an interval are included or not. The M × M identity matrix (also known as the unit matrix) is denoted IM , which will be abbreviated to I where there is no ambiguity about it dimensionality. It has elements Iij that equal 1 if i = j and 0 if i = j. A functional is denoted f [y] where y(x) is some function. The concept of a functional is discussed in Appendix D. The notation g(x) = O(f (x)) denotes that f (x)/g(x) is bounded as x → ∞. For instance if g(x) = 3x2 + 2, then g(x) = O(x2 ). The expectation of a function f (x, y) with respect to a random variable x is denoted by Ex [f (x, y)]. In situations where there is no ambiguity as to which variable is being averaged over, this will be simpliﬁed by omitting the sufﬁx, for instance xi xii MATHEMATICAL NOTATION E[x]. If the distribution of x is conditioned on another variable z, then the corresponding conditional expectation will be written Ex [f (x)z]. Similarly, the variance is denoted var[f (x)], and for vector variables the covariance is written cov[x, y]. We shall also use cov[x] as a shorthand notation for cov[x, x]. The concepts of expectations and covariances are introduced in Section 1.2.2. If we have N values x1 , . . . , xN of a Ddimensional vector x = (x1 , . . . , xD )T , we can combine the observations into a data matrix X in which the nth row of X corresponds to the row vector xT n . Thus the n, i element of X corresponds to the ith element of the nth observation xn . For the case of onedimensional variables we shall denote such a matrix by x, which is a column vector whose nth element is xn . Note that x (which has dimensionality N ) uses a different typeface to distinguish it from x (which has dimensionality D). Contents Preface vii Mathematical notation xi Contents 1 Introduction 1.1 Example: Polynomial Curve Fitting . . . . . . . 1.2 Probability Theory . . . . . . . . . . . . . . . . 1.2.1 Probability densities . . . . . . . . . . . 1.2.2 Expectations and covariances . . . . . . 1.2.3 Bayesian probabilities . . . . . . . . . . 1.2.4 The Gaussian distribution . . . . . . . . 1.2.5 Curve ﬁtting revisited . . . . . . . . . . 1.2.6 Bayesian curve ﬁtting . . . . . . . . . . 1.3 Model Selection . . . . . . . . . . . . . . . . . 1.4 The Curse of Dimensionality . . . . . . . . . . . 1.5 Decision Theory . . . . . . . . . . . . . . . . . 1.5.1 Minimizing the misclassiﬁcation rate . . 1.5.2 Minimizing the expected loss . . . . . . 1.5.3 The reject option . . . . . . . . . . . . . 1.5.4 Inference and decision . . . . . . . . . . 1.5.5 Loss functions for regression . . . . . . . 1.6 Information Theory . . . . . . . . . . . . . . . . 1.6.1 Relative entropy and mutual information Exercises . . . . . . . . . . . . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 12 17 19 21 24 28 30 32 33 38 39 41 42 42 46 48 55 58 xiii xiv CONTENTS 2 3 Probability Distributions 2.1 Binary Variables . . . . . . . . . . . . . . . . . . . 2.1.1 The beta distribution . . . . . . . . . . . . . 2.2 Multinomial Variables . . . . . . . . . . . . . . . . 2.2.1 The Dirichlet distribution . . . . . . . . . . . 2.3 The Gaussian Distribution . . . . . . . . . . . . . . 2.3.1 Conditional Gaussian distributions . . . . . . 2.3.2 Marginal Gaussian distributions . . . . . . . 2.3.3 Bayes’ theorem for Gaussian variables . . . . 2.3.4 Maximum likelihood for the Gaussian . . . . 2.3.5 Sequential estimation . . . . . . . . . . . . . 2.3.6 Bayesian inference for the Gaussian . . . . . 2.3.7 Student’s tdistribution . . . . . . . . . . . . 2.3.8 Periodic variables . . . . . . . . . . . . . . . 2.3.9 Mixtures of Gaussians . . . . . . . . . . . . 2.4 The Exponential Family . . . . . . . . . . . . . . . 2.4.1 Maximum likelihood and sufﬁcient statistics 2.4.2 Conjugate priors . . . . . . . . . . . . . . . 2.4.3 Noninformative priors . . . . . . . . . . . . 2.5 Nonparametric Methods . . . . . . . . . . . . . . . 2.5.1 Kernel density estimators . . . . . . . . . . . 2.5.2 Nearestneighbour methods . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 68 71 74 76 78 85 88 90 93 94 97 102 105 110 113 116 117 117 120 122 124 127 Linear Models for Regression 3.1 Linear Basis Function Models . . . . . . . . . 3.1.1 Maximum likelihood and least squares . 3.1.2 Geometry of least squares . . . . . . . 3.1.3 Sequential learning . . . . . . . . . . . 3.1.4 Regularized least squares . . . . . . . . 3.1.5 Multiple outputs . . . . . . . . . . . . 3.2 The BiasVariance Decomposition . . . . . . . 3.3 Bayesian Linear Regression . . . . . . . . . . 3.3.1 Parameter distribution . . . . . . . . . 3.3.2 Predictive distribution . . . . . . . . . 3.3.3 Equivalent kernel . . . . . . . . . . . . 3.4 Bayesian Model Comparison . . . . . . . . . . 3.5 The Evidence Approximation . . . . . . . . . 3.5.1 Evaluation of the evidence function . . 3.5.2 Maximizing the evidence function . . . 3.5.3 Effective number of parameters . . . . 3.6 Limitations of Fixed Basis Functions . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 138 140 143 143 144 146 147 152 152 156 159 161 165 166 168 170 172 173 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv CONTENTS 4 5 Linear Models for Classiﬁcation 4.1 Discriminant Functions . . . . . . . . . . . . . . 4.1.1 Two classes . . . . . . . . . . . . . . . . 4.1.2 Multiple classes . . . . . . . . . . . . . . 4.1.3 Least squares for classiﬁcation . . . . . . 4.1.4 Fisher’s linear discriminant . . . . . . . . 4.1.5 Relation to least squares . . . . . . . . . 4.1.6 Fisher’s discriminant for multiple classes 4.1.7 The perceptron algorithm . . . . . . . . . 4.2 Probabilistic Generative Models . . . . . . . . . 4.2.1 Continuous inputs . . . . . . . . . . . . 4.2.2 Maximum likelihood solution . . . . . . 4.2.3 Discrete features . . . . . . . . . . . . . 4.2.4 Exponential family . . . . . . . . . . . . 4.3 Probabilistic Discriminative Models . . . . . . . 4.3.1 Fixed basis functions . . . . . . . . . . . 4.3.2 Logistic regression . . . . . . . . . . . . 4.3.3 Iterative reweighted least squares . . . . 4.3.4 Multiclass logistic regression . . . . . . . 4.3.5 Probit regression . . . . . . . . . . . . . 4.3.6 Canonical link functions . . . . . . . . . 4.4 The Laplace Approximation . . . . . . . . . . . 4.4.1 Model comparison and BIC . . . . . . . 4.5 Bayesian Logistic Regression . . . . . . . . . . 4.5.1 Laplace approximation . . . . . . . . . . 4.5.2 Predictive distribution . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 181 181 182 184 186 189 191 192 196 198 200 202 202 203 204 205 207 209 210 212 213 216 217 217 218 220 Neural Networks 5.1 Feedforward Network Functions . . . . . . . 5.1.1 Weightspace symmetries . . . . . . . 5.2 Network Training . . . . . . . . . . . . . . . . 5.2.1 Parameter optimization . . . . . . . . . 5.2.2 Local quadratic approximation . . . . . 5.2.3 Use of gradient information . . . . . . 5.2.4 Gradient descent optimization . . . . . 5.3 Error Backpropagation . . . . . . . . . . . . . 5.3.1 Evaluation of errorfunction derivatives 5.3.2 A simple example . . . . . . . . . . . 5.3.3 Efﬁciency of backpropagation . . . . . 5.3.4 The Jacobian matrix . . . . . . . . . . 5.4 The Hessian Matrix . . . . . . . . . . . . . . . 5.4.1 Diagonal approximation . . . . . . . . 5.4.2 Outer product approximation . . . . . . 5.4.3 Inverse Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 227 231 232 236 237 239 240 241 242 245 246 247 249 250 251 252 . . . . . . . . . . . . . . . . xvi CONTENTS 6 7 5.4.4 Finite differences . . . . . . . . . . . . . . 5.4.5 Exact evaluation of the Hessian . . . . . . 5.4.6 Fast multiplication by the Hessian . . . . . 5.5 Regularization in Neural Networks . . . . . . . . 5.5.1 Consistent Gaussian priors . . . . . . . . . 5.5.2 Early stopping . . . . . . . . . . . . . . . 5.5.3 Invariances . . . . . . . . . . . . . . . . . 5.5.4 Tangent propagation . . . . . . . . . . . . 5.5.5 Training with transformed data . . . . . . . 5.5.6 Convolutional networks . . . . . . . . . . 5.5.7 Soft weight sharing . . . . . . . . . . . . . 5.6 Mixture Density Networks . . . . . . . . . . . . . 5.7 Bayesian Neural Networks . . . . . . . . . . . . . 5.7.1 Posterior parameter distribution . . . . . . 5.7.2 Hyperparameter optimization . . . . . . . 5.7.3 Bayesian neural networks for classiﬁcation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 253 254 256 257 259 261 263 265 267 269 272 277 278 280 281 284 Kernel Methods 6.1 Dual Representations . . . . . . . . . . . . 6.2 Constructing Kernels . . . . . . . . . . . . 6.3 Radial Basis Function Networks . . . . . . 6.3.1 NadarayaWatson model . . . . . . 6.4 Gaussian Processes . . . . . . . . . . . . . 6.4.1 Linear regression revisited . . . . . 6.4.2 Gaussian processes for regression . 6.4.3 Learning the hyperparameters . . . 6.4.4 Automatic relevance determination 6.4.5 Gaussian processes for classiﬁcation 6.4.6 Laplace approximation . . . . . . . 6.4.7 Connection to neural networks . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 293 294 299 301 303 304 306 311 312 313 315 319 320 Sparse Kernel Machines 7.1 Maximum Margin Classiﬁers . . . . 7.1.1 Overlapping class distributions 7.1.2 Relation to logistic regression 7.1.3 Multiclass SVMs . . . . . . . 7.1.4 SVMs for regression . . . . . 7.1.5 Computational learning theory 7.2 Relevance Vector Machines . . . . . 7.2.1 RVM for regression . . . . . . 7.2.2 Analysis of sparsity . . . . . . 7.2.3 RVM for classiﬁcation . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 326 331 336 338 339 344 345 345 349 353 357 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii CONTENTS 8 Graphical Models 8.1 Bayesian Networks . . . . . . . . . . . . . . 8.1.1 Example: Polynomial regression . . . 8.1.2 Generative models . . . . . . . . . . 8.1.3 Discrete variables . . . . . . . . . . . 8.1.4 LinearGaussian models . . . . . . . 8.2 Conditional Independence . . . . . . . . . . 8.2.1 Three example graphs . . . . . . . . 8.2.2 Dseparation . . . . . . . . . . . . . 8.3 Markov Random Fields . . . . . . . . . . . 8.3.1 Conditional independence properties . 8.3.2 Factorization properties . . . . . . . 8.3.3 Illustration: Image denoising . . . . 8.3.4 Relation to directed graphs . . . . . . 8.4 Inference in Graphical Models . . . . . . . . 8.4.1 Inference on a chain . . . . . . . . . 8.4.2 Trees . . . . . . . . . . . . . . . . . 8.4.3 Factor graphs . . . . . . . . . . . . . 8.4.4 The sumproduct algorithm . . . . . . 8.4.5 The maxsum algorithm . . . . . . . 8.4.6 Exact inference in general graphs . . 8.4.7 Loopy belief propagation . . . . . . . 8.4.8 Learning the graph structure . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 360 362 365 366 370 372 373 378 383 383 384 387 390 393 394 398 399 402 411 416 417 418 418 Mixture Models and EM 9.1 Kmeans Clustering . . . . . . . . . . . . . 9.1.1 Image segmentation and compression 9.2 Mixtures of Gaussians . . . . . . . . . . . . 9.2.1 Maximum likelihood . . . . . . . . . 9.2.2 EM for Gaussian mixtures . . . . . . 9.3 An Alternative View of EM . . . . . . . . . 9.3.1 Gaussian mixtures revisited . . . . . 9.3.2 Relation to Kmeans . . . . . . . . . 9.3.3 Mixtures of Bernoulli distributions . . 9.3.4 EM for Bayesian linear regression . . 9.4 The EM Algorithm in General . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 424 428 430 432 435 439 441 443 444 448 450 455 10 Approximate Inference 10.1 Variational Inference . . . . . . . . . . . . . . 10.1.1 Factorized distributions . . . . . . . . . 10.1.2 Properties of factorized approximations 10.1.3 Example: The univariate Gaussian . . . 10.1.4 Model comparison . . . . . . . . . . . 10.2 Illustration: Variational Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 462 464 466 470 473 474 9 xviii CONTENTS 10.2.1 Variational distribution . . . . . . . . . 10.2.2 Variational lower bound . . . . . . . . 10.2.3 Predictive density . . . . . . . . . . . . 10.2.4 Determining the number of components 10.2.5 Induced factorizations . . . . . . . . . 10.3 Variational Linear Regression . . . . . . . . . 10.3.1 Variational distribution . . . . . . . . . 10.3.2 Predictive distribution . . . . . . . . . 10.3.3 Lower bound . . . . . . . . . . . . . . 10.4 Exponential Family Distributions . . . . . . . 10.4.1 Variational message passing . . . . . . 10.5 Local Variational Methods . . . . . . . . . . . 10.6 Variational Logistic Regression . . . . . . . . 10.6.1 Variational posterior distribution . . . . 10.6.2 Optimizing the variational parameters . 10.6.3 Inference of hyperparameters . . . . . 10.7 Expectation Propagation . . . . . . . . . . . . 10.7.1 Example: The clutter problem . . . . . 10.7.2 Expectation propagation on graphs . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 481 482 483 485 486 486 488 489 490 491 493 498 498 500 502 505 511 513 517 11 Sampling Methods 11.1 Basic Sampling Algorithms . . . . . . . . 11.1.1 Standard distributions . . . . . . . 11.1.2 Rejection sampling . . . . . . . . . 11.1.3 Adaptive rejection sampling . . . . 11.1.4 Importance sampling . . . . . . . . 11.1.5 Samplingimportanceresampling . 11.1.6 Sampling and the EM algorithm . . 11.2 Markov Chain Monte Carlo . . . . . . . . 11.2.1 Markov chains . . . . . . . . . . . 11.2.2 The MetropolisHastings algorithm 11.3 Gibbs Sampling . . . . . . . . . . . . . . 11.4 Slice Sampling . . . . . . . . . . . . . . . 11.5 The Hybrid Monte Carlo Algorithm . . . . 11.5.1 Dynamical systems . . . . . . . . . 11.5.2 Hybrid Monte Carlo . . . . . . . . 11.6 Estimating the Partition Function . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 526 526 528 530 532 534 536 537 539 541 542 546 548 548 552 554 556 12 Continuous Latent Variables 12.1 Principal Component Analysis . . . . . 12.1.1 Maximum variance formulation 12.1.2 Minimumerror formulation . . 12.1.3 Applications of PCA . . . . . . 12.1.4 PCA for highdimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 561 561 563 565 569 . . . . . . . . . . xix CONTENTS 12.2 Probabilistic PCA . . . . . . . . . . . 12.2.1 Maximum likelihood PCA . . . 12.2.2 EM algorithm for PCA . . . . . 12.2.3 Bayesian PCA . . . . . . . . . 12.2.4 Factor analysis . . . . . . . . . 12.3 Kernel PCA . . . . . . . . . . . . . . . 12.4 Nonlinear Latent Variable Models . . . 12.4.1 Independent component analysis 12.4.2 Autoassociative neural networks 12.4.3 Modelling nonlinear manifolds . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 574 577 580 583 586 591 591 592 595 599 13 Sequential Data 13.1 Markov Models . . . . . . . . . . . . . . . . . . 13.2 Hidden Markov Models . . . . . . . . . . . . . 13.2.1 Maximum likelihood for the HMM . . . 13.2.2 The forwardbackward algorithm . . . . 13.2.3 The sumproduct algorithm for the HMM 13.2.4 Scaling factors . . . . . . . . . . . . . . 13.2.5 The Viterbi algorithm . . . . . . . . . . . 13.2.6 Extensions of the hidden Markov model . 13.3 Linear Dynamical Systems . . . . . . . . . . . . 13.3.1 Inference in LDS . . . . . . . . . . . . . 13.3.2 Learning in LDS . . . . . . . . . . . . . 13.3.3 Extensions of LDS . . . . . . . . . . . . 13.3.4 Particle ﬁlters . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 607 610 615 618 625 627 629 631 635 638 642 644 645 646 14 Combining Models 14.1 Bayesian Model Averaging . . . . . . . . . . 14.2 Committees . . . . . . . . . . . . . . . . . . 14.3 Boosting . . . . . . . . . . . . . . . . . . . 14.3.1 Minimizing exponential error . . . . 14.3.2 Error functions for boosting . . . . . 14.4 Treebased Models . . . . . . . . . . . . . . 14.5 Conditional Mixture Models . . . . . . . . . 14.5.1 Mixtures of linear regression models . 14.5.2 Mixtures of logistic models . . . . . 14.5.3 Mixtures of experts . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 654 655 657 659 661 663 666 667 670 672 674 . . . . . . . . . . . . . . . . . . . . . . Appendix A Data Sets 677 Appendix B Probability Distributions 685 Appendix C Properties of Matrices 695 xx CONTENTS Appendix D Calculus of Variations 703 Appendix E Lagrange Multipliers 707 References 711 Index 729 1 Introduction The problem of searching for patterns in data is a fundamental one and has a long and successful history. For instance, the extensive astronomical observations of Tycho Brahe in the 16th century allowed Johannes Kepler to discover the empirical laws of planetary motion, which in turn provided a springboard for the development of classical mechanics. Similarly, the discovery of regularities in atomic spectra played a key role in the development and veriﬁcation of quantum physics in the early twentieth century. The ﬁeld of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories. Consider the example of recognizing handwritten digits, illustrated in Figure 1.1. Each digit corresponds to a 28×28 pixel image and so can be represented by a vector x comprising 784 real numbers. The goal is to build a machine that will take such a vector x as input and that will produce the identity of the digit 0, . . . , 9 as the output. This is a nontrivial problem due to the wide variability of handwriting. It could be 1 2 1. INTRODUCTION Figure 1.1 Examples of handwritten digits taken from US zip codes. tackled using handcrafted rules or heuristics for distinguishing the digits based on the shapes of the strokes, but in practice such an approach leads to a proliferation of rules and of exceptions to the rules and so on, and invariably gives poor results. Far better results can be obtained by adopting a machine learning approach in which a large set of N digits {x1 , . . . , xN } called a training set is used to tune the parameters of an adaptive model. The categories of the digits in the training set are known in advance, typically by inspecting them individually and handlabelling them. We can express the category of a digit using target vector t, which represents the identity of the corresponding digit. Suitable techniques for representing categories in terms of vectors will be discussed later. Note that there is one such target vector t for each digit image x. The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data. Once the model is trained it can then determine the identity of new digit images, which are said to comprise a test set. The ability to categorize correctly new examples that differ from those used for training is known as generalization. In practical applications, the variability of the input vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors, and so generalization is a central goal in pattern recognition. For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. For instance, in the digit recognition problem, the images of the digits are typically translated and scaled so that each digit is contained within a box of a ﬁxed size. This greatly reduces the variability within each digit class, because the location and scale of all the digits are now the same, which makes it much easier for a subsequent pattern recognition algorithm to distinguish between the different classes. This preprocessing stage is sometimes also called feature extraction. Note that new test data must be preprocessed using the same steps as the training data. Preprocessing might also be performed in order to speed up computation. For example, if the goal is realtime face detection in a highresolution video stream, the computer must handle huge numbers of pixels per second, and presenting these directly to a complex pattern recognition algorithm may be computationally infeasible. Instead, the aim is to ﬁnd useful features that are fast to compute, and yet that 1. INTRODUCTION 3 also preserve useful discriminatory information enabling faces to be distinguished from nonfaces. These features are then used as the inputs to the pattern recognition algorithm. For instance, the average value of the image intensity over a rectangular subregion can be evaluated extremely efﬁciently (Viola and Jones, 2004), and a set of such features can prove very effective in fast face detection. Because the number of such features is smaller than the number of pixels, this kind of preprocessing represents a form of dimensionality reduction. Care must be taken during preprocessing because often information is discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer. Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems. Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a ﬁnite number of discrete categories, are called classiﬁcation problems. If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the yield in a chemical manufacturing process in which the inputs consist of the concentrations of reactants, the temperature, and the pressure. In other pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a highdimensional space down to two or three dimensions for the purpose of visualization. Finally, the technique of reinforcement learning (Sutton and Barto, 1998) is concerned with the problem of ﬁnding suitable actions to take in a given situation in order to maximize a reward. Here the learning algorithm is not given examples of optimal outputs, in contrast to supervised learning, but must instead discover them by a process of trial and error. Typically there is a sequence of states and actions in which the learning algorithm is interacting with its environment. In many cases, the current action not only affects the immediate reward but also has an impact on the reward at all subsequent time steps. For example, by using appropriate reinforcement learning techniques a neural network can learn to play the game of backgammon to a high standard (Tesauro, 1994). Here the network must learn to take a board position as input, along with the result of a dice throw, and produce a strong move as the output. This is done by having the network play against a copy of itself for perhaps a million games. A major challenge is that a game of backgammon can involve dozens of moves, and yet it is only at the end of the game that the reward, in the form of victory, is achieved. The reward must then be attributed appropriately to all of the moves that led to it, even though some moves will have been good ones and others less so. This is an example of a credit assignment problem. A general feature of reinforcement learning is the tradeoff between exploration, in which the system tries out new kinds of actions to see how effective they are, and exploitation, in which the system makes use of actions that are known to yield a high reward. Too strong a focus on either exploration or exploitation will yield poor results. Reinforcement learning continues to be an active area of machine learning research. However, a 4 1. INTRODUCTION Figure 1.2 Plot of a training data set of N = 10 points, shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the data. Our goal is to predict the value of t for some new value of x, without knowledge of the green curve. 1 t 0 −1 0 x 1 detailed treatment lies beyond the scope of this book. Although each of these tasks needs its own tools and techniques, many of the key ideas that underpin them are common to all such problems. One of the main goals of this chapter is to introduce, in a relatively informal way, several of the most important of these concepts and to illustrate them using simple examples. Later in the book we shall see these same ideas reemerge in the context of more sophisticated models that are applicable to realworld pattern recognition applications. This chapter also provides a selfcontained introduction to three important tools that will be used throughout the book, namely probability theory, decision theory, and information theory. Although these might sound like daunting topics, they are in fact straightforward, and a clear understanding of them is essential if machine learning techniques are to be used to best effect in practical applications. 1.1. Example: Polynomial Curve Fitting We begin by introducing a simple regression problem, which we shall use as a running example throughout this chapter to motivate a number of key concepts. Suppose we observe a realvalued input variable x and we wish to use this observation to predict the value of a realvalued target variable t. For the present purposes, it is instructive to consider an artiﬁcial example using synthetically generated data because we then know the precise process that generated the data for comparison against any learned model. The data for this example is generated from the function sin(2πx) with random noise included in the target values, as described in detail in Appendix A. Now suppose that we are given a training set comprising N observations of x, written x ≡ (x1 , . . . , xN )T , together with corresponding observations of the values of t, denoted t ≡ (t1 , . . . , tN )T . Figure 1.2 shows a plot of a training set comprising N = 10 data points. The input data set x in Figure 1.2 was generated by choosing values of xn , for n = 1, . . . , N , spaced uniformly in range [0, 1], and the target data set t was obtained by ﬁrst computing the corresponding values of the function 1.1. Example: Polynomial Curve Fitting 5 sin(2πx) and then adding a small level of random noise having a Gaussian distribution (the Gaussian distribution is discussed in Section 1.2.4) to each such point in order to obtain the corresponding value tn . By generating data in this way, we are capturing a property of many real data sets, namely that they possess an underlying regularity, which we wish to learn, but that individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved. Our goal is to exploit this training set in order to make predictions of the value t of the target variable for some new value x of the input variable. As we shall see later, this involves implicitly trying to discover the underlying function sin(2πx). This is intrinsically a difﬁcult problem as we have to generalize from a ﬁnite data x set. Furthermore the observed data are corrupted with noise, and so for a given there is uncertainty as to the appropriate value for t. Probability theory, discussed in Section 1.2, provides a framework for expressing such uncertainty in a precise and quantitative manner, and decision theory, discussed in Section 1.5, allows us to exploit this probabilistic representation in order to make predictions that are optimal according to appropriate criteria. For the moment, however, we shall proceed rather informally and consider a simple approach based on curve ﬁtting. In particular, we shall ﬁt the data using a polynomial function of the form M y(x, w) = w0 + w1 x + w2 x + . . . + wM x 2 = M wj xj (1.1) j =0 where M is the order of the polynomial, and xj denotes x raised to the power of j. The polynomial coefﬁcients w0 , . . . , wM are collectively denoted by the vector w. Note that, although the polynomial function y(x, w) is a nonlinear function of x, it is a linear function of the coefﬁcients w. Functions, such as the polynomial, which are linear in the unknown parameters have important properties and are called linear models and will be discussed extensively in Chapters 3 and 4. The values of the coefﬁcients will be determined by ﬁtting the polynomial to the training data. This can be done by minimizing an error function that measures the misﬁt between the function y(x, w), for any given value of w, and the training set data points. One simple choice of error function, which is widely used, is given by the sum of the squares of the errors between the predictions y(xn , w) for each data point xn and the corresponding target values tn , so that we minimize 1 2 {y(xn , w) − tn } 2 N E(w) = (1.2) n=1 where the factor of 1/2 is included for later convenience. We shall discuss the motivation for this choice of error function later in this chapter. For the moment we simply note that it is a nonnegative quantity that would be zero if, and only if, the 6 1. INTRODUCTION Figure 1.3 The error function (1.2) corresponds to (one half of) the sum of t the squares of the displacements (shown by the vertical green bars) of each data point from the function y(x, w). tn y(xn , w) xn x function y(x, w) were to pass exactly through each training data point. The geometrical interpretation of the sumofsquares error function is illustrated in Figure 1.3. Exercise 1.1 We can solve the curve ﬁtting problem by choosing the value of w for which E(w) is as small as possible. Because the error function is a quadratic function of the coefﬁcients w, its derivatives with respect to the coefﬁcients will be linear in the elements of w, and so the minimization of the error function has a unique solution, denoted by w , which can be found in closed form. The resulting polynomial is given by the function y(x, w ). There remains the problem of choosing the order M of the polynomial, and as we shall see this will turn out to be an example of an important concept called model comparison or model selection. In Figure 1.4, we show four examples of the results of ﬁtting polynomials having orders M = 0, 1, 3, and 9 to the data set shown in Figure 1.2. We notice that the constant (M = 0) and ﬁrst order (M = 1) polynomials give rather poor ﬁts to the data and consequently rather poor representations of the function sin(2πx). The third order (M = 3) polynomial seems to give the best ﬁt to the function sin(2πx) of the examples shown in Figure 1.4. When we go to a much higher order polynomial (M = 9), we obtain an excellent ﬁt to the training data. In fact, the polynomial passes exactly through each data point and E(w ) = 0. However, the ﬁtted curve oscillates wildly and gives a very poor representation of the function sin(2πx). This latter behaviour is known as overﬁtting. As we have noted earlier, the goal is to achieve good generalization by making accurate predictions for new data. We can obtain some quantitative insight into the dependence of the generalization performance on M by considering a separate test set comprising 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values. For each choice of M , we can then evaluate the residual value of E(w ) given by (1.2) for the training data, and we can also evaluate E(w ) for the test data set. It is sometimes more convenient to use the rootmeansquare 7 1.1. Example: Polynomial Curve Fitting M =0 1 M =1 1 t t 0 0 −1 −1 0 x 1 0 M =3 1 x M =9 1 t 1 t 0 0 −1 −1 0 x 1 0 x 1 Figure 1.4 Plots of polynomials having various orders M , shown as red curves, ﬁtted to the data set shown in Figure 1.2. (RMS) error deﬁned by ERMS = 2E(w )/N (1.3) in which the division by N allows us to compare different sizes of data sets on an equal footing, and the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t. Graphs of the training and test set RMS errors are shown, for various values of M , in Figure 1.5. The test set error is a measure of how well we are doing in predicting the values of t for new data observations of x. We note from Figure 1.5 that small values of M give relatively large values of the test set error, and this can be attributed to the fact that the corresponding polynomials are rather inﬂexible and are incapable of capturing the oscillations in the function sin(2πx). Values of M in the range 3 M 8 give small values for the test set error, and these also give reasonable representations of the generating function sin(2πx), as can be seen, for the case of M = 3, from Figure 1.4. 1. INTRODUCTION Figure 1.5 Graphs of the rootmeansquare error, deﬁned by (1.3), evaluated on the training set and on an independent test set for various values of M . 1 Training Test ERMS 8 0.5 0 0 3 M 6 9 For M = 9, the training set error goes to zero, as we might expect because this polynomial contains 10 degrees of freedom corresponding to the 10 coefﬁcients w0 , . . . , w9 , and so can be tuned exactly to the 10 data points in the training set. However, the test set error has become very large and, as we saw in Figure 1.4, the corresponding function y(x, w ) exhibits wild oscillations. This may seem paradoxical because a polynomial of given order contains all lower order polynomials as special cases. The M = 9 polynomial is therefore capable of generating results at least as good as the M = 3 polynomial. Furthermore, we might suppose that the best predictor of new data would be the function sin(2πx) from which the data was generated (and we shall see later that this is indeed the case). We know that a power series expansion of the function sin(2πx) contains terms of all orders, so we might expect that results should improve monotonically as we increase M . We can gain some insight into the problem by examining the values of the coefﬁcients w obtained from polynomials of various order, as shown in Table 1.1. We see that, as M increases, the magnitude of the coefﬁcients typically gets larger. In particular for the M = 9 polynomial, the coefﬁcients have become ﬁnely tuned to the data by developing large positive and negative values so that the correspondTable 1.1 Table of the coefﬁcients w for polynomials of various order. Observe how the typical magnitude of the coefﬁcients increases dramatically as the order of the polynomial increases. w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 M =0 0.19 M =1 0.82 1.27 M =6 0.31 7.99 25.43 17.37 M =9 0.35 232.37 5321.83 48568.31 231639.30 640042.26 1061800.52 1042400.18 557682.99 125201.43 9 1.1. Example: Polynomial Curve Fitting N = 15 1 N = 100 1 t t 0 0 −1 −1 0 x 1 0 x 1 Figure 1.6 Plots of the solutions obtained by minimizing the sumofsquares error function using the M = 9 polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the size of the data set reduces the overﬁtting problem. Section 3.4 ing polynomial function matches each of the data points exactly, but between data points (particularly near the ends of the range) the function exhibits the large oscillations observed in Figure 1.4. Intuitively, what is happening is that the more ﬂexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values. It is also interesting to examine the behaviour of a given model as the size of the data set is varied, as shown in Figure 1.6. We see that, for a given model complexity, the overﬁtting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set, the more complex (in other words more ﬂexible) the model that we can afford to ﬁt to the data. One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. However, as we shall see in Chapter 3, the number of parameters is not necessarily the most appropriate measure of model complexity. Also, there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set. It would seem more reasonable to choose the complexity of the model according to the complexity of the problem being solved. We shall see that the least squares approach to ﬁnding the model parameters represents a speciﬁc case of maximum likelihood (discussed in Section 1.2.5), and that the overﬁtting problem can be understood as a general property of maximum likelihood. By adopting a Bayesian approach, the overﬁtting problem can be avoided. We shall see that there is no difﬁculty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set. For the moment, however, it is instructive to continue with the current approach and to consider how in practice we can apply it to data sets of limited size where we 10 1. INTRODUCTION ln λ = −18 1 ln λ = 0 1 t t 0 0 −1 −1 0 x 1 0 x 1 Figure 1.7 Plots of M = 9 polynomials ﬁtted to the data set shown in Figure 1.2 using the regularized error function (1.4) for two values of the regularization parameter λ corresponding to ln λ = −18 and ln λ = 0. The case of no regularizer, i.e., λ = 0, corresponding to ln λ = −∞, is shown at the bottom right of Figure 1.4. may wish to use relatively complex and ﬂexible models. One technique that is often used to control the overﬁtting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function (1.2) in order to discourage the coefﬁcients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefﬁcients, leading to a modiﬁed error function of the form N 1 λ 2 E(w) = {y(xn , w) − tn } + w2 (1.4) 2 2 n=1 2 + w12 + . . . + wM , and the coefﬁcient λ governs the relwhere w ≡ w w = ative importance of the regularization term compared with the sumofsquares error term. Note that often the coefﬁcient w0 is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al., 2001), or it may be included but with its own regularization coefﬁcient (we shall discuss this topic in more detail in Section 5.5.1). Again, the error function in (1.4) can be minimized exactly in closed form. Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefﬁcients. The particular case of a quadratic regularizer is called ridge regression (Hoerl and Kennard, 1970). In the context of neural networks, this approach is known as weight decay. Figure 1.7 shows the results of ﬁtting the polynomial of order M = 9 to the same data set as before but now using the regularized error function given by (1.4). We see that, for a value of ln λ = −18, the overﬁtting has been suppressed and we now obtain a much closer representation of the underlying function sin(2πx). If, however, we use too large a value for λ then we again obtain a poor ﬁt, as shown in Figure 1.7 for ln λ = 0. The corresponding coefﬁcients from the ﬁtted polynomials are given in Table 1.2, showing that regularization has the desired effect of reducing 2 Exercise 1.2 T w02 11 1.1. Example: Polynomial Curve Fitting Section 1.3 Figure 1.8 Table of the coefﬁcients w for M = 9 polynomials with various values for the regularization parameter λ. Note that ln λ = −∞ corresponds to a model with no regularization, i.e., to the graph at the bottom right in Figure 1.4. We see that, as the value of λ increases, the typical magnitude of the coefﬁcients gets smaller. ln λ = −∞ 0.35 232.37 5321.83 48568.31 231639.30 640042.26 1061800.52 1042400.18 557682.99 125201.43 w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 ln λ = −18 0.35 4.74 0.77 31.97 3.89 55.28 41.32 45.95 91.53 72.68 ln λ = 0 0.13 0.05 0.06 0.05 0.03 0.02 0.01 0.00 0.00 0.01 the magnitude of the coefﬁcients. The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error (1.3) for both training and test sets against ln λ, as shown in Figure 1.8. We see that in effect λ now controls the effective complexity of the model and hence determines the degree of overﬁtting. The issue of model complexity is an important one and will be discussed at length in Section 1.3. Here we simply note that, if we were trying to solve a practical application using this approach of minimizing an error function, we would have to ﬁnd a way to determine a suitable value for the model complexity. The results above suggest a simple way of achieving this, namely by taking the available data and partitioning it into a training set, used to determine the coefﬁcients w, and a separate validation set, also called a holdout set, used to optimize the model complexity (either M or λ). In many cases, however, this will prove to be too wasteful of valuable training data, and we have to seek more sophisticated approaches. So far our discussion of polynomial curve ﬁtting has appealed largely to intuition. We now seek a more principled approach to solving problems in pattern recognition by turning to a discussion of probability theory. As well as providing the foundation for nearly all of the subsequent developments in this book, it will also Graph of the rootmeansquare error (1.3) versus ln λ for the M = 9 polynomial. 1 Training Test ERMS Table 1.2 0.5 0 −35 −30 ln λ −25 −20 12 1. INTRODUCTION give us some important insights into the concepts we have introduced in the context of polynomial curve ﬁtting and will allow us to extend these to more complex situations. 1.2. Probability Theory A key concept in the ﬁeld of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the ﬁnite size of data sets. Probability theory provides a consistent framework for the quantiﬁcation and manipulation of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory, discussed in Section 1.5, it allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous. We will introduce the basic concepts of probability theory by considering a simple example. Imagine we have two boxes, one red and one blue, and in the red box we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange. This is illustrated in Figure 1.9. Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we replace it in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue box 60% of the time, and that when we remove an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box. In this example, the identity of the box that will be chosen is a random variable, which we shall denote by B. This random variable can take one of two possible values, namely r (corresponding to the red box) or b (corresponding to the blue box). Similarly, the identity of the fruit is also a random variable and will be denoted by F . It can take either of the values a (for apple) or o (for orange). To begin with, we shall deﬁne the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to inﬁnity. Thus the probability of selecting the red box is 4/10 Figure 1.9 We use a simple example of two coloured boxes each containing fruit (apples shown in green and oranges shown in orange) to introduce the basic ideas of probability. 1.2. Probability Theory 13 ci } Figure 1.10 We can derive the sum and product rules of probability by considering two random variables, X, which takes the values {xi } where i = 1, . . . , M , and Y , which takes the values {yj } where j = 1, . . . , L. In this illustration we have M = 5 and L = 3. If we consider a total number N of instances of these variables, then we denote the number of instances where X = xi and Y = yj by nij , which is the number of yj points in the corresponding cell of the array. The number of points in column i, corresponding to X = xi , is denoted by ci , and the number of points in row j, corresponding to Y = yj , is denoted by rj . nij }r j xi and the probability of selecting the blue box is 6/10. We write these probabilities as p(B = r) = 4/10 and p(B = b) = 6/10. Note that, by deﬁnition, probabilities must lie in the interval [0, 1]. Also, if the events are mutually exclusive and if they include all possible outcomes (for instance, in this example the box must be either red or blue), then we see that the probabilities for those events must sum to one. We can now ask questions such as: “what is the overall probability that the selection procedure will pick an apple?”, or “given that we have chosen an orange, what is the probability that the box we chose was the blue one?”. We can answer questions such as these, and indeed much more complex questions associated with problems in pattern recognition, once we have equipped ourselves with the two elementary rules of probability, known as the sum rule and the product rule. Having obtained these rules, we shall then return to our boxes of fruit example. In order to derive the rules of probability, consider the slightly more general example shown in Figure 1.10 involving two random variables X and Y (which could for instance be the Box and Fruit variables considered above). We shall suppose that X can take any of the values xi where i = 1, . . . , M , and Y can take the values yj where j = 1, . . . , L. Consider a total of N trials in which we sample both of the variables X and Y , and let the number of such trials in which X = xi and Y = yj be nij . Also, let the number of trials in which X takes the value xi (irrespective of the value that Y takes) be denoted by ci , and similarly let the number of trials in which Y takes the value yj be denoted by rj . The probability that X will take the value xi and Y will take the value yj is written p(X = xi , Y = yj ) and is called the joint probability of X = xi and Y = yj . It is given by the number of points falling in the cell i,j as a fraction of the total number of points, and hence p(X = xi , Y = yj ) = nij . N (1.5) Here we are implicitly considering the limit N → ∞. Similarly, the probability that X takes the value xi irrespective of the value of Y is written as p(X = xi ) and is given by the fraction of the total number of points that fall in column i, so that p(X = xi ) = ci . N (1.6) Because the number of instances in column i in Figure 1.10 is just the sum of the number of instances in each cell of that column, we have ci = j nij and therefore, 14 1. INTRODUCTION from (1.5) and (1.6), we have p(X = xi ) = L p(X = xi , Y = yj ) (1.7) j =1 which is the sum rule of probability. Note that p(X = xi ) is sometimes called the marginal probability, because it is obtained by marginalizing, or summing out, the other variables (in this case Y ). If we consider only those instances for which X = xi , then the fraction of such instances for which Y = yj is written p(Y = yj X = xi ) and is called the conditional probability of Y = yj given X = xi . It is obtained by ﬁnding the fraction of those points in column i that fall in cell i,j and hence is given by p(Y = yj X = xi ) = nij . ci (1.8) From (1.5), (1.6), and (1.8), we can then derive the following relationship nij ci nij = · N ci N = p(Y = yj X = xi )p(X = xi ) p(X = xi , Y = yj ) = (1.9) which is the product rule of probability. So far we have been quite careful to make a distinction between a random variable, such as the box B in the fruit example, and the values that the random variable can take, for example r if the box were the red one. Thus the probability that B takes the value r is denoted p(B = r). Although this helps to avoid ambiguity, it leads to a rather cumbersome notation, and in many cases there will be no need for such pedantry. Instead, we may simply write p(B) to denote a distribution over the random variable B, or p(r) to denote the distribution evaluated for the particular value r, provided that the interpretation is clear from the context. With this more compact notation, we can write the two fundamental rules of probability theory in the following form. The Rules of Probability sum rule p(X) = p(X, Y ) (1.10) Y product rule p(X, Y ) = p(Y X)p(X). (1.11) Here p(X, Y ) is a joint probability and is verbalized as “the probability of X and Y ”. Similarly, the quantity p(Y X) is a conditional probability and is verbalized as “the probability of Y given X”, whereas the quantity p(X) is a marginal probability 1.2. Probability Theory 15 and is simply “the probability of X”. These two simple rules form the basis for all of the probabilistic machinery that we use throughout this book. From the product rule, together with the symmetry property p(X, Y ) = p(Y, X), we immediately obtain the following relationship between conditional probabilities p(Y X) = p(XY )p(Y ) p(X) (1.12) which is called Bayes’ theorem and which plays a central role in pattern recognition and machine learning. Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator p(X) = p(XY )p(Y ). (1.13) Y We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the lefthand side of (1.12) over all values of Y equals one. In Figure 1.11, we show a simple example involving a joint distribution over two variables to illustrate the concept of marginal and conditional distributions. Here a ﬁnite sample of N = 60 data points has been drawn from the joint distribution and is shown in the top left. In the top right is a histogram of the fractions of data points having each of the two values of Y . From the deﬁnition of probability, these fractions would equal the corresponding probabilities p(Y ) in the limit N → ∞. We can view the histogram as a simple way to model a probability distribution given only a ﬁnite number of points drawn from that distribution. Modelling distributions from data lies at the heart of statistical pattern recognition and will be explored in great detail in this book. The remaining two plots in Figure 1.11 show the corresponding histogram estimates of p(X) and p(XY = 1). Let us now return to our example involving boxes of fruit. For the moment, we shall once again be explicit about distinguishing between the random variables and their instantiations. We have seen that the probabilities of selecting either the red or the blue boxes are given by p(B = r) = 4/10 p(B = b) = 6/10 (1.14) (1.15) respectively. Note that these satisfy p(B = r) + p(B = b) = 1. Now suppose that we pick a box at random, and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue box which is 3/4, and so p(F = aB = b) = 3/4. In fact, we can write out all four conditional probabilities for the type of fruit, given the selected box p(F p(F p(F p(F = aB = r) = oB = r) = aB = b) = oB = b) = = = = 1/4 3/4 3/4 1/4. (1.16) (1.17) (1.18) (1.19) 16 1. INTRODUCTION p(Y ) p(X, Y ) Y =2 Y =1 X p(X) p(XY = 1) X X Figure 1.11 An illustration of a distribution over two variables, X, which takes 9 possible values, and Y , which takes two possible values. The top left ﬁgure shows a sample of 60 points drawn from a joint probability distribution over these variables. The remaining ﬁgures show histogram estimates of the marginal distributions p(X) and p(Y ), as well as the conditional distribution p(XY = 1) corresponding to the bottom row in the top left ﬁgure. Again, note that these probabilities are normalized so that p(F = aB = r) + p(F = oB = r) = 1 (1.20) p(F = aB = b) + p(F = oB = b) = 1. (1.21) and similarly We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple p(F = a) = p(F = aB = r)p(B = r) + p(F = aB = b)p(B = b) 1 4 3 6 11 = × + × = (1.22) 4 10 4 10 20 from which it follows, using the sum rule, that p(F = o) = 1 − 11/20 = 9/20. 1.2. Probability Theory 17 Suppose instead we are told that a piece of fruit has been selected and it is an orange, and we would like to know which box it came from. This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit, whereas the probabilities in (1.16)–(1.19) give the probability distribution over the fruit conditioned on the identity of the box. We can solve the problem of reversing the conditional probability by using Bayes’ theorem to give p(B = rF = o) = 3 4 20 2 p(F = oB = r)p(B = r) = × × = . p(F = o) 4 10 9 3 (1.23) From the sum rule, it then follows that p(B = bF = o) = 1 − 2/3 = 1/3. We can provide an important interpretation of Bayes’ theorem as follows. If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is provided by the probability p(B). We call this the prior probability because it is the probability available before we observe the identity of the fruit. Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability p(BF ), which we shall call the posterior probability because it is the probability obtained after we have observed F . Note that in this example, the prior probability of selecting the red box was 4/10, so that we were more likely to select the blue box than the red one. However, once we have observed that the piece of selected fruit is an orange, we ﬁnd that the posterior probability of the red box is now 2/3, so that it is now more likely that the box we selected was in fact the red one. This result accords with our intuition, as the proportion of oranges is much higher in the red box than it is in the blue box, and so the observation that the fruit was an orange provides signiﬁcant evidence favouring the red box. In fact, the evidence is sufﬁciently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one. Finally, we note that if the joint distribution of two variables factorizes into the product of the marginals, so that p(X, Y ) = p(X)p(Y ), then X and Y are said to be independent. From the product rule, we see that p(Y X) = p(Y ), and so the conditional distribution of Y given X is indeed independent of the value of X. For instance, in our boxes of fruit example, if each box contained the same fraction of apples and oranges, then p(F B) = P (F ), so that the probability of selecting, say, an apple is independent of which box is chosen. 1.2.1 Probability densities As well as considering probabilities deﬁned over discrete sets of events, we also wish to consider probabilities with respect to continuous variables. We shall limit ourselves to a relatively informal discussion. If the probability of a realvalued variable x falling in the interval (x, x + δx) is given by p(x)δx for δx → 0, then p(x) is called the probability density over x. This is illustrated in Figure 1.12. The probability that x will lie in an interval (a, b) is then given by b p(x) dx. (1.24) p(x ∈ (a, b)) = a 18 1. INTRODUCTION Figure 1.12 The concept of probability for discrete variables can be extended to that of a probability density p(x) over a continuous variable x and is such that the probability of x lying in the interval (x, x + δx) is given by p(x)δx for δx → 0. The probability density can be expressed as the derivative of a cumulative distribution function P (x). p(x) δx P (x) x Because probabilities are nonnegative, and because the value of x must lie somewhere on the real axis, the probability density p(x) must satisfy the two conditions ∞ p(x) 0 (1.25) p(x) dx = 1. (1.26) −∞ Under a nonlinear change of variable, a probability density transforms differently from a simple function, due to the Jacobian factor. For instance, if we consider a change of variables x = g(y), then a function f (x) becomes f(y) = f (g(y)). Now consider a probability density px (x) that corresponds to a density py (y) with respect to the new variable y, where the sufﬁces denote the fact that px (x) and py (y) are different densities. Observations falling in the range (x, x + δx) will, for small values of δx, be transformed into the range (y, y + δy) where px (x)δx py (y)δy, and hence dx py (y) = px (x) dy = px (g(y)) g (y) . (1.27) Exercise 1.4 One consequence of this property is that the concept of the maximum of a probability density is dependent on the choice of variable. The probability that x lies in the interval (−∞, z) is given by the cumulative distribution function deﬁned by z p(x) dx (1.28) P (z) = −∞ which satisﬁes P (x) = p(x), as shown in Figure 1.12. If we have several continuous variables x1 , . . . , xD , denoted collectively by the vector x, then we can deﬁne a joint probability density p(x) = p(x1 , . . . , xD ) such 1.2. Probability Theory 19 that the probability of x falling in an inﬁnitesimal volume δx containing the point x is given by p(x)δx. This multivariate probability density must satisfy p(x) 0 (1.29) p(x) dx = 1 (1.30) in which the integral is taken over the whole of x space. We can also consider joint probability distributions over a combination of discrete and continuous variables. Note that if x is a discrete variable, then p(x) is sometimes called a probability mass function because it can be regarded as a set of ‘probability masses’ concentrated at the allowed values of x. The sum and product rules of probability, as well as Bayes’ theorem, apply equally to the case of probability densities, or to combinations of discrete and continuous variables. For instance, if x and y are two real variables, then the sum and product rules take the form p(x, y) dy (1.31) p(x) = p(x, y) = p(yx)p(x). (1.32) A formal justiﬁcation of the sum and product rules for continuous variables (Feller, 1966) requires a branch of mathematics called measure theory and lies outside the scope of this book. Its validity can be seen informally, however, by dividing each real variable into intervals of width ∆ and considering the discrete probability distribution over these intervals. Taking the limit ∆ → 0 then turns sums into integrals and gives the desired result. 1.2.2 Expectations and covariances One of the most important operations involving probabilities is that of ﬁnding weighted averages of functions. The average value of some function f (x) under a probability distribution p(x) is called the expectation of f (x) and will be denoted by E[f ]. For a discrete distribution, it is given by p(x)f (x) (1.33) E[f ] = x so that the average is weighted by the relative probabilities of the different values of x. In the case of continuous variables, expectations are expressed in terms of an integration with respect to the corresponding probability density E[f ] = p(x)f (x) dx. (1.34) In either case, if we are given a ﬁnite number N of points drawn from the probability distribution or probability density, then the expectation can be approximated as a 20 1. INTRODUCTION ﬁnite sum over these points N 1 E[f ] f (xn ). N (1.35) n=1 We shall make extensive use of this result when we discuss sampling methods in Chapter 11. The approximation in (1.35) becomes exact in the limit N → ∞. Sometimes we will be considering expectations of functions of several variables, in which case we can use a subscript to indicate which variable is being averaged over, so that for instance (1.36) Ex [f (x, y)] denotes the average of the function f (x, y) with respect to the distribution of x. Note that Ex [f (x, y)] will be a function of y. We can also consider a conditional expectation with respect to a conditional distribution, so that p(xy)f (x) (1.37) Ex [f y] = x with an analogous deﬁnition for continuous variables. The variance of f (x) is deﬁned by 2 var[f ] = E (f (x) − E[f (x)]) Exercise 1.5 (1.38) and provides a measure of how much variability there is in f (x) around its mean value E[f (x)]. Expanding out the square, we see that the variance can also be written in terms of the expectations of f (x) and f (x)2 var[f ] = E[f (x)2 ] − E[f (x)]2 . (1.39) In particular, we can consider the variance of the variable x itself, which is given by var[x] = E[x2 ] − E[x]2 . (1.40) For two random variables x and y, the covariance is deﬁned by cov[x, y] = Ex,y [{x − E[x]} {y − E[y]}] = Ex,y [xy] − E[x]E[y] Exercise 1.6 (1.41) which expresses the extent to which x and y vary together. If x and y are independent, then their covariance vanishes. In the case of two vectors of random variables x and y, the covariance is a matrix cov[x, y] = Ex,y {x − E[x]}{yT − E[yT ]} = Ex,y [xyT ] − E[x]E[yT ]. (1.42) If we consider the covariance of the components of a vector x with each other, then we use a slightly simpler notation cov[x] ≡ cov[x, x]. 1.2. Probability Theory 21 1.2.3 Bayesian probabilities So far in this chapter, we have viewed probabilities in terms of the frequencies of random, repeatable events. We shall refer to this as the classical or frequentist interpretation of probability. Now we turn to the more general Bayesian view, in which probabilities provide a quantiﬁcation of uncertainty. Consider an uncertain event, for example whether the moon was once in its own orbit around the sun, or whether the Arctic ice cap will have disappeared by the end of the century. These are not events that can be repeated numerous times in order to deﬁne a notion of probability as we did earlier in the context of boxes of fruit. Nevertheless, we will generally have some idea, for example, of how quickly we think the polar ice is melting. If we now obtain fresh evidence, for instance from a new Earth observation satellite gathering novel forms of diagnostic information, we may revise our opinion on the rate of ice loss. Our assessment of such matters will affect the actions we take, for instance the extent to which we endeavour to reduce the emission of greenhouse gasses. In such circumstances, we would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence, as well as subsequently to be able to take optimal actions or decisions as a consequence. This can all be achieved through the elegant, and very general, Bayesian interpretation of probability. The use of probability to represent uncertainty, however, is not an adhoc choice, but is inevitable if we are to respect common sense while making rational coherent inferences. For instance, Cox (1946) showed that if numerical values are used to represent degrees of belief, then a simple set of axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for manipulating degrees of belief that are equivalent to the sum and product rules of probability. This provided the ﬁrst rigorous proof that probability theory could be regarded as an extension of Boolean logic to situations involving uncertainty (Jaynes, 2003). Numerous other authors have proposed different sets of properties or axioms that such measures of uncertainty should satisfy (Ramsey, 1931; Good, 1950; Savage, 1961; deFinetti, 1970; Lindley, 1982). In each case, the resulting numerical quantities behave precisely according to the rules of probability. It is therefore natural to refer to these quantities as (Bayesian) probabilities. In the ﬁeld of pattern recognition, too, it is helpful to have a more general no Thomas Bayes 1701–1761 Thomas Bayes was born in Tunbridge Wells and was a clergyman as well as an amateur scientist and a mathematician. He studied logic and theology at Edinburgh University and was elected Fellow of the Royal Society in 1742. During the 18th century, issues regarding probability arose in connection with gambling and with the new concept of insurance. One particularly important problem concerned socalled inverse probability. A solution was proposed by Thomas Bayes in his paper ‘Essay towards solving a problem in the doctrine of chances’, which was published in 1764, some three years after his death, in the Philosophical Transactions of the Royal Society. In fact, Bayes only formulated his theory for the case of a uniform prior, and it was PierreSimon Laplace who independently rediscovered the theory in general form and who demonstrated its broad applicability. 22 1. INTRODUCTION tion of probability. Consider the example of polynomial curve ﬁtting discussed in Section 1.1. It seems reasonable to apply the frequentist notion of probability to the random values of the observed variables tn . However, we would like to address and quantify the uncertainty that surrounds the appropriate choice for the model parameters w. We shall see that, from a Bayesian perspective, we can use the machinery of probability theory to describe the uncertainty in model parameters such as w, or indeed in the choice of model itself. Bayes’ theorem now acquires a new signiﬁcance. Recall that in the boxes of fruit example, the observation of the identity of the fruit provided relevant information that altered the probability that the chosen box was the red one. In that example, Bayes’ theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data. As we shall see in detail later, we can adopt a similar approach when making inferences about quantities such as the parameters w in the polynomial curve ﬁtting example. We capture our assumptions about w, before observing the data, in the form of a prior probability distribution p(w). The effect of the observed data D = {t1 , . . . , tN } is expressed through the conditional probability p(Dw), and we shall see later, in Section 1.2.5, how this can be represented explicitly. Bayes’ theorem, which takes the form p(wD) = p(Dw)p(w) p(D) (1.43) then allows us to evaluate the uncertainty in w after we have observed D in the form of the posterior probability p(wD). The quantity p(Dw) on the righthand side of Bayes’ theorem is evaluated for the observed data set D and can be viewed as a function of the parameter vector w, in which case it is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector w. Note that the likelihood is not a probability distribution over w, and its integral with respect to w does not (necessarily) equal one. Given this deﬁnition of likelihood, we can state Bayes’ theorem in words posterior ∝ likelihood × prior (1.44) where all of these quantities are viewed as functions of w. The denominator in (1.43) is the normalization constant, which ensures that the posterior distribution on the lefthand side is a valid probability density and integrates to one. Indeed, integrating both sides of (1.43) with respect to w, we can express the denominator in Bayes’ theorem in terms of the prior distribution and the likelihood function (1.45) p(D) = p(Dw)p(w) dw. In both the Bayesian and frequentist paradigms, the likelihood function p(Dw) plays a central role. However, the manner in which it is used is fundamentally different in the two approaches. In a frequentist setting, w is considered to be a ﬁxed parameter, whose value is determined by some form of ‘estimator’, and error bars 1.2. Probability Theory Section 2.1 Section 2.4.3 Section 1.3 23 on this estimate are obtained by considering the distribution of possible data sets D. By contrast, from the Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w. A widely used frequentist estimator is maximum likelihood, in which w is set to the value that maximizes the likelihood function p(Dw). This corresponds to choosing the value of w for which the probability of the observed data set is maximized. In the machine learning literature, the negative log of the likelihood function is called an error function. Because the negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the error. One approach to determining frequentist error bars is the bootstrap (Efron, 1979; Hastie et al., 2001), in which multiple data sets are created as follows. Suppose our original data set consists of N data points X = {x1 , . . . , xN }. We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB , whereas other points in X may be absent from XB . This process can be repeated L times to generate L data sets each of size N and each obtained by sampling from the original data set X. The statistical accuracy of parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap data sets. One advantage of the Bayesian viewpoint is that the inclusion of prior knowledge arises naturally. Suppose, for instance, that a fairlooking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1, implying that all future tosses will land heads! By contrast, a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion. There has been much controversy and debate associated with the relative merits of the frequentist and Bayesian paradigms, which have not been helped by the fact that there is no unique frequentist, or even Bayesian, viewpoint. For instance, one common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reﬂection of any prior beliefs. Even the subjective nature of the conclusions through their dependence on the choice of prior is seen by some as a source of difﬁculty. Reducing the dependence on the prior is one motivation for socalled noninformative priors. However, these lead to difﬁculties when comparing different models, and indeed Bayesian methods based on poor choices of prior can give poor results with high conﬁdence. Frequentist evaluation methods offer some protection from such problems, and techniques such as crossvalidation remain useful in areas such as model comparison. This book places a strong emphasis on the Bayesian viewpoint, reﬂecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required. Although the Bayesian framework has its origins in the 18th century, the practical application of Bayesian methods was for a long time severely limited by the difﬁculties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall 24 1. INTRODUCTION see, is required in order to make predictions or to compare different models. The development of sampling methods, such as Markov chain Monte Carlo (discussed in Chapter 11) along with dramatic improvements in the speed and memory capacity of computers, opened the door to the practical use of Bayesian techniques in an impressive range of problem domains. Monte Carlo methods are very ﬂexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for smallscale problems. More recently, highly efﬁcient deterministic approximation schemes such as variational Bayes and expectation propagation (discussed in Chapter 10) have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in largescale applications (Blei et al., 2003). 1.2.4 The Gaussian distribution We shall devote the whole of Chapter 2 to a study of various probability distributions and their key properties. It is convenient, however, to introduce here one of the most important probability distributions for continuous variables, called the normal or Gaussian distribution. We shall make extensive use of this distribution in the remainder of this chapter and indeed throughout much of the book. For the case of a single realvalued variable x, the Gaussian distribution is deﬁned by 1 1 exp − 2 (x − µ)2 (1.46) N xµ, σ 2 = 2 1 / 2 2σ (2πσ ) which is governed by two parameters: µ, called the mean, and σ 2 , called the variance. The square root of the variance, given by σ, is called the standard deviation, and the reciprocal of the variance, written as β = 1/σ 2 , is called the precision. We shall see the motivation for these terms shortly. Figure 1.13 shows a plot of the Gaussian distribution. From the form of (1.46) we see that the Gaussian distribution satisﬁes N (xµ, σ 2 ) > 0. Exercise 1.7 (1.47) Also it is straightforward to show that the Gaussian is normalized, so that PierreSimon Laplace 1749–1827 It is said that Laplace was seriously lacking in modesty and at one point declared himself to be the best mathematician in France at the time, a claim that was arguably true. As well as being proliﬁc in mathematics, he also made numerous contributions to astronomy, including the nebular hypothesis by which the earth is thought to have formed from the condensation and cooling of a large rotating disk of gas and dust. In 1812 he published the ﬁrst edition of Théorie Analytique des Probabilités, in which Laplace states that “probability theory is nothing but common sense reduced to calculation”. This work included a discussion of the inverse probability calculation (later termed Bayes’ theorem by Poincaré), which he used to solve problems in life expectancy, jurisprudence, planetary masses, triangulation, and error estimation. 1.2. Probability Theory Figure 1.13 Plot of the univariate Gaussian showing the mean µ and the standard deviation σ. 25 N (xµ, σ 2 ) 2σ µ ∞ −∞ Exercise 1.8 N xµ, σ 2 dx = 1. x (1.48) Thus (1.46) satisﬁes the two requirements for a valid probability density. We can readily ﬁnd expectations of functions of x under the Gaussian distribution. In particular, the average value of x is given by ∞ N xµ, σ 2 x dx = µ. (1.49) E[x] = −∞ Because the parameter µ represents the average value of x under the distribution, it is referred to as the mean. Similarly, for the second order moment ∞ 2 N xµ, σ 2 x2 dx = µ2 + σ 2 . (1.50) E[x ] = −∞ From (1.49) and (1.50), it follows that the variance of x is given by var[x] = E[x2 ] − E[x]2 = σ 2 Exercise 1.9 (1.51) and hence σ 2 is referred to as the variance parameter. The maximum of a distribution is known as its mode. For a Gaussian, the mode coincides with the mean. We are also interested in the Gaussian distribution deﬁned over a Ddimensional vector x of continuous variables, which is given by N (xµ, Σ) = 1 1 1 exp − (x − µ)T Σ−1 (x − µ) D/ 2 1 / 2 2 (2π) Σ (1.52) where the Ddimensional vector µ is called the mean, the D × D matrix Σ is called the covariance, and Σ denotes the determinant of Σ. We shall make use of the multivariate Gaussian distribution brieﬂy in this chapter, although its properties will be studied in detail in Section 2.3. 26 1. INTRODUCTION Figure 1.14 Illustration of the likelihood function for a Gaussian distribution, shown by the red curve. Here the black points de p(x) note a data set of values {xn }, and the likelihood function given by (1.53) corresponds to the product of the blue values. Maximizing the likelihood involves adjusting the mean and variance of the Gaussian so as to maximize this product. N (xn µ, σ 2 ) xn x Now suppose that we have a data set of observations x = (x1 , . . . , xN )T , representing N observations of the scalar variable x. Note that we are using the typeface x to distinguish this from a single observation of the vectorvalued variable (x1 , . . . , xD )T , which we denote by x. We shall suppose that the observations are drawn independently from a Gaussian distribution whose mean µ and variance σ 2 are unknown, and we would like to determine these parameters from the data set. Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d. We have seen that the joint probability of two independent events is given by the product of the marginal probabilities for each event separately. Because our data set x is i.i.d., we can therefore write the probability of the data set, given µ and σ 2 , in the form 2 p(xµ, σ ) = N N xn µ, σ 2 . (1.53) n=1 Section 1.2.5 When viewed as a function of µ and σ 2 , this is the likelihood function for the Gaussian and is interpreted diagrammatically in Figure 1.14. One common criterion for determining the parameters in a probability distribution using an observed data set is to ﬁnd the parameter values that maximize the likelihood function. This might seem like a strange criterion because, from our foregoing discussion of probability theory, it would seem more natural to maximize the probability of the parameters given the data, not the probability of the data given the parameters. In fact, these two criteria are related, as we shall discuss in the context of curve ﬁtting. For the moment, however, we shall determine values for the unknown parameters µ and σ 2 in the Gaussian by maximizing the likelihood function (1.53). In practice, it is more convenient to maximize the log of the likelihood function. Because the logarithm is a monotonically increasing function of its argument, maximization of the log of a function is equivalent to maximization of the function itself. Taking the log not only simpliﬁes the subsequent mathematical analysis, but it also helps numerically because the product of a large number of small probabilities can easily underﬂow the numerical precision of the computer, and this is resolved by computing instead the sum of the log probabilities. From (1.46) and (1.53), the log likelihood 1.2. Probability Theory 27 function can be written in the form ln p xµ, σ 2 = − N 1 N N ln σ 2 − ln(2π). (xn − µ)2 − 2 2σ 2 2 (1.54) n=1 Exercise 1.11 Maximizing (1.54) with respect to µ, we obtain the maximum likelihood solution given by N 1 µML = xn (1.55) N n=1 which is the sample mean, i.e., the mean of the observed values {xn }. Similarly, maximizing (1.54) with respect to σ 2 , we obtain the maximum likelihood solution for the variance in the form 2 = σML N 1 (xn − µML )2 N (1.56) n=1 Section 1.1 Exercise 1.12 which is the sample variance measured with respect to the sample mean µML . Note that we are performing a joint maximization of (1.54) with respect to µ and σ 2 , but in the case of the Gaussian distribution the solution for µ decouples from that for σ 2 so that we can ﬁrst evaluate (1.55) and then subsequently use this result to evaluate (1.56). Later in this chapter, and also in subsequent chapters, we shall highlight the signiﬁcant limitations of the maximum likelihood approach. Here we give an indication of the problem in the context of our solutions for the maximum likelihood parameter settings for the univariate Gaussian distribution. In particular, we shall show that the maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of overﬁtting encountered in the context of polynomial curve ﬁtting. 2 are functions of We ﬁrst note that the maximum likelihood solutions µML and σML the data set values x1 , . . . , xN . Consider the expectations of these quantities with respect to the data set values, which themselves come from a Gaussian distribution with parameters µ and σ 2 . It is straightforward to show that E[µML ] = µ N −1 2 σ2 E[σML ] = N (1.57) (1.58) so that on average the maximum likelihood estimate will obtain the correct mean but will underestimate the true variance by a factor (N − 1)/N . The intuition behind this result is given by Figure 1.15. From (1.58) it follows that the following estimate for the variance parameter is unbiased N N 1 2 σML σ 2 = = (xn − µML )2 . (1.59) N −1 N −1 n=1 28 1. INTRODUCTION Figure 1.15 Illustration of how bias arises in using maximum likelihood to determine the variance of a Gaussian. The green curve shows the true Gaussian distribution from which data is generated, and the three red curves show the Gaussian distributions obtained by ﬁtting to three data sets, each consisting of two data points shown in blue, using the maximum likelihood results (1.55) and (1.56). Averaged across the three data sets, the mean is correct, but the variance is systematically underestimated because it is measured relative to the sample mean and not relative to the true mean. (a) (b) (c) In Section 10.1.3, we shall see how this result arises automatically when we adopt a Bayesian approach. Note that the bias of the maximum likelihood solution becomes less signiﬁcant as the number N of data points increases, and in the limit N → ∞ the maximum likelihood solution for the variance equals the true variance of the distribution that generated the data. In practice, for anything other than small N , this bias will not prove to be a serious problem. However, throughout this book we shall be interested in more complex models with many parameters, for which the bias problems associated with maximum likelihood will be much more severe. In fact, as we shall see, the issue of bias in maximum likelihood lies at the root of the overﬁtting problem that we encountered earlier in the context of polynomial curve ﬁtting. 1.2.5 Curve ﬁtting revisited Section 1.1 We have seen how the problem of polynomial curve ﬁtting can be expressed in terms of error minimization. Here we return to the curve ﬁtting example and view it from a probabilistic perspective, thereby gaining some insights into error functions and regularization, as well as taking us towards a full Bayesian treatment. The goal in the curve ﬁtting problem is to be able to make predictions for the target variable t given some new value of the input variable x on the basis of a set of training data comprising N input values x = (x1 , . . . , xN )T and their corresponding target values t = (t1 , . . . , tN )T . We can express our uncertainty over the value of the target variable using a probability distribution. For this purpose, we shall assume that, given the value of x, the corresponding value of t has a Gaussian distribution with a mean equal to the value y(x, w) of the polynomial curve given by (1.1). Thus we have (1.60) p(tx, w, β) = N ty(x, w), β −1 where, for consistency with the notation in later chapters, we have deﬁned a precision parameter β corresponding to the inverse variance of the distribution. This is illustrated schematically in Figure 1.16. 29 1.2. Probability Theory Figure 1.16 Schematic illustration of a Gaussian conditional distribution for t given x given by (1.60), in which the mean is given by the polynomial function y(x, w), and the precision is given by the parameter β, which is related to the variance by β −1 = σ 2 . t y(x, w) y(x0 , w) 2σ p(tx0 , w, β) x0 x We now use the training data {x, t} to determine the values of the unknown parameters w and β by maximum likelihood. If the data are assumed to be drawn independently from the distribution (1.60), then the likelihood function is given by p(tx, w, β) = N N tn y(xn , w), β −1 . (1.61) n=1 As we did in the case of the simple Gaussian distribution earlier, it is convenient to maximize the logarithm of the likelihood function. Substituting for the form of the Gaussian distribution, given by (1.46), we obtain the log likelihood function in the form ln p(tx, w, β) = − N N N β 2 ln β − ln(2π). {y(xn , w) − tn } + 2 2 2 (1.62) n=1 Consider ﬁrst the determination of the maximum likelihood solution for the polynomial coefﬁcients, which will be denoted by wML . These are determined by maximizing (1.62) with respect to w. For this purpose, we can omit the last two terms on the righthand side of (1.62) because they do not depend on w. Also, we note that scaling the log likelihood by a positive constant coefﬁcient does not alter the location of the maximum with respect to w, and so we can replace the coefﬁcie