Main
Foundations of Machine Learning
Due to the technical work on the site downloading books (as well as file conversion and sending books to email/kindle) may be unstable from May, 27 to May, 28
Also, for users who have an active donation now, we will extend the donation period.
Foundations of Machine Learning
Mohri Mehryar, Afshin Rostamizadeh, and Ameet TalwalkarCategories:
Year:
2018
Edition:
2
Publisher:
The MIT Press
Language:
english
Pages:
505
ISBN:
0262039400; 9780262039406
Series:
Adaptive Computation and Machine Learning
File:
PDF, 8.30 MB
Download (pdf, 8.30 MB)
 Open in Browser
 Checking other formats...
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
You may be interested in Powered by Rec2Me
Most frequently terms
algorithm^{811}
theorem^{479}
sample^{423}
inequality^{387}
hypothesis^{377}
kernel^{342}
convex^{334}
margin^{310}
algorithms^{291}
probability^{270}
complexity^{251}
dimension^{243}
optimization^{235}
vector^{230}
generalization^{228}
bounds^{210}
linear^{209}
empirical^{207}
denote^{203}
classification^{196}
regression^{192}
kernels^{188}
finite^{184}
ranking^{176}
adaboost^{171}
matrix^{168}
sup^{163}
entropy^{156}
lemma^{149}
pds^{149}
bounded^{145}
optimization problem^{142}
boosting^{142}
dual^{142}
max^{141}
norm^{141}
define^{137}
rademacher complexity^{136}
labeled^{133}
multi^{130}
conditional^{126}
min^{124}
vectors^{121}
pac^{121}
variables^{120}
maxent^{119}
hypothesis set^{118}
label^{118}
exp^{117}
objective^{113}
sequence^{107}
svms^{106}
input^{105}
mapping^{102}
exercises^{100}
derive^{100}
implies^{97}
automata^{95}
binary^{94}
hypotheses^{93}
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
2

Foundations of Machine Learning second edition Adaptive Computation and Machine Learning Francis Bach, Editor A complete list of books published in The Adaptive Computations and Machine Learning series appears at the back of this book. Foundations of Machine Learning second edition Mehryar Mohri Afshin Rostamizadeh Ameet Talwalkar The MIT Press Cambridge, Massachusetts London, England c© 2018 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. This book was set in LATEX by the authors. Printed and bound in the United States of America. Library of Congress CataloginginPublication Data Names: Mohri, Mehryar, author.  Rostamizadeh, Afshin, author.  Talwalkar, Ameet, author. Title: Foundations of machine learning / Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Description: Second edition.  Cambridge, MA : The MIT Press, [2018]  Series: Adaptive computation and machine learning series  Includes bibliographical references and index. Identifiers: LCCN 2018022812  ISBN 9780262039406 (hardcover : alk. paper) Subjects: LCSH: Machine learning.  Computer algorithms. Classification: LCC Q325.5 .M64 2018  DDC 006.3/1dc23 LC record available at https://lccn.loc.gov/2018022812 10 9 8 7 6 5 4 3 2 1 https://lccn.loc.gov/2018022812 Contents Preface xiii 1 Introduction 1 1.1 What is machine learning? 1 1.2 What kind of problems can be tackled using machine learning? 2 1.3 Some standard learning tasks 3 1.4 Learning stages 4 1.5 Learning scenarios 6 1.6 Generalization 7 2 The PAC Learning Framework 9 2.1 The PAC learning model 9 2.2 Guarantees for finite hypothesis sets — consistent case 15 2.3 Guarantees for finite hypothesis sets — inconsistent case 19 2.4 Generalities 21 2.4.1 Deterministic versus stochastic scenarios 21 2.4.2 Bayes error and n; oise 22 2.5 Chapter notes 23 2.6 Exercises 23 3 Rademacher Complexity and VCDimension 29 3.1 Rademacher complexity 30 3.2 Growth function 34 3.3 VCdimension 36 3.4 Lower bounds 43 3.5 Chapter notes 48 3.6 Exercises 50 4 Model Selection 61 4.1 Estimation and approximation errors 61 4.2 Empirical risk minimization (ERM) 62 4.3 Structural risk minimization (SRM) 64 vi Contents 4.4 Crossvalidation 68 4.5 nFold crossvalidation 71 4.6 Regularizationbased algorithms 72 4.7 Convex surrogate losses 73 4.8 Chapter notes 77 4.9 Exercises 78 5 Support Vector Machines 79 5.1 Linear classification 79 5.2 Separable case 80 5.2.1 Primal optimization problem 81 5.2.2 Support vectors 83 5.2.3 Dual optimization problem 83 5.2.4 Leaveoneout analysis 85 5.3 Nonseparable case 87 5.3.1 Primal optimization problem 88 5.3.2 Support vectors 89 5.3.3 Dual optimization problem 90 5.4 Margin theory 91 5.5 Chapter notes 100 5.6 Exercises 100 6 Kernel Methods 105 6.1 Introduction 105 6.2 Positive definite symmetric kernels 108 6.2.1 Definitions 108 6.2.2 Reproducing kernel Hilbert space 110 6.2.3 Properties 112 6.3 Kernelbased algorithms 116 6.3.1 SVMs with PDS kernels 116 6.3.2 Representer theorem 117 6.3.3 Learning guarantees 117 6.4 Negative definite symmetric kernels 119 6.5 Sequence kernels 121 6.5.1 Weighted transducers 122 6.5.2 Rational kernels 126 6.6 Approximate kernel feature maps 130 6.7 Chapter notes 135 6.8 Exercises 137 7 Boosting 145 7.1 Introduction 145 7.2 AdaBoost 146 7.2.1 Bound on the empirical error 149 7.2.2 Relationship with coordinate descent 150 7.2.3 Practical use 154 Contents vii 7.3 Theoretical results 154 7.3.1 VCdimensionbased analysis 154 7.3.2 L1geometric margin 155 7.3.3 Marginbased analysis 157 7.3.4 Margin maximization 161 7.3.5 Gametheoretic interpretation 162 7.4 L1regularization 165 7.5 Discussion 167 7.6 Chapter notes 168 7.7 Exercises 170 8 OnLine Learning 177 8.1 Introduction 178 8.2 Prediction with expert advice 178 8.2.1 Mistake bounds and Halving algorithm 179 8.2.2 Weighted majority algorithm 181 8.2.3 Randomized weighted majority algorithm 183 8.2.4 Exponential weighted average algorithm 186 8.3 Linear classification 190 8.3.1 Perceptron algorithm 190 8.3.2 Winnow algorithm 198 8.4 Online to batch conversion 201 8.5 Gametheoretic connection 204 8.6 Chapter notes 205 8.7 Exercises 206 9 MultiClass Classification 213 9.1 Multiclass classification problem 213 9.2 Generalization bounds 215 9.3 Uncombined multiclass algorithms 221 9.3.1 Multiclass SVMs 221 9.3.2 Multiclass boosting algorithms 222 9.3.3 Decision trees 224 9.4 Aggregated multiclass algorithms 228 9.4.1 Oneversusall 229 9.4.2 Oneversusone 229 9.4.3 Errorcorrecting output codes 231 9.5 Structured prediction algorithms 233 9.6 Chapter notes 235 9.7 Exercises 237 10 Ranking 239 10.1 The problem of ranking 240 10.2 Generalization bound 241 10.3 Ranking with SVMs 243 viii Contents 10.4 RankBoost 244 10.4.1 Bound on the empirical error 246 10.4.2 Relationship with coordinate descent 248 10.4.3 Margin bound for ensemble methods in ranking 250 10.5 Bipartite ranking 251 10.5.1 Boosting in bipartite ranking 252 10.5.2 Area under the ROC curve 255 10.6 Preferencebased setting 257 10.6.1 Secondstage ranking problem 257 10.6.2 Deterministic algorithm 259 10.6.3 Randomized algorithm 260 10.6.4 Extension to other loss functions 262 10.7 Other ranking criteria 262 10.8 Chapter notes 263 10.9 Exercises 264 11 Regression 267 11.1 The problem of regression 267 11.2 Generalization bounds 268 11.2.1 Finite hypothesis sets 268 11.2.2 Rademacher complexity bounds 269 11.2.3 Pseudodimension bounds 271 11.3 Regression algorithms 275 11.3.1 Linear regression 275 11.3.2 Kernel ridge regression 276 11.3.3 Support vector regression 281 11.3.4 Lasso 285 11.3.5 Group norm regression algorithms 289 11.3.6 Online regression algorithms 289 11.4 Chapter notes 290 11.5 Exercises 292 12 Maximum Entropy Models 295 12.1 Density estimation problem 295 12.1.1 Maximum Likelihood (ML) solution 296 12.1.2 Maximum a Posteriori (MAP) solution 297 12.2 Density estimation problem augmented with features 297 12.3 Maxent principle 298 12.4 Maxent models 299 12.5 Dual problem 299 12.6 Generalization bound 303 12.7 Coordinate descent algorithm 304 12.8 Extensions 306 12.9 L2regularization 308 Contents ix 12.10 Chapter notes 312 12.11 Exercises 313 13 Conditional Maximum Entropy Models 315 13.1 Learning problem 315 13.2 Conditional Maxent principle 316 13.3 Conditional Maxent models 316 13.4 Dual problem 317 13.5 Properties 319 13.5.1 Optimization problem 320 13.5.2 Feature vectors 320 13.5.3 Prediction 321 13.6 Generalization bounds 321 13.7 Logistic regression 325 13.7.1 Optimization problem 325 13.7.2 Logistic model 325 13.8 L2regularization 326 13.9 Proof of the duality theorem 328 13.10 Chapter notes 330 13.11 Exercises 331 14 Algorithmic Stability 333 14.1 Definitions 333 14.2 Stabilitybased generalization guarantee 334 14.3 Stability of kernelbased regularization algorithms 336 14.3.1 Application to regression algorithms: SVR and KRR 339 14.3.2 Application to classification algorithms: SVMs 341 14.3.3 Discussion 342 14.4 Chapter notes 342 14.5 Exercises 343 15 Dimensionality Reduction 347 15.1 Principal component analysis 348 15.2 Kernel principal component analysis (KPCA) 349 15.3 KPCA and manifold learning 351 15.3.1 Isomap 351 15.3.2 Laplacian eigenmaps 352 15.3.3 Locally linear embedding (LLE) 353 15.4 JohnsonLindenstrauss lemma 354 15.5 Chapter notes 356 15.6 Exercises 356 16 Learning Automata and Languages 359 16.1 Introduction 359 x Contents 16.2 Finite automata 360 16.3 Efficient exact learning 361 16.3.1 Passive learning 362 16.3.2 Learning with queries 363 16.3.3 Learning automata with queries 364 16.4 Identification in the limit 369 16.4.1 Learning reversible automata 370 16.5 Chapter notes 375 16.6 Exercises 376 17 Reinforcement Learning 379 17.1 Learning scenario 379 17.2 Markov decision process model 380 17.3 Policy 381 17.3.1 Definition 381 17.3.2 Policy value 382 17.3.3 Optimal policies 382 17.3.4 Policy evaluation 385 17.4 Planning algorithms 387 17.4.1 Value iteration 387 17.4.2 Policy iteration 390 17.4.3 Linear programming 392 17.5 Learning algorithms 393 17.5.1 Stochastic approximation 394 17.5.2 TD(0) algorithm 397 17.5.3 Qlearning algorithm 398 17.5.4 SARSA 402 17.5.5 TD(λ) algorithm 402 17.5.6 Large state space 403 17.6 Chapter notes 405 Conclusion 407 A Linear Algebra Review 409 A.1 Vectors and norms 409 A.1.1 Norms 409 A.1.2 Dual norms 410 A.1.3 Relationship between norms 411 A.2 Matrices 411 A.2.1 Matrix norms 411 A.2.2 Singular value decomposition 412 A.2.3 Symmetric positive semidefinite (SPSD) matrices 412 Contents xi B Convex Optimization 415 B.1 Differentiation and unconstrained optimization 415 B.2 Convexity 415 B.3 Constrained optimization 419 B.4 Fenchel duality 422 B.4.1 Subgradients 422 B.4.2 Core 423 B.4.3 Conjugate functions 423 B.5 Chapter notes 426 B.6 Exercises 427 C Probability Review 429 C.1 Probability 429 C.2 Random variables 429 C.3 Conditional probability and independence 431 C.4 Expectation and Markov’s inequality 431 C.5 Variance and Chebyshev’s inequality 432 C.6 Momentgenerating functions 434 C.7 Exercises 435 D Concentration Inequalities 437 D.1 Hoeffding’s inequality 437 D.2 Sanov’s theorem 438 D.3 Multiplicative Chernoff bounds 439 D.4 Binomial distribution tails: Upper bounds 440 D.5 Binomial distribution tails: Lower bound 440 D.6 Azuma’s inequality 441 D.7 McDiarmid’s inequality 442 D.8 Normal distribution tails: Lower bound 443 D.9 KhintchineKahane inequality 443 D.10 Maximal inequality 444 D.11 Chapter notes 445 D.12 Exercises 445 E Notions of Information Theory 449 E.1 Entropy 449 E.2 Relative entropy 450 E.3 Mutual information 453 E.4 Bregman divergences 453 E.5 Chapter notes 456 E.6 Exercises 457 xii Contents F Notation 459 Bibliography 461 Index 475 Preface This book is a general introduction to machine learning that can serve as a reference book for researchers and a textbook for students. It covers fundamental modern topics in machine learning while providing the theoretical basis and conceptual tools needed for the discussion and justification of algorithms. It also describes several key aspects of the application of these algorithms. We have aimed to present the most novel theoretical tools and concepts while giving concise proofs, even for relatively advanced results. In general, whenever possible, we have chosen to favor succinctness. Nevertheless, we discuss some crucial complex topics arising in machine learning and highlight several open research questions. Certain topics often merged with others or treated with insufficient attention are discussed separately here and with more emphasis: for example, a different chapter is reserved for multiclass classification, ranking, and regression. Although we cover a very wide variety of important topics in machine learning, we have chosen to omit a few important ones, including graphical models and neural networks, both for the sake of brevity and because of the current lack of solid theoretical guarantees for some methods. The book is intended for students and researchers in machine learning, statistics and other related areas. It can be used as a textbook for both graduate and advanced undergraduate classes in machine learning or as a reference text for a research seminar. The first three or four chapters of the book lay the theoretical foundation for the subsequent material. Other chapters are mostly selfcontained, with the exception of chapter 6 which introduces some concepts that are extensively used in later ones and chapter 13, which is closely related to chapter 12. Each chapter concludes with a series of exercises, with full solutions presented separately. The reader is assumed to be familiar with basic concepts in linear algebra, prob ability, and analysis of algorithms. However, to further help, we have included an extensive appendix presenting a concise review of linear algebra, an introduction to convex optimization, a brief probability review, a collection of concentration xiv Preface inequalities useful to the analyses and discussions in this book, and a short intro duction to information theory. Our goal has been to give a unified presentation of multiple topics and areas, as opposed to a more specialized presentation adopted by some books which favor a particular viewpoint, such as for example a Bayesian view, or a particular topic, such as for example kernel methods. The theoretical foundation of this book and its deliberate emphasis on proofs and analysis make it also very distinct from many other presentations. In this second edition, we have updated the entire book. The changes include a different writing style in most chapters, new figures and illustrations, many simplifi cations, some additions to existing chapters, in particular chapter 6 and chapter 17, and several new chapters. We have added a full chapter on model selection (chap ter 4), which is an important topic that was only briefly discussed in the previous edition. We have also added a new chapter on Maximum Entropy models (chap ter 12) and a new chapter on Conditional Maximum Entropy models (chapter 13) which are both essential topics in machine learning. We have also significantly changed the appendix. In particular, we have added a full section on Fenchel dual ity to appendix B on convex optimization, made a number of changes and additions to appendix D dealing with concentration inequalities, added appendix E on infor mation theory, and updated most of the material. Additionally, we have included a number of new exercises and their solutions for existing and new chapters. Most of the material presented here takes its origins in a machine learning grad uate course (Foundations of Machine Learning) taught by the first author at the Courant Institute of Mathematical Sciences in New York University over the last fourteen years. This book has considerably benefited from the comments and sug gestions from students in these classes, along with those of many friends, colleagues and researchers to whom we are deeply indebted. We are particularly grateful to Corinna Cortes and Yishay Mansour who made a number of key suggestions for the design and organization of the material presented in the first edition, with detailed comments that we have fully taken into account and that have greatly improved the presentation. We are also grateful to Yishay Mansour for using a preliminary version of the first edition of the book for teaching, and for reporting his feedback to us. We also thank for discussions, suggested improvement, and contributions of many kinds the following colleagues and friends from academic and corporate research laboratories: Jacob Abernethy, Cyril Allauzen, Kareem Amin, Stephen Boyd, Aldo Corbisiero, Giulia DeSalvo, Claudio Gentile, Spencer Greenberg, Lisa Hellerstein, Sanjiv Kumar, Vitaly Kuznetsov, Ryan McDonald, Andrès Muñoz Medina, Tyler Neylon, Peter Norvig, Fernando Pereira, Maria Pershina, Borja de Balle Pigem, Preface xv Ashish Rastogi, Michael Riley, Dmitry Storcheus, Ananda Theertha Suresh, Umar Syed, Csaba Szepesvári, Toshiyuki Tanaka, Eugene Weinstein, Jason Weston, Scott Yang, and Ningshan Zhang. Finally, we thank the MIT Press publication team for their help and support in the development of this text. 1 Introduction This chapter presents a preliminary introduction to machine learning, including an overview of some key learning tasks and applications, basic definitions and termi nology, and the discussion of some general scenarios. 1.1 What is machine learning? Machine learning can be broadly defined as computational methods using experi ence to improve performance or to make accurate predictions. Here, experience refers to the past information available to the learner, which typically takes the form of electronic data collected and made available for analysis. This data could be in the form of digitized humanlabeled training sets, or other types of informa tion obtained via interaction with the environment. In all cases, its quality and size are crucial to the success of the predictions made by the learner. An example of a learning problem is how to use a finite sample of randomly selected documents, each labeled with a topic, to accurately predict the topic of unseen documents. Clearly, the larger is the sample, the easier is the task. But the difficulty of the task also depends on the quality of the labels assigned to the documents in the sample, since the labels may not be all correct, and on the number of possible topics. Machine learning consists of designing efficient and accurate prediction algo rithms. As in other areas of computer science, some critical measures of the quality of these algorithms are their time and space complexity. But, in machine learning, we will need additionally a notion of sample complexity to evaluate the sample size required for the algorithm to learn a family of concepts. More generally, theoreti cal learning guarantees for an algorithm depend on the complexity of the concept classes considered and the size of the training sample. Since the success of a learning algorithm depends on the data used, machine learn ing is inherently related to data analysis and statistics. More generally, learning 2 Chapter 1 Introduction techniques are datadriven methods combining fundamental concepts in computer science with ideas from statistics, probability and optimization. 1.2 What kind of problems can be tackled using machine learning? Predicting the label of a document, also known as document classification, is by no means the only learning task. Machine learning admits a very broad set of practical applications, which include the following: • Text or document classification. This includes problems such as assigning a topic to a text or a document, or determining automatically if the content of a web page is inappropriate or too explicit; it also includes spam detection. • Natural language processing (NLP). Most tasks in this field, including partof speech tagging, namedentity recognition, contextfree parsing, or dependency parsing, are cast as learning problems. In these problems, predictions admit some structure. For example, in partofspeech tagging, the prediction for a sentence is a sequence of partofspeech tags labeling each word. In contextfree parsing the prediction is a tree. These are instances of richer learning problems known as structured prediction problems. • Speech processing applications. This includes speech recognition, speech synthe sis, speaker verification, speaker identification, as well as subproblems such as language modeling and acoustic modeling. • Computer vision applications. This includes object recognition, object identifi cation, face detection, Optical character recognition (OCR), contentbased image retrieval, or pose estimation. • Computational biology applications. This includes protein function prediction, identification of key sites, or the analysis of gene and protein networks. • Many other problems such as fraud detection for credit card, telephone or in surance companies, network intrusion, learning to play games such as chess, backgammon, or Go, unassisted control of vehicles such as robots or cars, medical diagnosis, the design of recommendation systems, search engines, or information extraction systems, are tackled using machine learning techniques. This list is by no means comprehensive. Most prediction problems found in practice can be cast as learning problems and the practical application area of machine learning keeps expanding. The algorithms and techniques discussed in this book can be used to derive solutions for all of these problems, though we will not discuss in detail these applications. 1.3 Some standard learning tasks 3 1.3 Some standard learning tasks The following are some standard machine learning tasks that have been extensively studied: • Classification: this is the problem of assigning a category to each item. For example, document classification consists of assigning a category such as politics, business, sports, or weather to each document, while image classification consists of assigning to each image a category such as car, train, or plane. The number of categories in such tasks is often less than a few hundreds, but it can be much larger in some difficult tasks and even unbounded as in OCR, text classification, or speech recognition. • Regression: this is the problem of predicting a real value for each item. Examples of regression include prediction of stock values or that of variations of economic variables. In regression, the penalty for an incorrect prediction depends on the magnitude of the difference between the true and predicted values, in contrast with the classification problem, where there is typically no notion of closeness between various categories. • Ranking : this is the problem of learning to order items according to some criterion. Web search, e.g., returning web pages relevant to a search query, is the canonical ranking example. Many other similar ranking problems arise in the context of the design of information extraction or natural language processing systems. • Clustering : this is the problem of partitioning a set of items into homogeneous subsets. Clustering is often used to analyze very large data sets. For example, in the context of social network analysis, clustering algorithms attempt to identify natural communities within large groups of people. • Dimensionality reduction or manifold learning : this problem consists of trans forming an initial representation of items into a lowerdimensional representation while preserving some properties of the initial representation. A common example involves preprocessing digital images in computer vision tasks. The main practical objectives of machine learning consist of generating accurate predictions for unseen items and of designing efficient and robust algorithms to produce these predictions, even for largescale problems. To do so, a number of algorithmic and theoretical questions arise. Some fundamental questions include: Which concept families can actually be learned, and under what conditions? How well can these concepts be learned computationally? 4 Chapter 1 Introduction 1.4 Learning stages Here, we will use the canonical problem of spam detection as a running example to illustrate some basic definitions and describe the use and evaluation of machine learning algorithms in practice, including their different stages. Spam detection is the problem of learning to automatically classify email messages as either spam or nonspam. The following is a list of definitions and terminology commonly used in machine learning: • Examples: Items or instances of data used for learning or evaluation. In our spam problem, these examples correspond to the collection of email messages we will use for learning and testing. • Features: The set of attributes, often represented as a vector, associated to an example. In the case of email messages, some relevant features may include the length of the message, the name of the sender, various characteristics of the header, the presence of certain keywords in the body of the message, and so on. • Labels: Values or categories assigned to examples. In classification problems, examples are assigned specific categories, for instance, the spam and nonspam categories in our binary classification problem. In regression, items are assigned realvalued labels. • Hyperparameters: Free parameters that are not determined by the learning algo rithm, but rather specified as inputs to the learning algorithm. • Training sample: Examples used to train a learning algorithm. In our spam problem, the training sample consists of a set of email examples along with their associated labels. The training sample varies for different learning scenarios, as described in section 1.5. • Validation sample: Examples used to tune the parameters of a learning algorithm when working with labeled data. The validation sample is used to select appro priate values for the learning algorithm’s free parameters (hyperparameters). • Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is separate from the training and validation data and is not made available in the learning stage. In the spam problem, the test sample consists of a collection of email examples for which the learning algorithm must predict labels based on features. These predictions are then compared with the labels of the test sample to measure the performance of the algorithm. • Loss function: A function that measures the difference, or loss, between a pre dicted label and a true label. Denoting the set of all labels as Y and the set of possible predictions as Y′, a loss function L is a mapping L : Y × Y′ → R+. In most cases, Y′ = Y and the loss function is bounded, but these conditions do not always hold. Common examples of loss functions include the zeroone (or 1.4 Learning stages 5 pageFoundations of Machine Learning 1 labeled data training sample validation data test sample features algorithm parameter selection evaluation prior knowledge A(Θ0) <latexit sha1_base64="nlkdb+lmuqgDZimL0dm33HF6omA=">AAACHHicdVDLSgMxFM34rPVVFVdugkWomzIthdpdxY3LCh1baIeSSW/b0MyD5I5QhvkWcavf4UrcCn6Gf2D6EKyPAyGHc+5NDseLpNBo2+/Wyura+sZmZiu7vbO7t587OLzVYaw4ODyUoWp7TIMUATgoUEI7UsB8T0LLG19N/dYdKC3CoImTCFyfDQMxEJyhkXq5467PcKS5Si7TQrc5AmQ9+7yXy9vFWtUu1yr0NykV7RnyZIFGL/fR7Yc89iFALpnWnZIdoZswhYJLSLPdWEPE+JgNoWNowHzQbjKLn9Izo/TpIFTmBEhn6veNhPlaT3zPTM7C/vSm4l9eJ8bBhZuIIIoRAj7/aBBLiiGddkH7QgFHOTGEcSVMVspHTDGOprGllxYluQnE5hIRplnT0VcR9H/ilIu1Yummkq/XF2VlyAk5JQVSIlVSJ9ekQRzCSUIeyCN5su6tZ+vFep2PrliLnSOyBOvtE6TKoq8=</latexit><latexit sha1_base64="nlkdb+lmuqgDZimL0dm33HF6omA=">AAACHHicdVDLSgMxFM34rPVVFVdugkWomzIthdpdxY3LCh1baIeSSW/b0MyD5I5QhvkWcavf4UrcCn6Gf2D6EKyPAyGHc+5NDseLpNBo2+/Wyura+sZmZiu7vbO7t587OLzVYaw4ODyUoWp7TIMUATgoUEI7UsB8T0LLG19N/dYdKC3CoImTCFyfDQMxEJyhkXq5467PcKS5Si7TQrc5AmQ9+7yXy9vFWtUu1yr0NykV7RnyZIFGL/fR7Yc89iFALpnWnZIdoZswhYJLSLPdWEPE+JgNoWNowHzQbjKLn9Izo/TpIFTmBEhn6veNhPlaT3zPTM7C/vSm4l9eJ8bBhZuIIIoRAj7/aBBLiiGddkH7QgFHOTGEcSVMVspHTDGOprGllxYluQnE5hIRplnT0VcR9H/ilIu1Yummkq/XF2VlyAk5JQVSIlVSJ9ekQRzCSUIeyCN5su6tZ+vFep2PrliLnSOyBOvtE6TKoq8=</latexit><latexit sha1_base64="nlkdb+lmuqgDZimL0dm33HF6omA=">AAACHHicdVDLSgMxFM34rPVVFVdugkWomzIthdpdxY3LCh1baIeSSW/b0MyD5I5QhvkWcavf4UrcCn6Gf2D6EKyPAyGHc+5NDseLpNBo2+/Wyura+sZmZiu7vbO7t587OLzVYaw4ODyUoWp7TIMUATgoUEI7UsB8T0LLG19N/dYdKC3CoImTCFyfDQMxEJyhkXq5467PcKS5Si7TQrc5AmQ9+7yXy9vFWtUu1yr0NykV7RnyZIFGL/fR7Yc89iFALpnWnZIdoZswhYJLSLPdWEPE+JgNoWNowHzQbjKLn9Izo/TpIFTmBEhn6veNhPlaT3zPTM7C/vSm4l9eJ8bBhZuIIIoRAj7/aBBLiiGddkH7QgFHOTGEcSVMVspHTDGOprGllxYluQnE5hIRplnT0VcR9H/ilIu1Yummkq/XF2VlyAk5JQVSIlVSJ9ekQRzCSUIeyCN5su6tZ+vFep2PrliLnSOyBOvtE6TKoq8=</latexit><latexit sha1_base64="nlkdb+lmuqgDZimL0dm33HF6omA=">AAACHHicdVDLSgMxFM34rPVVFVdugkWomzIthdpdxY3LCh1baIeSSW/b0MyD5I5QhvkWcavf4UrcCn6Gf2D6EKyPAyGHc+5NDseLpNBo2+/Wyura+sZmZiu7vbO7t587OLzVYaw4ODyUoWp7TIMUATgoUEI7UsB8T0LLG19N/dYdKC3CoImTCFyfDQMxEJyhkXq5467PcKS5Si7TQrc5AmQ9+7yXy9vFWtUu1yr0NykV7RnyZIFGL/fR7Yc89iFALpnWnZIdoZswhYJLSLPdWEPE+JgNoWNowHzQbjKLn9Izo/TpIFTmBEhn6veNhPlaT3zPTM7C/vSm4l9eJ8bBhZuIIIoRAj7/aBBLiiGddkH7QgFHOTGEcSVMVspHTDGOprGllxYluQnE5hIRplnT0VcR9H/ilIu1Yummkq/XF2VlyAk5JQVSIlVSJ9ekQRzCSUIeyCN5su6tZ+vFep2PrliLnSOyBOvtE6TKoq8=</latexit> A(Θ) <latexit sha1_base64="oPb37qQYO9RA8sihDJtZSxG1H/4=">AAACGnicdVDLSgMxFM34rPVV7dJNsAh1U6alULuruHFZobWFdiiZ9E4bmnmQ3BGGoZ8ibvU7XIlbN36Gf2D6EKyPAyGHc+5NDseNpNBo2+/W2vrG5tZ2Zie7u7d/cJg7Or7VYaw4tHkoQ9V1mQYpAmijQAndSAHzXQkdd3I18zt3oLQIgxYmETg+GwXCE5yhkQa5fN9nONZcpZfTYr81BmTng1zBLtVrdqVepb9JuWTPUSBLNAe5j/4w5LEPAXLJtO6V7QidlCkUXMI02481RIxP2Ah6hgbMB+2k8/BTemaUIfVCZU6AdK5+30iZr3Xiu2ZyHvWnNxP/8noxehdOKoIoRgj44iMvlhRDOmuCDoUCjjIxhHElTFbKx0wxjqavlZeWFTkpxOYSEU6zpqOvIuj/pF0p1Uvlm2qh0ViWlSEn5JQUSZnUSINckyZpE04S8kAeyZN1bz1bL9brYnTNWu7kyQqst09hwaIM</latexit><latexit sha1_base64="oPb37qQYO9RA8sihDJtZSxG1H/4=">AAACGnicdVDLSgMxFM34rPVV7dJNsAh1U6alULuruHFZobWFdiiZ9E4bmnmQ3BGGoZ8ibvU7XIlbN36Gf2D6EKyPAyGHc+5NDseNpNBo2+/W2vrG5tZ2Zie7u7d/cJg7Or7VYaw4tHkoQ9V1mQYpAmijQAndSAHzXQkdd3I18zt3oLQIgxYmETg+GwXCE5yhkQa5fN9nONZcpZfTYr81BmTng1zBLtVrdqVepb9JuWTPUSBLNAe5j/4w5LEPAXLJtO6V7QidlCkUXMI02481RIxP2Ah6hgbMB+2k8/BTemaUIfVCZU6AdK5+30iZr3Xiu2ZyHvWnNxP/8noxehdOKoIoRgj44iMvlhRDOmuCDoUCjjIxhHElTFbKx0wxjqavlZeWFTkpxOYSEU6zpqOvIuj/pF0p1Uvlm2qh0ViWlSEn5JQUSZnUSINckyZpE04S8kAeyZN1bz1bL9brYnTNWu7kyQqst09hwaIM</latexit><latexit sha1_base64="oPb37qQYO9RA8sihDJtZSxG1H/4=">AAACGnicdVDLSgMxFM34rPVV7dJNsAh1U6alULuruHFZobWFdiiZ9E4bmnmQ3BGGoZ8ibvU7XIlbN36Gf2D6EKyPAyGHc+5NDseNpNBo2+/W2vrG5tZ2Zie7u7d/cJg7Or7VYaw4tHkoQ9V1mQYpAmijQAndSAHzXQkdd3I18zt3oLQIgxYmETg+GwXCE5yhkQa5fN9nONZcpZfTYr81BmTng1zBLtVrdqVepb9JuWTPUSBLNAe5j/4w5LEPAXLJtO6V7QidlCkUXMI02481RIxP2Ah6hgbMB+2k8/BTemaUIfVCZU6AdK5+30iZr3Xiu2ZyHvWnNxP/8noxehdOKoIoRgj44iMvlhRDOmuCDoUCjjIxhHElTFbKx0wxjqavlZeWFTkpxOYSEU6zpqOvIuj/pF0p1Uvlm2qh0ViWlSEn5JQUSZnUSINckyZpE04S8kAeyZN1bz1bL9brYnTNWu7kyQqst09hwaIM</latexit><latexit sha1_base64="oPb37qQYO9RA8sihDJtZSxG1H/4=">AAACGnicdVDLSgMxFM34rPVV7dJNsAh1U6alULuruHFZobWFdiiZ9E4bmnmQ3BGGoZ8ibvU7XIlbN36Gf2D6EKyPAyGHc+5NDseNpNBo2+/W2vrG5tZ2Zie7u7d/cJg7Or7VYaw4tHkoQ9V1mQYpAmijQAndSAHzXQkdd3I18zt3oLQIgxYmETg+GwXCE5yhkQa5fN9nONZcpZfTYr81BmTng1zBLtVrdqVepb9JuWTPUSBLNAe5j/4w5LEPAXLJtO6V7QidlCkUXMI02481RIxP2Ah6hgbMB+2k8/BTemaUIfVCZU6AdK5+30iZr3Xiu2ZyHvWnNxP/8noxehdOKoIoRgj44iMvlhRDOmuCDoUCjjIxhHElTFbKx0wxjqavlZeWFTkpxOYSEU6zpqOvIuj/pF0p1Uvlm2qh0ViWlSEn5JQUSZnUSINckyZpE04S8kAeyZN1bz1bL9brYnTNWu7kyQqst09hwaIM</latexit> Figure 1.1 Illustration of the typical stages of a learning process. misclassification) loss defined over {−1,+1} × {−1,+1} by L(y, y′) = 1y′ 6=y and the squared loss defined over I×I by L(y, y′) = (y′−y)2, where I ⊆ R is typically a bounded interval. • Hypothesis set : A set of functions mapping features (feature vectors) to the set of labels Y. In our example, these may be a set of functions mapping email features to Y = {spam,nonspam}. More generally, hypotheses may be functions mapping features to a different set Y′. They could be linear functions mapping email feature vectors to real numbers interpreted as scores (Y′ = R), with higher score values more indicative of spam than lower ones. We now define the learning stages of our spam problem (see figure 1.1). We start with a given collection of labeled examples. We first randomly partition the data into a training sample, a validation sample, and a test sample. The size of each of these samples depends on a number of different considerations. For example, the amount of data reserved for validation depends on the number of hyperparameters of the algorithm, which are represented here by the vector Θ. Also, when the labeled sample is relatively small, the amount of training data is often chosen to be larger than that of the test data since the learning performance directly depends on the training sample. Next, we associate relevant features to the examples. This is a critical step in the design of machine learning solutions. Useful features can effectively guide the learning algorithm, while poor or uninformative ones can be misleading. Although it is critical, to a large extent, the choice of the features is left to the user. This choice reflects the user’s prior knowledge about the learning task which in practice can have a dramatic effect on the performance results. Now, we use the features selected to train our learning algorithm A by tuning the values of its free parameters Θ (also called hyperparameters). For each value of these parameters, the algorithm selects a different hypothesis out of the hypothesis set. We choose the one resulting in the best performance on the validation sample (Θ0). Finally, using that hypothesis, we predict the labels of the examples in the test sample. The performance of the algorithm is evaluated by using the loss 6 Chapter 1 Introduction function associated to the task, e.g., the zeroone loss in our spam detection task, to compare the predicted and true labels. Thus, the performance of an algorithm is of course evaluated based on its test error and not its error on the training sample. 1.5 Learning scenarios We next briefly describe some common machine learning scenarios. These scenarios differ in the types of training data available to the learner, the order and method by which training data is received and the test data used to evaluate the learning algorithm. • Supervised learning : The learner receives a set of labeled examples as training data and makes predictions for all unseen points. This is the most common sce nario associated with classification, regression, and ranking problems. The spam detection problem discussed in the previous section is an instance of supervised learning. • Unsupervised learning : The learner exclusively receives unlabeled training data, and makes predictions for all unseen points. Since in general no labeled example is available in that setting, it can be difficult to quantitatively evaluate the per formance of a learner. Clustering and dimensionality reduction are example of unsupervised learning problems. • Semisupervised learning : The learner receives a training sample consisting of both labeled and unlabeled data, and makes predictions for all unseen points. Semisupervised learning is common in settings where unlabeled data is easily accessible but labels are expensive to obtain. Various types of problems arising in applications, including classification, regression, or ranking tasks, can be framed as instances of semisupervised learning. The hope is that the distribution of unlabeled data accessible to the learner can help him achieve a better performance than in the supervised setting. The analysis of the conditions under which this can indeed be realized is the topic of much modern theoretical and applied machine learning research. • Transductive inference: As in the semisupervised scenario, the learner receives a labeled training sample along with a set of unlabeled test points. However, the objective of transductive inference is to predict labels only for these particular test points. Transductive inference appears to be an easier task and matches the scenario encountered in a variety of modern applications. However, as in the semisupervised setting, the assumptions under which a better performance can be achieved in this setting are research questions that have not been fully resolved. 1.6 Generalization 7 • Online learning : In contrast with the previous scenarios, the online scenario involves multiple rounds where training and testing phases are intermixed. At each round, the learner receives an unlabeled training point, makes a prediction, receives the true label, and incurs a loss. The objective in the online setting is to minimize the cumulative loss over all rounds or to minimize the regret , that is the difference of the cumulative loss incurred and that of the best expert in hindsight. Unlike the previous settings just discussed, no distributional assumption is made in online learning. In fact, instances and their labels may be chosen adversarially within this scenario. • Reinforcement learning : The training and testing phases are also intermixed in reinforcement learning. To collect information, the learner actively interacts with the environment and in some cases affects the environment, and receives an im mediate reward for each action. The object of the learner is to maximize his reward over a course of actions and iterations with the environment. However, no longterm reward feedback is provided by the environment, and the learner is faced with the exploration versus exploitation dilemma, since he must choose between exploring unknown actions to gain more information versus exploiting the information already collected. • Active learning : The learner adaptively or interactively collects training examples, typically by querying an oracle to request labels for new points. The goal in active learning is to achieve a performance comparable to the standard supervised learning scenario (or passive learning scenario), but with fewer labeled examples. Active learning is often used in applications where labels are expensive to obtain, for example computational biology applications. In practice, many other intermediate and somewhat more complex learning scenar ios may be encountered. 1.6 Generalization Machine learning is fundamentally about generalization. As an example, the stan dard supervised learning scenario consists of using a finite sample of labeled exam ples to make accurate predictions about unseen examples. The problem is typically formulated as that of selecting a function out of a hypothesis set , that is a subset of the family of all functions. The function selected is subsequently used to label all instances, including unseen examples. How should a hypothesis set be chosen? With a rich or complex hypothesis set, the learner may choose a function or predictor that is consistent with the training sample, that is one that commits no error on the training sample. With a less com plex family, incurring some errors on the training sample may be unavoidable. But, 8 Chapter 1 Introduction Figure 1.2 The zigzag line on the left panel is consistent over the blue and red training sample, but it is a complex separation surface that is not likely to generalize well to unseen data. In contrast, the decision surface on the right panel is simpler and might generalize better in spite of its misclassification of a few points of the training sample. which will lead to a better generalization? How should we define the complexity of a hypothesis set? Figure 1.2 illustrates these two types of solution: one is a zigzag line that perfectly separates the two populations of blue and red points and that is chosen from a complex family; the other one is a smoother line chosen from a simpler family that only imperfectly discriminates between the two sets. We will see that, in general, the best predictor on the training sample may not be the best overall. A predictor chosen from a very complex family can essentially memorize the data, but generalization is distinct from the memorization of the training labels. We will see that the tradeoff between the sample size and complexity plays a critical role in generalization. When the sample size is relatively small, choosing from a too complex a family may lead to poor generalization, which is also known as overfitting. On the other hand, with a too simple a family it may not be possible to achieve a sufficient accuracy, which is known as underfitting. In the next chapters, we will analyze more in detail the problem of generalization and will seek to derive theoretical guarantees for learning. This will depend on different notions of complexity that we will thoroughly discuss. 2 The PAC Learning Framework Several fundamental questions arise when designing and analyzing algorithms that learn from examples: What can be learned efficiently? What is inherently hard to learn? How many examples are needed to learn successfully? Is there a gen eral model of learning? In this chapter, we begin to formalize and address these questions by introducing the Probably Approximately Correct (PAC) learning frame work. The PAC framework helps define the class of learnable concepts in terms of the number of sample points needed to achieve an approximate solution, sample complexity , and the time and space complexity of the learning algorithm, which depends on the cost of the computational representation of the concepts. We first describe the PAC framework and illustrate it, then present some general learning guarantees within this framework when the hypothesis set used is finite, both for the consistent case where the hypothesis set used contains the concept to learn and for the opposite inconsistent case. 2.1 The PAC learning model We first introduce several definitions and the notation needed to present the PAC model, which will also be used throughout much of this book. We denote by X the set of all possible examples or instances. X is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by Y. For the purpose of this introductory chapter, we will limit ourselves to the case where Y is reduced to two labels, Y = {0, 1}, which corresponds to the socalled binary classification. Later chapters will extend these results to more general settings. A concept c : X→ Y is a mapping from X to Y. Since Y = {0, 1}, we can identify c with the subset of X over which it takes the value 1. Thus, in the following, we equivalently refer to a concept to learn as a mapping from X to {0, 1}, or as a subset of X. As an example, a concept may be the set of points inside a triangle 10 Chapter 2 The PAC Learning Framework or the indicator function of these points. In such cases, we will say in short that the concept to learn is a triangle. A concept class is a set of concepts we may wish to learn and is denoted by C. This could, for example, be the set of all triangles in the plane. We assume that examples are independently and identically distributed (i.i.d.) according to some fixed but unknown distribution D. The learning problem is then formulated as follows. The learner considers a fixed set of possible concepts H, called a hypothesis set , which might not necessarily coincide with C. It re ceives a sample S = (x1, . . . , xm) drawn i.i.d. according to D as well as the labels (c(x1), . . . , c(xm)), which are based on a specific target concept c ∈ C to learn. The task is then to use the labeled sample S to select a hypothesis hS ∈ H that has a small generalization error with respect to the concept c. The generalization error of a hypothesis h ∈ H, also referred to as the risk or true error (or simply error) of h is denoted by R(h) and defined as follows.1 Definition 2.1 (Generalization error) Given a hypothesis h ∈ H, a target concept c ∈ C, and an underlying distribution D, the generalization error or risk of h is defined by R(h) = P x∼D [h(x) 6= c(x)] = E x∼D [ 1h(x)6=c(x) ] , (2.1) where 1ω is the indicator function of the event ω. 2 The generalization error of a hypothesis is not directly accessible to the learner since both the distribution D and the target concept c are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S. Definition 2.2 (Empirical error) Given a hypothesis h ∈ H, a target concept c ∈ C, and a sample S = (x1, . . . , xm), the empirical error or empirical risk of h is defined by R̂S(h) = 1 m m∑ i=1 1h(xi)6=c(xi). (2.2) Thus, the empirical error of h ∈ H is its average error over the sample S, while the generalization error is its expected error based on the distribution D. We will see in this chapter and the following chapters a number of guarantees relating these two quantities with high probability, under some general assumptions. We can already note that for a fixed h ∈ H, the expectation of the empirical error based on an i.i.d. 1 The choice of R instead of E to denote an error avoids possible confusions with the notation for expectations and is further justified by the fact that the term risk is also used in machine learning and statistics to refer to an error. 2 For this and other related definitions, the family of functions H and the target concept c must be measurable. The function classes we consider in this book all have this property. 2.1 The PAC learning model 11 sample S is equal to the generalization error: E S∼Dm [R̂S(h)] = R(h). (2.3) Indeed, by the linearity of the expectation and the fact that the sample is drawn i.i.d., we can write E S∼Dm [R̂S(h)] = 1 m m∑ i=1 E S∼Dm [1h(xi)6=c(xi)] = 1 m m∑ i=1 E S∼Dm [1h(x)6=c(x)], for any x in sample S. Thus, E S∼Dm [R̂S(h)] = E S∼Dm [1h(x)6=c(x)] = E x∼D [1h(x)6=c(x)] = R(h). The following introduces the Probably Approximately Correct (PAC) learning framework. Let n be a number such that the computational cost of representing any element x ∈ X is at most O(n) and denote by size(c) the maximal cost of the computational representation of c ∈ C. For example, x may be a vector in Rn, for which the cost of an arraybased representation would be in O(n). In addition, let hS denote the hypothesis returned by algorithm A after receiving a labeled sample S. To keep notation simple, the dependency of hS on A is not explicitly indicated. Definition 2.3 (PAClearning) A concept class C is said to be PAClearnable if there exists an algorithm A and a polynomial function poly(·, ·, ·, ·) such that for any � > 0 and δ > 0, for all distributions D on X and for any target concept c ∈ C, the following holds for any sample size m ≥ poly(1/�, 1/δ, n, size(c)): P S∼Dm [R(hS) ≤ �] ≥ 1− δ. (2.4) If A further runs in poly(1/�, 1/δ, n, size(c)), then C is said to be efficiently PAC learnable. When such an algorithm A exists, it is called a PAClearning algorithm for C. A concept class C is thus PAClearnable if the hypothesis returned by the algorithm after observing a number of points polynomial in 1/� and 1/δ is approximately correct (error at most �) with high probability (at least 1 − δ), which justifies the PAC terminology. The parameter δ > 0 is used to define the confidence 1 − δ and � > 0 the accuracy 1 − �. Note that if the running time of the algorithm is polynomial in 1/� and 1/δ, then the sample size m must also be polynomial if the full sample is received by the algorithm. Several key points of the PAC definition are worth emphasizing. First, the PAC framework is a distributionfree model : no particular assumption is made about the distribution D from which examples are drawn. Second, the training sample and the test examples used to define the error are drawn according to the same distribution D. This is a natural and necessary assumption for generalization to 12 Chapter 2 The PAC Learning Framework R R0 Figure 2.1 Target concept R and possible hypothesis R′. Circles represent training instances. A blue circle is a point labeled with 1, since it falls within the rectangle R. Others are red and labeled with 0. be possible in general. It can be relaxed to include favorable domain adaptation problems. Finally, the PAC framework deals with the question of learnability for a concept class C and not a particular concept. Note that the concept class C is known to the algorithm, but of course the target concept c ∈ C is unknown. In many cases, in particular when the computational representation of the con cepts is not explicitly discussed or is straightforward, we may omit the polynomial dependency on n and size(c) in the PAC definition and focus only on the sample complexity. We now illustrate PAClearning with a specific learning problem. Example 2.4 (Learning axisaligned rectangles) Consider the case where the set of in stances are points in the plane, X = R2, and the concept class C is the set of all axisaligned rectangles lying in R2. Thus, each concept c is the set of points inside a particular axisaligned rectangle. The learning problem consists of determining with small error a target axisaligned rectangle using the labeled training sample. We will show that the concept class of axisaligned rectangles is PAClearnable. Figure 2.1 illustrates the problem. R represents a target axisaligned rectangle and R′ a hypothesis. As can be seen from the figure, the error regions of R′ are formed by the area within the rectangle R but outside the rectangle R′ and the area within R′ but outside the rectangle R. The first area corresponds to false negatives, that is, points that are labeled as 0 or negatively by R′, which are in fact positive or labeled with 1. The second area corresponds to false positives, that is, points labeled positively by R′ which are in fact negatively labeled. To show that the concept class is PAClearnable, we describe a simple PAC learning algorithm A. Given a labeled sample S, the algorithm consists of returning the tightest axisaligned rectangle R′ = RS containing the points labeled with 1. Figure 2.2 illustrates the hypothesis returned by the algorithm. By definition, RS does not produce any false positives, since its points must be included in the target concept R. Thus, the error region of RS is included in R. 2.1 The PAC learning model 13 R R0 Figure 2.2 Illustration of the hypothesis R′ = RS returned by the algorithm. Let R ∈ C be a target concept. Fix � > 0. Let P[R] denote the probability mass of the region defined by R, that is the probability that a point randomly drawn according to D falls within R. Since errors made by our algorithm can be due only to points falling inside R, we can assume that P[R] > �; otherwise, the error of RS is less than or equal to � regardless of the training sample S received. Now, since P[R] > �, we can define four rectangular regions r1, r2, r3, and r4 along the sides of R, each with probability at least �/4. These regions can be constructed by starting with the full rectangle R and then decreasing the size by moving one side as much as possible while keeping a distribution mass of at least �/4. Figure 2.3 illustrates the definition of these regions. Let l, r, b, and t be the four real values defining R: R = [l, r] × [b, t]. Then, for example, the left rectangle r4 is defined by r4 = [l, s4] × [b, t], with s4 = inf{s : P[[l, s] × [b, t]] ≥ �/4}. It is not hard to see that the probability of the region r4 = [l, s4[×[b, t] obtained from r4 by excluding the rightmost side is at most �/4. r1, r2, r3 and r1, r2, r3 are defined in a similar way. Observe that if RS meets all of these four regions ri, i ∈ [4], then, because it is a rectangle, it will have one side in each of these regions (geometric argument). Its error area, which is the part of R that it does not cover, is thus included in the union of the regions ri, i ∈ [4], and cannot have probability mass more than �. By contraposition, if R(RS) > �, then RS must miss at least one of the regions ri, i ∈ [4]. As a result, we can write P S∼Dm [R(RS) > �] ≤ P S∼Dm [∪4i=1{RS ∩ ri = ∅}] (2.5) ≤ 4∑ i=1 P S∼Dm [{RS ∩ ri = ∅}] (by the union bound) ≤ 4(1− �/4)m (since P[ri] ≥ �/4) ≤ 4 exp(−m�/4), 14 Chapter 2 The PAC Learning Framework R R0 r1 r2 r3 r4 Figure 2.3 Illustration of the regions r1, . . . , r4. where for the last step we used the general inequality 1 − x ≤ e−x valid for all x ∈ R. For any δ > 0, to ensure that PS∼Dm [R(RS) > �] ≤ δ, we can impose 4 exp(−�m/4) ≤ δ ⇔ m ≥ 4 � log 4 δ . (2.6) Thus, for any � > 0 and δ > 0, if the sample size m is greater than 4� log 4 δ , then PS∼Dm [R(RS) > �] ≤ δ. Furthermore, the computational cost of the represen tation of points in R2 and axisaligned rectangles, which can be defined by their four corners, is constant. This proves that the concept class of axisaligned rectan gles is PAClearnable and that the sample complexity of PAClearning axisaligned rectangles is in O( 1� log 1 δ ). An equivalent way to present sample complexity results like (2.6), which we will often see throughout this book, is to give a generalization bound . A generalization bound states that with probability at least 1− δ, R(RS) is upper bounded by some quantity that depends on the sample size m and δ. To obtain this, it suffices to set δ to be equal to the upper bound derived in (2.5), that is δ = 4 exp(−m�/4) and solve for �. This yields that with probability at least 1 − δ, the error of the algorithm is bounded as follows: R(RS) ≤ 4 m log 4 δ . (2.7) Other PAClearning algorithms could be considered for this example. One alterna tive is to return the largest axisaligned rectangle not containing the negative points, for example. The proof of PAClearning just presented for the tightest axisaligned rectangle can be easily adapted to the analysis of other such algorithms. Note that the hypothesis set H we considered in this example coincided with the concept class C and that its cardinality was infinite. Nevertheless, the problem admitted a simple proof of PAClearning. We may then ask if a similar proof can readily apply to other similar concept classes. This is not as straightforward because the specific geometric argument used in the proof is key. It is nontrivial to extend the proof to other concept classes such as that of nonconcentric circles 2.2 Guarantees for finite hypothesis sets — consistent case 15 (see exercise 2.4). Thus, we need a more general proof technique and more general results. The next two sections provide us with such tools in the case of a finite hypothesis set. 2.2 Guarantees for finite hypothesis sets — consistent case In the example of axisaligned rectangles that we examined, the hypothesis hS returned by the algorithm was always consistent , that is, it admitted no error on the training sample S. In this section, we present a general sample complexity bound, or equivalently, a generalization bound, for consistent hypotheses, in the case where the cardinality H of the hypothesis set is finite. Since we consider consistent hypotheses, we will assume that the target concept c is in H. Theorem 2.5 (Learning bound — finite H, consistent case) Let H be a finite set of func tions mapping from X to Y. Let A be an algorithm that for any target concept c ∈ H and i.i.d. sample S returns a consistent hypothesis hS: R̂S(hS) = 0. Then, for any �, δ > 0, the inequality PS∼Dm [R(hS) ≤ �] ≥ 1− δ holds if m ≥ 1 � ( log H+ log 1 δ ) . (2.8) This sample complexity result admits the following equivalent statement as a gen eralization bound: for any �, δ > 0, with probability at least 1− δ, R(hS) ≤ 1 m ( log H+ log 1 δ ) . (2.9) Proof: Fix � > 0. We do not know which consistent hypothesis hS ∈ H is selected by the algorithm A. This hypothesis further depends on the training sample S. Therefore, we need to give a uniform convergence bound , that is, a bound that holds for the set of all consistent hypotheses, which a fortiori includes hS . Thus, we will bound the probability that some h ∈ H would be consistent and have error more than �. For any � > 0, define H� by H� = {h ∈ H : R(h) > �}. The probability that a hypothesis h in H� is consistent on a training sample S drawn i.i.d., that is, that it would have no error on any point in S, can be bounded as follows: P[R̂S(h) = 0] ≤ (1− �)m. Thus, by the union bound, the following holds: P [ ∃h ∈ H� : R̂S(h) = 0 ] = P [ R̂S(h1) = 0 ∨ · · · ∨ R̂S(hH�) = 0 ] ≤ ∑ h∈H� P [ R̂S(h) = 0 ] (union bound) ≤ ∑ h∈H� (1− �)m ≤ H(1− �)m ≤ He−m�. 16 Chapter 2 The PAC Learning Framework Setting the righthand side to be equal to δ and solving for � concludes the proof.� The theorem shows that when the hypothesis set H is finite, a consistent algorithm A is a PAClearning algorithm, since the sample complexity given by (2.8) is dom inated by a polynomial in 1/� and 1/δ. As shown by (2.9), the generalization error of consistent hypotheses is upper bounded by a term that decreases as a function of the sample size m. This is a general fact: as expected, learning algorithms benefit from larger labeled training samples. The decrease rate of O(1/m) guaranteed by this theorem, however, is particularly favorable. The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set H containing target concepts. Of course, the upper bound (2.9) increases with H. However, that dependency is only logarithmic. Note that the term log H, or the related term log2 H from which it differs by a constant factor, can be interpreted as the number of bits needed to represent H. Thus, the generalization guarantee of the theorem is controlled by the ratio of this number of bits, log2 H, and the sample size m. We now use theorem 2.5 to analyze PAClearning with various concept classes. Example 2.6 (Conjunction of Boolean literals) Consider learning the concept class Cn of conjunctions of at most n Boolean literals x1, . . . , xn. A Boolean literal is either a variable xi, i ∈ [n], or its negation xi. For n = 4, an example is the conjunction: x1 ∧ x2 ∧ x4, where x2 denotes the negation of the Boolean literal x2. (1, 0, 0, 1) is a positive example for this concept while (1, 0, 0, 0) is a negative example. Observe that for n = 4, a positive example (1, 0, 1, 0) implies that the target con cept cannot contain the literals x1 and x3 and that it cannot contain the literals x2 and x4. In contrast, a negative example is not as informative since it is not known which of its n bits are incorrect. A simple algorithm for finding a consistent hypoth esis is thus based on positive examples and consists of the following: for each positive example (b1, . . . , bn) and i ∈ [n], if bi = 1 then xi is ruled out as a possible literal in the concept class and if bi = 0 then xi is ruled out. The conjunction of all the liter als not ruled out is thus a hypothesis consistent with the target. Figure 2.4 shows an example training sample as well as a consistent hypothesis for the case n = 6. We have H = Cn = 3n, since each literal can be included positively, with nega tion, or not included. Plugging this into the sample complexity bound for consistent hypotheses yields the following sample complexity bound for any � > 0 and δ > 0: m ≥ 1 � ( (log 3)n+ log 1 δ ) . (2.10) Thus, the class of conjunctions of at most n Boolean literals is PAClearnable. Note that the computational complexity is also polynomial, since the training cost per example is in O(n). For δ = 0.02, � = 0.1, and n = 10, the bound becomes 2.2 Guarantees for finite hypothesis sets — consistent case 17 0 1 1 0 1 1 + 0 1 1 1 1 1 + 0 0 1 1 0 1  0 1 1 1 1 1 + 1 0 0 1 1 0  0 1 0 0 1 1 + 0 1 ? ? 1 1 Figure 2.4 Each of the first six rows of the table represents a training example with its label, + or −, indicated in the last column. The last row contains 0 (respectively 1) in column i ∈ [6] if the ith entry is 0 (respectively 1) for all the positive examples. It contains “?” if both 0 and 1 appear as an ith entry for some positive example. Thus, for this training sample, the hypothesis returned by the consistent algorithm described in the text is x1 ∧ x2 ∧ x5 ∧ x6. m ≥ 149. Thus, for a labeled sample of at least 149 examples, the bound guarantees 90% accuracy with a confidence of at least 98%. Example 2.7 (Universal concept class) Consider the set X = {0, 1}n of all Boolean vectors with n components, and let Un be the concept class formed by all sub sets of X. Is this concept class PAClearnable? To guarantee a consistent hypo thesis the hypothesis class must include the concept class, thus H ≥ Un = 2(2 n). Theorem 2.5 gives the following sample complexity bound: m ≥ 1 � ( (log 2)2n + log 1 δ ) . (2.11) Here, the number of training samples required is exponential in n, which is the cost of the representation of a point in X. Thus, PAClearning is not guaranteed by the theorem. In fact, it is not hard to show that this universal concept class is not PAClearnable. Example 2.8 (kterm DNF formulae) A disjunctive normal form (DNF) formula is a formula written as the disjunction of several terms, each term being a conjunction of Boolean literals. A kterm DNF is a DNF formula defined by the disjunction of k terms, each term being a conjunction of at most n Boolean literals. Thus, for k = 2 and n = 3, an example of a kterm DNF is (x1 ∧ x2 ∧ x3) ∨ (x1 ∧ x3). Is the class C of kterm DNF formulae PAClearnable? The cardinality of the class is 3nk, since each term is a conjunction of at most n variables and there are 3n such conjunctions, as seen previously. The hypothesis set H must contain C for 18 Chapter 2 The PAC Learning Framework consistency to be possible, thus H ≥ 3nk. Theorem 2.5 gives the following sample complexity bound: m ≥ 1 � ( (log 3)nk + log 1 δ ) , (2.12) which is polynomial. However, it can be shown by a reduction from the graph 3coloring problem that the problem of learning kterm DNF, even for k = 3, is not efficiently PAClearnable, unless RP, the complexity class of problems that admit a randomized polynomialtime decision solution, coincides with NP(RP = NP), which is commonly conjectured not to be the case. Thus, while the sample size needed for learning kterm DNF formulae is only polynomial, efficient PAClearning of this class is not possible if RP 6= NP. Example 2.9 (kCNF formulae) A conjunctive normal form (CNF) formula is a con junction of disjunctions. A kCNF formula is an expression of the form T1∧ . . .∧Tj with arbitrary length j ∈ N and with each term Ti being a disjunction of at most k Boolean attributes. The problem of learning kCNF formulae can be reduced to that of learning con junctions of Boolean literals, which, as seen previously, is a PAClearnable concept class. This can be done at the cost of introducing (2n)k new variables Yu1,...,uk using the following bijection: (u1, . . . , uk)→ Yu1,...,uk , (2.13) where u1, . . . , uk are Boolean literals over the original variables x1, . . . , xn. The value of Yu1,...,uk is determined by Yu1,...,uk = u1 ∨ · · · ∨ uk. Using this mapping, the original training sample can be transformed into one defined in terms of the new variables and any kCNF formula over the original variables can be written as a conjunction over the variables Yu1,...,uk . This reduction to PAClearning of conjunctions of Boolean literals can affect the original distribution of examples, but this is not an issue since in the PAC framework no assumption is made about the distribution. Thus, using this transformation, the PAClearnability of conjunctions of Boolean literals implies that of kCNF formulae. This is a surprising result, however, since any kterm DNF formula can be written as a kCNF formula. Indeed, using associativity, a kterm DNF T1 ∨ · · · ∨ Tk with Ti = ui,1 ∧ · · · ∧ ui,ni for i ∈ [k] can be rewritten as a kCNF formula via k∨ i=1 ui,1 ∧ · · · ∧ ui,ni = ∧ j1∈[n1],...,jk∈[nk] u1,j1 ∨ · · · ∨ uk,jk , To illustrate this rewriting in a specific case, observe, for example, that (u1 ∧ u2 ∧ u3) ∨ (v1 ∧ v2 ∧ v3) = 3∧ i,j=1 (ui ∨ vj). 2.3 Guarantees for finite hypothesis sets — inconsistent case 19 But, as we previously saw, kterm DNF formulae are not efficiently PAClearnable if RP 6= NP! What can explain this apparent inconsistency? The issue is that converting into a kterm DNF a kCNF formula we have learned (which is equivalent to a kterm DNF) is in general intractable if RP 6= NP. This example reveals some key aspects of PAClearning, which include the cost of the representation of a concept and the choice of the hypothesis set. For a fixed concept class, learning can be intractable or not depending on the choice of the representation. 2.3 Guarantees for finite hypothesis sets — inconsistent case In the most general case, there may be no hypothesis in H consistent with the la beled training sample. This, in fact, is the typical case in practice, where the learning problems may be somewhat difficult or the concept classes more complex than the hypothesis set used by the learning algorithm. However, inconsistent hy potheses with a small number of errors on the training sample can be useful and, as we shall see, can benefit from favorable guarantees under some assumptions. This section presents learning guarantees precisely for this inconsistent case and finite hypothesis sets. To derive learning guarantees in this more general setting, we will use Hoeffding’s inequality (theorem D.2) or the following corollary, which relates the generalization error and empirical error of a single hypothesis. Corollary 2.10 Fix � > 0. Then, for any hypothesis h : X → {0, 1}, the following inequalities hold: P S∼Dm [ R̂S(h)−R(h) ≥ � ] ≤ exp(−2m�2) (2.14) P S∼Dm [ R̂S(h)−R(h) ≤ −� ] ≤ exp(−2m�2). (2.15) By the union bound, this implies the following twosided inequality: P S∼Dm [∣∣R̂S(h)−R(h) ∣∣ ≥ � ] ≤ 2 exp(−2m�2). (2.16) Proof: The result follows immediately from theorem D.2. � Setting the righthand side of (2.16) to be equal to δ and solving for � yields immediately the following bound for a single hypothesis. Corollary 2.11 (Generalization bound — single hypothesis) Fix a hypothesis h : X→ {0, 1}. Then, for any δ > 0, the following inequality holds with probability at least 1− δ: R(h) ≤ R̂S(h) + √ log 2δ 2m . (2.17) The following example illustrates this corollary in a simple case. 20 Chapter 2 The PAC Learning Framework Example 2.12 (Tossing a coin) Imagine tossing a biased coin that lands heads with probability p, and let our hypothesis be the one that always guesses tails. Then the true error rate is R(h) = p and the empirical error rate R̂S(h) = p̂, where p̂ is the empirical probability of heads based on the training sample drawn i.i.d. Thus, corollary 2.11 guarantees with probability at least 1− δ that p− p̂ ≤ √ log 2δ 2m . (2.18) Therefore, if we choose δ = 0.02 and use a sample of size 500, with probability at least 98%, the following approximation quality is guaranteed for p̂: p− p̂ ≤ √ log(10) 1000 ≈ 0.048. (2.19) Can we readily apply corollary 2.11 to bound the generalization error of the hypothesis hS returned by a learning algorithm when training on a sample S? No, since hS is not a fixed hypothesis, but a random variable depending on the training sample S drawn. Note also that unlike the case of a fixed hypothesis for which the expectation of the empirical error is the generalization error (equation (2.3)), the generalization error R(hS) is a random variable and in general distinct from the expectation E[R̂S(hS)], which is a constant. Thus, as in the proof for the consistent case, we need to derive a uniform conver gence bound, that is a bound that holds with high probability for all hypotheses h ∈ H. Theorem 2.13 (Learning bound — finite H, inconsistent case) Let H be a finite hypoth esis set. Then, for any δ > 0, with probability at least 1−δ, the following inequality holds: ∀h ∈ H, R(h) ≤ R̂S(h) + √ log H+ log 2δ 2m . (2.20) Proof: Let h1, . . . , hH be the elements of H. Using the union bound and applying corollary 2.11 to each hypothesis yield: P [ ∃h ∈ H ∣∣R̂S(h)−R(h) ∣∣ > � ] = P [(∣∣R̂S(h1)−R(h1) ∣∣ > � ) ∨ . . . ∨ (∣∣R̂S(hH)−R(hH) ∣∣ > � )] ≤ ∑ h∈H P [∣∣R̂S(h)−R(h) ∣∣ > � ] ≤ 2H exp(−2m�2). Setting the righthand side to be equal to δ completes the proof. � 2.4 Generalities 21 Thus, for a finite hypothesis set H, R(h) ≤ R̂S(h) +O (√ log2 H m ) . As already pointed out, log2 H can be interpreted as the number of bits needed to represent H. Several other remarks similar to those made on the generalization bound in the consistent case can be made here: a larger sample size m guarantees better generalization, and the bound increases with H, but only logarithmically. But, here, the bound is a less favorable function of log2 Hm ; it varies as the square root of this term. This is not a minor price to pay: for a fixed H, to attain the same guarantee as in the consistent case, a quadratically larger labeled sample is needed. Note that the bound suggests seeking a tradeoff between reducing the empirical error versus controlling the size of the hypothesis set: a larger hypothesis set is penalized by the second term but could help reduce the empirical error, that is the first term. But, for a similar empirical error, it suggests using a smaller hypothesis set. This can be viewed as an instance of the socalled Occam’s Razor principle named after the theologian William of Occam: Plurality should not be posited with out necessity, also rephrased as, the simplest explanation is best. In this context, it could be expressed as follows: All other things being equal, a simpler (smaller) hypothesis set is better. 2.4 Generalities In this section we will discuss some general aspects of the learning scenario, which, for simplicity, we left out of the discussion of the earlier sections. 2.4.1 Deterministic versus stochastic scenarios In the most general scenario of supervised learning, the distribution D is defined over X×Y, and the training data is a labeled sample S drawn i.i.d. according to D: S = ((x1, y1), . . . , (xm, ym)). The learning problem is to find a hypothesis h ∈ H with small generalization error R(h) = P (x,y)∼D [h(x) 6= y] = E (x,y)∼D [1h(x)6=y]. This more general scenario is referred to as the stochastic scenario. Within this setting, the output label is a probabilistic function of the input. The stochastic scenario captures many realworld problems where the label of an input point is not unique. For example, if we seek to predict gender based on input pairs formed by the height and weight of a person, then the label will typically not be unique. 22 Chapter 2 The PAC Learning Framework For most pairs, both male and female are possible genders. For each fixed pair, there would be a probability distribution of the label being male. The natural extension of the PAClearning framework to this setting is known as the agnostic PAClearning . Definition 2.14 (Agnostic PAClearning) Let H be a hypothesis set. A is an agnostic PAClearning algorithm if there exists a polynomial function poly(·, ·, ·, ·) such that for any � > 0 and δ > 0, for all distributions D over X× Y, the following holds for any sample size m ≥ poly(1/�, 1/δ, n, size(c)): P S∼Dm [R(hS)− min h∈H R(h) ≤ �] ≥ 1− δ. (2.21) If A further runs in poly(1/�, 1/δ, n), then it is said to be an efficient agnostic PAClearning algorithm. When the label of a point can be uniquely determined by some measurable func tion f : X→ Y (with probability one), then the scenario is said to be deterministic. In that case, it suffices to consider a distribution D over the input space. The training sample is obtained by drawing (x1, . . . , xm) according to D and the labels are obtained via f : yi = f(xi) for all i ∈ [m]. Many learning problems can be formulated within this deterministic scenario. In the previous sections, as well as in most of the material presented in this book, we have restricted our presentation to the deterministic scenario in the interest of simplicity. However, for all of this material, the extension to the stochastic scenario should be straightforward for the reader. 2.4.2 Bayes error and noise In the deterministic case, by definition, there exists a target function f with no generalization error: R(h) = 0. In the stochastic case, there is a minimal nonzero error for any hypothesis. Definition 2.15 (Bayes error) Given a distribution D over X × Y, the Bayes error R∗ is defined as the infimum of the errors achieved by measurable functions h : X→ Y: R? = inf h hmeasurable R(h). (2.22) A hypothesis h with R(h) = R∗ is called a Bayes hypothesis or Bayes classifier. By definition, in the deterministic case, we have R∗ = 0, but, in the stochastic case, R∗ 6= 0. Clearly, the Bayes classifier hBayes can be defined in terms of the conditional probabilities as: ∀x ∈ X, hBayes(x) = argmax y∈{0,1} P[yx]. (2.23) 2.5 Chapter notes 23 The average error made by hBayes on x ∈ X is thus min{P[0x],P[1x]}, and this is the minimum possible error. This leads to the following definition of noise. Definition 2.16 (Noise) Given a distribution D over X × Y, the noise at point x ∈ X is defined by noise(x) = min{P[1x],P[0x]}. (2.24) The average noise or the noise associated to D is E[noise(x)]. Thus, the average noise is precisely the Bayes error: noise = E[noise(x)] = R∗. The noise is a characteristic of the learning task indicative of its level of difficulty. A point x ∈ X, for which noise(x) is close to 1/2, is sometimes referred to as noisy and is of course a challenge for accurate prediction. 2.5 Chapter notes The PAC learning framework was introduced by Valiant [1984]. The book of Kearns and Vazirani [1994] is an excellent reference dealing with most aspects of PAC learning and several other foundational questions in machine learning. Our example of learning axisaligned rectangles, also discussed in that reference, is originally due to Blumer et al. [1989]. The PAC learning framework is a computational framework since it takes into ac count the cost of the computational representations and the time complexity of the learning algorithm. If we omit the computational aspects, it is similar to the learn ing framework considered earlier by Vapnik and Chervonenkis [see Vapnik, 2000]. The definition of noise presented in this chapter can be generalized to arbitrary loss functions (see exercise 2.14). Occam’s razor principle is invoked in a variety of contexts, such as in linguistics to justify the superiority of a set of rules or syntax. The Kolmogorov complexity can be viewed as the corresponding framework in information theory. In the context of the learning guarantees presented in this chapter, the principle suggests selecting the most parsimonious explanation (the hypothesis set with the smallest cardinality). We will see in the next sections other applications of this principle with different notions of simplicity or complexity. 2.6 Exercises 2.1 Twooracle variant of the PAC model. Assume that positive and negative ex amples are now drawn from two separate distributions D+ and D−. For an accuracy (1− �), the learning algorithm must find a hypothesis h such that: P x∼D+ [h(x) = 0] ≤ � and P x∼D− [h(x) = 1] ≤ � . (2.25) 24 Chapter 2 The PAC Learning Framework r1 r2 r3 r1 r2 r3 (a) (b) Figure 2.5 (a) Gertrude’s regions r1, r2, r3. (b) Hint for solution. Thus, the hypothesis must have a small error on both distributions. Let C be any concept class and H be any hypothesis space. Let h0 and h1 represent the identically 0 and identically 1 functions, respectively. Prove that C is efficiently PAClearnable using H in the standard (oneoracle) PAC model if and only if it is efficiently PAClearnable using H∪ {h0, h1} in this twooracle PAC model. 2.2 PAC learning of hyperrectangles. An axisaligned hyperrectangle in Rn is a set of the form [a1, b1]× . . .× [an, bn]. Show that axisaligned hyperrectangles are PAClearnable by extending the proof given in Example 2.4 for the case n = 2. 2.3 Concentric circles. Let X = R2 and consider the set of concepts of the form c = {(x, y) : x2 + y2 ≤ r2} for some real number r. Show that this class can be (�, δ)PAClearned from training data of size m ≥ (1/�) log(1/δ). 2.4 Nonconcentric circles. Let X = R2 and consider the set of concepts of the form c = {x ∈ R2 : x − x0 ≤ r} for some point x0 ∈ R2 and real number r. Gertrude, an aspiring machine learning researcher, attempts to show that this class of concepts may be (�, δ)PAClearned with sample complexity m ≥ (3/�) log(3/δ), but she is having trouble with her proof. Her idea is that the learning algorithm would select the smallest circle consistent with the training data. She has drawn three regions r1, r2, r3 around the edge of concept c, with each region having probability �/3 (see figure 2.5(a)). She wants to argue that if the generalization error is greater than or equal to �, then one of these regions must have been missed by the training data, and hence this event will occur with probability at most δ. Can you tell Gertrude if her approach works? (Hint : You may wish to use figure 2.5(b) in your solution). 2.6 Exercises 25 A” B” C” A’ B’ A B C Figure 2.6 Axisaligned right triangles. 2.5 Triangles. Let X = R2 with orthonormal basis (e1, e2), and consider the set of concepts defined by the area inside a right triangle ABC with two sides parallel to the axes, with −−→ AB/‖−−→AB‖ = e1 and −→ AC/‖−→AC‖ = e2, and ‖ −−→ AB‖/‖−→AC‖ = α for some positive real α ∈ R+. Show, using similar methods to those used in the chapter for the axisaligned rectangles, that this class can be (�, δ)PAClearned from training data of size m ≥ (3/�) log(3/δ). (Hint : You may consider using figure 2.6 in your solution). 2.6 Learning in the presence of noise — rectangles. In example 2.4, we showed that the concept class of axisaligned rectangles is PAClearnable. Consider now the case where the training points received by the learner are subject to the following noise: points negatively labeled are unaffected by noise but the label of a positive training point is randomly flipped to negative with probability η ∈ (0, 12 ). The exact value of the noise rate η is not known to the learner but an upper bound η′ is supplied to him with η ≤ η′ < 1/2. Show that the algorithm returning the tightest rectangle containing positive points can still PAClearn axisaligned rectangles in the presence of this noise. To do so, you can proceed using the following steps: (a) Using the same notation as in example 2.4, assume that P[R] > �. Suppose that R(R′) > �. Give an upper bound on the probability that R′ misses a region rj , j ∈ [4] in terms of � and η′? (b) Use that to give an upper bound on P[R(R′) > �] in terms of � and η′ and conclude by giving a sample complexity bound. 2.7 Learning in the presence of noise — general case. In this question, we will seek a result that is more general than in the previous question. We consider a finite hypothesis set H, assume that the target concept is in H, and adopt the following noise model: the label of a training point received by the learner is 26 Chapter 2 The PAC Learning Framework randomly changed with probability η ∈ (0, 12 ). The exact value of the noise rate η is not known to the learner but an upper bound η′ is supplied to him with η ≤ η′ < 1/2. (a) For any h ∈ H, let d(h) denote the probability that the label of a training point received by the learner disagrees with the one given by h. Let h∗ be the target hypothesis, show that d(h∗) = η. (b) More generally, show that for any h ∈ H, d(h) = η + (1 − 2η)R(h), where R(h) denotes the generalization error of h. (c) Fix � > 0 for this and all the following questions. Use the previous questions to show that if R(h) > �, then d(h)− d(h∗) ≥ �′, where �′ = �(1− 2η′). (d) For any hypothesis h ∈ H and sample S of size m, let d̂(h) denote the fraction of the points in S whose labels disagree with those given by h. We will consider the algorithm L which, after receiving S, returns the hypothesis hS with the smallest number of disagreements (thus d̂(hS) is minimal). To show PAClearning for L, we will show that for any h, if R(h) > �, then with high probability d̂(h) ≥ d̂(h∗). First, show that for any δ > 0, with probability at least 1− δ/2, for m ≥ 2�′2 log 2δ , the following holds: d̂(h∗)− d(h∗) ≤ �′/2 (e) Second, show that for any δ > 0, with probability at least 1 − δ/2, for m ≥ 2�′2 (log H+ log 2δ ), the following holds for all h ∈ H: d(h)− d̂(h) ≤ �′/2 (f) Finally, show that for any δ > 0, with probability at least 1 − δ, for m ≥ 2 �2(1−2η′)2 (log H+ log 2δ ), the following holds for all h ∈ H with R(h) > �: d̂(h)− d̂(h∗) ≥ 0. (Hint : use d̂(h)− d̂(h∗) = [d̂(h)−d(h)] + [d(h)−d(h∗)] + [d(h∗)− d̂(h∗)] and use previous questions to lower bound each of these three terms). 2.8 Learning intervals. Give a PAClearning algorithm for the concept class C formed by closed intervals [a, b] with a, b ∈ R. 2.9 Learning union of intervals. Give a PAClearning algorithm for the concept class C2 formed by unions of two closed intervals, that is [a, b]∪[c, d], with a, b, c, d ∈ R. Extend your result to derive a PAClearning algorithm for the concept class Cp formed by unions of p ≥ 1 closed intervals, thus [a1, b1] ∪ · · · ∪ [ap, bp], with ak, bk ∈ R for k ∈ [p]. What are the time and sample complexities of your algorithm as a function of p? 2.6 Exercises 27 2.10 Consistent hypotheses. In this chapter, we showed that for a finite hypothesis set H, a consistent learning algorithm A is a PAClearning algorithm. Here, we consider a converse question. Let Z be a finite set of m labeled points. Suppose that you are given a PAClearning algorithm A. Show that you can use A and a finite training sample S to find in polynomial time a hypothesis h ∈ H that is consistent with Z, with high probability. (Hint : you can select an appropriate distribution D over Z and give a condition on R(h) for h to be consistent.) 2.11 Senate laws. For important questions, President Mouth relies on expert advice. He selects an appropriate advisor from a collection of H = 2,800 experts. (a) Assume that laws are proposed in a random fashion independently and iden tically according to some distribution D determined by an unknown group of senators. Assume that President Mouth can find and select an expert senator out of H who has consistently voted with the majority for the last m = 200 laws. Give a bound on the probability that such a senator incor rectly predicts the global vote for a future law. What is the value of the bound with 95% confidence? (b) Assume now that President Mouth can find and select an expert senator out of H who has consistently voted with the majority for all but m′ = 20 of the last m = 200 laws. What is the value of the new bound? 2.12 Bayesian bound. Let H be a countable hypothesis set of functions mapping X to {0, 1} and let p be a probability measure over H. This probability measure represents the prior probability over the hypothesis class, i.e. the probability that a particular hypothesis is selected by the learning algorithm. Use Hoeffding’s inequality to show that for any δ > 0, with probability at least 1 − δ, the following inequality holds: ∀h ∈ H, R(h) ≤ R̂S(h) + √ log 1p(h) + log 1 δ 2m . (2.26) Compare this result with the bound given in the inconsistent case for finite hypothesis sets (Hint : you could use δ′ = p(h)δ as confidence parameter in Hoeffding’s inequality). 2.13 Learning with an unknown parameter. In example 2.9, we showed that the concept class of kCNF is PAClearnable. Note, however, that the learning algorithm is given k as input. Is PAClearning possible even when k is not provided? More generally, consider a family of concept classes {Cs}s where Cs is the set of concepts in C with size at most s. Suppose we have a PAClearning algorithm A that can be used for learning any concept class Cs when s is given. 28 Chapter 2 The PAC Learning Framework Can we convert A into a PAClearning algorithm B that does not require the knowledge of s? This is the main objective of this problem. To do this, we first introduce a method for testing a hypothesis h, with high probability. Fix � > 0, δ > 0, and i ≥ 1 and define the sample size n by n = 32� [i log 2 + log 2 δ ]. Suppose we draw an i.i.d. sample S of size n according to some unknown distribution D. We will say that a hypothesis h is accepted if it makes at most 3/4� errors on S and that it is rejected otherwise. Thus, h is accepted iff R̂(h) ≤ 3/4�. (a) Assume that R(h) ≥ �. Use the (multiplicative) Chernoff bound to show that in that case PS∼Dn [h is accepted] ≤ δ2i+1 . (b) Assume that R(h) ≤ �/2. Use the (multiplicative) Chernoff bounds to show that in that case PS∼Dn [h is rejected] ≤ δ2i+1 . (c) Algorithm B is defined as follows: we start with i = 1 and, at each round i ≥ 1, we guess the parameter size s to be s̃ = b2(i−1)/ log 2δ c. We draw a sample S of size n (which depends on i) to test the hypothesis hi returned by A when it is trained with a sample of size SA(�/2, 1/2, s̃), that is the sample complexity of A for a required precision �/2, confidence 1/2, and size s̃ (we ignore the size of the representation of each example here). If hi is accepted, the algorithm stops and returns hi, otherwise it proceeds to the next iteration. Show that if at iteration i, the estimate s̃ is larger than or equal to s, then P[hi is accepted] ≥ 3/8. (d) Show that the probability that B does not halt after j = dlog 2δ / log 85e iter ations with s̃ ≥ s is at most δ/2. (e) Show that for i ≥ d1 + (log2 s) log 2δ e, the inequality s̃ ≥ s holds. (f) Show that with probability at least 1 − δ, algorithm B halts after at most j′ = d1 + (log2 s) log 2δ e+ j iterations and returns a hypothesis with error at most �. 2.14 In this exercise, we generalize the notion of noise to the case of an arbitrary loss function L : Y× Y→ R+. (a) Justify the following definition of the noise at point x ∈ X: noise(x) = min y′∈Y E y [L(y, y′)x]. What is the value of noise(x) in a deterministic scenario? Does the definition match the one given in this chapter for binary classification? (b) Show that the average noise coincides with the Bayes error (minimum loss achieved by a measurable function). 3 Rademacher Complexity and VCDimension The hypothesis sets typically used in machine learning are infinite. But the sample complexity bounds of the previous chapter are uninformative when dealing with infinite hypothesis sets. One could ask whether efficient learning from a finite sample is even possible when the hypothesis set H is infinite. Our analysis of the family of axisaligned rectangles (Example 2.4) indicates that this is indeed possible at least in some cases, since we proved that that infinite concept class was PAC learnable. Our goal in this chapter will be to generalize that result and derive general learning guarantees for infinite hypothesis sets. A general idea for doing so consists of reducing the infinite case to the analysis of finite sets of hypotheses and then proceed as in the previous chapter. There are different techniques for that reduction, each relying on a different notion of complexity for the family of hypotheses. The first complexity notion we will use is that of Rademacher complexity . This will help us derive learning guarantees using relatively simple proofs based on McDiarmid’s inequality, while obtaining high quality bounds, including datadependent ones, which we will frequently make use of in future chapters. However, the computation of the empirical Rademacher complexity is NPhard for some hypothesis sets. Thus, we subsequently introduce two other purely combinatorial notions, the growth function and the VCdimension. We first relate the Rademacher complexity to the growth function and then bound the growth function in terms of the VCdimension. The VCdimension is often easier to bound or estimate. We will review a series of examples showing how to compute or bound it, then relate the growth function and the VCdimensions. This leads to generalization bounds based on the VCdimension. Finally, we present lower bounds based on the VCdimension for two different settings: The realizable setting, where there is at least one hypothesis in the hypothesis set under consideration that achieves zero expected error, as well as the nonrealizable setting, where no hypothesis in the set achieves zero expected error. 30 Chapter 3 Rademacher Complexity and VCDimension 3.1 Rademacher complexity We will continue to use H to denote a hypothesis set as in the previous chapters. Many of the results of this section are general and hold for an arbitrary loss function L : Y×Y→ R. In what follows, G will generally be interpreted as the family of loss functions associated to H mapping from Z = X× Y to R: G = {g : (x, y) 7→ L(h(x), y) : h ∈ H}. However, the definitions are given in the general case of a family of functions G mapping from an arbitrary input space Z to R. The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can fit random noise. The following states the formal definitions of the empirical and average Rademacher complexity. Definition 3.1 (Empirical Rademacher complexity) Let G be a family of functions map ping from Z to [a, b] and S = (z1, . . . , zm) a fixed sample of size m with elements in Z. Then, the empirical Rademacher complexity of G with respect to the sample S is defined as: R̂S(G) = E σ [ sup g∈G 1 m m∑ i=1 σig(zi) ] , (3.1) where σ = (σ1, . . . , σm) >, with σis independent uniform random variables taking values in {−1,+1}.3 The random variables σi are called Rademacher variables. Let gS denote the vector of values taken by function g over the sample S: gS = (g(z1), . . . , g(zm)) >. Then, the empirical Rademacher complexity can be rewritten as R̂S(G) = E σ [ sup g∈G σ · gS m ] . The inner product σ ·gS measures the correlation of gS with the vector of random noise σ. The supremum supg∈G σ·gS m is a measure of how well the function class G correlates with σ over the sample S. Thus, the empirical Rademacher complexity measures on average how well the function class G correlates with random noise on S. This describes the richness of the family G: richer or more complex families G can generate more vectors gS and thus better correlate with random noise, on average. 3 We assume implicitly that the supremum over the family G in this definition is measurable and in general will adopt the same assumption throughout this book for other suprema over a class of functions. This assumption does not hold for arbitrary function classes but it is valid for the hypotheses sets typically considered in practice in machine learning, and the instances discussed in this book. 3.1 Rademacher complexity 31 Definition 3.2 (Rademacher complexity) Let D denote the distribution according to which samples are drawn. For any integer m ≥ 1, the Rademacher complexity of G is the expectation of the empirical Rademacher complexity over all samples of size m drawn according to D: Rm(G) = E S∼Dm [R̂S(G)]. (3.2) We are now ready to present our first generalization bounds based on Rademacher complexity. Theorem 3.3 Let G be a family of functions mapping from Z to [0, 1]. Then, for any δ > 0, with probability at least 1− δ over the draw of an i.i.d. sample S of size m, each of the following holds for all g ∈ G: E[g(z)] ≤ 1 m m∑ i=1 g(zi) + 2Rm(G) + √ log 1δ 2m (3.3) and E[g(z)] ≤ 1 m m∑ i=1 g(zi) + 2R̂S(G) + 3 √ log 2δ 2m . (3.4) Proof: For any sample S = (z1, . . . , zm) and any g ∈ G, we denote by ÊS [g] the em pirical average of g over S: ÊS [g] = 1m ∑m i=1 g(zi). The proof consists of applying McDiarmid’s inequality to function Φ defined for any sample S by Φ(S) = sup g∈G ( E[g]− ÊS [g] ) . (3.5) Let S and S′ be two samples differing by exactly one point, say zm in S and z′m in S′. Then, since the difference of suprema does not exceed the supremum of the difference, we have Φ(S′)− Φ(S) ≤ sup g∈G ( ÊS [g]− ÊS′ [g] ) = sup g∈G g(zm)− g(z′m) m ≤ 1 m . (3.6) Similarly, we can obtain Φ(S) − Φ(S′) ≤ 1/m, thus Φ(S) − Φ(S′) ≤ 1/m. Then, by McDiarmid’s inequality, for any δ > 0, with probability at least 1 − δ/2, the following holds: Φ(S) ≤ E S [Φ(S)] + √ log 2δ 2m . (3.7) 32 Chapter 3 Rademacher Complexity and VCDimension We next bound the expectation of the righthand side as follows: E S [Φ(S)] = E S [ sup g∈G ( E[g]− ÊS(g) )] = E S [ sup g∈G E S′ [ ÊS′(g)− ÊS(g) ]] (3.8) ≤ E S,S′ [ sup g∈G ( ÊS′(g)− ÊS(g) )] (3.9) = E S,S′ [ sup g∈G 1 m m∑ i=1 (g(z′i)− g(zi)) ] (3.10) = E σ,S,S′ [ sup g∈G 1 m m∑ i=1 σi(g(z ′ i)− g(zi)) ] (3.11) ≤ E σ,S′ [ sup g∈G 1 m m∑ i=1 σig(z ′ i) ] + E σ,S [ sup g∈G 1 m m∑ i=1 −σig(zi) ] (3.12) = 2 E σ,S [ sup g∈G 1 m m∑ i=1 σig(zi) ] = 2Rm(G). (3.13) Equation (3.8) uses the fact that points in S′ are sampled in an i.i.d. fashion and thus E[g] = ES′ [ÊS′(g)], as in (2.3). Inequality 3.9 holds due to the subadditivity of the supremum function. In equation (3.11), we introduce Rademacher variables σi, which are uniformly distributed independent random variables taking values in {−1,+1} as in defini tion 3.2. This does not change the expectation appearing in (3.10): when σi = 1, the associated summand remains unchanged; when σi = −1, the associated sum mand flips signs, which is equivalent to swapping zi and z ′ i between S and S ′. Since we are taking the expectation over all possible S and S′, this swap does not affect the overall expectation; we are simply changing the order of the summands within the expectation. Equation (3.12) holds by the subadditivity of the supremum function, that is the inequality sup(U + V ) ≤ sup(U) + sup(V ). Finally, (3.13) stems from the definition of Rademacher complexity and the fact that the variables σi and −σi are distributed in the same way. The reduction to Rm(G) in equation (3.13) yields the bound in equation (3.3), using δ instead of δ/2. To derive a bound in terms of R̂S(G), we observe that, by definition 3.1, changing one point in S changes R̂S(G) by at most 1/m. Then, using again McDiarmid’s inequality, with probability 1− δ/2 the following holds: Rm(G) ≤ R̂S(G) + √ log 2δ 2m . (3.14) 3.1 Rademacher complexity 33 Finally, we use the union bound to combine inequalities 3.7 and 3.14, which yields with probability at least 1− δ: Φ(S) ≤ 2R̂S(G) + 3 √ log 2δ 2m , (3.15) which matches (3.4). � The following result relates the empirical Rademacher complexities of a hypothesis set H and to the family of loss functions G associated to H in the case of binary loss (zeroone loss). Lemma 3.4 Let H be a family of functions taking values in {−1,+1} and let G be the family of loss functions associated to H for the zeroone loss: G = {(x, y) 7→ 1h(x)6=y : h ∈ H } . For any sample S = ((x1, y1), . . . , (xm, ym)) of elements in X × {−1,+1}, let SX denote its projection over X: SX = (x1, . . . , xm). Then, the following relation holds between the empirical Rademacher complexities of G and H: R̂S(G) = 1 2 R̂SX(H). (3.16) Proof: For any sample S = ((x1, y1), . . . , (xm, ym)) of elements in X × {−1,+1}, by definition, the empirical Rademacher complexity of G can be written as: R̂S(G) = E σ [ sup h∈H 1 m m∑ i=1 σi1h(xi) 6=yi ] = E σ [ sup h∈H 1 m m∑ i=1 σi 1−yih(xi) 2 ] = 1 2 E σ [ sup h∈H 1 m m∑ i=1 −σiyih(xi) ] = 1 2 E σ [ sup h∈H 1 m m∑ i=1 σih(xi) ] = 1 2 R̂SX(H), where we used the fact that 1h(xi)6=yi = (1−yih(xi))/2 and the fact that for a fixed yi ∈ {−1,+1}, σi and −yiσi are distributed in the same way. � Note that the lemma implies, by taking expectations, that for any m ≥ 1, Rm(G) = 1 2Rm(H). These connections between the empirical and average Rademacher com plexities can be used to derive generalization bounds for binary classification in terms of the Rademacher complexity of the hypothesis set H. Theorem 3.5 (Rademacher complexity bounds – binary classification ) Let H be a family of functions taking values in {−1,+1} and let D be the distribution over the input space X. Then, for any δ > 0, with probability at least 1 − δ over a sample S of 34 Chapter 3 Rademacher Complexity and VCDimension size m drawn according to D, each of the following holds for any h ∈ H: R(h) ≤ R̂S(h) + Rm(H) + √ log 1δ 2m (3.17) and R(h) ≤ R̂S(h) + R̂S(H) + 3 √ log 2δ 2m . (3.18) Proof: The result follows immediately by theorem 3.3 and lemma 3.4. � The theorem provides two generalization bounds for binary classification based on the Rademacher complexity. Note that the second bound, (3.18), is datadependent: the empirical Rademacher complexity R̂S(H) is a function of the specific sample S drawn. Thus, this bound could be particularly informative if we could compute R̂S(H). But, how can we compute the empirical Rademacher complexity? Using again the fact that σi and −σi are distributed in the same way, we can write R̂S(H) = E σ [ sup h∈H 1 m m∑ i=1 −σih(xi) ] = −E σ [ inf h∈H 1 m m∑ i=1 σih(xi) ] . Now, for a fixed value of σ, computing infh∈H 1m ∑m i=1 σih(xi) is equivalent to an empirical risk minimization problem, which is known to be computationally hard for some hypothesis sets. Thus, in some cases, computing R̂S(H) could be compu tationally hard. In the next sections, we will relate the Rademacher complexity to combinatorial measures that are easier to compute and also of independent interest for their usefulness in the analysis of learning in many contexts. 3.2 Growth function Here we will show how the Rademacher complexity can be bounded in terms of the growth function. Definition 3.6 (Growth function) The growth function ΠH : N → N for a hypothesis set H is defined by: ∀m ∈ N, ΠH(m) = max{x1,...,xm}⊆X ∣∣∣ {( h(x1), . . . , h(xm) ) : h ∈ H }∣∣∣. (3.19) In other words, ΠH(m) is the maximum number of distinct ways in which m points can be classified using hypotheses in H. Each one of these distinct classifications is called a dichotomy and, thus, the growth function counts the number of dichotomies that are realized by the hypothesis. This provides another measure of the richness of the hypothesis set H. However, unlike the Rademacher complexity, this measure does not depend on the distribution, it is purely combinatorial. 3.2 Growth function 35 To relate the Rademacher complexity to the growth function, we will use Mas sart’s lemma. Theorem 3.7 (Massart’s lemma) Let A ⊆ Rm be a finite set, with r = maxx∈A ‖x‖2, then the following holds: E σ [ 1 m sup x∈A m∑ i=1 σixi ] ≤ r √ 2 log A m , (3.20) where σis are independent uniform random variables taking values in {−1,+1} and x1, . . . , xm are the components of vector x. Proof: The result follows immediately from the bound on the expectation of a maximum given by Corollary D.11 since the random variables σixi are independent and each σixi takes values in [−xi, xi] with √∑m i=1 x 2 i ≤ r2. � Using this result, we can now bound the Rademacher complexity in terms of the growth function. Corollary 3.8 Let G be a family of functions taking values in {−1,+1}. Then the following holds: Rm(G) ≤ √ 2 log ΠG(m) m . (3.21) Proof: For a fixed sample S = (x1, . . . , xm), we denote by GS the set of vectors of function values (g(x1), . . . , g(xm)) > where g is in G. Since g ∈ G takes values in {−1,+1}, the norm of these vectors is bounded by √m. We can then apply Massart’s lemma as follows: Rm(G) = E S [ E σ [ sup u∈GS 1 m m∑ i=1 σiui ]] ≤ E S [√ m √ 2 log GS  m ] . By definition, GS  is bounded by the growth function, thus, Rm(G) ≤ E S [√ m √ 2 log ΠG(m) m ] = √ 2 log ΠG(m) m , which concludes the proof. � Combining the generalization bound (3.17) of theorem 3.5 with corollary 3.8 yields immediately the following generalization bound in terms of the growth function. Corollary 3.9 (Growth function generalization bound) Let H be a family of functions taking values in {−1,+1}. Then, for any δ > 0, with probability at least 1 − δ, for any h ∈ H, R(h) ≤ R̂S(h) + √ 2 log ΠH(m) m + √ log 1δ 2m . (3.22) 36 Chapter 3 Rademacher Complexity and VCDimension   +  + +  + +  + (a) (b) Figure 3.1 VCdimension of intervals on the real line. (a) Any two points can be shattered. (b) No sample of three points can be shattered as the (+,−,+) labeling cannot be realized. Growth function bounds can be also derived directly (without using Rademacher complexity bounds first). The resulting bound is then the following: P [∣∣∣R(h)− R̂S(h) ∣∣∣ > � ] ≤ 4ΠH(2m) exp ( −m� 2 8 ) , (3.23) which only differs from (3.22) by constants. The computation of the growth function may not be always convenient since, by definition, it requires computing ΠH(m) for all m ≥ 1. The next section introduces an alternative measure of the complexity of a hypothesis set H that is based instead on a single scalar, which will turn out to be in fact deeply related to the behavior of the growth function. 3.3 VCdimension Here, we introduce the notion of VCdimension (VapnikChervonenkis dimension). The VCdimension is also a purely combinatorial notion but it is often easier to compute than the growth function (or the Rademacher Complexity). As we shall see, the VCdimension is a key quantity in learning and is directly related to the growth function. To define the VCdimension of a hypothesis set H, we first introduce the concept of shattering . Recall from the previous section, that given a hypothesis set H, a dichotomy of a set S is one of the possible ways of labeling the points of S using a hypothesis in H. A set S of m ≥ 1 points is said to be shattered by a hypothesis set H when H realizes all possible dichotomies of S, that is when ΠH(m) = 2 m. Definition 3.10 (VCdimension) The VCdimension of a hypothesis set H is the size of the largest set that can be shattered by H: VCdim(H) = max{m : ΠH(m) = 2m}. (3.24) Note that, by definition, if VCdim(H) = d, there exists a set of size d that can be shattered. However, this does not imply that all sets of size d or less are shattered and, in fact, this is typically not the case. 3.3 VCdimension 37 + +  + +  + (a) (b) Figure 3.2 Unrealizable dichotomies for four points using hyperplanes in R2. (a) All four points lie on the convex hull. (b) Three points lie on the convex hull while the remaining point is interior. To further illustrate this notion, we will examine a series of examples of hy pothesis sets and will determine the VCdimension in each case. To compute the VCdimension we will typically show a lower bound fo