Main
Introduction to the Math of Neural Networks
Introduction to the Math of Neural Networks
Jeff Heaton
Year:
2012
Edition:
1st
Publisher:
Heaton Research
Language:
english
Pages:
102
File:
PDF, 4.66 MB
Download (pdf, 4.66 MB)
Preview
 Open in Browser
 Checking other formats...
 Convert to EPUB
 Convert to FB2
 Convert to MOBI
 Convert to TXT
 Convert to RTF
 Converted file can differ from the original. If possible, download the file in its original format.
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You may be interested in
Most frequently terms
neural network^{205}
equation^{162}
weights^{149}
output^{146}
calculate^{109}
neuron^{101}
input^{88}
gradients^{80}
update^{79}
neurons^{77}
gradient^{70}
som^{70}
derivative^{67}
rprop^{62}
neural networks^{56}
backpropagation^{50}
slope^{48}
iteration^{47}
bmu^{43}
normalization^{42}
calculation^{42}
activation^{39}
calculated^{38}
output neuron^{35}
derivatives^{35}
algorithm^{34}
epoch^{32}
lma^{32}
bias^{32}
update values^{31}
calculating^{31}
node^{31}
delta^{31}
output neurons^{29}
array^{27}
heaton research^{27}
weight update^{27}
matrix^{26}
software^{26}
hessian^{26}
formula^{23}
sigmoid^{21}
propagation^{20}
euclidean distance^{19}
global error^{19}
partial^{19}
momentum^{19}
error calculation^{18}
squared^{18}
activation function^{18}
input neurons^{18}
radient^{18}
hidden neuron^{17}
node delta^{17}
mse^{17}
widrow^{17}
nguyen^{17}
baekpropagation^{17}
iterations^{16}
represents^{16}
Irati
Thank you so much z library for sharing it! You saved 9€ of mine which is very important for my study :)
26 February 2019 (06:06)
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

INTRODUCTION NEURAL NETWORKS n B=0.7h'; i— i f(x) = :0.8723+(1.0*0 .5*(0.50 .8723)) = 0.68615 e2x 1 e2x + 1 Heaton Research Title Introduction to the Math o f Neural Networks Author [Jeff*Heaton Published May 01. 2012 Copyright Copyright 2012 by Heaton Research. Inc.. All Rights Reserved. File Created Thu May 17 13:06:16 CDT 2012 ISBN [9781475190878 Price 9.99 USD D o n ot m ak e illegal co p ie s o f th is e b o o k This eBook is copyrighted nutcrial, and public distribution is prohibited. If you did not receive this ebook from Heaton Research (http://www.heatonresearch.com), or an authorized bookseller, please contact Heaton Research. Inc. to purchase a licensed copy. DRM free copies o f our books can be purchased from: http: //w w w . hcatonrcscare h.com' hook If you purchased this book, thankyou! Your purchase o f this books supports the Lncog Machine Learning Framework, http: ' www.encog.org Publisher: Heaton Research. Ine Introduction to the Math o f Neural Networks May, 2012 Author: Jeff Heaton Fxiitor: WordsRU.com Cover Art: Carrie Spear ISBN: 9781475190878 Copyright © 2012 by Heaton Research Inc., 1734 Clarkson Rd. #107, Chesterfield, MO 630174976. World rights reserved. The author(s) created reusable code in this publication expressly for reuse by readers. Heaton Research, Ine. grants readers permission to reuse the code found in this publication or downloaded from our website so long as (author(s)) are attributed in any application containing the reusable code and the source code itself is never redistributed, posted online by electronic transmission, sold or commercially exploited as a standalone product. Aside from this specific exception concerning reusable code, no part o f this publication may be stored in a retrieval system, transmitted, o r reproduced in any way. including, but not limited to photo copy, photograph, magnetic, or other record, without prior agreement and written permission o f the publisher. Heaton Research. Encog, the Encog Logo and the Heaton Research logo arc all trademarks o f Heaton Research. Ine.. in the United States and/or oilier countries. TRADEMARKS: Heaton Research has attempted throughout this book to distinguish proprietary trademarks from descriptive terms by following tin.* capitalization style used by the manufacturer. The author anti publisher have made their best efforts to prepare this book, so the content is based upon the final release o f software whenever possible. Portions o f the manuscript may be based upon prerelease versions supplied by software manufacturers). The author and the publisher make no representation or warranties of any kind with regard to the completeness or accuracy o f the contents herein and accept no liability o f any kind including but not limited to performance, merchantability, fitness for any particular purpose, or any losses or damages o f any kind caused or alleged to be caused directly or indirectly from this book. SOFTWARE LICENSE AGREEMENT: TERMS AND CONDITIONS The media and/or any online materials accompanying this book that are available now or in the future contain programs and/or text files (the "Software") to be used in connection with the book. Heaton Research. Inc. hereby giants to you a license to use and distribute software programs that make use o f the compiled binary form o f this book's source code. You may not redistribute the source code contained in this book, without the written permission o f Heaton Research, Inc. Your purchase, acceptance, or use o f the Software w ill constitute your acceptance o f such terms. The Software compilation is the property o f Heaton Research, Inc. unless otherwise indicated and is protected by copyright to Heaton Research, Inc. or other copyright owncr(s) as indicated in the media tiles (the "Owncr(s)"). You are hereby granted a license to use and distribute the Software for your personal, noncommercial use only. You may not reproduce, sell, distribute, publish, circulate, or commercially exploit the Software, o r any portion thereof, without the written consent o f Heaton Research. Inc. and the specific copyright owncr(s) o f any component software included on this media. In the event that the Software o r components include specific license requirements or enduser agreements, statements o f condition, disclaimers, limitations or warranties (“ EndUser License”), those EndUser Liceases supersede the terms and conditions herein as to that particular Software component. Your purchase, acceptance, o r use o f the Software will constitute your acceptance o f such EndUser Licenses. By purchase, use or acceptance o f the Software you further agree to comply with all export laws and regulations o f the United States as such laws and regulations may exist from time to time. SOFTWARE SUPPORT Components o f the supplemental Software and any offers associated w ith them may be supported by the specific Owner(s) ofthat material but they are not supported by Heaton Research. Inc.. Information regarding any available support may be obtained from the Owner(s) using the information provided in the appropriate README files o r listed elsewhere on the media. Should the manufacturer(s) or other Owner(s) cease to offer support or decline to honor any offer, Heaton Research, Inc. bears no responsibility. This notice concerning support for the Software is provided for your information only. Heaton Research. Inc. is not the agent o r principal o f the Owner(s). and Heaton Research. Inc. is in no way responsible for providing any support for the Software, nor is it liable or responsible for any support provided, o r not provided, by the Ow ner(s). WARRANTY Heaton Research. Inc. warrants the enclosed media to be free o f physical defects for a period o f ninety (90) days after purchase. The Software is not available from Heaton Research. Inc. in any other form o r media than that enclosed herein or posted to www.heatonresearch.com. If you discover a defect in the media during this warranty period, you may obtain a replacement of identical format at no charge by sending the defective media, postage prepaid, with proof o f purchase to: Heaton Research, Ine. Customer Support Department 1734 Clarkson Rd #107 Chesterfield, MO 630174976 Web: vv\vw.heatonrcseareh.com EMai I: support@heatonresearch.com DISCLAIMER Heaton Research. Inc. makes no warranty o r representation, either expressed or implied, with respect to the Software or its contents, quality, performance, merchantability, or fitness for a particular purpose. In no event will Heaton Research, Ine., its distributors, or dealers be liable to you o r any other party for direct, indirect, special, incidental, consequential, or other damages arising out o f the use o f o r inability to use the Software or its contents even if advised o f the possibility o f such damage. In the event that the Software includes an online update feature. Heaton Research. Ine. further disclaims any obligation to provide this feature for any specific duration other than the initial posting. The exclusion o f implied warranties is not permitted by some states. Therefore, the above exclusion may not apply to you. This warranty provides you with specific legal rights; there may be other rights that you may have that vary from state to suite. The pricing o f the book with the Software by Heaton Research. Ine. reflects the allocation o f risk and limitations on liability contained in this agreement o f Terms and Conditions. SHAREWARE DISTRIBUTION This Software may use various programs and libraries that are distributed as shareware. Copyright laws apply to both shareware and ordinary commercial software, and the copyright Owner(s) retains all rights. If you try a shareware program and continue using it, you are expected to register it. Individual programs differ on details o f trial periods, registration, and payment. Please observe the requirements stated in appropriate files. Introduction • • • • Math Needed for Neural Networks Prerequisites Other Resources Structure o f this Book If you have read other books I have written, you will know that I try to shield the reader from the mathematics behind Al. Often, you do not need to know the exact math that is used to train a neural network or perform a cluster operation. You simply want the result. This resultsbased approach is very much the focus o f the Encog project. Encog is an advanced machine learning framework that allows you to perform many advanced operations, such as neural networks, genetic algorithms, support vector machines, simulated annealing and other machine learning methods. Encog allows you to use these advanced techniques without needing to know what is happening behind the scenes. However, sometimes you really do want to know what is going on behind the scenes. You do want to know the math that is involved. In this book, you will learn what happens, behind the scenes, with a neural network. You w ill also be exposed to tlx.* math. There are already many neural network books that at first glance appear as a math text. This is not what I seek to produce here. There are already several very good books that achieve a pure mathematical introduction to neural networks. My goal is to produce a mathematicallybased neural network book that targets someone who has perhaps only collegelevel algebra and computer programming background. These are the only two prerequisites for understanding this book, aside from one more that I will mention later in this introduction. Neural networks overlap several bodies o f mathematics. Neural netw ork goals, such as classification, regression anti clustering, come from statistics. The gradient descent that goes into backpropagation. along with other training methods, requires knowledge o f Calculus. Advanced training, such as Levenberg Marquardt, require both Calculus and Matrix Mathematics. To read nearly any academiclevel neural network or machine learning targeted book, you will need some knowledge o f Algebra, Calculus, Statistics and Matrix Mathematics. However, the reality is that you need only a relatively small amount o f knowledge from each o f these areas. The goal o f this book is to teach you enough math to understand neural networks and their training. You will learn exactly how a neural network functions, and when you are finished this book, you should be able to implement your own in any computer language you are familiar with. Since knowledge o f some areas o f mathematics is needed, I w ill provide an introductorylevel tutorial on the math. I only assume that you know' basic algebra to start out with. This book will discuss such mathematical concepts as derivatives, partial derivatives, matrix transformation, gradient descent and more. If you have not done this sort o f math in a while, I plan for this book to be a good refresher. If you have never done this sort o f math, then this book could serve as a good introduction. If you are very familiar with math, you can still learn neural networks from this book. However, you may want to skip some of the sections that cover basic material. This book is not about Encog. nor is it about how to program in any particular programming language. I assume that you will likely apply these principles to programming languages. If you want examples o f how I apply the principles in this book, you can learn more about Encog. This book is really more about the algorithms and mathematics behind neural networks. I did say there was one other prerequisite to understanding this book, other than basic algebra and programming knowledge in any language. That final prerequisite is know ledge o f what a neural network is and how it is used. If you do not vet know how to use a neural network, you may want to start with my article, *A NonMathcmatical Introduction to Using Neural Networks', which you can find at http: //www. heatonresearch.eonVcontent'non mathcmatiealintroductionusingncuralnctworks, The above article provides a brief crash course on w hat neural networks are. You may also want to look at some o f the Encog examples. You can find more information about Encog at the following URL: http: Vww w.hcatonresearch.com encog If neural networks are cars, then this book is a mechanics guide. If I am going to teach you to repair and build cars. I make two basic assumptions, in order o f inporta nee. The first is that you've actually seen a car. and know what one is used for. The second assumption is that you know how to drive a car. If neither o f these is true, then why do you care about learning the internals o f how a car works? The same applies to neural networks. Other Resources There arc many other resources on the internet tluit will be very useful as you read through this book. This section will provide you with an overview of some o f these resources. The first is the Khan Academy. This is a collection o f YouTube videos that demonstrate many areas o f mathematics. If you need additional review on any mathematical concept in this book, there is most likely a video on the Khan Academy that covers it. http;/' 'www.khanaeudemy.org' Second is the Neural Network FAQ. I bis textonly resource has a great deal o f information on neural networks. http;/'1w w w J k u s ^ J i i ^ s .1ai ;ki ' ncuraInets' The Eneog wiki has a fair amount o f general information on machine learning. This information is not necessarily tied to Eneog. There are articles in the Eneog wiki that will be helpful as you complete this book. http; www,heatonreseareh.eoin w jki/M ain, Page Finally, the F.ncog forums are a place w here Al and neural networks can be discussed. These forums are fairly active and you will likely receive an answer from myself or from one o f the community members at the forum. http: w w w .heatonresearch.com forum These resources should be helpful to you as you progress through this book. Structure of this Book The first chapter, “Neural Network Activation”, shows how the output from a neural network is calculated. Before you can find out how to train and evaluate a neural network, you must understand how a neural network produces its output. Chapter 2. "Error Calculation”, demonstrates how to evaluate the output from a neural network. Neural networks begin with random weights. Training adjusts these weights to produce meaningful output. Chapter 3, “ Understanding Derivatives”, focuses on a very important Calculus topic. Derivatives, and partial derivatives, are used by several neural network training methods. This chapter will introduce you to those aspects of derivatives that arc needed for this book. Chapter 4. “Training with Backpropagation”, shows you how to apply knowledge from Chapter 3 towards training a neural network. Backpropagation is one o f the oldest training techniques for neural networks. There are newer and much superior  training methods available. However, understanding backpropagation provides a very important foundation for resilient propagation (RPROP), quick propagation (QPROP) and the Levenberg Marquardt Algorithm (LMA). Chapter 5. “Faster Training with RPROP”, introduces resilient propagation, which builds upon backpropagation to provide much quicker training times. Chapter 6 , "Weight Initialization”, shows Ikhv neural networks are given their initial random weights. Some sets o f random weights perform better than others. This chapter looks at several, less than random, weight initialization methods. Chapter 7. “ LMA Training”, introduces the Levenberg Marquardt Algorithm. LMA is the most mathematically intense training nx'thod in this book. LMA can sometimes offer very rapid training for a neural network. Chapter X, “Self Organizing Maps”, show's how to create a clustering neural network. The S elf Organizing Map (SOM) can be used to group data. The structure o f the SOM is similar to the feedforward neural networks seen in this book. Chapter 9. "Normalization”, shows how numbers are normalized for neural networks. Neural networks typically require that input and output numbers be in the range o f 0 to I. o r  1 to I. This chapter shows how to transform numbers into that range. Chapter 1: Neural Network Activation • • • • Summation Calculating Activation Activation Functions Hi as Neurons In this chapter, you will find out how to calculate the output for a feedforward neural network. Most neural networks are in some way based on the feedforward neural network. Learning how this simple neural network is calculated will form the foundation for understanding training, as well as other more complex features o f neural networks. Several mathematical terms will be introduced in this chapter. You will be shown summation notation and simple mathematical formula notation. We will begin with a review o f the summation operator. Understanding the Summation Operator In this section, we will take a quick look at the summation operator. The summation operator, represented by the capital Greek letter sigma, can be seen in Equation l.l. Equation l .l : The Summation Operator 10 * = !> ' i—i The above equation is a summation. If you are unfamiliar with sigma notation, it is essentially the same thing as a programming for loop. Figure l.l shows Equation I. I reduced to pseudocode. Figure l.l: Summation Operator to Code n e xt i As you can see. the summation operator is very similar to a for loop. The information just below the sigma symbol specifies the stating value and the indexing variable. The information above the sigma specifies the limit o f the loop. The information to the right o f sigma specifies the value that is being summed. Calculating a Neural Network Wc will begin by looking at how a neural network calculates its output. You should already know the structure o f a neural network from the resources included in this book's introduction. Consider a neural network such as the one in Figure 1.2. Mgure 1.2: A Simple Neural Network This neural network has one output neuron. As a result, it w ill have one output value. To calculate the value o f this output neuron ( O l) , we must calculate the activation for each o f the inputs into O l. The inputs that feed into O l are H I. H2 and B2. The activation for B2 is simply 1.0. because it is a bias neuron. However. HI and H2 must be calculated independently. To calculate HI and H2. the activations o f II, 12 and B1 must be considered. Though HI and H2 share the same inputs, they will not calculate to the same activation. This is because they have different weights. In the above diagram, the weights are represented by lines. First, we must find out how one activation calculation is done. This same activation calculation can then be applied to the other activation calculations. Wc will examine how HI is calculated. Figure 1.3 shows only the inputs to HI. Hgure 1.3: Calculating H i's Activation We will now examine how to calculate H I. This relatively simple equation is shown in Equation 1.2. Equation 1.2: Calculate HI if r= I To understand Equation 1.2. we can first look at the variables that go into it. For the above equation we have three input values, described by the variable i. The three input values are input values o f II. 12 and Bl. II and 12 are simply the input values with which the neural network was provided to compute the output. Bl is always I. because it is the bias neuron. There are also three weight values considered: w l, w2 and w3. These are the weighted connections between III and the previous layer. Therefore, the variables to this equation are: i[l] i[2J i [3] w[l] w(2) w[31 n = = = = 3, f i r s t in p u t v a lu e t o th e n e u r a l network second in c u t v a lu e t o n e u r a l network 1 w e i g h t f r o m I I t o HI w e i g h t f r o m 12 t o HI w e i g h t f r o m B l t o HI th e number o f c o n n e c tio n s Though the bias neuron is not really part o f the input array, a value o f one is always placed into the input array for the bias neuron. Treating the bias as a forwardonly neuron makes the calculation much easier. To understand Equation 1.2. we will consider it as pseudocode. d o u b l e w[31 / / d o u b le 1(3] / / d o u b l e sum ■ 0 ; / / perform the th e w eights the input values / / t h e sum summ ation (sigma) for c = 0 to 2 sum  sum + < w [ c ] * i [ c ] ) next II ap p ly th e a c t i v a t i o n fu n ctio n sum = A ( s um ) Here, we sum up each o f the inputs times its respective weight. Finally, this sum is passed to an activation function. Activation functions are a very important concept in neural network programming. In the next section, w e will examine activation functions. Activation Functions Activation functions arc very commonly used in neural networks. They serve several important functions for a neural network. The primary reason to use an activation function is to introduce nonlinearity to the neural network. Without this nonlinearity, a neural network could do little to learn nonlinear functions. The output that we expect neural networks to learn is rarely linear. The two most common activation functions are the sigmoid and hyperbolic tangent activation function. The hyperbolic tangent activation function is the more common o f these two, as it hits a number range from  I to I, compared to the sigmoid function which ranges only from 0 to I. liquation 1.3: The Ilypcrlxdic Tangent Function The hyperbolic tangent function is actually a trigonometric function. However, our use for it has nothing to do with trigonometry. This function was chosen for the shape o f its graph. You can see a graph o f the hyperbolic tangent function in Figure 1.4. Figure 1.4: The Hyperbolic Tangent Function « Notice that the range is from  I to I. This allows it to accept a much wider range o f numbers. Also notice how values beyond I to 1 are quickly scaled. This provides a consistent range o f numbers for the network. Now we will look at tlic sigmoid function. You can see this in Equation 1.4. Equation 1.4: The Sigmoid Iunction The sigmoid function is also called the logistic function. Typically it does not perform as well as the hyperbolic tangent function. However, if the values in the training data are all positive, it can perform well. The graph for the sigmoid function is shown in Figure 1.5. Figure 1.5: The Sigmoid Function sigmoid (s) « As you can see, it scales numbers to 1.0. It also has a range that only includes positive numbers. It is less general purpose than hvperbolic tangent, but it can be useful. The sigmoid function outperforms the hyperbolic tangent function. Bias Neurons You may be wondering why bias values are even needed. The answer is that bias values allow a neural network to output a value o f zero even when the input is near one. Adding a bias allows the output o f the activation function to be shifted to the left o r right on the xaxis. To understand this, consider a simple neural network where a single input neuron II is directly connected to an output neuron O l . The network shown in Figure 1.6 has no bias. Figure 1.6: A Biasless Connection This network's output is computed by multiplying the input (x) by the weight (w). The result is then passed through an activation function. In this case, we are using the sigmoid activation function. Consider the output o f the sigmoid function for the following four weights. s i g m o i d (0 . 5*x) s i g m o i d ( 1 . 0*x) s i g m o i d ( 1 . 5 *x ) s i g m o i d ( 2 . 0*x) Given the above weights, the output o f the sigmoid will be as seen in Figure 1.7. Figure 1.7: Adjusting Weights frira Changing the weight w alters the “steepness” o f the sigmoid function. This allows the neural network to learn patterns. However, what if you wanted the work to output 0 when x is a value other than 0. such as 3? Simply changing steepness o f the sigmoid will not accomplish this. You must be able to shift entire curve to the right. That is the purpose o f bias. Adding a bias neuron causes the neural network to appear as in Figure 1.8. Figure 1.8: A Biased Connection Now we can calculate with the bias neuron present. We will calculate for several bias weights. s i g m o i d ( 1*x * 1 *1 ) s ig m o id (l* x * 0.5*1) s ig m o id (l* x + 1.5*1) s i g m o i d ( l * x + 2 *1 ) This produces the following plot, seen in Figure 1.9. Figure 1.9: Adjusting Bias * As you can see, the entire curve now shifts. Chapter Summary This chapter demonstrated how a feedforward neural network calculates output. The output o f a neural network is determined by ealeulating each successive layer after the input layer. The final output o f the neural network eventually reaches the output layer. Neural networks make use o f activation functions. An activation function provides nonlinearity to the neural network. Because most o f the data that a neural network seeks to learn is nonlinear, the activation functions must be non linear. An activation function is applied after the weights and activations have been multiplied. Most neural networks have bias neurons. Bias is an important concept for neural networks. Bias neurons arc added to every nonoutput layer o f the neural network. Bias neurons are different than ordinary neurons in two very important ways, firstly, the output from a bias neuron is always one. Secondly, a bias neuron has no inbound connections. The constant value o f one allows the layer to respond with nonzero values even when the input to the layer is zero. This can be very important for certain data sets. The neural networks w ill output values determined by the weights o f the connections. These weights are usually set to random initial values. Training is the process in which these random weights are adjusted to produce meaningful results. We need a way for the neural network to measure the effectiveness o f the neural network. This measure is called error calculation. F.rror calculation is discussed in the next chapter. Chapter 2: Error Calculation Methods • • • • Understanding Error Calculation The Error Function Error Calculation Methods How the Error is Used In this chapter, we will find out how to calculate errors for a neural network. When performing supervised training, a neural network's actual output must be compared against the ideal output specified in the training data. The difference between actual and ideal output is tl»c error o f the neural network. Error calculation occurs at two levels. First, there is the local error. This is the difference between tlx.* actual output o f oik * individual neuron and the ideal output that was expected. The local error is calculated using an error function. The local errors are aggregated together to form a global error. The global error is the measurement o f how well a neural network performs to the entire training set. There are several different means by which a global error can be calculated. The global error calculation methods discussed in this chapter are listed below. • Sum of Squares Error (ESS) • Mean Square Error (MSE) • Root Mean Square (RMS) Usually, you will use MSE. MSE is the most common means o f calculating errors for a neural network. Liter in the book, we will look at when to use ESS. The Levenberg Marquardt Algorithm (I.MA). which will be covered in Chapter S. requires ESS. Listly. RMS can be useful in certain situations. RMS can be useful in electronics and signal processing. The Error Function W t will start by looking at the local error. The local error comes from the error function. The error function is fed the actual and ideal outputs for a single output neuron. The error function then produces a number that represents the error o f that output neuron. Training methods w ill seek to minimize this error. This book w ill cover two error functions. The first is the standard linear error function, which is the most commonly used function. The second is the arctangent error function that is introduced by the Quick Propagation training method. Arctangent error functions and Quick Propagation will be discussed in Chapter 4. “ Back Propagation”. This chapter w ill focus on the standard linear error function. The formula for the linear error function can be seen in Equation 2 . 1. Equation 2.1: The I.inear F.rror Function £ = (* " ) The linear error function is very simple. The error is the difference between the ideal (i) and actual (a) outputs from the neural network. The only requirement of the error function is that it produce an error that you would like to minimize. For an example o f this, consider a neural network output neuron that produced 0.9 when it should have produced 0.8. The error for this neural network would be the difference between 0.8 and 0.9. which is 0 .1. In some cases, you may not provide an ideal output to the neural network and still use supervised training. In this case, you would write an error function that somehow evaluates the output o f the neural network for the given input. This evaluation error function would need to assign some sort o f a score to the neural network. A higher number would indicate less desirable output, while a lower number would indicate more desirable output. The training process would attempt to minimize this score. Calculating Global Error Now that wc have found out how to calculate the local error, we will move on to global error. MSE error calculation is the most common, so we will begin with that. You can see the equation that is used to calculate MSE in Equation 2 .2 . Equation 2.2: MSE Error Calculation MSE = ~ i t . & As you can see. the above equation makes use o f the local error (E) that we defined in the last section. Each local error is squared and summed. The resulting sum is then divided by the total number o f cases. In this way, the MSE error is similar to a traditional average, except that each local error is squared. The squaring negates the effect o f some errors being positive and others being negative. This is because a positive number squared is a positive number, just as a negative number squared is also a positive number. If you are unfamiliar with the summation operator, shown as a capital Greek letter sigma, refer to Chapter I. The MSE error is typically w ritten as a percentage. The goal is to decrease this error percentage as training progresses. To see how this is used, consider the following program output. Beginning t r a i n i n g . . . I t e r a t i o n #1 E r r o r : 5 1 . 0 2 3 7 8 6 % T a r g e t E r r o r : 1 . 0 0 0 0 0 0 % I t e r a t i o n #2 E r r o r : 4 9 . 6 5 9 2 9 1 % T a r g e t E r r o r : 1 . 0 0 0 0 0 0 % I t e r a t i o n #3 E r r o r : 4 3 . 1 4 0 4 7 1 % T a r g e t E r r o r : 1 . 0 0 0 0 0 0 % I t e r a t i o n #4 E r r o r : 2 9 . 8 2 0 8 9 1 % T a r g e t E r r o r : 1 . 0 0 0 0 0 0 % I t e r a t i o n #5 E r r o r : 2 9 . 4 5 7 0 8 6 % T a r g e t E r r o r : 1 . 0 0 0 0 0 0 % I t e r a t i o n #6 E r r o r : 1 9 . 4 2 1 5 8 5 % T a r g e t E r r o r : 1 . 0 0 0 0 0 0 % I t e r a t i o n #7 E r r o r : 2 . 1 6 0 9 2 5 % T a r g e t E r r o r : 1 . 0 0 0 0 0 0 % I t e r a t i o n #8 E r r o r :0 .4 3 2 1 0 4 % T a r g e t E r r o r : 1.000000% In p u t0 .0 0 0 0 ,0 .0 0 0 0 , A ctu a l0 .0 0 9 1 , Icieal0.0000 I n p u t = l . 00 0 0 ,0 .0 0 0 0 , A c tu a l= 0 .9793, Id eal= 1 .0 0 0 0 In p u t0 .0 0 0 0 , 1.0000, A ctu a l0 .9 4 7 2 , Icieal1.0000 I n p u t = l . 00 0 0 ,1 .0 0 0 0 , A c tu a l= 0 .0731, ld e al= 0 .0 0 0 0 Machine L e a rn in g Type: f e e d fo r w a r d M a c h i n e L e a r n i n g A r c h i t e c t u r e : ? : 3  > S I G M 0 I D  > 4 : B >SI GMOI D >? T r a i n i n g M e t h o d : lma Training A rgs: The above shows a program learning the XOR operator. Notice how the MSE error drops in each iteration? Finally, by iteration eight the error is below one percent, and training stops. Other Error Calculation Methods Though MSE is the most common method o f calculating global error, it is not the only method. In this section, we will look at two other global error calculation methods. S u m o f S q u a r e s E rro r The sum o f squares method (ESS) uses a similar formula to the MSE error method. However, ESS does not divide by the number o f elements. As a result, the ESS is not a percent. It is simply a number that is larger depending on how severe the error is. Equation 2.3 shows the MSE error formula. Equation 2.3: Sum o f Squares Error As you can see above, the sum is not divided by the number o f elements. Rather, the sum is simply divided in half. This results in an error that is not a percent, but instead a total o f the errors. Squaring the errors eliminates the effect o f positive and negative errors. Some training methods require that you use ESS. The Levenberg Marquardt Algorithm (I.MA) requires that the error calculation method be ESS. EMA will be covered in Chapter 7, "LMA Training”. R o o t M ea n S q u a r e E rror The Root Mean Square (RMS) error method is very similar to the MSE method previously discussed. The primary difference is that the square root ol the sum is taken. You can see the RMS formula in Equation 2.4. Equation 2.4: Root Mean Square Error RMS = Root mean square error will always be higher than MSE. The following output shows the calculated error for all three error calculation methods. All three cases use the same actual and ideal values. Trying from 1.00 to 1.00 Actual: [0.36,0. O'?, 0.55, 0.05, 0.37,0.34,0.72,0.10,0.41,0.32] I d e a l:(0 .3 7 ,0 .0 6 ,0 .5 1 ,0 .0 6 ,0 .3 6 ,0 .3 5 ,0 .6 7 ,0 .0 9 ,0 .4 3 ,0 .3 3 1 E r r o r (ESS): 0.00312453 E r r o r (MSE): 0 . 0 6 2 4 9 1 % E r r o r ( RMS) : 2 . 4 9 9 8 1 0 % RMS is not used very often for neural network error calculation. RMS was originally created in the field o f electrical engineering. I myself have not used RMS a great deal. Many research papers involving RMS show it being used for waveform analysis. Chapter Summary Neural networks start with random values for weights. These networks are then trained until a set o f weights is found that provides output from the neural network that closely matches the ideal values from the training data. For training to progress, a means is needed to evaluate the degree to which the actual output from the neural network matches the ideal output expected o f the neural network. This chapter began by introducing the concepts o f local and global error. Local error is the error used to measure the difference between the actual and ideal output o f an individual output neuron. This error is calculated using an error function. F.rror functions are only used to calculate local error. Global error is the total error o f the neural network across all output neurons arxl training set elements. Three different techniques were presented in this chapter for the calculation o f global error. Mean Square Error (MSE) is the most commonly used technique. Sum o f Squares Error (ESS) is used by some training methods to calculate error. Root Mean Square (RMS) can be used to calculate the error for certain applications. RMS was created for the field ol electrical engineering for waveform analysis. The next chapter w ill introduce a mathematical concept known as derivatives. Derivatives come from Calculus and will be used to analyze the error functions and adjust the weights to minimize this error. In this book, we will learn about several propagation training techniques. All propagation training techniques use derivatives to calculate update values for the weights of the neural network. Chapter 3: Derivatives • • • • Slope of a Line What is a Derivative Partial Derivatives Chilin Rule In this chapter, we will look at the mathematical concept o f a derivative. Derivatives are used in many aspects o f neural network training. The next few chapters w ill focus on training a neural network, and a basic understanding ol derivatives will be useful to help you properly understand them. The concept o f a derivative is central to an understanding o f Calculus. The topic o f derivatives is very large and could easily consume several chapters. I am only going to explain those aspects o f differentiation that are important to the understanding o f neural network training. If you are already familiar with differentiation you can safely skim, or even skip, this chapter. Calculating the Slope of a Line The slope o f a line is a numerical quality o f a line that tells you the direction and steepness o f a line. In this section, we will see how to calculate the slope o f a straight line. In the next section, we will find out how to calculate the slope o f a curved line at a single point. The slope o f a line is defined as the “rise" over the "rim", o r the change in y over the change in x. The slope o f a line can be written in the form o f Equation 3.1. Equation 3.1: The Slope o f a Straight Line A.r = fizJfi. xa —ti This can be visualized graphically as in Figure 3.1. Figure 3.1: Slope o f a Line We could easily calculate the slope o f the above line using Equation 3.1. Filling in the numbers for the two points we have on the line produces the following: (83) / (61)  1 The slope o f this line is one. This is a positive slope. When a line has a positive slope, it goes up left to right. When a line has a negative slope, it goes down left to right. When a line is horizontal, the slope is 0. and when the line is vertical, the slope is undefined. Figure 3.2 shows several slopes for comparison. Figure 3.2: Several Slopes Positive Slope Negative Slope Zero Slope A straight line, such as the ones seen above, can be written in slopeintercept form. Equation 3.2 shows the slope intercept form o f an equation. Equation 3.2: Slope Intercept Form // = m x + h Where m is the slope o f the line and b is the yintereept. which is the ycoordinate o f the point where the line crosses the y axis. To see this in action, consider the chart o f the following equation: f(x) = 2x ♦ 3 This equation can be seen graphically in Figure 3.3. Figure 3.3: The Graph o f 2\+3 As you can see from the above diagram, the line intercepts the yaxis at 3. W hat is a Derivative? In the Iasi section, w c saw that we can easily calculate the slope o f any straight line. Hut w e will very rarely work with a simple straight line. Usually, we are faced with the sort o f curves that we saw in the last chapter when we examined the activation functions o f neural networks. A derivative allows us to take the derivative o f a line at oik * point. Consider this simple equation: Equation 3.3: X Squared You can see equation 3.3 graphed in Figure 3.4. Figure 3.4: Graph o f \ Squared In the above chart, we would like to obtain the derivative at 1.5. The chart o f x squared is given by the ushaped line. The slope at 1.5 is given by the straight line that just barely touches the ushaped line at 1.5. This straight line is called a tangent line. If w e take the derivative o f Equation 3.2, w e are left with an equation that w ill provide us with the slope o f Equation 3.2 at any point x. It is relatively easy to derive such an equation. To see how this is done, consider Figure 3.5. Figure 3.5: Calculate Slope at X Here we are given a point at (x.y). This is the point for which we would like to find the derivative. However, we need two points to calculate a slope. So. we create a second point that is equal to x and y plus deltax and deltay. You can see this imaginary line in Figure 3.6. Figure 3.6: Slope o f a Secant Line The imaginary line above is called a secant line. The slope o f this secant line is close to the slope at (x.y), but it is not the exact number. As deltax and deltay become closer to zero, the slope o f the secant line becomes closer to the instantaneous slope at (x.y). We can use this fact to write an equation that is the derivative o f Equation 3.2. Before we look specifically at Equation 3.2, we will look at the general case o f how to find a derivative for any function fix). This formula uses a constant h that defines a second point by adding h to x. The smaller that h becomes, the more accurate a value w e are given for the slope o f the line at x. Equation 3.4 shows the slope o f the secant line between x and h. Equation 3.4: The Slope o f the Secant Line _ /(* + h )  f( r ) (r + h )  * The derivative is equal to the above equation as h approaches zero. This is shown in Equation 3.5. Equation 3.5: I he Derivative o f f(x) as a Slope  Inn ri*n J V + I‘ ) / ( ') (* + * ).* You should take note o f two things about the above formula. Notice the apostrophe above f? This designates the function as a derivative, and is pronounced “fprime”. The second is the word lim This designates a limit. The arrow at the bottom specifies “as h approaches zero". When taking a limit, sometimes the limit is undefined at the value it approaches. Therefore, the limit is the value either at, or close to. the value the limit is approaching. In many eases, the limit can be determined simply by solving the formula with the approached value substituted for x. The above formula can be simplified by removing redundant x terms in the denominator. This results in Equation 3.6, which is the definition o f a derivative as a slope. Equation 3.6: Derivative Formula I in f t ' + h) M .0 h Now we have a formula for a derivative, and we will see a simple application o f the formula in the next section. We can use this equation to finally determine the derivative for Equation 3.2. Equation 3.2 is simply the (unction ol x squared. /'(•*) u Using the formula from the last section, it is easy to take the derivative o f a formula such as x squared. Equation 3.7 shows Equation 3.5 modified to use x squared in place o f f(x). Equation 3.7: Derivative o f X Squared (step I) Equation 3.7 can be expanded using a simple algebraic rule that allows us to expand the term(x+h) squared. This results in Equation 3.8. Equation 3.8: Derivative o f X Squared (step 2) W, J+2i7i+/r ; (.»  l i u i  .r ' f. .» // This allows us lo remove redundant x terms in the numerator, giving us Equation 3.9. Equation 3.9: Derivative o f X Squared (step 3) 2 rh + Ir h We can also cancel out the h terms in the numerator with the same term in the denominator. This leaves us with Equation 3.10. f '( . r ) = lim UH I Equation 3.10: Derivative o f X Squared (step 4) J 'U x ) l i m 2 r + I, ' ' = full W'e can now evaluate the limit at zero. This produces the final general formula for the derivative show n in Equation 3.11. Equation 3.11: Kinal Derivative o f X Squared /'( * ) = 2x The above equation could be found in the front cover o f many Calculus textbooks. Simple derivative formulas like this are useful for converting common equations into derivative form Calculus textbooks usually have these derivative formulas listed in a table. Using this table, more complex derivatives can be obtained. I will not review how to obtain the derivative o f any arbitrary function. Generally, when I want to take the derivative o f an arbitrary function I use a program called R. R can be obtained from the R Project for Statistical Computing at this URL: hUpi/Ay\vwjprojccLorg1 The following R command could be used to find the derivative ofx squired. D ( expression(xA2 ) ,"x") For a brief tutorial on R. visit the following URL: hUp^/vvww'Jicatonrcscarckcom'wiki/BricLR^Tutorial Using Partial Derivatives So far. we have only seen “total derivatives”. A partial derivative o f a function of several variables is the derivative o f the function with respect to one of those variables. All other variables will be held constant. This differs from a total derivative, in which all variables are allowed to vary. The partial derivative o f a function f with respect to the variable z is variously denoted by these forms: f.< O J, « f The form that w ill be used in this book is shown here. i! L dz ■" For an example o f partial derivatives, consider the function f that has more than one variable: *  / ( * . * )  * * + *V + »*• The derivative o f /.. with respect to \ is given as follows. The variable y is treated as a constant. Partial derivatives are an important concept for neural networks. We will typically take the partial derivative o f the error o f a neural network with respect to each o f the weights. This will be covered in greater detail in the next chapter. Using the Chain Rule There arc many different rules in Calculus to allow you to take derivatives manually. We just saw an example o f the power rule. This rule states that given the equation: /<*) = * " the derivative o f ft x) w ill be as follows: /'(*) 1 This allows you to quickly take the derivative o f any power. There are many other derivative rules, and they are very useful to know. However, if you do not wish to learn manual differentiation, you can generally get by without it by using a program such as R. However, there is one more rule that is very useful to know. This rule is called the chain rule. The chain rule deals with composite functions. A composite function is nothing more than w hen one function takes the results o f a second function as input. This may sound complex, but programmers make use of composite functions all the time. Here is an example o f a composite function call in Java. S y s t e m . o u t . p r i n t l n < M a t h . p o w < 3 , 2) ); This is a composite function because we take the result o f the function pow and feed it to print In. Mathematically, we write this as follows. Imagine we had functions f and g. If we wished to pass the value o f 5 to f. and then pass the result o f f onto g. we w'ould use the expression: ( / a .'/)(*) The chain rule o f calculus gives us a relatively easy way to calculate the composite o f two functions. The chain rule is given in Equation 3.12. Equation 3.12: The Chain Rule ( f ° a Y ( 0 = /'('/<'))</(') The chain rule will be very valuable in calculating the derivative across an entire neural network. A neural network is very much a composite function. In a typical three layer neural network, the output o f the input layer flows to the hidden layer and finally to the output layer. Chapter Summary In this chapter we took a look at derivatives. Derivatives are a core concept in Calculus. A derivative is defined as the slope o f a curved line for * individual value o fx . The derivative can also be thought o f as the instantaneous rate o f change at the point x. Derivatives can be calculated manually or by using a software package sueh as R. o ik Derivatives are very important for neural network training. The derivatives of activation functions are used to calculate the error gradient with respect to individual weights. Various training algorithms make use o f these gradients to determine how best to update the neural network weights. In the next chapter, we will look at backpropagation. Baekpropagation is a training algorithm that adjusts the weights o f neural networks to produce more desirable output from the neural network. Backpropagation works by calculating the partial derivative o f the error funetion with respect to each o f the weights. Chapter 4: Backpropagation • • • • Understanding Gradients Calculating Gradients Understanding Backpropagation Momentum and Learning Rate So far, we have only looked at how to calculate the output from a neural network. The output from the neural network is a result o f applying the input to the neural network across the weights o f several layers. In this chapter, we will find out how these weights are adjusted to produce outputs that are closer to the desired output. This process is called training. Training is an iterative process. To make use o f training, you perform multiple training iterations. These training iterations are intended to lower the global error o f the neural network. Global and local error were discussed in Chapter 2. Understanding Gradients The first step is to calculate the gradients o f the neural network. The gradients are used to calculate the slope, o r gradient, o r the error function for a particular weight. A weight is a connection between two neurons. Calculating the gradient o f the error function allows the training method to know that it should cither increase or decrease the weight. There are a number o f different training methods that make use o f gradients. These training methods are called propagation training. This book will discuss the following propagation training methods: • Baekpropagation • Resilient Propagation • Quick Propagation This chapter will focus on using the gradients to train the neural network using baekpropagation. The next few chapters will cover the other propagation methods. W h a t is a G r a d ie n t First of all, let’s look at what a gradient is. Basically, training is a search. You are searching for the set o f weights that will cause the neural network to have the lowest global error for a training set. If we had an infinite amount of computation resources, w e would simply try every possible combination of weights ami see which one provided the absolute best global error. Because we do not have unlimited computing resources, we have to use some sort of shortcut. Essentially, all neural network training methods are really a kind o f shortcut. Each training method is a clever way o f finding an optimal set of weights without doing an impossibly exhaustive search. Consider a chart that shows the global error o f a neural network for each possible weight. This graph might look something like Figure 4.1. Figure 4.1: Gradient 00 05 1.0 1.5 2.0 2.5 3.0 looking at this chart, you can easily see the optimal weight. The optimal weight is the location where the line has the lowest y value. The problem is that we do not get to see the entire graph  that would require the exhaustive search mentioned above. We only see the error for the current value o f the weight. Hut we can determine the slope o f the error curve at a particular weight. In the above chart we see the slope o f the error curve at 1.5. The slope is given by the straight line that just barely touches the error curve at 1.5. This slope is the gradient. In this ease, the slope is 0.5622. The gradient is the instantaneous slope o f the error function at the specified weight. This uses the same definition for the derivative that we learned in Chapter 3. The gradient is given by the derivative o f the error curve at that point. This line tells us something about how steep the error function is at the given weight. Used with a training technique, this can provide * insight into how the weight should be adjusted for a lower error. Now that we have seen what a gradient is, in the next section, we will find out how to calculate a gradient. s o u k C a lcu la tin g G ra d ien ts We will now look at how to calculate the gradient. We will calculate an individual gradient for eaeh weight. I will show you the equations as well as how to apply them to an actual neural network with real numbers. The neural network that we will use is shown in Figure 4.2. Figure 4.2: An XOR Network The neural network above is a typical three layer feedforward network like the ones w e have seen before. The circles indicate neurons. The lines connecting the circles are the weights. The rectangles in the middle o f the connections give the weight for each connection. The problem that we now face is how to calculate the partial derivative for the output o f each neuron. This can be accomplished using a method based on the chain rule o f Calculus. The chain rule was discussed in Chapter 3. We will begin with one training set element. For Figure 4.2 we are providing an input o f 11,0 ami expecting an output o f 111. You can see that the input is applied on the above figure, as the first input neuron lias an input value o f 1.0 and the second input neuron lias an input value o f 0.0. This input feeds through the network and eventually produces an output. The exact process by which the output and sums are calculated was covered in Chapter I. Backpropagation has both a forward and backward pass. The forward pass occurred when the output o f the neural network was calculated. We will calculate the gradients only for this item in the training set. Other items in the training set will have different gradients. We will discuss how the gradients for each individual training set element are combined later in this chapter, when we talk about “Batch and Online Training". We are now ready to calculate the gradients. There are several steps involved in calculating the gradients for each weight. These steps are summarized here. • • • • Calculate Calculate Calculate Calculate the error, based on the ideal o f the training set the node delta for the output neurons the node delta for the interior neurons individual gradients These steps will be covered in the following sections. C a lcu la tin g th e N o d e D eltas The first step is to calculate a constant value for every node, o r neuron, in the neural network. We will start with the output nodes and work our way backwards through the neural network  this is where the term, backpropagation, comes from. We initially calculate the errors for the output neurons and propagate tltese errors backwards through the neural network. The value tliat we will calculate for each node is called the node delta. The term, layer delta, is also sometimes used to describe this value. Layer delta describes the fact that these deltas are calculated one layer at a time. The method for calculating the node deltas differs depending on whether you are calculating for an output o r interior node. The output neurons are, obviously, all output nodes. The hidden and input neurons are the interior nodes. The eqtuition to calculate the node delta is provided in Equation 4 .1. Equation 4.1: Calculating the Node Deltas ^ _ i E f! \ / / £ * w?* A , o u t p u t nodes , iuterier nodes Wc will calculate the node delta for all hidden and nonbias neurons. There is no need to calculate the node delta for the input and bias neurons. Even though the node delta can easily be calculated for input and bias neurons using the above equation, these values are not needed for the gradient calculation. As you will soon see, gradient calculation for a weight only looks at the neuron that the weight is connected to. Bias anti input neurons are only the beginning point for a connection. They arc never the end point. We will begin by using the formula for the output neurons. You will notice that the formula uses a value E. This is the error for this output neuron. You can see how to calculate E from Equation 4.2. Equation 4.2: The Error Function E = (a i) You may recall a similar equation to Equation 2.1 from Chapter 2. This is the error function. Here, we subtract the ideal from the actual. For the neural network provided in Figure 4.2. this can be written like this: E = 0.75  1.00 = 0 .2 5 Now that we have E. w c can calculate the node delta for the first (and only) output node. Filling in Equation 4 .1. we get the follow ing: ( 0.25) * dA<1.1254)  0.185 * 0.25  0.05 The value o f 0.05 w ill be used for the node delta o f the output neuron. In the above equation. dA represents the derivative o f the activation function. For this example, w e are using a sigmoid activation function. The sigmoid activation function is show n in Equation 4.3. Equation 4.3: The Sigmoid Function The derivative o f the sigmoid function is shown in Equation 4.4. Equation 4.4: The Derivative o f the Sigmoid Function r ( r ) = * ( r ) .( l .».v(r» Now that the node delta has been calculated for the output neuron, we should calculate it for the interior neurons as well. The equation to calculate the node delta for interior neurons was provided in Equation 4.1. Below, we apply this for the first hidden neuron: dA(sun> o f HI) * ( 0 1 Node D e l t a • W e i g h t o f HI  > 0 1 ) You will notice that Equation 4.1 called for the summing o f a number o f items based on the number o f inbound connections to output neuron one. Because there is only one inbound connection to output neuron one. there is only one value lo sum. This value is the product o f the output neuron one node delta and the weight between hidden neuron one and output neuron one. Filling in actual values for the above expression, w e are left with the following: dA(  0 . 5 3 ) * (0.05 * 0.22» = 0.0025 The value 0.0025 is the node delta for the first hidden neuron. Calculating the second hidden neuron follows exactly the same form as above. The second neuron would be computed like this: d A( s u ra o f H2) * ( 0 1 N o d e D e l t a * W e i g h t o f H2  > 0 1 ) Plugging in actual numbers, we find this: dA(1 .0 5 ) * (0.05 * 0.58) = 0.0055 The value of 0.0055 is the node delta for the second hidden neuron. As previously explained, there is no reason to calculate the node delta for either the bias neurons or the input neurons. We now have every node delta needed to calculate a gradient for each weight in the neural network. Calculation of the individual gradients will be discussed in the next section. C a lcu la tin g th e Individual G ra d ien ts We can now calculate the individual gradients. Unlike the node deltas, only one equation is used to calculate the actual gradient. A gradient is calculated using liquation 4.5. Equation 4.5: Individual Gradient The above equation calculates the partial derivative o f the error (E) with respect to each individual weight. The partial derivatives are the gradients. Partial derivatives were discussed in Chapter 3. To determine an individual gradient, multiply the node delta for the target neuron by the weight from the source neuron. In the above equation, k represents the target neuron and i represents the source neuron. To calculate the gradient for the weight from HI to O l, the following values would be used: o u tp u t(h i) • nodeD elta(ol) (0 .3 7 * 0 .0 5 )  0.01677 Ii is important to note that in the above equation, we are multiplying by the output o f hidden I. not the sum. When dealing direetly with a derivative you should supply the sum Otherwise, you would be indirectly applying the activation function twice. In the above equation, w e are not dealing directly with the derivative, so we use the regular node output. The node output has already had the activation function applied. Once the gradients are calculated, the individual positions o f the weights no longer matter. We can simply think o f the weights and gradients as single dimensional arrays. The individual training methods that we will look at will treat all weights and gradients equally. It does not matter if a weight is from an input neuron or an output neuron. It is only important that the correct weight is used with the correct gradient. The ordering o f these weight and gradient arrays is arbitrary. However. Encog uses the following order for the above neural network: W eight/G radient W eight/G radlent W eight/G radient W eight/G radient W eight/G radient W eight/G radient W eight/G radient W eight/G radient W eight/G radient 0 : H idden 1 > O utput 1 1 : H idden 2 > O utput 1 2: B ias 2 > O utput 1 3 : I n p u t 1 > Hidden 1 4 : I n p u t 2  > Hidden I 5 : B ia s 1 > H idden 1 6 : I n p u t 1  > Hidden 2 7 : I n p u t 2 > Hidden 2 8 : B ias 1 > H idden 2 The various learning algorithms will make use o f the weight and gradient arrays. It is also important to note that these are two separate arrays. There is a weight array, and there is a gradient array. Both arrays will be exactly the same length. Training a neural network is nothing more than adjusting the weights to provide desirable output. In this chapter, w e will see how backpropagation uses the gradient array to modify the weight array. Applying Back Propagation Backpropagation is a simple training method that uses the ealeulated gradients o f a neural network to adjust the weights o f the neural network. This is a form o f gradient descent, as we are descending down the gradients to lower values. As these weights are adjusted, the neural network should produce more desirable output. The global error o f the neural network should fall as it is trained. Before we can examine the backpropagation weight update process, we must look at tw o different ways that the gradients can be calculated. B a tch a n d O n lin e T rain in g We have already covered how to calculate the gradients for an individual training set element. Earlier in this chapter, w e saw how w e could calculate the gradients for a ease where the neural network was given an input o f [ 1.0] and an output o f [I] was expected. This works line for a single training set element However, most training sets have many elements. There are two different ways to handle multiple training set elements. These two approaches are called online and batch training. Online training implies that you modify the weights after every training set element. Using the gradients that you obtained for the first training set element, you calculate and apply a change to the weights. Training progresses to the next training set element and also calculates an update to the neural network. This training continues until every training set element has been used. At this point, one iteration, or epoch, o f training has completed. Batch training also makes use o f every training set element. However, the weights are not updated for every training set element. Rather, the gradients for each training set element are summed. Once every training set element has been used, the neural network weights can be updated. At this point, the iteration is considered complete. Sometimes, a batch size w ill be set. For example, you might have a training set size o f 10.000 elements. You might choose to update the weights o f the neural network every 1,000 elements. This would cause the neural network weights to be updated 10 times during the training iteration. Online training w as the original method used for backpropagation. However, online training is inefficient, because the neural network must be constantly updated. Additionally, online training is very difficult to implement in a multithreaded manner that can take advantage o f multicore processors. For these reasons, batch training is generally preferable to online training. B ackp ro p a g a tion W eight t Jpdate We arc now ready 10 update the weights. As previously mentioned, we will treat the weights and gradients as a single dimensional array. Given these two arrays, w e are ready to calculate the weight update for an iteration ol backpropagation training. The formula to update the weights for backpropagation is shown in Equation 4.6. Equation 4.6: Backpropagation Weight Update . OE , . £«•(0 = f+ ««A//,, i OH\l) The above equation calculates the change in weight for each element in the weight array. You will also notice that the above equation calls for the weight change from the previous iteration. These values must be kept in another array. The above equation calculates the weight delta to be the product o f the gradient and the learning rate (represented by epsilon). Additionally, the product of the previous weight change and the momentum value (represented by alpha) is added. The learning rate and momentum are two parameters that must be provided to the backpropagation algorithm. Choosing values for learning rate and momentum is very iinportant to the performance o f the training. Unfortunately, the process for determining learning rate and momentum is mostly trial and error. The learning rate scales the gradient and can slow down or speed up learning. A learning rate below zero will slow down learning. For example, a learning rate o f 0.5 would decrease every gradient by 50%. A learning rate above 1.0 would speed up training. In reality, the learning rate is almost always below zero. Choosing a learning rate that is too high will cause your neural network to fail to converge. A neural network that is failing to converge will generally have a high global error that simply bounces around, rather than converging to a low value. Choosing a learning rate that is too low will cause the neural network to take a great deal o f time to converge. Like the learning rate, the momentum is also a scaling factor. Momentum determines the percent o f the previous iteration's weight change that should be applied to this iteration. Momentum is optional. If you do not want to use momentum, just specify a value o f zero. Momentum is a technique that was added to backpropagation to help the training find its way out o f local minima. Local minima are low points on the error graph that are not the true global minimum. Backpropagation has a tendency to find its way into a local minimum anti not find its way back out again. This causes the training to converge to a higher undesirable error. Momentum gives the neural network some force in its current direction and may allow it to force through a local minimum. We are now ready to plug values into Equation 4.6 and calculate a weight delta. We will calculate a weight change for the first iteration using the neural network that we previously used in this chapter. So far. we have only calculated gradients for one training set element. There are still four other training set elements to calculate for. We will sum all four gradients together to apply hatch training. The batch gradient is calculated by summing the individual gradients, which are listed here: 0.01677795762852307 0.05554301180824532 0.021940533165555356 0.05861906411700882 Earlier in the chapter we calculated the first gradient, listed above. If you would like to see the calculation for the others, this information can be found at the following URL: hup: www,heatonreseareh.eoin wjki/BackPropagution Summing all o f these gradients produces this batch update gradient: 0.07544358513197481 For this neural network, w e will use a learning rate o f 0.7 and a momentum of 0.3. These are just arbitrary values. However, they do work well for training an XOR neural network. Plugging these values into Equation 4.6 results in this equation: d e l t a = (0.7 * 0.0754) + (0.3 * 0.0) = 0.052810509592382364 This is the first training iteration, so the previous delta value is 0.0. The momentum lias no effect on the first iteration. This delta value will be added to the weight to alter the neural network for the first training iteration. All o f the other weights in this neural network will be updated in the same way. according to their calculated gradient. This first training iteration will lower the neural network's global error slightly. Additional training iterations will lower the error further. The following program output shows the convergence o f this neural network. Epoch Epoch Epoch Epoch Epoch #1 #2 #3 #4 *5 E r r o r r O . 3100155809627523 E r r o r r O . 2909988918032235 E r r o r r O . 2712902750837602 E r r o r r O . 2583119003843881 E r r o r r O . 2523050561276289 Epoch Epoch Epoch Epoch Epoch #6 E r c o r r O . 2 5 0 2 9 8 6 9 7 1 9 0 2 5 4 5 #7 E r r o r : 0 . 2 4 9 8 1 8 2 2 9 5 1 9 2 1 5 4 #8 E r c o r r O . 2 4 9 7 4 2 4 5 6 5 0 5 4 1 6 8 8 #9 E r r o c : 0 . 2 4 9 7 3 4 5 8 8 9 3 8 0 6 6 2 7 H O E r r o r r O . 24972923906975902 Epoch #578 E r c o c r O . 010002702374503777 Epoch #579 E c c o c : 0 . 0099478 3 0 8 9 0 5 2 7 0 8 9 N e u ra l Netw ork R e s u l t s : 1 . 0 . 0 .0 , a c t u a l  0 . 9040102333814147,i d e a l  1 . 0 0 . 0 , 0 . 0 , a c t u a l = 0 . 09892634022671229,id e a l= 0 .0 0 . 0 , 1 . 0 , a c tu a l 0 . 9 0 4 0 2 0 6 8 2 4 39766, i d e a l  1 . 0 1 . 0 . 1 .0 , a c t u a l = 0 . 10659032105865764,id e a l= 0 .0 Each iteration, o r epoch, decreases the error. Once the error drops below one percent, the training stops. You can also see the output from the neural network for the XOR data. The answers are not exactly correct, but it is very clear that that the two training eases that should be 1 .0 are much closer to 1.0 than the others are. Chapter Summary In this chapter you were introduced to backpropagation. Backpropagation is one o f the oldest and most commonly used training algorithms available lor neural networks. Backpropagation works by calculating a gradient value for eaeh weight in the network. Many other training methods also make use o f these gradient values. There is one gradient for each weight in the neural network. Calculation ol the gradients is a stepbystep proeess. The first step in calculating the gradients is to calculate the error for each o f the outputs o f the neural network. This error is for one training set element. The gradients for all training set elements may be batched together later in the process. Once the error for the output layer has been established, you can go on to calculate values for each o f the output neurons. These values are called the node deltas for each o f the output neurons. We must calculate the node deltas for the output layer first. We calculate the node deltas for each layer o f the neural network, working our way backwards to the input layer. This is why this technique is called backpropagation. Once the node deltas have been calculated, it is very easy to calculate the gradients. At the end, you will have all o f the gradients for one training set element. If you are using online training, you will now use these gradients to apply a change to the weights o f the neural network. If you are using batch training, you will sum the gradients from each o f the training set elements into a single set o f gradients for the entire training set. Backpropagation must be provided with a learning rate and momentum. Both o f these are configuration items that will have an important effect on the training speed o f your neural network. Learning rate specifies how fast the weights should be updated. Too high a learning rate will cause a network to become unstable. Too low a learning rate will cause the neural network to take too long to train. Momentum allows the neural network to escape local minima. Local minima are low points in the error graph that are not the true global minimum. Choosing values for the learning rate and momentum can be tricky. Often it is just a matter o f trial and error. The Resilient Propagation training method (RPROP) requires no parameters to be set. Further, RPROP often trains much faster than backpropagation. RPROP will be covered in the next chapter. Chapter RPROP • • • • 5: Faster Training with Understanding Error Calculation The Error Function Error Calculation Methods I low is the Error Used In the last chapter, we looked at baekpropagation. Backpropagation is one of the oldest and most popular methods for training a neural network. Unfortunately, backpropagation is also one o f the slowest methods for training a neural network. In this chapter, we will take a look at a faster training technique called resilient propagation o r RPROP. RPROP works very much like backpropagation. Both backpropagation and RPROP must first calculate the gradients for the weights o f the neural network. Where backpropagation ami RPROP differ is in the way in which the gradients are used. In a simple XOR example, it typically takes backpropagation over 500 iterations to converge to a solution with an error rate o f below one percent. It will usually take RPROP around 30 to 100 iterations to accomplish the same thing. This large increase in performance is * reason why RPROP is a very popular training algorithm. o ik Another factor in the popularity o f RPROP is that there are no necessary training parameters to the RPROP algorithm. When you make use of backpropagation, you must specify I * learning rate and momentum. These two parameters can have a huge impact on the effectiveness o f your training. RPROP docs include a few training parameters, but they can almost always be left at their default settings. i k There are several variants o f the RPROP protocol. Here are some o f the variants. . . . . RPROP+ RPROPiRPROP+ iRPROP This book will focus on classic RPROP. as described in a 1994 paper by Martin Reidmiller entitled "Rprop  Description and Implementation Details". The other four variants described above are relatively minor adaptations ol classic RPROP. For more information about all variants o f RPROP, you ean visit the following URL: http;Z:>y>yw.)yitypiyg,sy3ryhytfiT>,l>yjki^.P.B Q^ In the next sections, we will see how the RPROP algorithm is implemented. RPROP Arguments As previously mentioned, one advantage RPROP lias over backpropagation is that no training arguments need to be provided in order for RPROP to be used. That is not to say that there are no configuration settings for RPROP. The configuration settings for RPROP do not usually need to be changed from their defaults. However, if you really want to change them, there are several configuration settings that you can set for RPROP training. These configuration settings are: • Initial Update Values • Maximum Step As you will see in the next section, RPROP keeps an array o f update values for the weights. This determines how large o f a change will be made to each weight. This is something like the learning rate in backpropagation, only much better. There is an update value for every weight in the neural network. This allows the update values to be fine tuned to each individual weight as training progresses. Some backpropagation algorithms will vary the learning rate and momentum as learning progresses. The RPROP approach is much better, because unlike backpropagation. it does not simply use a single learning rate for the entire neural network. These update values must start from somewhere. The "initial update values" argument defines this. By default, this argument is set to a value o f 0.1. As a general rule, this default should never be changed. One possible exception to this is in a neural network that has already been trained. If the neural network is already trained, then some o f the initial update values are going to be too strong for the neural network. The neural network will regress for many iterations before it is able to improve. An already trained neural network may benefit from a much smaller initial update. Another approach for an already trained neural network is to save the update values once training stops and use them for the new training. This will allow you to resume training without the initial spike in errors that you would normally see when resuming resilient propagation training. This method will only work if you are resuming resilient propagation. If you were previously training the neural network with a different training algorithm, then you will not have an array o f update values to restore from. As training progresses, the gradients will be used to adjust the updates up and down. The "maximum step" argument defines the maximum upward step size that can be taken over tlx.* update values. The default value for the maximum step argument is 50. It is unlikely that you will need to change the value o f this argument. As well as arguments defined for RPROP, there are also constants. These constants are simply values that are kept by RPROP during processing. These values are never changed. The constants are: • • • • Delta Minimum ( Ie6) Negative Eta (0.5) Positive Eta (1.2) Zero Tolerance ( Ie16) Delta minimum specifics the minimum value that any o f the update values can go to. This is important, because if an update value were to go to zero it would never be able to increase beyond zero. Negative and positive eta will be described in the next sections. The zero tolerance defines how close a number should be to zero before that number is equal to zero. In computer programming, it is typically considered bad practice to directly compare a floating point number to zero. This is because the number would have to be exactly equal to zero. Data Structures There arc several data structures that must be kept in memory while RPROP training is performed. These structures are all arrays o f floating point numbers. They are summarized here: • • • • • Current Update Values Last Weight Change Values Current Weight Change Values Current Gradient Values Previous Gradient Values The current update values are kept to hold the current update values for the training. If you wish to be able to resume training at some point, this is the array that must be stored. There is one update value per weight. These update values cannot go below the minimum delta constant. Likewise, these update values cannot exceed the maximum step argument. The last weight delta value must also be tracked. Backpropagation kept this value for momentum. RPROP uses this value in a different way than backpropagation. We will see how this array is used in the next section. The current weight change is kept long enough to change the weights and then is copied to the previous weight change. The current and previous gradients are needed too. RPROP is particularly interested when the sign changes from the current gradient to the previous gradient. This indicates that an action must be taken with regard to the update values. This is covered in the next section. Understanding RPROP In the last few sections, the arguments, constants and data structures necessary for RPROP were covered. In this section, we will see exactly how to run through an iteration o f RPROP. In the next section, we will apply real numbers to RPROP and see how training iterations progress for an XOR training. We will train exactly the same network that we used with baekpropagation. This will give us a good idea o f the difference in performance ofbackpropagation compared to RPROP. When we talked about baekpropagation, we mentioned two weight update methods: online and batch. RPROP does not support online training. All weight updates used with RPROP will be performed in batch mode. Because o f this, each iteration o f RPROP will receive gradients that are the sum o f the individual gradients o f each training set. This is consistent with using backpropagation in batch mode. There are three distinct steps in an iteration o f an RPROP iteration. They are covered in the next three sections. D eterm in e Sign C h a n g e o f G ra d ien t At this point, we should have the gradients. These gradients are nearly exactly the same as the gradients calculated by the baekpropagation algorithm. The only difference is that RPROP uses a gradient that is the inverse o f the baekpropagation gradient. This is easy enough to adjust Simply place a negative operator in front o f every backpropagation gradient. Because the same process is used to obtain gradients in both RPROP and backpropagation, we will not repeat it here. To learn how to calculate a gradient, refer to Chapter 4. The first step is to compare the gradient o f the current iteration to the gradient o f the previous iteration. If there is no previous iteration, then we can assume that the previous gradient was zero. To determine whether the gradient sign has changed, we will use the sign (sgn) function. The sgn function is defined in Equation 5.1. Equation 5.1: The Sign ( unction (sgn) ir j  <o, ir j = o, if j  > o . The sgn function returns the sign o f the number provided. If x is less than zero, the result is I. If x is greater than zero, then the result is I. If x is equal to zero, then the result is zero. I usually implement the sign function to use a tolerance for zero, since it is nearly impossible for floating point operations to hit zero precisely on a computer. To determine whether the gradient has changed sign. Equation 5.2 is used. Equation 5.2: Determine G radient Sign Change = 0 E ' l) ~ Ow.j l)E ihVij Equation 5.2 will result in a constant c. This value is evaluated to be negative or positive or close to zero. A negative value for c indicates that the sign has changed. A positive value indicates no change in sign for the gradient. A value near zero indicates that there was either a very small change in sign or nearly a change in sign. To see all three o f these outcomes, consider the following situations. 1 • 1 ■ 1 * 1  1 1 .0 * (negative,  1 changed from n e g a t i v e to positive) (positive, no change in sign) 0 . 0 0 0 0 0 1  0.000001 (near zero, alm o st changed s ig n s , but Now that we have calculated the constant c, which gives some indication of sign change, we can calculate the w eight change. This is covered in the next section. C a lc u la te W e ig h t C h a n g e Now that we liavc the change in sign o f the gradient, we can see what happens in each o f the three eases mentioned in the previous section. These three cases are summarized in Equation 5.3. Equation 5.3: Calculate RPROP Weight Change  a :;1 , if c > 0 + a ;;> , i f c < < ) 0 . otherwise This equation calculates the actual weight change for each iteration. If the value o f c is positive, then the weight change will be equal to the negative o f the weight update value. Similarly, if the value o f c is negative, the weight change will be equal to the positive o f the weight update value. Finally, if the value o f c is near zero, then there will be no w'eight change. M o d ify U p d a te V alues The weight update values seen in the previous seetion are used in caeh iteration to update the weights o f the neural netw ork. There is a separate w eight update value for every weight in the neural network. This works much better than a single learning rate for the entire neural network, such as was used in backpropagation. These weight update values are modified during each training iteration, as seen in Equation 5.4. Equation 5.4: Modif> Update Values n* ■< T A 0 " , if ir 0 , if c>0 c <0 , o th e rw is e The weight update values are modified in a w'ay that is very similar to how the weights themselves are modified. Just like the weights, the weight update values are changed based on the value c, as previously calculated. If the value o f c is positive, then the weight update value will be multiplied by the value o f positive eta. Similarly, if the value o f c is negative, the weight update value will be multiplied by negative eta. Finally, if the value o f c is near zero, then there w ill be no change to the weight update value. RPROP Update Examples We will now look at a few iterations o f a RPROP train o f a neural network. The critical stats on this neural network are listed here: Layers: 3 (in p u t, hidden, output) Input Neurons: 2 Hidden N eu ro n s: 2 O utput Neurons: 1 A c t i v a t i o n F u n c tio n : Sigmoid B i a s N e u r o n s : Yes Total w eights: 9 The initial random weights for this neural network are given here: W eight W eight W eight W eight W eight W eight W eight W eight W eight 0: 1: 2: 3: 4: 5: 6: 7: 8: H i> 0 1 : 0.22791948943117624 H2>01: 0 .5 8 1 7 1 4 0 9 9 6 4 1 3 5 7 B 2  > 0 1 : D. 7 7 9 2 9 9 1 2 0 3 6 7 3 4 1 4 11> H 01 :. 0 6 7 8 2 9 4 7 5 9 8 6 7 3 1 6 1 1 2  >H1: 0 . 2 2 3 4 1 0 7 7 1 9 7 8 8 8 1 8 2 B 1 > H 1:  0 . 4 6 3 5 1 0 7 3 9 9 5 7 7 9 9 8 11>H2: 0.9487814395569221 1 2  >H2: 0 . 4 6 1 5 8 7 1 1 6 4 6 2 5 4 Bi>H2: 0.09750161997450091 The neural network w ill be trained for the XOR operator, just as was done in Chapter 4. How'ever. in this ehapter RPROP will be used in place o f baekpropagation. T ra in in g Ite ra tio n #1 For the first training iteration, the weight update values are all set to 0.1. This is their default starting point. We begin by calculating the gradients o f each of the weights. This is the same gradient calculation as was performed for baekpropagation, so the calculation will not be covered here. The gradients are provided here: G radient G radient G radient G radient G radient G radient G radient G radient G radient 0: 1: 2: 3: 4: 5: 6: 7: 8: H I> 0 1 :0.07544358513197481 H 2> 01:0.12346935587390481 3 2  > 0 1 :  0 . 18705713395934637 1 1  >H1:0.005292326734004241 1 2  > H l:0.0049016107791925246 B1>H1:0.010264148244428655 1 1  >H2:0.00687 4 0 3 0 7 0 8 9 3 4 7 1 2  >H2:0 .0 0 5 2 36293038814788 B1>H2:0.02103299467864286 Now that we have the gradients, we must calculate the weight change from the previous gradients. There are no previous gradients, so their values are all zero. The calculation o fc is shown here: :  : ■ : :: c 7: c 8: H l> 0 1 : 0 .0 * 0.07544358513197481 0 H2>01: 0 .0 * 0.12346935587390481  0 B2>01: 0 .0 * 0.18705713395934637 = 0 11>H1: 0 . 0 * 0 . 0 0 5 2 9 2 3 2 6 7 3 4 0 0 4 2 4 1  0 12>H1: 0 . 0 * 0 . 0 0 4 9 0 1 6 1 0 7 7 9 1 9 2 5 2 4 6 = 0 B1>H1: 0 . 0 * 0 .0 1 0 2 6 4 1 4 8 2 4 4 4 2 8 6 5 5  0 11>H2: 0 .0 * 0.0068740307089347 = 0 I2>H 2: 0.0 0.005236293038814788  0 B1>H2: 0 .0 0.02103299467864286 = 0 From this we can determine each o f the weight change values. The value ol c is zero in all cases. Because o f this, the weight change value is the negative ol the weight update value. This results in a weight change o f 0.01 for every weight. W eight W eight W eight W eight W eight W eight W eight W eight W eight Change Change Change Change Change Change Change Change Change 0 1 2 3 4 5 6 7 8 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 This leaves us with the following weights: W eight W eight W eight W eight W eight W eight W eight W eight W eight 0: 1: 2: 3: 4: 5: 6: 7: 8: H I> 01: 0.3279194894311762 H2>01: 0.48171409964135703 B2>01: 0.6792991203673414 11>H1: 0.03217052401326839 1 2  >H1: 0 . 3 2 3 4 1 0 7 7 1 9 7 8 8 8 1 8 B1>H1:  0 .3 6 3 5 1 0 7 3 9 9 5 7 7 9 9 8 11>H2: 0.8487814395569221 1 2  >H2: 0 . 3 6 1 5 8 7 1 1 6 4 6 2 5 4 8 0 4 Bl>H2: 0.0024983800254990973 This ends the first iteration o f the RPROP. Some implementations o f RPROP w ill suppress any weight changes for the first training iteration to allow it to “ initialize". However, skipping the weight update is generally unnecessary, as the weights are starting from random values anyhow. The first iteration is now complete. The first iteration serves as little more than an initialization iteration. T ra in in g Ite ra tio n #2 We begin the second iteration by again calculating the gradients o f each o f the weights. The gradients will have changed, because the underlying weights changed. We w ill use these new gradient values to calculate a new value o f c for each weight. c c c c c c c c c 0 1 2 3 4 5 6 7 8 H l>01: H2>01: B2>01: I1>H 1: I2>H 1: B 1 > H 1: I1>H 2: I2>H 2: B1>H2: 0.0*754435851317481 *  0 .0 7 5 6 4 7 1 4 7 8 0 8 2 3 2 7 6 = 1 0.12346935587390481 * 0.10495082682420408  1 0.18705713395934637 * 0.16712652502209419 = 1 0.005292326734004241 * 0.007147520399328029  1 0.0049016107791925246 * 0.00657604229900621 = 1 0.010264148244428655 * 0.013445893781261988  1 0.0068740307089347 * 0.006335334269910348 = 1 0.005236293038814788 * 0.004772389693042953 0.02103299467864286 * 0.01641086135590903 = 1 1 From this we can determine each o f the weight change values. The value ol c is one in all cases. This means that none o f the gradients' signs changed. Because of this, the weight change value is the negative o f the weight update value. This results in a weight change o f 0.01 for every weight. W eight W eight Weight W eight Weight W eight Weight W eight W eight Change Change Change Change Change Change Change Change Change 0 1 2 3 4 5 6 7 8 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 This results in the follow ing weights: 0.4479194894311762 0.36171409964135703 0.5592991203673414 0.1521705240132684 0.4434107719788818 0.24351073995779982 0.7287814395569221 0.24158711646254805 0.12249838002549909 Everything continues to move in the same direction as in the first iteration. The update values w ill also move up from 0.1 to 0.12. which is the direction specified by positive eta. This ends iteration 2. As you can see. the training will continue in the present path until one ol the gradients changes signs. The update values will continue to grow, ami the weights will continue to change by a value based on the sign o f their gradient and the weight update value. This continues until iteration #8, when the update values will begin to take on different values. T ra in in g Ite ra tio n #8 We begin the eight iteration by again calculating the gradients o f eaeh o f the weights. The weights have changed over the iterations that we skipped. This will cause the gradients to change. We will use the previous gradients along with these new gradients to calculate a new value o f c for each weight. c c c c c c c c c 0: 1 : 2: 3: 4: 5: 6: 7: 8: Hl>01: 0 .024815885942307825 * 0 .009100949661316388 = 1 H2  > 0 1 :  0 . 0 2 3 2 6 6 0 1 7 8 6 6 3 0 6 7 1 4 *  0 . 0 0 9 1 0 9 4 8 0 1 3 3 2 0 1 8  1 B2>01: 0 .045150613856680816 * 0.0187 9 3 9 4 4 2 9 3 4 6 6 4 8 = 1 7 . 9 6 7 3 6 7 7 7 1 9 3 0 8 3 5E4 *  0 . 0 0 1 9 9 1 9 4 8 1 9 6 9 8 7 2 8 1  1 I2>H1: 8.920300028002447E 4 *  0 .0 0 1 5 3 6 8 2 7 4 1 6 0 3 1 9 7 7 3 = 1 B 1 > H 1: 0 . 0 0 5 0 2 1 0 2 7 7 2 1 3 0 9 6 4 1 * 0 . 0 0 1 1 9 4 5 9 9 5 2 4 8 2 5 2 9 5 4  1 I1> H 2: 0.0010888731437352726 *  3 . 140593778629495E4 = 1 I2>H2:  7 . 180424572079871E4 *  1 . 4136606514778948E4  1 B 1 > H 2:  0 . 0 0 2 2 1 5 7 5 6 9 5 7 7 5 1 3 6 6 *  7 . 2 6 4 4 1 8 4 4 7 1 2 5 0 9 6 E  4 = 1 As you can see. there are two weights that now have a  I value for e. This indicates that the gradient sign has changed. This means we have passed a local minimum, and error is now' climbing. We must do something to stop this increase in error for weights three and four. Instead o f multiplying the weight update value by positive eta, we will now use negative eta for weights three aixl four. This will scale the weight back from the value of 0.04319 to 0.021599. As previously discussed, the value for negative eta is a constant defined to be 0.5. At the end o f this iteration, the weight update values are: Update Update Update Update Update Update Update Update Update 0: 1: 2: 3: 4: 5: 6: 7: 8: H l>01: H2>01: B2>01: I1>H1: I2>H1: B1>H1: I1>H2: I2>H2: 31>H2: 0.05183999999999999 0.05183999999999999 0.05183999999999999 0.021599999999999998 0.021599999999999998 0.05183999999999999 0.05183999999999999 0.05183999999999999 0.05183999999999999 As you can see, all weights have continued increasing except for weights three and four. These two weights have been scaled back, and will receive a much smaller update in the next iteration. Additionally, the last gradient for each of these two weights is set to zero. This w ill prevent a modification o f the weight update values in the next training iteration. This can be accomplished because c will be zero in the next step since the previous gradient is zero. This process will continue until the global error o f the neural network falls to an acceptable level. You can see the complete training process here. E p o c h *1 E r r o r : 0 . 3 1 0 0 1 5 5 8 0 9 6 2 7 5 2 3 Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch *2 #3 #4 #5 *6 *7 *8 *9 E r r o r : 0 . 2888003866116162 E r r o r r O . 267380775814409 E r r o r r O . 25242444534566344 E r r o r r O . 25517114662144347 E r r o r r O . 25242444534566344 E r r o r r O . 2508797332883249 E r r o r r O . 25242444534566344 E r r o r r O . 2509114256134314 Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch *127 *128 *129 *130 *131 *132 *133 *134 *135 *136 E r r o r r O . 029681452838256468 E r r o r r O . 026157454894821013 E r r o r r O . 023541442841907054 E r r o r r O . 0253591989944982 E r r o r r O . 020825411676740083 E r r o r r O . 01754524879617848 E r r o r r O . 015171808565942009 E r r o r r O . 01294865705016459'’ E r r o r r O . 011092515418846417 E r r o r r O . 009750156492866442 This same sec o f weights took 579 iterations with baekpropagation. As you can see. RPROP outperforms baekpropagation by a considerable margin. With more advanced random weight generation, the number o f iterations needed for RPROP can be brought down even further. This will be demonstrated in the next chapter. Chapter Summary In this chapter, we saw the resilient propagation (RPROP) training method. This training method is much more efficient than backpropagation. Considerably fewer training iterations are necessary for RPROP compared to backpropagation. RPROP works by keeping an array o f update values used to modify the weights of the neural network. The gradients are not used to update the weights of the neural network directly. Rather, the gradients influence how the weight update values are changed for each iteration. Only the sign o f the gradient is used to determine the direction in which to take the update values. This makes for a considerable improvement over backpropagation. So far, we have randomized each neural network with purely random numbers. This does not always lead to the fastest training times. As we will see in the next chapter, we can make modifications to the random numbers to which the neural network is initialized. These small changes will yield faster training times. Chapter 6: Weight Initialization • Ranged Random Weights • Trained and Untrained Neural Networks • NguyenWidrow Weight Initialization Neural networks must start with their weights initialized to random numbers. These random weights provide the training algorithms with a starting point for the weights. If all o f the w eights o f a neural network were set to zero, the neural network would never train to an acceptable error level. This is because a zeroed weight matrix would place the neural network into a local minimum from which it can never escape. Often, the weights o f neural networks arc simply initialized with random numbers between a specific range. The range  I to +1 is very popular. These random weights will change as the neural network is trained to produce acceptable outputs. In the next section, we will find out what trained and untrained neural networks look like. Litter in this chapter, we will see that these random weights can be modified to help the neural network to train faster. The NguyenWidrow weight initialization algorithm is a popular technique to adjust these starting weights. This algorithm puts the weights into a position tliat is much more conducive to training. This means that you need fewer training iterations to get the neural network to an acceptable error rate. Looking at the Weights In previous chapters, we looked at the weights o f a neural network as an array o f numbers. You can't typically glance at a weight array and see any sort o f meaningful pattern. However, if the weights are represented graphically, patterns begin to emerge. One common way to view the weights o f a neural network is using a special type o f chart called a histogram You've probably seen histograms many times before  a histogram is a chart made up o f vertical bars that count the number o f occurrences in a population. Figure 6.1 is a histogram showing the popularity o f operating systems. The yaxis shows the number o f occurrences ol each o f the groups in the xaxis. Figure 6.1: Histogram o f OS Popularity (from Wikipedia) Usage share o l web client operating system s: May 2011 L in u x Wc can use a histogram to look at the weights o f a neural network. You can typically tell a trained from an untrained neural network by looking at this histogram Figure 6.2 shows a trained neural network. Figure 6.2: A Trained Neural Network 203 175 153 125 103 75 50 25 0 A neural network histogram uses the same concept as the operating system histogram shown earlier. The yaxis specifies how many weights fell into the ranges specified by the numbers on the xaxis. This allows you to see the distribution o f the weights. Most trained neural networks will look something like the above chart. Their weights will be very tightly clustered around zero. A trained neural network will typically look like a very narrow Gaussian curve. Range Randomization In the last section, we saw what a trained neural network looks like in a weight histogram. Untrained neural networks can have a variety o f appearances. The appearance o f the weight histogram will be determined by the weight initialization method used. Figure 6 3 : A Ranged Randomization 100 90 80 70 60 50 40 30 20 10 0 Range randomization produces a very simple looking chart. The more weights there are, the flatter the top w ill be. This is because the random number generator should give you an even distribution o f numbers. If you are randomizing to the range o f  1 to I, you would expect to have approximately the same number o f weights above zero as below. Using NguyenWidrow We will now look at the NguyenWidrow weight initialization method. The NguyenWidrow method starts out just like the range randomized method. Random values are chosen between 0.5 and +0.5. However, a special algorithm is employed to modify the weights. The histogram o f a NguyenWidrow weight initialization looks like Figure 6.4. Figure 6.4: The NguyenWidrow Initialization ■ W H Q h t s & B ia s o s As you can see. the NguyenWidrow initialization has a very distinctive pattern. There is a large distribution o f weights between 0.5 and 0.5. It gradually rises and then rapidly falls off to around 3.0 and +3.0. P e rf o rm a n c e o f N g u y en W id ro w You may be wondering how much advantage there is to using NguyenWidrow. Take a look at the average number o f training iterations needed to train a neural network initialized by range randomization and NguyenWidrow. Average i t e r a t i o n s needed Range random: 502.86 NguyenWidrow: 454.88 (low er i s better) As you can sec from the above information, the NguyenWidrow outperforms the range randomizer. Im p le m e n tin g N guyenW idrow The technique was invented by Derrick Nguyen anti Bernard Widrow. It was first introduced in their paper. "Improving the learning speed o f 2layer neural networks by choosing initial values o f the adaptive weights”, in the Proceedings o f the International Joint Conference on Neural Netw orks. 3:2126, 1990. To implement an NguyenWidrow randomization, first initialize the neural network with random weight values in the range 0.5 to +0.5. This is exactly the same technique as was described earlier in this chapter for the ranged random numbers. The NguyenWidrow randomization technique is efficient because it assigns each hidden neuron to a range o f the problem To do this, we must map the input neurons to the hidden neurons. We calculate a value, called beta, that establishes these ranges. You can see the calculation o f beta in Equation 6 .1. Equation 6.1: Calculation o f Beta 0.7/j" The variable h represents the number o f hidden neurons in the first hidden layer, whereas the variable I represents the number o f input neurons. We will calculate the weights, uiking each hidden neuron one at a time. For each hidden neuron, we calculate the Euclidean norm for o f all inputs to the current hidden neuron. This is done using Equation 6.2. Equation 6.2: Calculation o f the Euclidean Norm n= Beta w ill stay the same for every hidden neuron. However, the norm must be recalculated for each hidden neuron. Once the beta and norm values have been calculated, the random weights can be adjusted. The equation below shows how weights are adjusted using the previously calculated values. Equation 6.3: Updating the Weights tiu>, = All inbound weights to the current hidden neuron are adjusted using the same norm. This process is repeated for each hidden neuron. — You may have noticed that we are only specifying how to calculate the weights between the input layer and the first hidden layer. The NguyenWidrow method does not specify how to calculate the weights between a hidden layer and the output. Likewise, the NguyenWidrow method does not specify how to calculate the weights between multiple hidden layers. All weights outside o f the first layer and first hidden layer are simply initialized to a value between 0.5 and +0.5. N g u y e n W id ro w in A ction We will now walk through the random weight initialization for a small neural network. We will look at a neural network that hits two input neurons and a single output neuron. There is a single hidden layer with two neuroas. Bias neurons are present on the input and hidden layers. We begin by initializing the weights to random numbers in the range 0.5 to +0.5. The starting weights are shown here. H eight H eight H eight H eight H eight H eight H eight H eight H eight 0: 1: 2: 3: 4: 5: 6: 7: 8: H l>01: H2>01: B2>01: I1>H1: I2>H1: B1>H1: I1>H2: I2>H2: B1>H2: 0.23773012320107711 0.2200753094723884 0.12169691073037914 0.5172524211645029 0.5258712726818855 0.8891383322123643 0.007687742622070948 0.48985643968339754 0.6610227585583137 First we must ealeulate beta, which is given in Equation 6.1. This calculation is shown here. Beta = 0 .7 • ( hiddenN eurons * (1 .0/inputC ount) Filling in the variables, w e have: Beta  0 .7 • < 2.0 * (1 .0 /2 .0 ))  0.9899 We arc now ready to modify the weights. We will only modify weights three through eight. Weights zero through two are not covered by the NguyenWidrow algorithm and are simply set to random values between 0.5 and 0.5. The weights will be recalculated in two phases. First, we will recalculate all o f the weights from the bias and input neurons to hidden neuron one. To do this, we must calculate the Euclidean norm, o r magnitude, for hidden neuron one. From Equation 6.2. we have the following: Norm H i d d e n 1  s q r t ( ( w e i g h t 3 ) A2 ♦ ( w e i g h t 4 » A2 ♦ ( w e i g h t 5 ) A Filling in the weights, w e get: Norm Hidden 1 = 0 . 5 2 5 8 ) A2 * ( 0 . 8 8 9 1 A2 ) )  1 . 1 5 5 s q r t ( ( 0 . 5 1 7 2 A2> ♦ < Wc will now look ai how to calculate the first weight: New W e i g h t = ( b e t a * O ld W e i g h t ) /Norm H id d en 1 l illing in values, we have: New W e i g h t = ( 0 . 9 8 9 9 *