Main
Manual on Presentation of Data and Control Chart Analysis, 7th Edition
Manual on Presentation of Data and Control Chart Analysis, 7th Edition
ASTM Committee E11 on Quality and Statistics
This comprehensive manual assists in the development of supportive data and analysis when preparing standard test methods, specifications, and practices. It provides the latest information regarding statistical and quality control methods and their applications. This is the 7th revision of this popular manual first published in 1933 as STP 15 and is an excellent teaching and reference tool for data analysis and complements work needed for ISO quality control requirements.PART 1 discusses frequency distributions, simple statistical measures, and the presentation in concise form, of the essential information contained in a single set of n observations.PART 2 examines the problem of expressing limits of uncertainty for various statistical measures, together with some working rules for roundingoff observed results to an appropriate number of significant figures.Part 3 covers the control chart method for the analysis of observational data obtained from a series of samples, and for detecting lack of statistical control of quality.New material includes:Discussions of whole number frequency distributions, empirical percentiles, and order statistics.Additional material focusing on the risks involved in the decisionmaking process based on data; and tests for assessing evidence of nonrandom behavior in process control charts.The use of the s(rms) statistic has been minimized in favor of the sample standard deviation to reduce confusion as to their use.
Categories:
Mathematics\\Algorithms and Data Structures
Year:
2002
Edition:
7
Language:
english
Pages:
135 / 141
ISBN 10:
0803120931
ISBN 13:
9780803145498
File:
PDF, 2.75 MB
Download (pdf, 2.75 MB)
Preview
 Open in Browser
 Checking other formats...
 Convert to EPUB
 Convert to FB2
 Convert to MOBI
 Convert to TXT
 Convert to RTF
 Converted file can differ from the original. If possible, download the file in its original format.
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You may be interested in

Most frequently terms
sample^{408}
control chart^{366}
samples^{279}
observations^{221}
frequency^{178}
control charts^{160}
nonconformities^{144}
standard deviation^{119}
nonconforming^{116}
fig^{107}
presentation^{95}
fraction^{94}
sample size^{86}
statistical^{84}
manual^{65}
coefficient^{64}
chart method^{63}
nonconformities per^{59}
variation^{58}
distributions^{58}
vol^{56}
statistics^{56}
subgroups^{56}
unequal size^{55}
samples of equal^{54}
averages^{54}
chart analysis^{53}
astm^{52}
bin^{50}
formulas^{48}
chart lines^{47}
characteristic^{46}
small samples^{45}
relative frequency^{44}
universe^{44}
nonconforming units^{42}
control chart lines^{40}
ranges^{39}
sampling^{39}
samples of unequal^{38}
computed^{38}
cumulative^{37}
skewness^{37}
interval^{37}
percentile^{36}
characteristics^{36}
standard deviations^{34}
formula^{34}
inspection^{33}
presentation of data^{33}
successive^{33}
measurement^{33}
probability^{32}
kurtosis^{32}
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

Manual on Presentation of Data and Control Chart Analysis 7th Edition Prepared by COMMITTEE E11 ON QUALITY AND STATISTICS Stock #; MNL7A Revision of Special Technical Publication (STP) 15D ^l4 ASTM International • 100 Barr Harbor Drive • PO Box C700 INTERNATIONAL West Conshohocken, PA 194282959 Ittirary at Congress CataloginginPublication Data Manual en presentation of data and control chart analysis / prepared by the Committee E11 on statistical control. (ASTM manual series ; MNL 7) Includes bibliographical references. ISBN 0803112890 1. Materials—Testing—Handbooks, manuals, etc. 2. Quality control—Statistical methods—Handbooks, manuals, etc. I. ASTM Committeie E11 on Statistical Methods. II. Series. TA410.M355 1989 620.1'1'0287—dc20 8918047 CIP Copyright ©2002 by ASTM International, West Conshohocken, PA. Prior editions copyrighted 1995 and eariier, by ASTM International. All rights reserved. This material may not be reproduced or copied, in whole or in part, in any printed, mechanical, electronic, film, or other distribution and storage media, without consent of the publisher. ASTM International Photocopy Rights Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by ASTM International for users registered with the Copyright Clearance Center (CCC) Transactional Reporting Service, provided the base fee of $2.50 per copy, plus $0.50 per page is paid directly to CCC, 222 Rosewood Dr., Danvers, MA 01923; Phone: (508) 7508400; Fax: (508) 7404744; online: http://www.copyright.coni/. For those organizations that have been granted a photocopy license by CCC, a separate system of payment has been arranged. The fee code for users of the Transactional Reporting Service is 0803112890 95 $2.50 + .50. Printed in Bridgeport, NJ February 2002 Foreword THIS A S T M Manual on Presentation of Data and Control Chart Analysis is the sixth revision of the original ASTM Manual on Presentation of Data first published in 1933. This sixth revision was prepared by the ASTM El 1.10 Subcommittee on Sampling and Data Analysis, which serves the ASTM Committee E11 on Quality and Statistics. Contents Preface P r e s e n t a t i o n of D a t a Summary Recommendations for Presentation of Data Glossary of Symbols Used in Part 1 1 5 5 5 5 Introduction 1. Purpose 2. Type of Data Considered 3. Homogeneous Data 4. Typical Examples of Physical Data 7 7 7 8 9 U n g r o u p e d Whole N u m b e r Distribution 5. Ungrouped Distribution 6. Empirical Percentiles and Order Statistics 10 10 10 Grouped F r e q u e n c y Distributions 7. Introduction 8. Definitions 9. Choice of Bin Boundaries 10. Number of Bins 11. Rules for Constructing Bins 12. Tabular Presentation 13. Graphical Presentation 14. Cumulative Frequency Distribution 15. "Stem and L e a f Diagram 16. "Ordered Stem and L e a f Diagram and Box Plot 13 13 14 14 14 14 17 18 19 20 21 F u n c t i o n s of a F r e q u e n c y Distribution 17. Introduction 18. Relative Frequency 19. Average (Arithmetic Mean) 20. Other Measures of Central Tendency 21. Standard Deviation 22. Other Measures of Dispersion 23. Skewness—gi 23a. Kurtosis—g2 24. Computational Tutorial 22 22 23 23 23 24 24 25 25 26 A m o u n t of Information Contained in p, X,s, gi, a n d gz 25. Summarizing the Information 26. Several Values of Relative Frequency, p 27. Single Percentile of Relative Frequency, p 28. Average XOnly 26 26 27 27 28 29. Average X and Standard Deviation s 30. Average X, Standard Deviation s, Skewness gi, and Kurtosis g2 31. Use of Coefficient of Variation Instead of the Standard Deviation 32. General Comment on Observed Frequency Distributions of a Series of ASTM Observations 33. Summary—^Amount of Information Contained in Simple Functions of the Data 28 30 33 34 35 Essential Information 34. Introduction 35. What Functions of the Data Contain the Essential Information 36. Presenting X Only Versus Presenting X and s 37. Observed Relationships 38. Summary: Essential Information 35 35 P r e s e n t a t i o n of R e l e v a n t Information 39. Introduction 40. Relevant Information 41. Evidence of Control 39 39 39 40 Recommendations 42. Recommendations for Presentation of Data References for Part 1 40 40 41 P r e s e n t i n g P l u s or Minus Limits of U n c e r t a i n t y of a n Observed Average Glossary of Symbols Used in Part 2 1. Purpose 2. The Problem 3. Theoretical Background 4. Computation of Limits 5. Experimental Illustration 6. Presentation of Data 7. OneSided Limits 8. General Comments on the Use of Confidence Limits 9. Number of Places to be Retained in Computation and Presentation Supplements A. Presenting Plus or Minus Limits of Uncertainty for <7 —Normal Distribution B. Presenting Plus or Minus Limits of Uncertainty fory References for Part 2 35 36 37 39 42 42 42 42 43 43 47 47 48 49 49 51 51 53 55 3 Control Chart Method of Analysis and Presentation of Data Glossary of Terms and Symbols Used in Part 3 56 56 General Principles 1. Purpose 2. Terminology and Technical Background 3. Two Uses 4. Breaking up Data into Rational Subgroups 5. General Technique in Using Control Chart Method 6. Control Limits and Criteria of Control 58 58 59 60 60 60 61 Control—No Standard Given 7. Introduction 8. Control Charts for Averages, X, and for Standard Deviations, s—Large Samples 9. Control Charts for Averages, X, and for Standard Deviations, s—Small Samples 10. Control Charts for Averages, X, and for Ranges, R—Small Samples 11. Summary, Control Charts for X, s, and R—No Standard Given 12. Control Charts for Attributes Data 13. Control Chart for Fraction Nonconforming, p 14. Control Chart for Number of Nonconforming Units, np 15. Control Chart for Nonconformities per Unit, u 16. Control Chart for Number of Nonconformities, c 17. Summary, Control Charts for p, np, u, and c—No Standard Given 64 64 Control with Respect to a Given Standard 18. Introduction 19. Control Charts for Averages X, and for Standard Deviation, s 20. Control Chart for Ranges, R 21. Summary, Control Charts for X, s, and R— Standard Given 22. Control Charts for Attributes Data 23. Control Chart for Fraction Nonconforming, p 24. Control Chart for Number of Nonconforming Units, np 25. Control Chart for Nonconformities per Unit, u 26. Control Chart for Number of Nonconformities, c 27. Summary, Control Charts for p, np, u, and c— Standard Given Control Charts for Individuals 28. Introduction 29. Control Chart for Individuals, X^Using Rational Subgroups 64 65 66 66 66 69 70 71 73 74 74 74 75 76 76 76 76 78 78 79 80 80 80 81 30. Control Chart for Individuals, X—Using Ranges Moving 81 Examples 82 31. Illustrative Examples—Control, No Standard Given 82 Example 1: Control Charts for X and s, Large Samples of Equal Size (Section 8A) 84 Example 2: Control Charts for X and s. Large Samples of Unequal Size (Section 8B) 84 Example 3: Control Charts for X and s. Small Samples of Equal Size (Section 9A) 85 Example 4: Control Charts for X and s. Small Samples of Unequal Size (Section 9B) 86 Example 5: Control Charts for X and R, Small Samples of Equal Size (Section lOA) 86 Example 6: Control Charts for X and R, Small Samples of Unequal Size (Section lOB) 87 Example 7: Control Charts for p, Samples of Equal Size (Section ISA), and np, Samples of Equal Size (Section 14) 88 Example 8: Control Chart for p, Samples of Unequal Size (Section 13B) 90 Example 9: Control Charts for u, Samples of Equal Size (Section 15A), and c. Samples of Equal Size (Section 16A) 90 Example 10: Control Chart for u, Samples of Unequal Size (Section 15B) 92 Example 11: Control Charts for c. Samples of Equal Size (Section 16A) 93 32. Illustrative Examples—Control With Respect to a Given Standard 95 Example 12: Control Charts for X and s, Large Samples of Equal Size (Section 19) 95 Example 13: Control Charts for X and s, Large Samples of Unequal Size (Section 19) 96 Example 14: Control Chart for X and s. Small Samples of Equal Size (Section 19) 96 Example 15: Control Chart for X and s, Small Samples of Unequal Size (Section 19) 97 Example 16: Control Charts for X and R, Small Samples of Equal Size (Section 19 and 20) 98 Example 17: Control Charts forp. Samples of Equal Size (Section 23), and np, Samples of Equal Size (Section 24) 99 Example 18: Control Chart forp (Fraction Nonconforming), Samples of Unequal Size (Section 23) 100 Example 19: Control Chart forp (Fraction Rejected), Total and Components, Samples of Unequal Size (Section 23) 101 Example 20: Control Chart for u, Samples of Unequal Size (Section 25) Example 21: Control Charts for c, Samples of Equal Size (Section 26) 33. Illustrative Examples—Control Chart for Individuals Example 22: Control Chart for Individuals, X— Using Rational Subgroups, Samples of Equal 103 105 106 Size, No Standard Given—Based on X and R (Section 29) 106 Example 23: Control Chart for Individuals, X— Using Rational Subgroups, Standard Given, Based on \io and OQ (Section 29) 107 Example 24: Control Charts for Individuals, X, and Moving Range, MR, of Two Observations, No Standard Given—Based on X and MR , the Mean Moving Range (Section 30A) 109 Example 25: Control Charts for Individuals, X, and Moving Range, MR, of Two Observations, Standard Given—Based on jOo and OQ (Section30B)110 Supplements A. Mathematical Relations and Tables of Factors for Computing Control Chart Lines B. Explanatory Notes References for Part 3 111 111 117 121 S e l e c t e d P a p e r s o n Control Chart T e c h n i q u e s 122 Appendix List of Some Related Publications on Quality Control 125 Index 127 MNL7AEB/Feb. 2002 1 INTRODUCTORY INFORMATION PREFACE Manual on the Presentation of Data and Control Chart Analysis (MNL 7), was prepared by ASTM's Committee E11 on Quality and Statistics to make available to the ASTM INTERNATIONAL membership, and others, information regarding statistical and quality control methods, and to make recommendations for their application in the engineering work of the Society. The quality control methods considered herein are those methods t h a t have been developed on a statistical basis to control the quality of product through the proper relation of specification, production, and inspection as parts of a continuing process. THIS The purposes for which the Society was founded—the promotion of knowledge of the materials of engineering, and the standardization of specifications and the methods of testing—involve at every t u r n the collection, analysis, interpretation, and presentation of quantitative data. Such data form an important part of the source material used in arriving at new knowledge and in selecting standards of quality and methods of testing that are adequate, satisfactory, and economic, from the standpoints of the producer and the consumer. Broadly, the three general objects of gathering engineering data are to discover: (1) physical constants and frequency distributions, (2) the relationships—both functional and statistical—between two or more variables, and (3) causes of observed phenomena. Under these general headings, the following more specific objectives in the work of ASTM International may be cited: (a) to discover the distributions of quality characteristics of materials which serve as a basis for setting economic standards of quality, for comparing the relative merits of two or more materials for a particular use, for controlling quality at desired levels, for C o p y r i g h f 2002 by A S T M International www.astm.org predicting what variations in quality may be expected in subsequently produced material; to discover the distributions of the errors of measurement for particular test methods, which serve as a basis for comparing the relative merits of two or more methods of testing, for specifying the precision and accuracy of standard tests, for setting up economical testing and sampling procedures; (b) to discover the relationship between two or more properties of a material, such as density and tensile strength; and (c) to discover physical causes of the behavior of materials under particular service conditions; to discover the causes of nonconformance with specified standards in order to make possible the elimination of assignable causes and the attainment of economic control of quality. Problems falling in these categories can be treated advantageously by the application of statistical methods and quality control methods. This Manual limits itself to several of the items mentioned under (a). PART 1 discusses frequency distributions, simple statistical measures, and the presentation, in concise form, of the essential information contained in a single set of n observations. PART 2 discusses the problem of expressing + limits of uncertainty for various statistical measures, together with some working rules for roundingoff observed results to an appropriate number of significant figures. PART 3 discusses the control chart method for the analysis of observational data obtained from a series of samples, and for detecting lack of statistical control of quality. The present Manual is the sixth revision of earlier work on the subject. The original ASTM Manual on Presentation of Data, STP 15, issued in 1933 was prepared by a special committee of former DATA AND CONTROL CHART ANALYSIS Subcommittee IX on Interpretation and Presentation of Data of ASTM Committee E1 on Methods of Testing. In 1935, Supplement A on Presenting ± Limits of Uncertainty of an Observed Average and Supplement B on "Control Chart" Method of Analysis and Presentation of Data were issued. These were combined with the original manual and the whole, with minor modifications, was issued as a single volume in 1937. The personnel of the Manual Committee that undertook this early work were: H. F. Dodge, W. C. ChanceUor, J. T. McKenzie, R. F. Passano, H. G. Romig, R. T. Webster, and A. E. R. Westman. They were aided in their work by the ready cooperation of the Joint Committee on the Development of Apphcations of Statistics in Engineering and Manufacturing (sponsored by ASTM International and the American Society of Mechanical Engineers (ASME)) and especially of the chairman of the Joint Committee, W. A. Shewhart. The nomenclature and symbolism used in this early work were adopted in 1941 and 1942 in the American War Standards on Quahty Control (Zl.l, Z1.2, and Z1.3) of the American Standards Association, and its Supplement B was reproduced as an appenduc with one of these standards. In 1946, ASTM Technical Committee E11 on Quality Control of Materials was established under the chairmanship of H. F. Dodge, and the manual became its responsibility. A major revision was issued in 1951 as ASTM Manual on Quality Control of Materials, STP 15C. The Task Group that undertook the revision of PART 1 consisted of R. F. Passano, Chairman, H. F. Dodge, A. C. Holman, and J. T. McKenzie. The same task group also revised PART 2 (the old Supplement A) and the task group for revision of PART 3 (the old Supplement B) consisted of A. E. R. Westman, Chairman, H. F. Dodge, A. I. Peterson, H. G. Romig, and L. E. Simon. In this 1951 revision, the term "confidence limits" was introduced and constants for computing 0.95 confidence hmits were added to the constants for 0.90 and 0.99 confidence hmits presented in prior printings. Separate treatment was given to control charts for "number of defectives," "number of defects," and "number of defects per unit" and material on control charts for individuals was added. In subsequent editions, the term "defective" has been replaced by "nonconforming unit" and "defect" by "nonconformity" to agree with definitions adopted by the American Society for Quality Control in 1978 (See the American National Standard, ANSI/ASQC Al1987, Definitions, Symbols, Formulas and Tables for Control Charts.) There were more printings of ASTM STP 15C, one in 1956 and a second in 1960. The first added the ASTM Recommended Practice for Choice of Sample Size to Estimate the Average Quality of a Lot or Process (E 122) as an Appendix. This recommended practice had been prepared by a task group of ASTM Committee E11 consisting of A. G. Scroggie, Chairman, C. A. Bicking, W. E. Deming, H. F. Dodge, and S. B. Littauer. This Appendix was removed from that edition because it is revised more often than the main text of this Manual. The current version of E 122, as well as of other relevant ASTM International pubhcations, may be procured from ASTM International. (See the hst of references at the back of this Manual.) In the 1960 printing, a modifications were made committee consisting of Chairman, Simon Collier, Hader, and E. G. Olds. number of minor by an ad hoc Harold Dodge, R. H. Ede, R. J. The principal change in ASTM STP 15C introduced in ASTM STP 15D was the redefinition of the sample standard deviation to be s = J^^ '~ /„])• This change required numerous changes throughout the Manual in mathematical equations and formulas, tables, and numerical illustrations. It also led to a sharpening of distinctions between sample values, universe values, and standard INTRODUCTORY INFORMATION values that necessary. were not formerly deemed New material added in ASTM STP 15D included the following items. The sample measure of kurtosis, g2, was introduced. This addition led to a revision of Table 8 and Section 34 of PART 1. In PART 2, a brief discussion of the determination of confidence limits for a universe standard deviation and a universe proportion was included. The Task Group responsible for this fourth revision of the Manual consisted of A. J. Duncan, Chairman R. A. Freund, F. E. Grubbs, and D. C. McCune. In the twentytwo years between the appearance oi ASTM STP 15D and Manual on Presentation of Data and Control Chart Analysis, &^ Edition there were two reprintings without significant changes. In that period a number of misprints and minor inconsistencies were found in ASTM STP 15D. Among these were a few erroneous calculated values of control chart factors appearing in tables of PART 3. While all of these errors were small, the mere fact that they existed suggested a need to recalculate all tabled control chart factors. This task was carried out by A. T. A. Holden, a student at the Center for Quality and Applied Statistics at the Rochester Institute of Technology, under the general guidance of Professor E. G. Schilling of Committee E 11. The tabled values of control chart factors have been corrected where found in error. In addition, some ambiguities and inconsistencies between the text and the examples on attribute control charts have received attention. A few changes were made to bring the Manual into better agreement with contemporary statistical notation and usage. The symbol i (Greek "mu") has replaced X (and X') for the universe average of measurements (and of sample averages of those measurements.) At the same time, the symbol o has replaced a' as the universe value of standard deviation. This entailed replacing a by s^jms) to denote the sample rootmeansquare deviation. Replacing the universe values p', u' and c' by Greek letters was thought worse than leaving them as they are. Section 33, PART 1, on distributional information conveyed by Chebyshev's inequality, has been revised. Summary of changes in definitions and notations. MNL7 H, a, p , u , c STP 15D X', (f, p', u', c' ( = universe values) ( = universe values) io, Oo, po, Uo, Co Xg,Oo', po', Uo', Co' ( = standard values) ( = standard values) In the twelveyear period since this Manual was revised again, three developments were made t h a t had an increasing impact on the presentation of data and control chart analysis. The first was the introduction of a variety of new tools of data analysis and presentation. The effect to date of these developments is not fully reflected in PART 1 of this edition of the Manual, but an example of the "stem and l e a f diagram is now presented in Section 15. Manual on Presentation of Data and Control Chart Analysis, &'^ Edition from the first has embraced the idea that the control chart is an allimportant tool for data analysis and presentation. To integrate properly the discussion of this established tool with the newer ones presents a challenge beyond the scope of this revision. The second development of recent years strongly affecting the presentation of data and control chart analysis is the greatly increased capacity, speed, and availability of personal computers and sophisticated hand calculators. The computer revolution h a s not only enhanced capabilities for data analysis and presentation, but has enabled DATA AND CONTROL CHART ANALYSTS techniques of high speed realtime datataking, analysis, and process control, which years ago would have been unfeasible, if not unthinkable. This has made it desirable to include some discussion of practical approximations for control chart factors for rapid if not realtime application. Supplement A has been considerably revised as a result. (The issue of approximations was raised by Professor A. L. Sweet of Purdue University.) The approximations presented in this Manual presume the computational ability to take squares and square roots of rational numbers without using tables. Accordingly, the Table of Squares and Square Roots that appeared as an Appendix to ASTM STP 15D was removed from the previous revision. Further discussion of approximations appears in Notes 8 and 9 of Supplement B, PART 3. Some of the approximations presented in PART 3 appear to be new and assume mathematical forms suggested in part by unpublished work of Dr. D. L. Jagerman of AT&T Bell Laboratories on the ratio of gamma functions with near arguments. The third development has been the refinement of alternative forms of the control chart, especially the exponentially weighted moving average chart and the cumulative sum ("cusum") chart. Unfortunately, time was lacking to include discussion of these developments in the fifth revision, although references are given. The assistance of S. J Amster of AT&T Bell Laboratories in providing recent references to these developments is gratefully acknowledged. Manual on Presentation of Data and Control Chart Analysis, &^ Edition by Committee E11 was initiated by M. G. Natrella with the help of comments from A. Bloomberg, J. T. Bygott, B. A. Drew, R. A. Freund, E. H. Jebe, B. H. Levine, D. C. McCune, R. C. Paule, R. F. Potthoff, E. G. Schilling and R. R. Stone. The revision was completed by R. B. Murphy and R. R. Stone with further comments from A. J. Duncan, R. A. Freund, J. H. Hooper, E. H. Jebe and T. D. Murphy. Manual on Presentation of Data and Control Chart Analysis, 7"» Edition has been directed at bringing the discussions around the various methods covered in PART 1 up to date. Especially, in the areas of whole number frequency distributions, empirical percentiles, and order statistics. As an example, an extension of the stemandleaf diagram has been added which is termed an "ordered stemandleaf," which makes it easier to locate the quartiles of the distribution. These quartiles, along with the maximum and minimum values, are then used in the construction of a box plot. In PART 3, additional material has been included to discuss the idea of risk, namely, the alpha (a) and beta (P) risks involved in the decisionmaking process based on data; and tests for assessing evidence of nonrandom behavior in process control charts. Also, use of the s(rms) statistic has been minimized in this revision in favor of the sample standard deviation s to reduce confusion as to their use. Furthermore, the graphics and tables throughout the text have been repositioned so that they appear more closely to their discussion in the text. Manual on Presentation of Data and Control Chart Analysis, Z"* Edition by Committee E11 was initiated and led by Dean V. Neubauer, Chairman of the E l l . 1 0 Subcommittee on Sampling and Data Analysis that oversees this document. Additional comments from Steve Luko, Charles Proctor, Paul Selden, Greg Gould, Frank Sinibaldi, Ray Mignogna, Neil UUman, Thomas D. Murphy, and R. B. Murphy were instrumental in the vast majority of the revisions made in this sixth revision. Thanks must also be given to Kathy Dernoga and Monica Siperko of the ASTM International New Publications department for their efforts in the publication of this edition. MNL7AEB/Feb. 2002 Presentation of Dote To see how the data may depart from a Normal distribution, prepare the grouped frequency distribution and its histogram. Also, calculate skewness, gi, and kurtosis, PART 1 IS CONCERNED solely with presenting information about a given sample of data. It contains no discussion of inferences that might be made about the population from which the sample came. SUMMARY Bearing in mind that no rules can be laid down to which no exceptions can be found the committee believes that if the recommendations presented are followed, the presentations will contain the essential information for a majority of the uses made of ASTM data. g2. 4. If the data seem not to be normally distributed, then one should consider presenting the median and percentiles (discussed in Section 6), or consider a transformation to make the distribution more normally distributed. The advice of a statistician should be sought to help determine which, if any, transformation is appropriate to suit the user's needs. 5. Present as much evidence as possible t h a t the data were obtained under controlled conditions. 6. Present relevant information on precisely (a) the field of application within which the measurements are believed valid and (b) the conditions under which they were made. RECOMMENDATIONS FOR PRESENTATION OF DATA Given a sample of n observations of a single variable obtained under the same essential conditions: 1. Present as a minimum, the average, the standard deviation, and the number of observations. Always state the number of observations. 2. Also, present the values of the maximum and minimum observations. Any collection of observations may contain mistakes. If errors occur in the collection of the data, then correct the data values, but do not discard or change any other observations. 3. GLOSSARY OF SYMBOLS U S E D PARTI gi g2 n The average and standard deviation are sufficient to describe the data, particularly so when they follow a Normal distribution. Copyright 2002 by ASTM International www.astni.org IN Observed frequency (number of observations) in a single bin of a frequency distribution Sample coefficient of skewness, a measure of skewness, or lopsidedness of a distribution Sample coefficient of kurtosis Number of observed values (observations) Sample relative frequency or proportion, the ratio of the number of occurrences of a given type to the total possible number of occurrences, the ratio of the number of observations in any stated interval to DATA AND CONTROL CHART ANALYSTS R s s^ cv X 'x the total number of observations; sample fraction nonconforming for measured values the ratio of the number of observations lying outside specified limits (or beyond a specified limit) to the total number of observations Sample range, the difference between the largest observed value and the smallest observed value. Sample standard deviation Sample variance Sample coefficient of variation, a measure of relative dispersion based on the standard deviation (see Sect. 31) Observed values of a measurable characteristic; specific observed values are designated Xi, X2, X3, etc. in order of measurement, and X(i), X(2), X(3), etc. in order of their size, where X(i) is the smallest or minimum observation and X(n) is the largest or maximum observation in a sample of observations; also used to designate a measurable characteristic Sample average or sample mean, the sum of the n observed values in a sample divided by n NOTE The sample proportion p is an example of a sample average in which each observation is either a 1, the occurrence of a given type, or a 0, the nonoccurrence of the same type. The sample average is then exactly the ratio, p, of the total number of occurrences to the total number possible in the sample, n. If reference is to be made to the population from which a given sample came, the following symbols should be used. Yi 72 X Population skewness defined as the expected value (see Note) of (X  i)^ divided by a^. It is spelled and pronounced "gamma one." Population coefficient of kurtosis defined as the amount by which the expected value (see Note) of (X  )x)* divided by a^ exceeds or falls short of 3; it is spelled and pronounced "gamma two." Population average or universe mean p' a o^ CV defined as the expected value (see Note) of X; t h u s E(X) = [i, spelled "mu" and pronounced "mew." Population relative frequency Population standard deviation, spelled and pronounced "sigma." Population variance defined as the expected value (see Note) of the square of a deviation from the universe mean; thusE[(Xn)2]=a2 Population coefficient of variation defined as the population standard deviation divided by the population mean, also called the relative standard deviation, or relative error, (see Sect. 31) NOTE If a set of data is homogeneous in the sense of Section 3 of P A R T 1, it is usually safe to apply statistical theory and its concepts, like that of an expected value, to the data to assist in its analysis and interpretation. Only then is it meaningful to speak of a population average or other characteristic relating to a population (relative) frequency distribution function of X. This function commonly assumes the form of f(x), which is the probability (relative frequency) of an observation having exactly the value X, or the form of f(x)dx, which is the probability an observation has a value between x and x + dx. Mathematically the expected value of a function of X, say h(X), is defined as the sum (for discrete data) or integral (for continuous data) of that function times the probability of X and written E[h(X)]. For example, if the probability of X lying between x and x + dx based on continuous data is f(x)dx, then the expected value is lh(x)fix)dx = E[hix)]. If the probability of X lying between x and X + dx based on continuous data is f(x)dx, then the expected value is I.h(x)f(x)dx = E[h(x)]. Sample statistics, like X, s^, gi, and g2, also have expected values in most practical cases, but these expected values relate to PRESENTATION OF DATA the population frequency distribution of entire samples of n observations each, rather t h a n of individual observations. The expected value of X is [i, the same as that of an individual observation regardless of the population frequency distribution of X, and E(s2) = a^ likewise, but E(s) is less t h a n a in all cases and its value depends on the population distribution of X. Firsi Type n Chvobsmafm l/iMffs mtachfhing Second Type One n Observaiions thing V a a I INTRODUCTION 1. Purpose PART 1 of the Manual discusses the application of statistical methods to the problem of: (a) condensing the information contained in a sample of observations, and (b) presenting the essential information in a concise form more readily interpretable than the unorganized mass of original data. Attention will be directed particularly to quantitative information on measurable characteristics of materials and manufactured products. Such characteristics will be termed quality characteristics. 2. Type of Data Considered Consideration will be given to the treatment of a sample of n observations of a single variable. Figure 1 illustrates two general types: (a) the first type is a series of n observations representing single measurements of the same quality characteristic of n similar things, and (b) the second type is a series of n observations representing n measurements of the same quality characteristic of one thing. The observations in Figure 1 are denoted as Xi, where i = 1, 2, 3, ... , n. Generally, the subscript will represent the time sequence in which the observations were taken from a process or measurement. In this sense, we may consider the order of the data in Table 1 as being represented in a timeordered manner. FIG. 1—Two general types of data. Data of the first type are commonly gathered to furnish information regarding the distribution of the quality of the material itself, having in mind possibly some more specific purpose; such as the establishment of a quality standard or the determination of conformance with a specified quality standard, for example, 100 observations of transverse strength on 100 bricks of a given brand. Data of the second type are commonly gathered to furnish information regarding the errors of measurement for a particular test method, for example, 50micrometer measurements of the thickness of a test block. NOTE The quality of a material in respect to some particular characteristic, such as tensile strength, is better represented by a frequency distribution function, t h a n by a singlevalued constant. The variability in a group of observed values of such a quality characteristic is made up of two parts: variability of the material itself, and the errors of measurement. In some practical problems, the error of measurement may be large compared with the variability of the material; in others, the converse may be true. In any case, if one is interested in discovering the objective frequency distribution of the quality of the material, consideration must be given to correcting 8 DATA AND CONTROL CHARTANALYSIS If a given sample of data consists of two or more subportions collected under different test conditions or representing material produced under different conditions, it should be considered as two or more separate subgroups of observations, each to be treated independently in the analysis. Merging of such subgroups, representing significantly different conditions, may lead to a condensed presentation that will be of little practical value. Briefly, any sample of observations to which these methods are applied should be homogeneous. the errors of measurement (This is discussed in Ref. 1, pp. 379384, in the seminal book on control chart methodology by Walter A. Shewhart.). 3. H o m o g e n e o u s D a t a While the methods here given may be used to condense any set of observations, the results obtained by using them may be of little value from the standpoint of interpretation unless the data are good in the first place and satisfy certain requirements. In the illustrative examples of PART 1, each sample of observations will be assumed to be homogeneous, that is, observations from a common universe of causes. The analysis and presentation by control chart methods of data obtained from several samples or capable of subdivision into subgroups on the basis of relevant engineering information is discussed in PART 3 of this Manual. Such methods enable one to determine whether for practical To be useful for inductive generalization, any sample of observations that is treated as a single group for presentation purposes should represent a series of measurements, all made under essentially the same test conditions, on a material or product, all of which has been produced under essentially the same conditions. TABLE 1. Three groups of original data. (a) Transverse Strength of 270 Bricks of a Typical Brand, psi° 860 920 1320 1100 1200 830 920 850 920 820 1250 1100 940 1040 1480 1190 1080 1180 830 1380 1390 1100 960 820 1360 730 830 980 1120 1090 1170 2010 1160 1330 1090 890 930 880 790 1100 1010 1130 1260 1260 1140 1050 860 850 1080 1070 900 890 970 700 820 880 1150 1060 980 1180 950 1110 1270 1010 890 270 1310 1070 1020 1170 920 960 1020 1180 740 860 1240 1020 1030 1290 870 820 990 1030 1100 740 1130 1000 1080 1000 1000 1150 860 810 1070 1630 670 1330 1150 700 880 910 870 1170 1340 800 840 1080 1060 1230 1040 980 940 1240 1110 1020 1100 1060 990 840 1060 1170 970 790 690 1020 1070 820 890 580 960 860 800 990 870 660 1040 820 1180 1350 950 900 760 1080 830 890 970 1100 1220 1020 1100 1380 1090 1010 1380 1380 1030 830 850 900 950 890 1010 1000 1150 1360 860 880 730 910 890 1030 1060 1400 850 1010 1080 970 1110 780 1100 920 800 1140 970 890 1010 1120 1070 1100 800 710 880 780 940 1240 1190 910 870 810 960 870 910 1180 1190 1050 1230 1150 630 780 710 1020 1300 990 880 750 970 980 940 780 760 910 990 870 1230 1100 1240 940 860 1090 830 1040 1510 740 1150 1000 1140 1030 700 920 860 950 860 720 1080 840 1070 800 570 800 1180 1000 920 650 1610 1180 980 830 460 730 1030 860 800 1050 1070 1400 1010 970 980 900 PRESENTATION OF DATA specimens selected in a random manner to provide information about the quality of a larger quantity of material—the general output of one brand of brick, a production lot of galvanized iron sheets, and a shipment of hard drawn copper wire. Consideration will be given to ways of arranging and condensing these data into a form better adapted for practical use. purposes a given sample of observations may be considered to be homogeneous. 4. Typical E x a m p l e s of P h y s i c a l D a t a Table 1 gives three typical sets of observations, each one of these datasets represents measurements on a sample of units or TABLE 1. Continued. (c) Breaking Strength of Ten Specimens of 0.104in. HardDrawn Copper Wire, Ib'^ (b) Weight of Coating of 100 Sheets of Galvanized Iron Sheets, oz/ft^'' 1.467 1.623 1.520 1.767 1.550 1.533 1.377 1.373 1.637 1.460 1.627 1.537 1.533 1.337 1.603 1.373 1.457 1.660 1.323 1.647 1.603 1.577 1.603 1.383 1.730 1.700 1.600 1.603 1.477 1.513 1.533 1.593 1.503 1.600 1.543 1.567 1.490 1.550 1.577 1.483 1.600 1.577 1.323 1.620 1.473 1.420 1.450 1.337 1.440 1.557 1.480 1.477 1.550 1.637 1.570 1.617 1.477 1.750 1.497 1.717 1.563 1.393 1.647 1.620 1.530 1.470 1.337 1.580 1.493 1.563 1.543 1.567 1.670 1.473 1.633 1.763 1.573 1.537 1.420 1.513 1.437 1.350 1.530 1.383 1.457 1.443 1.473 1.433 1.637 1.500 1.607 1.423 1.573 1.753 1.467 1.563 1.503 1.550 1.647 1.690 578 572 570 568 572 570 570 572 576 584 " Measured to the nearest 10 psi. Test method used was ASTM Method of Testing Brick and Structural Clay (C 67). Data from ASTM Manual for Interpretation of Refractory Test Data, 1935, p. 83. ' Measured to the nearest 0.01 oz/ft^ of sheet, averaged for three spots. Test method used was ASTM Triple Spot Test of Standard Specifications for ZincCoated (Galvanized) Iron or Steel Sheets (A 93). This has been discontinued and was replaced by ASTM Specification for General Requirements for Steel Sheet, ZincCoated (Galvanized) by the HotDip Process (A 525). Data from laboratory tests. 'Measured to the nearest 2 lb. Test method used was ASTM Specification for HardDrawn Copper Wire (B 1). Data from inspection report. 10 DATA AND CONTROL CHART ANALYSIS 500 1000 Transverse Strength, psi. 1500 2000 Fig. 2—Showing graphicaiiy the ungrouped frequency distribution of a set of observations. Each dot represents one bricl<, data of Table 2(a}. UNGROUPED WHOLE NUMBER DISTRIBUTION 5. Ungrouped Distribution An arrangement of the observed values in ascending order of magnitude will be referred to in the Manual as the ungrouped frequency distribution of the data, to distinguish it from the grouped frequency distribution defined in Section 8. A further adjustment in the scale of the ungrouped distribution produces the whole number distribution. For example, the data of Table 1(a) were multiplied by lO^, and those of Table 1(b) by 103, ^ h i l e those of Table 1(c) were already whole numbers. If the data carry digits past the decimal point, just round until a tie (one observation equals some other) appears and then scale to whole numbers. Table 2 presents ungrouped frequency distributions for the three sets of observations given in Table 1. Figure 2 shows graphically the ungrouped frequency distribution of Table 2(a). In the graph, there is a minor grouping in terms of the unit of measurement. For the data of Fig. 2, it is the "roundingoff unit of 10 psi. It is rarely desirable to present data in the manner of Table 1 or Table 2. The mind cannot grasp in its entirety the meaning of so many numbers; furthermore, greater compactness is required for most of the practical uses t h a t are made of data. 6. Empirical Percentiles and Order Statistics As should be apparent, the ungrouped whole number distribution may differ from the original data by a scale factor (some power of ten), by some rounding and by having been sorted from smallest to largest. These features should make it easier to convert from an ungrouped to a grouped frequency distribution. More importantly, they allow calculation of the order statistics t h a t will aid in finding ranges of the distribution wherein lie specified proportions of the observations. A collection of observations is often seen as only a sample from a potentially huge population of observations and one aim in studying the sample may be to say what proportions of values in the population lie in certain ranges. This is done by calculating the percentiles of the distribution. We will see there are a number of ways to do this but we begin by discussing order statistics and empirical estimates of percentiles. A glance at Table 2 gives some information not readily observed in the original data set of Table 1. The data in Table 2 are arranged in increasing order of magnitude. When we arrange any data set like this the resulting ordered sequence of values are referred to as order statistics. Such ordered arrangements are often of value in the initial stages of an analysis. In this context, we use subscript notation and write X© to denote the P'^ order statistic. For a sample of n values the order statistics are X(i) < X(2) < X(3) < ... < X(n). The index i is sometimes called the rank of the data point to which it is attached. For a sample size of n values, the first order statistic is the smallest or minimum value and has rank 1. We write this as X(i). The n"» order statistic is the largest or maximum value and has rank n. We write this as X(n). The i*'' order statistic is written as X(i), for 1 < i < ;x. For the breaking strength data in Table 2c, the order statistics are: X(i)=568, X(2)=570, ... , X(io)=584. When ranking the data values, we may find some that are the same. In this situation, we say that a matched set of values constitutes a tie. The proper rank assigned to values that make up the tie is calculated by averaging the 11 PRESENTATION OF DATA TABLE 2. Ungrouped frequency distributions in tabular form. (a) Transverse Strength, psi (data of Table 1 (a)) 270 460 570 580 630 780 780 780 790 790 830 830 830 840 840 870 880 880 880 880 920 920 920 920 920 970 980 980 980 980 1020 1020 1020 1020 1020 1070 1070 1070 1070 1070 1100 1100 1100 1100 1110 1180 1180 1180 1180 1180 1310 1320 1330 1330 1340 650 660 670 690 700 800 800 800 800 800 840 850 850 850 850 880 880 890 890 890 930 940 940 940 940 980 980 990 990 990 1020 1020 1030 1030 1030 1070 1070 1080 1080 1080 1110 1110 1120 1120 1130 1180 1180 1190 1190 1190 1350 1360 1360 1380 1380 700 700 710 710 720 800 800 810 810 820 860 860 860 860 860 890 890 890 890 890 940 950 950 950 950 990 990 1000 1000 1000 1030 1030 1030 1040 1040 1080 1080 1080 1080 1090 1130 1140 1140 1140 1150 1200 1220 1230 1230 1230 1380 1380 1390 1400 1400 730 730 730 740 740 820 820 820 820 820 860 860 860 860 860 900 900 900 900 910 960 960 960 960 970 1000 1000 1000 1010 1010 1040 1040 1050 1050 1050 1090 1090 1090 1100 1100 1150 1150 1150 1150 1150 1240 1240 1240 1240 1250 1480 1510 1610 1630 2010 740 750 760 760 780 820 830 830 830 830 870 870 870 870 870 910 910 910 910 920 970 970 970 970 970 1010 1010 1010 1010 1010 1060 1060 1060 1060 1060 1100 1100 1100 1100 1100 1160 1170 1170 1170 1170 1260 1260 1270 1290 1300 ranks that would have been determined by the procedure above in the case where each value was different from the others. For example, there are many ties present in Table 2. The rank associated with the three values of 700 would be the average of the ranks as if they were 700, 701, and 702, respectively. In other words, we see that the values of 700 occur in the 10*, llth^ and 12* positions, or represented as X(io), X(ii), and X(i2), respectively, if they were unequal. Thus, the value of 700 should carry a rank equal to (10+ll+12)/3 = 11, and each value specified as X(ii). The order statistics can be used for a variety of purposes, but it is for estimating the percentiles that they are used here. A percentile is a value that divides a distribution to leave a given fraction of the observations less than that value. For example, the 5 0 * percentile, typically referred to as the median, is a value such t h a t half of the observations exceed it and half are below it. The 7 5 * percentile is a value such t h a t 25% of the observations exceed it and 75% are below it. The 9 0 * percentile is a value such that 10% of the observations exceed it and 90%) are below it. To aid in understanding the formulas that follow, consider finding the percentile t h a t best corresponds to a given order statistic. Although there are several answers to this question, one of the simplest is to realize that a sample of size n will partition the distribution from which it came into n+1 compartments as illustrated in the following figure. 12 DATA AND CONTROL CHART ANALYSTS statistic. For X(i), the percentile is 100(l)/(24+l)  4th; and for X(24), the percentile is 100(24/(24+1) = 96th. For the illustration in Figure 3, the point a corresponds to the 20*^ percentile, point b to the 40'^ percentile, point c to the GO'h percentile and point d to the 8 0 * percentile. It is not difficult to extend this application. From the figure it appears that the interval defined by a < x < d should enclose, on average, 60% of the distribution of X. Fig. 3—Any distribution is partitioned into n+1 compartments witti a sampie of n. In Figure 3, the sample size is rt=4; the sample values are denoted as a, b, c and d. The sample presumably comes from some distribution as the figure suggests. Although we do not know the exact locations that the sample values correspond to along the true distribution, we observe that the four values divide the distribution into 5 roughly equal compartments. Each compartment will contain some percentage of the area under the curve so that the sum of each of the percentages is 100%. Assuming that each compartment contains the same area, the probability a value will fall into any compartment is 100[l/(n+l)]%. Similarly, we can compute the percentile that each value represents by 100[i/(n+l)]%, where i = 1, 2, ..., n. If we ask what percentile is the first order statistic among the four values, we estimate the answer as the 100[l/(4+l)]% = 20%, or 20th percentile. This is because, on average, each of the compartments in Figure 3 will include approximately 20% of the distribution. Since there are ?i+l=4+l=5 compartments in the figure, each compartment is worth 20%. The generalization is obvious. For a sample of n values, the percentile corresponding to the i'h order statistic is 100[i/(7i+l)]%, where i = 1, 2, ..., n. For example, if n=24 and we want to know which percentiles are best represented by the l^t and 24th order statistics, we can calculate the percentile for each order We now extend these ideas to estimate the distribution percentiles. For the coating weights in Table 2(b), the sample size is n.=100. The estimate of the 50*^ percentile, or sample median, is the number lying halfway between the 50th and Sl'^t order statistics (X(50) = 1.537 and X(5i) = 1.543, respectively). Thus, the sample median is (1.537 +1.543)/2 = 1.540. Note that the middlemost values may be the same (tie). When the sample size is an even number, the sample median will always be taken as halfway between the middle two order statistics. Thus, if the sample size is 250, the median is taken as (X(i25)+X(i26))/2. If the sample size is an odd number, the median is taken as the middlemost order statistic. For example, if the sample size is 13, the sample median is taken as X(7). Note that for an odd numbered sample size, n, the index corresponding to the median will be i in+l)/2. We can generalize the estimation of any percentile by using the following convention. Let p be a proportion, so that for the 50th percentile p equals 0.50, for the 25th percentile p = 0.25, for the lO'h percentile p = 0.10, and so forth. To specify a percentile we need only specify p. An estimated percentile will correspond to an order statistic or weighted average of two adjacent order statistics. First, compute an approximate rank using the formula i = (n+l)p. If i is an integer then the lOOp"* percentile is estimated as X© and we are done. If i is not an integer, then drop the decimal portion and keep the integer portion of i. Let k be the retained integer portion and r be the dropped decimal portion (note: 0<r<l). 13 PRESENTATION OF DATA TABLE 2 Continued. (c) Breaking Strength, lb. (data of Table 1(c)) (b) Weight of Coating, oz/ft^ (data of Table 1(6)) 1.323 1.323 1.337 1337 1.337 1.457 1.457 1.460 1.467 1.467 1.513 1.513 1.520 1.530 1.530 1.567 1.567 1.570 1.573 1.573 1.620 1.623 1.627 1.633 1.637 568 570 570 570 572 1.350 1.373 1.373 1.377 1.383 1.470 1.473 1.473 1.473 1.477 1.533 1.533 1.533 1.537 1.537 1.577 1.577 1.577 1.580 1.593 1.637 1.637 1.647 1.647 1.647 572 572 576 578 584 1.383 1.393 1.420 1.420 1.423 1.477 1.477 1.480 1.483 1.490 1.543 1.543 1.550 1.550 1.550 1.600 1.600 1.600 1.603 1.603 1.660 1.670 1.690 1.700 1.717 1.433 1.437 1.440 1.443 1.450 1.493 1.497 1.500 1.503 1.503 1.550 1.557 1.563 1.563 1.563 1.603 1.603 1.607 1.617 1.620 1.730 1.750 1.753 1.763 1.767 The estimated lOOp"* percentile is computed from the formula X(k) + r(X(k+i)  X(k)). Consider the transverse strengths with ?i=270 and let us find the 2.5'^ and 97.5*^ percentiles. For the 2.5*^^ percentile, p = 0.025. The approximate rank is computed as i = (270+1) 0.025 = 6.775. Since this is not an integer, we see that k6 and r=0.775. Thus, the 2.5*'^ percentile is estimated by X(6) + r(X(7)X(6)), which is 650 + 0.775(660650) = 657.75. For the 97.5'^ percentile, the approximate r a n k is i = (270+1) 0.975 == 264.225. Here again, i is not an integer and so we use ^=264 and r=0.225; however; notice that both X(264) and X(265) are equal to 1400. In this case, the value 1400 becomes the estimate. GROUPED FREQUENCY DISTRIBUTIONS 7. I n t r o d u c t i o n Merely grouping the data values may condense the information contained in a set of observations. Such grouping involves some loss of information but is often useful in presenting engineering data. In the following sections, both tabular and graphical presentation of grouped data will be discussed. 14 DATA AND CONTROL CHART ANALYSTS 8. Definitions 10. Number of Bins A grouped frequency distribution of a set of observations is an arrangement that shows the frequency of occurrence of the values of the variable in ordered classes. The number of bins in a frequency distribution should preferably be between 13 and 20. (For a discussion of this point. See Ref. 1, p. 69, and Ref. 18, pp. 912.) Sturge's rule is to make the number of bins equal to lt3.31ogio(n). If the number of observations is, say, less t h a n 250, as few as 10 bins may be of use. When the number of observations is less than 25, a frequency distribution of the data is generally of little value from a presentation standpoint, as for example the 10 observations in Table 3(c). In general, the outline of a frequency distribution when presented graphically is more irregular when the number of bins is larger. This tendency is illustrated in Fig. 4. The interval, along the scale of measurement, of each ordered class is termed a bin. The frequency for any bin is the number of observations in t h a t bin. The frequency for a bin divided by the total number of observations is the relative frequency for that bin. Table 3 illustrates how the three sets of observations given in Table 1 may be organized into grouped frequency distributions. The recommended form of presenting tabular distributions is somewhat more compact, however, as shown in Table 4. Graphical presentation is used in Fig. 4 and discussed in detail in Section 14. 9. Choice of Bin Boundaries It is usually advantageous to make the bin intervals equal. It is recommended that, in general, the bin boundaries be chosen halfway between two possible observations. By choosing bin boundaries in this way, certain difficulties of classification and computation are avoided (See Ref. 2, pp. 7376). With this choice, the bin boundary values will usually have one more significant figure (usually a 5) t h a n the values in the original data. For example, in Table 3(a), observations were recorded to the nearest 10 psi, hence the bin boundaries were placed at 225, 375, etc., rather than at 220, 370, etc., or 230, 380, etc. Likewise, in Table 3(6), observations were recorded to the nearest 0.01 oz/ft^, hence bin boundaries were placed at 1.275, 1.325, etc., rather than at 1.28, 1.33, etc. 11. Rules for Constructing Bins After getting the ungrouped whole number distribution, one can use a number of popular computer programs to automatically construct a histogram. For example, a spreadsheet program, e.g., Excel^, can be used by selecting the Histogram item from the Analysis Toolpack menu. Alternatively, you can do it manually by applying the following rules: • The number of bins (or "cells" or "levels") is set equal to NL = CEIL(2.1 log(n)), where n is the sample size and CEIL is an Excel spreadsheet function t h a t extracts the largest integer part of a decimal number, e.g., 5 is CEIL(4.1)). • Compute the bin interval as LI = CEIL(RG/NL), where RG = LWSW, and LW is the largest whole number and SW is the smallest among the n observations. • Find the stretch adjustment as SA = CEIL((NL*LIRG)/2). Set the start boundary at START = SWSA0.5 and then add LI successively NL times to get the bin boundaries. Average successive pairs of boundaries to get the bin midpoints. ' Excel is a trademark of Microsoft Corporation. 15 PRESENTATION OF DATA TABLE 3. Three examples of grouped frequency distribution, showing bin midpoints and bin boundaries. Bin Midpoint (a) Transverse strength, psi (data of Table 1 {a)) Bin Boundaries Observed Frequency 225 300 375 450 525 600 675 750 38 825 900 80 975 1050 83 1125 1200 39 1275 1350 17 1425 1500 2 1575 1650 2 1725 1800 0 1875 1950 1 2025 Total (b) Weight of coating, oz/fl^ (data of Table 1 (b)) 270 1.275 1.300 2 1.325 1.350 6 1.375 1.400 7 1.425 1.450 14 1.475 1.500 14 1.525 1.550 22 1.575 1.600 17 1.625 1.650 10 1.675 1.700 3 1.725 1.750 5 1.775 Total (c) Breaking strength, lb (data Table 1 (c)) 100 567 568 1 569 3 570 571 3 572 573 0 574 575 576 1 577 578 1 579 580 0 581 582 0 583 584 1 585 Total 10 16 DATA AND CONTROL CHART ANALYSIS TABLE 4. Four methods of presenting a tabular frequency distribution (data of TABLE 1(a)). (6) Relative Frequency (expressed in percentages) (a) Frequency Transverse Strength, psi 225 to 375 375 to 525 525 to 675 675 to 825 825 to 975 975 to 1125 1125 to 1275 1275 to 1425 1425 to 1575 1575 to 1725 1725 to 1875 1875 to 2025 Number of Bricks Having Strength Within Given Limits Transverse Strength, psi (c) Cumulative Frequency 0.4 0.4 2.2 14.1 29.6 30.7 14.5 6.3 0.7 0.7 0.0 0.4 225 to 375 375 to 525 525 to 675 675 to 825 825 to 975 975 to 1125 1125 to 1275 1275 to 1425 1425 to 1575 1575 to 1725 1725 to 1875 1875 to 2025 1 1 6 38 80 83 39 17 2 2 0 1 100.0 Total Number of observations = 270 270 Total Percentage of Bricks Having Strength Within Given Limits (d) Cumulative Relative Frequency (expressed in percentages) Transverse Strength, psi Number of Bricks Having Strength less than Given Values Transverse Strength, psi 375 525 675 825 975 1125 1275 1425 1575 1725 1875 2025 1 2 8 46 126 209 248 265 267 269 269 270 375 525 675 825 975 1125 1275 1425 1575 1725 1875 2025 Percentage of Bricks Having Strength less t h a n Given Values 0.4 0.8 3.0 17.1 46.7 77.4 91.9 98.2 98.9 99.6 99.6 100.0 Number of observations = 270 NOTE—"Number of observations" should be recorded with tables of relative frequencies. 100 Using 12 cells, (Table III [a]) Using 19 cells > 80 o S 60h I 40 500 1000 1500 2000 500 1000 Fig. 4—Illustrating increased irregularity with larger number of cells, or bins. 1500 2000 17 PRESENTATION OF DATA • • Having defined the bins, the last step is to count the whole numbers in each bin and thus record the grouped frequency distribution as the bin midpoints with the frequencies in each. The user may improve upon the rules but they will produce a useful starting point and do obey the general principles of construction of a frequency distribution. Figure 5 illustrates a convenient method of classifying observations into bins when the number of observations is not large. For each observation, a mark is entered in the proper bin. These marks are grouped in five's as the tallying proceeds, and the completed tabulation itself, if neatly done, provides a good picture of the frequency distribution. If the number of observations is, say, over 250, and accuracy is essential, the use of a computer may be preferred. 12. Tabular Presentation Methods of presenting tabular frequency distributions are shown in Table 4. To make a frequency tabulation more understandable, relative frequencies may be listed as well as actual frequencies. If only relative frequencies are given, the table cannot be regarded as complete unless the total number of observations is recorded. Confusion often arises from failure to record bin boundaries correctly. Of the four methods, A to D, illustrated for strength measurements made to the nearest 10 lb., only Methods A and B are recommended (Table 5). Method C gives no clue as to how observed values of 2100, 2200, etc., which fell exactly at bin boundaries were classified. If such values were consistently placed in the next higher bin, the real bin boundaries are those of Method A. Method D is liable to misinterpretation since strengths were measured to the nearest 10 lb. only. Transverse Strength, psi. Frequency 225 to 375 375 to 525 1 1 525 675 825 975 mi to to to to 675 825 975 1125 1125 to 1275 to 1425 to 1575 to 1275 1425 1575 1775 1725 to 1875 1875 to 2025 1 1 6 38 80 mmmitfritfritiMttnii 1tfrltfr1tH~ltH*1tfrltfrltfrlHKllHttfr1lfr1Hf1tfr1^ itfrmitfritfritttiifrttfrmitfrmmmmmmm iifritifiiifmittntifmiiii 83 39 17 2 mmitifii II II 2 0 1 1 Total Fig. 5—Method of classifying observations. Data of Table 1(a). 270 18 DATA AND CONTROL CHART ANALYSIS TABLE 5. Methods A through D illustrated for strength measurements to the nearest 10 lb. NOT RECOMMENDED RECOMMENDED METHOD A STRENGTH, lb. NUMBER OF OBSERVATIONS STRENGTH, lb. 2000 to 2090 2100 to 2190 2200 to 2290 2395 to 2495 1 3 17 36 82 etc. etc. 1995 to 2095 2095 to 2195 2195 to 2295 2295 to 2395 METHOD C METHOD B NUMBER OF OBSERVATIONS STRENGTH, lb. STRENGTH, lb. NUMBER OF OBSERVATIONS 2000 to 2099 2100 to 2199 2400 to 2500 1 3 17 36 82 2300 to 2399 2400 to 2499 1 3 17 36 82 etc. etc. etc. etc. 2000 to 2100 2100 to 2200 2200 to 2300 2400 to 2490 1 3 17 36 82 etc. etc. 2300 to 2390 NUMBER OF OBSERVATIONS METHOD D 2300 to 2400 2200 to 2299 100 Frequency 30 Bar Chart (Bars centered on cell midpoints) 20 80 60 13. Graphical Presentation 40 10 20 0 0 Alternate Form 30 of Frequency Bar Chart 20 (Line erected at cell midpoints) 10 80 60 40 20 c 0 CD 80 Q. 60 40 30 20 20 0 10 — Frequency  • Histogram 0 (Columns erected on cells) 30 80 60 40 20 0 20 R 500 1000 1500 2000 Transverse Strength, psi. 10 0 Frequency I I I 11 6l38l80l83l39ll7l2 I 2 lo I 1 I C e l l B o u n d r i e s S ^ § g § g m ^ §   Cell Midpoint I30ol45ol60ol75ol90oll05oll20oll35oll50oll65olia0oll95ol FIG. 6—Graphical presentations of a frequency distribution. Data of Table 1(a) as grouped in Table 3(a). Using a convenient horizontal scale for values of the variable and a vertical scale for bin frequencies, frequency distributions may be reproduced graphically in several ways as shown in Fig. 6. The frequency bar chart is obtained by erecting a series of bars, centered on the bin midpoints, with each bar having a height equal to the bin frequency. An alternate form of frequency bar chart may be constructed by using lines rather t h a n bars. The distribution may also be shown by a series of points or circles representing bin frequencies plotted at bin midpoints. The frequency polygon is obtained by joining these points by straight lines. Each endpoint is joined to the base at the next bin midpoint to close the polygon. Another form of graphical representation of a frequency distribution is obtained by placing along the graduated horizontal scale a series of vertical columns, each having a width equal to the bin width and a height equal to the bin frequency. Such a graph, shown at the bottom of Fig. 6, is called the frequency histogram of the distribution. In the histogram, if bin widths are arbitrarily given the value 1, the area enclosed by the steps represents frequency exactly, and the sides of the columns designate bin boundaries. 19 PRESENTATION OF DATA The same charts can be used to show relative frequencies by substituting a relative frequency scale, such as that shown in Fig. 6. It is often advantageous to show both a frequency scale and a relative frequency scale. If only a relative frequency scale is given on a chart, the number of observations should be recorded as well. 14. Cumulative Frequency Distribution Two methods of constructing cumulative frequency polygons are shown in Fig. 7. Points are plotted at bin boundaries. The upper chart gives cumulative frequency and relative cumulative frequency plotted on an arithmetic scale. This type of graph is often called an ogive or "s" graph. Its use is discouraged mainly because it is usually difficult to interpret the tail regions. The lower chart shows a preferable method by plotting the relative cumulative frequencies on a normal probability scale. A Normal distribution (see Fig. 14) will plot cumulatively as a straight line on this scale. Such graphs can be drawn to show the number of observations either "less than" or "greater than" the scale values. (Graph paper with one dimension graduated in terms of the summation of Normal law distribution has been described in Refs. 3,18). It should be noted t h a t the cumulative percents need to be adjusted to avoid cumulative percents from equaling or exceeding 100%. The probability scale only reaches to 99.9% on most available probability plotting papers. Two methods which will work for estimating cumulative percentiles are [cumulative frequency/(n+1)], and [(cumulative frequency — 0.5)/n]. § 300 100 50 S 1500 Transverse Strength, psi. (a) Using arithmetic scale for frequency. (b) Using probability scale for relative frequency. Fig. 7—Graphical presentations of a cumulative frequency distribution. Data of Table 4: (a) using arithmetic scale for frequency, and (b) using probability scale for relative frequency. 20 DATA AND CONTROL CHART ANALYSIS For some purposes, the number of observations having a value "less than" or "greater than" particular scale values is of more importance t h a n the frequencies for particular bins. A table of such frequencies is termed a cumulative frequency distribution. The "less than" cumulative frequency distribution is formed by recording the frequency of the first bin, then the sum of the first and second bin frequencies, then the sum of the first, second, and third bin frequencies, and so on. Because of the tendency for the grouped distribution to become irregular when the number of bins increases, it is sometimes preferable to calculate percentiles from the cumulative frequency distribution rather than from the order statistics. This is recommended as n passes the hundreds and reaches the thousands of observations. The method of calculation can easily be illustrated geometrically by using Table 4(d), Cumulative Relative Frequency and the problem of getting the 2.5*11 and 97.5'^ percentiles. We first define the cumulative relative frequency function, F(x), from the bin boundaries and the cumulative relative frequencies. It is just a sequence of straight lines connecting the points (X=235, F(235)=0.000), (X=385, F(385)=0.0037), (X=535, F(535)=0.0074), and so on up to (X=2035, F(2035)=1.000). Notice in Fig. 7, with a n arithmetic scale for percent, and you can see the function. A horizontal line at height 0.025 will cut the curve between X=535 and X=685, where the curve rises from 0.0074 to 0.0296. The full vertical distance is 0.02960.0074 = 0.0222, and the portion lacking is 0.02500.0074 = 0.0176, so this cut will occur at (0.0176/0.0222) 150+535 = 653.9 psi. The horizontal at 97.5% cuts the curve at 1419.5 psi. The first step is to reduce the data to two or threedigit numbers by: (1) dropping constant initial or final digits, like the final zero's in Table 1(a) or the initial one's in Table 1(b); (2) removing the decimal points; and finally, (3) rounding the results after (1) and (2), to two or threedigit numbers we can call coded observations. For instance, if the initial one's and the decimal points in the data of Table 1(b) are dropped, the coded observations r u n from 323 to 767, spanning 445 successive integers. If forty successive integers per class interval are chosen for the coded observations in this example, there would be 12 intervals; if thirty successive integers, then 15 intervals; and if twenty successive integers then 23 intervals. The choice of 12 or 23 intervals is outside of the recommended interval from 13 to 20. While either of these might nevertheless be chosen for convenience, the flexibility of the stem and leaf procedure is best shown by choosing thirty successive integers per interval, perhaps the least convenient choice of the three possibilities. Each of the resulting 15 class intervals for the coded observations is distinguished by a first digit and a second. The third digits of the coded observations do not indicate to which intervals they belong and are therefore not needed to construct a stem and leaf diagram in this case. But the first digit may change (by one) within a single class interval. For instance, the first class interval with coded observations beginning with 32, 33 or 34 may be identified by 3(234) and the second class interval by 3(567), but the third class interval includes coded observations with leading digits 38, 39 and 40. This interval may be identified by 3(89)4(0) The intervals, identified in this manner, are listed in the left column of Fig. 8. Each coded observation is set down in t u r n to the right of its class interval identifier in the diagram using as a symbol its second digit, in the order (from left to right) in which the original observations occur in Table 1(b). 15. "Stem and Leaf Diagram It is sometimes quick and convenient to construct a "stem and l e a f diagram, which has the appearance of a histogram turned on its side. This kind of diagram does not require choosing explicit bin widths or boundaries. In spite of the complication of changing some first digits within some class intervals, this stem and leaf diagram is quite simple to construct. In this particular case, the diagram reveals "wings" at both ends of the diagram. 21 PRESENTATION OF DATA First (and The 'TJOX" is formed by the 2 5 * and 75'*^ second) Digit: Second Digits Only 3(234) 3(567) 3(89)4(0) 4(123) 4(456) 4(789) 5(012) 5(345) 5(678) 5(9)6(01) 6(234) 6(567) 6(89)7(0) 7(123) 7(456) 32233 7775 898 22332 66554546 798787797977 2 10100 53333455534335 677776866776 000090010 23242342334 67 09 31 6565 FIG. 8—Stem and leaf diagram of data from Table 1(b) with groups based on triplets of first and second decimal digits. As this example shows, the procedure does not require choosing a precise class interval width or boundary values. At least as important is the protection against plotting and counting errors afforded by using clear, simple numbers in the construction of the diagram—a histogram on its side. For further information on stem and leaf diagrams see Refs. 4 and 18. percentiles, the center of the data is dictated by the 50'^ percentile (median) and "whiskers" are formed by extending a line from either side of the box to the minimum, X(i) point, and to the maximum, X(n) point. Fig. 8b shows the box plot for the data from Table 1(b). For further information on boxplots, see Ref. 18. First (and second) Digit: 3(234) 3(567) 3(89)4(0) 4(123) 4(456) 4(789) 5(012) 5(345) 5(678) 5(9)6(01) 6(234) 6(567) 6(89)7(0) 7(123) 7(456) Second Digits Only 22333 5777 889 22233 44555666 777777788999 000 112 33333334455555 666667777778 90 0 0 0 0 0 0 1 22223333444 67 90 13 5566 FIG. 8a—Ordered stem and leaf diagram of data from Table 1(b) with groups based on triplets of first and second decimal digits. The 25'", 50*^ and 75"^ quartiles are shown in bold type and are underlined. 1.323 1.767 1.4678 16. "Ordered Stem and Leaf Diagram and Box Plot The stem and leaf diagram can be extended to one that is ordered. The ordering pertains to the ascending sequence of values within each "leaf. The purpose of ordering the leaves is to make the determination of the quartiles an easier task. The quartiles represent the 2b''^, 50"i (median), and 75'^ percentiles of the frequency distribution. They are found by the method discussed in Section 6. In Fig. 8a, the quartiles for the data are bold and underlined. The quartiles are used to construct another graphic called a box plot. 1.540 1.6030 FIG. 8b—Box plot of data from Table 1(b) The information contained in the data may also be summarized by presenting a tabular grouped frequency distribution, if the number of observations is large. A graphical presentation of a distribution makes it possible to visualize the nature and extent of the observed variation. While some condensation is effected by presenting grouped frequency distributions, further reduction is necessary for most of the uses t h a t are made of ASTM data. This need can be fulfilled by means of a few simple functions of the observed distribution, notably, the average and the standard deviation. 22 DATA AND CONTROL CHART ANALYSIS (b) the spread or dispersion of the observations about the central value. FUNCTIONS OF A FREQUENCY DISTRIBUTION A third characteristic of some interest, but of less importance, is the skewness or lack of symmetry—the extent to which the observations group themselves more on one side of the central value than on the other (see Fig. 9b). 17. Introduction In the problem of condensing and summarizing the information contained in the frequency distribution of a sample of observations, certain functions of the distribution are useful. For some purposes, a statement of the relative frequency within stated limits is all t h a t is needed. For most purposes, however, two salient characteristics of the distribution which are illustrated in Fig. 9a are: (a) the position on the scale of measurement—the value about which the observations have a tendency to center, and A fourth characteristic is "kurtosis" which relates to the tendency for a distribution to have a sharp peak in the middle and excessive frequencies on the tails as compared with the Normal distribution or conversely to be relatively flat in the middle with little or no tails (see Fig. 10). Swwd ..•itl 1 Mil.... Different pMrifions, Mmc .lllllh. Same Powtion, different spreads yilllllllllllllii....... illlllll Different Positions, different spreads ..lllllll.. Scale of mtesurtmenf > FIG. 9a—Illustrating two salient characteristics of distributions—position and spread. Positive Skewness Negative Skewness g, = 0 91 =1.00 . . . i i I'l — X 1111 iu. g, = +1.00 I llm. Scale of Measurement — FIG. 9b—Illustrating a third characteristic of frequency distributions—skewness, and particular values of skewness, g,. 23 PRESENTATION OF DATA Leptokurtic 92=100 ±L Platykurtic Mesokurtic ILO. 92=0 92=0.9 JLx. FIG. 10—Illustrating the kurtosis of a frequency distribution and particular values of g^ Several representative sample measures are available for describing these characteristics, but by far the most useful are the arithmetic mean X, the standard deviation s, the skewness factor g^, and the kurtosis factor g2—all algebraic functions of the observed values. Once the numerical values of these particular measures have been determined, the original data may usually be dispensed with and two or more of these values presented instead. The four characteristics of the distribution of a sample of observations just discussed are most useful when the observations form a single heap with a single peak frequency not located at either extreme of the sample values. If there is more than one peak, a tabular or graphical representation of the frequency distribution conveys information the above four characteristics do not. 18. Relative F r e q u e n c y The relative frequency p within stated limits on the scale of measurement is the ratio of the number of observations lying within those limits to the total number of observations. In practical work, this function has its greatest usefulness as a measure of fraction nonconforming, in which case it is the fraction, p, representing the ratio of the number of observations lying outside specified limits (or beyond a specified limit) to the total number of observations. 19. A v e r a g e (Arithmetic Mean) The average (arithmetic mean) is the most widely used measure of central tendency. The term average and the symbol X will be used in this Manual to represent the arithmetic mean of a sample of numbers. The average, X, of a sample of n numbers, Xi, Xg,..., Xn, is the sum of the numbers divided by n, that is (1) where the expression .Z^ Xi means "the sum of all values of X, from Xi to X„, inclusive." Considering the n values of X as specifying the positions on a straight line of n particles of equal weight, the average corresponds to the center of gravity of the system. The average of a series of observations is expressed in the same units of measurement as the observations, t h a t is, if the observations are in pounds, the average is in pounds. 20. Other M e a s u r e s of Central Tendency The geometric mean, of a sample of n numbers, Zi, Xj,..., Xn, is the n."" root of their product, that is 24 DATA AND CONTROL CHART ANALYSTS geometric mean = ^X\X2 ••• X„ (2) (Xi  Xf +{X2Xf +•••+ {X„  Xf s =. n\ or logX,+logX2+ ••• (4) v\2 log (geometric mean) +logX„ (3) Equation 3, obtained by taking logarithms of both sides of Eq 2, provides a convenient method for computing the geometric mean using the logarithms of the numbers. NOTE The distribution of some quality characteristics is such that a transformation, using logarithms of the observed values, gives a substantially Normal distribution. When this is true, the transformation is distinctly advantageous for (in accordance with Section 29) much of the total information can be presented by two functions, the average, X, and the standard deviation, s, of the logarithms of the observed values. The problem of transformation is, however, a complex one that is beyond the scope of this Manual. See Ref. 18. The median of the frequency distribution of n numbers is the middlemost value. The mode of the frequency distribution of n numbers is the value that occurs most frequently. With grouped data, the mode may vary due to the choice of the interval size and the starting points of the bins. 21. S t a n d a r d D e v i a t i o n The standard deviation is the most widely used measure of dispersion for the problems considered in PART 1 of the Manual. For a sample of n numbers, Xi, X2..., X^, the sample standard deviation is commonly defined by the formula where X is defined by Eq 1. The quantity s^ is called the sample variance. The standard deviation of any series of observations is expressed in the same units of measurement as the observations, t h a t is, if the observations are in pounds, the standard deviation is in pounds. (Variances would be measured in pounds squared.) A frequently more convenient formula for the computation of s is Is^(=1 (5) n\ but care must be taken to avoid excessive rounding error when n is larger t h a n s. NOTE A useful quantity related to the standard deviation is the rootmeansquare deviation mxx) T7\2 i=l ^(rms) = S, ~ \n\ 22. O t h e r M e a s u r e s of D i s p e r s i o n The coefficient of variation, cv, of a sample of n numbers, is the ratio (sometimes the coefficient is expressed as a percentage) of their standard deviation, s, to their average X. It is given by s X (6) 25 PRESENTATION OF DATA The coefficient of variation is an adaptation of the standard deviation, which was developed by Prof. Karl Pearson to express the variability of a set of numbers on a relative scale rather t h a n on an absolute scale. It is thus a dimensionless number. Sometimes it is called the relative standard deviation, or relative error. The average deviation of a sample of n numbers, X^, X^, ..., X„, is the average of the absolute values of the deviations of the numbers from their average X that is X(^,^)' !=1 ^1 (8) = ns This measure of skewness is a pure number and may be either positive or negative. For a symmetrical distribution, gi is zero. In general, for a nonsymmetrical distribution, g^ is negative if the long tail of the distribution extends to the left, towards smaller values on the scale of measurement, and is positive if the long tail extends to the right, towards larger values on the scale of measurement. Figure 9 shows three unimodal distributions with different values of gj. t,\xx\ average deviation (7) 23a. Kurtosis—g2 where the symbol   denotes the absolute value of the quantity enclosed. The range i? of a sample of n numbers is the difference between the largest number and the smallest number of the sample. One computes R from the order statistics as R = X(n)X(i). This is the simplest measure of dispersion of a sample of observations. A useful measure of the lopsidedness of a sample frequency distribution is the coefficient of skewness gi. The coefficient of skewness g^, of a sample of n numbers, X^, X^, ..., X^, is defined by the expression gi = ks/s^. Where ks is the third kstatistic as defined by R. A. Fisher. The kstatistics were devised to serve as the moments of small sample data. The first moment is the mean, the second is the variance, and the third is the average of the cubed deviations and so K'l — The coefficient of kurtosis ga for a sample of n numbers, Xi, X^, ..., X„, is defined by the expression g2 = k4/s^ and K = {n\){n2){n3) 23. Skewness—g^ on. Thus, ki= X ,k2 The peakedness and tail excess of a sample frequency distribution is generally measured by the coefficient of kurtosis ^2 s^, (nl)(«2) Notice that when n is large («2)(«3) Notice that when n is large g2=^ 1 3 (9) ns Again this is a dimensionless number and may be either positive or negative. Generally, when a distribution has a sharp peak, thin shoulders, and small tails relative to the bellshaped distribution characterized by the Normal distribution, g2 is positive. When a distribution is flattopped with fat tails, relative to the Normal distribution, gz is negative. Inverse relationships do not necessarily follow. We cannot definitely infer anything about the shape of a distribution from knowledge of g2 unless we are willing to assume some theoretical curve, say a Pearson curve, as being 26 DATA AND CONTROL CHART ANALYSTS appropriate as a graduation formula (see Fig. 14 and Section 30). A distribution with a positive g2 is said to be leptokurtic. One with a negative ^2 is said to be platykurtic. A distribution with ^2 = 0 is said to be mesokurtic. Figure 10 gives three unimodal distributions with different values of ^2 SE{x)=sl4n, SB (g2) = yjlAIn , respectively. 24. Computational Tutorial The method of computation can best be illustrated with an artificial example for n=4 with Xi = 0, X2 = 4, Xs = 0, and X4 = 0. Please first verify that X= 1. The deviations from this mean are found as  1 , 3,  1 , and  1 . The sum of the squared deviations is thus 12 and s^ = 4. The sum of cubed deviations is —1+2711 = 24, and thus ks = 16. Now we find gi = 16/8 2. Please verify that g2 = 4. Since both gi and g2 are positive, we can say that the distribution is both skewed to the right and leptokurtic relative to the Normal distribution. Of the many measures that are available for describing the salient characteristics of a sample frequency distribution, the average X, the standard deviation s, the skewness gi, and the kurtosis g2, are particularly useful for summarizing the information contained therein. So long as one uses them only as rough indications of uncertainty we list approximate sampling standard deviations of When using a computer software calculation, the ungrouped whole number distribution values will lead to less round off in the printed output and are simple to scale back to original units. The results for the data from Table 2 are given in Table 6. AMOUNT OF INFORMATION CONTAINED IN p , J^, s, g^, AND g^ 25. Summarizing the Information Given a sample of n observations, Xi, X2, X3, ..., Xn, of some quality characteristic, how can we present concisely information by means of which the observed distribution can be closely approximated, that is, so that the percentage of the total number, n, of observations lying within any stated interval from, say, XatoX = b, can be approximated? the quantities X, s^, gi and g2, as Table 6. Summary Statistics for Three Sets of Data Datasets X s gi g2 Transverse Strength, psi 999.8 201.8 0.611 2.567 Weight of Coating, 1.535 0.1038 0.013 0.291 573.2 4.826 1.419 1.797 Oz/ft2 Breaking Strength, lb 27 PRESENTATION OF DATA The total information can be presented only by giving all of the observed values. It will be shown, however, that much of the total information is contained in a few simple functions—notably the average X, the standard deviation s, the skewness ^i, and the kurtosis ^2 26. Several Values of Relative Frequency, p By presenting, say, 10 to 20 values of relative frequency p, corresponding to stated bin intervals and also the number n of observations, it is possible to give practically all of the total information in the form of a tabular grouped frequency distribution. If the ungrouped distribution has any peculiarities, however, the choice of bins may have an important bearing on the amount of information lost by grouping. NOTE For the purposes of PART 1 of this Manual, the curves of Figs. 11 and 12 may be taken to represent frequency histograms with small bin widths and based on large samples. In a frequency histogram, such as that shown at the bottom of Fig. 5, let the percentage relative frequency between any two bin boundaries be represented by the area of the histogram between those boundaries, the total area being 100 percent. Since the bins are of uniform width, the relative frequency in any bin is then proportional to the height of t h a t bin and may be read on the vertical scale to the right. Average X, — Xj — X3 • 1 1 \ 1 1 1 >' t 1 ••211 ,' ( 3\.'. 27, Single Percentile of Relative Frequency, p If we present but a percentile value, Qp, of relative frequency p, such as the fraction of the total number of observed values falling outside of a specified limit and also the number n of observations, the portion of the total information presented is very small. This follows from the fact that quite dissimilar distributions may have identically the same percentile value as illustrated in Fig. 11. Specified Limit (min. Q„ FIG. 11—Quite different distributions may have the same percentile value of p, fraction of total observations below specified limit. 1 ' . ^ ^ ^.'' FIG. 12—Quite different distributions may have the same average. If the sample size is increased and the bin width reduced, a histogram in which the relative frequency is measured by area approaches as a limit the frequency distribution of the population, which in many cases can be represented by a smooth curve. The relative frequency between any two values is then represented by the area under the curve and between ordinates erected at those values. Because of the method of generation, the ordinate of the curve may be regarded as a curve of relative frequency density. This is analogous to the representation of the variation of density along a rod of uniform cross section by a smooth curve. The weight between any two points along the rod is proportional to the area under the curve between the two ordinates and we may speak of the density (that is, weight density) at any point but not of the weight at any point. 28 DATA AND CONTROL CHART ANALYSIS If we present merely the average, X, and number, n, of observations, the portion of the total information presented is very small. Quite dissimilar distributions may have identically the same value of X as illustrated in Fig. 12. of the exponential function, yield a fitting formula from which estimates can be made of the percentage of cases lying between any two specified values of X. Presentation of X and n is sufficient in such cases provided they are accompanied by a statement that there are reasons to believe that X has a negative exponential distribution. In fact, no single one of the five functions, Qp, X, s, gi, or g2, presented alone, is generally capable of giving much of the total information in the original distribution. Only by presenting two or three of these functions can a fairly complete description of the distribution generally be made. 29. Average X and S t a n d a r d Deviation s 28. Average X Only An exception to the above statement occurs when theory and observation suggest that the underlying law of variation is a distribution for which the basic characteristics are all functions of the mean. For example, "life" data "under controlled conditions" sometimes follows a negative exponential distribution. For this, the cumulative relative frequency is given by the equation F{X) = le x/Q 0<X<oo (14) This is a single parameter distribution for which the mean and standard deviation both equal 0. That the negative exponential distribution is the underlying law of variation can be checked by noting whether values of 1 — F(X) for the sample data tend to plot as a straight line on ordinary semilogarithmic paper. In such a situation, knowledge of X will, by taking 0 = X in Eq. 14 and using tables These two functions contain some information even if nothing is known about the form of the observed distribution, and contain much information when certain conditions are satisfied. For example, more than 1  1/k'' of the total number n of observations lie within the closed interval X ± ks (where k is not less t h a n 1). This is Chebyshev's inequality and is shown graphically in Fig. 13. The inequality holds true of any set of finite numbers regardless of how they were obtained. Thus if X a n d s are presented, we may say at once that more t h a n 75 percent of the numbers lie within the interval X ± 2s; stated in another way, less than 25 percent of the numbers differ from X by more t h a n 2s. Likewise, more t h a n 88.9 percent lie within the interval X ± 3s, etc. Table 7 indicates the conformance with Chebyshev's inequality of the three sets of observations given in Table 1. Percentage .75.00. 0 40 60 70 I .93.75. .88.89. 80 1 1,1 I, 1 — I ^ l  i . i i i . . ^ l . . I i i I • I • I ^ • • I I I, I , 90 I, 92 94 FIG. 13—Percentage of the total observations lying within the interval x ± Acs always exceeds the percentage given on this chart. 29 PRESENTATION OF DATA TABLE 7. Comparison of observed percentages and Chebyshev's minimum percentages of the total observations lying within given intervals. INTERVAL, X±ks CHEBYSHEVS MINIMUM OBSERVATIONS LYING WITHIN THE GIVEN_ INTERVAL X ±ks OBSERVED PERCENTAGES" DATA OF TABLE 1(a) {n = 270) DATA OF TABLE 1(6) {n = 100) DATA OF TABLE 1(c) (n = 10) X± 2.0s 75.0 96.7 94 90 X ± 2.5s 84.0 97.8 100 90 X± 3.0s 88.9 98.5 100 100 "Data of Table 1(a): X= 1000, s = 202; data of Table 1(&): X= 1.535, s = 0.105; data of Table 1(c): X = 573.2, s = 4.58. To determine approximately just what percentages of the total number of observations lie within given limits, as contrasted with minimum percentages within those limits, requires additional information of a restrictive nature. If we present X, s, and n, and are able to add the information "data obtained under controlled conditions," then it is possible to make such estimates satisfactorily for limits spaced equally above and below X. The applicability of the Normal law rests on two converging arguments. One is mathematical and proves that the distribution of a sample mean obeys the Normal law no matter what the shape of the distributions are for each of the separate observations. The other is that experience with many, many sets of data show that more of them approximate the Normal law t h a n any other distribution. In the field of statistics, this effect is known as the central limit theorem. What is meant technically by "controlled conditions" is discussed by Shewhart (see Ref. 1) and is beyond the scope of this Manual. Among other things, the concept of control includes the idea of homogeneous data—a set of observations resulting from measurements made under the same essential conditions and representing material produced under the same essential conditions. It is sufficient for present purposes to point out t h a t if data are obtained under "controlled conditions," it may be assumed that the observed frequency distribution can, for most practical purposes, be graduated by some theoretical curve say, by the Normal law or by one of the nonnormal curves belonging to the system of frequency curves developed by Karl Pearson. (For an extended discussion of Pearson curves, see Ref. 5). Two of these are illustrated in Fig. 14. Supposing a smooth curve plus a gradual approach to the horizontal axis at one or both sides derived the Pearson system of curves. The Normal distribution's fit to the set of data may be checked roughly by plotting the cumulative data on Normal probability paper (see Section 13). Sometimes if the original data do not appear to follow the Normal law, some transformation of the data, such as log X, will be approximately normal. Thus, the phrase "data obtained under controlled conditions" is taken to be the equivalent of the more mathematical assertion t h a t "the functional form of the distribution may be represented by some specific curve." However, conformance of the shape of a frequency distribution with some curve should by no means be taken as a sufficient criterion for control. 30 DATA AND CONTROL CHART ANALYSIS Ben snaped Examples of two Pearson nonnormal frequency curves FIG. 14—^A frequency distribution of observations obtained under controlled conditions will usually have an outline that conforms to the Normal law or a nonnormal Pearson frequency curve. Percentage .6827 0 10 20 30 40 50 60 ,70 80 90 • I • I . I. .1 • I .1 • l l l i u i ' m i l i i u l iUJ 99.73. •95.45 95; W 99 99.5 T I 0.6745 1.282 1.645 1.960 2.326 2.576 k FIG. 15—Normal law integral diagram giving percentage of total area under Normal law curve falling within the range \i ± ka. This diagram is also useful in probability and sampling problems, expressing the upper (percentage) scale values in decimals to represent "probability." Generally for controlled conditions, the percentage of the total observations in the original sample lying within the interval X±ks may be determined approximately from the chart of Fig. 15, which is based on the Normal law integral. The approximation may be expected to be better the larger the number of observations. Table 8 compares the observed percentages of the total number of observations lying within several symmetrical intervals about X with those estimated from a knowledge of X and s, for the three sets of observations given in Table 1. 30. A v e r a g e X, S t a n d a r d D e v i a t i o n s, S k e w n e s s gi, a n d Kurtosis g2 If the data are obtained under "controlled conditions" and if a Pearson curve is assumed appropriate as a graduation formula, the presentation of ^1 and g2 in addition to X and s will contribute further information. They will give no immediate help in determining the percentage of the total observations lying within a symmetric interval about the average X, t h a t is, in the interval of X ± ks. 31 PRESENTATION OF DATA TABLE 8. Comparison of observed percentages and theoretical estimated percentages of the total observations lying within given intervals. THEORETICAL ESTIMATED PERCENTAGES" OF TOTAL OBSERVATIONS LYING WITHIN THE INTERVAL, GIVEN INTERVAL X X ±ks ±ks OBSERVED PERCENTAGES DATA OF TABLE 1(a) {n = 270) DATA OF TABLE 1(6) {n = 100) DATA OF TABLE 1(c) {n = 10) 52.2 54 70 X ± 0.6745s 50.0 X ± 1.0s 68.3 76.3 72 80 X ± 1.5s 86.6 89.3 84 90 X ± 2.0s 95.5 96.7 94 90 X ± 2.5s 98.7 97.8 100 90 X±3.0s 99.7 98.5 100 100 "Use Fig. 15 with X and s as estimates of x and a. What they do is to help in estimating observed percentages (in a sample already taken) in an interval whose limits are not equally spaced above and below X. If a Pearson curve is used as a graduation formula, some of the information given by g^ and g^ may be obtained from Table 9 which is taken from Table 42 of the Biometrika Tables for Statisticians. For j3, = gf and jS^ = g^ + 3, this table gives values of ^^ for use in estimating the lower 2.5 percent of the data and values of k^j for use in estimating the upper 2.5 percent point. More specifically, it may be estimated that 2.5 percent of the cases are less than Xk^s and 2.5 percent are greater than X + k^s • Put another way, it may be estimated that 95 percent of the cases are between Xk^s and X + k^sTable 42 of the Biometrika Tables for Statisticians also gives values of ki and ku for 0.5, 1.0, and 5.0 percent points. Example For a sample of 270 observations of the transverse strength of bricks, the sample distribution is shown in Fig. 5. From the sample values of g^ = 0.61 and ga = 2.57, we take pi = gi2 = (0.61)2 = 0.37 and P2 = g2 + 3 = 2.57 + 3 = 5.57. Thus, from Tables 9(a) and (6) we may estimate that approximately 95 percent of the 270 cases lie between X — k^s and X + kyS,or between 1000  1.801 (201.8) = 636.6 and 1000 + 2.17 (201.8) = 1437.7. The actual percentage of the 270 cases in this range is 96.3 percent (see Table 2(a)). Notice that using just X±l.96s gives the interval 604.3 to 1395.3 which actually includes 95.9% of the cases versus a theoretical percentage of 95%. The reason we prefer the Pearson curve interval arises from knowing that the gi = 0.63 value has a standard error of 015 (= V6/270) and is thus about four standard errors above zero. That is, if future data come from the same conditions it is highly probable that they will also be skewed. The 604.3 to 1395.3 interval is symmetric about the mean, while the 636.6 to 1437.7 interval is offset in line with the anticipated skewness. Recall that the interval based on the order statistics was 657.8 to 1400 and that from the cumulative frequency distribution was 653.9 to 1419.5. When computing the median, all methods will give essentially the same result but we need to choose among the methods when estimating a percentile near the extremes of the distribution. 32 DATA AND CONTROL CHART ANALYSIS TABLE 9. Lower and upper 2.5 percent points ^^ and k^ of the standardized deviate (X x)/o, given by Pearson frequency curves for designated values of Pi(estimated as equal to g^) and PJ (estimated as equal to ^2 + 3). (a) Lower" /JL (b) Upper ^L Pilp2 0.00 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 1.65 1.76 1.83 1.88 1.92 1.94 1.96 1.97 1.98 1.99 1.99 1.99 2.00 2.00 2.00 2.00 2.00 1.65 1.76 1.83 1.88 1.92 1.94 1.96 1.97 1.98 1.99 1.99 1.99 2.00 2.00 2.00 2.00 2.00 0.01 0.03 0.05 0.10 0.15 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.68 1.76 1.82 1.86 1.89 1.91 1.93 1.94 1.95 1.95 1.96 1.96 1.96 1.96 1.97 1.97 1.62 1.71 1.77 1.82 1.85 1.87 1.89 1.90 1.91 1.92 1.93 1.93 1.94 1.94 1.94 1.94 1.56 1.66 1.73 1.78 1.82 1.84 1.86 1.88 1.89 1.90 1.91 1.91 1.92 1.92 1.93 1.93 1.57 1.65 1.71 1.76 1.79 1.81 1.83 1.85 1.86 1.87 1.88 1.88 1.89 1.89 1.90 1.49 1.58 1.64 1.70 1.74 1.77 1.79 1.81 1.82 1.84 1.84 1.85 1.86 1.87 1.87 1.41 1.51 1.58 1.65 1.69 1.72 1.75 1.77 1.79 1.81 1.82 1.83 1.83 1.84 1.85 1.39 1.47 1.55 1.60 1.65 1.68 1,71 1.73 1.75 1.76 1.78 1.79 1.80 1.81 1.37 1.45 1.52 1.57 1.61 1.65 1.67 1.70 1.72 1.73 1.75 1.76 1.77 1.35 1.42 1.49 1.54 1.58 1.62 1.64 1.67 1.69 1.70 1.72 1.73 1.33 1.40 1.46 1.51 1.56 1.59 1.62 1.64 1.66 1.68 1.69 1.32 1.39 1.44 1.49 1.53 1.56 1.59 1.62 1.64 1.65 1.24 1.31 1.38 1.43 1.47 1.51 1.54 1.57 1.59 1.61 1.23 1.30 1.36 1.41 1.45 1.49 1.52 1.55 1.57 1.23 1.29 1.35 1.40 1.44 1.47 1.50 1.53 1.82 1.89 1.94 1.97 1.99 2.01 2.02 2.02 2.02 2.03 2.03 2.03 2.03 2.03 2.03 2.03 1.86 1.93 1.98 2.01 2.03 2.04 2.05 2.05 2.05 2.05 2.05 2.05 2.05 2.05 2.05 2.05 1.89 1.96 2.01 2.03 2.05 2.06 2.07 2.07 2.07 2.07 2.07 2.07 2.07 2.07 2.07 2.07 2.00 2.05 2.08 2.09 2.10 2.11 2.11 2.11 2.11 2.11 2.10 2.10 2.10 2.10 2.09 2.04 2.08 2.11 2.13 2.13 2.14 2.14 2.14 2.13 2.13 2.13 2.13 2.12 2.12 92.1 2.06 2.11 2.14 2.15 2.16 2.16 2.16 2.16 2.16 2.15 2.15 2.15 2.14 2.14 22.1 2.15 2.18 2.20 2.21 2.21 2.21 2.20 2.20 2.19 2.19 2.18 2.18 2.17 2.17 2.22 2.24 2.25 2.25 2.25 2.24 2.24 2.23 2.22 2.22 2.21 2.21 2.20 2.27 2.28 2.29 2.28 2.28 2.27 2.26 2.25 2.25 2.24 2.23 2.23 2.32 2.32 2.32 2.31 2.30 2.29 2.28 2.28 2.27 2.26 2.25 2.35 2.35 2.34 2.33 2.32 2.31 2.31 2.30 2.29 42.2 2.38 2.38 2.37 2.36 2.35 2.34 2.33 2.32 2.31 2.30 2.41 2.41 2.40 2.38 2.37 2.36 2.35 2.34 82.3 2.44 2.43 2.41 2.40 2.39 2.38 2.36 2.35 NOTES—This table was reproduced from Biometrika Tables for Statisticians, Vol. 1, p. 207, with the kind permission of the Biometrika Trust. The Biometrika Tables also give the lower and upper 0.5, 1.0, and 5 percent points. Use for a large sample only, say n > 250. Take /x = X and a = s. ''When gj > 0, the skewness is taken to be positive, and the deviates for the lower percentage points are negative. As a first step, one should scan the data to assess its approach to the Normal law. We suggest dividing gi and g2 by their standard errors and if either ratio exceeds 3 then look to see if there is an outlier. An outlier is an observation so small or so large that there are no other observations near it and so extreme that persons familiar with the measurements can assert that such extreme value will not arise in the future under ordinary conditions. A glance at Fig. 2 suggests the presence of outliers but we must suppose that the second criterion was not satisfied. If any observations seem to be outliers then discard them. If n is very large, say M>10000, then use the percentile estimator based on the order statistics. If the ratios are both below 3 then use the Normal law for smaller sample sizes. If n is between 1000 and 10000 but the ratios suggest skewness and/or kurtosis, then use the cumulative frequency function. For smaller sample sizes and 33 PRESENTATION OF DATA are respectively the largest and smallest values in the sample. If the population distribution is known to be Normal it might also be said, with a 0.90 probability of being right, that 99 percent of the values of the population will lie in the interval X ± 2.703J. Further information on statistical tolerances of this kind is presented in Refs. 6, 7, and 18. evidence of skewness and/or kurtosis, use the Pearson system curves. Obviously, these are rough guidehnes and the user must adapt them to the actual situation by trying alternative calculations and then judging the most reasonable. NOTE ON TOLERANCE LIMITS In Sections 33 through 34, the percentages of X values estimated to be within a specified range pertain only to the given sample of data which is being represented succinctly by selected statistics, X, s, etc. The Pearson curves used to derive these percentages are used simply as graduation formulas for the histogram of the sample data. The aim of Sections 33 to 34 is to indicate how much information about the sample is given by X, s, gi, and g2. It should be carefully noted that in an analysis of this kind the selected ranges of X and associated percentages are not to be confused with what in the statistical literature are called "tolerance limits." 31. Use of Coefficient of Variation I n s t e a d of t h e S t a n d a r d D e v i a t i o n So far as quantity of information is concerned, the presentation of the sample coefficient of variation, cv, together with the average, X, is equivalent to presenting the sample standard deviation, s, and the average, X, since s may be computed directly from the values of cv = sjX and X. In fact, the sample coefficient of variation (multiplied by 100) is merely the sample standard deviation, s, expressed as a percentage of the average, X. The coefficient of variation is sometimes useful in presentations whose purpose is
Manual on Presentation of Data and Control Chart Analysis, 7th2002_(ASTM).pdf