2 CHEMICAL MEASUREMENTS — THE THREE-LEGGED PLATFORM
Tải bản đầy đủ - 0trang
DK4712_C001.fm Page 3 Tuesday, January 31, 2006 11:49 AM
Introduction to Chemometrics
3
quickly and rapidly with low noise, rather than measuring its absorbance at a single
wavelength. By properly considering the distribution of multiple variables simultaneously, we obtain more information than could be obtained by considering each
variable individually. This is one of the so-called multivariate advantages. The additional information comes to us in the form of correlation. When we look at one variable
at a time, we neglect correlation between variables, and hence miss part of the picture.
A recent paper by Bro described four additional advantages of multivariate
methods compared with univariate methods [1]. Noise reduction is possible when
multiple redundant variables are analyzed simultaneously by proper multivariate
methods. For example, low-noise factors can be obtained when principal component
analysis is used to extract a few meaningful factors from UV spectra measured at
hundreds of wavelengths. Another important multivariate advantage is that partially
selective measurements can be used, and by use of proper multivariate methods,
results can be obtained free of the effects of interfering signals. A third advantage
is that false samples can be easily discovered, for example in spectroscopic analysis.
For any well characterized chemometric method, aliquots of material measured in
the future should be properly explained by linear combinations of the training set
or calibration spectra. If new, foreign materials are present that give spectroscopic
signals slightly different from the expected ingredients, these can be detected in the
spectral residuals and the corresponding aliquot flagged as an outlier or “false
sample.” The advantages of chemometrics are often the consequence of using multivariate methods. The reader will find these and other advantages highlighted
throughout the book.
1.4 HOW TO USE THIS BOOK
This book is suitable for use as an introductory textbook in chemometrics or for use
as a self-study guide. Each of the chapters is self-contained, and together they cover
many of the main areas of chemometrics. The early chapters cover tutorial topics
and fundamental concepts, starting with a review of basic statistics in Chapter 2,
including hypothesis testing. The aim of Chapter 2 is to review suitable protocols
for the planning of experiments and the analysis of the data, primarily from a
univariate point of view. Topics covered include defining a research hypothesis, and
then implementing statistical tools that can be used to determine whether the stated
hypothesis is found to be true. Chapter 3 builds on the concept of the univariate
normal distribution and extends it to the multivariate normal distribution. An example
is given showing the analysis of near infrared spectral data for raw material testing,
where two degradation products were detected at 0.5% to 1% by weight. Chapter 4
covers principal component analysis (PCA), one of the workhorse methods of
chemometrics. This is a topic that all basic or introductory courses in chemometrics
should cover. Chapter 5 covers the topic of multivariate calibration, including partial
least-squares, one of the single most common application areas for chemometrics.
Multivariate calibration refers generally to mathematical methods that transform and
instrument’s response to give an estimate of a more informative chemical or physical
variable, e.g., the target analyte. Together, Chapters 3, 4, and 5 form the introductory
core material of this book.
© 2006 by Taylor & Francis Group, LLC
DK4712_C001.fm Page 4 Tuesday, January 31, 2006 11:49 AM
4
Practical Guide to Chemometrics
The remaining chapters of the book introduce some of the advanced topics of
chemometrics. The coverage is fairly comprehensive, in that these chapters cover
some of the most important advanced topics. Chapter 6 presents the concept of
robust multivariate methods. Robust methods are insensitive to the presence of
outliers. Most of the methods described in Chapter 6 can tolerate data sets contaminated with up to 50% outliers without detrimental effects. Descriptions of algorithms
and examples are provided for robust estimators of the multivariate normal distribution, robust PCA, and robust multivariate calibration, including robust PLS. As
such, Chapter 6 provides an excellent follow-up to Chapters 3, 4, and 5.
Chapter 7 covers the advanced topic of nonlinear multivariate model estimation,
with its primary examples taken from chemical kinetics. Chapter 8 covers the
important topic of experimental design. While its position in the arrangement of this
book comes somewhat late, we feel it will be much easier for the reader or student
to recognize important applications of experimental design by following chapters
on calibration and nonlinear model estimation. Chapter 9 covers the topic of multivariate classification and pattern recognition. These types of methods are designed
to seek relationships that describe the similarity or dissimilarity between diverse
groups of data, thereby revealing common properties among the objects in a data
set. With proper multivariate approaches, a large number of features can be studied
simultaneously. Examples of applications in this area of chemometrics include the
identification of the source of pollutants, detection of unacceptable raw materials,
intact classification of unlabeled pharmaceutical products for clinical trials through
blister packs, detection of the presence or absence of disease in a patient, and food
quality testing, to name a few.
Chapter 10, Signal Processing and Digital Filtering, is concerned with mathematical methods that are intended to enhance signals by decreasing the contribution
of noise. In this way, the “true” signal can be recovered from a signal distorted by
other effects. Chapter 11, Multivariate Curve Resolution, describes methods for the
mathematical resolution of multivariate data sets from evolving systems into descriptive models showing the contributions of pure constituents. The ability to correctly
recover pure concentration profiles and spectra for each of the components in the
system depends on the degree of overlap among the pure profiles of the different
components and the specific way in which the regions of these profiles are overlapped.
Chapter 12 describes three-way calibration methods, an active area of research in
chemometrics. Chapter 12 includes descriptions of methods such as the generalized
rank annihilation method (GRAM) and parallel factor analysis (PARAFAC). The main
advantage of three-way calibration methods is their ability to estimate analyte concentrations in the presence of unknown, uncalibrated spectral interferents. Chapter 13
reviews some of the most active areas of research in chemometrics.
1.4.1 SOFTWARE APPLICATIONS
Our experience in learning chemometrics and teaching it to others has demonstrated repeatedly that people learn new techniques by using them to solve interesting problems. For this reason, many of the contributing authors to this book
have chosen to illustrate their chemometric methods with examples using
© 2006 by Taylor & Francis Group, LLC
DK4712_C001.fm Page 5 Tuesday, January 31, 2006 11:49 AM
Introduction to Chemometrics
5
Microsoft® Excel, MATLAB, or other powerful computer applications. For many
research groups in chemometrics, MATLAB has become a workhorse research tool,
and numerous public-domain MATLAB software packages for doing chemometrics
can be found on the World Wide Web. MATLAB is an interactive computing environment that takes the drudgery out of using linear algebra to solve complicated
problems. It integrates computer graphics, numerical analysis, and matrix computations into one simple-to-use package. The package is available on a wide range
of personal computers and workstations, including IBM-compatible and Macintosh
computers. It is especially well-suited to solving complicated matrix equations using
a simple “algebra-like” notation. Because some of the authors have chosen to use
MATLAB, we are able to provide you with some example programs. The equivalent
programs in BASIC, Pascal, FORTRAN, or C would be too long and complex for
illustrating the examples in this book. It will also be much easier for you to experiment with the methods presented in this book by trying them out on your data sets
and modifying them to suit your special needs. Those who want to learn more about
MATLAB should consult the manuals shipped with the program and numerous web
sites that present tutorials describing its use.
1.5 GENERAL READING ON CHEMOMETRICS
A growing number of books, some of a specialized nature, are available on chemometrics. A brief summary of the more general texts is given here as guidance for
the reader. Each chapter, however, has its own list of selected references.
JOURNALS
1. Journal of Chemometrics (Wiley) — Good for fundamental papers and applications
of advanced algorithms.
2. Journal of Chemometrics and Intelligent Laboratory Systems (Elsevier) — Good for
conference information; has a tutorial approach and is not too mathematically heavy.
3. Papers on chemometrics can also be found in many of the more general analytical
journals, including: Analytica Chimica Acta, Analytical Chemistry, Applied Spectroscopy, Journal of Near Infrared Spectroscopy, Journal of Process Control, and Technometrics.
BOOKS
1. Adams, M.J., Chemometrics in Analytical Spectroscopy, 2nd ed., The Royal Society
of Chemistry: Cambridge. 2004.
2. Beebe, K.R., Pell, R.J., and Seasholtz, M.B. Chemometrics: A Practical Guide., John
Wiley & Sons: New York. 1998.
3. Box, G.E.P., Hunter, W.G., and Hunter, J.S. Statistics for Experimenters. John Wiley
& Sons: New York. 1978.
4. Brereton, R.G. Chemometrics: Data Analysis for the Laboratory and Chemical Plant.
John Wiley & Sons: Chichester, U.K. 2002.
5. Draper, N.R. and Smith, H.S. Applied Regression Analysis, 2nd ed., John Wiley &
Sons: New York. 1981.
© 2006 by Taylor & Francis Group, LLC
DK4712_C001.fm Page 6 Tuesday, January 31, 2006 11:49 AM
6
Practical Guide to Chemometrics
6. Jackson, J.E. A User’s Guide to Principal Components. John Wiley & Sons: New
York. 1991.
7. Jollife, I.T. Principal Component Analysis. Springer-Verlag: New York. 1986.
8. Kowalski, B.R., Ed. NATO ASI Series. Series C, Mathematical and Physical Sciences,
Vol. 138: Chemometrics, Mathematics, and Statistics in Chemistry. Dordrecht; Lancaster:
Published in cooperation with NATO Scientific Affairs Division [by] Reidel, 1984.
9. Kowalski, B.R., Ed. Chemometrics: Theory and Application. ACS Symposium Series
52. American Chemical Society: Washington, DC. 1977.
10. Malinowski, E.R. Factor Analysis of Chemistry. 2nd ed., John Wiley & Sons: New
York. 1991.
11. Martens, H. and Næs, T. Multivariate Calibration. John Wiley & Sons: Chichester,
U.K. 1989.
12. Massart, D.L., Vandeginste, B.G.M., Buyden, L.M.C., De Jong, S., Lewi, P.J., and
Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics, Part A and B.
Elsevier: Amsterdam. 1997.
13. Miller, J.C. and Miller, J.N. Statistics and Chemometrics for Analytical Chemistry,
4th ed., Prentice Hall: Upper Saddle River N.J. 2000.
14. Otto, M. Chemometrics: Statistics and Computer Application in Analytical Chemistry.
John Wiley & Sons-VCH: New York. 1999.
15. Press, W.H.; Teukolsky, S.A., Flannery, B.P., and Vetterling, W.T. Numerical Recipes
in C. The Art of Scientific Computing, 2nd ed., Cambridge University Press: New
York. 1992.
16. Sharaf, M.A., Illman, D.L., and Kowalski, B.R. Chemical Analysis, Vol. 82: Chemometrics. John Wiley & Sons: New York. 1986.
REFERENCES
1. Bro, R., Multivariate calibration. What is in chemometrics for the analytical chemist?
Analytica Chimica Acta, 2003. 500(1-2): 185–194.
© 2006 by Taylor & Francis Group, LLC
DK4712_C002.fm Page 7 Thursday, March 2, 2006 5:04 PM
2
Statistical Evaluation
of Data
Anthony D. Walmsley
CONTENTS
Introduction................................................................................................................8
2.1 Sources of Error ...............................................................................................9
2.1.1 Some Common Terms........................................................................10
2.2 Precision and Accuracy..................................................................................12
2.3 Properties of the Normal Distribution ...........................................................14
2.4 Signiﬁcance Testing .......................................................................................18
2.4.1 The F-test for Comparison of Variance
(Precision) ..........................................................................................19
2.4.2 The Student t-Test ..............................................................................22
2.4.3 One-Tailed or Two-Tailed Tests.........................................................24
2.4.4 Comparison of a Sample Mean with a Certiﬁed
Value...................................................................................................24
2.4.5 Comparison of the Means from Two Samples..................................25
2.4.6 Comparison of Two Methods with Different Test Objects
or Specimens ......................................................................................26
2.5 Analysis of Variance ......................................................................................27
2.5.1 ANOVA to Test for Differences Between
Means .................................................................................................28
2.5.2 The Within-Sample Variation (Within-Treatment
Variation) ............................................................................................29
2.5.3 Between-Sample Variation (Between-Treatment
Variation) ............................................................................................29
2.5.4 Analysis of Residuals.........................................................................30
2.6 Outliers ...........................................................................................................33
2.7 Robust Estimates of Central Tendency and Spread ......................................36
2.8 Software..........................................................................................................38
2.8.1 ANOVA Using Excel .........................................................................39
Recommended Reading ...........................................................................................40
References................................................................................................................40
7
© 2006 by Taylor & Francis Group, LLC
DK4712_C002.fm Page 8 Thursday, March 2, 2006 5:04 PM
8
Practical Guide to Chemometrics
INTRODUCTION
Typically, one of the main errors made in analytical chemistry and chemometrics
is that the chemical experiments are performed with no prior plan or design. It is
often the case that a researcher arrives with a pile of data and asks “what does it
mean?” to which the answer is usually “well what do you think it means?” The
weakness in collecting data without a plan is that one can quite easily acquire
data that are simply not relevant. For example, one may wish to compare a new
method with a traditional method, which is common practice, and so aliquots or
test materials are tested with both methods and then the data are used to test which
method is the best (Note: for “best” we mean the most suitable for a particular
task, in most cases “best” can cover many aspects of a method from highest purity,
lowest error, smallest limit of detection, speed of analysis, etc. The “best” method
can be deﬁned for each case). However, this is not a direct comparison, as the
new method will typically be one in which the researchers have a high degree of
domain experience (as they have been developing it), meaning that it is an optimized method, but the traditional method may be one they have little experience
with, and so is more likely to be nonoptimized. Therefore, the question you have
to ask is, “Will simply testing objects with both methods result in data that can
be used to compare which is the better method, or will the data simply infer that
the researchers are able to get better results with their method than the traditional
one?” Without some design and planning, a great deal of effort can be wasted and
mistakes can be easily made. It is unfortunately very easy to compare an optimized
method with a nonoptimized method and hail the new technique as superior, when
in fact, all that has been deduced is an inability to perform both techniques to the
same standard.
Practical science should not start with collecting data; it should start with a
hypothesis (or several hypotheses) about a problem or technique, etc. With a set of
questions, one can plan experiments to ensure that the data collected is useful in
answering those questions. Prior to any experimentation, there needs to be a consideration of the analysis of the results, to ensure that the data being collected are
relevant to the questions being asked. One of the desirable outcomes of a structured
approach is that one may ﬁnd that some variables in a technique have little inﬂuence
on the results obtained, and as such, can be left out of any subsequent experimental
plan, which results in the necessity for less rather than more work.
Traditionally, data was a single numerical result from a procedure or assay; for
example, the concentration of the active component in a tablet. However, with
modern analytical equipment, these results are more often a spectrum, such as a
mid-infrared spectrum for example, and so the use of multivariate calibration models
has ﬂourished. This has led to more complex statistical treatments because the result
from a calibration needs to be validated rather than just a single value recorded. The
quality of calibration models needs to be tested, as does the robustness, all adding
to the complexity of the data analysis. In the same way that the spectroscopist relies
on the spectra obtained from an instrument, the analyst must rely on the results
obtained from the calibration model (which may be based on spectral data); therefore,
the rigor of testing must be at the same high standard as that of the instrument
© 2006 by Taylor & Francis Group, LLC
DK4712_C002.fm Page 9 Thursday, March 2, 2006 5:04 PM
Statistical Evaluation of Data
9
manufacturer. The quality of any model is very dependent on the test specimens
used to build it, and so sampling plays a very important part in analytical methodology. Obtaining a good representative sample or set of test specimens is not easy
without some prior planning, and in cases where natural products or natural materials
are used or where no design is applicable, it is critical to obtain a representative
sample of the system.
The aim of this chapter is to demonstrate suitable protocols for the planning of
experiments and the analysis of the data. The important question to keep in mind
is, “What is the purpose of the experiment and what do I propose as the outcome?”
Usually, deﬁning the question takes greater effort than performing any analysis.
Deﬁning the question is more technically termed deﬁning the research hypothesis,
following which the statistical tools can be used to determine whether the stated
hypothesis is found to be true.
One can consider the application of statistical tests and chemometric tools to be
somewhat akin to torture—if you perform it long enough your data will tell you
anything you wish to know—but most results obtained from torturing your data are
likely to be very unstable. A light touch with the correct tools will produce a much
more robust and useable result then heavy-handed tactics ever will. Statistics, like
torture, beneﬁt from the correct use of the appropriate tool.
2.1 SOURCES OF ERROR
Experimental science is in many cases a quantitative subject that depends on
numerical measurements. A numerical measurement is almost totally useless
unless it is accompanied by some estimate of the error or uncertainty in the
measurement. Therefore, one must get into the habit of estimating the error or
degree of uncertainty each time a measurement is made. Statistics are a good way
to describe some types of error and uncertainty in our data. Generally, one can
consider that simple statistics are a numerical measure of “common sense” when
it comes to describing errors in data. If a measurement seems rather high compared
with the rest of the measurements in the set, statistics can be employed to give a
numerical estimate as to how high. This means that one must not use statistics
blindly, but must always relate the results from the given statistical test to the data
to which the data has been applied, and relate the results to given knowledge of
the measurement. For example, if you calculate the mean height of a group of
students, and the mean is returned as 296 cm, or more than 8 ft, then you must
consider that unless your class is a basketball team, the mean should not be so
high. The outcome should thus lead you to consider the original data, or that an
error has occurred in the calculation of the mean.
One needs to be extremely careful about errors in data, as the largest error will
always dominate. If there is a large error in a reference method, for example, small
measurement errors will be superseded by the reference errors. For example, if one
used a bench-top balance accurate to one hundredth of a gram to weigh out one
gram of substance to standardize a reagent, the resultant standard will have an
accuracy of only one part per hundredth, which is usually considered to be poor for
analytical data.
© 2006 by Taylor & Francis Group, LLC
DK4712_C002.fm Page 10 Thursday, March 2, 2006 5:04 PM
10
Practical Guide to Chemometrics
Statistics must not be viewed as a method of making sense out of bad data, as
the results of any statistical test are only as good as the data to which they are
applied. If the data are poor, then any statistical conclusion that can be made will
also be poor.
Experimental scientists generally consider there to be three types of error:
1. Gross error is caused, for example, by an instrumental breakdown such
as a power failure, a lamp failing, severe contamination of the specimen
or a simple mislabeling of a specimen (in which the bottle’s contents are
not as recorded on the label). The presence of gross errors renders an
experiment useless. The most easily applied remedy is to repeat the
experiment. However, it can be quite difﬁcult to detect these errors, especially if no replicate measurements have been made.
2. Systematic error arises from imperfections in an experimental procedure,
leading to a bias in the data, i.e., the errors all lie in the same direction
for all measurements (the values are all too high or all too low). These
errors can arise due to a poorly calibrated instrument or by the incorrect
use of volumetric glassware. The errors that are generated in this way can
be either constant or proportional. When the data are plotted and viewed,
this type of error can usually be discovered, i.e., the intercept on the
y-axis for a calibration is much greater than zero.
3. Random error (commonly referred to as noise) produces results that are
spread about the average value. The greater the degree of randomness,
the larger the spread. Statistics are often used to describe random errors.
Random errors are typically ones that we have no control over, such as
electrical noise in a transducer. These errors affect the precision or reproducibility of the experimental results. The goal is to have small random
errors that lead to good precision in our measurements. The precision of
a method is determined from replicate measurements taken at a similar
time.
2.1.1 SOME COMMON TERMS
Accuracy: An experiment that has small systematic error is said to be accurate,
i.e., the measurements obtained are close to the true values.
Precision: An experiment that has small random errors is said to be precise,
i.e., the measurements have a small spread of values.
Within-run: This refers to a set of measurements made in succession in the
same laboratory using the same equipment.
Between-run: This refers to a set of measurements made at different times,
possibly in different laboratories and under different circumstances.
Repeatability: This is a measure of within-run precision.
Reproducibility: This is a measure of between-run precision.
Mean, Variance, and Standard Deviation: Three common statistics can be
calculated very easily to give a quick understanding of the quality of a
dataset and can also be used for a quick comparison of new data with some
© 2006 by Taylor & Francis Group, LLC
DK4712_C002.fm Page 11 Thursday, March 2, 2006 5:04 PM
Statistical Evaluation of Data
11
prior datasets. For example, one can compare the mean of the dataset with
the mean from a standard set. These are very useful exploratory statistics,
they are easy to calculate, and can also be used in subsequent data analysis
tools. The arithmetic mean is a measure of the average or central tendency
of a set of data and is usually denoted by the symbol x . The value for the
mean is calculated by summing the data and then dividing this sum by the
number of values (n).
x=
∑x
i
(2.1)
n
The variance in the data, a measure of the spread of a set of data, is related to
the precision of the data. For example, the larger the variance, the larger the spread
of data and the lower the precision of the data. Variance is usually given the symbol
s2 and is deﬁned by the formula:
s2 =
∑ (x − x )
2
i
(2.2)
n
The standard deviation of a set of data, usually given the symbol s, is the square
root of the variance. The difference between standard deviation and variance is that
the standard deviation has the same units as the data, whereas the variance is in units
squared. For example, if the measured unit for a collection of data is in meters (m)
then the units for the standard deviation is m and the unit for the variance is m2. For
large values of n, the population standard deviation is calculated using the formula:
s=
∑ (x − x )
2
i
(2.3)
n
If the standard deviation is to be estimated from a small set of data, it is more
appropriate to calculate the sample standard deviation, denoted by the symbol sˆ,
which is calculated using the following equation:
sˆ =
∑ (x − x )
i
n −1
2
(2.4)
The relative standard deviation (or coefficient of variation), a dimensionless
quantity (often expressed as a percentage), is a measure of the relative error, or noise
in some data. It is calculated by the formula:
RSD =
s
x
(2.5)
When making some analytical measurements of a quantity (x), for example the
concentration of lead in drinking water, all the results obtained will contain some
© 2006 by Taylor & Francis Group, LLC
DK4712_C002.fm Page 12 Thursday, March 2, 2006 5:04 PM
12
Practical Guide to Chemometrics
random errors; therefore, we need to repeat the measurement a number of times (n).
The standard error of the mean, which is a measure of the error in the ﬁnal answer,
is calculated by the formula:
sM =
s
n
(2.6)
It is good practice when presenting your results to use the following representation:
x±
s
n
(2.7)
Suppose the boiling points of six impure ethanol specimens were measured using
a digital thermometer and found to be: 78.9, 79.2, 79.4, 80.1, 80.3, and 80.9°C. The
mean of the data, x , is 79.8°C, the standard deviation, s, is 0.692°C. With the value
of n = 6, the standard error, sm, is found to be 0.282°C, thus the true temperature of
the impure ethanol is in the range 79.8 ± 0.282°C (n = 6).
2.2 PRECISION AND ACCURACY
The ability to perform the same analytical measurements to provide precise and
accurate results is critical in analytical chemistry. The quality of the data can be
determined by calculating the precision and accuracy of the data. Various bodies have
attempted to deﬁne precision. One commonly cited deﬁnition is from the International
Union of Pure and Applied Chemistry (IUPAC), which deﬁnes precision as “relating
to the variations between variates, i.e., the scatter between variates.”[1] Accuracy can
be deﬁned as the ability of the measured results to match the true value for the data.
From this point of view, the standard deviation is a measure of precision and the mean
is a measure of the accuracy of the collected data. In an ideal situation, the data would
have both high accuracy and precision (i.e., very close to the true value and with a
very small spread). The four common scenarios that relate to accuracy and precision
are illustrated in Figure 2.1. In many cases, it is not possible to obtain high precision
and accuracy simultaneously, so common practice is to be more concerned with the
precision of the data rather than the accuracy. Accuracy, or the lack of it, can be
compensated in other ways, for example by using aliquots of a reference material, but
low precision cannot be corrected once the data has been collected.
To determine precision, we need to know something about the manner in which
data is customarily distributed. For example, high precision (i.e., the data are very
close together) produces a very narrow distribution, while low precision (i.e., the
data are spread far apart) produces a wide distribution. Assuming that the data are
normally distributed (which holds true for many cases and can be used as an
approximation in many other cases) allows us to use the well understood mathematical distribution known as the normal or Gaussian error distribution. The advantage
to using such a model is that we can compare the collected data with a well
understood statistical model to determine the precision of the data.
© 2006 by Taylor & Francis Group, LLC
DK4712_C002.fm Page 13 Thursday, March 2, 2006 5:04 PM
Statistical Evaluation of Data
13
Target
Target
Precise but not
accurate
(a)
Accurate but not
precise
(b)
Target
Target
Inaccurate and
imprecise
Accurate and
precise
(c)
(d)
FIGURE 2.1 The four common scenarios that illustrate accuracy and precision in data: (a)
precise but not accurate, (b) accurate but not precise, (c) inaccurate and imprecise, and (d)
accurate and precise.
Although the standard deviation gives a measure of the spread of a set of results
about the mean value, it does not indicate the way in which the results are distributed.
To understand this, a large number of results are needed to characterize the distribution. Rather than think in terms of a few data points (for example, six data points)
we need to consider, say 500 data points, so the mean, x , is an excellent estimate
of the true mean or population mean, µ. The spread of a large number of collected
data points will be affected by the random errors in the measurement (i.e., the
sampling error and the measurement error) and this will cause the data to follow
the normal distribution. This distribution is shown in Equation 2.8:
y=
exp[−( x − µ )2 / 2σ 2 ]
σ 2π
(2.8)
where µ is the true mean (or population mean), x is the measured data, and σ is the
true standard deviation (or the population standard deviation). The shape of the
distribution can be seen in Figure 2.2, where it can be clearly seen that the smaller
the spread of the data, the narrower the distribution curve.
It is common to measure only a small number of objects or aliquots, and so one
has to rely upon the central limit theorem to see that a small set of data will behave
in the same manner as a large set of data. The central limit theorem states that “as
the size of a sample increases (number of objects or aliquots measured), the data
will tend towards a normal distribution.” If we consider the following case:
y = x1 + x2 + … + xn
© 2006 by Taylor & Francis Group, LLC
(2.9)