Thứ Tư, 12 tháng 2, 2014
Tài liệu Statistics for Environmental Engineers P1 ppt
Preface to 2nd Edition
This second edition, like the first, is about how to generate informative data and how to extract information
from data. The short-chapter format of the first edition has been retained. The goal is for the reader to
be able to “dip in” where the case study or the statistical method stimulates interest without having to
study the book from front to back, or in any particular order.
Thirteen new chapters deal with experimental design, selecting the sample size for an experiment,
time series modeling and forecasting, transfer function models, weighted least squares, laboratory quality
assurance, standard and specialty control charts, and tolerance and prediction intervals. The chapters on
regression, parameter estimation, and model building have been revised. The chapters on transformations,
simulation, and error propagation have been expanded.
It is important to encourage engineers to see statistics as a professional tool. One way to do this is to
show them examples similar to those faced in one’s own work. For most of the examples in this book,
the environmental engineer will have a good idea how the test specimens were collected and how the
measurements were made. This creates a relevance and reality that makes it easier to understand special
features of the data and the potential problems associated with the data analysis.
Exercises for self-study and classroom use have been added to all chapters. A solutions manual is
available to course instructors. It will not be possible to cover all 54 chapters in a one-semester course,
but the instructor can select chapters that match the knowledge level and interest of a particular class.
Statistics and environmental engineering share the burden of having a special vocabulary, and students
have some early frustration in both subjects until they become familiar with the special language.
Learning both languages at the same time is perhaps expecting too much. Readers who have prerequisite
knowledge of both environmental engineering and statistics will find the book easily understandable.
Those who have had an introductory environmental engineering course but who are new to statistics, or
vice versa, can use the book effectively if they are patient about vocabulary.
We have not tried to discuss every statistical method that is used to interpret environmental data. To
do so would be impossible. Likewise, we cannot mention every environmental problem that involves
statistics. The statistical methods selected for discussion are those that have been useful in our work,
which is environmental engineering in the areas of water and wastewater treatment, industrial pollution
control, and environmental modeling. If your special interest is air pollution control, hydrology, or geosta-
tistics, your work may require statistical methods that we have not discussed. Some topics have been
omitted precisely because you can find an excellent discussion in other books. We hope that whatever
kind of environmental engineering work you do, this book will provide clear and useful guidance on
data collection and analysis.
P. Mac Berthouex
Madison, Wisconsin
Linfield C. Brown
Medford, Massachusetts
© 2002 By CRC Press LLC
The Authors
Paul Mac Berthouex
is Emeritus Professor of civil and environmental engineering at the University of
Wisconsin-Madison, where he has been on the faculty since 1971. He received his M.S. in sanitary
engineering from the University of Iowa in 1964 and his Ph.D. in civil engineering from the University
of Wisconsin-Madison in 1970. Professor Berthouex has taught a wide range of environmental engi-
neering courses, and in 1975 and 1992 was the recipient of the Rudolph Hering Medal, American Society
of Civil Engineers, for most valuable contribution to the environmental branch of the engineering
profession. Most recently, he served on the Government of India’s Central Pollution Control Board.
In addition to
Statistics for Environmental Engineers, 1st Edition
(1994, Lewis Publishers), Professor
Berthouex has written books on air pollution and pollution control. He has been the author or co-author
of approximately 85 articles in refereed journals.
Linfield C. Brown
is Professor of civil and environmental engineering at Tufts University, where he
has been on the faculty since 1970. He received his M.S. in environmental health engineering from Tufts
University in 1966 and his Ph.D. in sanitary engineering from the University of Wisconsin-Madison in
1970. Professor Brown teaches courses on water quality monitoring, water and wastewater chemistry,
industrial waste treatment, and pollution prevention, and serves on the U.S. Environmental Protection
Agency’s Environmental Models Subcommittee of the Science Advisory Board. He is a Task Group
Member of the American Society of Civil Engineers’ National Subcommittee on Oxygen Transfer
Standards, and has served on the Editorial Board of the
Journal of Hazardous Wastes and Hazardous
Materials
.
In addition to
Statistics for Environmental Engineers, 1st Edition
(1994, Lewis Publishers), Professor
Brown has been the author or co-author of numerous publications on environmental engineering, water
quality monitoring, and hazardous materials.
© 2002 By CRC Press LLC
Table of Contents
1
Environmental Problems and Statistics
2
A
Brief Review of Statistics
3
Plotting Data
4
Smoothing Dat
a
5
Seeing the Shape of a Distribution
6
External Reference Distributions
7
Using
Transformations
8
Estimating Percentiles
9
Accurac
y, Bias, and Precision of Measurements
10
Precision of Calculated Values
11
Laboratory Quality Assurance
12
Fundamentals of Process Control Charts
13
Specialized Control Charts
14
Limit of Detection
15
Analyzing Censored Data
16
Comparing a Mean with a Standard
17
Paired
t
-
Test for Assessing the Average of Differences
18
Independent
t
-Test for Assessing the Difference of Two Averages
19
Assessing the Di
fference of Proportions
20
Multiple Paired Comparisons of
k
Averages
© 2002 By CRC Press LLC
21
Tolerance Intervals and Prediction Intervals
22
Experimental Desig
n
23
Sizing the Experiment
24
Analysis of Variance to Compare
k
Averages
25
Components of
Variance
26
Multiple Factor Analysis of
Variance
27
Factorial Experimental Designs
28
Fractional Factorial Experimental Designs
29
Screening of Important
Variables
30
Analyzing Factorial Experiments by Regression
31
Correlation
32
Serial Correlation
33
The Method of Least Square
s
34
Precision of Parameter Estimates in Linear Models
35
Precision of Parameter Estimates in Nonlinear Models
36
Calibration
37
Weighted Least Squares
38
Empirical Model Building by Linear Regression
39
The Coefficient of Determination,
R
2
40
Regression Analysis with Categorical Variables
41
The E
ffect of Autocorrelation on Regression
42
The Iterative Approach to Experimentation
43
Seeking Optimum Conditions by Response Surface Methodology
© 2002 By CRC Press LLC
44
Designing Experiments for Nonlinear Parameter Estimation
45
Why Linearization Can Bias Parameter Estimates
46
Fitting Models to Multiresponse Data
47
A Problem in Model Discrimination
48
Data Adjustment for Process Rationalization
49
How Measurement Errors Are Transmitted into Calculated Values
50
Using Simulation to Study Statistical Problems
51
Introduction to Time Series Modeling
52
Transfer Function Models
53
Forecasting Time Series
54
Intervention Analysis
Appendix — Statistical Tables
© 2002 By CRC Press LLC
© 2002 By CRC Press LLC
1
Environmental Problems and Statistics
There are many aspects of environmental problems: economic, political, psychological, medical, scientific,
and technological. Understanding and solving such problems often involves certain quantitative aspects,
in particular the acquisition and analysis of data. Treating these quantitative problems effectively involves
the use of statistics. Statistics can be viewed as the prescription for making the quantitative learning process
effective.
When one is confronted with a new problem, a two-part question of crucial importance is, “How will
using statistics help solve this problem and which techniques should be used?” Many different substantive
problems arise and many different statistical techniques exist, ranging from making simple plots of data
to iterative model building and parameter estimation.
Some problems can be solved by subjecting the available data to a particular analytical method. More
often the analysis must be stepwise. As Sir Ronald Fisher said, “…a statistician ought to strive above all
to acquire versatility and resourcefulness, based on a repertoire of tried procedures, always aware that
the next case he wants to deal with may not fit any particular recipe.”
Doing statistics on environmental problems can be like coaxing a stubborn animal. Sometimes small
steps, often separated by intervals of frustration, are the only way to progress at all. Even when the data
contains bountiful information, it may be discovered in bits and at intervals.
The goal of statistics is to make that discovery process efficient. Analyzing data is part science, part
craft, and part art. Skills and talent help, experience counts, and tools are necessary. This book illustrates
some of the statistical tools that we have found useful; they will vary from problem to problem. We
hope this book provides some useful tools and encourages environmental engineers to develop the
necessary craft and art.
Statistics and Environmental Law
Environmental laws and regulations are about toxic chemicals, water quality criteria, air quality criteria,
and so on, but they are also about statistics because they are laced with statistical terminology and
concepts. For example,
the limit of detection
is a statistical concept used by chemists. In environmental
biology,
acute and chronic toxicity criteria
are developed from complex data collection and statistical
estimation procedures, safe and adverse conditions are differentiated through statistical comparison of
control and exposed populations, and
cancer potency factors
are estimated by extrapolating models that
have been fitted to dose-response data.
As an example, the Wisconsin laws on toxic chemicals in the aquatic environment specifically mention
the following statistical terms:
geometric mean, ranks, cumulative probability, sums of squares, least
squares regression, data transformations, normalization of geometric means, coefficient of determination,
standard F-test at a 0.05 level, representative background concentration, representative data, arithmetic
average, upper 99th percentile, probability distribution, log-normal distribution, serial correlation, mean,
variance, standard deviation, standard normal distribution
,
and
Z value
. The U.S. EPA guidance doc-
uments on statistical analysis of bioassay test data mentions
arc-sine transformation, probit analysis,
non-normal distribution, Shapiro-Wilks test, Bartlett’s test, homogeneous variance, heterogeneous vari-
ance, replicates
, t-
test with Bonferroni adjustment, Dunnett’s test, Steel’s rank test
,
and
Wilcoxon rank
sum test
. Terms mentioned in EPA guidance documents on groundwater monitoring at RCRA sites
L1592_frame_CH-01 Page 1 Tuesday, December 18, 2001 1:39 PM
© 2002 By CRC Press LLC
include
ANOVA, tolerance units, prediction intervals, control charts, confidence intervals, Cohen’s adjust-
ment, nonparametric ANOVA, test of proportions, alpha error, power curves
, and
serial correlation
. Air
pollution standards and regulations also rely heavily on statistical concepts and methods.
One burden of these environmental laws is a huge investment in collecting environmental data. No
nation can afford to invest huge amounts of money in programs and designs that are generated from
badly designed sampling plans or by laboratories that have insufficient quality control. The cost of poor
data is not only the price of collecting the sample and making the laboratory analyses, but is also
investments wasted on remedies for non-problems and in damage to the environment when real problems
are not detected. One way to eliminate these inefficiencies in the environmental measurement system is
to learn more about statistics.
Truth and Statistics
Intelligent decisions about the quality of our environment, how it should be used, and how it should be
protected can be made only when information in suitable form is put before the decision makers. They,
of course, want facts. They want truth. They may grow impatient when we explain that at best we can
only make inferences about the truth. “Each piece, or part, of the whole of nature is always merely an
approximation to the complete truth, or the complete truth so far as we know it.…Therefore, things
must be learned only to be unlearned again or, more likely, to be corrected” (Feynman, 1995).
By making carefully planned measurements and using them properly, our level of knowledge is
gradually elevated. Unfortunately, regardless of how carefully experiments are planned and conducted,
the data produced will be imperfect and incomplete. The imperfections are due to unavoidable random
variation in the measurements. The data are incomplete because we seldom know, let alone measure,
all the influential variables. These difficulties, and others, prevent us from ever observing the truth exactly.
The relation between truth and inference in science is similar to that between guilty and not guilty in
criminal law. A verdict of not guilty does not mean that innocence has been proven; it means only that
guilt has not been proven. Likewise the truth of a hypothesis cannot be firmly established. We can only
test to see whether the data dispute its likelihood of being true. If the hypothesis seems plausible, in light
of the available data, we must make decisions based on the likelihood of the hypothesis being true. Also,
we assess the consequences of judging a true, but unproven, hypothesis to be false. If the consequences
are serious, action may be taken even when the scientific facts have not been established. Decisions to
act without scientific agreement fall into the realm of mega-tradeoffs, otherwise known as politics.
Statistics
are numerical values that are calculated from imperfect observations.
A statistic estimates a
quantity that we need to know about but cannot observe directly
. Using statistics should help us move
toward the truth, but it cannot guarantee that we will reach it, nor will it tell us whether we have done so.
It can help us make scientifically honest statements about the likelihood of certain hypotheses being true.
The Learning Process
Richard Feynman said (1995), “ The principle of science, the definition almost, is the following. The
test of all knowledge is experiment. Experiment is the sole judge of scientific truth. But what is the
course of knowledge? Where do the laws that are to be tested come from? Experiment itself helps to
produce these laws, in the sense that it gives us hints. But also needed is imagination to create from
these hints the great generalizations — to guess at the wonderful, simple, but very strange patterns beneath
them all, and then to experiment again to check whether we have made the right guess.”
An experiment is like a window through which we view nature (Box, 1974). Our view is never perfect.
The observations that we make are distorted. The imperfections that are included in observations are
“noise.” A statistically efficient design reveals the magnitude and characteristics of the noise. It increases
the size and improves the clarity of the experimental window. Using a poor design is like seeing blurred
shadows behind the window curtains or, even worse, like looking out the wrong window.
L1592_frame_CH-01 Page 2 Tuesday, December 18, 2001 1:39 PM
© 2002 By CRC Press LLC
Learning is an iterative process, the key elements of which are shown in Figure 1.1. The cycle begins
with expression of a working hypothesis, which is typically based on
a
priori
knowledge about the
system. The hypothesis is usually stated in the form of a mathematical model that will be tuned to the
present application while at the same time being placed in jeopardy by experimental verification.
Whatever form the hypothesis takes, it must be probed and given every chance to fail as data become
available. Hypotheses that are not “put to the test” are like good intentions that are never implemented.
They remain hypothetical.
Learning progresses most rapidly when the experimental design is statistically sound. If it is poor, so
little will be learned that intelligent revision of the hypothesis and the data collection process may be
impossible. A statistically efficient design may literally let us learn more from eight well-planned exper-
imental trials than from 80 that are badly placed. Good designs usually involve studying several variables
simultaneously in a group of experimental runs (instead of changing one variable at a time). Iterating
between data collection and data analysis provides the opportunity for improving precision by shifting
emphasis to different variables, making repeated observations, and adjusting experimental conditions.
We strongly prefer working with experimental conditions that are statistically designed. It is compar-
atively easy to arrange designed experiments in the laboratory. Unfortunately, in studies of natural systems
and treatment facilities it may be impossible to manipulate the independent variables to create conditions
of special interest. A range of conditions can be observed only by spacing observations or field studies over
a long period of time, perhaps several years. We may need to use historical data to assess changes that
have occurred over time and often the available data were not collected with a view toward assessing
these changes. A related problem is not being able to replicate experimental conditions. These are huge
stumbling blocks and it is important for us to recognize how they block our path toward discovery of
the truth. Hopes for successfully extracting information from such historical data are not often fulfilled.
Special Problems
Introductory statistics courses commonly deal with linear models and assume that available data are
normally distributed and independent. There are some problems in environmental engineering where
these fundamental assumptions are satisfied. Often the data are not normally distributed, they are serially
or spatially correlated, or nonlinear models are needed (Berthouex et al., 1981; Hunter, 1977, 1980, 1982).
Some specific problems encountered in data acquisition and analysis are:
FIGURE 1.1
Nature is viewed through the experimental window. Knowledge increases by iterating between experimental
design, data collection, and data analysis. In each cycle the engineer may formulate a new hypothessis, add or drop variables,
change experimental settings, and try new methods of data analysis.
Define problem
Hypothesis
Design
experiment
Experiment
Data
Analysis
Deduction
Redefine hypothesis
Redesign experiment
Collect
more data
Problem is not solved Problem is solved
Data
NATURE
True models
True variables
True values
L1592_frame_CH-01 Page 3 Tuesday, December 18, 2001 1:39 PM
© 2002 By CRC Press LLC
Aberrant values
. Values that stand out from the general trend are fairly common. They may occur
because of gross errors in sampling or measurement. They may be mistakes in data recording. If we think
only in these terms, it becomes too tempting to discount or throw out such values. However, rejecting
any value out of hand may lead to serious errors. Some early observers of stratospheric ozone concen-
trations failed to detect the hole in the ozone layer because their computer had been programmed to screen
incoming data for “outliers.” The values that defined the hole in the ozone layer were disregarded. This
is a reminder that rogue values may be real. Indeed, they may contain the most important information.
Censored data
. Great effort and expense are invested in measurements of toxic and hazardous
substances that should be absent or else be present in only trace amounts. The analyst handles many
specimens for which the concentration is reported as “not detected” or “below the analytical method
detection limit.” This method of reporting censors the data at the limit of detection and condemns all
lower values to be qualitative. This manipulation of the data creates severe problems for the data analyst
and the person who needs to use the data to make decisions.
Large amounts of data (which are often observational data rather than data from designed experi-
ments)
. Every treatment plant, river basin authority, and environmental control agency has accumulated
a mass of multivariate data in filing cabinets or computer databases. Most of this is
happenstance data
.
It was collected for one purpose; later it is considered for another purpose. Happenstance data are
often ill suited for model building. They may be ill suited for detecting trends over time or for testing
any hypothesis about system behavior because (1) the record is not consistent and comparable from
period to period, (2) all variables that affect the system have not been observed, and (3) the range of
variables has been restricted by the system’s operation. In short, happenstance data often contain
surprisingly little information. No amount of analysis can extract information that does not exist.
Large measurement errors
. Many biological and chemical measurements have large measurement
errors, despite the usual care that is taken with instrument calibration, reagent preparation, and personnel
training. There are efficient statistical methods to deal with
random errors
. Replicate measurements
can be used to estimate the random variation, averaging can reduce its effect, and other methods can
compare the random variation with possible real changes in a system.
Systematic errors
(bias) cannot
be removed or reduced by averaging.
Lurking variables
. Sometimes important variables are not measured, for a variety of reasons. Such
variables are called lurking variables. The problems they can cause are discussed by Box (1966) and
Joiner (1981). A related problem occurs when a truly influential variable is carefully kept within a narrow
range with the result that the variable appears to be insignificant if it is used in a regression model.
Nonconstant variance
. The error associated with measurements is often nearly proportional to the
magnitude of their measured values rather than approximately constant over the range of the measured
values. Many measurement procedures and instruments introduce this property.
Nonnormal distributions
. We are strongly conditioned to think of data being symmetrically distributed
about their average value in the bell shape of the normal distribution. Environmental data seldom have
this distribution. A common asymmetric distribution has a long tail toward high values.
Serial correlation
. Many environmental data occur as a sequence of measurements taken over time
or space. The order of the data is critical. In such data, it is common that the adjacent values are not
statistically independent of each other because the natural continuity over time (or space) tends to make
neighboring values more alike than randomly selected values. This property, called serial correlation,
violates the assumptions on which many statistical procedures are based. Even low levels of serial
correlation can distort estimation and hypothesis testing procedures.
Complex cause-and-effect relationships
. The systems of interest — the real systems in the field — are
affected by dozens of variables, including many that cannot be controlled, some that cannot be measured
accurately, and probably some that are unidentified. Even if the known variables were all controlled, as
we try to do in the laboratory, the physics, chemistry, and biochemistry of the system are complicated
and difficult to decipher. Even a system that is driven almost entirely by inorganic chemical reactions
can be difficult to model (for example, because of chemical complexation and amorphous solids forma-
tion). The situation has been described by Box and Luceno (1997): “All models are wrong but some are
useful.” Our ambition is usually short of trying to discover all causes and effects. We are happy if we
can find a
useful
model.
L1592_frame_CH-01 Page 4 Tuesday, December 18, 2001 1:39 PM
© 2002 By CRC Press LLC
The Aim of this Book
Learning statistics is not difficult, but engineers often dislike their introductory statistics course. One
reason may be that the introductory course is largely a sterile examination of textbook data, usually
from a situation of which they have no intimate knowledge or deep interest. We hope this book, by
presenting statistics in a familiar context, will make the subject more interesting and palatable.
The book is organized into short chapters, each dealing with one essential idea that is usually developed
in the context of a case study. We hope that using statistics in relevant and realistic examples will make
it easier to understand peculiarities of the data and the potential problems associated with its analysis.
The goal was for each chapter to stand alone so the book does not need to be studied from front to back,
or in any other particular order. This is not always possible, but the reader is encouraged to “dip in”
where the subject of the case study or the statistical method stimulates interest.
Most chapters have the following format:
•
Introduction
to the general kind of engineering problem and the statistical method to be
discussed.
•
Case Study
introduces a specific environmental example, including actual data.
•
Method
gives a brief explanation of the statistical method that is used to prepare the solution
to the case study problem. Statistical theory has been kept to a minimum. Sometimes it is
condensed to an extent that reference to another book is mandatory for a full understanding.
Even when the statistical theory is abbreviated, the objective is to explain the broad concept
sufficiently for the reader to recognize situations when the method is likely to be useful, although
all details required for their correct application are not understood.
•
Analysis
shows how the data suggest and influence the method of analysis and gives the
solution. Many solutions are developed in detail, but we do not always show all calculations.
Most problems were solved using commercially available computer programs (e.g.,
MINITAB, SYSTAT, Statview, and EXCEL).
•
Comments
provide guidance to other chapters and statistical methods that could be useful
in analyzing a problem of the kind presented in the chapter. We also attempt to expose the
sensitivity of the statistical method to assumptions and to recommend alternate techniques
that might be used when the assumptions are violated.
•
References
to selected articles and books are given at the end of each chapter. Some cover
the statistical methodology in greater detail while others provide additional case studies.
•
Exercises
provides additional data sets, models, or conceptual questions for self-study or
classroom use.
Summary
To gain from what statistics offer, we must proceed with an attitude of letting the data reveal the critical
properties and of selecting statistical methods that are appropriate to deal with these properties. Envi-
ronmental data often have troublesome characteristics. If this were not so, this book would be unneces-
sary. All useful methods would be published in introductory statistics books. This book has the objective
of bringing together, primarily by means of examples and exercises, useful methods with real data and
real problems. Not all useful statistical methods are included and not all widely encountered problems
are discussed. Some problems are omitted because they are given excellent coverage in other books
(e.g., Gilbert, 1987). Still, we hope the range of material covered will contribute to improving the state-
of-the-practice of statistics in environmental engineering and will provide guidance to relevant publica-
tions in statistics and engineering.
L1592_frame_CH-01 Page 5 Tuesday, December 18, 2001 1:39 PM
Đăng ký:
Đăng Nhận xét (Atom)
Không có nhận xét nào:
Đăng nhận xét