shahid lecture-1- mkag1273

90
STATISTICAL HYDROLOGY MAL1303/MKAG1273 Summarizing Data Dr. Shamsuddin Shahid Associate Professor Department of Hydraulics and Hydrology Faculty of Civil Engineering Room No.: M46-332; Phone: 07-5531624; Mobile: 0182051586 Email: [email protected] 11/23/2015 Shamsuddin Shahid, FKA, UTM You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Upload: nchakori

Post on 18-Jan-2017

128 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Shahid Lecture-1- MKAG1273

STATISTICAL HYDROLOGY MAL1303/MKAG1273Summarizing Data

Dr. Shamsuddin ShahidAssociate Professor

Department of Hydraulics and HydrologyFaculty of Civil Engineering

Room No.: M46-332; Phone: 07-5531624; Mobile: 0182051586

Email: [email protected]

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 2: Shahid Lecture-1- MKAG1273

Office Location: C09-Room No. 219

11/23/2015 Shamsuddin Shahid, FKA, UTM11/23/2015

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 3: Shahid Lecture-1- MKAG1273

What is Statistics

Statistics is the science of Learning from Data.

Statistics is the science of collecting data and analyzing(modeling) data for the purpose of decision making andscientific discovery when the available information is bothlimited and variable.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 4: Shahid Lecture-1- MKAG1273

Descriptive Statistics and Inferential Statistics

Descriptive statistics refers to statistical techniques used tosummarize and describe a data set. Measures of central tendency,such as mean and median, and dispersion, such as range andstandard deviation, are the main descriptive statistics. Displays ofdata, such as histograms and box-plots, are also consideredtechniques of descriptive statistics.

Inferential statistics, or statistical induction, means the use ofstatistics to make inferences concerning some unknown aspect.Some examples of inferential statistics methods include hypothesistesting, linear regression, and prediction.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 5: Shahid Lecture-1- MKAG1273

Statistical Hydrology

In statistical hydrology, we generally use four-step process in learning from Data:

(1) Defining the problem(2) Collecting the data(3) Summarizing the data(4) Analyzing data, interpreting the analyses, and communicating results.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 6: Shahid Lecture-1- MKAG1273

Defining the Problem

Defining the problem or the question to be addressed

• What is condition of water quality of the river?• Is there a relationship between nitrogen fertilizer use and

groundwater quality?• What effect of rainfall on river discharge in a particular river

basin?• What is the main responsible factor of water quality

deterioration in a river?• Is there any change in rainfall in last decade?

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 7: Shahid Lecture-1- MKAG1273

Collecting the Data

Proposing a study or experiment to collect meaningful data to answer the problem.

• What variables have to measure?

• How many data/samples have to collect?

• How the samples should be selected?

• Etc..

The most crucial element in data collection is the manner in which the sample is selected.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 8: Shahid Lecture-1- MKAG1273

PopulationThe population is the total set of measurements which couldhypothetically be taken from the entity being studied. It is the set of allmeasurements of interest. For example, trace element composition of allstream sediments in a area, groundwater depth as any point of acatchment, etc.

The Statistical SampleA sample is any subset of measurements selected from the population.There may be confusion about sample in water resources and sample instatistical. Water sample usually means one bottle of water. In statistics,sample is a data that is actually available for analysis.

Population and Samples

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 9: Shahid Lecture-1- MKAG1273

Population and Samples

Sample is a subset of population.

Sample should be collected in such as way that it represent the whole population

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 10: Shahid Lecture-1- MKAG1273

Sampling Size

• Generalization of sample size is not possible to made.• It depends on the degree of intricacy of the problem being

addressed.• Many times, sample size is possible to know in advance.• Many cases, hydrologists collect reasonable number of

sample and analyze to see how accurately it represent thepopulation. According, they go for next set of samplecollection.

• However, many cases samples can not be collected accordingto need.

• If the sample size is not adequate, different methods (non-parametric) methods are used to analyze the data.

• Whatever it may be, sample size should never be less than 6.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 11: Shahid Lecture-1- MKAG1273

Sample Technique

In hydrology, one of the main decision about samplingpopulation is where to sample. Sampling techniques can beclassified as below:

1. Random Sampling2. Stratified Sampling3. Uniform Sampling4. Regular Sampling5. Clustered Sampling6. Traverse Sampling

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 12: Shahid Lecture-1- MKAG1273

A random sample of n measurementsselected from a population containingN measurements (N > n). A sample ofn measurements selected from apopulation is said to be a randomsample if every different sample of sizen from the population has an equalprobability of being selected.

Sample data selected in a nonrandomfashion are frequently distorted by aselection bias. A selection bias existswhenever there is a systematictendency to overrepresent orunderrepresent some part of thepopulation.

Random Sampling

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 13: Shahid Lecture-1- MKAG1273

The whole population is dividedinto a number of groups or strataaccording to a property and eachgroup is sampled differently. Thistype of sampling is called StratifiedSampling.

Let, we want to studytransmissivity of groundwater in acatchment. Transmissivity ofgroundwater heavy depends onlocal geology. Therefore, if wedivided the catchment accordingto geological units to study theimpact of geology ontransmissivity.

Stratified Sampling

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 14: Shahid Lecture-1- MKAG1273

Systematic Sampling

Uniform Sampling: Planned byrandomization within grid squares.

Regular or gridded: Planned onrectangular or triangular grid.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 15: Shahid Lecture-1- MKAG1273

1. Clustered: It is focused onpatchy distribution.

2. Traverse: Often forced by accessand exposure constraints orlogistics.

Systematic Sampling

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 16: Shahid Lecture-1- MKAG1273

Sample Technique

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 17: Shahid Lecture-1- MKAG1273

Data Quality

Garbage in – garbage out.

Hydrologist must be confident about data quality beforeprocessing of data.

Selection of data analysis technique depends on quality of data.

Use of any measurement device must be accompanied byawareness of precision and accuracy.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 18: Shahid Lecture-1- MKAG1273

Data Quality

Precision - A measurement is precise if repeated measurements of the same entity are similar.

Accuracy – A measurement is accurate if it is close to the true value. In water resources, the true value is usually unknown, although there are standard that can be used for calibrating analytical equipment.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 19: Shahid Lecture-1- MKAG1273

Data Type

There are a variety of categories of data which may be encountered inHydrology. It is important to know the data type before selecting theappropriate data analysis technique.

Ratio scale dataOrdinary measurements such as amount of rainfall, depth ofgroundwater level, etc. This is the best quality and most versatile datatype.

Interval scale dataInterval scale data differ from ratio scale data in that the zero point isnot a fundamental termination of the scale. The classical example ofinterval scale data is temperature measured in Centigrade.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 20: Shahid Lecture-1- MKAG1273

Data Type

Ordinal Scale DataThis category is of considerable lower quality that the Ratio orInterval Scale Data. Only purpose of the scale is to place observationsin relative order. Consequently, it is not valid to apply addition orsubtraction, as well as division, to ordinal scale data. Non-parametricmethods are used to analysis the Ordinal Scale Data

Nominal or Categorical DataInformation is sometimes prescribed in the form of names. Such asflood sometimes recorded as normal flood, severe flood, very severeflood, etc. Sometimes, drought is recoded according to it’soccurrence, such as drought occur in 1974, 1981, 1993, 1990, 2008, etc.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 21: Shahid Lecture-1- MKAG1273

Data Type

Directional DataData that is expressed in angle. Example of directional data: directionof cyclone, direction of surface runoff, etc. Directional data requirespecial methods of analysis as the numerical values cycle aroundthrough 360 degree.

Closed DataThere are lower and upper limits of this type of data. These are datain the form of percentage, parts per million (ppm), etc. Such datarequire cautious treatment, especially in bivariate and multivariatemethods, because variables are fundamentally interdependent.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 22: Shahid Lecture-1- MKAG1273

Continuous and Discrete Types of Variable

Discrete VariableWhen observations on a quantitative random variable can assume only acountable number of values, the variable is called a discrete randomvariable. It can have on only a countable number of values.

Example: Number of days in a year with rainfall > 20 mm, number offlood in a year, etc.

Continuous VariableWhen observations on a quantitative random variable can assume any oneof the uncountable number of values, the variable is called a continuousrandom variable. A continuous variable can take on any value over somerange.

Example: Daily rainfall, River Discharge, Groundwater Depth, etc.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 23: Shahid Lecture-1- MKAG1273

Summarization of Data

To analyze any collection of water resources data appropriately,the first consideration must be the characteristics of the datathemselves. Therefore, first we need to know the commoncharacteristics of water resources data we want to study.Summarization of data means making a summary of the data inthe forms which convey their important characteristics.

As for example, if we want to analyze rainfall data of a station,first we need to know:

1.What is the average rainfall of that station?2.How the rainfall vary in the station?3.How the rainfall distributed over the year?4.How often heavy rainfall occur?

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 24: Shahid Lecture-1- MKAG1273

Summarization of Data

Summarization of data means making a summary of the datain the forms which convey their important characteristics suchas:

• A measure of the center of the data,• A measure of spread or variability,• A measure of the symmetry of the data distribution,• Percentile of the data (estimates of extremes)• Etc…

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 25: Shahid Lecture-1- MKAG1273

The Mean and Median are the two most commonly-used measures use to describe center of the data.

Beside that Mode, Geometric Mean and TrimmedMean are also sometimes used to summarize some waterresources data.

We need to understand:

• What are the properties of these measures?

• When should one be employed over the other?

Measurement of the center of the data

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 26: Shahid Lecture-1- MKAG1273

The Mean - classical measurement of data

Mean of a set of measurements is defined to be the sum of themeasurements divided by the total number of measurements.

The sample mean is denoted by the symbol, X (x-bar) andThe population mean is denoted by (mu’).

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 27: Shahid Lecture-1- MKAG1273

Computation of Mean - Example

Discharge in a river in every consecutive three days in the month ofJanuary is measured as below. What is mean value of river discharge inJanuary?

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 28: Shahid Lecture-1- MKAG1273

Measurement of the center of the data: Mean

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 29: Shahid Lecture-1- MKAG1273

Influence of an observation on the overall mean is the distance between the observation and the mean excluding the observation.

Therefore, influence of observation 4.0 onoverall mean is (4.0 – 3.3) = 0.7

The Mean

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 30: Shahid Lecture-1- MKAG1273

Influence of observation on the Mean

Mean Discharge = 3.4 cumec

• All observations donot have sameinfluence on mean.

• Observations closer tomean has lessinfluence on meancompared to theobservations awayfrom the mean.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 31: Shahid Lecture-1- MKAG1273

Mean discharge in the month of January is 6.1 cumecIf we remove the extreme value 34.8 (outlier) from the observation, the mean discharge = 3.2 cumec

Influence of observation on the Mean

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 32: Shahid Lecture-1- MKAG1273

The Mean: Summary

1. It is the arithmetic average of the measurements in a dataset.

2. There is only one mean for a data set.3. Its value is influenced by extreme measurements; trimming

can help to reduce the degree of influence.4. Means of subsets can be combined to determine the mean

of the complete data set.5. It is applicable to quantitative data only.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 33: Shahid Lecture-1- MKAG1273

The Median

The median of a set of measurements is defined to be the middle valuewhen the measurements are arranged from lowest to highest.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 34: Shahid Lecture-1- MKAG1273

The Median

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 35: Shahid Lecture-1- MKAG1273

The Median

No change in Median

The median is onlyminimally affected bythe magnitude of a singleobservation. Therefore,Median is often knownas resistant measure ofcentral value.

This resistance to theeffect of a change invalue or presence ofoutlying observations isoften a desirableproperty in waterresources data analysis.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 36: Shahid Lecture-1- MKAG1273

1. It is the central value; 50% of the measurements lieabove it and 50% fall below it.

2. There is only one median for a data set.3. It is less influenced by extreme measurements.4. Medians of subsets cannot be combined to determine

the median of the complete data set.5. For grouped data, its value is rather stable even when

the data are organized into different categories.6. It is applicable to quantitative data only.

The Median: Summary

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 37: Shahid Lecture-1- MKAG1273

When a summary value is desired that is not stronglyinfluenced by a few extreme observations, the median ispreferable to the mean.

One such example is the chemical concentration in water onemight expect to find over many streams in a given region. Usingthe median, one sample with unusually high concentration hasno greater effect on the estimate than one with lowconcentration. The mean concentration may be pulled towardsthe outlier, and be higher than concentrations found in most ofthe samples.

Preference of Mean or Median

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 38: Shahid Lecture-1- MKAG1273

The mode is the most frequently observed value. It is far more applicable for

• Discrete Data• Grouped Data• Data which are recorded only as falling into a finite

number of categories It is very easy to obtain It is less applicable for continuous data When applied on continuous data, it gives a poor measure

of ceter Its value often depends on the arbitrary grouping of those

data

Measurement of the center of the data: Mode

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 39: Shahid Lecture-1- MKAG1273

Let consider, in a location rainfall greater than 100 mm in a day consideredas a heavy rainfall day. Data show the number of heavy rainfall days indifferent years over the time period 2000-2010. How frequently heavyrainfall occur that location?

Measurement of the center of the data: Mode

Mean = 3.1 daysMedian = 3 daysMode = 4 days

It is more meaningful ifyou say that heavyrainfall usually happens4 times in a year.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 40: Shahid Lecture-1- MKAG1273

Measurement of the center of the data: Mode

Mode is not influenced by extreme measurements.

Heavy rainfall days in different years over the time period 2000-2010

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 41: Shahid Lecture-1- MKAG1273

Measurement of the center of the data: Mode

Heavy rainfall days in different years over the time period 2000-2010

There can be more than one mode for a data set. This type of data set is called multi-modal data.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 42: Shahid Lecture-1- MKAG1273

The Mode: Summary

1. It is the most frequent or probable measurement in the dataset.

2. There can be more than one mode for a data set.3. It is not influenced by extreme measurements.4. Modes of subsets cannot be combined to determine the

mode of the complete data set.5. For grouped data its value can change depending on the

categories used.6. It is applicable for both qualitative and quantitative data.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 43: Shahid Lecture-1- MKAG1273

Geometric Mean

The geometric mean, is a type of mean, which indicates the central tendency

• Geometric mean applies only to positive numbers.• It is less influenced by extreme values

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 44: Shahid Lecture-1- MKAG1273

Trimmed Mean

Trimmed mean drops the highest and lowest extreme values andaverages the rest. For example, a 5% trimmed mean drops the highest 5%and the lowest 5% of the measurements and averages the rest. Similarly,a 25% trimmed mean drops the highest and the lowest 25% of themeasurements and averages the rest.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 45: Shahid Lecture-1- MKAG1273

Trimmed Mean14 samples are collected to monitor the Nitrate concentration ingroundwater as below:

20% trimmed mean of the data:

Lower limit of window = 14*o.20 = 2.8 3Lower limit of window = 14*o.80 = 11.2 11

Trimmed mean = 2.8

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 46: Shahid Lecture-1- MKAG1273

Measurement of Spread

1. Range2. pth percentile3. Interquartile Range4. Deviation5. Variance6. Standard deviation7. Variation of coefficient

The degree of variation of data values can be diagnostic in waterresources. For example, variation in annual rainfall tells us howrainfall is reliable in an area.

It is also important to obtain a measure of the variability of values ina population in order to assess the number of observations we needto draw useful conclusions and to assess the reliability of ourestimates of parameters.

Measures used to show the data variability are:

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 47: Shahid Lecture-1- MKAG1273

Measurement of Spread: Range

The range is defined as the difference between the largest and thesmallest measurements of the set.

Range is the simplest but least useful measure of data variation.

For grouped data, because we do notknow the individual measurements,the range is taken to be thedifference between the upper limitof the last interval and the lowerlimit of the first interval.

Range = 4.8 – 2.5= 2.3

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 48: Shahid Lecture-1- MKAG1273

Measurement of Spread: p-th percentile

The p-th percentile of a set of n measurements (arranged in order ofmagnitude) is that value that has at most p% of the measurementsbelow it and at most (100 - p)% above it.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 49: Shahid Lecture-1- MKAG1273

Measurement of Spread: Interquartile Range

The interquartile range (IQR) of a set of measurements is defined to bethe difference between the upper and lower quartiles; that is,

IQR = 75th percentile - 25th percentile

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 50: Shahid Lecture-1- MKAG1273

Measurement of Spread: Deviation

Deviation defines how a particular measurement deviate from the meanof the set of measurements.

The deviations of the measurements are computed by using the formula

=

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 51: Shahid Lecture-1- MKAG1273

Measurement of Spread: Variance

A more easily interpreted function of the deviations involves thesum of the squared deviations of the measurements from theirmean. This measure is called the variance.

The variance of a set of n measurements X1, X2, … Xn with mean isthe sum of the squared deviations divided by n – 1:

We have special symbols to denote the sample and populationvariances. The symbol s2 represents the sample variance, and thecorresponding population variance is denoted by the symbol 2.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 52: Shahid Lecture-1- MKAG1273

Measurement of Spread: Standard Deviation

The standard deviation of a set of measurements is defined to be thesquare root of the variance.

One reason for defining the standard deviation is that it yields ameasure of variability having the same units of measurement as theoriginal data, whereas the units for variance are the square of themeasurement units.

s denotes the sample standard deviation and denotes thecorresponding population standard deviation.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 53: Shahid Lecture-1- MKAG1273

Measurement of Spread: Group Data

We need a simple modification of the formula to calculate thesample variance of grouped data are available.

If Xi and fi denote the midpoint and frequency, respectively, forthe i-th class interval, then variance of group data:

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 54: Shahid Lecture-1- MKAG1273

• Means and variances are ways to describe a distribution ofdata.

• But all the details of the data can not be understood with onlymean and variance.

• A histogram is one of the most effective ways to depict afrequency distribution

• To know details of the data distribution we need to drawfrequency histogram and the relative frequency histogram

• Frequency is the number of times a variable takes on aparticular group

• Note that any variable has a frequency distribution

Distribution of Data

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 55: Shahid Lecture-1- MKAG1273

Total annual rainfall (in millimeter) in a station situated in the sub-tropical region is given below.

940 2118 1684 1415 1111 1303 1526 16671182 1301 1995 1692 2293 1903 870 16781647 1236 1180 1515 1650 1614 1384 18451315 1879 2326 1539 2110 2038 1976 11171710 2066 1295 1814 1265 1419 1751 15821446 1468 1810 1471 2102 1895 2474 16451769 2119

Distribution of Data

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 56: Shahid Lecture-1- MKAG1273

940 2118 1684 1415 1111 1303 1526 16671182 1301 1995 1692 2293 1903 870 16781647 1236 1180 1515 1650 1614 1384 18451315 1879 2326 1539 2110 2038 1976 11171710 2066 1295 1814 1265 1419 1751 15821446 1468 1810 1471 2102 1895 2474 16451769 2119

Distribution of Data

Minimum Rainfall = 870 mmMaximum Rainfall = 2474 mmRange = 1604Group Interval = Range / (number of group -1)

= 1604 / (9 – 1) 200

Groups: 801 – 1000; 1001-1200; 1201 – 1400; 1401 – 1600; 1601 – 1800; 1801 – 2000; 2001 – 2200; 2201 – 2400; 2401 - 2600

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 57: Shahid Lecture-1- MKAG1273

Distribution of Data: Histogram

• A histogram is used to graphically summarize the distribution of a data set• A histogram divides the range of values in a data set into intervals• Over each interval is placed a bar whose height represents the frequency of

data values in the interval.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 58: Shahid Lecture-1- MKAG1273

Distribution of Data

Number of group is 17

Because of the arbitrariness in the choice of number of intervals, startingvalue, and length of intervals, histograms can be made to take on differentshapes for the same set of data, especially for small data sets. Histogramsare most useful for describing data sets when the number of data points isfairly large, say 50 or more.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 59: Shahid Lecture-1- MKAG1273

Distribution of Data

Number of group is 5

When the number of data points is relatively small and the number of intervalsis large, the histogram fluctuates too much—that is, responds to a very fewdata Values. This results in a graph that is not a realistic depiction of thehistogram for the whole population. When the number of class intervals is toosmall, most of the patterns or trends in the data are not displayed.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 60: Shahid Lecture-1- MKAG1273

• Frequencies can be absolute (when the frequencyprovided is the actual count of the occurrences) or relative(when they are normalized by dividing the absolutefrequency by the total number of observations [0, 1])

• Relative frequencies are particularly useful if you want tocompare distributions drawn from two different sources(i.e. while the numbers of observations of each source maybe different)

Absolute and Relative Frequency Distribution

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 61: Shahid Lecture-1- MKAG1273

Distribution of DataWe can compare two different samples (orpopulations) by examining their relativefrequency histograms even if the samples(populations) are of different sizes, becausewe use proportions rather than frequenciesin a relative frequency histogram.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 62: Shahid Lecture-1- MKAG1273

Distribution of Data

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 63: Shahid Lecture-1- MKAG1273

Distribution of Data

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 64: Shahid Lecture-1- MKAG1273

Distribution of Data

This rule applies to data setswith roughly a “mound-shaped’’ histogram—that is, ahistogram that has a singlepeak, is symmetrical, andtapers off gradually in thetails.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 65: Shahid Lecture-1- MKAG1273

Distribution of Data

M0 – Sample MeanMd – Sample Medianµ - Population meanTM – Trimmed mean

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 66: Shahid Lecture-1- MKAG1273

Coefficient of variation measures thevariability in the values in a populationrelative to the magnitude of thepopulation mean. In a process orpopulation with mean and standarddeviation , the coefficient of variationis defined as,

Coefficient of Variation

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 67: Shahid Lecture-1- MKAG1273

Coefficient of Variation shows how much the hydrological variables vary.

For example, if we calculate Coefficient of Variation over annual rainfalltime series, it weill indicate the reliability of rainfall. If the value is veryhigh, it means it varies very wide from year to year and less reliable.

Coefficient of Variation

In both cases annual average rainfall is 580 mm. However, rainfall in second area is more reliable compared to first one.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 68: Shahid Lecture-1- MKAG1273

Measures of Skewness and Kurtosis

• A fundamental task in many statistical analyses is to characterizethe location and variability of a data set (Measures of centraltendency vs. measures of dispersion)

• Both measures tell us nothing about the shape of the distribution

• A further characterization of the data includes skewness andkurtosis

• The histogram is an effective graphical technique for showing boththe skewness and kurtosis of a data set

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 69: Shahid Lecture-1- MKAG1273

The term skewness refers to the lack of symmetry.

Skewness measures the degree of asymmetry exhibited bythe data

If skewness equals zero, the histogram is symmetric aboutthe mean

31

3)(

ns

xxskewness

n

ii

Skewness

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 70: Shahid Lecture-1- MKAG1273

Consider the two distributions in the figure just below. Within each graph, the bars on the right side of the distribution taper differently than the bars on the left side. These tapering sides are called tails, and they provide a visual means for determining which of the two kinds of skewness a distribution has:

1. negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left.

2. positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right.

Skewness

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 71: Shahid Lecture-1- MKAG1273

Skewness

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 72: Shahid Lecture-1- MKAG1273

Skewness in a data series may be observed not only graphically but by simple inspection of the values.

For instance, consider the numeric sequence (49, 50, 51), whose values are evenly distributed around a central value of (50). We can transform this sequence into a negatively skewed distribution by adding a value far below the mean, as in e.g. (40, 49, 50, 51).

Similarly, we can make the sequence positively skewed by adding a value far above the mean, as in e.g. (49, 50, 51, 60).

Skewness

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 73: Shahid Lecture-1- MKAG1273

Skewness

Positive skewnessThere are more observations below the mean than above it When themean is greater than the median

Negative skewnessThere are a small number of low observations and a large number of highones When the median is greater than the mean

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 74: Shahid Lecture-1- MKAG1273

Skewness of a distribution cannot be determined simply byinspection.

If Mean > Median, the skew is positive.If Mean < Median, the skew is negative.If Mean = Median, the skew is zero.

Skewness

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 75: Shahid Lecture-1- MKAG1273

• One measure of absolute skewness is difference betweenmean and mode. A measure of such would not be truemeaningful because it depends of the units ofmeasurement.

• The simplest measure of skewness is the Pearson’scoefficient of skewness:

Skewness

deviation StandardMode-Meanskewness oft coefficien sPearson'

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 76: Shahid Lecture-1- MKAG1273

• Skewness coefficient varies between -3.o to +3.0.• There is no acceptable range of skewness to measure the

distribution of data.• Some people says that rule of thumb is -1 to +1 being

acceptable (-2 to +2 is often used too) for normaldistribution.

Skewness

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 77: Shahid Lecture-1- MKAG1273

Kurtosis measures how peaked the histogram is

The kurtosis of a normal distribution is 0 (zero)

Kurtosis characterizes the relative peakedness or flatness of adistribution compared to the normal distribution

3)(

4

4

ns

xxkurtosis

n

ii

Kurtosis

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 78: Shahid Lecture-1- MKAG1273

• Platykurtic– When the kurtosis < 0, the frequencies throughout thecurve are closer to be equal (i.e., the curve is more flat and wide).Thus, negative kurtosis indicates a relatively flat distribution

• Leptokurtic– When the kurtosis > 0, there are high frequencies inonly a small part of the curve (i.e, the curve is more peaked). Thus,positive kurtosis indicates a relatively peaked distribution

• Kurtosis is based on the size of a distribution's tails. Negativekurtosis (platykurtic) – distributions with short tails. Positivekurtosis (leptokurtic) – distributions with relatively long tails

leptokurticplatykurtic

Kurtosis

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 79: Shahid Lecture-1- MKAG1273

Coefficient of Kurtosis is the most important measure of kurtosiswhich is based on the second and fourth moments :

Kurtosis

22

42

N

xxf2

2)(

N

xxf4

4

)(

Where,

Second Momentum

Fourth Momentum

• If 2 -3 > 0, the distribution is leptokurtic.• If , If 2 -3 < 0 the distribution is platykurtic.• If , 2 -3 = 0 the distribution is mesokurtic (normal).

A kurtosis value of +/-1 is considered very good for most uses, but +/-2 is also usually acceptable.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 80: Shahid Lecture-1- MKAG1273

Chebyshev’s theoremAccording to Chebyshev’s theorem,

At least of the measurements will fall within

[Mean – (k-1)*SD] to [Mean + (k-1)*SD], where K = 2

Empirical ruleGive a set of n measurements possessing a mound-shaped histogram,then

the interval X s contains approximately 68% of the measurementsthe interval X 2s contains approximately 95% of the measurementsthe interval X 3s contains approximately 99.7% of the measurements.

Chebyshev’s Rule

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 81: Shahid Lecture-1- MKAG1273

Empirical ruleGive a set of n measurements possessing a mound-shaped histogram, then

the interval X s contains approximately 68% of the measurementsthe interval X 2s contains approximately 95% of the measurementsthe interval X 3s contains approximately 99.7% of the measurements.

Chebyshev’s Rule

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 82: Shahid Lecture-1- MKAG1273

Outlier

An outlier is an observation that lies an abnormal distance from othervalues in a random sample from a population.

Outlier or an outlying observation, is one that appears to deviatemarkedly from other members of the sample in which it occurs.

Outliers can have many anomalous causes:

• A physical apparatus for taking measurements may have suffered atransient malfunction.

• There may have been an error in data transmission or transcription.• Outliers arise due to changes in system behaviour, fraudulent

behaviour, human error, instrument error• simply through natural deviations in populations.• A sample may have been contaminated with elements from outside

the population being examined.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 83: Shahid Lecture-1- MKAG1273

Identification of Outliers

There is no rigid mathematical definition of what constitutes an outlier.Determining whether or not an observation is an outlier is ultimately asubjective exercise.

Type 1 - Determine the outliers with no prior knowledge of the data. This isessentially a learning approach. The approach processes the data as astatic distribution, pinpoints the most remote points, and flags them aspotential outliers.

Type 2 – Using model-based methods which assume that the data are froma normal distribution, and identify observations which are deemed"unlikely" based on mean and standard deviation.

• Chauvenet's criterion• Grubbs' test• Dixon's Q test

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 84: Shahid Lecture-1- MKAG1273

Chauvenet's criterion

• A value is measured experimentally in several trials as 9, 10, 10, 10,11, and 50.

• The mean is 16.7 and the standard deviation 16.34.

• Value 50 differs from 16.7 by 33.3, slightly more than two standarddeviations.

• The probability of taking data more than two standard deviations fromthe mean is roughly 0.05.

• Six measurements were taken, so the statistic value (data sizemultiplied by the probability) is 0.05×6 = 0.3.

• Because 0.3 < 0.5, according to Chauvenet's criterion, the measuredvalue of 50 should be discarded (leaving a new mean of 10, withstandard deviation 0.7).

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 85: Shahid Lecture-1- MKAG1273

Grubbs' test detects one outlier at a time.

Gcalculated > Gtable then reject the questionable point.

Grubbs' test

Example: 9, 10, 10, 10, 11, and 50

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 86: Shahid Lecture-1- MKAG1273

Grubbs' test

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 87: Shahid Lecture-1- MKAG1273

To apply a Q test, arrange the data in order of increasing valuesand calculate Q as defined:

Where gap is the absolute difference between the outlier inquestion and the closest number to it.

If Qcalculated > Qtable then reject the questionable point.

Dixon's Q test, or simply the Q test

Example: 9, 10, 10, 10, 11, and 50

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 88: Shahid Lecture-1- MKAG1273

Dixon's Q test

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 89: Shahid Lecture-1- MKAG1273

Common Characteristics of Water Resources Data

1. A lower bound of zero. No negative values are possible.2. Presence of 'outliers‘ regularly occur, specially outliers on the high

side are more common in water resources.3. Non-normal distribution of data4. Positive skewness is common.5. Data reported only as below or above some threshold (censored

data). Examples include concentrations below one or moredetection limits, annual flood above a level, etc.

6. Seasonal patterns. Values tend to be higher or lower in certainseasons of the year.

7. Positive autocorrelation. Consecutive observations tend to bestrongly correlated with each other. High values tend to followhigh values and low values tend to follow low values.

8. Dependence on other uncontrolled variables. Water dischargefrom a well highly depends on hydraulic conductivity, sedimentgrain size, or some other variable.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 90: Shahid Lecture-1- MKAG1273

STATISTICAL HYDROLOGY MAL1303/MKAG1273

Summarizing Data

Dr. Shamsuddin ShahidAssociate Professor

Department of Hydraulics and HydrologyFaculty of Civil Engineering

Room No.: M46-332; Phone: 07-5531624; Mobile: 0182051586

Email: [email protected]

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)