shahid lecture-7- mkag1273

43
Dr. Shamsuddin Shahid Department of Hydraulics and Hydrology Faculty of Civil Engineering, Universiti Teknologi Malaysia Room No.: M46-332; Phone: 07-5531624; Mobile: 0182051586 Email: [email protected] MAL1303: STATISTICAL HYDROLOGY Non-parametric Regression 11/23/2015 Shamsuddin Shahid, FKA, UTM You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Upload: nchakori

Post on 15-Jan-2017

119 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Shahid Lecture-7- MKAG1273

Dr. Shamsuddin ShahidDepartment of Hydraulics and Hydrology

Faculty of Civil Engineering, Universiti Teknologi Malaysia

Room No.: M46-332; Phone: 07-5531624; Mobile: 0182051586 Email: [email protected]

MAL1303: STATISTICAL HYDROLOGY

Non-parametric Regression

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 2: Shahid Lecture-7- MKAG1273

Simple Linear Regression: Revisited

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 3: Shahid Lecture-7- MKAG1273

Null Hypothesis, H0 : There is no change, m = 0Alternative Hypothesis, HA: There is a change, m ≠ 0

If |t(calculated)| > t (critical, α, n-2), Null hypothesis rejected.The change is significant.

If t(calculated) = 3.59t (critical, 0.05, 10) = 2.23

As t(calculated) > t (critical, 0.05, 10), Null hypothesis rejected.The change is significant.

A change in rainfall by 1mm cause a change indischarge by 1.08 cumec, at 95% level of confidence.

Test of Significance of Slope

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 4: Shahid Lecture-7- MKAG1273

Null Hypothesis, H0 : The intercept is zero, c = 0Alternative Hypothesis, HA: There intercept is not zero, m ≠ 0

If |t(calculated)| > t (critical, α, n-2), Null hypothesis rejected.The change is significant.

If t(calculated) = 0.11t (critical, 0.05, 10) = 2.23

As t(calculated) < t (critical, 0.05, 10), Null hypothesis CANNOT BE rejected. The intercept is NOT significantlydifferent from zero.It can be commented that discharge is notsignificantly different from zero at 95% level ofconfidence when rainfall is zero.

Test of Significance of Intercept

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 5: Shahid Lecture-7- MKAG1273

ResidualsDifference between actual observation and the predicted observation is called residual.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 6: Shahid Lecture-7- MKAG1273

Distribution of Residuals

Residuals should be normally distributed.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 7: Shahid Lecture-7- MKAG1273

Distribution of Residuals

Distribution of Residuals for the present example.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 8: Shahid Lecture-7- MKAG1273

Abnormal Distribution of Residuals

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 9: Shahid Lecture-7- MKAG1273

Leverage

Leverage is a measure of an "outlier" in the x direction. It is a function of thedistance from the i-th x value to the middle (mean) of the x values used inthe regression.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 10: Shahid Lecture-7- MKAG1273

A high leverage point is one where hi > 3p/n where p is the number ofcoefficients in the model (p=2 in simple linear regression, b0 and b1).

Leverage

All hi is less than 3p/n (3*2/12 = 0.5)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 11: Shahid Lecture-7- MKAG1273

Leverage

One hi is more than 3p/n (3*2/12 = 0.5)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 12: Shahid Lecture-7- MKAG1273

Measures of Outliers in the y Direction

One measure of outliers in the y direction is the standardized residual, esi

An extreme outlier is one for which |esi|>3.There should be only an average of 3 of these in 1,000 observations ifthe residuals are normally distributed.

|esi|>2 should occur about 5 times in 100 observations if normallydistributed.

More than this number indicates that the residuals do not have anormal distribution.

Where,

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 13: Shahid Lecture-7- MKAG1273

Measures of Outliers in the y Direction

An extreme outlier is one for which |esi|>3.|esi|>2 should occur about 5 times in 100 observations if normally distributed.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 14: Shahid Lecture-7- MKAG1273

Measures of Influence of Outliers

Observations with high influence are those which have both highleverage and large outliers. These exert a stronger influence on theposition of the regression line than other observations.

There are two most widely used methods to measure the influence ofoutlier in regression equation,

1. Cook's D2. DFFITS

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 15: Shahid Lecture-7- MKAG1273

Cook's D Method"Cook's D" is one of the most widely method used to measures the influence.

The i-th observation is considered to have high influence if

Di > F(p+1,n−p) at α=0.05

where p is the number of coefficients.

For Simple Linear Regression (SLR) with more than about 30observations, the critical value for Di would be about 2.4.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 16: Shahid Lecture-7- MKAG1273

The DFFITS is a more robust method to diagnosis influence.

DFFITS Method

An observation is considered to have high influence if

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 17: Shahid Lecture-7- MKAG1273

Measures of Influence of Outliers

Cooks D: F(p+1,n−p) at α=0.05 = 3.7, Di is always less than 3.7

DFFITS: 2*√pn = 2 *√2*12 = 9.79, DFFITS values are always less than 9.79

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 18: Shahid Lecture-7- MKAG1273

Abnormal Distribution of Residuals

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 19: Shahid Lecture-7- MKAG1273

Alternative Methods for Regression

Situations such as the above frequently arise where the assumptions ofconstant variance and normality of residuals required by Ordinary LeastRegression (OLS) are not satisfied, and transformations to remedy thisare either not possible, or not desirable.

In these situations, alternative methods are better for fitting lines todata.These include:

• Nonparametric rank-based methods• Minimizing residuals variations• Smooths.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 20: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line

Kendall-Theil is non-parametric rank based method.

Related to Kendall-tau rank correlation, it is a robust nonparametricline applicable when Y is linearly related to X.

These are the advantages of Kendall-Theil method in contrast to OLSRegression are:

• Kendall-Theil line does not depend on the normality of residualsfor validity of significance tests

• It is not strongly affected by outliers

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 21: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust LineKendall-Theil method also try to find the best fit line:

Where, slope,

and Intercept,

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 22: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Example

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 23: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Example

0.412 0.595 0.729 0.739 0.750 0.787 0.795 0.812 0.817 0.839 0.856 0.8820.890 0.937 0.985 1.000 1.000 1.010 1.038 1.053 1.063 1.077 1.220 1.2281.393 1.500 1.897 2.222

Median is the average of 14th and 15th slopes, i.e., (0.937+0.985)/2 = 0.961

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 24: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Example

C = 49.9 – (0.961 * 47.5) = 4.25

Y = 0.961X + 4.25

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 25: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Test of Significance

The test for significance of the Kendall-Theil linear relationship,

H0: m = 0HA: m 0

The steps involve to test the significance:

1. Calculate the S as the sum of the algebraic signs of the possiblepair wise slopes.

2. Calculate the Significance value from table using S and n3. Decide significance.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 26: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Test of Significance

Number of positive slopes are 24. Negative slopes are 0. Therefore,

S = 24 – 0 = 24N = 8

Table values or (S = 24 and N = 8) = 0.0009Two-tailed test: Significance = 2 X 0.0009 = 0.0018 (Significant)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 27: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Test of Significance

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 28: Shahid Lecture-7- MKAG1273

Confidence Interval of Y

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 29: Shahid Lecture-7- MKAG1273

Confidence Interval for Theil Slope

Method for calculating confidence interval of slope is depends onsample size. For small sample size we use tabulated values.

1. For small sample sizes, table is used to find the critical value Xuhaving a p-value nearest to α/2.

2. This critical value is then used to compute the ranks Ru and Rlcorresponding to the slope values at the upper and lowerconfidence limits for slope

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 30: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Confidence Interval

0.412 0.595 0.729 0.739 0.750 0.787 0.795 0.812 0.817 0.839 0.856 0.8820.890 0.937 0.985 1.000 1.000 1.010 1.038 1.053 1.063 1.077 1.220 1.2281.393 1.500 1.897 2.222

There are 24 slopes.Median is the average of 14th and 15th slopes, i.e., (0.937+0.985)/2 = 0.961

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 31: Shahid Lecture-7- MKAG1273

To determine a confidenceinterval for slope at 95% level ofconfidence (α = 0.05), the tabledcritical value Xu nearest to α/2=0.025 for N = 8 is found to be 16(p=0.031).

Therefore, Ru = (24 + 16)/2 = 20 Rl = [(24 - 16)/2] + 1 = 5

Kendall-Theil Robust Line: Confidence Interval

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 32: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Confidence Interval

0.412 0.595 0.729 0.739 0.750 0.787 0.795 0.812 0.817 0.839 0.856 0.8820.890 0.937 0.985 1.000 1.000 1.010 1.038 1.053 1.063 1.077 1.220 1.2281.393 1.500 1.897 2.222

Median = 0.961 with range 0.750 to 1.228

Ru = 20; Rl = 5

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 33: Shahid Lecture-7- MKAG1273

Kendall-Theil Robust Line: Confidence Interval

When, n 20

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 34: Shahid Lecture-7- MKAG1273

Regression: Non-parametric

Sen’s Slope Method

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 35: Shahid Lecture-7- MKAG1273

Example: Sen’s Slope Method

Net change is 1.6

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 36: Shahid Lecture-7- MKAG1273

Weighted Least Squares (WLS)

With WLS, each squared residual is weighted by some weight factor in such a way that observations with greater variance have lesser weight.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 37: Shahid Lecture-7- MKAG1273

Weighted Least Squares (WLS)

With WLS, X and Y are weighted by,

Where,

And, c is a constant, commonly used 3S = the IQR of the residuals

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 38: Shahid Lecture-7- MKAG1273

Weighted Least Squares (WLS)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 39: Shahid Lecture-7- MKAG1273

Weighted Least Squares (WLS)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 40: Shahid Lecture-7- MKAG1273

Smoothing

1. Smoothing is an exploratory technique, having no simple equationor significance tests associated with it.

2. The most common smooths estimate the center of the data -- theconditional mean or median of Y as X changes.

3. The lack of an equation is a strength in the sense that a smooth isnot constrained by some prior assumption as to the mathematicalfunction of the relationship.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 41: Shahid Lecture-7- MKAG1273

Moving Average

• It computes an average of the last m consecutive observations• In contrast to modeling in terms of a mathematical equation, the

moving average merely smooths the fluctuations in the data.• A moving average works well when the data have

– a fairly linear trend– a definite rhythmic pattern of fluctuations

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 42: Shahid Lecture-7- MKAG1273

Example of Moving Average

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Page 43: Shahid Lecture-7- MKAG1273

Example of Moving Average

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)