1、RegrdiagRegression DiagnosticsThe strength of a regression model, however powerful it may appear on a computer screen, rests upon its assumptions being reasonably valid in practice. The least squares regression model is a powerful research tool but when applied inappropriately it can easily lead to
2、nonsensical results being produced. This danger is particularly acute now that we have easy access to powerful computers which allow us to run a multitude of regressions at virtually no cost at all. Unfortunately, there can be a tendency for the researcher to arrive through trial and error at a regr
3、ession which looks good despite possible fragile foundations. Good data analysis, however, requires that we take modelling and the assumption upon which it rests seriously. With the wide availability of computer graphics, it is now possible to explore the properties of the models rather more. Regres
4、sion diagnostics allows us to determine whether there is anything strange about any of the observations. Data which is in some way strange can arise in several ways:1. There may be gross errors in either the response or the explanatory variables.2. The linear model may be inadequate to describe the
5、systematic structure of the data. 3. It may be more appropriate to analyse the data in another scale, e.g. logarithmic.4. It may be that the error distribution of the response variable is not normal.The kind of questions we want to investigate so as to check whether the model assumptions are reasona
6、bly valid in practice are as follows:1. Does the relation between Y and the Xs follow a linear pattern?2. Are the residuals approximately normally distributed?3. Are the residuals reasonably homoscedastic?4. Are the residuals autocorrelated?5. Do all data points contribute roughly equally to determi
7、ne the position of the line or do some points exert undue influence on the outcome?6. Are there any outlying data points which clearly do not fit the general pattern?A number of plots can be preferred, but it has to be remembered that they are not always sufficiently powerful and they can be mislead
8、ing. Such plots are as follows:1. Scatter plot of y v xi Use with care, but may suggest non-linearity.2. Residuals/Standardised Residuals v xi The presence of a curvilinear relationship suggests that a higher-order term, e.g. quadratic, should be added to the model, or a transformation, such as a lo
9、g, should be considered. Can indicate the existence of outliers, structural breaks and non-constant variance of the error term, i.e. heteroscedasticity3. Residuals v explanatory variables not in the model. The presence of a relationship would suggest that the explanatory variable should be included
10、in the model. 4. Residuals v y If the variance of the residuals changes with the predicted values, then heteroscedasticity is indicated. Outliers, non-linearity and structural breaks may also be indicated.5. Residuals v time In the case of time series data, correlation between the error terms can be
11、 detected suggesting the presence of autocorrelation. This may indicate missing variable(s) in the model.6. Variables v time A problem associated with non-stationary variables, and frequently faced by econometricians when dealing with time series data, is the spurious regression problem. If at least
12、 one of the explanatory variables in a regression equation is non-stationary in the sense that it displays a distinct stochastic trend, it is very likely the case that the dependent variable in the equation will display a similar trend. If such a problem is detected, then error correction models (EC
13、M) and cointegration analysis will have to be considered.7. Normal plot of the residuals The use of normality plots can help detect abnormalities with the data and the model. If the model is correctly specified, then the residuals should look like a sample from a normal distribution.Note: Any system
14、atic pattern in the residuals of a regression equation should be regarded as suggestive of the possibility of misspecification. Normality TestsRecall that one of the assumptions in the classical regression model is that the errors had to be normally distributed about their zero mean. The assumption
15、is necessary if the inferential aspects of classical regression (t test, F tests etc.) are to be valid in small samples.There are several tests of normality that can be used.1. Histogram of Residuals A simple graphical device, but rather subjective.2. Normal Probability Plot A rather comparatively s
16、imple graphical device. MINITAB will produce normal scores by means of the NSCORES command. These can be plotted against the residuals and an elongated S-shaped curve should be produced if the residuals are normally distributed. There should also be an extremely high correlation between the residual
17、s and the nscores statistical tables will be required to check the significance of the results.3. Normal Probability Tests produced by MINITAB Anderson-Darling normality test Ryan-Joiner normality test Kolmogorov-Smirnoff normality test These tests are constructed using different assumptions about t
18、he data (for details see the HELP facility within MINITAB). They all take the null hypothesis to be one of normality. Therefore, normality of the residuals will be rejected if the quoted p-value is smaller than the significance level, .4. The Jarque-Bera Test for NormalityA test of normality which i
19、s found in a number of econometric packages in the Jarque-Bera (JB) test. This is an asymptotic or large sample test and is based on OLS residuals. The test hinges on the values for skewness and kurtosis which for a normal distribution are 0 and 3 respectively. These are measured by the third and fo
20、urth moments of the OLS residuals.Under a null hypothesis of normally distributed disturbances, we have skewness (3) = 0 and kurtosis (4) = 3It can be shown thatZ3 = 3 n and Z4 = (4 - 3) n 6 24 both have a standard normal distribution in large samples. Therefore, (Z32 + Z42) will have a 2 with 2 df
21、Hence, Ho : residuals are normally distributed i.e. 3 = 0 and 4 = 3 v H1 : residuals are not normally distributed i.e. 3 0 or 4 3 or both We reject H0 if JB = Z32 + Z42 = (n / 6 ) . 32 + (n / 24) . (4 - 3)2 = n 32 / 6 + (4 - 3)2 / 24 (2)2 at % significance level Where n is the sample size, 3 is skew
22、ness and 4 is kurtosis Now, from the sample of residuals 2 = ei2 / n , 3 = ei3 / n , 4 = ei4 / n It can be shown that 32 = 32 / 23 and 4 = 4 / 22Outliers, Leverage and InfluenceIn regression analysis, you should always beware of points which do not fit the general pattern or exert under influence on
23、 the outcome of our numerical summaries. There are three types of data points which should concern us. These are : an outlier, a point of high leverage, and an influential point.An outlier in a regression is a data point which has a large residual (usually more than three standard deviations from th
24、e mean (= 0) ).A point of high leverage can be defined thus : A data point has a high leverage if it is extreme in the X - direction, i.e. it is a disproportionate distance away from the middle range of the X values. These points can exert undue influence on the outcome of an OLS regression line. Th
25、ey are capable of exerting a strong pull on the slope of the regression line. Whether they do so or not is another matter.An influential point is a point which if removed from the sample would markedly change the position of the least squares regression line. Hence, influential data points pull the
26、regression line in their direction. Note that influential data points do not necessarily produce large residuals, that is, they are not always outliers as well, although they can be. It is precisely because they draw the regression line towards themselves that they may end up with small residuals. C
27、onversely, an outlier is not necessarily an influential point, particularly when it is a point with little leverage.In general we note :outliers are not necessarily influential but they can be so (depending on leverage) yet high leverage points are not always influential and influential points are n
28、ot necessarily outliers.The presence of outliers or of influential points often gives us a clear signal that our model is probably misspecified. In terms of visual displays, outliers can be spotted with residual plots, whereas influential points really need scatter plots which are not always so mean
29、ingful when dealing with several explanatory variables. Apart from these graphical methods, we can also use some special statistics designed to detect outliers, points of leverage and points of influence.Studentised ResidualsA good way to detect outliers is to investigate each observation at a time,
30、 using an OLS regression with the relevant observation excluded, and testing whether the prediction error for that observation is significantly larger. This can be most easily done by including an observation-specific dummy variable. For example, to investigate the i th observation in a data set, we
31、 define a dummy variable taking a value of unity for the i th observation and zero for all other observations. If we include this dummy in the OLS regression, its coefficient will equal the required prediction error. To test the prediction error for significance, we can examine its t ratio. This t r
32、atio is referred to as a studentised residual. It has a students t distribution with (T-1-K-1) degrees of freedom where K is the number of explanatory variables.We define the studentised residuals (ei*). ei* = ei / s(i) ( 1 - hi ) = s ei / s(i)Where s(i) is the standard error estimate of the regression fitted after deleting the i th observation, and hi is a measure of leverage, and ei is the standardized residual.Unusual Y values will clearly stand o
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1