Quantitative Methods:

Module 12:  Advanced Regression Analysis

 

Introduction

Simple Regression Analysis

 

y = a + bx

Multiple (more than one right-hand side variable) and non-linear (equations other than u= a = bx)

Statistic tests to be employed, alongside visual ones, in evaluating and using regression and correlation analysies.

 

 

Multiple Regression Analysis

The basic idea is extended from simple linear regression to two or more variables on the right-hand side for example there may be three ‘x’

 

y = A + Bx + Cz + Dt

 

three independent variables x,z,t; their coefficients are B, C and D the constant is A.  this is still linear equation no squared, cubed logarithmic etc. 

When additional variables are added the coefficients of existing variables will be re-estimated. 

 

Similarities and Differences Between Simple and Multiple

The sum of squared residuals is minimized.  Inevitably the formulae for calculating are more complicated. 

Similarities

a)     Substations of x values in regression equation to make predictions

b)     F test to measure closeness of fit

c)      Checking of residuals for randomness

d)     Use of Se(Pred) to measure accuracy.

Differences

a)                 Adjustment of correlation coefficient to allow for degrees of freedom

b)                 t test to determine variables to leave out

c)                  Check for collinearity.

 

Technique: The sum of squared residuals is minimized as in simple regression. Finding coefficients is more complicated, but no big deal when computers are used.

 

Scatter Diagrams

 

It is not possible to draw in two dimensions a scatter diagram involving several variables. Therefore, more than one scatter diagram will need to be drawn.  A scatter diagram for the y variable combined with each of the x variables.  This will alow you to gain and approximate idea if the variables are related.

 

Correlation Coefficient

R bar squared” is used instead of R squared--it is adjusted to make allowance for the effects of multiple variables.

 

Collinearity

 

Occurs when one or more of the x variables are highly correlated. In these circumstances, the two variables are contributing essentially the same information to the regression.

 

Test: Check correlation coefficient for all x variables taken in pairs.  If any of the coefficients are high, the corresponding two variables are collinear.

 

Remedies:

a) Use only one of the variables (which? its subjective call)

b) Amalgamate the variables: (for instance, add together if this has meaning)

c) Substitute one of the variables with another that has similar meaning but is lower correlation with the remaining one of the pair.

test and remedies are not precise.  Be aware of the problem and the restrictions it places on interpretation than to delve into the technicalities lying behind it

 

Dummy Variables

Variables that are in the two category form ( 1/0, yes/no), can be represented by a dummy variable and regressed. Coefficient represents amount of change was it an increase or decrease.

 

Non-Linear Regression Analysis

Relationship between variables may no longer always be of a linear form, in non linear regression the right-hand side variables may appear in squared, cubed, logarithmic etc. form, a scatter diagram between a y variable and an x variable need no longer be an approximate straight line.

 

Major way of carrying out a non-linear regression rely on converting or transforming non-linear regression so that it can be treated as if it were linear.

 

Curvilinear regression

Curvilinear regression squared, cubed etc, terms in an x variable are treated as separate variables in a multiple regression.

Technique:

1) Take a non linear equation: y = a + bx+ cx2

2) Treat the equation as if it were in the form y = a + bx+ cz

3) Multiple regress to determine coefficients a, b, and c.

4) After estimation, x2 is restored to the equation for predictive purposes.

 

Transformations

Transform exponential functions into log functions:

 

y = a*ebx

 

Log y = log (a*ebx)

log y = log a + b*x

 

This new equation has the form Y=A + BX, where;

 

Y= log y, A = log a, B=b, X = x

 

Summarized: Two variables, x and y, are, it is thought, related by an exponential function.  The y variable is transformed by taking its logarithmic values.  Linear regression is applied to log y and x as if they where two new variables.  The coefficient estimates obtained from this linear regression can be translated into the coefficients of the exponential function which can then be used to forecast. 

Underlying relationship between y and x is not itself being altered.  No sense which curved relationship is being forced to be linear.  Relationship is expressed mathematically.  The ‘trick’ pf expressing the same relationship in a different form make it possible to use regression analysis.

 

Sometimes it may be necessary to try several transformations by using the squares, square roots, reciprocals, logarithms etc.  of one or both variable to find the scatter diagram of transformed variables which looks most like a straight line.  Regression analyses carried out with several types of transformation to find the one best statistical result, highest R-squared, random residuals can be dangerous it is usually possible eventually to come up with satisfactory result however may be no sound reason and be purely associative best to base on logic or prior knowledge.  Trail and error approach to transformation should not be ruled out, but it should be used with caution.

 

Statistical Basis of Regression and Correlation

Statistical Basis of Regression and Correlation allows all aspects of the relationship to be tested statistically rather than visually.  Example randomness of residuals tested more precisely based on sampling.

Calculations are made from sample data means .  basis of significance tests of whether the hypothesis of a straight-line relationship in the population is true.  Basis for determining the accuracy of regression predictions.

Measuring Closeness of  Fit

Correlation measuring the strength of a relationship.  Correlation coefficient (r ), its square (R2) or, in multiple regressing R-bar squared. 

ANOVA can be used to do this as is can be shown that :

 

Mean Square (Regression)/ Mean Square (Error) has an F-Distribution. Here’s how you do it.

ANOVA Technique

 

1) Take mean of x values

2) Take mean of y values

3) Find regression equation

4) Find correlation coefficient

5) Find Residuals and Best Fit Values

6) Find SS,SS(regression), and SS(residual) as follows:

 

SS =         

SS(Regression) =

 

SS(Residual) = S (Residual) 2

 

Total SS =SS(regression) + SS(residual)

 

ANOVA Table

 

Variation

Degrees of Freedom

Sums of Squares

Mean Square

F

Explained by Regression

k (1 for each independent variable)

SS(regression)

MST=SSR/k

MST/MSE

Error of Unexplained (residuals)

n - k - 1 (no observations-1df for coefficients-1 for bar y)

SS(residuals)

MSE=SS(residuals)/n - k -1

 

Total

n-1

SS

 

 

 

Testing that Residuals are Random

Regression hypothesizes a true linear relationship.  Deviations from minor random disturbances.  Not only is the randomness part of the hypothesis, it is also intuitively reasonable.  If reseals not random, they must have a pattern.  If pattern linear model is not adequate and should be revised or altered to incorporate it.  If random, the linear model must be the best pattern that can be obtained form the data.

 

Runs Test is a common example of statistical tests for randomness.  – a run is a group of consecutive residuals with the same sign

Based on an ‘expected’ number of runs which is the number of runs that would be most likely to occur if the residuals were random.  Expected number of runs can be calculated using the basic ideas of probability.  Compared observed number of runs counted in the residuals .  If actual differs form expected by a large margin the residuals will be assumed non-random. 

Significance test will indicate whether any difference between the observed and expected numbers of runs is sufficient to reject the hypothesis that the residuals are random.

 

Technique:

1) Hypothesis: residuals are random

2) Evidence: sample of residuals

3) Significance: 5%

4) Count number of positive and negative runs. Note as n1 and n2.

5) Compare to runs test table. If number of observed runs is less than lower critical value, then hypothesis is rejected.

 

If n1 n2 is larger than 20, tables cant be used, instead, use:

 

Mean = 2n1n2/(n1 + n2) +1

 

SD = SQRT (2n1n2 (2n1n2-n1-n2)/(n1 + n2)2 (n1+ n2 - 1)

 

Critical values are not taken from tables, they are 2 SD’s above and below the mean (for 5% significance).

 

Deciding Which Variables to Retain

 

Variables are retained if they are significant, that is, if they have a significant effect on the y variable.

 

Technique:

 

1) Hypothesis is that the true population for the variable is 0

2) The evidence is just the set of observations from which the regression coefficients have been estimated. As well, the coefficient of the variable in question, its standard error will also have been calculated.

3) The significance level is usually 5%.

4) Degrees of freedom are:

 

n - k - 1, where n = number of observations, and k = number of x variables in the regression.

 

k + 1 degrees of freedom are lost because of the need to estimate the coefficients of the k variables and the constant term. The critical t value is t.025 or n -k-1 degrees of freedom found from the t tables.

 

5) The observed t value is tObs=Coefficient estimate - 0/Standard Error of coefficient

 

6) If tObs exceeds t.025, then hypothesis is rejected--the variable does have a significant effect. If less, than accepted, and variable is eliminated.

 

Accuracy of Predictions

A common use of regression equation is to make predictions.

Forecast corresponding to an x value is found by putting the x value into the equation and calculating y. the y value is called a point estimate.

Residuals basis for measuring the accuracy of a prediction.  Residuals are historical differences between actual values and regression line.  Future will differ from the regression line by similar amounts.  The scatter of the residuals measured by their standard error, is therefore an indication of forecasting accuracy.

The overall uncertainty in apreciton comes from two areas:

1)                 What the regression does not deal with = the residuals

2)                 What the regression does deal with – error in the estimation of regression model coefficients.

Both are combined in the standard error of predicted values  - SE(Pred):

 

If more than 30 points, than 95% of future values are likely to be within +/- 2SE(Pred) of the point estimate. (If fewer than 30, than t value (found from tables with appropriate degrees of freedom) applies).

 

This interval is known as the forecast interval. Level of accuracy is sufficient  a 95% confidence.

Different for each different set of x values, is a vital important measure. Predicting sufficient accurate useful in decisions.  Forecast interval is the final determinant of whether the regression model is satisfactory.

 

 

Regression Analysis Summary

The steps below assume the problem to be tackled is one of making predictions, time related or cross-sectional , rather than just establishing the existence of a relationship. – solely statistical application of regression.

 

1) Propose a tentative Model

scatter diagrams and prior knowledge 0 decide what the regression equation might be.  Tentative decisions as to which x variable to include and what transformation to use to handle any curvature.  Type of data available.

 

2) Run the Regression and check closeness of fit

R-bar squared, possible an ANOVA table or computer printout sufficiently high proportion of the orginal variation in the y variable had been explained.

 

3) Check the residuals

Should be random.  Scatter diagram between residuals and fitted values demonstrated this visually or a run test will permit the check to be made statistically

 

4) Decide whether any x variables could be discarded

T test on each x variable coefficient will indicate if the variable has a significant effect. Y variable,  If not, discard,  in conjunction with prior knowledge. 

 

5) Check for collinearity

Correlation matrix for all x variables show collinear and therefore have unreliable coefficient estimates.

 

6) Decide if the regression estimates are accurate enough for the decision

 

SE(Pred) basis of calculating confidence intervals.  Can be contrasted with decision at hand.

7) If necessary, formulate a new regression model

Any checks unsatisfactory results, necessary to return to stage (a) and try again with new mode.

 

Key Message from the module

1)         Multiple regression analysis is makes the extension beyond simple regression.  It allows changes in one variable (the y variable) to be explained by changes in several other variables (the x variables).

Multiple regression analysis is based on the same principle, the least-squared criterion, as simple regression.  However, the addition of the extra x variables does bring about added complications. 

Similarities

e)     Substations of x values in regression equation to make predictions

f)        F test to measure closeness of fit

g)     Checking of residuals for randomness

h)      Use of Se(Pred) to measure accuracy.

Differences

d)                 Adjustment of correlation coefficient to allow for degrees of freedom

e)                 t test to determine variables to leave out

f)                    Check for collinearity.

2)                 ‘curved’ relationships between variables. This is done by transforming one or more variables so that the equation can be handled as if it were linear.  Possibilities are wind for allowing a variety of non-linear relationships to be modeled through regression.

  Caution must be taken that in trying until one found a regression equation may result in the risk of causality being forgotten.  Regression should confirm prior beliefs, rather than to find a ‘belief’.  This latter may to lead to many purely associative relationships between variables.  You must always ask ‘Is the regression sensible?’

Make the correct balance between statistical and not-statistical factors.  Example t test for inclusion of variable sin a multiple regression taken carefully into account but not to be exclusion of other factors.  The profusion of complex data produced by regression analyses can promote a spurious sense of accuracy and a spurious sense of importance of the statistical aspects.