Quantitative Methods:
Module 12: Advanced Regression Analysis
Simple Regression Analysis
y = a + bx
Multiple (more
than one right-hand side variable) and non-linear (equations other than u= a =
bx)
Statistic
tests to be employed, alongside visual ones, in evaluating and using regression
and correlation analysies.
The basic idea is extended from simple linear regression to
two or more variables on the right-hand side for example there may be three ‘x’
y = A + Bx + Cz
+ Dt
three independent variables x,z,t; their coefficients are B,
C and D the constant is A. this is still
linear equation no squared, cubed logarithmic etc.
When additional variables are added the coefficients of
existing variables will be re-estimated.
The sum of squared residuals is minimized. Inevitably the formulae for calculating are
more complicated.
a)
Substations of x values in regression equation to
make predictions
b)
F test to measure closeness of fit
c)
Checking of residuals for randomness
d)
Use of Se(Pred) to measure accuracy.
a)
Adjustment of correlation coefficient to allow for degrees
of freedom
b)
t test to determine variables to leave out
c)
Check for collinearity.
Technique: The sum
of squared residuals is minimized as in simple regression. Finding coefficients
is more complicated, but no big deal when computers are used.
Scatter Diagrams
It is not possible to draw in two dimensions a scatter
diagram involving several variables. Therefore, more than one scatter diagram
will need to be drawn. A scatter diagram
for the y variable combined with each of the x variables. This will alow you to gain and approximate
idea if the variables are related.
Correlation Coefficient
“R bar squared”
is used instead of R squared--it is adjusted to make allowance for the effects
of multiple variables.
Collinearity
Occurs when one or more of the x variables are highly
correlated. In these circumstances, the two variables are contributing
essentially the same information to the regression.
Test: Check
correlation coefficient for all x
variables taken in pairs. If any of the
coefficients are high, the corresponding two variables are collinear.
Remedies:
a) Use only one of the variables (which? its subjective
call)
b) Amalgamate the variables: (for instance, add together if
this has meaning)
c) Substitute one of the variables with another that has
similar meaning but is lower correlation with the remaining one of the pair.
test and remedies are not precise. Be aware of the problem and the restrictions it places on interpretation
than to delve into the technicalities lying behind it
Variables that are in the two category form ( 1/0, yes/no),
can be represented by a dummy variable and regressed. Coefficient represents
amount of change was it an increase or decrease.
Relationship between variables may no longer always be of a
linear form, in non linear regression the right-hand side variables may appear
in squared, cubed, logarithmic etc. form, a scatter diagram between a y variable
and an x variable need no longer be an approximate straight line.
Major way of carrying out a non-linear regression rely on
converting or transforming non-linear regression so that it can be treated as
if it were linear.
Curvilinear regression squared, cubed etc, terms in an x variable are treated as separate variables in a multiple regression.
Technique:
1) Take a non linear equation: y = a + bx+ cx2
2) Treat the equation as if it were in the form y = a + bx+ cz
3) Multiple regress to determine coefficients a, b, and c.
4) After estimation, x2 is restored to the
equation for predictive purposes.
Transform exponential functions into log functions:
y = a*ebx
Log y = log (a*ebx)
log y = log a + b*x
This new equation has the form Y=A + BX, where;
Y= log y, A = log a, B=b, X = x
Summarized: Two variables, x and y, are, it is
thought, related by an exponential function.
The y variable is transformed by taking its logarithmic
values. Linear regression is applied to
log y and x as if they where two new variables. The coefficient estimates obtained from this linear regression
can be translated into the coefficients of the exponential function which can
then be used to forecast.
Underlying relationship between y and x is not itself being
altered. No sense which curved
relationship is being forced to be linear.
Relationship is expressed mathematically. The ‘trick’ pf expressing the same relationship in a different
form make it possible to use regression analysis.
Sometimes it may be necessary to try several transformations
by using the squares, square roots, reciprocals, logarithms etc. of one or both variable to find the scatter
diagram of transformed variables which looks most like a straight line. Regression analyses carried out with several
types of transformation to find the one best statistical result, highest
R-squared, random residuals can be dangerous it is usually possible eventually
to come up with satisfactory result however may be no sound reason and be
purely associative best to base on logic or prior knowledge. Trail and error approach to transformation
should not be ruled out, but it should be used with caution.
Statistical Basis of Regression and Correlation allows all aspects of the relationship to be tested statistically rather than visually. Example randomness of residuals tested more precisely based on sampling.
Calculations are made from sample data means . basis of significance tests of whether the
hypothesis of a straight-line relationship in the population is true. Basis for determining the accuracy of
regression predictions.
Correlation measuring the strength of a relationship. Correlation coefficient (r ), its square (R2)
or, in multiple regressing R-bar squared.
ANOVA can be used to do this as is can be shown that :
Mean Square (Regression)/ Mean Square (Error) has an
F-Distribution. Here’s how you do it.
ANOVA
Technique
1) Take mean of x values
2) Take mean of y values
3) Find regression equation
4) Find correlation coefficient
5) Find Residuals and Best Fit Values
6) Find SS,SS(regression), and SS(residual) as follows:
SS =
SS(Regression)
= ![]()
SS(Residual)
= S
(Residual) 2
Total SS
=SS(regression) + SS(residual)
ANOVA Table
|
Variation |
Degrees
of Freedom |
Sums of
Squares |
Mean
Square |
F |
|
Explained by Regression |
k (1 for each independent variable) |
SS(regression) |
MST=SSR/k |
MST/MSE |
|
Error of Unexplained (residuals) |
n - k - 1 (no observations-1df for coefficients-1 for bar
y) |
SS(residuals) |
MSE=SS(residuals)/n - k -1 |
|
|
Total |
n-1 |
SS |
|
|
Regression hypothesizes a true linear relationship. Deviations from minor random
disturbances. Not only is the
randomness part of the hypothesis, it is also intuitively reasonable. If reseals not random, they must have a
pattern. If pattern linear model is not
adequate and should be revised or altered to incorporate it. If random, the linear model must be the best
pattern that can be obtained form the data.
Runs Test is a common example of statistical tests for randomness. – a run is a group of consecutive residuals
with the same sign
Based on an ‘expected’ number of runs which is the number of
runs that would be most likely to occur if the residuals were random. Expected number of runs can be calculated
using the basic ideas of probability.
Compared observed number of runs counted in the residuals . If actual differs form expected by a large
margin the residuals will be assumed non-random.
Significance test will indicate whether any difference
between the observed and expected numbers of runs is sufficient to reject the
hypothesis that the residuals are random.
Technique:
1) Hypothesis:
residuals are random
2) Evidence:
sample of residuals
3) Significance:
5%
4) Count number of positive and negative runs. Note as n1 and
n2.
5) Compare to runs test table. If number of observed runs is
less than lower critical value, then hypothesis is rejected.
If n1 n2 is larger than 20, tables cant be used, instead,
use:
Mean = 2n1n2/(n1 + n2)
+1
SD = SQRT (2n1n2 (2n1n2-n1-n2)/(n1
+ n2)2 (n1+ n2 - 1)
Critical values are not taken from tables, they are 2 SD’s
above and below the mean (for 5% significance).
Variables are retained if they are significant, that is, if
they have a significant effect on the y variable.
Technique:
1) Hypothesis is that the true population for the variable
is 0
2) The evidence is just the set of observations from which the regression coefficients have been estimated. As well, the coefficient of the variable in question, its standard error will also have been calculated.
3) The significance level is usually 5%.
4) Degrees of freedom are:
n - k - 1, where n = number of observations, and k = number
of x variables in the regression.
k + 1 degrees of freedom are lost because of the need to
estimate the coefficients of the k variables and the constant term. The
critical t value is t.025 or n -k-1 degrees of freedom found from
the t tables.
5) The observed t value is tObs=Coefficient
estimate - 0/Standard Error of coefficient
6) If tObs exceeds t.025, then
hypothesis is rejected--the variable does have a significant effect. If less,
than accepted, and variable is eliminated.
A common use of regression equation is to make predictions.
Forecast corresponding to an x value is found by putting the
x value into the equation and calculating y. the y value is called a point
estimate.
Residuals basis for measuring the accuracy of a
prediction. Residuals are historical
differences between actual values and regression line. Future will differ from the regression line
by similar amounts. The scatter of the
residuals measured by their standard error, is therefore an indication of
forecasting accuracy.
The overall uncertainty in apreciton comes from two areas:
1)
What the regression does not deal with = the residuals
2)
What the regression does deal with – error in the estimation
of regression model coefficients.
Both are combined in the standard error of predicted
values - SE(Pred):
If more than 30 points, than 95% of future values are likely to be within +/- 2SE(Pred) of the point estimate. (If fewer than 30, than t value (found from tables with appropriate degrees of freedom) applies).
This interval is known as the forecast interval.
Level of accuracy is sufficient a 95%
confidence.
Different for each different set of x values, is a vital
important measure. Predicting sufficient accurate useful in decisions. Forecast interval is the final determinant
of whether the regression model is satisfactory.
The steps below assume the problem to be tackled is one of
making predictions, time related or cross-sectional , rather than just
establishing the existence of a relationship. – solely statistical application
of regression.
|
1) Propose a tentative Model |
scatter diagrams and prior knowledge 0 decide what the
regression equation might be.
Tentative decisions as to which x variable to include and what
transformation to use to handle any curvature. Type of data available. |
|
2) Run the Regression and check closeness of fit |
R-bar squared, possible an ANOVA table or computer
printout sufficiently high proportion of the orginal variation in the y
variable had been explained. |
|
3) Check the residuals |
Should be random.
Scatter diagram between residuals and fitted values demonstrated this
visually or a run test will permit the check to be made statistically |
|
4) Decide whether any x variables could be
discarded |
T test on each x variable coefficient will indicate if the
variable has a significant effect. Y variable, If not, discard, in
conjunction with prior knowledge. |
|
5) Check for collinearity |
Correlation matrix for all x variables show collinear and
therefore have unreliable coefficient estimates. |
|
6) Decide if the regression estimates are accurate enough
for the decision |
SE(Pred) basis of calculating confidence intervals. Can be contrasted with decision at hand. |
|
7) If necessary, formulate a new regression model |
Any checks unsatisfactory results, necessary to return to
stage (a) and try again with new mode. |
1) Multiple regression analysis is makes the extension beyond simple regression. It allows changes in one variable (the y variable) to be explained by changes in several other variables (the x variables).
Multiple regression analysis is based on the same principle,
the least-squared criterion, as simple regression. However, the addition of the extra x variables does bring
about added complications.
e)
Substations of x values in regression equation to
make predictions
f)
F test to measure closeness of fit
g)
Checking of residuals for randomness
h)
Use of Se(Pred) to measure accuracy.
d)
Adjustment of correlation coefficient to allow for degrees
of freedom
e)
t test to determine variables to leave out
f)
Check for collinearity.
2)
‘curved’ relationships between variables. This is done by
transforming one or more variables so that the equation can be handled as if it
were linear. Possibilities are wind for
allowing a variety of non-linear relationships to be modeled through regression.
Caution must be
taken that in trying until one found a regression equation may result in the
risk of causality being forgotten.
Regression should confirm prior beliefs, rather than to find
a ‘belief’. This latter may to lead to
many purely associative relationships between variables. You must always ask ‘Is the regression
sensible?’
Make the correct balance between statistical and
not-statistical factors. Example t test
for inclusion of variable sin a multiple regression taken carefully into
account but not to be exclusion of other factors. The profusion of complex data produced by regression analyses can
promote a spurious sense of accuracy and a spurious sense of importance of the
statistical aspects.