Quantitative Methods:
Module 11: Regression and Correlation
Regression and correlation are concerned with relationships between variables. Investigate whether a variable is related statistically to one or more other variables thought to cause changes in it
Regression determining the mathematical
formula relating the variables.
Correlation measuring the strength of
the relationship.
Regression shows what the connection is;
correlation shows whether the connection is strong enough to
merit using it.
Correlation reinforce intuitive evidence.
Regression analysis is often used to include both
regression and correlation.
Regression finds the formula for the scatter diagram;
correlation will indicate the strength of the relationship.
Simple linear regression means involving only two
variables; linear means that the relationships is some form of straight line
(as opposed to a curve)
The equation for a straight line as used in the text is
Forecasting frequently based on regression analysis. The variable to be forecast is regressed against another variable thought to ‘cause’ change.
Correlation can measure this strength, Regression can
determine the formula linking the two sets of numbers.
Data are cross-sectional because the observations are related to different people at one point in time.
Data are time-series observations relate to a
different point in time.
High associated with low and vice versa, the correlation is
said to be negative. Application the
correlation is positive high associated with high and low with low.
Simple linear regression refers to case where two variables when plotted on a gragh have an (approximately) straight line relationship.
Mathematical equation
y = a + bx (bx stands for b
multiplied by x)
Where y
and x are variables; a and b are fixed numbers or
constants.
Simple linear
regression finding the values of a and b which provide the best
connection between the two variable’s
y = a
+ bx
a) a is the intercept::
value of y the line crosses the y axis. Can be verified by noting the y = a when x = 0
b)
b is the slope: change in y for a change in x
of one unit
Slope can be positive or negative.
Determining the equation of a straight line amounts to
finding the values of a and b.
The line is completely determined.
Linear regression is the task of finding the values of a and b
which provide the best connection between the two variable’s
The difference between actual and fitted y values is the residual
Residual = Actual y
value - Fitted y value
The best straight line is residual as small as possible. 3 approaches
1)
for which the sum of the residuals is a minimum compared
with any other line drawn through the points.
2)
Make the sum of the absolute values of the residuals as
small as possible
3)
The ‘best’ straight line through a set of points is found
called the least squares method. Values for a and b are chosen
such that the sum of the squared residuals is as small as possible.
Sum (residuals squared) is a minimum.
Formula:
y = a + bx
where b= 
and a=
Regression finds the ‘best’ line for a set of points but
does not reveal whether a line is a good representation of the points. Correlation fills the this gap. It helps to decide whether, in view of the
closeness (or otherwise) of the points to a line, the regression line is likely
to be of any practical use. It does
this by quantifying the strength of the linear relationship. It measures whether; overall the points are
close to a straight line.
Quantifies the strength of a linear relationship.
Actual measure of the strength of the relationship is called
the correlation coefficient (denoted by r). the formula by which r is calculated is”
r = 
Close to 0, correlation coefficients indicate a weak or
non-existent linear relationship.
Considering the square of the correlation coefficient
written with a capital letter, R2 before carrying out a regression
analysis on can measure the total variation in the y variable by:
This measures the extent to which y varies from its
average value.
Part of the total variation in y can be thought of as being
‘caused’ by the x variable.
Regression and correlation, to investigate the extent to which changes
in y are affected by changes in x. The variation in y that is ‘caused’ by x is called
the explained variation.
Explained variation = ![]()
Unexplained variation = S (Residuals)2
Total Variation = Explained variation + Unexplained variation
Correlation coefficient:
R2 =
Explained variation/Total variation
When R2 = 1,
Explained variation = total variation,
unexplained variation = 0.
Because the variation is measured in squared terms it is
labeled R-squared.
Essence of correlation is that when a regression analysis is
carried out the variation in the y variable is split into two parts
a)
Explained
b)
Unexplained
c)
The correlation coefficient squared tells what proportion of
the original variation in y has been explained by associating y with
x. The higher proportion the
stronger the correlation. In most cases
0.75 or more regarded as highly satisfactory, 0.50-0.75 adequate; below 0.50
serious doubts.
After the regression is found, the residuals can be found.
(Plug in x values, produce fitted y values, and subtract actual y’s
from fitted y’s.)
Residuals should be random, meaning that they should have no pattern or order. This determines if the linear equation is an adequate means of expressing the line.
The residuals should be checked for randomness deciding
whether a linear equation is adequate expressing the connection between y and
x. first way is visual alternatively, a
scatter diagram drawn plotting the residuals against the fitted y values. Shows size of residuals and any pattern in
them. Visual test is all that is
necessary to detect obvious pattern.
Frequently this is not clear-cut.
There may be a hint of pattern but it is not definite. Statistical test of randomness are a more precise approach.
1) Inspect the scatter diagram
2) Calculate the regression coefficients
3) Calculate the correlation coefficient
4) Checking the residuals for randomness
A scatter diagram is a first check that the analysis makes
sense.
When each residual is related to the previous residual,
there is said to be serial correlation. This problem occurs particularly in time series data where there
may be some time-related cycle.
When a series varies in size at different parts of the line,
it is referred to as heteroscedasticity. Cross sectional data size of residuals related to the the x
value. Profitability to company
size.
If there is a pattern a visual test with a scatter diagram
is usually sufficient to detect it.
There are statistical tests for detecting patterns. The runs
test is the most common.
a) Correlation implies association, not causality. Largest single source of confusion and error
b) Guard against spurious regressions. (Using the text
example, you can have the same thing appear on both sides of an equation, but
in different forms, so correlation is naturally high.)
c) Extrapolation: Going too far outside the range of data
d) Regression applies only to single sets of data- high correlation coefficient, hiding the fact that tow separate straight lines more appropriate. Be familiar with the data.
e) Least squares by being over-precise. has been
applies to y on x, an x on y regression would nearly always produce a different
line with different slope. (Correlation coefficient, r, would be the same
however.
Regression and correlation predicting and understand relationships in data. Wide range of applications: economics, sales forecasting, budgeting, costing, human resource planning, corporate planning.
Users of regression can allow the statistics to
dominate. Major errors the wider
non-statistical issue have been neglected.
Penetrating questions about the way regressing and
correlation are being applied.
Broad principles and managerial issues as important as the
technical, statistical aspects.
Knowledge of statistical principles is necessary not in
order to do the regression analysis’s but as a passport to legitimate place in
discussion.