Quantitative Methods:

Module 11:  Regression and Correlation

Introduction

Regression and correlation are concerned with relationships between variables.  Investigate whether a variable is related statistically to one or more other variables thought to cause changes in it

Regression determining the mathematical formula relating the variables.

Correlation measuring the strength of the relationship.

Regression shows what the connection is; correlation shows whether the connection is strong enough to merit using it.

Correlation reinforce intuitive evidence.

Regression analysis is often used to include both regression and correlation.

Regression finds the formula for the scatter diagram; correlation will indicate the strength of the relationship.

Simple linear regression means involving only two variables; linear means that the relationships is some form of straight line (as opposed to a curve)

The equation for a straight line as used in the text is

 

Applications

Forecasting

Forecasting frequently based on regression analysis.  The variable to be forecast is regressed against another variable thought to ‘cause’ change.

Correlation can measure this strength, Regression can determine the formula linking the two sets of numbers. 

Explaining

Data are cross-sectional because the observations are related to different people at one point in time.

Data are time-series observations relate to a different point in time.

High associated with low and vice versa, the correlation is said to be negative.  Application the correlation is positive high associated with high and low with low.

 

Mathematical Preliminaries

Simple linear regression refers to case where two variables when plotted on a gragh have an (approximately) straight line relationship.

Mathematical equation     y = a + bx (bx stands for b multiplied by x)

Where y and x are variables; a and b are fixed numbers or constants.

Simple linear regression finding the values of a and b which provide the best connection between the two variable’s

 

The Equation of a Straight line

 y = a + bx

 

a)     a is the intercept:: value of y the line crosses the y axis.  Can be verified by noting the y = a when x = 0

b)     b is the slope: change in y for a change in x of one unit

Slope can be positive or negative. 

Determining the equation of a straight line amounts to finding the values of a and b.  The line is completely determined.  Linear regression is the task of finding the values of a and b which provide the best connection between the two variable’s

 

 

Residuals

The difference between actual and fitted y values is the residual

 

Residual = Actual y value - Fitted y value

 

Simple Linear Regression

The best straight line is residual as small as possible. 3 approaches

1)                 for which the sum of the residuals is a minimum compared with any other line drawn through the points.

2)                 Make the sum of the absolute values of the residuals as small as possible

3)                 The ‘best’ straight line through a set of points is found called the least squares method. Values for a and b are chosen such that the sum of the squared residuals is as small as possible.

 

Sum (residuals squared) is a minimum.

 

Formula:

 

y = a + bx

 

where  b=

 

and  a=  

 

Correlation

Regression finds the ‘best’ line for a set of points but does not reveal whether a line is a good representation of the points.  Correlation fills the this gap.  It helps to decide whether, in view of the closeness (or otherwise) of the points to a line, the regression line is likely to be of any practical use.  It does this by quantifying the strength of the linear relationship.  It measures whether; overall the points are close to a straight line.

Quantifies the strength of a linear relationship.

 

Correlation Coefficient

Actual measure of the strength of the relationship is called the correlation coefficient (denoted by r).  the formula by which r is calculated is”

 

r =  

 

Close to 0, correlation coefficients indicate a weak or non-existent linear relationship. 

Considering the square of the correlation coefficient written with a capital letter, R2 before carrying out a regression analysis on can measure the total variation in the y variable by:

 

Total variation =

 

This measures the extent to which y varies from its average value.

Part of the total variation in y can be thought of as being ‘caused’ by the x variable.  Regression and correlation, to investigate the extent to which changes in y are affected by changes in x.  The variation in y that is ‘caused’ by x is called the explained variation.

 

Explained variation =

 

Unexplained variation = S (Residuals)2

 

Total Variation = Explained variation + Unexplained variation

 

Correlation coefficient:

R2 = Explained variation/Total variation

 

When R2 = 1,

Explained variation = total variation,

unexplained variation = 0.

 

Because the variation is measured in squared terms it is labeled R-squared. 

Essence of correlation is that when a regression analysis is carried out the variation in the y variable is split into two parts

a)     Explained

b)     Unexplained

c)      The correlation coefficient squared tells what proportion of the original variation in y has been explained by associating y with x.  The higher proportion the stronger the correlation.  In most cases 0.75 or more regarded as highly satisfactory, 0.50-0.75 adequate; below 0.50 serious doubts.

 

Checking the residuals

 

After the regression is found, the residuals can be found. (Plug in x values, produce fitted y values, and subtract actual y’s from fitted y’s.)

 

Residuals should be random, meaning that they should have no pattern or order. This determines if the linear equation is an adequate means of expressing the line.

The residuals should be checked for randomness deciding whether a linear equation is adequate expressing the connection between y and x.  first way is visual alternatively, a scatter diagram drawn plotting the residuals against the fitted y values.  Shows size of residuals and any pattern in them.  Visual test is all that is necessary to detect obvious pattern.  Frequently this is not clear-cut.  There may be a hint of pattern but it is not definite.  Statistical test  of randomness are a more precise approach.

 

 

Four Steps in Regression and Correlation

 

1) Inspect the scatter diagram

2) Calculate the regression coefficients

3) Calculate the correlation coefficient

4) Checking the residuals for randomness

 

A scatter diagram is a first check that the analysis makes sense.

 

Examining Residuals

 

When each residual is related to the previous residual, there is said to be serial correlation.   This problem occurs particularly in time series data where there may be some time-related cycle.

 

When a series varies in size at different parts of the line, it is referred to as heteroscedasticity.  Cross sectional data size of residuals related to the the x value.  Profitability to company size. 

 

If there is a pattern a visual test with a scatter diagram is usually sufficient to detect it.

 

There are statistical tests for detecting patterns. The runs test is the most common.

 

Reservations about Regression and Correlation

 

a) Correlation implies association, not causality.  Largest single source of confusion and error

b) Guard against spurious regressions. (Using the text example, you can have the same thing appear on both sides of an equation, but in different forms, so correlation is naturally high.)

c) Extrapolation: Going too far outside the range of data

d) Regression applies only to single sets of data- high correlation coefficient, hiding the fact that tow separate straight lines more appropriate.  Be familiar with the data.

e) Least squares by being over-precise. has been applies to y on x, an x on y regression would nearly always produce a different line with different slope. (Correlation coefficient, r, would be the same however.

 

Key Message from the Module

Regression and correlation predicting and understand relationships in data.  Wide range of applications: economics, sales forecasting, budgeting, costing, human resource planning, corporate planning. 

Users of regression can allow the statistics to dominate.  Major errors the wider non-statistical issue have been neglected.

Penetrating questions about the way regressing and correlation are being applied. 

Broad principles and managerial issues as important as the technical, statistical aspects.

Knowledge of statistical principles is necessary not in order to do the regression analysis’s but as a passport to legitimate place in discussion.