Regression analysis is a statistical technique to examine if a dependent variable can be predicted with one or more independent variables. For example, statisticians often conduct regression analysis to see if college students’ GPA scores can be predicted with SAT scores and/or high school GPA scores.
Simple linear regression
In a simple linear regression situation, the goal is to explore if there is a linear relationship between two variables so that one variable can be predicted with another one. The illustration below shows that there is a positive correlation between bar tip takings and temperature--bar tip takings can largely be predicted by temperature.
Y = ß0 + ß1X + €
Sum of Square of Total variance (SST): The total variance of the dependent variable – bar tip takings
Sum of Square of variance due to Regression (SSR): The part of variance of the dependent variable that can be explained by the linear regression model
Sum of Squares of variance due to Error (SSE): The part of variance of the dependent variable that cannot be explained the linear regression model
R square: The ratio of SSR to SST – The proportion of the total variance in Y is explained by the variance in X. Evidently, the higher R square, the better the linear regression model.
It is worth noting that, because any given linear regression model (e.g., Y^ = -353.11 + 123.54X) is generated with the given sample data, we can never get the true linear line, Y = β0 + β1X + €, that will require the population data to achieve. So e means error in sample data while € represents error in population data.
Degree of Freedom and Adjusted R-sqaure
To understand the concept of degree of freedom in regression analysis, let’s start with the following question:
Question: What is the minimum number of observations required to estimate this regression: Yi = B0 + B1Xi + €I?
As we know, to make estimate in any statistical analysis, we will need to have variance. In the Yi = B0 + B1Xi + €I case, we need at least 2 observations to make a linear relationship between X and Y. However, we need at least 3 observations to have variance for making estimation. When n = 3, we have 1 degree of freedom because there is 0 degree of freedom when n = 2. Similarly, we have 2 degree of freedom when n = 4.
Now let’s increase the number of independent variables to 2.
Yi = B0 + B1X1i + B2X2i + €i
Question: What is the minimum number of observations required to estimate this regression:
Yi = B0 + B1X1i + B2X2i + €i
Obviously, we will need at least 3 observations to establish the Yi = B0 + B1X1i + B2X2i + €i relationship.
We can draw a pattern with the following formula now to calculate the number of degree of freedom in regression analysis:
df = n – k -1
k is the number of independent variables
As we can seen in the illustration above, as df decreases, especially when adding more independent variables in formula, R-square will increase.
Therefore, we need to adjust R-square to correct the impact of decreased df when number of independent variables increased. In other words, as k increases, adjusted R-square will tend to decrease, reflecting the reduced power of the regression model.
Regression Output Explained
The example used in this section is to explore if a country’s latitude affects its medal tally with data observed in a winter Olympics. The number of medal won by individual country will be the Y – the dependent variable while latitude, average elevation, and log population are the 3 independent variables.
Below is a simple illustration that shows a linear relationship between medals and latitude. If β1 is 0, it means latitude has zero effect on medals.
We need always keep in mind that we use sample data to estimate true relationship in population. When a pattern is evidently observed in the sample, we can be highly confident about the existence of the true relationship.
In the following illustration, the upper case demonstrates a strong relationship and the confidence interval range is much above ZERO (0). In contrast, the lower case shows a hard-to-tell relationship and the confidence interval range includes ZERO (0).
In the winter Olympic sample, we are exploring the relationship between number of medals and three independent variables, including latitude, average elevation, and population.
Below is the outcome of the regression analysis.
The ANOVA Section
First of all, let's take a look at the ANOVA section of the outcome output.
MS = SS/df. Result of the F-test is calculated with MSmodel/MSresidual.
F3,21 = 146.42 / 45.45 = 3.22. So H0 is rejected at the 0.05 significance level.
The Variable Section
The coefficient column allows to offer the regression formula to calculate the dependent variable with those 3 independent variables.
After having a regression formula, the total number of medals for a country can be predicted with it.
As a statistician, we always want to evaluate whether an independent variable's coefficient is significant or not at either .05 or .01 level.