Regression analysis can be more complex than what is covered in the "Dive deep to regression analysis" article. For example, sometimes the relationships are non-linear (e.g., square or reverse, etc.). Occasionally values of variables need to be converted in logarithms. When dealing with categorical independent variables (X), we will need to create dummy variables and examine effects of interactive variables. In addition, we will need to use logistic regression model when dealing with categorical dependent variable (Y).

Let's use the Jay Bob Car Sale data set to demonstrate how to run regression analysis from a simple model to more complex models.

## Model 1

Regression Model 1: **price _{i} = ß_{0} + ß_{1}Age_{i} + ß_{2}Odometer_{i} + €_{i}**

Price_{i }= 4615.90 + 98.92Age_{i }-23.03Odometer_{i}

For every additional year in age, the car can be expected to increase in price by $98.92, on average, holding odometer constant.

For every additional thousand km in odometer, the price is expected to decrease by $23.03, on average, holding age constant.

## Model 2

Regression Model 2: **price _{i} = ß_{0} + ß_{1}Age_{i} + ß_{2}Age^{2}_{i} + ß_{3}(1/Odometer_{i} )+ €_{i}**

Price_{i }= 8809.03 + -429.63Age_{i }+ 7.32 (Age^{2}_{i}) + 1946.46 (1/Odometer_{i})

With Model 2, the amount of variance explained by the model has been improved a lot in terms of R^{2}, which is .59 compared with R^{2} = 0.17 in model 1.

The problem for model 2 is that coefficients are very hard to be interpreted. This brings us to the concept of logarithms or log, which is a way of scaling of skewed data.

Original Scale | Log with base 10 |

10 | log(10) = 1 |

100 | log(100) = 2 |

1000 | log(1000) = 3 |

10000 | log(10000) = 4 |

Original Scale | Log with base e |

2.718 | ln(2.718) = 1 |

7.389 | ln(7.389) = 2 |

20.086 | ln(20.086) = 3 |

54.598 |
ln(54.598) = 4 |

## Model 3

Regression Model 3: **price _{i} = ß_{0} + ß_{1}Age_{i} + ß_{2}Age^{2}_{i} + ß_{3}ln(Odometer_{i} )+ €_{i}**

Price_{i }= 11863.25 - 365.58Age_{i }+ 6.63 (Age^{2}_{i}) - 1079.42 ln(Odometer_{i})

Every unit increase in the natural log of the odometer reading decreases the price by $1079.42, on average, holding age constant.

The coefficient of ln(odometer) can also be interpreted in the following way:

Every 1% increase of the odometer reading decreases the price by 10.79%, on average, holding age constant.

## Model 4

Regression Model 4: **ln(price _{i}) = ß_{0} + ß_{1}Age_{i} + ß_{2}Age^{2}_{i} + ß_{3}ln(Odometer_{i} )+ €_{i}**

Ln(Price_{i}) = 9.39 - 0.05Age_{i }+ 0.001 (Age^{2}_{i}) - 0.20 ln(Odometer_{i})

Every 1% increase of the odometer reading decreases the price by 20%, on average, holding age constant.

## Categorical Independent Variables (X)

**A simple situation is that an independent variable is binary variable:**

Pinkslip = 1 if car has roadworthy certificate

Pinkslip = 0 if car doesn't have roadworthy certificate

Let's try a simple model: **price _{i} = ß_{0} + ß_{1}pinkslip_{i} + €_{i}**

price_{i} = 3978.26 +1625.24 (pinkslip_{i})

A car with pinkslip would increase a car sale price $1625.24 more than a car without a pinkslip on average.

##### Model 5:

Regression Model 5: **ln(price _{i}) = ß_{0} + ß_{1}Age_{i} + ß_{2}Age^{2}_{i} + ß_{3}ln(Odometer_{i} )+ ß_{4}(pinkslip_{i}) €_{i}**

Ln(Price_{i}) = 9.24 - 0.05Age_{i }+ 0.001 (Age^{2}_{i}) - 0.20 ln(Odometer_{i}) + 0.16(pinkslip_{i})

A car with pinkslip would increase a car sale price 16% more than a car without a pinkslip on average, holding other variables constant.

**Another situation is that an independent variable has multiple categorical values.**

Let's create a new variable, ageCat, based on the age variable:

ageCat = 1 when age <=5

ageCat =2 when 5 < age < 15

ageCat =3 when 15 < age <35

ageCat =4 when age >=35

The next step: we will need to create several dummy variables:

ageCat1 =1 if ageCat =1 & ageCat1 = 0 otherwise

ageCat2 = 1 if ageCat =2 & ageCat2 =0 otherwise

ageCat3 = 1 if ageCat =3 & ageCat3 =0 otherwise

ageCat4 = 1 if ageCat = 4 & ageCat4 = 0 otherwise

Regression Model 5a: **ln(price _{i}) = ß_{0} + ß_{1}ageCat1_{i} + ß_{2}ageCat2_{i}+ ß_{3}ageCat3_{i}+ ß_{4}ageCat4_{i} + ß_{5}ln(Odometer_{i} )+ ß_{6}(pinkslip_{i}) + €_{i}**

## Model 6

But ageCat1 = 1 - ageCat2 -ageCat3 - ageCat4. So we have to drop ageCat1 variable because it is the baseline category for ageCat2, ageCat3 and ageCat4. Actually you can drop any of those dummy ageCat variables. We drop ageCat1 because it seems being logic. Another rule for dropping a dummy variable which has the most cases.

**ln(price _{i}) = ß_{0} + ß_{1}ageCat2_{i} + ß_{2}ageCat3_{i}+ ß_{3}ageCat4_{i}+ ß_{4}ln(Odometer_{i} )+ ß_{5}(pinkslip_{i}) + €_{i}**

Ln(price_{i}) = 8.95 -013(ageCat2_{i}) - 0.73(ageCat3_{i}) + 0,47(ageCat4_{i}) -0.22(ln(Odemeter_{i})) + 0.34(pinkslip_{i})

On average, when holding all other variables constant, a car in age category 2 (i.e., ageCat2) will command a price 13% lower than a car in age category 1 (i.e., ageCat1).

Similarly, when holding all other variables constant, a car in age category 3 (i.e., ageCat3) will command a price 73% lower than a car in age category 1 (i.e., ageCat1).

On average, when holding all other variables constant, a car in age category 4 (i.e., ageCat4) will command a price 47% higher than a car in age category 1 (i.e., ageCat1).

## Interaction Terms

Assume we want to build a model to explain the salary of all Google employees.

Salary_{i} = ß_{0} + ß_{1}(employeeAge_{i}) + ß_{2}(collegeDegree_{i}) + €_{i}

Salary_{i} = ß_{0} + ß_{1}(employeeAge_{i}) + ß_{2}(collegeDegree_{i}) + ß_{3}(employeeAge_{i} x collegeDegree_{i}) + €_{i}

Interaction terms are required when x1 affects the relationship between x2 an y. In the Google employee salary case, age of employment affects the relationship between collegeDegree and salary. * A common misconception is that interaction terms are required when x1 and x2 is correlated*.

Model 7: **ln(price _{i}) = ß_{0} + ß_{1}ageCat2_{i} + ß_{2}ageCat3_{i}+ ß_{3}ageCat4_{i}+ ß_{4}ln(Odometer_{i} )+ ß_{5}(pinkslip_{i}) + ß_{6}(ageCat4_{i} x pinkslip_{i}) + €_{i}**

Ln(price_{i}) = 9.13 -018(ageCat2_{i}) - 0.80(ageCat3_{i}) - 0.39(ageCat4_{i}) -0.21(ln(Odemeter_{i})) + 0.12(pinkslip_{i}) + 1.37(**ageCat4 _{i }x_{}pinkslip**

_{i)}

Note: the interaction term now is very significant (p = 0.003).

**Interpretation**:

For models less than or equal to 35 years old, attaining a pink slip increases the price by 12%, holding all other variables constant.

For models greater than 35 years old, attaining a pink slip increases the price by 149%, holding all other variables constant. %149 = 1.37% + 12%

## Categorical Dependent Variable (Y)

Now would like to use car's sold status as dependent variable:

Sold = 1 if a car has been sold

Sold = 0 if a car has not been sold

Apparently the model just doesn't look right. For example, when you try to calculate the probability of selling for a car with $40,000, the value of the probability will be negative.

When using ln(p/(1-p)), the distribution suddenly becomes kind of symmetrical.

ln(p/(1-p) = ß0 + ß1(price) + ß2(pinkslip) + €

When y is either 0 or 1. This regression model is called **binomial logistic regression**.

ln(p/1-p) = 0.40 -0.17(price) + 1.55(pinkslip)

Odd ratio of price (unit of 1000) is: e^{-.17} = 0.84

Odd ratio of pinksip = e^{1.55} = 4.71

For a $1000 increase in price, the log-odds of selling the car decreases by 0.17, on average, holding all other variables constant.

A better way: For a $1000 increase in price, the odds of selling the car decreases 16%, on average, holding else constant.

1 - 0.84 = 0.16 = 16%

Cars with a pink slip have 4.71 times the odds of being sold compared to cars without pink slip, on average, holding all else constant.

Remember: "chance" = "probability". However either chance or probability is different from ODDS

For example, if the probability of raining tomorrow is 20%, what is the odds of having rain tomorrow?

ODDS = p/1-p = 0.20/(1-0.20) = 0.20/0.80 = 0.25 or 1: 4, meaning 1 chance of raining vs 4 chances of not raining.

a)** Find the probability that a $4500 car with a pink slip will sell.**

Ln(p/1-p) = 0.40 -0.17(4.5) + 1.55(1) = -1.185

p/(1-p) = e^{-1.185} = 3.27

p = 3.27 - 3.27p

p = 3.27/4.27 = 0.766

Answer: The probability of sale for a car priced at $4500 with a pink slip is 76.6%.

b)** Find the probability that a $4500 car without a pink slip will sell.**

Ln(p/1-p) = 0.40 -0.17(4.5) + 1.55(0) = -0.365

p/(1-p) = e^{-0.365} = 0.69

p = 0.69 -0.69p

p = 0.69/1.69 = 0.41

Answer: The probability of sale for a car priced at $4500 without a pink slip is 41%.

c) Calculate the odds in a) and b) and find the odds ratio.

a) | b) | |

Car | $4500 with pink slip | $4500 without pink slip |

Prob |
0.766 | 0.41 |

Odds | 0.766/(1-0.766) = 3.27 | 0.41/(1-0.41) = 0.69 |

Odds ratio = 3.27/0.69 = 4.71 |

The pink slip odds ratio can also used the coefficient (1.55) to calculate:

Odds ratio for pink slip: e^{1.55} = 4.17

95% C.I. for odds ratio of pink slip: e^{1.55} +/- 1.96 * 0.53 = [3.13 , 5.21]

Similarly, the odds ratio for price is: e^{-0.17} = 0.84

95% C.I. for odds ratio of price: e^{-0.17} +/- 1.96 * 0.06 = [0.72 , 0.96]

## Final Interpretation

Logistic regression was used to analyze the relationship between price and pink slip on the probability of car sale.

It was found that, holding price constant, the odds of sale increased by 317% (95% CI [2.13, 4.21]) for cars with pink slip compared to those without pink slip.

It was also found that, holding pink slip constant, the odds of sale decreased by 16% (95% CI [0.04, 0.28]) for each $1000 price increase.