* Download auto.dta data set from STATA Example Library
sysuse auto.dta
sysdescribe auto.dta
* Summarize auto data
summarize
* Notice the mean of mpg is 21.2973. We might refer to the mean
* as the naive predictor because is does not use the information
* available in the explanatory variables we are considering here
* The Main question is "Do any of these models predict better
* on independent observations than the sample mean of 21.2973?
* The models that do predict better than the sample mean give
* us what is called "lift" We of course prefer models that
* provide a lot of "lift"
* Summarize in detail the factor variable rep78
summarize i.rep78
* Summarize in detail the factor variable rep78 including the base case
summarize ibn.rep78
* Run regression of mpg on weight
regress mpg weight
* Produce residual plot of residual versus weight
rvpplot weight
* Next three statements conduct 3 separate tests for heteroscedasticity
estat hettest
estat hettest, iid
estat hettest, fstat
* Run OLS but with heteroscedasticity robust standard errors
regress mpg weight, vce(robust)
* generate a squared variable and an interaction variable
generate weight2 = weight*weight
generate weight_length = weight*length
* A quadratic regression on weight
regress mpg weight weight2
estat hettest
regress mpg weight weight2, vce(robust)
regress mpg weight weight2 weight_length, vce(robust)
* Now let use generate the Correlation matrix of all of the continuous
* explanatory variables and the Variance Inflation Factors (VIFs) associated
* with the full model with all of the explanatory variables in it
* As we see below the many of the pairwise correlations between the
* explanatory variables are quite high and 6 of the explanatory variables
* have VIFs above the benchmark value of 5. Therefore eliminating some of the
* collinear explanatory variables via variable selection methods might
* be desirable from an accuracy in prediction perspective
* For the definition of VIF see https://en.wikipedia.org/wiki/Variance_inflation_factor
* VIF = 1/(1 - R(j)^2) where R(j)^2 is the coefficient of determination
* obtained by regressing the j-th explanatory variable on the rest of the
* explanatory variables.
pwcorr mpg weight weight2 weight_length rep78 headroom trunk length turn displacement gear_ratio
regress mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign
estat vif
* Using the downloaded procedure "vselect" to do searches for best model
vselect mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign, best
vselect mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign, back bic
vselect mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign, forward bic
* Now let us the signs of the coefficients and their significances of the
* competing models chosen by the various variable selection techniques
* Best Option using R2ADJ Criterion BIC = 386.3016
regress mpg foreign weight weight2 rep78 turn gear_ratio price weight_length
* Best Option using CP Mallows Criterion BIC = 376.4264
regress mpg foreign rep78 gear_ratio length
* Best Option using AIC Criterion BIC = 380.4026
regress mpg foreign weight weight2 rep78 gear_ratio weight_length
* Best Option using AICC Criterion BIC = 376.4242
regress mpg foreign rep78 gear_ratio weight_length
* Best Option using BIC criterion BIC = 375.2161
regress mpg weight
* Backward Option using BIC criterion BIC = 379.1498
* Same as using Forward Option and BIC criterion
regress mpg weight2 weight_length rep78 gear_ratio foreign