* Download auto.dta data set from STATA Example Library sysuse auto.dta sysdescribe auto.dta * Summarize auto data summarize * Notice the mean of mpg is 21.2973. We might refer to the mean * as the naive predictor because is does not use the information * available in the explanatory variables we are considering here * The Main question is "Do any of these models predict better * on independent observations than the sample mean of 21.2973? * The models that do predict better than the sample mean give * us what is called "lift" We of course prefer models that * provide a lot of "lift" * Summarize in detail the factor variable rep78 summarize i.rep78 * Summarize in detail the factor variable rep78 including the base case summarize ibn.rep78 * Run regression of mpg on weight regress mpg weight * Produce residual plot of residual versus weight rvpplot weight * Next three statements conduct 3 separate tests for heteroscedasticity estat hettest estat hettest, iid estat hettest, fstat * Run OLS but with heteroscedasticity robust standard errors regress mpg weight, vce(robust) * generate a squared variable and an interaction variable generate weight2 = weight*weight generate weight_length = weight*length * A quadratic regression on weight regress mpg weight weight2 estat hettest regress mpg weight weight2, vce(robust) regress mpg weight weight2 weight_length, vce(robust) * Now let use generate the Correlation matrix of all of the continuous * explanatory variables and the Variance Inflation Factors (VIFs) associated * with the full model with all of the explanatory variables in it * As we see below the many of the pairwise correlations between the * explanatory variables are quite high and 6 of the explanatory variables * have VIFs above the benchmark value of 5. Therefore eliminating some of the * collinear explanatory variables via variable selection methods might * be desirable from an accuracy in prediction perspective * For the definition of VIF see https://en.wikipedia.org/wiki/Variance_inflation_factor * VIF = 1/(1 - R(j)^2) where R(j)^2 is the coefficient of determination * obtained by regressing the j-th explanatory variable on the rest of the * explanatory variables. pwcorr mpg weight weight2 weight_length rep78 headroom trunk length turn displacement gear_ratio regress mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign estat vif * Using the downloaded procedure "vselect" to do searches for best model vselect mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign, best vselect mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign, back bic vselect mpg weight weight2 weight_length price rep78 headroom trunk length turn displacement gear_ratio foreign, forward bic * Now let us the signs of the coefficients and their significances of the * competing models chosen by the various variable selection techniques * Best Option using R2ADJ Criterion BIC = 386.3016 regress mpg foreign weight weight2 rep78 turn gear_ratio price weight_length * Best Option using CP Mallows Criterion BIC = 376.4264 regress mpg foreign rep78 gear_ratio length * Best Option using AIC Criterion BIC = 380.4026 regress mpg foreign weight weight2 rep78 gear_ratio weight_length * Best Option using AICC Criterion BIC = 376.4242 regress mpg foreign rep78 gear_ratio weight_length * Best Option using BIC criterion BIC = 375.2161 regress mpg weight * Backward Option using BIC criterion BIC = 379.1498 * Same as using Forward Option and BIC criterion regress mpg weight2 weight_length rep78 gear_ratio foreign