/* This program was written by a team of students in the Eco 5385 class ("Data Mining Techniques for Economists") at SMU during the first summer session of 2006. The team members were Stefan Avdjiev, Abdullah Kumas, Aditi Roy, and Jayjit Roy. The program was written as a "special project" assigned to the Ph.D. students in the class with leadership being provided by my teaching assistant Stefan Avdjiev. The purpose of the program is to demonstrate that, when building econometric models through multiple testing (e.g. multiple regressions chosen by best subset selection methods), it is best to check the validity of such chosen models by "scoring" their predictions on an independent data set and comparing their accuracies to the accuracy of predictions produced by some kind of naive benchmark that involves no economic variables whatsoever. In the statistical and data mining literature, this is called examining the efficacy of a proposed model by cross-validation. */ /* In the below program 1000 samples of data are generated, each sample consisting of 500 observations. The data consists of one dependent variable labeled y and 20 explanatory (input) variables labeled X1 through X20. The y and X1 through X20 variables are generated as INDEPENDENT N(0,1) random variables. So any chosen multiple regression model of y on an intercept and variables X1 through X20 is a "spurious" regression by construction. To demonstrate the usefulness of cross validation in ferreting out spurious regressions, each of the 1000 samples are divided into two parts: the first 250 observations are designated as the training (estimation) data set and the second 250 observations are designated as the validation (out-of-sample) data set to be reserved for validation scoring. That is, in any one sample, the "best" multiple regression is chosen by a best subset selection procedure (here, selection = backward) based on the first 250 observations and then the estimated model is used to score (predict) the remaining 250 observations on the dependent variable y. The measure of prediction accuracy is chosen to be mean square error (MSE) though other measures such as mean absolute error (MAE) would do just as well. The "naive" benchmark we use is the sample mean of y determined from the training data set (i.e. the first 250 observations). Then this training set sample mean is used to predict the validation data set values of y. Again the measure of prediction accuracy is chosen to be MSE. */ /* In terms of showing the spuriousness of the best subset multiple regressions in each of the 1000 samples, the program compares the MSE of the chosen multiple regression with the MSE of the training set sample mean and records a 1 if the chosen multiple regression has a smaller MSE than the training set mean and 0 otherwise. After running through all of the samples (1000 in total) and scoring the outcomes of each sample, the program then provides two outputs. They are "scoresum" which is the total of the number of times (out of 1000) that the best subset multiple linear regression "beat" the training set mean in predicting the validation data set y's (based on MSE) and "scoreave" = scoresum/1000 which is the proportion of the samples in which the best subset multiple linear regression outperformed the training set mean. One would expect this proportion to be low because the sample mean is the best predictor given the fact that the explanatory variables X1 through X20 have been generated independently of the dependent variable y. Also, as a separate output, the results of the simulation are reported sample by sample. */ /* Students can repeat this experiment using different random number seeds and different best subset selection techniques. In addition to the backward selection method, SAS Proc REG supports other best subset selection methods some of which are "selection = forward", "selection = stepwise", "selection = adjrsq", and "selection = cp". Like Backward selection, the forward and backward selection techniques are "directed" searches and do not take that long for execution on each sample. In contrast, the Adjusted R-squared (adjrsq) and Cp Mallows (CP) searches are comprehensive (ALL) subset searches and, given 20 regressors, take a long time to run, especially when iterating over 1000 separate samples. Thus, it is not advised to run this simulation using the comprehensive search methods adjrsq or cp. Otherwise, students should test out the proposition that the results produced by this simulation are not significantly affected by the choices of seeds and the "directed" best subset selection techniques backward, forward, or stepwise. */ /* Tom Fomby, Dept. of Economics, SMU, Dallas, TX 75275 */ /* June, 2006 */ Options nodate nonumber; data train val; /* Initiatlizing the random number seeds */ seedy=127389; seedx1=748492; seedx2=564321; seedx3=54321; seedx4=254321; seedx5=6154321; seedx6=6454321; seedx7=67584921; seedx8=62554321; seedx9=6546321; seedx10=6543721; seedx11=6549321; seedx12=6543021; seedx13=6543221; seedx14=65433218; seedx15=6543261; seedx16=65432816; seedx17=6543291; seedx18=6543210; seedx19=6543212; seedx20=6543215; /* Generating 1000 samples of 500 observations each, with each sample divided into two parts, the first 250 observations being the training data set and the second 250 observations being the validation data set. */ do sample = 1 to 1000; do time = 1 to 500; y = rannor(seedy+sample); x1 = rannor(seedx1+sample); x2 = rannor(seedx2+sample); x3 = rannor(seedx3+sample); x4 = rannor(seedx4+sample); x5 = rannor(seedx5+sample); x6 = rannor(seedx6+sample); x7 = rannor(seedx7+sample); x8 = rannor(seedx8+sample); x9 = rannor(seedx9+sample); x10 = rannor(seedx10+sample); x11 = rannor(seedx11+sample); x12 = rannor(seedx12+sample); x13 = rannor(seedx13+sample); x14 = rannor(seedx14+sample); x15 = rannor(seedx15+sample); x16 = rannor(seedx16+sample); x17 = rannor(seedx17+sample); x18 = rannor(seedx18+sample); x19 = rannor(seedx19+sample); x20 = rannor(seedx20+sample); if time <= 250 then output train; if time > 250 then output val; end; end ; run; /* run regressions by sample */ proc reg data=train outest=trainco noprint; model y = x1-x20/selection=backward; by sample; run; /* Trimming the trainco data sets and assigning a value of 0 to each missing observation. */ data trainco; set trainco (keep = Intercept x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 sample); If Intercept=. then Intercept=0; If x1=. then x1=0; If x2=. then x2=0; If x3=. then x3=0; If x4=. then x4=0; If x5=. then x5=0; If x6=. then x6=0; If x7=. then x7=0; If x8=. then x8=0; If x9=. then x9=0; If x10=. then x10=0; If x11=. then x11=0; If x12=. then x12=0; If x13=. then x13=0; If x14=. then x14=0; If x15=. then x15=0; If x16=. then x16=0; If x17=. then x17=0; If x18=. then x18=0; If x19=. then x19=0; If x20=. then x20=0; run; /* Renaming the coefficients */ data trainco1; set trainco(rename=(x1=b1 x2=b2 x3=b3 x4=b4 x5=b5 x6=b6 x7=b7 x8=b8 x9=b9 x10=b10 x11=b11 x12=b12 x13 =b13 x14=b14 x15=b15 x16=b16 x17=b17 x18=b18 x19=b19 x20=b20 )); run; /* Renaming the y variable from the training data set */ data train; set train (rename=(y=ytrain)); run; /* Duplicating the coefficient vector 250 times for each of the 1000 samples */ data newtrainco1; set trainco1; do count = 1 to 250; output newtrainco1; end; /* Horizontally concatenating the 2 data sets */ data combo; merge val newtrainco1; run; /* Starting proc IML */ proc iml ; use combo; /* Defining a column vector of ones */ ones=J(250000,1,1); /* Reading the variables into proc IML */ read all var{Intercept} into Intercept; read all var{x1} into x1; read all var{x2} into x2; read all var{x3} into x3; read all var{x4} into x4; read all var{x5} into x5; read all var{x6} into x6; read all var{x7} into x7; read all var{x8} into x8; read all var{x9} into x9; read all var{x10} into x10; read all var{x11} into x11; read all var{x12} into x12; read all var{x13} into x13; read all var{x14} into x14; read all var{x15} into x15; read all var{x16} into x16; read all var{x17} into x17; read all var{x18} into x18; read all var{x19} into x19; read all var{x20} into x20; read all var{b1} into b1; read all var{b2} into b2; read all var{b3} into b3; read all var{b4} into b4; read all var{b5} into b5; read all var{b6} into b6; read all var{b7} into b7; read all var{b8} into b8; read all var{b9} into b9; read all var{b10} into b10; read all var{b11} into b11; read all var{b12} into b12; read all var{b13} into b13; read all var{b14} into b14; read all var{b15} into b15; read all var{b16} into b16; read all var{b17} into b17; read all var{b18} into b18; read all var{b19} into b19; read all var{b20} into b20; read all var{y} into y; /* Predicting y in the validation dataset using the coefficients from the training data set */ yhat = ones#Intercept+ b1#x1 + b2#x2 + b3#x3 + b4#x4 + b5#x5 + b6#x6 + b7#x7 + b8#x8 + b9#x9 + b10#x10 + b11#x11 + b12#x12 + b13#x13 + b14#x14 + b15#x15 + b16#x16 + b17#x17 + b18#x18 + b19#x19 + b20#x20; /* Finding the residuals */ residual = (y-yhat); /* Finding the squared residuals */ sqres = residual##2; /* Creating a new data set that can be used outside of proc IML */ create newcombo var{yhat residual sqres}; append; quit; /* Horizontally concatenating the 3 data sets */ data finalcombo; merge combo newcombo train; run; /* Starting proc IML */ proc iml; use finalcombo; /* Reading the variables into proc IML */ read all var{sqres} into sqres; read all var{ytrain} into ytrain; read all var{y} into y; SR = J(250,1000,0); /* Initializing a matrix of Squared Residuals */ SSR = J(1,1000,0); /* Initializing a vector of the Sums of Squared Residuals */ MSSR = J(1,1000,0); /* Initializing a vector of the Means of Squared Residuals */ /* Starting a "do" loop in order to estimate the best subset linear regression MSSR by sample */ do i = 1 to 1000; SR[,i] = sqres[(i-1)*250+1:250*i]; /* Computing the elements of the matrix of Squared Residuals */ SSR[i] = sum(SR[,i]); /* Computing the elements of the vector of the Sums of Squared Residuals */ MSSR[i] = SSR[i]/250; /* Computing the elements of the vector of Means of Squared Residuals */ end; /* Evaluating the alternative estimator (i.e. the sample mean of the training data set) */ /* Initializing the vectors and matrices needed for the estimation of the second predictor */ ysum = J(1,1000,0); ybar = J(1,1000,0); error = J(250,1000,0); errorsq = J(250,1000,0); SSRbar = J(1,1000,0); MSSRbar = J(1,1000,0); /* Starting a "do" loop in order to estimate the training set sample mean MSSR by sample */ do i = 1 to 1000; ysum[i] = sum(ytrain[(i-1)*250+1:250*i]); ybar[i] = ysum[i]/250; error[,i] = (y[(i-1)*250+1:250*i]-ybar[i]); errorsq[,i] = error[,i]##2; SSRbar[i] = sum(errorsq[,i]); MSSRbar[i] = SSRbar[i]/250; end; /* Initializing the score vector and scoresum variable */ score = J(1,1000,0); scoresum = 0; /* Starting a do loop that creates an indicator variable (score) that takes a value of 1 when the best subset linear regression estimator outperforms the sample mean estimator */ do i = 1 to 1000; if MSSR[i] < MSSRbar[i] then score[i] = 1; else score[i] = 0; end; /* Computing the proportion of cases in which the best subset linear regression estimator outperforms the sample mean estimator */ scoresum = sum(score); scoreavg = scoresum/1000; /* Printing out scoresum and scoreavg */ title1 'The total number of times (scoresum) out of 1000 samples that the best subset multiple'; title2 'linear regression beat (in MSE) the training set sample mean in predicting'; title3 'the validation set values of the dependent variable y.'; title4 'Scoreave is the proportion of samples in which the best subset multiple linear'; title5 'regression beat the training set sample mean in the validation data sets.'; print scoresum scoreavg; create funcombo var{SSR MSSR MSSRbar score}; append; quit; /* Here we report the actual results for each sample */ /* Renamed variables slightly to make them more self-explanatory */ data funcombo; set funcombo; SSRreg = SSR; MSSRreg = MSSR; MSSRybar = MSSRbar; title 'Results by Sample'; proc print data=funcombo obs='sample'; var SSRreg MSSRreg MSSRybar Score; run;