* Data obtained from http://www.ats.ucla.edu/
* Entering High School students make program choices among general program
* (prog = 1), academic program (prog = 2), and vocational program (prog = 3).
* Their choice is likely to depend on aptitude scores and social
* economic status among other variables.
* ses = social economic status, socst = social studies test score,
* schtyp = school type
* The higher the math, reading, writing, science, and social science test
* scores, the greater the academic aptitude. The socio-economic
* status is probably calibrated as 1 = lowest while 3 = highest.
use c:\data\hsb2.dta
describe
summarize
gen female_math = female*math
* Building a model by a type of backward eliination by considering all
* of the various explanatory variables and then eliminating the ones that
* are not statistically significant. The below tests are for zero restrictions
* across the two equations in the system.
* Note. The i.race and i.ses notation breaks the indicator variables race and
* into seperate indicator functions so that we can examine the significance of the
* separate parts. Unfortunately, the test statement does not accomodate the
* testing of variables like i.ses across equations. We have to do that later.
mlogit prog read math socst science write female female_math schtyp i.race i.ses, baseoutcome(2) nolog
test read
test math
test socst
test science
test write
test female
test female_math
test schtyp
* The variables read, write, female, and female_math are, separately, not jointly
* significant across equations so we drop them and reestimate. We use a 10%
* significance level.
mlogit prog math socst science schtyp i.race i.ses, baseoutcome(2) nolog
test math
test socst
test science
test schtyp
* The variables math, socst, science, and schtyp are all highly significant.
* Now we move to test the significance of i.race and i.ses
* breaking apart the various categorial variables so that we can test their
* individual significance across the two multinomial equations.
gen race1 = (race == 1)
gen race2 = (race == 2)
gen race3 = (race == 3)
gen race4 = (race == 4)
gen ses1 = (ses == 1)
gen ses2 = (ses == 2)
gen ses3 = (ses == 3)
mlogit prog math socst science schtyp race2 race3 race4 ses2 ses3, baseoutcome(2) nolog
test race2 race3 race4
test ses2 ses3
* In the above equation race is not statistically significant while the ses
* variable is. Therefore we move to estimate our final equation.
mlogit prog math socst science schtyp ses2 ses3, baseoutcome(2) nolog
test math
test socst
test science
test schtyp
test ses2 ses3
* Everything is jointly significant now
* Let us not report the final equation in rrr form
mlogit prog math socst science schtyp ses2 ses3, rr baseoutcome(2) nolog
* Predict probabilities of choice of each school choice and compare to actual freqs
quietly mlogit prog math socst science schtyp ses2 ses3, baseoutcome(2)
predict pmlogit1 pmlogit2 pmlogit3, pr
summarize pmlogit* prog, separator(3)
list pmlogit* in 1/10
* Create Classification Table and get accuracy rate
egen pred_max = rowmax(pmlogit*)
generate pred_choice = .
forv i=1/3 {
replace pred_choice = `i' if (pred_max == pmlogit`i')
}
local prog_label: value label prog
label values pred_choice `prog_label'
tabulate pred_choice prog
* Tabulate the various prog choices for calculating lift of classifier (mlogit)
tabulate prog
* Accuracy rate = (10 + 87 + 29)/200 = 0.63
* In comparison, the accuracy rate that one would expect from naively classifying
* using the majority class (prog = 2) would be 52.5% accuracy on average.
* See the previous tabulation result for the dependent variable - prog.
* Thus, the current mlogit classifier is providing a LIFT of 63/52.5 = 1.2.