* Unordered Multinomial Choice problem
* Source: Greene and Hensher (1997), downloaded from William Greene, Econometric Analysis website
* Time = terminal waiting time, 0 for car
* Invc = In-vehicle cost component
* Invt = the amount of time spent traveling
* GC = Generalized cost measure
* Hinc = Household income
* Psize = Party size in mode chosen
* Data is in "long form"  with characteristics being different for each mode that the individual faces
* First Mode = air, Second Mode = train, Third Mode = bus, Fourth Mode = car

* Individual-specific variables: hinc, psize.
* Choice-specific variables: gc, invc, invt, ttme.
* 210 individuals and 840 observations.

clear all
set more off

insheet using "C:\data\travel.csv"

* Use "car" as base alternative
* A conditional logit model with only choice-specific variables (gc and time)
asclogit mode gc time, case(id) alternatives(travelmode) basealternative(car) nolog 

* A conditional logit model with both choice-specific variables (gc and time) and individual specific variable (hinc)
asclogit mode gc time, case(id) alternatives(travelmode) casevars(hinc) basealternative(car) nolog 

* A conditional logit model with both choice-specific variables (gc and ttme) and individual specific variables (hinc and psize)
asclogit mode gc time, case(id) alternatives(travelmode) casevars(hinc psize) basealternative(car) nolog 

* Predicted probabilities of choice of each mode and compare to actual freqs
predict pasclogit, pr
table travelmode, contents(mean mode mean pasclogit sd pasclogit) cellwidth(15)
drop pasclogit

* Reshape the dataset into wide form.
reshape wide mode gc time invc invt, i(id) j(travelmode air train bus car) string
gen mode =.
replace mode = 1 if (modeair == 1)
replace mode = 2 if (modetrain == 1)
replace mode = 3 if (modebus == 1)
replace mode = 4 if (modecar == 1)

* Tabulate the dependent variable (mode)
tabulate mode

* Table of household income by travel mode
table mode, contents(N hinc mean hinc sd hinc)

* Table of terminal time by travel mode
table mode, contents(mean timeair mean timetrain mean timebus mean timecar)

* Multinomial logit with base outcome alternative 4 (car)
mlogit mode hinc, baseoutcome(4)
mlogit mode hinc psize, baseoutcome(4)

* Odds Ratio estimates - Multinomial logit with base outcome alternative 4 (car)
mlogit mode hinc, rr baseoutcome(4)

* Predict probabilities of choice of each mode and compare to actual freqs
quietly mlogit mode hinc, baseoutcome(4)
predict pmlogit1 pmlogit2 pmlogit3 pmlogit4, pr
summarize pmlogit* modeair modetrain modebus modecar, separator(4)
list pmlogit* in 1/10

* Create Classification Table and get accuracy rate

egen pred_max = rowmax(pmlogit*)
generate pred_choice = .
forv i=1/4 {
replace pred_choice = `i' if (pred_max == pmlogit`i')
}
local mode_label: value label mode
label values pred_choice `mode_label'
tabulate pred_choice mode

* From the above tabulation we see that the only choices predicted
* were classes 2 and 4 (train = 2 and car = 4).
* Accuracy rate = (46 + 40)/210 = 0.4095
* In comparison, the accuracy rate that one would expect from naively classifying
* using the majority class (train = class 2) would be 30.0% accuracy on average.
* See the previous tabulation result for the dependent variable - mode.
* Thus, the current mlogit classifier is providing a LIFT of 40.95/30.0 = 1.365.