GSoC Statsmodels - Discrete choice models

sábado, 14 de septiembre de 2013

Conditional Logit Model - An example

The last days I've been working on an example to show how estimate a conditional logit model with our new code. Since Ipython notebook has become popular I decided to try it. You can see it here:

http://nbviewer.ipython.org/6564526

I hope you enjoy it. Please, let me know any comment.

The soft "pencils down" date is coming up on two days (the hard "pencils down" deadline on 23 September) and my code for Nested Logit Models still have a lot of bugs so, I have to fix it along this days. To those who are in the same situation, Happy Fixing!

Some references:

IPython notebook: http://ipython.org/ipython-doc/dev/interactive/notebook.html
Short instructions from statsmodels to build an example like this: https://github.com/statsmodels/statsmodels/wiki/Examples#user-contributions

sábado, 31 de agosto de 2013

Progress Report Update

This weeks I have been working on how to providing initial data for our models. I thought that use a dictionary type could be an easy way. But it isn't... I want to preserve the order of inserted data, nor store the data sorted by the key (as dictionary type do)!
But, after hours of pain, something came to my rescue : a dictionary that remembers insertion order! From python 2.7, OrderedDict is supported [1].
This example will show you a quick idea of how it works:

Case 1: numbers for dict keys


>>> from collections import OrderedDict
>>> from pprint import pprint
>>> 
>>> V = {
...     '1': ['time',  'cost'],
...     '2': ['time',  'cost'],
...     '3': ['time'],
...      }
>>>      
... pprint (V)
{'1': ['time', 'cost'], '2': ['time', 'cost'], '3': ['time']}
>>> 
>>> V = OrderedDict((
...         ('1', ['time',  'cost']),
...         ('2', ['time',  'cost']),
...         ('3', ['time'])
...         ))
>>> pprint (V)
OrderedDict([('1', ['time', 'cost']), ('2', ['time', 'cost']), ('3', ['time'])])

Case 2: strings for dict keys


>>> from collections import OrderedDict
>>> from pprint import pprint
>>> 
>>> V = {
...     'train': ['time',  'cost'],
...     'bus':   ['time',  'cost'],
...     'car':   ['time'],
...      }
>>> pprint (V)
{'bus': ['time', 'cost'], 'car': ['time'], 'train': ['time', 'cost']}
>>> 
>>> V = OrderedDict((
...         ('train', ['time',  'cost']),
...         ('bus',   ['time',  'cost']),
...         ('car',   ['time'])
...         ))
>>> pprint (V)
OrderedDict([('train', ['time', 'cost']), ('bus', ['time', 'cost']), ('car', ['time'])])

While in case 1 (dictionary keys with numbers) the outcome is similar, with strings for dictionary keys (case 2), it isn't: only the order is preserved for ordered dictionary.

[1] https://pypi.python.org/pypi/ordereddict

lunes, 19 de agosto de 2013

Progress Report

This weeks I made a results class, some tests and examples for the Conditional Logit Model. The model has some dirty code and some hacks, so, I've to work on it this week. On the other hand I was working simultaneously on the Nested Logit Model but, so far, only a draft in the paper.

One interesting thing learned: how to rename a file with git.

There is a rename command (git mv):

git mv <oldname> <newname>

Note that the effect is the same from removing the file and adding another with different name and the same content.
This command is also use to move file from one to another location:

git mv <source> <destination>

One interesting thing found:

A useful git visual cheatsheet: http://ndpsoftware.com/git-cheatsheet.htm

And one motivate thing! we received a message on the statsmodels mailing list being interested in the Conditional Logit model. He gave us some tricks for the implementation and offered as beta tester! That was wonderful!

jueves, 1 de agosto de 2013

Quick status - Plans for Next Weeks

This weeks I've been working on a type of Multinomial Logit Model which variables could vary over alternatives. That model is also called conditional logit model.
You can view the code on my github: https://github.com/AnaMP/statsmodels/compare/clogit

The next couple of weeks are going to be spent working on the Nested Logit Model. The model is the same as conditional logit excepted that captures correlations between alternatives by partitioning the choice set into nests.

I'll write about this in more detail next weeks on the statsmodels wiki:

https://github.com/statsmodels/statsmodels/wiki/DCM:-Discrete-choice-models

Stay tuned!

viernes, 12 de julio de 2013

Method for understanding why people make the choices that they do .

Discrete choice models (DCM) provides a framework with which to exame and predict how people’s choices are influenced by their personal characteristics and by the different attributes of the alternatives available to them.

The models have been used in very different sectors. Some are: transport (which mode of transport -car, bus, rail- to take to work, behavioural responses as a result of tolls), health (patients’preferences for treatment alternatives or choice among alternative healthcare providers), consumer demand(which car to buy o postal service to use). That are also used to examine choices by organizations, such as firms or government agencies.

DCM was developed in the late 1970’s, one parent is Daniel McFadden, who win the 2000 Nobel Prize in Economics for his theories and methods for analyzing discrete choice. In its beginningss, Moshe Ben-Akiva published a Ph.D. dissertation on the subject and Jordan Louviere and David Bunch helped develop original designs for DCM choice experiments. On 90's the DCM was remodeling and got a strong drive.

A discrete choice model is one in which decision makers choose among a set of alternatives or choice set. The decision maker obtains a certain level of utility from each alternatives. DCM are usually derived in a random utility model (RUM) framework in which decision makers are assumed to be utility maximizer. The utility, U, that decision maker, labeled n, obtains from any alternative j is Unj , j = 1...J. The decision maker chooses the alternative with the highest utility: choose alternative i if and only if Uni > Unj ∀j ≠ i and the probability that decision maker n choose alternative is simply: Pni = Prob(Uni > Unj ∀j = i). But the utility is only known to the decision maker but not the analyst. Because there are aspects of utility that the researcher does not or cannot observe, utility is decomposed as Unj =Vnj+εj. Where Vnj, which has some attributes of the alternatives and some attributes of the decision maker, is the systematic component of a decision maker’s utility and εj, wich captures the factors that inﬂuence utility but that are not in Vnj, is the stochastic component.

miércoles, 3 de julio de 2013

A Discrete Choice Modeling Framework

My project proposal to Statsmodels has been accepted for this year’s Google Summer of Code. My work is related to Discrete Choice Models (DCM) based on random utility maximization approach (RUM). Firstly, we work on multinomial logit and the nested logit algorithms and, then, on mixed logit algorithms. You can see my final proposal here.

GSoC 2013 is already started and we are working on to implement a model wich variables could vary over alternatives, a type of multinomial logit model also called conditional logit, and looking for a new dataset that will be used in the tests and examples.

On the other hand, we are working on a outline with cases of use, properties and references of the principal DCM based on RUM. As well, we are listing the statistics software packages and source codes for DCM estimation. You can see it here.

Any feedback, comments, and suggestions will be highly appreciated!

If you want to collaborate on it, you are welcome. Please, email me and I'll send you a link to edit the document.

jueves, 2 de mayo de 2013

GSOC 2013. Project Proposal Information

Organization

Python Software Foundation

Proposal Title

Statsmodels: Discrete choice models

Proposal Abstract

The aim of this project is to add discrete choice models to statsmodels and fill a gap in the set of discrete models that are currently available. Statsmodels is a BSD licensed Python package for estimation of many different statistical models.

Multinomial Logit and Nested Logit models have been the workhorse for discrete choice models since the 1970's despite their limitations. They are still the best choice for simpler models. But, thanks to the increased feasibility of computer intensive simulation approaches, it is now possible to estimate more complex models. Mixed Logit models are gaining attention and use since they can accommodate random taste variation across users or consumers and correlation across alternatives. Furthermore, Mixed Logit models make it possible to use mixed types of data (preferences revealed and declared) or data from different sources.

This project proposes, first, to work on the currently implemented Multinomial Logit and the Nested Logit algorithms and, then, implement Mixed Logit algorithms. Also, I propose to implement flexible model specification and several supporting functions for the summary of the model, the statistics result, and statistical tests to check heteroscedasticity, the nesting structures and random parameters of the model. Working on the model specification will provide users with a user friendly way to define even complex discrete choice models.

Proposal Detailed Description

As stated, several supporting functions will be improved or implemented. First, a function which returns a summary of the model and the principal statistics results. And second, implement three general statistical tests: the Wald test, the Lagrange multiplier test and the Likelihood ratio to:

Test of heteroscedasticity.
Test about the nesting structure.
Test of random parameters.

I am planning to test the implemented algorithm against other implementations or software available with similar model estimation functions like: mlogit (package for R), Biogeme and Nlogit, and implement examples based on the main references on the topic. I am familiar and have used those packages before.

From the beginning, I will implement unit tests that will verify the implemented algorithms against benchmark results to ensure the correctness of the results. This will help to automatically test the algorithm to catch any error that could be introduced by any future modification off the code.

The task in all parts of the project is to write the statistical models and the related supporting functionality like plots and statistical tests. Depending on the time it takes to implement the primary goals, additional work of this list or proposed by the community could be done.

I plan to communicate with mentor by email on a weekly basis to set weekly mini-goals and discuss regular code reviews. Also, I will remain in constant touch with my mentor and the Statsmodels community through IRC and mailing lists.

I write up weekly update blog posts and at least two posts to show code snippets to use the package on http://gsocstatsmodels.blogspot.com.es/, also the code will be regularly pushed to the github repository.

Timeline

Community Bonding Period (May 28 – Jun 16). Familiarize with the Statsmodels codebase and the community, the version control system, the documentation and test system used. Start to work on the list of models and supporting functionality.

September 27. Begin coding

Weeks 1-2 (June 17 – 28). Study of the Statsmodels codebase to get familiar with it and write unit tests for current Multinomial Logit and the Nested Logit algorithms.
Weeks 3-6 (July 1 – 19). Implement flexible model specification and supporting functions for the summary of the model, the results statistics and the three statistical test.
Weeks 7 (July 22 – 26). Clean code, improve unit tests and documentation for Multinomial Logit and Nested Logit.
Weeks 8 (July 29 – 2). Writing up blog posts to show code snippets to use the package. Submitting mid-term evaluation.

August 2. Mid-term evaluations deadline

Weeks 9 (August 5 – 9). Start work on mixed logit. Implement a prototype of the required functions, methods or classes that will set the base for implementing the algorithms.
Weeks 10 (August 12 – 16). Implement a basic algorithm for mixed logit and test the implemented algorithm against another implementations / software available.
Week 11 (August 19 – 23). Optimize the implemented algorithm trying to achieve the best performance and precision possible.
Weeks 12 (August 26 – 30). Implement unit tests and documentation for mixed logit algorithms.
Week 13 (September 2 – 13). Finishing up any pending code corrections, test and bug fixes.
Week 14 (September 16 – 20). Clean code, refine unity tests and documentation for the whole project.
Week 15 (September 23 – 27). New blog posts to show code snippets to use the package and a small white paper with investigation, coding and documentation. Submitting final evaluations to Google.

September 27. Final evaluation deadline

References

Ben-Akiva, M. y S.R. Lerman. (1985) Discrete Choice Analysis. Theory and Application to Travel Demand. The MIT Press. Cambridge, Massachusetts.

Bierlaire, M. (2003) BIOGEME: A free package for the estimation of discrete choice models , Proceedings of the 3rd Swiss Transportation Research Conference, Ascona, Switzerland.

Croissant, Y. (2010) mlogit: Multinomial Logit Model. R package version 0.1-5.

Domencich, T. y D. McFadden (1972) A Disaggregated Behavioral Model of Urban Travel Demand. Report No. CRA-156-2. Charles Rivers Associates, Inc. Cambridge, Massachussetts.

Hensher, D.A. and W.H. Greene. (2003) The Mixed Logit model: The state of practice. Transportation 30, 133-176.

Hensher, D.A., W.H. Greene and J.M Rose. (2005) Applied Choice Analysis. Cambridge University Press.

Louviere, J.J., D.A. Hensher and J.D. Swait. (2000) Stated Choice Methods: Analysis and Application. Cambridge University Press. Cambridge.

McFadden, D. (2000) Disaggregate behavioral travel demand´s RUM guide. A 30-year retrospective. International Association of Travel Behavior Analysts. Brisbane, Australia.

Orro, A. (2006) Modelos de elección discreta en transportes con coeficientes aleatorios. Tesis Doctoral. University of A Coruña, A Coruña. Abertis chair. Barcelona.

Ortúzar, J. de D. and L. G. Willumsen. (2001) Modelling Transport. Trird edition. Wilaey and Sons.

Train, K. (2003) Discrete Choice Methods with Simulation. Cambridge University Press.

Zeileis A, Croissant Y (2010) Extended Model Formulas in R: Multiple Parts and Multiple Responses. Journal of Statistical Software,34, 1-13.