jueves, 2 de mayo de 2013

GSOC 2013. Project Proposal Information


Organization
Python Software Foundation

Proposal Title 
Statsmodels: Discrete choice models

Proposal Abstract
The aim of this project is to add discrete choice models to statsmodels and fill a gap in the set of discrete models that are currently available. Statsmodels is a BSD licensed Python package for estimation of many different statistical models. 
Multinomial Logit and Nested Logit models have been the workhorse for discrete choice models since the 1970's despite their limitations. They are still the best choice for simpler models. But, thanks to the increased feasibility of computer intensive simulation approaches, it is now possible to estimate more complex models. Mixed Logit models are gaining attention and use since they can accommodate random taste variation across users or consumers and correlation across alternatives. Furthermore, Mixed Logit models make it possible to use mixed types of data (preferences revealed and declared) or data from different sources.
This project proposes, first, to work on the currently implemented Multinomial Logit and the Nested Logit algorithms and, then, implement Mixed Logit algorithms. Also, I propose to implement flexible model specification and several supporting functions for the summary of the model, the statistics result, and statistical tests to check heteroscedasticity, the nesting structures and random parameters of the model. Working on the model specification will provide users with a user friendly way to define even complex discrete choice models.


Proposal Detailed Description
As stated, several supporting functions will be improved or implemented. First, a function which returns a summary of the model and the principal statistics results. And second, implement three general statistical tests: the Wald test, the Lagrange multiplier test and the Likelihood ratio to:
  • Test of heteroscedasticity.
  • Test about the nesting structure.
  • Test of random parameters.
I am planning to test the implemented algorithm against other implementations or software available with similar model estimation functions like: mlogit (package for R), Biogeme and Nlogit, and implement examples based on the main references on the topic. I am familiar and have used those packages before.
From the beginning, I will implement unit tests that will verify the implemented algorithms against benchmark results to ensure the correctness of the results. This will help to automatically test the algorithm to catch any error that could be introduced by any future modification off the code.
The task in all parts of the project is to write the statistical models and the related supporting functionality like plots and statistical tests. Depending on the time it takes to implement the primary goals, additional work of this list or proposed by the community could be done.
I plan to communicate with mentor by email on a weekly basis to set weekly mini-goals and discuss regular code reviews. Also, I will remain in constant touch with my mentor and the Statsmodels community through IRC and mailing lists.
I write up weekly update blog posts and at least two posts to show code snippets to use the package on http://gsocstatsmodels.blogspot.com.es/, also the code will be regularly pushed to the github repository.

Timeline
  • Community Bonding Period (May 28 – Jun 16). Familiarize with the Statsmodels codebase and the community, the version control system, the documentation and test system used. Start to work on the list of models and supporting functionality.
September 27. Begin coding
  • Weeks 1-2 (June 17 – 28). Study of the Statsmodels codebase to get familiar with it and write unit tests for current Multinomial Logit and the Nested Logit algorithms.
  • Weeks 3-6 (July 1 – 19). Implement flexible model specification and supporting functions for the summary of the model, the results statistics and the three statistical test.
  • Weeks 7 (July 22 – 26). Clean code, improve unit tests and documentation for Multinomial Logit and Nested Logit.
  • Weeks 8 (July 29 – 2). Writing up blog posts to show code snippets to use the package. Submitting mid-term evaluation.
August 2. Mid-term evaluations deadline
  • Weeks 9 (August 5 – 9). Start work on mixed logit. Implement a prototype of the required functions, methods or classes that will set the base for implementing the algorithms.
  • Weeks 10 (August 12 – 16). Implement a basic algorithm for mixed logit and test the implemented algorithm against another implementations / software available.
  • Week 11 (August 19 – 23). Optimize the implemented algorithm trying to achieve the best performance and precision possible.
  • Weeks 12 (August 26 – 30). Implement unit tests and documentation for mixed logit algorithms.
  • Week 13 (September 2 – 13). Finishing up any pending code corrections, test and bug fixes.
  • Week 14 (September 16 – 20). Clean code, refine unity tests and documentation for the whole project.
  • Week 15 (September 23 – 27). New blog posts to show code snippets to use the package and a small white paper with investigation, coding and documentation. Submitting final evaluations to Google.
September 27. Final evaluation deadline
References
Ben-Akiva, M. y S.R. Lerman. (1985) Discrete Choice Analysis. Theory and Application to Travel Demand. The MIT Press. Cambridge, Massachusetts.
Bierlaire, M. (2003) BIOGEME: A free package for the estimation of discrete choice models , Proceedings of the 3rd Swiss Transportation Research Conference, Ascona, Switzerland.
Croissant, Y. (2010) mlogit: Multinomial Logit Model. R package version 0.1-5.
Domencich, T. y D. McFadden (1972) A Disaggregated Behavioral Model of Urban Travel Demand. Report No. CRA-156-2. Charles Rivers Associates, Inc. Cambridge, Massachussetts.
Hensher, D.A. and W.H. Greene. (2003) The Mixed Logit model: The state of practice. Transportation 30, 133-176.
Hensher, D.A., W.H. Greene and J.M Rose. (2005) Applied Choice Analysis. Cambridge University Press.
Louviere, J.J., D.A. Hensher and J.D. Swait. (2000) Stated Choice Methods: Analysis and Application. Cambridge University Press. Cambridge.
McFadden, D. (2000) Disaggregate behavioral travel demand´s RUM guide. A 30-year retrospective. International Association of Travel Behavior Analysts. Brisbane, Australia.
Orro, A. (2006) Modelos de elección discreta en transportes con coeficientes aleatorios. Tesis Doctoral. University of A Coruña, A Coruña. Abertis chair. Barcelona.
Ortúzar, J. de D. and L. G. Willumsen. (2001) Modelling Transport. Trird edition. Wilaey and Sons.
Train, K. (2003) Discrete Choice Methods with Simulation. Cambridge University Press.
Zeileis A, Croissant Y (2010) Extended Model Formulas in R: Multiple Parts and Multiple Responses. Journal of Statistical Software,34, 1-13.