A platform for research: civil engineering, architecture and urbanism
Generation of synthetic datasets for discrete choice analysis
Abstract Despite the widespread use of synthetic data in discrete choice analysis, little is known about how the methodology used to generate synthetic datasets influences the properties of parameter estimates and the validity of results based on these estimates. That is, there are two potential sources of biases when using synthetic discrete choice data: (1) bias due to the method used to generate the dataset; and, (2) bias due to parameter estimation. The primary objective of this study is to examine bias due to the underlying data generation method. This study compares three methods for generating synthetic datasets and uses design of experiments and analysis of variance methods to investigate the ability to recover estimates for “true” logsum parameters for nested logit models. The method that uses nested logit probabilities to generate the chosen alternative results in unbiased parameter estimates. The method that is based on Gumbel error component approximations reveals that while the error components themselves are unbiased, subtle empirical identification problems can arise when these error components are combined with synthetically generated utility functions. The method that is based on normal error component approximations reveals that all logsum coefficients are biased upwards; the bias dramatically increases for those nests that have a low choice frequency and is most pronounced for those nests with high correlations among alternatives. Based on the results of the analysis, several recommendations for the generation of synthetic datasets for discrete choice analyses are provided.
Generation of synthetic datasets for discrete choice analysis
Abstract Despite the widespread use of synthetic data in discrete choice analysis, little is known about how the methodology used to generate synthetic datasets influences the properties of parameter estimates and the validity of results based on these estimates. That is, there are two potential sources of biases when using synthetic discrete choice data: (1) bias due to the method used to generate the dataset; and, (2) bias due to parameter estimation. The primary objective of this study is to examine bias due to the underlying data generation method. This study compares three methods for generating synthetic datasets and uses design of experiments and analysis of variance methods to investigate the ability to recover estimates for “true” logsum parameters for nested logit models. The method that uses nested logit probabilities to generate the chosen alternative results in unbiased parameter estimates. The method that is based on Gumbel error component approximations reveals that while the error components themselves are unbiased, subtle empirical identification problems can arise when these error components are combined with synthetically generated utility functions. The method that is based on normal error component approximations reveals that all logsum coefficients are biased upwards; the bias dramatically increases for those nests that have a low choice frequency and is most pronounced for those nests with high correlations among alternatives. Based on the results of the analysis, several recommendations for the generation of synthetic datasets for discrete choice analyses are provided.
Generation of synthetic datasets for discrete choice analysis
Garrow, Laurie A. (author) / Bodea, Tudor D. (author) / Lee, Misuk (author)
Transportation ; 37
2009
Article (Journal)
English
Generation of synthetic datasets for discrete choice analysis
Online Contents | 2009
|Reverse discrete choice models
Online Contents | 1999
|Modeling spatial discrete choice
Online Contents | 2010
|Wiley | 2012
|