Sponsored by
[ Events ]

Activity Search
Sort out
2019 NCTS Summer Course: Art and Practice of Regression Trees and Forests
10:00 - 12:00 on Thursdays, June 20 - July 4, 2019
Lecture Room B, 4th Floor, The 3rd General Building, NTHU

Wei-Yin Loh (University of Wisconsin–Madison)

Yu-Shan Shih (National Chung Cheng University)


Regression tree and forest methods have greatly improved in the last decade. Their ease of use, prediction accuracy, execution speed, and interpretability make them essential tools for machine learning and data analysis. The course teaches how to use the tools effectively and efficiently in practice. It follows an examplefocused style, with each example chosen to illustrate particular weaknesses of traditional solutions and to show how tree methods overcome them and yield new insights. Examples include a large consumer survey with hundreds of variables and substantial amounts of missing values; cancer and diabetes randomized trials with censored and longitudinal responses for precision medicine; and observational studies of high-school dropouts and Alzheimer’s patients. Learning highlights are (1) how trees deal with missing values without requiring imputation, (2) how importance scores help with variable selection, and (3) how to perform post-selection inference with the bootstrap. To encourage hands-on training, the presentation is interwoven with live demos of free software. No commercial software is required. Specific algorithmic techniques are discussed where appropriate but no systematic presentation of entire algorithms is given. Attendees should have experience with linear and logistic regression. Instructions for software and dataset downloads will be given in advance.
The course may be subtitled ”Classification and Regression Trees by Example.” Specially selected real datasets are used to motivate and illustrate particular difficulties faced by traditional techniques and how they are overcome and solved in new ways by tree methods. Live demos of free software are interwoven in the presentation to encourage hands-on training. No commercial software is required. The target audience is statisticians, data scientists, and researchers in business, government, industry, and academia. It should be particularly useful for those who need to explore and analyze complex datasets with many variables and missing values and who want to learn to use the free classification and regression tree software.
1. Estimation of population mean income from a consumer expenditure survey. Aims are to show that (i) popular missing value methods such as multiple imputation are grossly inadequate when the the amount of missing values is substantial and the variables number in the hundreds, (ii) regression tree and forest models can be constructed easily without missing value imputation, and (iii) tree algorithms can quickly order the predictor variables in terms of their predictive importance. Material based on Loh et al. (2017). Data from Bureau of Labor Statistics.
2. Classification of peptide sequences. Introduce concepts of node impurity in classification tree models with categorical predictors, Compare CART (Breiman et al. 1984), GUIDE (Loh 2002, 2009) and Random forest (Breiman 2001) in their importance scoring of variables. Compare GUIDE with neural networks on predictive accuracy.
3. Birthweight data. Introduce concepts of class priors and misclassification costs. Show how to build classification tree models from data with rare events or highly unbalanced classes. Data from Centers for Disease Control.
4. College tuition data. Build quantile regression tree models to estimate upper percentiles of tuition in U.S. colleges. Also build single regression tree models that simultaneously predict tuition cost and graduation rate. Data from U. S. News & World Report.
5. Hourly wages of high-school dropouts. Show the deficiencies of traditional linear mixed models. Build regression tree models for data with time-varying and longitudinal responses. Material based on Loh and Zheng (2013). Data from Singer and Willet (2003).
6. Alzheimer’s disease data. Cluster response trajectories of Alzheimer’s patients using patient baseline measurements. Compare results with those obtained with traditional clustering methods. Data from Alzheimer’s Disease Neuroimaging Initiative (ADNI).
7. Breast cancer randomized trial. Discuss problems with sub-group identification for differential treatment effects using traditional proportional hazards models. Compare regression tree solutions that adjust for local linear effects of prognostic variables. Material based on Loh et al. (2015, 2017). Data from Schumacher et al. (1994).
8. Type II diabetes randomized trial. Identify subgroups with differential treatment effects for longitudinal response data. Material based on Loh et al. (2016). Data from Eli Lilly.
9. Mortality from cardiovascular disease. Use classification tree models for matching and propensity scoring to estimate effect of hypertensive treatment in observational data. Compare results with those from logistic regression. Data from NIH Framingham Heart Study.
10. Post-selection inference. Regression tree methods have long been considered useful only for exploratory purposes, due to hitherto nonexistent methods of statistical inference. The problem is due to the difficulty of adjusting for the many algorithmic steps employed in the search for splits. This all changed very recently with the development of a bootstrap calibration technique that yields a theoretically justifiable method for construction of confidence intervals for subgroup means in the terminal nodes of a tree. Material based on Loh (1987, 2016, 2017).
Breiman, L. (2001). ”Random Forests,” Machine Learning, vol. 45, 5-32.
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). ”Classification and Regression Trees,” CRC Press.
Loh, W.-Y. (1987). ”Calibrating confidence coefficients,” Journal of the American Statistical Association, vol. 82, 155-162.
Loh, W.-Y. (2002). ”Regression trees with unbiased variable selection and interaction detection,” Statistica Sinica, vol. 12, 361-386
Loh, W.-Y. (2009). ”Improving the precision of classification trees,” Annals of Applied Statistics, vol. 3, 1710-1737.
Loh, W.-Y. (2014). ”Fifty years of classification and regression trees (with discussion),” International Statistical Review, vol. 34, 329-370.
Loh, W.-Y., Eltinge, J. and Cho, M.-J. (2017). ”Classification and regression trees and forests for incomplete data from sample surveys,” Statistica Sinica, in press.
Loh, W.-Y., Fu, H., Man, M., Champion, V. and Yu, M. (2016). ”Identification of subgroups with differential treatment effects for longitudinal and multiresponse variables,” Statistics in Medicine, vol. 35, 4837-4855.
Loh, W.-Y., He, X. and Man, M. (2015). ”A regression tree approach to identifying subgroups with differential treatment effects,” Statistics in Medicine, vol. 34, 1818-1833.
Loh, W.-Y., Man, M. and Wang, S. (2017). ”Subgroups from regression trees with adjustment for prognostic effects and postselection inference,” Statistics in Medicine, in press.
Loh, W.-Y. and Zheng, W. (2013). ”Regression trees for longitudinal and multiresponse data,” Annals of Applied Statistics, vol. 7, 495-522.
Schumacher M., Baster G., Bojar H., Hübner K., Olschewski M., Sauerbrei W., Schmoor C., Beyerle C., Newmann R. L. A. and Rauschecker H. F. (1994). ”Randomized 2×2 trial evaluating hormonal treatment and the duration of chemotherapy in nodepositive breast cancer patients,” Journal of Clinical Oncology, vol. 12, 2086-2093.
Singer, J. D. and Willett, J. B. (2003). ”Applied Longitudinal Data Analysis,” Oxford University Press.

Poster: events_3_171190504363095919.pdf

back to list
 (C) 2021 National Center for Theoretical Sciences