ncts

Outline:

The course may be subtitled ”Classification and Regression Trees by Example.” Specially selected real datasets are used to motivate and illustrate particular difficulties faced by traditional techniques and how they are overcome and solved in new ways by tree methods. Live demos of free software are interwoven in the presentation to encourage hands-on training. No commercial software is required. The target audience is statisticians, data scientists, and researchers in business, government, industry, and academia. It should be particularly useful for those who need to explore and analyze complex datasets with many variables and missing values and who want to learn to use the free classification and regression tree software.

Examples:

1. Estimation of population mean income from a consumer expenditure survey. Aims are to show that (i) popular missing value methods such as multiple imputation are grossly inadequate when the the amount of missing values is substantial and the variables number in the hundreds, (ii) regression tree and forest models can be constructed easily without missing value imputation, and (iii) tree algorithms can quickly order the predictor variables in terms of their predictive importance. Material based on Loh et al. (2017). Data from Bureau of Labor Statistics.

2. Classification of peptide sequences. Introduce concepts of node impurity in classification tree models with categorical predictors, Compare CART (Breiman et al. 1984), GUIDE (Loh 2002, 2009) and Random forest (Breiman 2001) in their importance scoring of variables. Compare GUIDE with neural networks on predictive accuracy.

3. Birthweight data. Introduce concepts of class priors and misclassification costs. Show how to build classification tree models from data with rare events or highly unbalanced classes. Data from Centers for Disease Control.

4. College tuition data. Build quantile regression tree models to estimate upper percentiles of tuition in U.S. colleges. Also build single regression tree models that simultaneously predict tuition cost and graduation rate. Data from U. S. News & World Report.

5. Hourly wages of high-school dropouts. Show the deficiencies of traditional linear mixed models. Build regression tree models for data with time-varying and longitudinal responses. Material based on Loh and Zheng (2013). Data from Singer and Willet (2003).

6. Alzheimer’s disease data. Cluster response trajectories of Alzheimer’s patients using patient baseline measurements. Compare results with those obtained with traditional clustering methods. Data from Alzheimer’s Disease Neuroimaging Initiative (ADNI).

7. Breast cancer randomized trial. Discuss problems with sub-group identification for differential treatment effects using traditional proportional hazards models. Compare regression tree solutions that adjust for local linear effects of prognostic variables. Material based on Loh et al. (2015, 2017). Data from Schumacher et al. (1994).

8. Type II diabetes randomized trial. Identify subgroups with differential treatment effects for longitudinal response data. Material based on Loh et al. (2016). Data from Eli Lilly.

9. Mortality from cardiovascular disease. Use classification tree models for matching and propensity scoring to estimate effect of hypertensive treatment in observational data. Compare results with those from logistic regression. Data from NIH Framingham Heart Study.

10. Post-selection inference. Regression tree methods have long been considered useful only for exploratory purposes, due to hitherto nonexistent methods of statistical inference. The problem is due to the difficulty of adjusting for the many algorithmic steps employed in the search for splits. This all changed very recently with the development of a bootstrap calibration technique that yields a theoretically justifiable method for construction of confidence intervals for subgroup means in the terminal nodes of a tree. Material based on Loh (1987, 2016, 2017).

References:

Breiman, L. (2001). ”Random Forests,” Machine Learning, vol. 45, 5-32.

Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). ”Classification and Regression Trees,” CRC Press.

Loh, W.-Y. (1987). ”Calibrating confidence coefficients,” Journal of the American Statistical Association, vol. 82, 155-162.

Loh, W.-Y. (2002). ”Regression trees with unbiased variable selection and interaction detection,” Statistica Sinica, vol. 12, 361-386

Loh, W.-Y. (2009). ”Improving the precision of classification trees,” Annals of Applied Statistics, vol. 3, 1710-1737.

Loh, W.-Y. (2014). ”Fifty years of classification and regression trees (with discussion),” International Statistical Review, vol. 34, 329-370.

Loh, W.-Y., Eltinge, J. and Cho, M.-J. (2017). ”Classification and regression trees and forests for incomplete data from sample surveys,” Statistica Sinica, in press.

Loh, W.-Y., Fu, H., Man, M., Champion, V. and Yu, M. (2016). ”Identification of subgroups with differential treatment effects for longitudinal and multiresponse variables,” Statistics in Medicine, vol. 35, 4837-4855.

Loh, W.-Y., He, X. and Man, M. (2015). ”A regression tree approach to identifying subgroups with differential treatment effects,” Statistics in Medicine, vol. 34, 1818-1833.

Loh, W.-Y., Man, M. and Wang, S. (2017). ”Subgroups from regression trees with adjustment for prognostic effects and postselection inference,” Statistics in Medicine, in press.

Loh, W.-Y. and Zheng, W. (2013). ”Regression trees for longitudinal and multiresponse data,” Annals of Applied Statistics, vol. 7, 495-522.

Schumacher M., Baster G., Bojar H., Hübner K., Olschewski M., Sauerbrei W., Schmoor C., Beyerle C., Newmann R. L. A. and Rauschecker H. F. (1994). ”Randomized 2×2 trial evaluating hormonal treatment and the duration of chemotherapy in nodepositive breast cancer patients,” Journal of Clinical Oncology, vol. 12, 2086-2093.

Singer, J. D. and Willett, J. B. (2003). ”Applied Longitudinal Data Analysis,” Oxford University Press.