Econ 3466 Essay Samples
Type of paper: Essay
Topic: Model, Selection, Crime, Murder, Illiteracy, Population, Frost, Value
Pages: 2
Words: 550
Published: 2020/12/14
Assignment 6: Model Selection
This exercise on model selection was done using the state.x77 dataset included in R. This dataset contains 50 observations of 8 variables: Population (population estimate as of July 1, 1975), Income (per capita income in 1974), Illiteracy (1970, percent of population), Life Expectancy (life expectancy in years, between 1969 and 1971), Murder (murder and non-negligent manslaughter rate per 100,000 population in 1976), HSGrad (percent high-school graduates in 1970), Frost (mean number of days with minimum temperature below freezing between 1931–1960, in capital or large city), and Area (land area in square miles).
Cleaning the data and selection of outcome variable: My student number is 0297127 and I have deleted observations 27 and 12. Then, the procedure “0297127%%7+1” returned the number 6, which corresponds to “Frost” as the dependent variable.
An unrestricted model (U model) with Frost as the outcome variable and the rest of the variables as predictors was performed. At a 5% significance level, 1 out of every 20 significant results will be by chance. The results of this unrestricted analysis show that the coefficient of Illiteracy was significantly different from zero and has the highest p-value, so it was removed because it is most likely to have a non-zero coefficient by accident (see Table 1).
A first restricted model (R.1 model) was performed without Illiteracy. This first restricted model explains less variance when compared against the unrestricted model (adjusted R² in the U model of 53.38% vs. 33.6% in the R.1 model). It is preferable to use adjusted R² instead of raw R², because the adjusted one takes into account the number of included predictors. As more predictors are added to the model, the raw R² tends to go higher, so it is important to penalize this value by the number of predictors to get a more realistic estimate of the amount of variance explained by the model. Second, all predictor coefficients changed considerably, which means that the parameter values are not robust to restriction. Furthermore, the total explained variance and the predictor coefficients changed so much that deleting Illiteracy from this model or -on the contrary- purposively including it on the model might introduce model selection bias (Zucchini, 58). Model selection bias means that it is not recommended to use the same data both for selecting the best-fitting model from a set of candidates, and to test hypotheses about the value of the estimated parameters from this best-fitting model because it would inflate the Type I error (Zucchini, 58).
On the first restricted model, Murder was the predictor with the highest p-value for its coefficient, so it was removed in the second restricted model (R.2 model). Now, the adjusted R² is even lower (14.07%), and the regression coefficients changed again. The same rationale applied before for Illiteracy is now applicable to the removal of Murder.
Finally, an iterative process was performed to find a final restricted model (R.3 model). The goal of this process was to find a parsimonious model to explain the highest amount of variance with the less number of predictors, taking into account the p-value as a criterion for selection. Illiteracy and Murder seemed to explain a large amount of variance, so they were re-included. Then, each remaining variable was re-included at a time until a final model was obtained. This final model (R.3 model) includes Population, Illiteracy, Murder and Life Expectancy as predictors, and explains 52.5% of the total variance in Frost (Table 1).
Finally, a covariance matrix is included in Table 2, where it can be seen that the outcome variable (Frost) moderately correlates with its predictors, but the predictors do not seem to highly correlate between each other. The highest correlation coefficient observed between predictors is 0.328 between Population and Murder, and it is still not very high. These observations support the principle of avoiding multi-collinearity when building and selecting a model. If a predictor is highly collinear with other predictors included in the model, it becomes redundant, so it should be avoided.
Works cited
Zucchini, W. “An introduction to model selection”. Journal of mathematical psychology, 2000: 41-61.
- APA
- MLA
- Harvard
- Vancouver
- Chicago
- ASA
- IEEE
- AMA