[ Pobierz całość w formacie PDF ]

0 1 2 3 4 5 6
x1
Figure 9.5: Ridge regression illustrated. The least squares estimate is at the center of the ellipse while the
ridge regression is the point on the ellipse closest to the origin. The ellipse is a contour of equal density of
the posterior probability, which in this case will be comparable to a confidence ellipse. » controls the size
of the ellipse - the larger » is, the larger the ellipse will be
Another way of looking at is to suppose we place to some upper bound on ²T ² and then compute the
least squares estimate of ² subject to this restriction. Use of Lagrange multipliers leads to ridge regression.
The choice of » corresponds to the choice of upper bound in this formulation.
Æ
» may be chosen by automatic methods but it is probably safest to plot the values of ² as a function of
». You should pick the smallest value of » that produces stable estimates of ².
We demonstrate the method on the Longley data. » 0 corresponds to least squares while we see that
Æ
as » ", ² 0.
> library(MASS)
> gr
> matplot(gr$lambda,t(gr$coef),type="l",xlab=expression(lambda),
ylab=expression(hat(beta)))
> abline(h=0,lwd=2)
2
x
1
2
3
4
9.5. RIDGE REGRESSION 115
0.00 0.02 0.04 0.06 0.08 0.10
»
Figure 9.6: Ridge trace plot for the Longley data. The vertical line is the Hoerl-Kennard choice of ». The
topmost curve represent the coefficient for year. The dashed line that starts well below zero but ends above
is for GNP.
The ridge trace plot is shown in Figure 9.5.
Various automatic selections for » are available
> select(gr)
modified HKB estimator is 0.0042754
modified L-W estimator is 0.032295
smallest value of GCV at 0.003
> abline(v=0.00428)
The Hoerl-Kennard (the originators of ridge regression) choice of » has been shown on the plot but I
Æ
would prefer a larger value of 0.03. For this choice of », the ² s are
> gr$coef[,gr$lam == 0.03]
GNP.deflator GNP Unemployed Armed.Forces Population Year
0.22005 0.76936 -1.18941 -0.52234 -0.68618 4.00643
in contrast to the least squares estimates of
> gr$coef[,1]
GNP.deflator GNP Unemployed Armed.Forces Population Year
0.15738 -3.44719 -1.82789 -0.69621 -0.34420 8.43197
Note that these values are based on centered and scaled predictors which explains the difference from
previous fits. Consider the change in the coefficient for GNP. For the least squares fit, the effect of GNP is
negative on the response - number of people employed. This is counter-intuitive since we d expect the effect
to be positive. The ridge estimate is positive which is more in line with what we d expect.
^
²
-2
2
4
6
8
9.5. RIDGE REGRESSION 116
Ridge regression estimates of coefficients are biased. Bias is undesirable but there are other considera-
tions. The mean squared error can be decomposed in the following way:
2 2 2
Æ Æ Æ Æ
E ² ² E ² ² E ² E²
Thus the mean-squared error of an estimate can be represented as the square of the bias plus the variance.
Sometimes a large reduction in the variance may obtained at the price of an increase in the bias. If the MSE
is reduced as a consequence then we may be willing to accept some bias. This is the trade-off that Ridge
Regression makes - a reduction in variance at the price of an increase in bias. This is a common dilemma.
Chapter 10
Variable Selection
Variable selection is intended to select the  best subset of predictors. But why bother?
1. We want to explain the data in the simplest way  redundant predictors should be removed. The
principle of Occam s Razor states that among several plausible explanations for a phenomenon, the
simplest is best. Applied to regression analysis, this implies that the smallest model that fits the data
is best.
2. Unnecessary predictors will add noise to the estimation of other quantities that we are interested in.
Degrees of freedom will be wasted.
3. Collinearity is caused by having too many variables trying to do the same job.
4. Cost: if the model is to be used for prediction, we can save time and/or money by not measuring
redundant predictors.
Prior to variable selection:
1. Identify outliers and influential points - maybe exclude them at least temporarily.
2. Add in any transformations of the variables that seem appropriate.
10.1 Hierarchical Models
Some models have a natural hierarchy. For example, in polynomial models, x2 is a higher order term than x.
When selecting variables, it is important to respect the hierarchy. Lower order terms should not be removed
from the model before higher order terms in the same variable. There two common situations where this
situation arises:
Polynomials models. Consider the model
y ²0 ²1x ²2x2 µ
Suppose we fit this model and find that the regression summary shows that the term in x is not signif-
icant but the term in x2 is. If we then removed the x term, our reduced model would then become
y ²0 ²2x2 µ
117
10.2. STEPWISE PROCEDURES 118
but suppose we then made a scale change x x a, then the model would become
y ²0 ²2a2 2²2ax ²2x2 µ
The first order x term has now reappeared. Scale changes should not make any important change to
the model but in this case an additional term has been added. This is not good. This illustrates why
we should not remove lower order terms in the presence of higher order terms. We would not want
interpretation to depend on the choice of scale. Removal of the first order term here corresponds to
the hypothesis that the predicted response is symmetric about and has an optimum at x 0. Often this
hypothesis is not meaningful and should not be considered. Only when this hypothesis makes sense
in the context of the particular problem could we justify the removal of the lower order term.
Models with interactions. Consider the second order response surface model:
y ²0 ²1x1 ²2x2 ²11x2 ²22x2 ²12x1x2
1 2
We would not normally consider removing the x1x2 interaction term without simultaneously consid-
ering the removal of the x2 and x2 terms. A joint removal would correspond to the clearly meaningful
1 2
comparison of a quadratic surface and linear one. Just removing the x1x2 term would correspond to
a surface that is aligned with the coordinate axes. This is hard to interpret and should not be con-
sidered unless some particular meaning can be attached. Any rotation of the predictor space would
reintroduce the interaction term and, as with the polynomials, we would not ordinarily want our model
interpretation to depend on the particular basis for the predictors.
10.2 Stepwise Procedures
Backward Elimination
This is the simplest of all variable selection procedures and can be easily implemented without special
software. In situations where there is a complex hierarchy, backward elimination can be run manually while
taking account of what variables are eligible for removal.
1. Start with all the predictors in the model
2. Remove the predictor with highest p-value greater than ±crit
3. Refit the model and goto 2
4. Stop when all p-values are less than ±crit.
The ±crit is sometimes called the  p-to-remove and does not have to be 5%. If prediction performance
is the goal, then a 15-20% cut-off may work best, although methods designed more directly for optimal
prediction should be preferred.
10.2.1 Forward Selection
This just reverses the backward method.
1. Start with no variables in the model.
2. For all predictors not in the model, check their p-value if they are added to the model. Choose the one
with lowest p-value less than ±crit. [ Pobierz caÅ‚ość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • antman.opx.pl
  • img
    \