PROC REG, Version 6 Enhancements

Michael Friendly
SCS Short Course


PROC REG

The REG procedure has been enhanced considerably in Version 6. The most important additions are:

Interactive statements

A number of new statements in PROC REG support fitting a regression model interactively. To support interactive processing, the RUN; statement in SAS Version 6 causes all statements submitted up to that point to be executed immediately, but RUN; does not end the PROC step as in Version 5.

This means you can submit a model, look at the output, then submit additional statements without starting over. A procedure is ended by another PROC step, a DATA step, or the QUIT; statement.

The interactive statements in Version 6 are:

ADD
Adds variable(s) to the current model.
DELETE
Deletes variable(s) from the current model.
PRINT
Print an ANOVA table or other print options specified on the MODEL statement (e.g., PARTIAL, VIF, COLLIN, etc.) Implicitly causes the current model to be refitted.
PLOT
Plots data set variables and/or diagnostic measures calcualted for each observation (residuals, leverage, etc.)
PAINT
Selects observations to be "painted" or highlighted in a scatterplot. The plots are produced by the next PLOT statement.
REWEIGHT
Excludes specific observations from analysis (gives them 0 weight) or change the weights of observations used.
REFIT
Explicitly request the current model to be refitted.

Note: A number of the interactive statements have a lasting effect. For example, when you REWEIGHT or PAINT an observation, that action continues until you explicitly undo it with another statement.

Example: Interactive analysis

The following example shows how the interactive statements are used in a regression analysis predicting a person's WEIGHT from their AGE and HEIGHT. Note that a RUN; statement follows each set of interactive statements, causing the results of the preceeding statements to be displayed. The form of the REWEIGHT statement used here is REWEIGHT condition. All observations satisfying the condition have their weight set to zero, and so are ignored in the analysis.
                      /* Interactive Analysis */
proc reg data=class;
   model weight = age height;
 run;                 /* fit initial model */
   delete age;        /* delete AGE from model */
   print;             /* ANOVA summary and parameter estimates */
 run;
   add age;           /* put AGE back in */
   plot r.*p.;        /* plot residual * predicted */
 run;
   reweight r.>20;    /* refit, ignoring obs with residual > .20 */
   plot;              /* plot residual * predicted again */
 run;
Note that same the plot specification (R. * P.) is used again in the second PLOT statement. In the same way, the REWEIGHT condition (R. > .20) would continue to be used in subsequent steps until it was changed with a new REWEIGHT statement.

Plots

The PLOT statement in PROC REG prints scatter plots of variables listed in the MODEL or VAR statements or any of the calculated regression measures which can be used on the OUTPUT statement. The calculated regression measures are referred to in the PLOT statement as keyword.; for example, STUDENT. is the studentized residual, P. is the predicted value.

The example below produces two sets of scatter plots using the PAINT statement to identify individual points. The first plot, shown in Figure 2, identifies the observation with name='Henry' and all observations with large absolute studentized residuals. The plotting symbols are specified by the SYMBOL= option in each PAINT statement.

                     /* ---Painting Scatter Plots--- */
proc reg data=class;
   model weight=age height / noprint;
run;
   paint name='Henry'                   /* identify Henry with 'H'     */
       / symbol='H';
   paint student.>=2 or student.<=-2    /* identify obs with large abs */
       / symbol='$';                    /* studentized residuals       */
   plot student.*p.;             /* plot studentized residual vs. yhat */
run;
   paint student.>=1 / symbol='p';
   paint student.<1 and student.>-1
       / symbol='s';
   paint student.<=-1 / symbol='n';
   plot student.*p. cookd.*h. / hplots=2 vplots=2;
run;


          ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+---
 STUDENT  |                                                           |
          |                                                           |
        3 +                                                           +
          |                                                           |
          |                                                           |
          |                                                           |
          |                                                           |
          |                                                           |
          |                                    $                      |
        2 +                                                           +
S         |                                                           |
t         |                                                           |
u         |                                                           |
d         |                                                           |
e         |                                           1               |
n         |                        1                              1   |
t       1 +                                  1                        +
i         |                                                           |
z         |                                                           |
e         |                   11                                      |
d         |                                                           |
          |                    1                                      |
R         |                                                           |
e       0 +                  1               1                        +
s         |                                                           |
i         |                                   H                       |
d         |                          1               2                |
u         |                                                           |
a         |      1                                                    |
l         |                                                           |
       -1 +                                                           +
          |                                      1        1           |
          |                                                           |
          |                                1                          |
          |                                     1                     |
          |                                                           |
          |                                                           |
       -2 +                                                           +
          ---+-----+-----+-----+-----+-----+-----+-----+-----+-----+---
            50    60    70    80    90    100   110   120   130   140

                     Predicted Value of WEIGHT     PRED

Figure 2: Painting observations

The second PLOT statement in the example above produces two small scatter plots on a single page, as shown in Figure 3. The PAINT statements identify observations with large positive residuals by the character 'p', those with large negative residuals by the character 'n', and those with small absolute residuals by the character 's'. The options HPLOTS=2 VPLOTS=2 allow for four plots on a page, of which two are actually used. The same plotting symbols are used in both plots, so you can relate the observations in the two plots.



     --+----+----+----+----+----+--        -+-----+-----+-----+-----+--
     |                            |        |                          |
   4 +                            +   1.00 +                          +
     |                            |        |                          |
     |                            |        |                          |
     |                            |        |                    p     |
S    |                  p         | C 0.75 +                          +
T  2 +                            + O      |                          |
U    |                            | O      |                          |
D    |             p       p    p | K      |                          |
E    |                 s          | D 0.50 +                          +
N    |           s                |        |                          |
T  0 +          s      s          +        |                          |
     |             s   s  s       |        |                          |
     |     s                      |   0.25 +                          +
     |                   n  n     |        |                p         |
     |                n n         |        |            n  s          |
  -2 +                            +        |    n pp n         s      |
     |                            |   0.00 +    sss s   s             +
     --+----+----+----+----+----+--        -+-----+-----+-----+-----+--
      40   60   80   100  120  140         0.0   0.1   0.2   0.3   0.4

                PRED                                 H

Figure 3: Painting observations in two plots

Model-selection methods

In Version 5, methods for selecting predictor variables by forward, backward, or stepwise selection were carried out by PROC STEPWISE. Methods for evaluating best possible subsets of predictors using R sup 2 or Mallow's C sub p were carried out by PROC RSQUARE.

These methods are all provided in PROC REG in Version 6, using the SELECTION= option on the MODEL statement. In addition, a modification of the syntax for the MODEL statement and a GROUPNAMES option allows groups of variables to be entered or removed as a whole in stepwise methods.

Note: For compatiblity across versions, if PROC STEPWISE or PROC RSQUARE is requested in Version 6, PROC REG with the appropriate model-selection method is actually used.

Stepwise methods

For forward, backward, or stepwise selection, specify SELECTION=FORWARD, SELECTION=BACKWARD, or SELECTION=STEPWISE on the MODEL statement.

For each of these methods, you can use the option START= and STOP= to specify the smallest and largest number of variables in a given model. The option BEST= specifies the maximum number of models to be printed for a given number of predictors.

The following example uses the RSQUARE criterion to identify the best 4 models with 2, 3, 4 and 5 predictors for the fitness data.

proc reg data=fitness;
   model oxy = runtime age weight runpulse maxpulse rstpulse
       / selection=rsquare
         start=2 stop=5 best=4;