RSQUARE Procedure

The RSQUARE procedure selects optimal subsets of independent variables in a multiple regression analysis. Regression coefficients and a variety of statistics useful for model selection can be printed or output to a SAS data set. In SAS Version 6+, the RSQUARE procedure is subsumed by PROC REG.

SPECIFICATIONS

The following statements control the RSQUARE procedure:

   PROC RSQUARE options;
      MODEL dependents=independents/options;
      FREQ variable;
      WEIGHT variable;
      BY variables;

There must be one or more MODEL statements. The FREQ, WEIGHT, and BY statements can appear only once. The MODEL, FREQ, WEIGHT, and BY statements can appear in any order.

PROC RSQUARE Statement

   PROC RSQUARE options;

The following options can be specified in the PROC statement:

DATA=SASdataset
names the SAS data set to be used. The data set can be an ordinary SAS data set or a TYPE=CORR, COV, or SSCP data set. If the DATA= option is omitted, RSQUARE uses the most recently created SAS data set.
SIMPLE|S
prints means and standard deviations for every variable listed in a MODEL statement.
CORR|C
prints the correlation matrix for all variables in the analysis.
NOINT
suppresses the intercept term from all models.
NOPRINT
suppresses the regression printout
OUTEST=SASdataset
creates a TYPE=EST data set containing model-selection statistics and parameter estimates for the selected models.

The options listed in the MODEL Statement section can also be used in the PROC RSQUARE statement. Any option specified in the PROC statement applies to every MODEL statement except those in which you specify a different value of the option. Optional statistics will appear in the OUTEST= data set only if the corresponding options are specified in the PROC statement.

MODEL Statement


   label: MODEL dependents=independents/options;

The MODEL statement specifies the variables to use for one or more subset regression analyses. On the left side of the equal sign list one or more dependent variables; on the right side of the equal sign list one or more independent variables (regressors). The label is optional.

When more than one dependent variable is used, RSQUARE performs a separate analysis for each dependent variable. No multivariate analyses are performed.

Any number of MODEL statements can follow the PROC RSQUARE statement.

The following options can appear in either the PROC RSQUARE statement or any MODEL statement after the slash (/):

SELECT=n
specifies the maximum number of subset models of each size to be printed or output to the OUTEST= data set. If SELECT= is used without the B option, the variables in each MODEL are listed in order of inclusion instead of the order in which they appear in the MODEL statement. If SELECT= is omitted and the number of regressors is less than 11, all possible subsets are evaluated. If SELECT= is omitted and the number of regressors is greater than 10, the number of subsets selected is at most equal to the number of regressors. A small value of SELECT= greatly reduces the CPU time required for large problems.
INCLUDE=i
requests that the first i variables after the equal sign in the MODEL statement be included in every regression model. By default, no variables are required to appear in every model.
START=n
specifies the smallest number of regressors to be reported in a subset model. The default value is one more than the value specified by the INCLUDE= option, or one if INCLUDE= is omitted.
STOP=n
specifies the largest number of regressors to be reported in a subset model. The default is the number of regressors listed in the MODEL statement.
ADJRSQ
computes r-square adjusted for degrees of freedom for each model selected.
CP
computes Mallows' Cp statistic for each model selected.
JP
computes Jp, the estimated mean square error of prediction for each model selected assuming that the values of the regressors are fixed and that the model is correct. The Jp statistic is also called the final prediction error (FPE).
MSE
computes the mean square error for each model selected.
SSE
computes the error sum of squares for each model selected.
B
computes estimated regression coefficients for each model selected.

FREQ Statement

   FREQ variable;

If a variable in your data set represents the frequency of occurrence for the other values in the observation, include the variable's name in a FREQ statement. The procedure then treats the data set as if each observation appears n times, where n is the value of the FREQ variable for the observation. The total number of observations will be considered equal to the sum of the FREQ variable when the procedure determines degrees of freedom for significance probabilities.

WEIGHT Statement

   WEIGHT variable;

A WEIGHT statement names a variable in the input data set whose values are relative weights for a weighted least-squares fit. If the weight value is proportional to the reciprocal of the variance for each observation, then the weighted estimates are the best linear unbiased estimates (BLUE).

The WEIGHT and FREQ statements have similar effects, except in the calculation of degrees of freedom. BY Statement

   BY variables;

A BY statement can be used with PROC RSQUARE to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use the SORT procedure with a similar BY statement to sort the data, or, if appropriate, use the BY statement options NOTSORTED or DESCENDING.