PROC CORRESP

PROC CORRESP is a new procedure in Version 6 for correspondence analysis of categorical data. Correspondence analysis is a technique related to principal components analysis which finds a multidimensional representation of the association between the row and column categories of a two-way contingency table. Like principal components analysis, these dimensions account for the greatest proportion of the chi² for association between the row and column categories, just as components account for maximum variance.

Correspondence analysis is an exploratory technique for categorical data. It is designed to show how the data differ from independence. Confirmatory methods, such as those provided by PROC FREQ and PROC CATMOD should be used for hypothesis testing.

PROC CORRESP can read two kinds of input:

An output data set from PROC CORRESP contains scores for the row and column categories on two or more dimensions. Graphical display of these scores provides a useful aid to interpretation.

Specifications

The following statements are used with PROC CORRESP:
PROC CORRESP DATA=SAS-data-set 
   TABLES row-variables <,column-variables>;     raw data input
   VAR variables;                                contingency table input

   ID variable;
   BY variables;
   SUPPLEMENTARY variables;
   WEIGHT variable;

Example

The example below examines the numbers of Ph.D. degrees awarded in the United States in six areas during the years 1973 to 1978. The data set PHDS is input in the form of a contingency table, where the variables are years and observations are the six discipline areas.
data phds;
   input science $1-13 y1973-y1978;
   datalines;
Life          4489 4303 4402 4350 4266 4361
Physical      4101 3800 3749 3572 3410 3234
Social        3354 3286 3344 3278 3137 3008
Behavioral    2444 2587 2749 2878 2960 3049
Engineering   3338 3144 2959 2791 2641 2432
Mathematics   1222 1196 1149 1003  959  959
;

A correspondence analysis is carried out by the statements below. The ID statement specifies the labels for the rows of the table. An output data set RESULTS is produced containing the row and column dimensions.

proc corresp data=phds out=results rp short;
   var y1973-y1978;
   id science;

The printed output from the CORRESP procedure is shown in Figure 1. The total chi² of 383.8 on 25 degrees of freedom is highly significant, indicating differences in the profiles of Ph.D.s across the various disciplines. The breakdown of the chi² , shows that over 96% of the total chi-square is explained by the first dimension, indicating that the association is essentially one-dimensional.



                 The Correspondence Analysis Procedure

                  Inertia and Chi-Square Decomposition

    Singular  Principal Chi-
    Values    Inertias  Squares Percents   19   38   57   76   95
                                        ----+----+----+----+----+--
    0.05845   0.00342   368.653  96.04% *************************
    0.00861   0.00007     7.995   2.08% *
    0.00694   0.00005     5.197   1.35%
    0.00414   0.00002     1.852   0.48%
    0.00122   0.00000     0.160   0.04%
              -------   -------
              0.00356   383.856 (Degrees of Freedom = 25)

                            Row Coordinates

                                     Dim1          Dim2

                Life             0.025813      0.008097
                Physical         -.041273      -.002420
                Social           0.001352      -.011413
                Behavioral       0.110006      -.001299
                Engineering      -.070379      -.003671
                Mathematics      -.063942      0.022762

                           Column Coordinates

                                  Dim1          Dim2

                   Y1973      -.084027      0.003252
                   Y1974      -.050893      0.002939
                   Y1975      -.014823      0.000793
                   Y1976      0.024241      -.012926
                   Y1977      0.051249      -.008190
                   Y1978      0.086413      0.014276

Figure 1: PROC CORRESP output for PHDS data

The patern of association may be interpreted by plotting the output dimensions in the RESULTS data set. Since PROC PLOT can only use 1 character plotting symbols, a DATA step is used to change the point symbols for years to the last digit of the year.

data results;
   set results;
   /* create year plot character from last digit */
   if _type_ = 'VAR' then science = substr(science,5,1);

proc plot data=results;
   plot dim2 * dim1 = science / box
        vspace=5 hspace=10
        haxis=-0.15 to 0.15 by 0.05
        vaxis=-0.05 to 0.05 by 0.05;
The plot is shown in below. It may be seen that the years increase steadily from left to right, so Dimension 1 corresponds to the increase in the marginal totals for years over time.


             Plot of DIM2*DIM1.  Symbol is value of SCIENCE.


      ---+---------+---------+---------+---------+---------+---------+---
 DIM2 |                                                                 |
      |                                                                 |
 0.05 +                                                                 +
      |                                                                 |
      |                                                                 |
      |                   M                                             |
      |                                     L           8               |
 0.00 +               3  E   4 P    5                        B          +
      |                                S    6    7                      |
      |                                                                 |
      |                                                                 |
      |                                                                 |
-0.05 +                                                                 +
      |                                                                 |
      ---+---------+---------+---------+---------+---------+---------+---
       -0.15     -0.10     -0.05     0.00      0.05      0.10      0.15

                                     DIM1

Figure 2: Plot of correspondence analysis solution