Correspondence analysis is an exploratory technique for categorical data. It is designed to show how the data differ from independence. Confirmatory methods, such as those provided by PROC FREQ and PROC CATMOD should be used for hypothesis testing.
PROC CORRESP can read two kinds of input:
An output data set from PROC CORRESP contains scores for the row and column categories on two or more dimensions. Graphical display of these scores provides a useful aid to interpretation.
PROC CORRESP DATA=SAS-data-setTABLES row-variables <,column-variables>; raw data input VAR variables; contingency table input ID variable; BY variables; SUPPLEMENTARY variables; WEIGHT variable;
data phds; input science $1-13 y1973-y1978; datalines; Life 4489 4303 4402 4350 4266 4361 Physical 4101 3800 3749 3572 3410 3234 Social 3354 3286 3344 3278 3137 3008 Behavioral 2444 2587 2749 2878 2960 3049 Engineering 3338 3144 2959 2791 2641 2432 Mathematics 1222 1196 1149 1003 959 959 ;
A correspondence analysis is carried out by the statements below. The ID statement specifies the labels for the rows of the table. An output data set RESULTS is produced containing the row and column dimensions.
proc corresp data=phds out=results rp short; var y1973-y1978; id science;
The printed output from the CORRESP procedure is shown in Figure 1. The total chi² of 383.8 on 25 degrees of freedom is highly significant, indicating differences in the profiles of Ph.D.s across the various disciplines. The breakdown of the chi² , shows that over 96% of the total chi-square is explained by the first dimension, indicating that the association is essentially one-dimensional.
The Correspondence Analysis Procedure
Inertia and Chi-Square Decomposition
Singular Principal Chi-
Values Inertias Squares Percents 19 38 57 76 95
----+----+----+----+----+--
0.05845 0.00342 368.653 96.04% *************************
0.00861 0.00007 7.995 2.08% *
0.00694 0.00005 5.197 1.35%
0.00414 0.00002 1.852 0.48%
0.00122 0.00000 0.160 0.04%
------- -------
0.00356 383.856 (Degrees of Freedom = 25)
Row Coordinates
Dim1 Dim2
Life 0.025813 0.008097
Physical -.041273 -.002420
Social 0.001352 -.011413
Behavioral 0.110006 -.001299
Engineering -.070379 -.003671
Mathematics -.063942 0.022762
Column Coordinates
Dim1 Dim2
Y1973 -.084027 0.003252
Y1974 -.050893 0.002939
Y1975 -.014823 0.000793
Y1976 0.024241 -.012926
Y1977 0.051249 -.008190
Y1978 0.086413 0.014276
Figure 1: PROC CORRESP output for PHDS
data
The patern of association may be interpreted by plotting the output dimensions in the RESULTS data set. Since PROC PLOT can only use 1 character plotting symbols, a DATA step is used to change the point symbols for years to the last digit of the year.
data results;
set results;
/* create year plot character from last digit */
if _type_ = 'VAR' then science = substr(science,5,1);
proc plot data=results;
plot dim2 * dim1 = science / box
vspace=5 hspace=10
haxis=-0.15 to 0.15 by 0.05
vaxis=-0.05 to 0.05 by 0.05;
The plot is shown in below. It may be seen that the years increase steadily from left to right, so Dimension 1 corresponds to the increase in the marginal totals for years over time.
Plot of DIM2*DIM1. Symbol is value of SCIENCE.
---+---------+---------+---------+---------+---------+---------+---
DIM2 | |
| |
0.05 + +
| |
| |
| M |
| L 8 |
0.00 + 3 E 4 P 5 B +
| S 6 7 |
| |
| |
| |
-0.05 + +
| |
---+---------+---------+---------+---------+---------+---------+---
-0.15 -0.10 -0.05 0.00 0.05 0.10 0.15
DIM1
Figure 2: Plot of correspondence
analysis solution