Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (19.54 MB, 466 trang )

Unsupervised Learning Methods ◾ 83

◾◾ New: Factor pattern plots are generated using the New 9.2: statistical graphics feature before and after factor rotation.

◾◾ New: Assessing the significance and the nature of factor loadings are generated using the New 9.2: statistical graphics feature.

◾◾ New: Confidence interval estimates for factor loading when ML factor analysis is used.

◾◾ Biplot display showing the interrelationship between the principal component or factor scores and the correlations among the multiattributes are produced for all combinations of selected principal components or factors.

◾◾ Options for saving the output tables and graphics in WORD, HTML, PDF,

and TXT formats are available.

Software requirements for using the FACTOR2 macro are the following:

◾◾ SAS/BASE, SAS/STAT, SAS/GRAPH, and SAS/IML must be licensed and

installed at your site.

◾◾ SAS version 9.13 and above is required for full utilization.

4.7.1 Steps Involved in Running the FACTOR2 Macro

1.Create or open a temporary SAS dataset from n × p coordinate data containing

p correlated continuous variables and n observations. If a coordinate (n × p)

dataset is not available and only a correlation matrix is available, then create

a special correlation SAS dataset (see Figure 4.15).

2.Open the FACTOR2.SAS macro-call file into the SAS EDITOR window.

Instructions are given in Appendix 1 regarding downloading the macro-call

and sample data files from this book’s Web site. Click the RUN icon to submit the macro-call file FACTOR2.SAS to open the MACRO–CALL window called FACTOR2 (Figure 4.1).

Special note to SAS Enterprise Guide (EG) CODE window users: Because the

user-friendly SAS macro application included in this book uses SAS

WINDOW/DISPLAY commands, and these commands are not compatible with SAS EG, open the traditional FACTOR2 macro-call file

included in the\dmsas2e\maccal\nodisplay\ into the SAS editor. Read the

instructions given in Appendix 3 regarding using the traditional macrocall files in the SAS EG/SAS Learning Edition (LE) code window.

3.Input the appropriate parameters in the macro-call window by following the

instructions provided in the FACTOR2 macro help file in Appendix 2. Users

can choose either the scatter plot analysis option or the PCA/EFA analysis option. Options for checking for multivariate normality assumptions and

detecting for presence of outliers are also available. After inputting all the

required macro parameters, check whether the cursor is in the last input field,

and then hit the ENTER key (not the RUN icon) to submit the macro.

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 83

5/18/10 3:36:56 PM

84 ◾ Statistical Data Mining Using SAS Applications

Figure 4.1 Screen copy of FACTOR2 macro-call window showing the macro-call

parameters required for performing PCA.

4.Examine the LOG window (only in DISPLAY mode) for any macro execution errors. If you see any errors in the LOG window, activate the EDITOR

window, resubmit the FACTOR2.SAS macro-call file, check your macro

input values and correct if you see any input errors.

5.Save the output files. If no errors are found in the LOG window, activate the

EDITOR window, resubmit the FACTOR2.SAS macro-call file, and change

the macro input value from DISPLAY to any other desirable format. PCA or

EFA SAS output files and exploratory graphs could be saved in user-specified

format in the user-specified folder.

4.7.2 Case Study 1: Principal Component

Analysis of 1993 Car Attribute Data

4.7.2.1 Study Objectives

1.Variable reduction: Reduce the dimension of 6 of highly correlated, multiattribute, coordinate data into fewer dimensions (2 or 3) without losing much

of the variation in the dataset.

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 84

5/18/10 3:36:57 PM

Unsupervised Learning Methods ◾ 85

2.Scoring observations: Group or rank the observations in the dataset based on

composite scores generated by an optimally weighted linear combination of

the original variables.

3.Interrelationships: Investigate the interrelationship between the observations and the multiattributes and group similar observations and similar

variables.

4.7.2.2 Data Descriptions

Data Name

SAS Dataset CARS93

Multiattributes

Y2: Midprice

Y4: City gas mileage/gallon

X4: HP

X8: Passenger capacity

X11: Width of the vehicle

X15: Physical weight

Number of observations

92

Car93: Data source18

Lock, R. H. (1993)

Open the FACTOR2.SAS macro-call file in the SAS EDITOR window, and click

RUN to open the FACTOR2 macro-call window (Figure 4.1). Input the appropriate

macro-input values by following the suggestions given in the help file (Appendix 2).

Exploratory analysis: Input Y2, Y4, X4, X8, X11, and X15 as the multiattributes in (#3). Input YES in the macro-call field #2 to perform data exploration and

create a scatter plot matrix. (PCA will not be performed when you choose to run

data exploration.) Submit the FACTOR2 macro, and SAS will output descriptive

statistics, correlation matrices, and scatter plot matrices. Only selected output and

graphics generated by the FACTOR2 macro are interpreted in the following text.

The descriptive simple statistics of all multiattributes generated by the SAS

PROC CORR are presented in Table 4.2. The number of observations (N) per

variable is useful in checking for missing values for any given attribute and providing information on the size of the n × p coordinate data. The estimates of central

tendency (mean) and the variability (standard deviation) provide information on

the nature of multiattributes that can be used to decide whether to use standardized or unstandardized data in the PCA analysis. The minimum and the maximum

values describe the range of variation in each attribute and help to check for any

extreme outliers.

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 85

5/18/10 3:36:57 PM

86 ◾ Statistical Data Mining Using SAS Applications

Table 4.2 PROC CORR Output and Simple Statistics—FACTOR2 Macro

Standard

Deviation

Sum

19.50

9.65

1814

7.40

61.90

Midrange

price (in

$1000)

93

22.36

5.61

2080

15.00

46.00

City MPG

(miles per

gallon by

EPA rating)

X4

93

143.82

52.37

13376

55.00

300.00

HP

(maximum)

X8

93

5.08

473.00

2.00

8.00

Passenger

capacity

(persons)

X11

93

69.37

3.77

6452

60.00

78.00

Car width

(inches)

X15

93

589.89

285780

Variable

N

Y2

93

Y4

Mean

3073

1.038

Minimum Maximum

1695

4105

Label

Weight

(pounds)

The degree of linear association among the variables measured by the Pearson

correlation coefficient (r), and their statistical significance are presented in Table 4.3.

The value of r ranged from 0 to 0.87. The statistical significance of r varied from no

correlation (p-value: 0.967) to a highly significant correlation (p-value < 0.0001).

Among the 15 possible pairs of correlations, 13 pairs of correlations were highly

significant, indicating that this data is suitable for performing PCA analysis. The

scatter plot matrix among the six attributes presented in Figure 4.2 reveals the

strength of correlation, presence of any outliers, and the nature of bidirectional

variation. In addition, each scatter plot shows the linear regression line, 95% mean

confidence interval band, and a horizontal line (Y-bar line), which passes through the

mean of the Y-variable. If this Y-bar line intersects the confidence band lines, that

is, the confidence band region does not enclose the Y-bar line, then the correlation

between the X and Y variable is statistically significant. For example, among the 15

scatter plots present in Figure 4.2, only in two scatter plots (Y2 versus X8; X4 versus

X8) did the Y-bar lines not intersect the confidence band. Only these two correlations are statistically not significant (Table 4.3).

To perform PCA, input Y2, Y4, X4, X8, X11, and X15 as the multiattributes

in (#3). Leave the macro-call field #2 blank to perform PCA. (PCA will not be

performed when you input YES to data exploration.) Input the appropriate macroinput values by following the suggestions given in the help file (Appendix 2).

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 86

5/18/10 3:36:57 PM

Unsupervised Learning Methods ◾ 87

Table 4.3 Pearson Correlation Coefficients and Their Statistical Significance

Levels (p-values)—PROC CORR Output from FACTOR2—Macro

Y2

Y2

Y4

1

−0.59a

Midrange price (in $1000)

Y4

City MPG (miles per

gallon by EPA rating)

X4

HP (maximum)

X8

Passenger capacity

(persons)

X11

Car width (inches)

X15

Weight (pounds)

a

b

X4

X8

X11

X15

0.78

0.05

0.45

0.64

<0.0001 <0.0001

b

−0.59

1

<0.0001

0.78

−0.67

0.5817

−0.41

<0.0001 <0.0001

−0.72

<0.0001 <0.0001 <0.0001 <0.0001

−0.67

1

<0.0001 <0.0001

0.009

0.9298

0.05

−0.41

0.009

0.5817

<0.0001

0.9298

0.45

−0.72

0.64

1

0.64

−0.84

0.73

0.73

<0.0001 <0.0001

0.48

0.55

<0.0001 <0.0001

0.48

1

<0.0001 <0.0001 <0.0001 <0.0001

0.64

−0.84

0.55

0.87

<0.0001

0.87

1

<0.0001 <0.0001 <0.0001 <0.0001 <0.0001

Correlation coefficient.

Statistically significant (p-value).

In PCA analysis, the dimensions of standardized multiattributes define the number of eigenvalues. An eigenvalue greater than 1 indicates that PC accounts for more

of the variance than one of the original variables in standardized data. This can be

confirmed by visually examining the improved scree plot (Figure 4.3) of eigenvalues

and the parallel analysis of eigenvalues. This enhanced scree plot shows the rate of

change in the magnitude of the eigenvalues for an increasing number of PCs. The

rate of decline levels off at a given point in the scree plot that indicates the optimum

number of PC to extract. Also, the intersection point between the scree plot and the

parallel analysis plot reveals that the first two eigenvalues that account for 86.2% of

the total variation could be retained as the significant PC (Figure 4.3).

If the data is standardized, that is, normalized to zero mean and 1 standard

deviation, the sum of the eigenvalues is equal to the number of variables used.

The magnitude of the eigenvalue is usually expressed as a percentage of the total

variance. The information in Table 4.4 indicates that the first eigenvalue accounts

for about 66% of the variation, the second for 20%, and the proportions drop off

gradually for the rest of the eigenvalues. Cumulatively, the first two eigenvalues

together account for 86% of the variation in the dataset. A two-dimensional view

(of the six-dimensional dataset) can be created by projecting all data points onto the

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 87

5/18/10 3:36:57 PM

88 ◾ Statistical Data Mining Using SAS Applications

x11

80

X8

8

7

6

5

4

3

2

X8

8

7

6

5

4

3

2

40

30

20

10

0

65

60

x4

300

x4

300

200

200

200

100

100

100

0

0

0

5000

2345678

60 65 70 75 80

X8: r = 0.01

x11: r = 0.64

0

5000

x15: r = 0.74

y4

50

y4

50

y4

50

40

40

40

30

30

30

20

20

20

23 4 56 78

X8: r = –0.42

0 100 200 300

x4: r = –0.67

y2

70

60

50

40

30

20

10

0

10 20 30 40 50

y4: r = –0.59

10

10

10

0

5000

x15: r = 0.65

y2

70

60

50

40

30

20

10

0

0

5000

x15: r = 0.87

x4

300

x15: r = –0.84

y2

70

60

50

40

30

20

10

0

70

0

5000

x15: r = 0.55

60 65 70 75 80

x11: r = 0.49

y4

50

75

y2

70

60

50

40

30

20

10

0

0 100 200 300

x4: r = 0.79

60 65 70 75 80

x11: r = –0.72

y2

70

60

50

40

30

20

10

0

23 4 56 7 8

X8: r = 0.06

60 65 70 75 80

x11: r = 0.46

Figure 4.2 Scatter plot matrix illustrating the degree of linear correlation among

the five attributes derived using the SAS macro FACTOR2.

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 88

5/18/10 3:36:58 PM

Unsupervised Learning Methods ◾ 89

Scree Plot

Variance Explained

4

1.0

0.8

Proportion

Eigenvalue

3

2

1

0.6

0.4

0.2

0

0.0

1

2

3

4

Factor

5

6

1

2

3

4

Factor

5

6

Cumulative

Proportion

Scree Plot and Parallel Analysis

4

e

Scree plot

Eigenvalue

3

2

Parallel analysis plot

1

p

e

p

p

e

0

1

2

3

p

e

4

p

p

e

e

5

6

Number of PC

Figure 4.3 PCA scree plot (New: ODS Graphics feature) illustrating the relationship between number of PCs and the rate of decline of eigenvalue and the parallel

analysis plot derived using the SAS macro FACTOR2.

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 89

5/18/10 3:37:00 PM

90 ◾ Statistical Data Mining Using SAS Applications

Table 4.4 Eigenvalues in Principal Component Analysis—PROC

FACTOR Output from FACTOR2 Macro

a

Eigenvaluea

Difference

Proportion

Cumulative

1

3.97215807

2.76389648

0.6620

0.6620

2

1.20826159

0.83065407

0.2014

0.8634

3

0.37760751

0.11365635

0.0629

0.9263

4

0.26395117

0.14171050

0.0440

0.9703

5

0.12224066

0.06645966

0.0204

0.9907

6

0.05578100

—

0.0093

1.0000

Eigenvalues of the correlation matrix: Total = 6, average = 1.

plane defined by the axes of the first two PC. This two-dimensional view will retain

86% of the information from the six-dimensional plot.

The new variables PC1 and PC2 are the linear combinations of the six standardized variables, and the magnitude of the eigenvalues accounts for the variation in

the new PC scores. The eigenvectors presented in Table 4.5 provide the weights for

transforming the six standardized variables into PCs. For example, PC1 is derived

by performing the following linear transformation using these eigenvectors.

PC1 = 0.37781Y1 − 0.44702Y2 + 0.41786X4 + 23403X8 + 0.43847X11 + 0. 48559X15

The sum of the squared of eigenvectors for a given PC is equals to one.

PC loadings presented in Table 4.6 are the correlation coefficient between the first

two PC scores and the original variables. They measure the importance of each variable

Table 4.5 Eigenvectors in PCA Analysis: PROC FACTOR Output from

FACTOR2 Macro

Variables

Eigenvectors

1

2

0.37781

−0.44215

−0.44702

−0.05055

Y2

Midrange price (in $1000)

Y4

City MPG (miles per gallon by EPA rating)

X4

HP (maximum)

0.41786

−0.42666

X8

Passenger capacity (persons)

0.23403

0.75256

X11

Car width (inches)

0.43847

0.19308

X15

Weight (pounds)

0.48559

0.12758

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 90

5/18/10 3:37:00 PM

Unsupervised Learning Methods ◾ 91

Table 4.6 Principal Component (PC) Loadings for the First Two PC:

PROC FACTOR Output from FACTOR2 Macro

Variables

FACTOR (PC) 1

FACTOR (PC) 2

0.75298

−0.48602

−0.89092

−0.05557

Y2

Midrange price (in $1000)

Y4

City MPG (miles per gallon

by EPA rating)

X4

HP (maximum)

0.83281

−0.46899

X8

Passenger capacity (persons)

0.46643

0.82722

X11

Car width (inches)

0.87388

0.21224

X15

Weight (pounds)

0.96780

0.14024

in accounting for the variability in the PC. That is, the larger the loadings in absolute

terms, the more influential the variables are in forming the new PC and vice versa.

A high correlation between PC1 and midrange price, city MPG, HP, car width, and

weight indicate that these variables are associated with the direction of the maximum

amount of variation in this dataset. The first PC loading patterns suggest that heavy,

big, very powerful, and highly priced cars are less fuel efficient. A strong correlation

between passenger capacity and PC2 indicates that this variable is mainly attributed to

the passenger capacity of the vehicle responsible for the next largest variation in the data

perpendicular to PC1. A visual display of the degree and the direction of PC loadings

is presented in Figure 4.4. The regression plot between PC scores and the original variables derived using the SAS macro FACTOR2 displays the statistical significance of the

linear association between the original variable and the derived PC scores (Figure 4.5).

A partial list of the first two PC scores presented in Table 4.7 is the scores

computed by the linear combination of the standardized variables using the eigenvectors as the weights. The cars that have small negative scores for the PC1 are less

expensive, small, and less powerful, but they are highly fuel efficient. Similarly,

expensive, large, and powerful cars with low fuel efficiency are listed at the end of

the table with the large positive PC1 scores.

A biplot display of both PC (PC1 and PC2) scores and PC loadings (Figure 4.6) is

very effective in studying the relationships within observations, between variables, and

the interrelationship between observations and the variables. The X-Y axis of the biplot

of PCA analysis represents the standardized PC1 and PC2 scores, respectively. In order

to display the relationships among the variables, the PC loading values for each PC are

overlaid on the same plot after being multiplied by the corresponding maximum value of

PC. For example, PC1 loading values are multiplied by the maximum value of the PC1

score, and the PC2 loadings are multiplied by the maximum value of the PC2 scores. This

transformation places both the variables and the observations on the same scale in the

biplot display since the range of PC loadings is usually shorter than the PC scores.

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 91

5/18/10 3:37:01 PM

92 ◾ Statistical Data Mining Using SAS Applications

Variable

Factor 1

Factor 2

X11

>0.7

X15

>0.7

>0.7

X4

Ns

–0.4 to

–07

0.4–07

X8

>0.7

Y2

>0.7

–0.4 to

–07

< –0.7

Y4

–1.0

–0.5

0.0

0.5

1.0 –1.0 –0.5

Factor (PC) Loadings

0.0

0.5

1.0

Regression Plots of Factor Scores and Attributes

4

Factor 1

2

0

–2

–4

Factor 2

2

0

–2

–4

20 40 60 20 30 40 50 150 250 2 3 4 5 6 7 8 60 65 70 75 2000 4000

Y4

X11

X15

Y2

X4

X8

Figure 4.4 Factor loadings plot. (New: Statistical graphics feature and the

regression plot between PC scores and the original variables derived using the

SAS macro FACTOR2.)

Cars having larger (>75% percentile) or smaller (<25% percentile) PC scores are

only identified by their ID numbers on the biplot to avoid crowding of too many

ID values. Cars with similar characteristics are displayed together in the biplot

observational space since they have similar PC1 and PC2 scores. For example, small

compact cars with relatively higher gas mileage such as “Geo Metro (ID 12)” and

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 92

5/18/10 3:37:02 PM

Unsupervised Learning Methods ◾ 93

2

Y4

X4

Attribute

1

Y2

0

Attribute

mean

X8

–1

X11

–2

X15

–2

0

Factor 1 (66.2%)

1

2

Y2

2

Attribute

–1

Attribute

mean

0

–2

X4

X8

–4

–4

–2

0

Factor 2 (20.1%)

2

Figure 4.5 Assessing the significance and the nature of factor loadings. (New:

Statistical graphics feature) derived using the SAS macro FACTOR2.

“Ford Fiesta (ID 7)” are grouped closer. Similarly, cars with different attributes

are displayed far apart since their PC1, PC2, or both PC scores are different. For

example, small compact cars with relatively higher gas mileage such as “Geo Metro

(ID 12)” and large expensive cars with relatively lower gas mileage such as “Lincoln

Town car (ID 42)” are separated far apart.

The correlations among the multivariate attributes used in the PCA analysis

are revealed by the angles between any two PC loading vectors. For each variable,

a PC load vector is created by connecting the X-Y origin (0,0) and the multiplied

value of PC1 and PC2 loadings in the biplot. The angles between any two variable

vectors will be:

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 93

5/18/10 3:37:03 PM

Tải bản đầy đủ (.pdf) (466 trang)