Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (19.54 MB, 466 trang )

174 ◾ Statistical Data Mining Using SAS Application

5.Save _score_ and regest datasets for future use: These two datasets are created and saved as temporary SAS datasets in the work folder and also exported

to Excel worksheets and saved in the user-specified output folder. The _score_

data contains the observed variables; the predicted scores, including observations with missing response value; residuals; and confidence-interval estimates. This dataset could be used as the base for developing the scorecards

for each observation. Also, the second SAS data called regest contains the

parameter estimates that could be used in the RSCORE macro for scoring

different datasets containing the same variables.

6.If–then analysis and lift charts: Perform IF-THEN analysis and construct

a lift chart to estimate the differences in the predicted response when one of

the continuous predictor variables is fixed at a given value.

Multiple Linear Regression Analysis of 1993 Car Attribute Data

Data name

SAS dataset CARS93

Multiattributes

Y2: Midprice

X1 Air bags (0 = none, 1 = driver only, 2 = driver and

passenger)

X2 Number of cylinders

X3 Engine size (liters)

X4 HP (maximum)

X5 RPM (revs per minute at maximum HP)

X6 Engine revolutions per mile (in highest gear)

X7 Fuel tank capacity (gallons)

X8 Passenger capacity (persons)

X9 Car length (inches)

X10 Wheelbase (inches)

X11 Car width (inches)

X12 U-turn space (feet)

X13 Rear seat room (inches)

X14 Luggage capacity (cubic feet)

X15 Weight (pounds)

Number of observations 92

Car93 Data Source:

Lock38

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 174

5/18/10 3:37:24 PM

Supervised Learning Methods ◾ 175

5.12.1.1 Step 1: Preliminary Model Selection

Open the REGDIAG2.SAS macro-call file in the SAS EDITOR window and click

RUN to open the REGDIAG2 macro-call window (Figure 5.4). Input the appropriate macro-input values by following the suggestions given in the help file (Appendix 2).

Leave the group variable option blank since all the predictors used are continuous.

Leave the macro field #14 BLANK to skip regression diagnostics and to run MLR.

◾◾ Special note to SAS Enterprise Guide (EG) Code Window Users: Because

these user-friendly SAS macro applications included in this book, use SAS

WINDOW/DISPLAY commands, and these commands are not compatible

with SAS EG, open the traditional REGDIAG macro-call file included in the

\dmsas2e\maccal\nodisplay\ into the SAS editor. Read the instructions given

in Appendix 3 regarding using the traditional macro-call files in the SAS EG/

SAS Learning Edition (LE) code window.

Model selection: Variable selection using MAX R 2 selection method: The REGDIAG2

macro utilizes all possible regression models using the MAXR 2 selection methods

and output the best two models for all subsets (Table 5.1). Because 15 continuous

predictor variables were used in the model selection, the full model had 15 predictors. Fifteen subsets are possible with 15 predictor variables. By comparing the R 2 ,

R 2(adj), RMSE, C(p), and AIC values between the full model and all subsets, we

can conclude that the 6-variable subset model is superior to all other subsets.

The Mallows C(p) measures the total squared error for a subset that equals total

error variance plus the bias introduced by not including the important variables in

the subset. The C(p) plot (Figure 5.5) shows the C(p) statistic against the number

of predictor variables for the full model and the best two models for each subset.

Additionally, the RMSE statistic for the full model and best two regression models

in each subset is also shown in the C(p) plot. Furthermore, the diameter of the

bubbles in the C(p) plot is proportional to the magnitude of RMSE. Consequently,

dropping any variable from the six-variable model is not recommended because, the

C(p), RMSE, and AIC values jump up so high. These results clearly indicate that

C(p), RMSE, and AIC statistics are better indicators for variable selection than R 2

and R 2(adj). Thus, the C(p) plot and the summary table of model selection statistics

produced by the REGDIAG2 macro can be used effectively in selecting the best

subset in regression models with many (5 to 25) predictor variables.

LASSO, the new model selection method implemented in the new SAS procedure GLMSELECT, is also utilized in the REGDIAG2 macro for screening all listed

predictor variables and examine and visualize the contribution of each predictor in

the model selection. Two informative diagnostic plots (Figures 5.6 and 5.7) generated by the ODS graphics feature in the GLMSELECT can be used to visualize the

importance of the predictor variables. The fit criteria plot (Figure 5.6) displays the

trend plots of six model selection criteria versus the number of model parameters, and

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 175

5/18/10 3:37:24 PM

Number

in Model

R-Square

Adjusted

R-Square

C(p)

AIC

Root

MSE

SBC

1

0.6670

0.6627

48.5856

266.1241

5.10669

270.91297

X4

1

0.6193

0.6145

66.5603

276.9593

5.45992

281.74819

X15

2

0.7006

0.6929

37.8970

259.4966

4.87278

266.67998

X4 X7

2

0.6996

0.6919

38.2671

259.7618

4.88076

266.94510

X1 X4

3

0.7364

0.7261

26.4105

251.1920

4.60207

260.76981

X4 X11 X15

3

0.7276

0.7169

29.7340

253.8557

4.67837

263.43350

X4 X10 X11

4

0.7710

0.7589

15.3699

241.8019

4.31775

253.77414

X1 X2 X11 X15

4

0.7666

0.7544

16.9950

243.3118

4.35818

255.28403

X1 X4 X11 X15

5

0.7960

0.7824

7.9336

234.4305

4.10214

248.79718

X1 X2 X4 X7 X11

5

0.7943

0.7805

8.5844

235.1128

4.11945

249.47949

X1 X2 X7 X11 X15

6

0.8162

0.8013

2.2959

227.9613

3.91941

244.72243

X1 X2 X4 X7 X10 X11

6

0.8079

0.7924

5.4282

231.5424

4.00701

248.30352

X1 X2 X4 X7 X11 X15

7

0.8188

0.8015

3.3136

228.8049

3.91809

247.96048

X1 X2 X4 X6 X7 X10 X11

7

0.8185

0.8011

3.4231

228.9346

3.92123

248.09024

X1 X2 X4 X7 X10 X11 X15

8

0.8245

0.8050

3.1778

228.2320

3.88305

249.78208

X1 X2 X4 X6 X7 X10 X11 X15

Variables in Model

176 ◾ Statistical Data Mining Using SAS Application

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 176

Table 5.1 Macro REGDIAG2—Best Two Subsets in All Possible MAXR2 Selection Method

5/18/10 3:37:24 PM

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 177

0.8208

0.8009

4.5708

229.9193

3.92370

251.46934

X1 X2 X4 X7 X10 X11 X12 X15

9

0.8259

0.8038

4.6640

229.6007

3.89509

253.54524

X1 X2 X4 X6 X7 X10 X11 X12 X15

9

0.8248

0.8026

5.0546

230.0812

3.90666

254.02565

X1 X2 X4 X6 X7 X8 X10 X11 X15

10

0.8261

0.8013

6.5653

231.4789

3.91986

257.81784

X1 X2 X4 X6 X7 X9 X10 X11 X12 X15

10

0.8261

0.8013

6.5721

231.4873

3.92007

257.82622

X1 X2 X4 X6 X7 X8 X10 X11 X12 X15

11

0.8266

0.7989

8.4032

233.2784

3.94328

262.01177

X1 X2 X4 X6 X7 X8 X10 X11 X12 X13 X15

11

0.8265

0.7988

8.4254

233.3058

3.94395

262.03921

X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X15

12

0.8270

0.7964

10.2462

235.0837

3.96740

266.21150

X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X13 X15

12

0.8268

0.7963

10.3043

235.1558

3.96917

266.28364

X1 X2 X4 X6 X7 X8 X10 X11 X12 X13 X14 X15

13

0.8273

0.7938

12.1050

236.9082

3.99257

270.43044

X1 X2 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X15

13

0.8272

0.7937

12.1621

236.9792

3.99432

270.50152

X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15

14

0.8276

0.7911

14.0000

238.7775

4.01946

274.69420

X1 X2 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15

14

0.8273

0.7907

14.1049

238.9081

4.02270

274.82480

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X15

15

0.8276

0.7878

16.0000

240.7775

4.05026

279.08865

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15

5/18/10 3:37:24 PM

Supervised Learning Methods ◾ 177

8

178 ◾ Statistical Data Mining Using SAS Application

CP/P-Ratio & RMSE (area of the bubble) Plot

30

25

RMSE

5.107

Cp/_P_ratio

20

15

4.881

4.873

10

4.678

4.602

5

4.358

4.318

4.050

4.119

4.102

4.019

4.023

3.993

3.994

3.967

3.969

4.007

3.943

3.944

3.920

3.924

3.907

3.895

3.918

3.921

3.883

3.919

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Predictor Variables

Figure 5.5 Model selection using SAS macro REGDIAG2: CP plot for selecting

the best subset model.

in this example, all six criteria identify the 13-parameter model as the best model.

However, beyond the six variables, no substantial gain was noted. The coefficient

progression plot displayed in Figure 5.7 shows the stability of the standardized regression coefficients as a result of adding new variables in each model-selection step. The

problem of multicollinearity among the predictor variables was not evident since all

the standardized regression coefficients have values less than ±1. The following six

variables, X4, X7, X1, X2, X13, and X11, were identified as the most contributing variables in the model selection sequence. Although X15 was included in the second step,

it was later excluded from the model. Thus, these features enable the analysts to identify the most contributing variables and help them perform further investigations.

Because this model-selection step only includes the linear effects of the variables, it is recommended that this step be used as a preliminary model selection

step rather than the final concluding step. Furthermore, the REGDIAG2 macro

also has a feature for selecting the best-candidate models using AICC and SBC

(Tables 5.2 and 5.3). Next we will examine the predictor variables selected in the

best-candidate models.

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 178

5/18/10 3:37:25 PM

Supervised Learning Methods ◾ 179

Fit Criteria for Y2

AICC

AIC

Adj R-Sq

SBC

C(p)

0

BIC

5

Step

10

15

Best criterion value

0

5

Step

10

15

Step selected by SBC

Figure 5.6 Model selection using SAS macro REGDIAG2: Fit criteria plots derived

from using the ODS graphics feature in GLMSELECT procedure.

Both minimum AICC and SBC criteria identified the same six-variable model

(X1, X2, X4, X7, X11, and X10) as the best model. The first five variables were also

selected as the best contributing variables by the LASSO method (Figure 5.6), and

the CP method picked the same six variables as the best model (Table 5.1). The

ΔSBC criterion is very conservative and picked only one model as the best candidate where as Δ AICC method identified five models as the best candidates. The

standardized regression coefficients of the best candidate model’s predictors were

very stable, indicating the impact of multicollinearity is very minimal. Then, based

on the preliminary model selection step, the following X1, X2, X4, X7, X11, and

X10 variables were identified as the best linear predictors, and we can proceed with

the second step of the analysis.

5.12.1.2 Step 2: Graphical Exploratory Analysis

and Regression Diagnostic Plots

Open the REGDIAG2.SAS macro-call file in the SAS EDITOR window and

click RUN to open the REGDIAG2 macro-call window (Figure. 5.8). Input the

appropriate macro-input values by following the suggestions given in the help file

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 179

5/18/10 3:37:25 PM

180 ◾ Statistical Data Mining Using SAS Application

Coeﬃcient Progression for Y2

Standardized Coeﬃcient

0.50

X2

X1

X10

X6

X9

0.25

0.00

X8

X12

–0.25

–0.50

360

340

320

300

280

260

Selected step

In

te

rc

ep

1+ t

X

2+ 4

X1

3+ 5

X7

4+

X1

5+

X

6+ 2

X1

7– 3

X1

8+ 5

X1

9+ 1

X

10 15

+X

1

11 0

+X

12 6

+X

1

13 2

+X

14 8

+X

15 14

+X

16 9

+X

17 5

+X

3

SBC

X11

Eﬀect Sequence

Figure 5.7 Model selection using SAS macro REGDIAG2: Standardized regression coefficient and SBC progression plots by model selection steps derived from

using the ODS graphics feature in GLMSELECT procedure.

(Appendix 2). Leave the group variable option blank, because all the predictors

used are continuous. Input YES in macro field #14 to request additional regression

diagnostics plots using the selected predicted variables in step 1.

The three model selection plots—CP plot (Figure 5.9), fit criteria plot

(Figure 5.10), and coefficient progression plot (Figure 5.11)—on the predicted variables selected in step 1 (6 variables: X1, X2, X4, X7, X11, and X10) further confirmed that these are the best linear predictors in all model selection criteria. Thus,

in the second step, data exploration and diagnostic plot analysis were carried out

using these six predictor variables.

Simple linear regression and augmented partial residual (APR) plots for all six

predictor variables are presented in Figure 5.12. The linear/quadratic regression

parameter estimates for the simple and multiple linear regressions and their significance levels are also displayed in the titles of the APR plots. The simple linear

regression line describes the relationship between the response and a given predictor variable in a simple linear regression. The APR line shows the quadratic

regression effect of the ith predictor on the response variable after accounting for

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 180

5/18/10 3:37:27 PM

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 181

Table 5.2 Macro REGDIAG2—Standardized Regression Coefficient Estimates and the Several Model Selection Criteria

for the Best-Candidate Models in All Possible MAXR2 Selection Methods Using the Selection Criterion Delta AICC < 2

4 Engine

RPM (revs

per minute Revolutions

per Mile

at

HP

(maximum) maximum (in highest

gear) X6

HP)X5

X4

Number of

Cylinders

X2

Engine

Size

(liters)

X3

Y2

19.3219

2.44057

3.14933

.

3.53493

.

.

.

.

Y2

19.4307

2.21502

3.31102

.

2.66918

.

1.27614

.

.

Y2

19.3455

2.38315

3.37800

.

3.52546

.

0.77161

.

.

Y2

19.3628

2.36426

3.02281

.

3.03817

.

.

.

.

Y2

19.3221

2.49279

3.10444

.

3.56846

.

.

.

.

U-Turn

Space (feet)

X12

Rear Seat

Room

(inches)

X13

Dependent

Variable

Car

Width

Wheelbase (inches)

X11

(inches) X10

Passenger

Capacity

(persons)

X8

Car Length

(inches) X9

3.31439

.

.

2.80894

−6.1088

.

.

.

.

7

2.33768

.

.

2.22848

−5.9390

.

.

.

3.10061

9

3.25036

.

.

2.98053

−5.7254

.

.

.

.

8

2.80552

.

.

2.30391

−6.3801

.

.

.

1.81189

8

3.32973

.

.

2.90899

−5.7885

−0.523

.

.

.

8

Fuel Tank

Capacity

(gallons) X7

Luggage

Weight

Capacity (cu (pounds)

ft X14

X15

Number of

Parameters

in Model

5/18/10 3:37:27 PM

(continued)

Supervised Learning Methods ◾ 181

Intercept

Air Bags

(0 = none,

1 = driver only,

2 = driver and

passenger) X1

Schwarz’s

Bayesian

Criterion

AICC

DELTA_

AICC

DELTA_

SBC

W_AICC

W_AICCR

0.80133

244.722

229.279

0.00000

0.00000

0.33421

1.00000

0.80500

249.782

230.401

1.12177

5.05964

0.19074

0.57070

0.80147

247.960

230.519

1.24024

3.23805

0.17977

0.53788

0.80115

248.090

230.649

1.37000

3.36781

0.16847

0.50409

0.79975

248.658

231.217

1.93826

3.93607

0.12681

0.37941

Adjusted

r-Squared

182 ◾ Statistical Data Mining Using SAS Application

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 182

Table 5.2 Macro REGDIAG2—Standardized Regression Coefficient Estimates and

the Several Model Selection Criteria for the Best-Candidate Models in All Possible

MAXR2 Selection Methods Using the Selection Criterion Delta AICC < 2 (Continued)

5/18/10 3:37:27 PM

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 183

Table 5.3 Macro REGDIAG2—Standardized Regression Coefficient Estimates and the Several Model-Selection Criteria

for the Best-Candidate Models in All Possible MAXR2 Selection Methods Using the Selection Criterion Delta SBC <2

4 Engine

Revolutions

per Mile

(in highest

gear) X6

.

3.53493

.

.

Wheelbase

(inches) X10

Car Width

(inches)

X11

U-Turn

Space (feet)

X12

Rear Seat

Room

(inches) X13

Luggage

Capacity

(cu ft) X14

Weight

(pounds)

X15

Number of

Parameters

in Model

.

2.80894

−6.1088

.

.

.

.

7

Schwarz’s

Bayesian

criterion

AICC

DELTA_

AICC

DELTA_

SBC

W_SBC

W_SBCR

244.722

229.279

0

0

1

1

Intercept

Number of

Cylinders

X2

Engine

Size

(liters) X3

19.3219

2.44057

3.14933

Fuel Tank

Capacity

(gallons) X7

Passenger

Capacity

(persons)

X8

Car Length

(inches) X9

3.31439

.

Adjusted

r-squared

Dependent

Variable

Y2

0.80133

5/18/10 3:37:27 PM

Supervised Learning Methods ◾ 183

HP

(maximum)

X4

RPM (revs

per minute

at maximum

HP) X5

Air Bags

(0 = none,

1 = driver only,

2 = driver and

passenger) X1

184 ◾ Statistical Data Mining Using SAS Application

Figure 5.8 Screen copy of REGDIAG2 macro-call window showing the macrocall parameters required for performing regression diagnostic plots in MLR.

the linear effects of other predictors on the response. The APR plot is very effective in detecting significant outliers and nonlinear relationships. Significant outliers

and/or influential observations are identified and marked on the APR plot if the

absolute STUDENT value exceeds 2.5, or the DFFITS statistic exceeds 1.5. These

influential statistics are derived from the MLR model involving all predictor variables. If the correlations among all predictor variables are negligible, the simple and

the partial regression lines should have similar slopes.

The APR plots for the six-predictor variables showed significant linear relationships between the six predictors and median price. A big difference in the magnitude of the partial (adjusted) and the simple (unadjusted) regression effects for all

six predictors on median price were clearly evident (Figure 5.12). The quadratic

effects of all six predictor variables on the median price were not significant at the

5% level. Five significant outliers were also detected in these APR plots.

Partial leverage plots (PL) for all six predictor variables are presented in

Figure 5.13. The PL display shows three curves: (a) the horizontal reference line

that goes through the response variable mean, (b) the partial regression line, which

quantifies the slope of the partial regression coefficient of the ith variable in the

MLR, and (c) the 95% confidence band for partial regression line. The partial

regression parameter estimates for the ith variable in the multiple linear regression

© 2010 by Taylor and Francis Group, LLC

K10535_Book.indb 184

5/18/10 3:37:28 PM

Tải bản đầy đủ (.pdf) (466 trang)