Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (30.67 MB, 785 trang )

242

C. Mao et al.

On the other hand, By Bayes’ theorem [8], P (C|X) can be computed as

Equation (4), where P (C) and P (X) are the respective prior probabilities of C

and X, and P (X|C) is the posterior probability of X conditioned on C.

P (C|X) =

P (X|C)P (C)

P (X)

(4)

Equation (4) can be extended to local conditions, where each item should

be estimated under the local condition δ(X). Then we get the Bayesian formula

under local condition as Equation (5).

P (C|(X, δ(X))) =

P (X|(C, δ(X)))P (C|δ(X))

P (X|δ(X))

(5)

The Bayesian classiﬁer [10] maximizes P (C|X) according to formula (4) and

estimates it in the whole dataset. While our method maximizes P (C|X) according to formula (3) and (5), we estimate it in a local area around the query

sample. Under the assumption that the near neighbors can represent the property of a query sample better than the more distant samples, estimating the LPP

by formula (5) represents more reasonable than by formula (4).

To maximize P (C|(X, δ(X))) according to formula (5): as P (X|δ(X)) is constant for all classes, only P (X|(C, δ(X)))P (C|δ(X)) needs to be maximized.

Then, the optimization problem can be transformed to

ω = arg max P (X|(C, δ(X)))P (C|δ(X)).

C

2.2

(6)

Local Distribution Estimation

Given an arbitrary query sample X and a distance metric, its k nearest neighbors

can be obtained from the training set. In this paper, we call the set of the k

nearest neighbors k -neighborhood of sample X and denote it by δk (X). To solve

the optimization problem (6), the two items P (X|(C, δ(X))) and P (C|δ(X))

which are relevant to the local distribution of class C should be estimated based

on δk (X) for each class.

P (C|δ(X)) derives the probability of a sample belonging to class C given

that the sample is in the neighborhood of X. If there are Nj samples from class

Cj in the k -neighborhood of X, then P (C|δ(X)) can be estimated by

P (Cj |δk (X)) = Nj /k.

(7)

P (X|(C, δ(X))) derives the probability of a sample being equal to X given

that the sample is from class C and is in the neighborhood of X; this can be

regarded as the local probability distribution density of class C at point X for

continuous attributes.

To estimate P (X|(C, δ(X))) accurately we just consider the continuous

attributes in our method; the estimation of P (X|(C, δ(X))) becomes a problem of probability density estimation in local area. In our method, we assume

Nearest Neighbor Method Based on Local Distribution

243

that the samples in the neighborhood follow a Gaussian distribution with a mean

μ and covariance matrix Σ deﬁned by Equation (8).

1

f (X; μ, Σ) =

(2π)d |Σ|

e−0.5(X−μ)

T

Σ−1 (X−μ)

(8)

where d is the dimension of the data. So that, for δk (X) and a speciﬁed class

Cj , we have

(9)

P (X|(Cj , δk (X))) ∝ f (X; μCj , ΣCj )

where μCj and ΣCj respectively represent the mean and the covariance matrix

of class Cj in δk (X).

Then we need to estimate the mean μ and the covariance matrix Σ from

δk (X) for each class. In our approach, to ensure the covariance matrix is positive

deﬁnite, we take the naive assumption of local class conditional independence

that an attribute on each class does not correlate with the other attributes in

local area; that is, the covariance matrix (Σ) would be a diagonal matrix. If there

C

are Nj samples from class Cj in δk (X), denoted by Xi j (i = 1, · · · , Nj ), the two

parameters the mean (μCj ) and the covariance matrix (ΣCj ) can be estimated

through maximum likelihood estimation by the following Formulae (10) and (11)

[15].

μˆCj =

1

ΣˆCj = diag(

Nj

Nj

i=1

1

Nj

Cj

(Xi

Nj

i=1

Cj

Xi

(10)

Cj

− μˆCj )(Xi

− μˆCj )T )

(11)

where diag(·) converts a square matrix to a diagonal matrix with the same

diagonal elements.

Then, we plug the mean (μ) and covariance matrix (Σ) respectively estimated

from Formulae (10) and (11) into Equation (8) to estimate f (X; μCj , ΣCj ) and

then estimate P (X|(Cj , δk (X))) from Formula (9).

2.3

Classification Rules

As k is constant for all classes, according to Formulae (7) and (9), the classiﬁcation problem as deﬁned in (6) can be transformed into an optimization problem

ﬁnally formulated as shown in Formula (12).

ω = arg

max

j=1,··· ,NC

{Nj · f (X; μCj , ΣCj )}

(12)

where NC is the total number of classes, f (·), μCj and ΣCj is denoted by Formulae (8), (10) and (11) respectively.

According to the aforementioned process, the LD-k NN approach classiﬁes a

query sample by the LPP estimated from local distribution. This is calculated

according to the Bayesian Theorem in the local area. The query sample is then

labeled with the class having a maximum LPP.

244

2.4

C. Mao et al.

Related Methods

The traditional V-k NN classiﬁed the query sample only by the number of nearest neighbors for each class in the k -neighborhood (i.e. Nj for the j th class).

Compared with the V-k NN rule, LD-k NN takes into account the local probability density around the query sample (f (X; μCj , ΣCj )) besides the number (Nj ).

For diﬀerent classes, the local probability densities are not always the same and

may play a signiﬁcant role for classiﬁcation.

Another classiﬁcation method related with LD-k NN is the Bayesian classiﬁcation method. Bayesian classiﬁer assigns the query sample to the class with the

highest posterior probability, which is estimated through the global distribution.

While LD-k NN estimates the posterior probability through the local distribution around the query sample. Naive Bayesian Classiﬁcation (NBC) method can

be considered as a special case of LD-k NN with k approaching the size of the

dataset. Thus, LD-k NN would be more eﬀective and comprehensive for a special

query sample.

In actuality, the LD-k NN method may be viewed as a compromise between

the nearest neighbor rule and the Bayesian method. The parameter k denotes

the locality in LD-k NN; when parameter k is close to 1, LD-k NN approaches the

nearest neighbor rule. And when k is large and equal to the size of the dataset,

the local area is extended to the whole dataset; in this case LD-k NN becomes

a Bayesian classiﬁer. Thus, LD-k NN may combine the advantages of the two

classiﬁers and become a more eﬀective and comprehensive classiﬁcation method.

As for CAP and LPC, they consider an equal number of nearest neighbors

for each class and the classiﬁcation is based on the nearest center. As presented

in Equation (12), CAP and LPC use a constant Nj for all classes and the other

item (f (X; μCj , ΣCj )) is estimated only from the center of the Nj samples in

each class. Thus, CAP and LPC can be viewed as special cases of LD-k NN.

3

3.1

Experiments

The Datasets

In our experimentation we have selected 15 real datasets from the well-known UCIIrvine repository of machine learning datasets [1]. The selected datasets include

six two-class problems and nine multi-class problems, and vary in terms of their

domain, size, and complexity. The estimation of probability density is only for

continuous attributes and we only take into account continuous attributes in our

experiments. Table 1 summarizes the relevant information for these datasets; for

more information, please turn to http://archive.ics.uci.edu/ml.

3.2

Experimental Settings

Before classiﬁcation, to prevent attributes with an initially large range from

inducing bias by out-weighing attributes with initially smaller ranges, we use

z -score normalization to linearly transform each of the numeric attributes of a

Nearest Neighbor Method Based on Local Distribution

245

Table 1. Some Information about the datasets

datasets

#Instances #Attributes #Classes

Abalone

4177

7

3

690

6

2

Australian

106

9

6

Breast

345

6

2

Bupa Liver

366

33

6

Dermatology

214

9

6

Glass

583

9

2

ILPD

150

4

3

Iris

20000

16

26

Letters

5473

10

5

Pageblock

208

60

2

Sonar

4601

57

2

Spambase

267

44

2

spectf

846

18

4

Vehicle

178

13

3

Wine

A

dataset with mean value 0 and standard deviation 1 by v = v−μ

σA , where μA

and σA are the mean and standard deviation, respectively, of attribute A.

In order to achieve an impartial evaluation, we have employed six competing

classiﬁers to test the performance of alternative approaches and to provide a

comparative analysis to evaluate the eﬀectiveness of our LD-k NN algorithm.

These competing classiﬁers include base classiﬁers (e.g. V-k NN, DW-k NN [5]

and NBC), and the state-of-the-art classiﬁers (e.g. CAP [13], LPC [16] and SVM

[2]).

For k NN-type classiﬁers, we use Euclidean distance to measure the distance

between two samples in search of the nearest neighbors. In addition, the parameter k in k NN-type classiﬁers indicates the number of nearest neighbors, we use

the average number of nearest neighbors per class (denoted by kpc) to indicate

the neighborhood size, i.e. kpc ∗ NC nearest neighbors are searched, where NC

is the number of classes.

To express the generalization capacity, i.e. the classiﬁcation ability of a classiﬁer classifying previously unseen samples, the training samples and the test

samples should be independent. In our research we use stratiﬁed 5-fold cross

validation to estimate the misclassiﬁcation rate of a classiﬁer on each dataset.

The data are stratiﬁed into 5 folds. For the 5 folds, 4 folds constitute the training set with the remaining fold being used as the test set. The training and test

sessions are performed 5 times with each session using a diﬀerent test set and

the corresponding training set. To avoid bias, the 5-fold cross validation process

is applied to each dataset 10 times and the average misclassiﬁcation rate (AMR)

is calculated to evaluate the performance of the classiﬁer.

246

C. Mao et al.

Table 2. The AMR (%) of the seven methods with corresponding stds on the 15 UCI

datasets (the best recognition performance is described in bold-face on each data set)

datasets

Abalone

Australian

Breast

Bupaliver

Dermatology

Glass

ILPD

Iris

Letter

Pageblock

Sonar

Spambase

spectf

Vehicle

Wine

Average AMR

Average Rank

4

LD-kNN

V-kNN

DW-kNN

CAP

LPC

SVM

NBC

35.26±0.23 34.92±0.20 34.95±0.28 35.52±0.26 35.54±0.43 34.52±0.08 41.55±0.09

24.49±0.83 24.57±0.47 24.55±0.59 25.00±0.80 25.12±0.65 24.22±0.34 27.87±0.46

30.19±2.63 33.77±1.45 31.89±1.62 30.38±0.57 30.00±1.18 41.13±2.00 35.38±1.21

31.86±1.36 34.43±1.28 34.00±1.37 32.09±1.74 32.93±1.22 29.83±1.02 48.06±0.71

1.75±0.44 3.93±0.28 3.83±0.24 2.81±0.41 2.84±0.22 2.70±0.35 7.46±0.59

26.73±1.47 31.26±1.50 28.83±1.82 28.27±1.75 29.72±1.11 30.61±1.06 59.67±2.73

29.97±1.34 28.54±0.61 28.99±0.83 30.67±1.36 30.81±1.25 29.11±0.54 44.80±0.49

3.67±0.33 3.87±0.65 3.80±0.60 3.73±0.68 3.73±0.68 4.07±0.31 4.33±0.58

5.08±0.09 8.02±0.08 5.29±0.07 3.93±0.05 4.14±0.10 5.54±0.08 35.75±0.07

3.27±0.10 3.30±0.06 3.19±0.06 3.16±0.13 3.17±0.11 3.99±0.09 13.12±1.39

11.30±1.40 14.47±1.42 14.33±1.23 11.68±1.96 14.47±1.42 17.45±1.81 31.11±1.01

7.77±0.17 8.60±0.10 7.85±0.20 7.97±0.21 8.13±0.23 6.80± 0.12 18.24±0.16

20.00±1.56 19.40±0.71 20.60±0.92 20.22±0.75 20.45±1.24 21.09±0.71 32.62±1.15

24.04±1.05 28.53±0.86 27.86±0.83 23.96±1.20 24.36±0.73 24.20±0.82 53.95±0.74

0.84±0.38 2.30±0.69 2.13±0.42 1.85±0.67 1.46±0.45 1.97±0.55 2.58±0.45

17.08

2.13

18.66

4.63

18.14

3.80

17.42

2.90

17.79

3.80

18.48

3.80

30.43

6.93

Results and Discussion

The parameter kpc is an important factor that can aﬀect the performance of LDkNN. If kpc is too small, the estimation of the local distribution may be unstable;

however, if it is too large, there will be many distant neighbors that may have an

adverse eﬀect on the local distribution estimation. To investigate the inﬂuence

of the parameter kpc on classiﬁcation results for k NN-type classiﬁers, we tune

the parameter kpc as an integer in the range 1 to 30 for each dataset, perform

the classiﬁcation tasks and achieve the corresponding AMR for each kpc value.

This procedure will guide us in the selection of parameter kpc for classiﬁcation.

Fig. 1 shows the performance curves with respect to kpc of the ﬁve k NN-type

methods on several real datasets. Because diﬀerent real datasets usually have

diﬀerent distributions, the curves of AMR with respect to the kpc for LD-k NN

are usually diﬀerent. These performance curves show that, on average the LDk NN method can be quite eﬀective for these real problems, and validate that a

modest kpc for LD-kNN can usually achieve a more eﬀective performance.

We use the lowest AMR with the corresponding kpc ranging from 1 to 30 to

evaluate the performance of a k NN-type classiﬁer. Then, following experimental

testing we obtained a comparative performance for our posited approach when

compared with the alternative approaches. The classiﬁcation results on each

dataset for all the classiﬁers are shown in Table 2 in terms of AMR with the

corresponding standard deviations (stds).

From the results in Table 2 we can see that LD-k NN oﬀers the best performance on 5 datasets, more than all other classiﬁers; this is an improvement

over the alternative classiﬁers. The overall average AMR and rank of LD-k NN

on these datasets are 17.08% and 2.13 respectively, lower than all other classi-

Nearest Neighbor Method Based on Local Distribution

0.4

0.16

LD−kNN

V−kNN

DW−kNN

CAP

LPC

0.39

0.38

0.14

0.12

AMR(%)

0.37

AMR(%)

247

0.36

0.35

LD−kNN

V−kNN

DW−kNN

CAP

LPC

0.1

0.08

0.34

0.06

0.33

0.04

0.32

0.31

0

5

10

15

20

25

0.02

0

30

5

10

kpc

(a) Bupaliver

25

30

20

25

30

20

25

30

0.12

LD−kNN

V−kNN

DW−kNN

CAP

LPC

0.115

0.11

0.105

AMR(%)

AMR(%)

20

(b) Iris

0.35

0.3

15

kpc

0.25

0.2

LD−kNN

V−kNN

DW−kNN

CAP

LPC

0.1

0.095

0.09

0.085

0.15

0.08

0.1

0

5

10

15

20

25

0.075

0

30

5

10

kpc

(c) Sonar

(d) Spambase

0.45

0.14

LD−kNN

V−kNN

DW−kNN

CAP

LPC

0.12

0.1

AMR(%)

AMR(%)

0.4

15

kpc

0.35

0.3

LD−kNN

V−kNN

DW−kNN

CAP

LPC

0.08

0.06

0.04

0.25

0.02

0.2

0

5

10

15

kpc

(e) Vehicle

20

25

30

0

0

5

10

15

kpc

(f) Wine

Fig. 1. The performance curves with respect to kpc on diﬀerent real datasets

ﬁers, which means that the proposed LD-k NN may be more eﬀective than other

classiﬁers for these datasets.

To evaluate the statistical signiﬁcance of the diﬀerence between LD-k NN

and each other classiﬁers, we have performed a Wilcoxon signed rank test [12]

between LD-k NN and each other classiﬁers. The p-values of the tests between

248

C. Mao et al.

4

3.5

3

2.5

2

1.5

1

LD−kNN V−kNN DW−kNN

CAP

LPC

SVM

NBC

Fig. 2. The rm distributions of diﬀerent methods

LD-k NN and V-k NN, DW-k NN, CAP, LPC, SVM and NBC are 0.0103, 0.0125,

0.0181, 0.0103, 0.1876 and 0.0001 respectively, all less than 0.05 except that of

SVM. Combined with the result that the average AMR for the LD-k NN method

is the lowest among these classiﬁers, it can be seen that the LD-k NN method

can outperform other classiﬁers and be comparable with SVM in terms of AMR

at the 5% signiﬁcance level.

To evaluate how well a particular method performs on average among all the

problems taken into consideration we have addressed the issue of robustness. Following the method designed by Friedman [7], we quantify the robustness of a classiﬁer m by the ratio rm of its error rate em to the smallest error rate over all the

methods being compared in a particular application (i.e. rm = em /min1≤k≤7 ek ).

The optimal method m* for that application will have the ratio with rm∗ = 1,

and all other methods will have a greater ratio. The greater the value for this

ratio, the worse the performance of the corresponding method is for that application among the comparative methods. Thus, the distribution of rm for each

method, over all the datasets, provides information concerning its robustness.

We illustrate the distribution of rm for each method over the 15 datasets by box

plots in Fig. 2 where it is clear that the spread of rm for LD-k NN is narrow

and close to point 1.0, which demonstrates that the LD-k NN method performs

extremely robustly over these datasets.

From the above analysis, it can be seen that LD-k NN performs better than

other classiﬁers in respect of the overall AMR. In considering the k NN-type classiﬁers, the DW-k NN improves the performance over the traditional V-k NN by

weighting; the CAP and the LPC has improved the k NN method by local centering. The LD-k NN is a more comprehensive method and considers the nearest

neighbor set integrally by local distribution; thus it is reasonable to conclude

that among the k NN-type classiﬁers the LD-k NN performs best followed by

CAP, LPC, DW-k NN and V-k NN.

Nearest Neighbor Method Based on Local Distribution

249

The SVM, as an advanced and highly respected algorithm, can also achieve a

comparable performance with LD-k NN for certain classiﬁcation problems; however the performance of the LD-k NN is more robust to application than SVM;

that is, SVM may perform eﬀectively on certain datasets however it also performs

badly on other datasets and is not as stable as the LD-k NN on the experimental

datasets. NBC performs badly in the experimental classiﬁcation tasks principally

due to the fact that the class conditional independence assumption is too severe

in practical problems.

The LD-k NN can be viewed in terms of a Bayesian classiﬁcation method

as it is predicated on the Bayes theorem. Since the classiﬁcation is based on

maximum posterior probability, the LD-k NN classiﬁer can in theory achieve the

Bayes error rate. Additionally, As a k NN-type classiﬁer, LD-k NN can inherit

the advantages of k NN method. Thus, it may be intuitively anticipated that

LD-k NN can perform much more eﬀectively than NBC and other k NN-type

classiﬁers in most cases.

5

Conclusion

We have introduced the concept of local distribution to the k NN methods for

classiﬁcation. The proposed LD-k NN method essentially considers the k nearest

neighbors of the query sample as several integral sets by the class labels and

then estimates the local distribution of these integral sets to achieve the LPP

for each class; then the query sample is classiﬁed based on the maximum LPP.

This approach provides a simple mechanism for quantifying the probability of

the query sample attached to each class and has been shown to present several

advantages. The experimental results demonstrate the eﬀectiveness and robustness for LD-k NN and show its potential superiority.

In the proposed method, a signiﬁcant step is the estimation of local distribution. In our experiments, we assume that the local probability distributions

of the instances for each class can be modeled as a Gaussian distribution. However, the Gaussian distribution assumption may not be always appropriate for all

practical problems; there are other probability distribution estimation methods

available, such as Gaussian mixture model [19] and kernel density estimation

[6]. Diﬀerent local distribution estimation methods for LD-k NN may produce

diﬀerent results. For a particular classiﬁcation problem in a speciﬁc domain of

interest various methods may be tested to achieve good results; this represents

a future direction for our research.

References

1. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.

ics.uci.edu/ml

2. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM

Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011). http://

www.csie.ntu.edu.tw/cjlin/libsvm

250

C. Mao et al.

3. Cover, T., Hart, P.: Nearest neighbor pattern classiﬁcation. IEEE Transactions on

Information Theory 13(1), 21–27 (1967)

4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classiﬁcation. John Wiley & Sons

(2012)

5. Dudani, S.: The distance-weighted k-nearest-neighbor rule. IEEE Transactions on

Systems, Man and Cybernetics 4, 325–327 (1976)

6. Duong, T.: ks: Kernel density estimation and kernel discriminant analysis for multivariate data in r. Journal of Statistical Software 21(7), 1–16 (2007)

7. Friedman, J., et al.: Flexible metric nearest neighbor classiﬁcation. Unpublished manuscript available by anonymous FTP from playfair. stanford. edu (see

pub/friedman/README) (1994)

8. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.:

Bayesian data analysis. CRC Press (2013)

9. Govindarajan, M., Chandrasekaran, R.: Evaluation of k-nearest neighbor classiﬁer

performance for direct marketing. Expert Systems with Applications 37(1), 253–

258 (2010)

10. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan kaufmann (2006)

11. Hand, D., Mannila, H., Smyth, P.: Principles of data mining. MIT Press (2001)

12. Hollander, M., Wolfe, D.A.: Nonparametric statistical methods. John Wiley &

Sons, NY (1999)

13. Hotta, S., Kiyasu, S., Miyahara, S.: Pattern recognition using average patterns of

categorical k-nearest neighbors. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 4, pp. 412–415. IEEE (2004)

14. Kononenko, I., Kukar, M.: Machine learning and data mining. Elsevier (2007)

15. Lehmann, E.L., Casella, G.: Theory of point estimation, vol. 31. Springer (1998)

16. Li, B., Chen, Y., Chen, Y.: The nearest neighbor algorithm of local probability centers. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

38(1), 141–154 (2008)

17. Magnussen, S., McRoberts, R.E., Tomppo, E.O.: Model-based mean square error

estimators for k-nearest neighbour predictions and applications using remotely

sensed data for forest inventories. Remote Sensing of Environment 113(3), 476–

488 (2009)

18. Mitani, Y., Hamamoto, Y.: A local mean-based nonparametric classiﬁer. Pattern

Recognition Letters 27(10), 1151–1159 (2006)

19. Reynolds, D.: Gaussian mixture models. In: Encyclopedia of Biometrics, pp. 659–

663 (2009)

20. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J.,

Ng, A., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowledge and

Information Systems 14(1), 1–37 (2008)

Immune Centroids Over-Sampling Method

for Multi-Class Classification

Xusheng Ai1 , Jian Wu1(B) , Victor S. Sheng2 ,

Pengpeng Zhao1 , Yufeng Yao1 , and Zhiming Cui1

1

The Institute of Intelligent Information Processing and Application,

Soochow University, Suzhou 215006, China

jianwu@suda.edu.cn

2

Department of Computer Science,

University of Central Arkansas, Conway 72035, USA

Abstract. To improve the classiﬁcation performance of imbalanced

learning, a novel over-sampling method, Global Immune Centroids OverSampling (Global-IC) based on an immune network, is proposed. GlobalIC generates a set of representative immune centroids to broaden the

decision regions of small class spaces. The representative immune centroids are regarded as synthetic examples in order to resolve the imbalance problem. We utilize an artiﬁcial immune network to generate

synthetic examples on clusters with high data densities. This approach

addresses the problem of synthetic minority oversampling techniques,

which lacks of the reﬂection on groups of training examples. Our comprehensive experimental results show that Global-IC can achieve better

performance than renowned multi-class resampling methods.

Keywords: Resampling · Immune network

anced learning · Synthetic examples

1

·

Over-sampling

·

Imbal-

Introduction

The class imbalance problem typically occurs when there are many more

instances belonging to some classes than others in multi-class classiﬁcation.

Recently, reports from both academy and industry indicate that the imbalanced

class distribution of a data set has posed a serious diﬃculty to most classiﬁcation

algorithms which assume a relatively balanced distribution. Furthermore, identifying rare objects is of crucial importance. In many real-world applications,

the classiﬁcation performances on the small classes are the major concerns in

determining the property of a classiﬁcation model.

In the research community of imbalanced learning, almost all reported solutions are designed for binary classiﬁcation. However, multi-class imbalanced

learning problems appear frequently. Identifying the concept for each class in

these problems is usually equally important. When multiple classes are present in

an application domain, solutions proposed for binary classiﬁcation problems may

c Springer International Publishing Switzerland 2015

T. Cao et al. (Eds.): PAKDD 2015, Part I, LNAI 9077, pp. 251–263, 2015.

DOI: 10.1007/978-3-319-18038-0 20

252

X. Ai et al.

not be directly applicable, or may achieve a lower performance than expected.

For example, solutions at the data level suﬀer from the increased search space,

and solutions at the algorithm level become more complicated, since they must

consider small classes and it is diﬃcult to learn the corresponding concepts for

these small classes. Additionally, learning from multiple classes itself implies a

diﬃculty, since the boundaries among the classes may overlap. The overlap would

downgrade the learning performance.

There exist many researches on multi-class imbalance learning. However,

most ex-isting researches transfer multi-class imbalance learning into binary

using diﬀerent class decomposition schemes and apply existing binary imbalance learning solutions. These decomposition approaches help reuse the existing

binary imbalance learning solutions. However, they have their own shortcomings, which will be discussed in the next section related work. To overcome

these shortcomings, in this paper we present a novel global multi-class imbalance learning approach, which does not need to transfer multi-class into binary.

This novel approach is based on immune network theory, and utilizes an aiNet

model [3] to generate immune centroids for the clusters of each small class, which

have high data density, called global immune centroids over-sampling (denoted

as Global-IC). Speciﬁcally, our novel approach Global-IC resamples each small

class by introducing immune centroids of the clusters of the examples belonging

to the small class. Our experimental results show that Global-IC achieves better

performance, comparing with existing methods.

The rest of this paper is organized as follows. We review related work in

Section 2. Section 3 presents our proposed over-sampling method Global-IC.

Our experimental results and comparisons are shown in Section 4. Finally, we

conclude this paper in Section 5.

2

Related Work

As we said before, most existing solutions for multi-class imbalance classiﬁcation problems use diﬀerent class decomposition schemes to convert a multi-class

classiﬁ-cation problem into multiple binary classiﬁcation problems, and then

apply binary imbalance techniques on each binary classiﬁcation problem. For

example, Tan et al. [4] used both one-vs-all (OVA) [2] and one-vs-one (OVO)

[1] schemes to break down a multi-class problem to binary problems, and then

built rule-based learners to im-prove the coverage of minority class examples.

Zhao [20] used OVA to convert a multi-class problem into multiple binary problems, and then used under-sampling and SMOTE [5] techniques to overcome

the imbalance issues. Liao [6] investigated a variety of over-sampling and undersampling techniques with OVA for a weld ﬂaw classiﬁcation problem. Chen et

al. [7] proposed an approach that used OVA to convert a multi-class classiﬁcation problem to binary problems and then applied some advanced resampling

methods to rebalance the data of each binary problem. All these methods are

based on multi-class decomposition. Multi-class decomposition oversimpliﬁes the

original multi-class problem. It is obvious that each individual classiﬁer learned

Tải bản đầy đủ (.pdf) (785 trang)