Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.45 MB, 449 trang )

420

A. Bernstein et al.

where V(Z) is the explicit written q×p matrix and hKNR(Z) =

K Z

∑

K Z, Z

h is

the standard Kernel nonparametric regression estimator for h(Z) based on the preliminary values hj ∈ hn of the vectors h(Zj) at the sample points Zj, j = 1, 2, … , n. The

Z Z H Z

h h

over

value (5) minimizes the residual ∑ K Z, Z

h and gives the new solution for the Manifold embedding problem.

3.5

Tangent Bundle Reconstruction Step

This step which gives a new solution for the Manifold Reconstruction problem and

consists of several sequentially executed stages.

First, the data-based kernel k(y, y′) on the FS Yθ = h(M) is constructed to satisfy

the approximate equalities k(h(Z), h(Z′)) ≈ K(Z, Z′) for near points Z, Z′ ∈ M. The

linear space L*(y) ∈ Grass(p, q) depending on y ∈ Yθ which meets the condition

L*(h(Z)) ≈ LPCA(Z) is also constructed in this stage.

Then, a p×q matrix G(y) depending on y ∈ Yθ is constructed to meet the conditions

G(h(Z)) ≈ H(Z). To do this, the p×q matrix G(y) for an arbitrary point y ∈ Yθ, is choG H X

under the

sen to minimize over G the quadratic form ∑ k y, y

F

constraint Span(G(y)) = L*(y). A solution to this problem is given by the formula

G(y) = π*(y)×

∑

k y, y

H X ,

(6)

here, π*(y) is the projector onto the linear space L*(y) and k(y) = ∑ k y, y .

Finally, the mapping g is built to meet the conditions g(h(Z)) ≈ Z and Jg(y) = G(y)

which provide the manifold and tangent proximities, respectively. To do this, the

vector g(y) for an arbitrary point y ∈ Yθ is chosen to minimize over g the quadratic

Z g G y

y

. The solution to this problem is

form ∑ k y, y

g(y) =

∑

k y, y

Z

G y

y

∑

k y, y

y ,

(7)

the first term is the standard kernel nonparametric regression estimator for g(y) based

on the values Zj of the vector g(y) at the sample feature points yj, j = 1, 2, … , n.

4

Solution of the Regression Task

Consider the TBML problem for the unknown manifold M = M(f) (1) considered as

the data manifold in which the data set Zn ⊂ M (3) is the sample from the manifold.

Applying the GSE to this problem, we get the embedding mapping h(Z) (5) from the

DM M to the FS Y = h(M), the reconstruction mapping g(y) (7) from the FS Y to the

RP and its Jacobian G(y) = Jg(y) (6).

Manifold Learning in Regression Tasks

A splitting of the p-dimensional vector Z =

Z

Z

421

on the q-dimensional vector Zx

g y

g y

and the m-dimensional vector Zu implies the corresponding partitions g(y) =

G y

of the vector g(y) ∈ Rp and p×q matrix G(y); the q×q and m×q

G y

matrices Gx(y) and Gu(y) are the Jacobian matrices of the mappings gx(y) and gu(y).

x

From the representation Z = f x follows that the mapping h*(Z) = Zx = x determines a parameterization of the DM. Thus, we have two parameterizations of the

DM: the GSE-parameterization y = h(Z) (5) and the parameterization x = h*(Z) which

are linked together by an unknown one to one ‘reparameterization’ mapping y = ϕ(x).

After constructing the mapping ϕ(x) from the sample, the learned function

and G(y) =

fMLR(x) = gu(ϕ(x))

(8)

is chosen as a MLR-solution of the regression task ensuring proximity fMLR(x) ≈ f(x);

the matrix Jf,MLR(x) = Gu(ϕ(x)) × G x is an estimator of the Jacobian Jf(x).

The simplest way to find ϕ(x) is the choice the mapping y = ϕ(x) as a solution of

the equation gx(y) = x; the Jacobian matrix Jϕ(x) of this solution is (Gx(ϕ(x)))-1. By

construction, the values ϕ(xj) = yj and Jϕ(xj) = G y are known at the input points

{xj, j = 1, 2, … , n}; the value ϕ(x) for an arbitrary point x ∈ X is chosen to minimize

y y G y

x x

, here

over y the quadratic form ∑ K E x, x

q

KE(x, x′) is the ‘heat’ Euclidean kernel in R . The solution of this problem is

ϕ(x) =

KE X

∑

K E x, x

y

G

y

x

x

,

KE(x) = ∑

K E x, x .

Denote y = yj + Δj, where Δj is a correction term, then g(y ) ≈ g(yj) + G(yj) × Δj.

This term is chosen to minimize the principal term |g(yj) + G(yj) × Δj – Zj|2 in the

squared error |g(y ) – Zj|2, and Δj = G-(yj) × (Zj – g(yj)) is the standard least squares

solution of this minimization problem, here G-(yj) = (GT(yj) × G(yj))-1 × GT(yj).

The another version of the reparameterization’ mapping

ϕ(x) =

KE X

∑

K E x, x

y

G

y

Z

g y

G

y

x

x

is determined by the choice of the points y as the values of the function ϕ(x) at the

input points xj, j = 1, 2, … , n.

5

Results of Numerical Experiments

The function f(x) = sin(30×(x – 0.9)4) × cos(2(x – 0.9)) + (x – 0.9)/2, x ∈ [0, 1], (Fig.

1(a)), which was used in [34] to demonstrate a drawback of the kernel nonparametric

regression (kriging) estimator with stationary kernel (sKNR) fsKNR, was selected to

compare the estimator fsKNR and the proposed MLR estimator fMLR (8). The kernel

bandwidths were optimized for both methods.

422

A. Bernstein et al.

(a) original function

(b) sKNR-estimator

(c) MLR-estimator

Fig. 1. Reconstruction of thee function (a) by the sKNR-estimator (b) and MLR-estimator ((c)

The same training data set consisting of n = 100 randomly and uniformly disstributed points on the intervaal [0, 1] was used for constructing the estimators fsKNR and

fMLR (Fig. 1(b) and Fig. 1(c)). The mean squared errors MSEsKNR = 0,0024 and

MSEMLR = 0,0014 were callculated for both estimators at the uniform grid consistting

of 1001 points on the interv

val [0, 1]. Visual comparison of the graphs shows that the

proposed MLR-method con

nstructs an essentially smoother estimator for the origiinal

function.

Acknowledgments. The study

y was performed in the IITP RAS exclusively by the grant ffrom

the Russian Science Foundatio

on (project № 14-50-00150).

References

1. Vapnik, V.: Statistical Leaarning Theory. John Wiley, New-York (1998)

2. James, G., Witten, D., Hastie,

H

T., Tibshirani, R.: An Introduction to Statistical Learnning

with Applications in R. Sp

pringer Texts in Statistics, New-York

3. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Minning,

Inference, and Prediction,, 2nd edn. Springer (2009)

4. Bishop, C.M.: Pattern Reccognition and Machine Learning. Springer, Heidelberg (2007))

5. Deng, L., Yu, D.: Deep Learning: Methods and Applications. NOW Publishers, Boston

(2014)

6. Breiman, L.: Random Forrests. Machine Learning 45(1), 5–32 (2001)

7. Friedman, J.H.: Greedy Function

F

Approximation: A Gradient Boosting Machine. Annnals

of Statistics 29(5), 1189–1

1232 (2001)

8. Rasmussen, C.E., William

ms, C.: Gaussian Processes for Machine Learning. MIT Prress,

Cambridge (2006)

9. Belyaev, M., Burnaev, E., Kapushev, Y.: Gaussian process regression for structured ddata

dings of the SLDS 2015, London, England, UK (2015)

sets. To appear in Proceed

10. Burnaev E., Panov M.: Adaptive

A

design of experiments based on gaussian processes.. To

appear in Proceedings of the

t SLDS 2015, London, England, UK (2015)

11. Loader, C.: Local Regresssion and Likelihood. Springer, New York (1999)

12. Vejdemo-Johansson, M.: Persistent homology and the structure of data. In: Topologgical

Methods for Machine Leaarning, an ICML 2014 Workshop, Beijing, China, June 25 (20014).

http://topology.cs.wisc.ed

du/MVJ1.pdf

13. Carlsson, G.: Topology an

nd Data. Bull. Amer. Math. Soc. 46, 255–308 (2009)

Manifold Learning in Regression Tasks

423

14. Edelsbrunner, H., Harer, J.: Computational Topology: An Introduction. Amer. Mathematical Society (2010)

15. Cayton, L.: Algorithms for manifold learning. Univ of California at San Diego (UCSD),

Technical Report CS2008-0923, pp. 541-555. Citeseer (2005)

16. Huo, X., Ni, X., Smith, A.K.: Survey of manifold-based learning methods. In: Liao, T.W.,

Triantaphyllou, E. (eds.) Recent Advances in Data Mining of Enterprise Data,

pp. 691–745. World Scientific, Singapore (2007)

17. Ma, Y., Fu, Y. (eds.): Manifold Learning Theory and Applications. CRC Press, London

(2011)

18. Bernstein, A.V., Kuleshov, A.P.: Tangent bundle manifold learning via grassmann&stiefel

eigenmaps. In: arXiv:1212.6031v1 [cs.LG], pp. 1-25, December 2012

19. Bernstein, A.V., Kuleshov, A.P.: Manifold Learning: generalizing ability and tangent

proximity. International Journal of Software and Informatics 7(3), 359–390 (2013)

20. Kuleshov, A., Bernstein, A.: Manifold learning in data mining tasks. In: Perner, P. (ed.)

MLDM 2014. LNCS, vol. 8556, pp. 119–133. Springer, Heidelberg (2014)

21. Kuleshov, A., Bernstein, A., Yanovich, Yu.: Asymptotically optimal method in Manifold

estimation. In: Márkus, L., Prokaj, V. (eds.) Abstracts of the XXIX-th European Meeting

of Statisticians, July 20-25, Budapest, p. 325 (2013)

22. Genovese, C.R., Perone-Pacifico, M., Verdinelli, I., Wasserman, L.: Minimax Manifold

Estimation. Journal Machine Learning Research 13, 1263–1291 (2012)

23. Kuleshov, A.P., Bernstein, A.V.: Cognitive Technologies in Adaptive Models of Complex

Plants. Information Control Problems in Manufacturing 13(1), 1441–1452 (2009)

24. Bunte, K., Biehl, M., Hammer B.: Dimensionality reduction mappings. In: Proceedings of

the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2011), pp.

349-356. IEEE, Paris (2011)

25. Lee, J.A.: Verleysen, M.: Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72(7–9), 1431–1443 (2009)

26. Saul, L.K., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2003)

27. Saul, L.K., Roweis, S.T.: Nonlinear dimensionality reduction by locally linear embedding.

Science 290, 2323–2326 (2000)

28. Zhang, Z., Zha, H.: Principal Manifolds and Nonlinear Dimension Reduction via Local

Tangent Space Alignment. SIAM Journal on Scientific Computing 26(1), 313–338 (2005)

29. Hamm, J., Lee, D.D.: Grassmann discriminant analysis: A unifying view on subspacebased learning. In: Proceedings of the 25th International Conference on Machine Learning

(ICML 2008), pp. 376-83 (2008)

30. Tyagi, H., Vural, E., Frossard, P.: Tangent space estimation for smooth embeddings of

riemannian manifold. In: arXiv:1208.1065v2 [stat.CO], pp. 1-35, May 17 (2013)

31. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15, 1373–1396 (2003)

32. Bengio, Y., Monperrus, M.: Non-local manifold tangent learning. In: Advances in Neural

Information Processing Systems, vol. 17, pp. 129-136. MIT Press, Cambridge (2005)

33. Dollár, P., Rabaud, V., Belongie, S.: Learning to traverse image manifolds. In: Advances

in Neural Information Processing Systems, vol. 19, pp. 361-368. MIT Press, Cambridge

(2007)

34. Xiong, Y., Chen, W., Apley, D., Ding, X.: A Nonstationary Covariance-Based Kriging

Method for Metamodeling in Engineering Design. International Journal for Numerical Methods in Engineering 71(6), 733–756 (2007)

Random Projection Towards the Baire Metric

for High Dimensional Clustering

Fionn Murtagh1(B) and Pedro Contreras2

1

Goldsmiths University of London, London, UK

fmurtagh@acm.org

2

Thinking Safe Ltd., Egham, UK

pedro.contreras@acm.org

Abstract. For high dimensional clustering and proximity finding, also

referred to as high dimension and low sample size data, we use random

projection with the following principle. With the greater probability of

close-to-orthogonal projections, compared to orthogonal projections, we

can use rank order sensitivity of projected values. Our Baire metric,

divisive hierarchical clustering, is of linear computation time.

Keywords: Big data · Ultrametric topology · Hierarchical clustering

Binary rooted tree · Computational complexity

1

·

Introduction

In [18], we provide background on (1) taking high dimensional data into a consensus random projection, and then (2) endowing the projected values with the Baire

metric, which is simultaneously an ultrametric. The resulting regular 10-way tree

is a divisive hierarchical clustering. Any hierarchical clustering can be considered

as an ultrametric topology on the objects that are clustered.

In [18], we describe the context for the use of the following data. 34,352 proposal details related to awards made by a research funding agency were indexed

in Apache Solr, and MLT (“more like this”) scores were generated by Solr for the

top 100 matching proposals. A selection of 10,317 of these proposals constituted

the set that was studied.

Using a regular 10-way tree, Figure 1 shows the hierarchy produced, with

nodes colour-coded (a rainbow 10-colour lookup table was used), and with the

root (a single colour, were it shown), comprising all clusters, to the bottom. The

terminals of the 8-level tree are at the top.

The ﬁrst Baire layer of clusters, displayed as the bottom level in Figure 1,

was found to have 10 clusters (8 of which are very evident, visually.) The next

Baire layer has 87 clusters (the maximum possible for this 10-way tree is 100),

and the third Baire layer has 671 clusters (maximum possible: 1000).

In this article we look further at the use of random projection which empowers

this linear hierarchical clustering in very high dimensional spaces.

c Springer International Publishing Switzerland 2015

A. Gammerman et al. (Eds.): SLDS 2015, LNAI 9047, pp. 424–431, 2015.

DOI: 10.1007/978-3-319-17091-6 37

Baire Metric for High Dimensional Clustering

425

Fig. 1. The mean projection vector of 99 random projection vectors is used. Abscissa:

the 10118 (non-empty) documents are sorted (by random projection value). Ordinate:

each of 8 digits comprising random projection values. A rainbow colour coding is used

for display.

1.1

Random Projection in Order to Cluster High Dimensional Data

In [3] we show with a range of applications how we can (1) construct a Baire

hierarchy from our data, and (2) use this Baire hierarchy for data clustering. In

astronomy using Sloan Digital Sky Survey data, the aim was nearest neighbour

regression such that photometric redshift could be mapped into spectrometric

redshift. In chemoinformatics using chemical compound data, we ﬁrst addressed

the problem of high dimensional data by mapping onto a one-dimensional axis.

When this axis was a consensus of rankings, we found this approach, experimentally, to be stable and robust.

We continue our study of this algorithm pipeline here. Our focus is on the

random projection approach, which we use. In this article, having ﬁrst reviewed

the theory of random projection, we note that the conventional random projection approach is not what we use. In brief, the conventional random projection

approach is used to ﬁnd a subspace of dimension greater than 1. In the conventional approach, the aim is that proximity relations be respected in that low

dimensional ﬁt to the given cloud of points. In our random projection approach,

426

F. Murtagh and P. Contreras

we seek a consensus 1-dimensional mapping of the data, that represents relative

proximity. We use the mean projection vector as this consensus.

Since we then induce a hierarchy on our 1-dimensional data, we are relying

on the known fact that a hierarchy can be perfectly scaled in one dimension. See

[11] for the demonstration of this. Such a 1-dimensional scaling of a hierarchy

is not unique. This 1-dimensional scaling of our given cloud of ponts is what

will give us our Baire hierarchy. The mean of 1-dimensional random projections

is what we use, and we then endow that with the Baire hierarchy. We do not

require uniqueness of the mean 1-dimensional random projection.

Relative to what has been termed as conventional random projection (e.g.,

in [13]), our use of random projection is therefore non-conventional.

2

Dimensionality Reduction by Random Projection

It is a well known fact that traditional clustering methods do not scale well in

very high dimensional spaces. A standard and widely used approach when dealing

with high dimensionality is to apply a dimensionality reduction technique. This

consists of ﬁnding a mapping F relating the input data from the space Rd to a

lower-dimension feature space Rk . We can denote this as follows:

F (x) : Rd → Rk

(1)

A statistically optimal way of reducing dimensionality is to project the data

onto a lower dimensional orthogonal subspace. Principal Component Analysis

(PCA) is a very widely used way to to do this. It uses a linear transformation to

form a simpliﬁed dataset retaining the characteristics of the original data. PCA

does this by means of choosing the attributes that best preserve the variance of

the data. This is a good solution when the data allows these calculations, but

PCA as well as other dimensionality reduction techniques remain computationally expensive. Eigenreduction is of cubic computational complexity, where, due

to the dual spaces, this is power 3 in the minimum of the size of the observation

set or of the attribute set.

Conventional random projection [2,5–8,13,14,23] is the ﬁnding of a low

dimensional embedding of a point set, such that the distortion of any pair of

points is bounded by a function of the lower dimensionality.

The theoretical support for random projection can be found in the JohnsonLindenstrauss Lemma [9], which is as follows. It states that a set of points in

a high dimensional Euclidean space can be projected into a low dimensional

Euclidean space such that the distance between any two points changes by a

fraction of 1 + ε, where ε ∈ (0, 1).

Lemma 1. For any 0 < ε < 1 and any integer n, let k be a positive integer such

that

(2)

k ≥ 4(ε2 /2 − ε3 /3)−1 ln n.

Then for any set V of any points in Rd , there is a map f : Rd → Rk such that

for all u, v ∈ V ,

Baire Metric for High Dimensional Clustering

(1 − ε)

u−v

2

≤

f (u) − f (v)

2

≤ (1 + ε)

u−v

2

427

.

Furthermore, this map can be found in randomized polynomial time.

The original proof of this lemma was further simpliﬁed by Dasgupta and

Gupta [4], also see Achlioptas [1] and Vempala [23].

We have mentioned that the optimal way to reduce dimensionality is to have

orthogonal vectors in R. However, to map the given vectors into an orthogonal

basis is computationally expensive. Vectors having random directions might be

suﬃciently close to orthogonal. Additionally this helps solving the problem of

data sparsity in high dimensional spaces, as discussed in [19].

Thus, in random projection the original d-dimensional data is projected to a

k-dimensional subspace (k

d), using a random k×d matrix R. Mathematically

this can be described as follows:

RP

= Rk×d Xd×N

Xk×N

(3)

where Xd×N is the original set with d-dimensionality and N observations.

Computationally speaking random projection is simple. Forming the random

matrix R and projecting the d × N data matrix X into the k dimensions is of

order O(dkN ). If X is sparse with c nonzero entries per column, the complexity

is of order O(ckN ). (This is so, if we use an index of the nonzero entries in X.)

3

Random Projection

Random mapping of a high dimensional cloud of points into a lower dimensional

subspace, with very little distortion of (dis)similarity, is described in [12]. Kaski in

this work ﬁnds that a 100-dimensional random mapping of 6000-dimensional data is

“almost as good”. The background to this is as follows. Consider points g, h ∈ Rν ,

i.e. ν-dimensional, that will be mapped, respectively, onto x, y ∈ Rm , by means

of a random linear transformation with the following characteristics. We will have

m

ν. Random matrix, R, is of dimensions m × ν. Each column, rj ∈ Rm

will be required to be of unit norm: rj 2 = 1 ∀j. We have: x = Rg, and we

ν

ν

have x =

j=1 gj rj where gj is the jth coordinate of g ∈ R , in this mapping

ν

m

ν

of R → R . While by construction, the space R has an orthonormal coordinate

system, there is no guarantee that all x, y ∈ Rm that are mapped in this way are in

an orthonormal coordinate system. But (Kaski cites Hecht-Nielsen), the number of

almost orthogonal directions in a coordinate system, that is determined at random in

a high dimensional space, is very much greater than the number of orthogonal directions. Consider the case of rj ∈ Rν being orthonormal for j. Then X = RG for target

vectors in matrix X, and source vectors in matrix G. Therefore R X = R RG = G

(where R is R transpose), and hence X X = G R RG = G G. Since invariance

of Euclidean distance holds in the source and in the target spaces, this is Parseval’s

relation. (For the Parseval invariance relation, see e.g. [20], p. 45.) Since the rj ∈ Rν

428

F. Murtagh and P. Contreras

will not form an orthonormal system, we look instead, [12], at the eﬀect of the linear transformation, given by R, on similarities (between source space vectors and

between target space vectors).

For g, h ∈ Rν mapped onto x, y ∈ Rm , consider the scalar product: x y =

(Rg) (Rh) = g R Rh. Looking at the closeness to orthonormal column vectors in

R makes us express: R R = I + where ij = ri rj , i = j, and ii = 0 because ri ∈

Rν are of unit norm. The diagonal of R R is the identity: diag(R R) = {ri2 , 1 ≤

i ≤ ν} = I. Now, take components of each rj , call them rjl , 1 ≤ l ≤ m. Our

initial choice of random R used (in its vector components): rjl ∼ iid N (0) (i.e.

zero-mean Gaussian). Then normalization implies (vectors): rj ∼ iid N (0, 1).

It follows that the orientation is uniformly distributed. We have E[ ij ] = 0, ∀i, j

where E is the expected value. Recalling that ij = ri rj and ri , rj are of unit

norm, therefore ij is the correlation between the two vectors.

If the dimensionality m is suﬃciently large, then ij ∼ N (0, σ 2 ), with σ 2 ≈

1/m. This comes from a result (stated in [12] to be due to Fisher), that 1/2 ln(1+

ij )/(1 − ij ) ∼ N with variance = 1/(m − 3) if m is the number of samples in

the estimate. If the foregoing is linearized around 0, then σ 2 ≈ 1/m for large m.

Further discussion is provided by Kaski [12].

In summary, the random matrix values are Gaussian distributed, and the

column vectors are normalized. This makes the column vectors to be Gaussian

vectors with zero mean and unit standard deviation. Distortion of the variances/covariances, relative to orthogonality of these random projections, has

approximate variance 2/m, where m is the random projection subspace (or the

number of random projections). With a suﬃcient number of random projections,

i.e. a suﬃciently high dimensionality of the target subspace, the distortion variance becomes small. In that case, we have close to orthogonal unit norm vectors

being mapped into orthogonal unit norm vectors in a much reduced dimensionality subspace.

In [13], the use of column components rj in R that are Gaussian is termed

the case of conventional random projections. This includes rj in R being iid

with 0 mean and constant variance, or just iid Gaussian with 0 mean. The latter

is pointed to as the only necessary condition for preserving pairwise distances.

There is further discussion in [13] of 0 mean, unit variance, and fourth moment

equal to 3. It is however acknowledged that “a uniform distribution is easier to

generate than normals, but the analysis is more diﬃcult”. In taking further the

work of [1], sparse random projections are used: the elements of R are chosen from

{−1, 0, 1} with diﬀerent (symmetric in sign) probabilities. In further development

of this, very sparse, sign random projection is studied by [13].

While we also use a large number of random projections, and thereby we also

are mapping into a random projection subspace, nonetheless our methodology

(in [3,17], and used as the entry point in the work described in this article) is

diﬀerent. We will now describe how and why our methodology is diﬀerent.

Baire Metric for High Dimensional Clustering

4

429

Implementation of Algorithm

In our implementation, (i) we take one random projection axis at a time. (ii) By

means of maximum value of the projection vector (10317 projected values on a

random axis), we rescale so that projection values are in the closed/open interval,

[0, 1). This we do to avoid having a single projection value equal to 1. (iii) We

cumulatively add these rescaled projection vectors. (iv) We take the mean vector

of the, individually rescaled, projection vectors. That mean vector then is what

we use to endow it with the Baire metric. Now consider our processing pipeline,

as just described, in the following terms.

1. Take a cloud of 10317 points in a 34352-dimensional space. (This sparse

matrix has density 0.285%; the maximum value is 3.218811, and the minimum value is 0.)

2. Our linear transformation, R, maps these 10317 points into a 99-dimensional

space. R consists of uniformly distributed random values (and the column

vectors of R are not normalized).

3. The projections are rescaled to be between 0 and 1 on these new axes. I.e.

projections are in the (closed/open) interval [0, 1).

4. By the central limit theorem, and by the concentration (data piling) eﬀect of

high dimensions [10,21], we have as dimension m → ∞: pairwise distances

become equidistant; orientation tends to be uniformly distributed. We ﬁnd

also: the norms of the target space axes are Gaussian distributed; and as

typiﬁes sparsiﬁed data, the norms of the 10317 points in the 99-dimensional

target space are distributed as a negative exponential or a power law.

5. We ﬁnd: (i) correlation between any pair of our random projections is greater

than 0.98894, and most are greater than 0.99; (ii) correlation between the ﬁrst

principal component loadings and our mean random projection is 0.9999996;

and the correlation between the ﬁrst principal component loadings and each

of our input random projections is greater than 0.99; (iii) correlations between

the second and subsequent principal component loadings are close to 0.

In summary, we have the following. We do not impose unit norm on the

column vectors of our random linear mapping, R. The norms of the initial coordinate system are distributed as negative exponential, and the linear mapping

into the subspace gives norms of the subspace coordinate system that are Gaussian distributed. We ﬁnd very high correlation (0.99 and above, with just a few

instances of 0.9889 and above) between all of the following: pairs of projected

(through linear mapping with uniformly distributed values) vectors; projections

on the ﬁrst, and only the ﬁrst, principal component of the subspace; the mean

set of projections among sets of projections on all subspace axes. For computational convenience, we use the latter, the mean subspace set of projections for

endowing it with the Baire metric.

With reference to other important work in [12,21,22] which uses conventional

random projection, the following may be noted. Our objective is less to determine

or model cluster properties as they are in very high dimensions, than it is to

430

F. Murtagh and P. Contreras

extract useful analytics by “re-representing” the data. That is to say, we are

having our data coded (or encoded) in a diﬀerent way. (In [15], discussion is

along the lines of alternatively encoded data being subject to the same general

analysis method. This is as compared to the viewpoint of having a new analysis

method developed for each variant of the data.)

Traditional approaches to clustering, [16], use pairwise distances, between

adjacent clusters of points; or clusters are formed by assigning to cluster centres.

A direct reading of a partition is the approach pursued here. Furthermore, we

determine these partitions level by level (of digit precision). The hierarchy, or

tree, results from our set of partitions. This is diﬀerent from the traditional

(bottom-up, usually agglomerative) process where the sequence of partitions of

the data result from the hierarchy. See [3] for further discussion.

To summarize: in the traditional approach, the hierarchy is built, and then

the partition of interest is determined from it. In our new approach, a set of

partitions is built, and then the hierarchy is determined from them.

5

Conclusions

We determine a hierarchy from a set of – random projection based – partitions.

As we have noted above, the traditional hierarchy forming process ﬁrst determines the hierarchical data structure, and then derives the partitions from it.

One justiﬁcation for our work is interest in big data analytics, and therefore

having a top-down, rather than bottom-up hierarchy formation process. Such

hierarchy construction processes can be also termed, respectively, divisive and

agglomerative.

In this article, we have described how our work has many innovative features.

References

1. Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with

binary coins. Journal of Computer and System Sciences 66(4), 671–687 (2003)

2. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pp. 245–250. ACM, New York

(2001)

3. Contreras, P., Murtagh, F.: Fast, linear time hierarchical clustering using the Baire

metric. Journal of Classification 29, 118–143 (2012)

4. Dasgupta, S., Gupta, A.: An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22(1), 60–65 (2003)

5. Dasgupta, S.: Experiments with random projection. In: Proceedings of

the 16th Conference on Uncertainty in Artificial Intelligence, pp. 143151.

Morgan Kaufmann, San Francisco (2000)

6. Deegalla, S., Bostră

om, H.: Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: ICMLA

2006: Proceedings of the 5th International Conference on Machine Learning and

Applications, pp. 245–250. IEEE Computer Society, Washington DC (2006)

Tải bản đầy đủ (.pdf) (449 trang)