Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (30.67 MB, 785 trang )

S.K. Gupta et al.

1

1

1

2

0.5

3

feature index

4

5

0

6

7

25

3

0.5

4

5

0

6

−0.5

7

−0.5

8

9

10

15 20

task index

25

30

10

(a) βtrue

15 20

task index

25

30

5

(b) βMTRL

1

0.5

4

5

0

1.5

0.8

1

0.6

0.5

0.4

RMSE

feature index

2

3

6

7

−0.5

8

9

−1

10

15 20

task index

25

30

(d) βMR-MTL

2

3

4

5

6

7

subspace dimension (K)

8

2500

2000

1500

1000

0.2

9

500

0

30

25

25

20

20

20

10

5

5

0

0

5

10

15

20

task index

(g) Ω1 :

MTL)

25

30

task index

30

15

50

100

150

Iteration number

200

(f ) Convergence of MRMTL

25

10

30

3000

30

15

25

3500

(e) Performance w.r.t. K

task index

task index

0

1

10

15

20

task index

(c) ΩMTRL

2

1

1

5

10

0

5

Cost function

5

15

−1

9

−1

20

5

8

Explained Variance (R2)

feature index

30

1

2

task index

310

15

10

5

0

5

10

15

20

task index

25

30

5

10

15

20

task index

25

30

basis-1 (MR- (h) Ω2 : basis-2 (MR-MTL) (i) Ω3 : basis-3 (MR-MTL)

Fig. 1. Experimental results for Synthetic data. (a) True task parameters (b) task

parameters estimated by MTRL [16] (c) task relatedness estimated by MTRL, shown

as Hinton plot (‘green’ denotes positive values and ‘red’ denotes negative values); (d)

task parameters estimated by the proposed MR-MTL (e) Performance variations of

MR-MTL w.r.t. subspace dimension (f) Convergence plot for MR-MTL algorithm (g)(i) task relatedness estimated by MR-MTL for the features in the first, second and

third basis respectively, shown as Hinton plot. The ﬁrst basis is about features ‘1-3’,

the second basis is about features ‘4-6’ and the third basis is about ‘7-9’.

Collaborating Diﬀerently on Diﬀerent Topics

311

group have the same parameters and are thus strongly correlated. Given these

tasks, our idea is to create multiple relationships across task groups by using

feature-dependent task relationships in various forms: positive relationship, negative relationship and no relationship. Figure 1 (a) depicts the simulated task

parameters (i.e. β) for all the tasks along 9 features. Along the ﬁrst three features, task group-2 and task group-3 are positively related but both are unrelated

to task group-1. Similarly, along the next three features, task group-1 and task

group-3 are positively related but both are unrelated to task group-2. Finally,

along the last three features, task group-1 and task group-2 are negatively related

but both are unrelated to task group-3. Given these task parameters, feature

vectors are randomly drawn from a 9-dimensional multi-variate Gaussian distribution as xti ∼ N (0, I). The corresponding target yti is randomly drawn as

yti ∼ N βtT xti , 0.1 .

We randomly split the synthetic dataset in two parts using 70% instances for

training and the remainder for test. We run our proposed MR-MTL algorithm

and compare its performance to one of the related baseline, MTRL for illustration

purposes. Figure 1 (b) and (d) show the task parameters estimated by MTRL

and the proposed MR-MTL respectively. Clearly the task parameter estimates of

MR-MTL are much closer to the true task parameters (Figure 1 (a)). The better

estimates by MR-MTL can be explained by looking at the task relationships

learnt by both methods, which are shown in Figure 1 (c) for MTRL and Figure

(g)-(i) for MR-MTL using Hinton plots. The task relationship learnt by MTRL

is averaged across all 9 features, causing overestimation of the unrelatedness

while underestimation of the strong relatedness. In contrast, the proposed MRMTL accurately estiamtes task relationships by using three separate feature

groups (one feature group represented by each basis of the subspace as we use

K = 3) and thus learning one task relationship matrix for each feature group.

This added ﬂexibility allows MR-MTL to have a ﬁner and diﬀerential level of

control in joint modeling of tasks along diﬀerent features. We use the held out

test set to evaluate the performance of MR-MTL and compare it with MTRL in

Table 1. The reported results are averaged over 40 randomly generated datasets

along with corresponding standard errors. As seen from the Table, MR-MTL

clearly outperforms both STL and MTRL with respect to two evaluation metrics

- Explained variance (R2 ) and root mean square error (RMSE). Due to presence

of diﬀerent task relationships in data, MTRL is unable to estimate the task

relationships and thus performs worse than STL. The performance variations

of MR-MTL with respect to subspace dimension (K) is shown in Figure 1 (e),

wherein the best performance is achieved at K = 3, however, the performance

degrades very slowly with increasing values of K. An example of the convergence

behavior of proposed MR-MTL is shown in Figure 1 (f) - the algorithm quickly

converges within 50 iterations.

3.2

Experiments with Real Data

We use the following classification and regression datasets.

312

S.K. Gupta et al.

Landmine Data (Classification): This dataset is created from radar images

collected from 19 landmine ﬁelds. This is a benchmark dataset and used widely

for multi-task learning. Each data instance is a 9-dimensional representation of

each image formed by concatenating diﬀerent image based features. The task

is to detect images with landmines. Treating each landmine ﬁeld as a task, we

jointly model them via multi-task learning. For each task we randomly split the

data in two parts: 30% instances for training and the remainder for testing. The

results are averaged over 40 training-test splits.

Acute Myocardial Infarction (AMI) Data (Classification): This dataset is collected from a hospital in Australia (Ethics approval #12/83). It contains records

of patients who visited the hospital during 2007-2011 with AMI as the primary

reason for admission. The cohort is ﬁrst divided into two main AMI types:

STEMI and Non-STEMI, each of which is further divided into 4 subcohorts

based on the major interventions administered (coronary artery bypass surgery,

coronary artery stenting, other intervention or no intervention at all), resulting

in a total of 8 subcohorts. The task is to predict readmission within the ﬁrst

30-days of discharge due to any heart related medical emergency. Out of the

original 8 subcohorts only 5 are chosen as they have at least 2 positive examples

per year. In the selected subcohorts, total number of patients varied from 50-182

per year. The features used are patients demography (gender, age, occupation)

and health status in terms of Elixhauser comorbidities [17], aggregated over 3

time scales: 1 month, 3 months and 1 year prior to their AMI admission. Evaluation is performed progressively with patients from 2009, 2010 and 2011 for test

whilst using all past patients data before the test year for training.

Computer Survey Data (Regression): This dataset [6] contains ratings of 20

computers by 190 students based on 13 binary features (cf. Figure 2). Each

rating value lies between 0-10 indicating likelihood of buying a computer. We

treat ratings by each student as a task, thus having a total of 190 tasks. As these

tasks are related, we jointly model them under the setting of multi-task learning.

Following [15], we use the ﬁrst 15 computer ratings for training and test using

ratings of the last 5 computers.

SARCOS Data (Regression): The data relates to an inverse dynamics problem

for a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The task is

to map from a 21-dimensional input space (7 joint positions, 7 joint velocities, 7

joint accelerations) to the corresponding 7 joint torques, giving rise to 7 mapping

tasks. For this dataset 100 random examples are sampled for training and another

400 are sampled randomly for test. This is to demonstrate the eﬃcacy of multitask learning algorithm for small data. We average the performance over 40

random training-test datasets.

Experimental Results. Table 2 presents a comparison of our proposed MRMTL algorithm with STL, and other baseline MTL algorithms on Landmine

dataset in terms of both prediction AUC and F1. Predictive performance of

MR-MTL for diﬀerent numbers of feature subsets (K=1, 2, and 3) are also

Collaborating Diﬀerently on Diﬀerent Topics

313

reported. Clearly, MR-MTL with K=2 (AUC 0.775, F1 0.880) outperforms all

other methods by a good margin. The closest performer is MTRL (AUC 0.760,

F1 0.872), whilst other methods are further lower. The landmine dataset contains tasks which can be broadly divided into two groups based on whether a

task is a landmine detection problem at a foliated region or in a desert region.

Interestingly, MR-MTL also found K=2 to be the best for this dataset.

Table 3 presents a similar comparison of performance on the AMI dataset.

Predictive performance at three diﬀerent training-test scenarios are presented.

For all those settings K=2 is found to give the best performance for MR-MTL.

For the test year 2009, MR-MTL closely follows MTRL in terms of AUC and

GMTL in terms of F1. For the two other test years, MR-MTL convincingly

outperforms all other methods in terms of both AUC and F1. For both the

scenarios, the AUC is above 0.6 and F1 is above 0.75, whilst the same for other

methods are much lower. There is also gradual improvement of performance by

MR-MTL as more and more training data is available when tested on later years,

whilst all other methods behaved erratically.

Table 2. Comparative AUC of MR-MTL against baseline methods on Landmine

dataset. Training and test splits are generated randomly with 30% for training and

the rest for test. Average over 40 such splits are reported. Corresponding standard

errors are reported in brackets.

AUC

(std

err)

F1

(std

err)

STL

K=1

MR-MTL

K=2

K=3

MTRL

MTFL

GMTL

0.734

(0.002)

0.664

(0.007)

0.775

(0.003)

0.757

(0.002)

0.760

(0.001)

0.733

(0.002)

0.720

(0.002)

0.853

(0.013)

0.795

(0.012)

0.880

(0.008)

0.873

(0.008)

0.872

(0.008)

0.847

(0.012)

0.839

(0.014)

Table 3. Comparative AUC and F1 of MR-MTL on AMI dataset against baseline

methods. Test is performed progressively at 2009, 2010, and 2011 with corresponding

past years data being used for training.

Training years

Test

year

2007-08

2009

2007-09

2010

2007-10

2011

Measure STL

AUC

F1

AUC

F1

AUC

F1

0.507

0.517

0.558

0.676

0.545

0.683

MRMTL

K=2

0.584

0.568

0.606

0.781

0.614

0.826

MTRL MTFL

0.588

0.518

0.521

0.492

0.588

0.502

0.570

0.452

0.539

0.576

0.554

0.723

GMTL

0.487

0.613

0.552

0.669

0.535

0.599

Table 4 & 5 presents results on two regression dataset namely, Computer and

SARCOS datasets. For computer dataset, MR-MTL with K=3 performs (RMSE

314

S.K. Gupta et al.

Table 4. Comparative RMSE and explained variance (R2 ) of MR-MTL on Computer

dataset against the baselines. Rating data from the ﬁrst 15 computers are used for

training and the remaining 5 for test. MR-MTL is evaluated at four diﬀerent numbers

of latent basis (K=2, 3 and 4).

RMSE

Explained

Variance (R2 )

STL

K=2

MRMTL

K=3

2.085

0.238

1.711

0.309

1.664

0.318

K=4

1.673

0.317

MTRL MTFL GMTL

1.766

0.291

2.056

0.220

2.638

0.160

Table 5. Comparative RMSE and explained variance (R2) of MR-MTL on SARCOS

dataset with respect to the baselines methods. Randomly selected 100 data points

are used for training and 1400 for test. Average performance over 40 such random

experiments are reported. Respective standard errors are reported in brackets.

STL

RMSE

3.449

(std err)

(0.025)

Explained

0.823

Variance (R2 ) (0.001)

(std err)

K=5

3.257

(0.019)

0.798

(0.003)

MR-MTL

K=6

K=7

3.248

3.218

(0.017) (0.018)

0.818

0.829

(0.003) (0.002)

MTRL MTFL GMTL

6.945

4.722

3.496

(0.032) (0.030) (0.025)

0.379

0.640

0.821

(0.003) (0.003) (0.002)

1.664, R2 0.318) the best followed by MTRL (RMSE 1.766, R2 0.291). All other

baselines have higher RMSE values. To illustrate the behavior of MR-MTL further, we present the basis vectors corresponding to K=3 in Fig 2 (a). The three

basis vectors captures 3 diﬀerent grouping of features. The ﬁrst basis (U1) captures positive preference for high performance (CPU speed, RAM size) along

with positive preference for having CD-ROM. The second basis (U2) captures

positive preference for CD-ROM, whilst non-preference for higher CPU speed

with larger cache. The third basis (U3) captures price of the unit as a major

factor. Fig 2(b) shows the histogram of task relatedness along diﬀerent basis.

It is interesting to note that task-relatedness along U3, whose major factor is

price shows higher prevalence of positive relatedness (the histogram for U3 is

skewed on the positive side), which implies that many raters give importance to

price similarly. This is intuitive since price is always a major factor in consumer

spending. We see that histogram on U1 have high peak around zero, implying that preference for high performance and CD-ROM is more independent in

nature. Conversely, highest disagreement among the raters is observed along U2.

For SARCOS dataset, MR-MTL with K=7 performs (RMSE 3.198, R2 0.832)

the best, followed by GMTL (RMSE 3.349, R2 0.821). Other baseline methods

have considerable higher RMSE and lower R2 values. For this dataset, the tasks

are low-related, therefore, other MTL methods which tries to regularize strongly

Collaborating Diﬀerently on Diﬀerent Topics

(a) Subspace basis matrix.

315

(b) Histogram of task relatedness.

Fig. 2. Illustration of results of MR-MTL with K=3 on Computer dataset. (a) Subspace

basis matrix with basis U1, U2 and U3. Only weights with absolute value more than

0.1 are shown., and (b) Histogram of task relatedness with respect to each basis.

performed lower, whereas, MR-MTL with K=7 is able to oﬀer the right balance

between the ﬂexibility and regularization leading to better performance.

4

Conclusion

We have presented a novel multi-task learning framework that allows joint modeling of tasks based on multiple relationship between them, where each relation

is independently deﬁned on a set of semantically related features. This helps in

modeling scenarios where task-to-task relationships diﬀer based on feature sets

or where tasks have slightly diﬀerent features sets. To model multiple task relatedness, we learn several feature subsets using a low dimensional subspace and

use a task covariance matrix to capture task relationships (both positive and

negative) along each feature subset. We formulate the model as an optimization

problem and derive an eﬃcient solution. Using both synthetic and real datasets,

we demonstrate that the performance of proposed model is better than several

state-of-the-art multi-task learning algorithms.

References

1. Kang, Z., Grauman, K., Sha, F.: Learning with whom to share in multi-task feature

learning. In: International Conference on Machine Learning, pp. 521–528 (2011)

2. Saha, B., Gupta, S., Phung, D., Venkatesh, S.: Multiple task transfer learning

with small sample sizes. Knowledge and Information Systems (2014). doi:10.1007/

s10115-015-0821-z

3. Xue, Y., Liao, X., Carin, L., Krishnapuram, B.: Multi-task learning for classiﬁcation with dirichlet process priors. The Journal of Machine Learning Research 8,

35–63 (2007)

4. Zhou, J., Liu, J., Narayan, V.A., Ye, J.: Modeling disease progression via multi-task

learning. NeuroImage 78, 233–248 (2013)

316

S.K. Gupta et al.

5. Lin, H., Baracos, V., Greiner, R., Chun-nam, Y.: Learning patient-speciﬁc cancer

survival distributions as a sequence of dependent regressors. In: Advances in Neural

Information Processing Systems, pp. 1845–1853 (2011)

6. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning.

Machine Learning 73(3), 243–272 (2008)

7. Kumar, A., Daum´e III, H.: Learning task grouping and overlap in multi-task learning. In: International Conference on Machine Learning (ICML) (2012)

8. Rai, P., Daume, H.: Inﬁnite predictor subspace models for multitask learning.

In: International Conference on Artiﬁcial Intelligence and Statistics, pp. 613–620

(2010)

9. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117. ACM

(2004)

10. Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. In: Journal of Machine Learning Research, pp. 615–637 (2005)

11. Caruana, R.: Multitask learning. Machine learning 28(1), 41–75 (1997)

12. Jacob, L., Vert, J.-P., Bach, F.R.: Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems, pp. 745–752 (2009)

13. Zhou, J., Chen, J., Ye, J.: Clustered multi-task learning via alternating structure

optimization. In: Advances in Neural Information Processing Systems, pp. 702–710

(2011)

14. Passos, A., Rai, P., Wainer, J., Daume, H.: Flexible modeling of latent task

structures in multitask learning. In: Int’l Conference on Machine Learning,

pp. 1103–1110 (2012)

15. Gupta, S., Phung, D., Venkatesh, S.: Factorial multi-task learning: a Bayesian

nonparametric approach. In: International Conference on Machine Learning, pp.

657–665 (2013)

16. Zhang, Y., Yeung, D.-Y.: A convex formulation for learning task relationships in

multi-task learning. In: Uncertainty in Artiﬁcial Intelligence, pp. 733–442 (2010)

17. Elixhauser, A., Steiner, C., Harris, D.R., Coﬀey, R.M.: Comorbidity measures for

use with administrative data. Medical Care 36(1), 8–27 (1998)

Multi-Task Metric Learning on Network Data

Chen Fang(B) and Daniel N. Rockmore

Computer Science Department, Dartmouth College, Hanover, NH 03755, USA

{chenfang,rockmore}@cs.dartmouth.edu

Abstract. Multi-task learning (MTL) has been shown to improve prediction performance in a number of diﬀerent contexts by learning models

jointly on multiple diﬀerent, but related tasks. In this paper, we propose

to do MTL on general network data, which provide an important context

for MTL. We ﬁrst show that MTL on network data is a common problem

that has many concrete and valuable applications. Then, we propose a

metric learning approach that can eﬀectively exploit correlation across

multiple tasks and networks. The proposed approach builds on structural

metric learning and intermediate parameterization, and has eﬃcient an

implementation via stochastic gradient descent. In experiments, we challenge it with two common real-world applications: citation prediction

for Wikipedia articles and social circle prediction in Google+. The proposed method achieves promising results and exhibits good convergence

behavior.

Keywords: Multi-task learning · Metric learning · Social network · Link

prediction

1

Introduction

Multi-task learning (MTL) [2,3,6,7,21] considers the problem of learning models

jointly and simultaneously over multiple, diﬀerent but related tasks. Compared

to single-task learning (STL), which learns a model for each task independently

using only task speciﬁc data, MTL leverages all available data and shares knowledge among tasks, thereby resulting in better model generalization and prediction performance. The underlying principle of MTL is that highly correlated

tasks can beneﬁt from each other via joint training, but additional care should

be taken to respect the distinct nature of each task, i.e., it is usually inappropriate to pool all available data and learn a single model for all tasks.

Despite the popularity and value of MTL, most MTL methods are developed

for tasks on i.i.d. data. Standard examples include phoneme recognition [14] and

image recognition [19]. Explicitly correlated data, often represented in the form

of a network, is widely available, such as social network, citation network and

inﬂuence network. It provides a rich source of new application contexts to MTL.

Due to the diversity and variation in networks (e.g., multi-relational links or

multi-category entities/nodes), various tasks can be performed and often a rich

correlation exists between them. In the following, we give two common scenarios

where there is abundant correlation between tasks and it is beneﬁcial to apply

c Springer International Publishing Switzerland 2015

T. Cao et al. (Eds.): PAKDD 2015, Part I, LNAI 9077, pp. 317–329, 2015.

DOI: 10.1007/978-3-319-18038-0 25

318

C. Fang and D.N. Rockmore

MTL to exploit it. (These scenarios are also the settings for the experiments

using real-world data that we present in Section 4).

Scenario 1: Article Citation Prediction

The citation prediction problem has been studied extensively [1,8–10,18]. People

either build a predictive model for a uniﬁed network [10] (i.e., a citation network

that contains papers across all subject areas) or build predictive models for each

area independently [16]. Since article content and citation pattern varies across

diﬀerent areas, the former methodology ignores the diﬀerence between areas.

However, some areas, while labeled as diﬀerent are still related, in the sense of

both content and citation pattern. Thus the latter methodology fails to exploit

the correlation among subject areas. For example, computer science and electrical engineering articles may be classiﬁed or tagged as diﬀerent areas, but in

many cases they may still have much in common, or at least have signiﬁcant similarity or overlap. In this case, to build predictive models for citations, a learning

algorithm that is capable of utilizing these overlaps and explicit commonalities

has advantages over traditional methods.

Scenario 2: Social Circle Prediction

Members of online social networks tend to categorize their links to followers/

followees. For example, many social networking platforms enable coarse-scale

categorizations such as “family members,” or “friends and colleagues.” Finer

gradations allow for categorizations such as colleagues at particular companies

or classmates at speciﬁc schools. A person’s social circle, studied in [11], is the

ego network of a social network user (or “ego”). This is the (star-shaped) subgraph on “ego” and all of ego’s followers comprising all the links joining ego

to ego’s followers that belong to the same category. Given a friend or stranger,

the goal of social circle prediction is to assign him/her to appropriate social

circles. Because some social circles are related to each other (e.g., family members and childhood friends may share some common informative features such

as geographical proximity), advantages may very well accrue if the relatedness

of the entities is used for the various predictions, instead of building a predictive

assignment model for each social circle independently.

As these scenarios suggest, correlations commonly exist among tasks on network data and there should be signiﬁcant advantages to developing methods that

can leverage it. Diﬀerent from i.i.d. data, network data not only has attributes

(metadata) associated with each entity (node), but also rich structural information, mainly encoded in the links. Therefore, we employ structural learning to

exploit both attributes and structure of networks. Speciﬁcally, we adopt structure

preserving metric learning (SPML) [16], which was originally developed for singletask learning on networks. Our proposed method, MT-SPML, empowers SPML

with the ability of doing MTL over multiple tasks and networks. SPML learns a

single Mahalanobis distance metric on node attributes for a single task by using

network structure as supervision, so that the learned distance function encodes

the structure. Our method learns Mahalanobis distance metrics jointly over all

tasks. More precisely, it learns a common metric for all tasks and one metric for

Multi-Task Metric Learning on Network Data

319

each individual task. The common metric construction follows the methodology

of shared intermediate parameterization [7,12], which allows sharing knowledge

between tasks. While a task speciﬁc metric alone captures task speciﬁc information, when combined they work together to preserve the connectivity structure

of the corresponding network. The learned metrics of SPML and MT-SPML are

useful to many tasks on network, one of which is predicting future link pattern.

We further show that as in the case of SPML, MT-SPML can be optimized with

eﬃcient online methods similar to OASIS [4] and PEGASOS [15] via stochastic

gradient descent. Finally, MT-SPML is designed for general networks, thus can

be applied extensively in a wide variety of problems. In experiments, in order to

demonstrate the advantages of MTL on network data, we apply MT-SPML to two

common real-world prediction problems (citation prediction and social circle prediction), and achieve promising results for link prediction.

2

Related Work

MTL is a popular research topic and has been studied extensively and systematically for i.i.d. data. To name a few, Yu et al. [21] applied hierarchical Bayesian

modeling for text categorization. Evgeniou et al. [7] extended Support Vector

Machines (SVMs) to MTL via parameter sharing. Following [7], Parameswaran

et al. [12] proposed the multi-task version of large margin nearest-neighbor metric learning [20]. However, there have been only few works focusing on MTL

on relational data [5,17,22]. Of greatest relevance for our work is [13] wherein

Qi et al. carefully designed a mechanism to sample across networks to predict

missing links in a target network. Our paper diﬀers from it in several ways.

First, we aim at improving prediction performance of all networks, while [13]

targets at a speciﬁc network and uses other networks as additional sources.

Second, MT-SPML learns a joint embedding of both attribute features and network topological structure. Thus, the learned metrics can predict link patterns

solely from node attributes while [13] tries to combine linearly attribute features

with hand-constructed local structure information such as the number of shared

neighbors between nodes. This suﬀers from the well-known “cold start” problem

when structure information is limited (e.g. new nodes).

3

Our Approach

In this section, we ﬁrst cover the technical details of SPML and then those of

MT-SPML.

3.1

Notations and Preliminaries

Given a network on n nodes we represent it as a pair G = (X, A), where

X ∈ Rd×n represents the node attributes and A ∈ Rn×n is the binary adjacency matrix, whose entry Aij indicates the linkage information between node

320

C. Fang and D.N. Rockmore

i and node j. Recall that a Mahalanobis distance is parameterized by a positive

semideﬁnite (PSD) matrix M ∈ Rd×d , where M 0. The corresponding distance

function is deﬁned as dM (xi , xj ) = (xi − xj ) M(xi − xj ). This is equivalent to

the existence of a linear transformation matrix L on the feature space such that

M = L L. Given a metric M, to predict the structure pattern of X we adopt a

simple k-nearest neighbor algorithm, which is denoted as C, meaning each node

is connected with its top-k nearest neighbors under the deﬁned metric. Mathematically, we say M is structure preserving or that it preserves A, if C(X, M)

closely approximates A.

Let G = {G1 , G2 , . . . , GQ } denote a set of networks. Each individual network

Gq has its own Xq and Aq . We use q to index the network so that Aqij stands

for element (i, j) in Aq . Similarly, xqi represents the feature of node i in Xq . In

algorithms, we will use a superscript to index over iteration, e.g., Mk refers to

the k-th iteration of M under the relevant iterative process.

3.2

SPML

The goal of SPML is to learn M from a network G = (X, A), such that M

preserves A. This problem has a semideﬁnite max margin learning formulation,

λ

||M||2F + ξ

0 2

min

M

(1)

subject to the following constraints:

∀i,j , dM (xi , xj ) ≥ (1 − Aij ) max(Ail dM (xi , xl )) + 1 − ξ.

l

(2)

In Eq.(1) || · ||F denotes the Frobenius norm and it takes on the role as a regularizer on M with λ representing the corresponding weight parameter. The key

piece for achieving structure preserving is the set of linear constraints in Eq.(2).

This essentially enforces that from node i, the distances to all disconnected

nodes must be larger than the distance to the furthest connected node. Thus,

when the constraints in Eq.(2) are all satisﬁed, C(X, M) will exactly reproduce

A. Furthermore, to allow for violation (with penalty), the slack variable ξ is

introduced.

With the many constraints in Eq.(2), optimizing Eq.(1) becomes unfeasible

when the network has even a few hundred nodes. But a rewriting of the problem

as follows makes possible the use of stochastic subgradient descent (see Algorithm

1):

1

λ

max(ΔM (xi , xj , xl ) + 1, 0)

(3)

f (M) = ||M||2F +

2

|S|

(i,j,l)∈S

where ΔM (xi , xj , xl ) = dM (xi , xl )−dM (xi , xj ) and S = {(i, j, l)|Ai,l = 1∧Ai,j =

0}. Thus, inclusion of the triplet (i, j, l) means that there is a link between node i

and node l, but not between i and j. The subgradient of Eq.(3) can be calculated

as

Tải bản đầy đủ (.pdf) (785 trang)