1. Trang chủ >
  2. Công Nghệ Thông Tin >
  3. Kỹ thuật lập trình >

3 Activity Summarization Features: MACHINE-GUN, ENTHUSIASM and REPETITION Patterns

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (30.67 MB, 785 trang )


102



Z. Luo et al.



0.5

No Topic

House

Xiao Mi

Li Yang

He Bei



0.45

0.4



The fraction



0.35

0.3

0.25

0.2

0.15

0.1

0.05

0



1



2



3



4



5



6



7



8



9



10



The path length



Fig. 4. Path length distribution of retweets of each topic

Table 3. The information of different types of users

Users

Avg. of path length. Avg. burst duration(days) Avg. life duration(days)

Top 100

1.86

2.73

10.63

Top 100-1000

2.17

3.05

18.34

Normal

3.04

3.77

28.46



Burst Pattern vs. Users. We examine whether different types of users who

post tweets have effects on burst pattern. Table 3 shows the general comparison

of three types of users: top 100, top 100-1000, and normal users. We define the

top 100 users as those who rank among the top 100 in terms of the total number

of retweets each user receives. We can see there are significant differences in

terms of path length, peak time, and duration time among three types of users.

Tweets from the top 100 most influential users have much shorter path lengths,

burst duration, and tweet life duration than the top 100-1000 and normal users.

Figure 5 shows the path length distribution for each type of user under study.

We can observe that the proportions of retweets of those tweets authored by the



0.5

Top Users

Second Top Users

Normal Users



0.45

0.4



The fraction



0.35

0.3

0.25

0.2

0.15

0.1

0.05

0



1



2



3



4



5



6



7



8



9



10



The path length



Fig. 5. Path length distribution of retweets from three types of users



On Burst Detection and Prediction in Retweeting Sequence



103



top 100 users with path length 1 and 2 are 46% and 39%, respectively, which are

much higher than the corresponding proportions for tweets from the top 1001000 and normal users. This phenomena shows that the top 100 most influential

users can propagate their messages more quickly in the microblogging site than

other users.



4



Burst Prediction



We are interested in the following prediction problem: given a tweet with known

information about its content, its user profile, the followship topology, and the

observed retweet sequence in the first 12 hours, can we predict whether the tweet

will have multi-burst in the future of its life cycle.

One challenge here is what kind of features we can extract from the known

information and how useful they are for burst prediction. In our study, we extract

178 features from the a-priori known information of a tweet (i.e., its topics, user

profile, followship topology, and its observed retweet sequence in the first 12

hours). The extracted features can be roughly grouped into two main classes:

user-related and tweet-related.

In the user-related class, we extract features from the profile of the user who

posts the original tweet. For example, we extract the number of his immediate

followees, the number of his two-hop followees, the number of tweets the user

has authored, the average number of retweets received in the first 12 hours for

all his tweets, and the numbers of tweets with no, single, and multiple bursts.

In the tweet-related class, we extract the features such as the tweet’s post

time, first retweeting time, the presence/absence of hot topics in the tweet, the

presence/absence of hot topics in its retweets, the presence/absence of @users in

the tweet, the presence/absence of @users in its retweets, the number of retweets

containing @users and the number of @users in its retweets, etc. For each tweet,

we also build a retweet tree from its observed retweet sequence in the first 12

hours and extract features such as the maximum width, the maximum height,

the number of retweet users, and the average path length.

In our experiment, we exclude from the Sina Weibo dataset those records in

which the original tweets’ user ID could not be found in the followship network.

Finally, we build a training data set with 30,084 tweets with no multi-burst and

30,030 tweets with multi-burst.

We run a suite of 7 classifiers: Logistic Regression (LR), Random Forest(RF),

Decision Tree (DT), Naive Bayes (NB), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), and k-Nearest Neighbor (kNN). We take the 10 fold

cross-validation for each classifier. The accuracy result is shown in Figure 6. We

can observe that Random Forest, Decision Tree, k-Nearest Neighbor, and Logistic

Regression achieve good prediction results in terms of accuracy (higher than 72%).

We then analyze the effect of each feature on prediction. We take the logistic regression coefficient as the effect. The regression coefficients represent the

change in the logit for each unit change in the feature. The larger the absolute

value of the coefficient is, the more effect the feature takes. Formally, we can



104



Z. Luo et al.



0.9



Average accuracy



0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0



LR



RF



DT



NB



SVM



SGD



kNN



Fig. 6. Accuracy of Classifiers: Logistic Regression (LR), Random Forest(RF), Decision

Tree (DT), Naive Bayes (NB), Support Vector Machine (SVM), Stochastic Gradient

Descent (SGD), and k-Nearest Neighbor (kNN)

−5



4



x 10



Logistic coefficient



3

2

1

0

−1

−2

−3

−4



0



20



40



60



80



100



120



140



160



180



Feature index



Fig. 7. Logistic Coefficient of Features



use the likelihood ratio test or the Wald statistic to assess the significance of an

individual feature. Our results show that there are only 20 features with relatively large coefficient values. Figure 7 plots the logistic regression coefficient for

each feature where X-axis represents different features and Y-axis shows each

feature’s coefficient value. We list top 5 most significant features in Table 4. We

can see that the average number of retweets with path length 1 of the user’s all

Table 4. Top 5 most significant features (PL1 denotes path length 1)

Index

121

87

50

84

82



Meaning

Avg no of

Avg no of

Avg no of

Avg no of

Avg no of



PL1 retweets of user’s all tweets

PL1 retweets (first 12h) of user’s no-burst tweets

retweets (first 12h) of user’s multi-burst tweets

retweets (first 12h) of user’s no-burst tweets

retweets of user’s no-burst retweets



Coefficient

3.95E-05

3.51E-05

-3.37E-05

3.14E-05

-2.86E-05



On Burst Detection and Prediction in Retweeting Sequence



105



tweets is the most significant feature with the coefficient value 3.95E-05. In our

future work, we will conduct detailed correlation analysis and examine prediction

performance after removing those redundant features.



5



Related Work



Examining retweet behavior has been an active research area recently [7–9,12,

13]. For example, the authors in [7] studies the coverage prediction of retweets,

i.e., what is the number of times that a particular message posted by a user will

be retweeted. In [13], the authors examine various factors such as user, message,

and time and propose a factor graph model to predict whether a user will retweet

a message. The authors in [9] study why people retweet and examine the antihomophily phenomena. In [8], the authors examine the use of log-linear modeling

to identify multi-way interactions between retweet and various features such as

power ratio, link structure and users’ profile information. In [12], the authors

analyze the ways in which hashtags spread on twitter and show widely-used

hashtags on different topics spread significantly different.

Change detection models [1,4] provide a standard approach to detecting deviations from baseline. Usually we assume the mean and variance of a distribution

representing normal behavior and the mean and variance of another distribution representing behavior that is abnormal. We can measure deviations from

normal using the generalized likelihood ratio. For example, in [4], the authors

assume both distributions are Gaussian with the same variance and the change

is reflected in the mean of the observations. In this context, they apply the

generalized likelihood ratio to score changes from baseline.

Techniques for finding burst patterns in data streams have also been presented in [6,11,15,16]. In [6], the authors examine bursty structure in temporal text streams (e.g., emails or blogs). They examine how frequency words

change over time. The burstiness of words is defined as those words with significantly higher frequency than others. They propose to model the stream using an

infinite-state automaton, in which bursts appear naturally as state transitions.

In [16], the authors examine point monitoring and aggregate monitoring in time

series data streams and design a new structure, called the Shifted Wavelet Tree,

for elastic burst monitoring. In [15], the authors propose a family of data structures based on the Shifted Binary Tree for elastic burst detection and develop a

heuristic search algorithm to find an efficient structure given the input. In [11],

the authors study how to detect, characterize and classify bursts in user query

logs of large scale e-commerce systems. The authors build several models that

continually detect newer bursts with minimal computation and provide a mechanism to rank the identified bursts based on a number of factors such as burst

concentration, burst intensity and burst interestingness. They also propose several quantities to rank bursts including duration of burst, mass of burst, arrival

rate for burst, span ratio, momentum of burst, and concentration of burst, and

apply unsupervise learning techniques to classify the bursts based on their patterns. Although extensive work has been done in related fields for mining various



106



Z. Luo et al.



temporal patterns, we notice that very little work has been done to detect and

predict interesting burst patterns from large-scale retweet sequence data.

Message propagation can be regarded as a social contagion process. There

has been research on rumor propagation [5,10,14]. In [14], the authors study the

dynamics of an epidemic-like model for the spread of a rumor on a small-world

network. In [10], the authors study the dynamics of a generic rumor model on

complex scale-free topologies and investigate the impact of the interaction rules

on the efficiency and reliability of the rumor process. In [5], the authors apply the

susceptible-infectious-recovered and susceptible-infectious-susceptible models to

study the spreading process in complex networks. However, we notice that very

little work has been done to detect and predict burst patterns.



6



Conclusion



In this paper, we have proposed the use of the Cantelli’s inequality to identify

bursts from retweet sequence data. With the use of the Cantelli’s inequality,

we do not need to assume the distribution of the retweet sequence data and

can still identify bursts efficiently. We conducted a complete empirical study of

burst pattern using Sina Weibo data and examined what factors would affect

burst. We extracted various features from users’ profiles, followship topology, and

message topics and investigated whether and how accurate we can predict bursts

using various classifiers based on the extracted features. Our empirical evaluation

results show the burst prediction is feasible with appropriately extracted features

and classifiers.

In our future work, we will investigate various regression analysis methods

[3] on extracted features to predict when a tweet produces its first burst as well

as following bursts. We will analyze the bursts to see what their causality was by

matching external events that might have caused the bursts. In our future work,

we will also study how to classify bursts based upon their shapes, durations, and

derived burst characteristics. We will examine various burst characteristics such

as burst concentration, burst intensity and burst interestingness. We will study

how the window size affects burst detection and categorization. Finally, we will

study the use of topic modeling [2] to analyze tweet content and automatically

identify the topics of every tweet.

Acknowledgments. The authors would like to thank anonymous reviewers for their

valuable comments and suggestions. This work was supported in part by U.S. National

Science Foundation (CCF-1047621), U.S. National Institute of Health (1R01GM103309),

and the Chancellor’s Special Fund from UNC Charlotte.



References

1. Basseville, M., Nikiforov, I., et al.: Detection of abrupt changes: theory and application, vol. 104. Prentice Hall, Englewood Cliffs (1993)

2. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)



On Burst Detection and Prediction in Retweeting Sequence



107



3. Cohen, J., Cohen, P.: Applied multiple regression/correlation analysis for the

behavioral sciences. Lawrence Erlbaum (1975)

4. Curry, C., Grossman, R., Locke, D., Vejcik, S., Bugajski, J.: Detecting changes in

large data sets of payment card data: a case study. In: KDD, pp. 1018–1022. ACM

(2007)

5. Kitsak, M., Gallos, L., Havlin, S., Liljeros, F., Muchnik, L., Stanley, H., Makse, H.:

Identification of influential spreaders in complex networks. Nature Physics 6(11),

888–893 (2010)

6. Kleinberg, J.: Bursty and hierarchical structure in streams. Data Mining and

Knowledge Discovery 7(4), 373–397 (2003)

7. Luo, Z., Wang, Y., Wu, X.: Predicting retweeting behavior based on autoregressive

moving average model. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE

2012. LNCS, vol. 7651, pp. 777–782. Springer, Heidelberg (2012)

8. Luo, Z., Wu, X., Cai, W., Peng, D.: Examining multi-factor interactions in

microblogging based on log-linear modeling. In: ASONAM (2012)

9. Macskassy, S.A., Michelson, M.: Why do people retweet? anti-homophily wins the

day! In: ICWSM (2011)

10. Moreno, Y., Nekovee, M., Pacheco, A.: Dynamics of rumor spreading in complex

networks. Physical Review E 69(6), 066130 (2004)

11. Parikh, N., Sundaresan, N.: Scalable and near real-time burst detection from ecommerce queries. In: KDD, pp. 972–980. ACM (2008)

12. Romero, D.M., Meeder, B., Kleinberg, J.: Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on

twitter. In: WWW, pp. 695–704. ACM (2011)

13. Yang, Z., Guo, J., Cai, K., Tang, J., Li, J., Zhang, L., Su, Z.: Understanding

retweeting behaviors in social networks. In: CIKM, pp. 1633–1636. ACM (2010)

14. Zanette, D.: Dynamics of rumor propagation on small-world networks. Physical

Review E 65(4), 041908 (2002)

15. Zhang, X., Shasha, D.: Better burst detection. In: ICDE, pp. 146–146. IEEE (2006)

16. Zhu, Y., Shasha, D.: Efficient elastic burst detection in data streams. In: KDD,

pp. 336–345. ACM (2003)



#FewThingsAboutIdioms: Understanding

Idioms and Its Users in the Twitter Online

Social Network

Koustav Rudra1(B) , Abhijnan Chakraborty1 , Manav Sethi1 , Shreyasi Das1 ,

Niloy Ganguly1 , and Saptarshi Ghosh2,3

1



Department of CSE, Indian Institute of Technology Kharagpur, Kharagpur, India

koustav.rudra@cse.iitkgp.ernet.in

2

Max Planck Institute for Software Systems, Kaiserslautern, Germany

3

Department of CST, Indian Institute of Engineering

Science and Technology Shibpur, Howrah, India



Abstract. To help users find popular topics of discussion, Twitter periodically publishes ‘trending topics’ (trends) which are the most discussed

keywords (e.g., hashtags) at a certain point of time. Inspection of the

trends over several months reveals that while most of the trends are

related to events in the off-line world, such as popular television shows,

sports events, or emerging technologies, a significant fraction are not

related to any topic / event in the off-line world. Such trends are usually

known as idioms, examples being #4WordsBeforeBreakup, #10thingsIHateAboutYou etc. We perform the first systematic measurement study

on Twitter idioms. We find that tweets related to a particular idiom

normally do not cluster around any particular topic or event. There are

a set of users in Twitter who predominantly discuss idioms – common,

not-so-popular, but active users who mostly use Twitter as a conversational platform – as opposed to other users who primarily discuss topical

contents. The implication of these findings is that within a single online

social network, activities of users may have very different semantics; thus,

tasks like community detection and recommendation may not be accomplished perfectly using a single universal algorithm. Specifically, we run

two (link-based and content-based) algorithms for community detection

on the Twitter social network, and show that idiom oriented users get

clustered better in one while topical users in the other. Finally, we build

a novel service which shows trending idioms and recommends idiom users

to follow.



1



Introduction



Twitter is now considered more of an ‘information network’ than a social network [6] and almost the entire focus of the research community has been on ‘topical’ content in Twitter, such as tweets / hashtags related to sports or technology

or emergency situations in the off-line world [2]. However, a closer inspection of

the Twitter trending topics (‘trends’ in short) – keywords periodically declared

c Springer International Publishing Switzerland 2015

T. Cao et al. (Eds.): PAKDD 2015, Part I, LNAI 9077, pp. 108–121, 2015.

DOI: 10.1007/978-3-319-18038-0 9



#FewThingsAboutIdioms: Understanding Idioms and Its Users



109



Table 1. Percentage of Twitter trends collected over ten months, and classified into

nine different categories that were identified by human volunteers (details in Section 2).

Also given are few examples of trends.

Category

% Example trends

Entertainment 33% #5sosonKiis, #IWishICould, #Austinonidol

Sports

30% #argentinavsholanda, #lakers, #bravsger

Idioms

9% #WhenIWasATeenager, #FactsaboutMe, I get

angry when

Technology

8% #iphone6, #galaxy4, AppleWatch, ios8

Politics

5% #tcot, #pjnet, #obama, #gaza

Business

5% #amazon, #AlibabaIPO, #FedReserve

Religion

3% #EidMubarak, #jesus, #citrt

Health

2% #Ebola, #Who, #breastcancer

Others

5% #garlicparmpizza, #filipino, cheesecake, pizza is healthy



by Twitter as being the most discussed at that point in time – indicates some

exceptions to this view, and provides the motivation for the present study.

We collected US trends over a duration of 10 months (January – October,

2014) using the Twitter API at 15-minute intervals. This gave about 18,500 distinct trending topics during this period. We then developed a classifier Odin 1

and classified the trends into multiple categories such as sports, entertainment,

technology etc. – these broad categories were identified by human volunteers

(details in Section 2). Table 1 shows the distribution of the trends in the different broad categories. While most of the categories are topical and related to

events in the off-line world, it can be observed that a special category, known as

idioms 2 , regularly becomes trending. Examples of idioms include #4WordsBeforeBreakup, #11ThingsAboutYou, and apparently these are not related to any

topic or event in the off-line world.

The frequent presence of such trends is intriguing – it raises the question

whether their dynamics as well as the users discussing such trends are similar

to those of the topical counterparts. To understand the dynamics, we collected

tweets related to hundreds of idioms and the users who discuss them, and conducted a detailed measurement study. We find that the tweets containing idioms

are mainly conversational in nature; for instance, they hardly contain URLs.

On investigating the users who post the tweets (the idiom-users), we find that

they are mostly general and active Twitter users, as opposed to being popular

experts / celebrities who usually drive topics such as politics and entertainment.

The idiom-users maintain close friendships among themselves and interact on

diverse issues with their friends. Thus, the study unfurls that hidden within the

1

2



Named after the God of Wisdom according to Norse mythology; details in Section 2.

In this study, we follow the definition of idioms given by [13] – an idiom is a keyword

representing a conversational theme on Twitter, consisting of a concatenation of at

least two common words which does not include names of people, places or music

albums etc.



110



K. Rudra et al.



information network of Twitter, there is a social network of users who regularly

have “non-topical” conversations among themselves.

Such an inference has far-reaching implications. It essentially means that

multiple dominant dynamics are present in the same social network – so the

standard tasks like community detection, recommendation, and so on, cannot

be done using a one-parameter-fits-all approach. An algorithm to identify (recommend) topical groups might fail to identify (recommend) idiom-users. To test

this proposition, we run two community detection algorithms – one identifying

topical groups [2] and the other, Infomap [14] which detects communities using

link structure. We find that the idiom-users are well identified by Infomap while

the topical groups are better identified by [2]. This establishes that different

approaches for tasks such as clustering may have different utilities in a heterogeneous online social network. Further, considering that all existing recommender

services are specifically meant to recommend topical experts, we develop a service Idiomatic where one can easily follow popular idiom-users, see the recent

and past trending idioms and post tweets using them.



2



Classification of Trends



In order to perform a large scale study on idioms and the trends related to

topics / events in the off-line world, we built an automatic classifier Odin, to

distinguish particular trends based on whether they are idioms or related to some

topic. Note that some prior studies [7,21] have also attempted to classify trends

(not necessarily into the same categories found by Odin), utilizing the textual

contents of the tweets containing the trends. However, tweets (restricted to 140

characters) often contain informal language and abbreviations which potentially

results in lower classification accuracy [21]. Hence, we adopt a different approach

that combines both tweets and related web documents and uses several webbased knowledge engines to perform the classification. Odin classifies a given

trend following the steps presented below.

2.1



Preprocessing



Segmentation: Trends often consist of multiple words [13] recognizing which is

easy for multi-word phrases and hashtags written in CamelCase style (e.g., #WorldCupSoccer), but is very difficult for trends which simply have the words juxtaposed

without any separation (e.g., #everythingididntsay). Since it is important to identify the individual words which make up a trend in order to understand its topic,

trends need to be segmented into the component words. Odin follows a modified

version of the Viterbi Algorithm [1], which uses a model of word distribution to calculate the most probable character sequence forming a word. Odin computes the

word distribution from Google n-gram corpus (https://books.google.com/ngrams).



#FewThingsAboutIdioms: Understanding Idioms and Its Users



111



Given a trend, Odin segments the trend into its constituent words based on this

calculated probability estimates (details omitted for brevity).

Categorization of Related Web Documents: Odin searches different Web

search engines (e.g., Google, Bing) with the segmented trend, to get a large set of

web-pages relevant to the given trend. Often the tweets containing the trend have

URLs, which become another source for getting related web-pages.3 For a given

trend, Odin collects all the web-pages pointed from the tweets and returned

by the search engines; and then a set of category keywords are extracted for

these collected web-pages using the NLP-based AlchemyAPI web service (www.

alchemyapi.com).

Entity Extraction and Categorization: Sometimes the trend contains names

of people, organisations or locations (e.g., #EMABiggestFansJustinBieber) detecting which might give a clear idea on the category of the trend. Similarly, the web documents and the tweets associated with a particular trend have many such named

entities present in them. Odin extracts such entities using AlchemyAPI and then

queries Freebase (www.freebase.com) to know the ‘notable type’ of such named entities (e.g., according to Freebase, notable type for ‘Justin Bieber’ is ‘/music/artist’).

2.2



Classification



At the end of preprocessing steps, for a given trend, Odin collects the categories

of the related web documents and the notable types of the related named entities. Treating the number of web documents and named entities in the various

categories as features, Odin uses a Support Vector Machine (SVM) classifier

with Radial Basis Function kernel to classify a particular trend into one of the

9 categories shown in Table 1.

Training Data Preparation: For creation of training data, three human volunteers (regular users of Twitter, who are not authors of this paper) were asked

to manually inspect 700 distinct trends collected during the first two weeks

of January 2014 (along with tweets containing these trends), and classify the

trends into different categories. The volunteers identified the nine broad categories shown in Table 1, such as Entertainment, Sports, Technology, Idioms

(following the definition of idioms in [13]). Out of the 700 trends, all three volunteers agreed upon a particular category for 575 trends. We created the training

data considering this unanimous categorization as the ground truth.

Classification Performance: Standard 10-fold cross validation on the data of

the 575 trends showed that Odin attains 77.15% accuracy in predicting trend

categories, which is good considering that it is a complex nine-class classification

task.

3



URLs leading to social media sites like Facebook, Twitter, Instagram, are ignored,

since these pages usually do not have much content to help the topic categorization

process.



112



K. Rudra et al.



Table 2. Statistics of data collected

Property

Number of trends

Total #tweets containing the trends (millions)

Mean #tweets per trend

Total #distinct users posting the trend

(millions)

Mean #distinct users per trend



3



Idiom Sports Entertainment Technology

150

150

150

150

6.205 6.787

6.967

6.105

41,369 45,257

2.74

2.71



46,455

1.90



40,721

1.75



18,315 18,098



12,725



11,705



Dataset



Since most of the Twitter trends were related to the three topics entertainment,

sports, and technology (see Table 1), we decided to focus on idioms and trends

related to these three topics; the trends related to any of these three topics are

collectively referred to as ‘topical trends’. For each of the trends belonging to

the four selected categories, we collected as many tweets containing the trend

as possible using the Twitter search API. To get a better understanding about

the trends, in our analysis as presented in later sections, we used only those

trends for which we were able to collect more than 30,000 tweets. To maintain

uniformity across categories, we finally selected a set of 150 trends related to

each of the categories (the actual distribution is stated in Table 1).

For each of the 600 selected trends, we further collected detailed statistics

about all the users (including their profile details, social links and recently posted

tweets) who posted a tweet containing any of the selected trends. Table 2 summarizes the statistics of the data collected for the trends of the four categories.



4



Comparing Idioms and Topical Trends



In this section, we compare how idioms and topical trends are discussed in the

Twitter social network, and the users who discuss them frequently.

4.1



How Trends Are Discussed in Twitter



We first analyze how the trends of different categories are, in general, discussed in

Twitter. For a given trend t, we consider all tweets containing t, and measure what

percentage of these tweets contain other hashtags (apart from t itself), and URLs.

Figure 1 shows mean values of the percentage of tweets containing other hashtags and URLs, where the mean values are computed over all trends of a particular

category. Statistical measures like two sample KS-test and Mann-Whitney U test

with significance level 0.05 show that there is a significant difference in the distribution of the mean values among the four categories. Expectedly, we find that the

topical trends are much more likely to be accompanied by other hashtags and URLs



Xem Thêm
Tải bản đầy đủ (.pdf) (785 trang)

×