1. Trang chủ >
2. Công Nghệ Thông Tin >
3. Kỹ thuật lập trình >

3 Activity Summarization Features: MACHINE-GUN, ENTHUSIASM and REPETITION Patterns

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (30.67 MB, 785 trang )

102

Z. Luo et al.

0.5

No Topic

House

Xiao Mi

Li Yang

He Bei

0.45

0.4

The fraction

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

1

2

3

4

5

6

7

8

9

10

The path length

Fig. 4. Path length distribution of retweets of each topic

Table 3. The information of diﬀerent types of users

Users

Avg. of path length. Avg. burst duration(days) Avg. life duration(days)

Top 100

1.86

2.73

10.63

Top 100-1000

2.17

3.05

18.34

Normal

3.04

3.77

28.46

Burst Pattern vs. Users. We examine whether diﬀerent types of users who

post tweets have eﬀects on burst pattern. Table 3 shows the general comparison

of three types of users: top 100, top 100-1000, and normal users. We deﬁne the

top 100 users as those who rank among the top 100 in terms of the total number

of retweets each user receives. We can see there are signiﬁcant diﬀerences in

terms of path length, peak time, and duration time among three types of users.

Tweets from the top 100 most inﬂuential users have much shorter path lengths,

burst duration, and tweet life duration than the top 100-1000 and normal users.

Figure 5 shows the path length distribution for each type of user under study.

We can observe that the proportions of retweets of those tweets authored by the

0.5

Top Users

Second Top Users

Normal Users

0.45

0.4

The fraction

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

1

2

3

4

5

6

7

8

9

10

The path length

Fig. 5. Path length distribution of retweets from three types of users

On Burst Detection and Prediction in Retweeting Sequence

103

top 100 users with path length 1 and 2 are 46% and 39%, respectively, which are

much higher than the corresponding proportions for tweets from the top 1001000 and normal users. This phenomena shows that the top 100 most inﬂuential

users can propagate their messages more quickly in the microblogging site than

other users.

4

Burst Prediction

We are interested in the following prediction problem: given a tweet with known

information about its content, its user proﬁle, the followship topology, and the

observed retweet sequence in the ﬁrst 12 hours, can we predict whether the tweet

will have multi-burst in the future of its life cycle.

One challenge here is what kind of features we can extract from the known

information and how useful they are for burst prediction. In our study, we extract

178 features from the a-priori known information of a tweet (i.e., its topics, user

proﬁle, followship topology, and its observed retweet sequence in the ﬁrst 12

hours). The extracted features can be roughly grouped into two main classes:

user-related and tweet-related.

In the user-related class, we extract features from the proﬁle of the user who

posts the original tweet. For example, we extract the number of his immediate

followees, the number of his two-hop followees, the number of tweets the user

has authored, the average number of retweets received in the ﬁrst 12 hours for

all his tweets, and the numbers of tweets with no, single, and multiple bursts.

In the tweet-related class, we extract the features such as the tweet’s post

time, ﬁrst retweeting time, the presence/absence of hot topics in the tweet, the

presence/absence of hot topics in its retweets, the presence/absence of @users in

the tweet, the presence/absence of @users in its retweets, the number of retweets

containing @users and the number of @users in its retweets, etc. For each tweet,

we also build a retweet tree from its observed retweet sequence in the ﬁrst 12

hours and extract features such as the maximum width, the maximum height,

the number of retweet users, and the average path length.

In our experiment, we exclude from the Sina Weibo dataset those records in

which the original tweets’ user ID could not be found in the followship network.

Finally, we build a training data set with 30,084 tweets with no multi-burst and

30,030 tweets with multi-burst.

We run a suite of 7 classiﬁers: Logistic Regression (LR), Random Forest(RF),

Decision Tree (DT), Naive Bayes (NB), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), and k-Nearest Neighbor (kNN). We take the 10 fold

cross-validation for each classiﬁer. The accuracy result is shown in Figure 6. We

can observe that Random Forest, Decision Tree, k-Nearest Neighbor, and Logistic

Regression achieve good prediction results in terms of accuracy (higher than 72%).

We then analyze the eﬀect of each feature on prediction. We take the logistic regression coeﬃcient as the eﬀect. The regression coeﬃcients represent the

change in the logit for each unit change in the feature. The larger the absolute

value of the coeﬃcient is, the more eﬀect the feature takes. Formally, we can

104

Z. Luo et al.

0.9

Average accuracy

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

LR

RF

DT

NB

SVM

SGD

kNN

Fig. 6. Accuracy of Classiﬁers: Logistic Regression (LR), Random Forest(RF), Decision

Tree (DT), Naive Bayes (NB), Support Vector Machine (SVM), Stochastic Gradient

Descent (SGD), and k-Nearest Neighbor (kNN)

−5

4

x 10

Logistic coefficient

3

2

1

0

−1

−2

−3

−4

0

20

40

60

80

100

120

140

160

180

Feature index

Fig. 7. Logistic Coeﬃcient of Features

use the likelihood ratio test or the Wald statistic to assess the signiﬁcance of an

individual feature. Our results show that there are only 20 features with relatively large coeﬃcient values. Figure 7 plots the logistic regression coeﬃcient for

each feature where X-axis represents diﬀerent features and Y-axis shows each

feature’s coeﬃcient value. We list top 5 most signiﬁcant features in Table 4. We

can see that the average number of retweets with path length 1 of the user’s all

Table 4. Top 5 most signiﬁcant features (PL1 denotes path length 1)

Index

121

87

50

84

82

Meaning

Avg no of

Avg no of

Avg no of

Avg no of

Avg no of

PL1 retweets of user’s all tweets

PL1 retweets (ﬁrst 12h) of user’s no-burst tweets

retweets (ﬁrst 12h) of user’s multi-burst tweets

retweets (ﬁrst 12h) of user’s no-burst tweets

retweets of user’s no-burst retweets

Coeﬃcient

3.95E-05

3.51E-05

-3.37E-05

3.14E-05

-2.86E-05

On Burst Detection and Prediction in Retweeting Sequence

105

tweets is the most signiﬁcant feature with the coeﬃcient value 3.95E-05. In our

future work, we will conduct detailed correlation analysis and examine prediction

performance after removing those redundant features.

5

Related Work

Examining retweet behavior has been an active research area recently [7–9,12,

13]. For example, the authors in [7] studies the coverage prediction of retweets,

i.e., what is the number of times that a particular message posted by a user will

be retweeted. In [13], the authors examine various factors such as user, message,

and time and propose a factor graph model to predict whether a user will retweet

a message. The authors in [9] study why people retweet and examine the antihomophily phenomena. In [8], the authors examine the use of log-linear modeling

to identify multi-way interactions between retweet and various features such as

power ratio, link structure and users’ proﬁle information. In [12], the authors

analyze the ways in which hashtags spread on twitter and show widely-used

hashtags on diﬀerent topics spread signiﬁcantly diﬀerent.

Change detection models [1,4] provide a standard approach to detecting deviations from baseline. Usually we assume the mean and variance of a distribution

representing normal behavior and the mean and variance of another distribution representing behavior that is abnormal. We can measure deviations from

normal using the generalized likelihood ratio. For example, in [4], the authors

assume both distributions are Gaussian with the same variance and the change

is reﬂected in the mean of the observations. In this context, they apply the

generalized likelihood ratio to score changes from baseline.

Techniques for ﬁnding burst patterns in data streams have also been presented in [6,11,15,16]. In [6], the authors examine bursty structure in temporal text streams (e.g., emails or blogs). They examine how frequency words

change over time. The burstiness of words is deﬁned as those words with signiﬁcantly higher frequency than others. They propose to model the stream using an

inﬁnite-state automaton, in which bursts appear naturally as state transitions.

In [16], the authors examine point monitoring and aggregate monitoring in time

series data streams and design a new structure, called the Shifted Wavelet Tree,

for elastic burst monitoring. In [15], the authors propose a family of data structures based on the Shifted Binary Tree for elastic burst detection and develop a

heuristic search algorithm to ﬁnd an eﬃcient structure given the input. In [11],

the authors study how to detect, characterize and classify bursts in user query

logs of large scale e-commerce systems. The authors build several models that

continually detect newer bursts with minimal computation and provide a mechanism to rank the identiﬁed bursts based on a number of factors such as burst

concentration, burst intensity and burst interestingness. They also propose several quantities to rank bursts including duration of burst, mass of burst, arrival

rate for burst, span ratio, momentum of burst, and concentration of burst, and

apply unsupervise learning techniques to classify the bursts based on their patterns. Although extensive work has been done in related ﬁelds for mining various

106

Z. Luo et al.

temporal patterns, we notice that very little work has been done to detect and

predict interesting burst patterns from large-scale retweet sequence data.

Message propagation can be regarded as a social contagion process. There

has been research on rumor propagation [5,10,14]. In [14], the authors study the

dynamics of an epidemic-like model for the spread of a rumor on a small-world

network. In [10], the authors study the dynamics of a generic rumor model on

complex scale-free topologies and investigate the impact of the interaction rules

on the eﬃciency and reliability of the rumor process. In [5], the authors apply the

susceptible-infectious-recovered and susceptible-infectious-susceptible models to

study the spreading process in complex networks. However, we notice that very

little work has been done to detect and predict burst patterns.

6

Conclusion

In this paper, we have proposed the use of the Cantelli’s inequality to identify

bursts from retweet sequence data. With the use of the Cantelli’s inequality,

we do not need to assume the distribution of the retweet sequence data and

can still identify bursts eﬃciently. We conducted a complete empirical study of

burst pattern using Sina Weibo data and examined what factors would aﬀect

burst. We extracted various features from users’ proﬁles, followship topology, and

message topics and investigated whether and how accurate we can predict bursts

using various classiﬁers based on the extracted features. Our empirical evaluation

results show the burst prediction is feasible with appropriately extracted features

and classiﬁers.

In our future work, we will investigate various regression analysis methods

[3] on extracted features to predict when a tweet produces its ﬁrst burst as well

as following bursts. We will analyze the bursts to see what their causality was by

matching external events that might have caused the bursts. In our future work,

we will also study how to classify bursts based upon their shapes, durations, and

derived burst characteristics. We will examine various burst characteristics such

as burst concentration, burst intensity and burst interestingness. We will study

how the window size aﬀects burst detection and categorization. Finally, we will

study the use of topic modeling [2] to analyze tweet content and automatically

identify the topics of every tweet.

Acknowledgments. The authors would like to thank anonymous reviewers for their

valuable comments and suggestions. This work was supported in part by U.S. National

Science Foundation (CCF-1047621), U.S. National Institute of Health (1R01GM103309),

and the Chancellor’s Special Fund from UNC Charlotte.

References

1. Basseville, M., Nikiforov, I., et al.: Detection of abrupt changes: theory and application, vol. 104. Prentice Hall, Englewood Cliﬀs (1993)

2. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)

On Burst Detection and Prediction in Retweeting Sequence

107

3. Cohen, J., Cohen, P.: Applied multiple regression/correlation analysis for the

behavioral sciences. Lawrence Erlbaum (1975)

4. Curry, C., Grossman, R., Locke, D., Vejcik, S., Bugajski, J.: Detecting changes in

large data sets of payment card data: a case study. In: KDD, pp. 1018–1022. ACM

(2007)

5. Kitsak, M., Gallos, L., Havlin, S., Liljeros, F., Muchnik, L., Stanley, H., Makse, H.:

Identiﬁcation of inﬂuential spreaders in complex networks. Nature Physics 6(11),

888–893 (2010)

6. Kleinberg, J.: Bursty and hierarchical structure in streams. Data Mining and

Knowledge Discovery 7(4), 373–397 (2003)

7. Luo, Z., Wang, Y., Wu, X.: Predicting retweeting behavior based on autoregressive

moving average model. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE

2012. LNCS, vol. 7651, pp. 777–782. Springer, Heidelberg (2012)

8. Luo, Z., Wu, X., Cai, W., Peng, D.: Examining multi-factor interactions in

microblogging based on log-linear modeling. In: ASONAM (2012)

9. Macskassy, S.A., Michelson, M.: Why do people retweet? anti-homophily wins the

day! In: ICWSM (2011)

10. Moreno, Y., Nekovee, M., Pacheco, A.: Dynamics of rumor spreading in complex

networks. Physical Review E 69(6), 066130 (2004)

11. Parikh, N., Sundaresan, N.: Scalable and near real-time burst detection from ecommerce queries. In: KDD, pp. 972–980. ACM (2008)

12. Romero, D.M., Meeder, B., Kleinberg, J.: Diﬀerences in the mechanics of information diﬀusion across topics: idioms, political hashtags, and complex contagion on

twitter. In: WWW, pp. 695–704. ACM (2011)

13. Yang, Z., Guo, J., Cai, K., Tang, J., Li, J., Zhang, L., Su, Z.: Understanding

retweeting behaviors in social networks. In: CIKM, pp. 1633–1636. ACM (2010)

14. Zanette, D.: Dynamics of rumor propagation on small-world networks. Physical

Review E 65(4), 041908 (2002)

15. Zhang, X., Shasha, D.: Better burst detection. In: ICDE, pp. 146–146. IEEE (2006)

16. Zhu, Y., Shasha, D.: Eﬃcient elastic burst detection in data streams. In: KDD,

pp. 336–345. ACM (2003)

Idioms and Its Users in the Twitter Online

Social Network

Koustav Rudra1(B) , Abhijnan Chakraborty1 , Manav Sethi1 , Shreyasi Das1 ,

Niloy Ganguly1 , and Saptarshi Ghosh2,3

1

Department of CSE, Indian Institute of Technology Kharagpur, Kharagpur, India

koustav.rudra@cse.iitkgp.ernet.in

2

Max Planck Institute for Software Systems, Kaiserslautern, Germany

3

Department of CST, Indian Institute of Engineering

Science and Technology Shibpur, Howrah, India

Abstract. To help users ﬁnd popular topics of discussion, Twitter periodically publishes ‘trending topics’ (trends) which are the most discussed

keywords (e.g., hashtags) at a certain point of time. Inspection of the

trends over several months reveals that while most of the trends are

related to events in the oﬀ-line world, such as popular television shows,

sports events, or emerging technologies, a signiﬁcant fraction are not

related to any topic / event in the oﬀ-line world. Such trends are usually

known as idioms, examples being #4WordsBeforeBreakup, #10thingsIHateAboutYou etc. We perform the ﬁrst systematic measurement study

on Twitter idioms. We ﬁnd that tweets related to a particular idiom

normally do not cluster around any particular topic or event. There are

a set of users in Twitter who predominantly discuss idioms – common,

not-so-popular, but active users who mostly use Twitter as a conversational platform – as opposed to other users who primarily discuss topical

contents. The implication of these ﬁndings is that within a single online

social network, activities of users may have very diﬀerent semantics; thus,

tasks like community detection and recommendation may not be accomplished perfectly using a single universal algorithm. Speciﬁcally, we run

two (link-based and content-based) algorithms for community detection

on the Twitter social network, and show that idiom oriented users get

clustered better in one while topical users in the other. Finally, we build

a novel service which shows trending idioms and recommends idiom users

to follow.

1

Introduction

Twitter is now considered more of an ‘information network’ than a social network [6] and almost the entire focus of the research community has been on ‘topical’ content in Twitter, such as tweets / hashtags related to sports or technology

or emergency situations in the oﬀ-line world [2]. However, a closer inspection of

the Twitter trending topics (‘trends’ in short) – keywords periodically declared

c Springer International Publishing Switzerland 2015

T. Cao et al. (Eds.): PAKDD 2015, Part I, LNAI 9077, pp. 108–121, 2015.

DOI: 10.1007/978-3-319-18038-0 9

#FewThingsAboutIdioms: Understanding Idioms and Its Users

109

Table 1. Percentage of Twitter trends collected over ten months, and classiﬁed into

nine diﬀerent categories that were identiﬁed by human volunteers (details in Section 2).

Also given are few examples of trends.

Category

% Example trends

Entertainment 33% #5sosonKiis, #IWishICould, #Austinonidol

Sports

30% #argentinavsholanda, #lakers, #bravsger

Idioms

angry when

Technology

8% #iphone6, #galaxy4, AppleWatch, ios8

Politics

5% #tcot, #pjnet, #obama, #gaza

5% #amazon, #AlibabaIPO, #FedReserve

Religion

3% #EidMubarak, #jesus, #citrt

Health

2% #Ebola, #Who, #breastcancer

Others

5% #garlicparmpizza, #ﬁlipino, cheesecake, pizza is healthy

by Twitter as being the most discussed at that point in time – indicates some

exceptions to this view, and provides the motivation for the present study.

We collected US trends over a duration of 10 months (January – October,

2014) using the Twitter API at 15-minute intervals. This gave about 18,500 distinct trending topics during this period. We then developed a classiﬁer Odin 1

and classiﬁed the trends into multiple categories such as sports, entertainment,

technology etc. – these broad categories were identiﬁed by human volunteers

(details in Section 2). Table 1 shows the distribution of the trends in the different broad categories. While most of the categories are topical and related to

events in the oﬀ-line world, it can be observed that a special category, known as

idioms 2 , regularly becomes trending. Examples of idioms include #4WordsBeforeBreakup, #11ThingsAboutYou, and apparently these are not related to any

topic or event in the oﬀ-line world.

The frequent presence of such trends is intriguing – it raises the question

whether their dynamics as well as the users discussing such trends are similar

to those of the topical counterparts. To understand the dynamics, we collected

tweets related to hundreds of idioms and the users who discuss them, and conducted a detailed measurement study. We ﬁnd that the tweets containing idioms

are mainly conversational in nature; for instance, they hardly contain URLs.

On investigating the users who post the tweets (the idiom-users), we ﬁnd that

they are mostly general and active Twitter users, as opposed to being popular

experts / celebrities who usually drive topics such as politics and entertainment.

The idiom-users maintain close friendships among themselves and interact on

diverse issues with their friends. Thus, the study unfurls that hidden within the

1

2

Named after the God of Wisdom according to Norse mythology; details in Section 2.

In this study, we follow the deﬁnition of idioms given by [13] – an idiom is a keyword

representing a conversational theme on Twitter, consisting of a concatenation of at

least two common words which does not include names of people, places or music

albums etc.

110

K. Rudra et al.

information network of Twitter, there is a social network of users who regularly

have “non-topical” conversations among themselves.

Such an inference has far-reaching implications. It essentially means that

multiple dominant dynamics are present in the same social network – so the

standard tasks like community detection, recommendation, and so on, cannot

be done using a one-parameter-ﬁts-all approach. An algorithm to identify (recommend) topical groups might fail to identify (recommend) idiom-users. To test

this proposition, we run two community detection algorithms – one identifying

topical groups [2] and the other, Infomap [14] which detects communities using

link structure. We ﬁnd that the idiom-users are well identiﬁed by Infomap while

the topical groups are better identiﬁed by [2]. This establishes that diﬀerent

approaches for tasks such as clustering may have diﬀerent utilities in a heterogeneous online social network. Further, considering that all existing recommender

services are speciﬁcally meant to recommend topical experts, we develop a service Idiomatic where one can easily follow popular idiom-users, see the recent

and past trending idioms and post tweets using them.

2

Classification of Trends

In order to perform a large scale study on idioms and the trends related to

topics / events in the oﬀ-line world, we built an automatic classiﬁer Odin, to

distinguish particular trends based on whether they are idioms or related to some

topic. Note that some prior studies [7,21] have also attempted to classify trends

(not necessarily into the same categories found by Odin), utilizing the textual

contents of the tweets containing the trends. However, tweets (restricted to 140

characters) often contain informal language and abbreviations which potentially

results in lower classiﬁcation accuracy [21]. Hence, we adopt a diﬀerent approach

that combines both tweets and related web documents and uses several webbased knowledge engines to perform the classiﬁcation. Odin classiﬁes a given

trend following the steps presented below.

2.1

Preprocessing

Segmentation: Trends often consist of multiple words [13] recognizing which is

easy for multi-word phrases and hashtags written in CamelCase style (e.g., #WorldCupSoccer), but is very diﬃcult for trends which simply have the words juxtaposed

without any separation (e.g., #everythingididntsay). Since it is important to identify the individual words which make up a trend in order to understand its topic,

trends need to be segmented into the component words. Odin follows a modiﬁed

version of the Viterbi Algorithm [1], which uses a model of word distribution to calculate the most probable character sequence forming a word. Odin computes the

#FewThingsAboutIdioms: Understanding Idioms and Its Users

111

Given a trend, Odin segments the trend into its constituent words based on this

calculated probability estimates (details omitted for brevity).

Categorization of Related Web Documents: Odin searches diﬀerent Web

search engines (e.g., Google, Bing) with the segmented trend, to get a large set of

web-pages relevant to the given trend. Often the tweets containing the trend have

URLs, which become another source for getting related web-pages.3 For a given

trend, Odin collects all the web-pages pointed from the tweets and returned

by the search engines; and then a set of category keywords are extracted for

these collected web-pages using the NLP-based AlchemyAPI web service (www.

alchemyapi.com).

Entity Extraction and Categorization: Sometimes the trend contains names

of people, organisations or locations (e.g., #EMABiggestFansJustinBieber) detecting which might give a clear idea on the category of the trend. Similarly, the web documents and the tweets associated with a particular trend have many such named

entities present in them. Odin extracts such entities using AlchemyAPI and then

queries Freebase (www.freebase.com) to know the ‘notable type’ of such named entities (e.g., according to Freebase, notable type for ‘Justin Bieber’ is ‘/music/artist’).

2.2

Classification

At the end of preprocessing steps, for a given trend, Odin collects the categories

of the related web documents and the notable types of the related named entities. Treating the number of web documents and named entities in the various

categories as features, Odin uses a Support Vector Machine (SVM) classiﬁer

with Radial Basis Function kernel to classify a particular trend into one of the

9 categories shown in Table 1.

Training Data Preparation: For creation of training data, three human volunteers (regular users of Twitter, who are not authors of this paper) were asked

to manually inspect 700 distinct trends collected during the ﬁrst two weeks

of January 2014 (along with tweets containing these trends), and classify the

trends into diﬀerent categories. The volunteers identiﬁed the nine broad categories shown in Table 1, such as Entertainment, Sports, Technology, Idioms

(following the deﬁnition of idioms in [13]). Out of the 700 trends, all three volunteers agreed upon a particular category for 575 trends. We created the training

data considering this unanimous categorization as the ground truth.

Classification Performance: Standard 10-fold cross validation on the data of

the 575 trends showed that Odin attains 77.15% accuracy in predicting trend

categories, which is good considering that it is a complex nine-class classiﬁcation

3

since these pages usually do not have much content to help the topic categorization

process.

112

K. Rudra et al.

Table 2. Statistics of data collected

Property

Number of trends

Total #tweets containing the trends (millions)

Mean #tweets per trend

Total #distinct users posting the trend

(millions)

Mean #distinct users per trend

3

Idiom Sports Entertainment Technology

150

150

150

150

6.205 6.787

6.967

6.105

41,369 45,257

2.74

2.71

46,455

1.90

40,721

1.75

18,315 18,098

12,725

11,705

Dataset

Since most of the Twitter trends were related to the three topics entertainment,

sports, and technology (see Table 1), we decided to focus on idioms and trends

related to these three topics; the trends related to any of these three topics are

collectively referred to as ‘topical trends’. For each of the trends belonging to

the four selected categories, we collected as many tweets containing the trend

as possible using the Twitter search API. To get a better understanding about

the trends, in our analysis as presented in later sections, we used only those

trends for which we were able to collect more than 30,000 tweets. To maintain

uniformity across categories, we ﬁnally selected a set of 150 trends related to

each of the categories (the actual distribution is stated in Table 1).

For each of the 600 selected trends, we further collected detailed statistics

about all the users (including their proﬁle details, social links and recently posted

tweets) who posted a tweet containing any of the selected trends. Table 2 summarizes the statistics of the data collected for the trends of the four categories.

4

Comparing Idioms and Topical Trends

In this section, we compare how idioms and topical trends are discussed in the

Twitter social network, and the users who discuss them frequently.

4.1

How Trends Are Discussed in Twitter

We ﬁrst analyze how the trends of diﬀerent categories are, in general, discussed in

Twitter. For a given trend t, we consider all tweets containing t, and measure what

percentage of these tweets contain other hashtags (apart from t itself), and URLs.

Figure 1 shows mean values of the percentage of tweets containing other hashtags and URLs, where the mean values are computed over all trends of a particular

category. Statistical measures like two sample KS-test and Mann-Whitney U test

with signiﬁcance level 0.05 show that there is a signiﬁcant diﬀerence in the distribution of the mean values among the four categories. Expectedly, we ﬁnd that the

topical trends are much more likely to be accompanied by other hashtags and URLs

Xem Thêm
Tải bản đầy đủ (.pdf) (785 trang)

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×