Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (30.67 MB, 785 trang )
Z. Luo et al.
The path length
Fig. 4. Path length distribution of retweets of each topic
Table 3. The information of diﬀerent types of users
Avg. of path length. Avg. burst duration(days) Avg. life duration(days)
Burst Pattern vs. Users. We examine whether diﬀerent types of users who
post tweets have eﬀects on burst pattern. Table 3 shows the general comparison
of three types of users: top 100, top 100-1000, and normal users. We deﬁne the
top 100 users as those who rank among the top 100 in terms of the total number
of retweets each user receives. We can see there are signiﬁcant diﬀerences in
terms of path length, peak time, and duration time among three types of users.
Tweets from the top 100 most inﬂuential users have much shorter path lengths,
burst duration, and tweet life duration than the top 100-1000 and normal users.
Figure 5 shows the path length distribution for each type of user under study.
We can observe that the proportions of retweets of those tweets authored by the
Second Top Users
The path length
Fig. 5. Path length distribution of retweets from three types of users
On Burst Detection and Prediction in Retweeting Sequence
top 100 users with path length 1 and 2 are 46% and 39%, respectively, which are
much higher than the corresponding proportions for tweets from the top 1001000 and normal users. This phenomena shows that the top 100 most inﬂuential
users can propagate their messages more quickly in the microblogging site than
We are interested in the following prediction problem: given a tweet with known
information about its content, its user proﬁle, the followship topology, and the
observed retweet sequence in the ﬁrst 12 hours, can we predict whether the tweet
will have multi-burst in the future of its life cycle.
One challenge here is what kind of features we can extract from the known
information and how useful they are for burst prediction. In our study, we extract
178 features from the a-priori known information of a tweet (i.e., its topics, user
proﬁle, followship topology, and its observed retweet sequence in the ﬁrst 12
hours). The extracted features can be roughly grouped into two main classes:
user-related and tweet-related.
In the user-related class, we extract features from the proﬁle of the user who
posts the original tweet. For example, we extract the number of his immediate
followees, the number of his two-hop followees, the number of tweets the user
has authored, the average number of retweets received in the ﬁrst 12 hours for
all his tweets, and the numbers of tweets with no, single, and multiple bursts.
In the tweet-related class, we extract the features such as the tweet’s post
time, ﬁrst retweeting time, the presence/absence of hot topics in the tweet, the
presence/absence of hot topics in its retweets, the presence/absence of @users in
the tweet, the presence/absence of @users in its retweets, the number of retweets
containing @users and the number of @users in its retweets, etc. For each tweet,
we also build a retweet tree from its observed retweet sequence in the ﬁrst 12
hours and extract features such as the maximum width, the maximum height,
the number of retweet users, and the average path length.
In our experiment, we exclude from the Sina Weibo dataset those records in
which the original tweets’ user ID could not be found in the followship network.
Finally, we build a training data set with 30,084 tweets with no multi-burst and
30,030 tweets with multi-burst.
We run a suite of 7 classiﬁers: Logistic Regression (LR), Random Forest(RF),
Decision Tree (DT), Naive Bayes (NB), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), and k-Nearest Neighbor (kNN). We take the 10 fold
cross-validation for each classiﬁer. The accuracy result is shown in Figure 6. We
can observe that Random Forest, Decision Tree, k-Nearest Neighbor, and Logistic
Regression achieve good prediction results in terms of accuracy (higher than 72%).
We then analyze the eﬀect of each feature on prediction. We take the logistic regression coeﬃcient as the eﬀect. The regression coeﬃcients represent the
change in the logit for each unit change in the feature. The larger the absolute
value of the coeﬃcient is, the more eﬀect the feature takes. Formally, we can
Z. Luo et al.
Fig. 6. Accuracy of Classiﬁers: Logistic Regression (LR), Random Forest(RF), Decision
Tree (DT), Naive Bayes (NB), Support Vector Machine (SVM), Stochastic Gradient
Descent (SGD), and k-Nearest Neighbor (kNN)
Fig. 7. Logistic Coeﬃcient of Features
use the likelihood ratio test or the Wald statistic to assess the signiﬁcance of an
individual feature. Our results show that there are only 20 features with relatively large coeﬃcient values. Figure 7 plots the logistic regression coeﬃcient for
each feature where X-axis represents diﬀerent features and Y-axis shows each
feature’s coeﬃcient value. We list top 5 most signiﬁcant features in Table 4. We
can see that the average number of retweets with path length 1 of the user’s all
Table 4. Top 5 most signiﬁcant features (PL1 denotes path length 1)
Avg no of
Avg no of
Avg no of
Avg no of
Avg no of
PL1 retweets of user’s all tweets
PL1 retweets (ﬁrst 12h) of user’s no-burst tweets
retweets (ﬁrst 12h) of user’s multi-burst tweets
retweets (ﬁrst 12h) of user’s no-burst tweets
retweets of user’s no-burst retweets
On Burst Detection and Prediction in Retweeting Sequence
tweets is the most signiﬁcant feature with the coeﬃcient value 3.95E-05. In our
future work, we will conduct detailed correlation analysis and examine prediction
performance after removing those redundant features.
Examining retweet behavior has been an active research area recently [7–9,12,
13]. For example, the authors in  studies the coverage prediction of retweets,
i.e., what is the number of times that a particular message posted by a user will
be retweeted. In , the authors examine various factors such as user, message,
and time and propose a factor graph model to predict whether a user will retweet
a message. The authors in  study why people retweet and examine the antihomophily phenomena. In , the authors examine the use of log-linear modeling
to identify multi-way interactions between retweet and various features such as
power ratio, link structure and users’ proﬁle information. In , the authors
analyze the ways in which hashtags spread on twitter and show widely-used
hashtags on diﬀerent topics spread signiﬁcantly diﬀerent.
Change detection models [1,4] provide a standard approach to detecting deviations from baseline. Usually we assume the mean and variance of a distribution
representing normal behavior and the mean and variance of another distribution representing behavior that is abnormal. We can measure deviations from
normal using the generalized likelihood ratio. For example, in , the authors
assume both distributions are Gaussian with the same variance and the change
is reﬂected in the mean of the observations. In this context, they apply the
generalized likelihood ratio to score changes from baseline.
Techniques for ﬁnding burst patterns in data streams have also been presented in [6,11,15,16]. In , the authors examine bursty structure in temporal text streams (e.g., emails or blogs). They examine how frequency words
change over time. The burstiness of words is deﬁned as those words with signiﬁcantly higher frequency than others. They propose to model the stream using an
inﬁnite-state automaton, in which bursts appear naturally as state transitions.
In , the authors examine point monitoring and aggregate monitoring in time
series data streams and design a new structure, called the Shifted Wavelet Tree,
for elastic burst monitoring. In , the authors propose a family of data structures based on the Shifted Binary Tree for elastic burst detection and develop a
heuristic search algorithm to ﬁnd an eﬃcient structure given the input. In ,
the authors study how to detect, characterize and classify bursts in user query
logs of large scale e-commerce systems. The authors build several models that
continually detect newer bursts with minimal computation and provide a mechanism to rank the identiﬁed bursts based on a number of factors such as burst
concentration, burst intensity and burst interestingness. They also propose several quantities to rank bursts including duration of burst, mass of burst, arrival
rate for burst, span ratio, momentum of burst, and concentration of burst, and
apply unsupervise learning techniques to classify the bursts based on their patterns. Although extensive work has been done in related ﬁelds for mining various
Z. Luo et al.
temporal patterns, we notice that very little work has been done to detect and
predict interesting burst patterns from large-scale retweet sequence data.
Message propagation can be regarded as a social contagion process. There
has been research on rumor propagation [5,10,14]. In , the authors study the
dynamics of an epidemic-like model for the spread of a rumor on a small-world
network. In , the authors study the dynamics of a generic rumor model on
complex scale-free topologies and investigate the impact of the interaction rules
on the eﬃciency and reliability of the rumor process. In , the authors apply the
susceptible-infectious-recovered and susceptible-infectious-susceptible models to
study the spreading process in complex networks. However, we notice that very
little work has been done to detect and predict burst patterns.
In this paper, we have proposed the use of the Cantelli’s inequality to identify
bursts from retweet sequence data. With the use of the Cantelli’s inequality,
we do not need to assume the distribution of the retweet sequence data and
can still identify bursts eﬃciently. We conducted a complete empirical study of
burst pattern using Sina Weibo data and examined what factors would aﬀect
burst. We extracted various features from users’ proﬁles, followship topology, and
message topics and investigated whether and how accurate we can predict bursts
using various classiﬁers based on the extracted features. Our empirical evaluation
results show the burst prediction is feasible with appropriately extracted features
In our future work, we will investigate various regression analysis methods
 on extracted features to predict when a tweet produces its ﬁrst burst as well
as following bursts. We will analyze the bursts to see what their causality was by
matching external events that might have caused the bursts. In our future work,
we will also study how to classify bursts based upon their shapes, durations, and
derived burst characteristics. We will examine various burst characteristics such
as burst concentration, burst intensity and burst interestingness. We will study
how the window size aﬀects burst detection and categorization. Finally, we will
study the use of topic modeling  to analyze tweet content and automatically
identify the topics of every tweet.
Acknowledgments. The authors would like to thank anonymous reviewers for their
valuable comments and suggestions. This work was supported in part by U.S. National
Science Foundation (CCF-1047621), U.S. National Institute of Health (1R01GM103309),
and the Chancellor’s Special Fund from UNC Charlotte.
1. Basseville, M., Nikiforov, I., et al.: Detection of abrupt changes: theory and application, vol. 104. Prentice Hall, Englewood Cliﬀs (1993)
2. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
On Burst Detection and Prediction in Retweeting Sequence
3. Cohen, J., Cohen, P.: Applied multiple regression/correlation analysis for the
behavioral sciences. Lawrence Erlbaum (1975)
4. Curry, C., Grossman, R., Locke, D., Vejcik, S., Bugajski, J.: Detecting changes in
large data sets of payment card data: a case study. In: KDD, pp. 1018–1022. ACM
5. Kitsak, M., Gallos, L., Havlin, S., Liljeros, F., Muchnik, L., Stanley, H., Makse, H.:
Identiﬁcation of inﬂuential spreaders in complex networks. Nature Physics 6(11),
6. Kleinberg, J.: Bursty and hierarchical structure in streams. Data Mining and
Knowledge Discovery 7(4), 373–397 (2003)
7. Luo, Z., Wang, Y., Wu, X.: Predicting retweeting behavior based on autoregressive
moving average model. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE
2012. LNCS, vol. 7651, pp. 777–782. Springer, Heidelberg (2012)
8. Luo, Z., Wu, X., Cai, W., Peng, D.: Examining multi-factor interactions in
microblogging based on log-linear modeling. In: ASONAM (2012)
9. Macskassy, S.A., Michelson, M.: Why do people retweet? anti-homophily wins the
day! In: ICWSM (2011)
10. Moreno, Y., Nekovee, M., Pacheco, A.: Dynamics of rumor spreading in complex
networks. Physical Review E 69(6), 066130 (2004)
11. Parikh, N., Sundaresan, N.: Scalable and near real-time burst detection from ecommerce queries. In: KDD, pp. 972–980. ACM (2008)
12. Romero, D.M., Meeder, B., Kleinberg, J.: Diﬀerences in the mechanics of information diﬀusion across topics: idioms, political hashtags, and complex contagion on
twitter. In: WWW, pp. 695–704. ACM (2011)
13. Yang, Z., Guo, J., Cai, K., Tang, J., Li, J., Zhang, L., Su, Z.: Understanding
retweeting behaviors in social networks. In: CIKM, pp. 1633–1636. ACM (2010)
14. Zanette, D.: Dynamics of rumor propagation on small-world networks. Physical
Review E 65(4), 041908 (2002)
15. Zhang, X., Shasha, D.: Better burst detection. In: ICDE, pp. 146–146. IEEE (2006)
16. Zhu, Y., Shasha, D.: Eﬃcient elastic burst detection in data streams. In: KDD,
pp. 336–345. ACM (2003)
Idioms and Its Users in the Twitter Online
Koustav Rudra1(B) , Abhijnan Chakraborty1 , Manav Sethi1 , Shreyasi Das1 ,
Niloy Ganguly1 , and Saptarshi Ghosh2,3
Department of CSE, Indian Institute of Technology Kharagpur, Kharagpur, India
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Department of CST, Indian Institute of Engineering
Science and Technology Shibpur, Howrah, India
Abstract. To help users ﬁnd popular topics of discussion, Twitter periodically publishes ‘trending topics’ (trends) which are the most discussed
keywords (e.g., hashtags) at a certain point of time. Inspection of the
trends over several months reveals that while most of the trends are
related to events in the oﬀ-line world, such as popular television shows,
sports events, or emerging technologies, a signiﬁcant fraction are not
related to any topic / event in the oﬀ-line world. Such trends are usually
known as idioms, examples being #4WordsBeforeBreakup, #10thingsIHateAboutYou etc. We perform the ﬁrst systematic measurement study
on Twitter idioms. We ﬁnd that tweets related to a particular idiom
normally do not cluster around any particular topic or event. There are
a set of users in Twitter who predominantly discuss idioms – common,
not-so-popular, but active users who mostly use Twitter as a conversational platform – as opposed to other users who primarily discuss topical
contents. The implication of these ﬁndings is that within a single online
social network, activities of users may have very diﬀerent semantics; thus,
tasks like community detection and recommendation may not be accomplished perfectly using a single universal algorithm. Speciﬁcally, we run
two (link-based and content-based) algorithms for community detection
on the Twitter social network, and show that idiom oriented users get
clustered better in one while topical users in the other. Finally, we build
a novel service which shows trending idioms and recommends idiom users
Twitter is now considered more of an ‘information network’ than a social network  and almost the entire focus of the research community has been on ‘topical’ content in Twitter, such as tweets / hashtags related to sports or technology
or emergency situations in the oﬀ-line world . However, a closer inspection of
the Twitter trending topics (‘trends’ in short) – keywords periodically declared
c Springer International Publishing Switzerland 2015
T. Cao et al. (Eds.): PAKDD 2015, Part I, LNAI 9077, pp. 108–121, 2015.
DOI: 10.1007/978-3-319-18038-0 9
#FewThingsAboutIdioms: Understanding Idioms and Its Users
Table 1. Percentage of Twitter trends collected over ten months, and classiﬁed into
nine diﬀerent categories that were identiﬁed by human volunteers (details in Section 2).
Also given are few examples of trends.
% Example trends
Entertainment 33% #5sosonKiis, #IWishICould, #Austinonidol
30% #argentinavsholanda, #lakers, #bravsger
9% #WhenIWasATeenager, #FactsaboutMe, I get
8% #iphone6, #galaxy4, AppleWatch, ios8
5% #tcot, #pjnet, #obama, #gaza
5% #amazon, #AlibabaIPO, #FedReserve
3% #EidMubarak, #jesus, #citrt
2% #Ebola, #Who, #breastcancer
5% #garlicparmpizza, #ﬁlipino, cheesecake, pizza is healthy
by Twitter as being the most discussed at that point in time – indicates some
exceptions to this view, and provides the motivation for the present study.
We collected US trends over a duration of 10 months (January – October,
2014) using the Twitter API at 15-minute intervals. This gave about 18,500 distinct trending topics during this period. We then developed a classiﬁer Odin 1
and classiﬁed the trends into multiple categories such as sports, entertainment,
technology etc. – these broad categories were identiﬁed by human volunteers
(details in Section 2). Table 1 shows the distribution of the trends in the different broad categories. While most of the categories are topical and related to
events in the oﬀ-line world, it can be observed that a special category, known as
idioms 2 , regularly becomes trending. Examples of idioms include #4WordsBeforeBreakup, #11ThingsAboutYou, and apparently these are not related to any
topic or event in the oﬀ-line world.
The frequent presence of such trends is intriguing – it raises the question
whether their dynamics as well as the users discussing such trends are similar
to those of the topical counterparts. To understand the dynamics, we collected
tweets related to hundreds of idioms and the users who discuss them, and conducted a detailed measurement study. We ﬁnd that the tweets containing idioms
are mainly conversational in nature; for instance, they hardly contain URLs.
On investigating the users who post the tweets (the idiom-users), we ﬁnd that
they are mostly general and active Twitter users, as opposed to being popular
experts / celebrities who usually drive topics such as politics and entertainment.
The idiom-users maintain close friendships among themselves and interact on
diverse issues with their friends. Thus, the study unfurls that hidden within the
Named after the God of Wisdom according to Norse mythology; details in Section 2.
In this study, we follow the deﬁnition of idioms given by  – an idiom is a keyword
representing a conversational theme on Twitter, consisting of a concatenation of at
least two common words which does not include names of people, places or music
K. Rudra et al.
information network of Twitter, there is a social network of users who regularly
have “non-topical” conversations among themselves.
Such an inference has far-reaching implications. It essentially means that
multiple dominant dynamics are present in the same social network – so the
standard tasks like community detection, recommendation, and so on, cannot
be done using a one-parameter-ﬁts-all approach. An algorithm to identify (recommend) topical groups might fail to identify (recommend) idiom-users. To test
this proposition, we run two community detection algorithms – one identifying
topical groups  and the other, Infomap  which detects communities using
link structure. We ﬁnd that the idiom-users are well identiﬁed by Infomap while
the topical groups are better identiﬁed by . This establishes that diﬀerent
approaches for tasks such as clustering may have diﬀerent utilities in a heterogeneous online social network. Further, considering that all existing recommender
services are speciﬁcally meant to recommend topical experts, we develop a service Idiomatic where one can easily follow popular idiom-users, see the recent
and past trending idioms and post tweets using them.
Classification of Trends
In order to perform a large scale study on idioms and the trends related to
topics / events in the oﬀ-line world, we built an automatic classiﬁer Odin, to
distinguish particular trends based on whether they are idioms or related to some
topic. Note that some prior studies [7,21] have also attempted to classify trends
(not necessarily into the same categories found by Odin), utilizing the textual
contents of the tweets containing the trends. However, tweets (restricted to 140
characters) often contain informal language and abbreviations which potentially
results in lower classiﬁcation accuracy . Hence, we adopt a diﬀerent approach
that combines both tweets and related web documents and uses several webbased knowledge engines to perform the classiﬁcation. Odin classiﬁes a given
trend following the steps presented below.
Segmentation: Trends often consist of multiple words  recognizing which is
easy for multi-word phrases and hashtags written in CamelCase style (e.g., #WorldCupSoccer), but is very diﬃcult for trends which simply have the words juxtaposed
without any separation (e.g., #everythingididntsay). Since it is important to identify the individual words which make up a trend in order to understand its topic,
trends need to be segmented into the component words. Odin follows a modiﬁed
version of the Viterbi Algorithm , which uses a model of word distribution to calculate the most probable character sequence forming a word. Odin computes the
word distribution from Google n-gram corpus (https://books.google.com/ngrams).
#FewThingsAboutIdioms: Understanding Idioms and Its Users
Given a trend, Odin segments the trend into its constituent words based on this
calculated probability estimates (details omitted for brevity).
Categorization of Related Web Documents: Odin searches diﬀerent Web
search engines (e.g., Google, Bing) with the segmented trend, to get a large set of
web-pages relevant to the given trend. Often the tweets containing the trend have
URLs, which become another source for getting related web-pages.3 For a given
trend, Odin collects all the web-pages pointed from the tweets and returned
by the search engines; and then a set of category keywords are extracted for
these collected web-pages using the NLP-based AlchemyAPI web service (www.
Entity Extraction and Categorization: Sometimes the trend contains names
of people, organisations or locations (e.g., #EMABiggestFansJustinBieber) detecting which might give a clear idea on the category of the trend. Similarly, the web documents and the tweets associated with a particular trend have many such named
entities present in them. Odin extracts such entities using AlchemyAPI and then
queries Freebase (www.freebase.com) to know the ‘notable type’ of such named entities (e.g., according to Freebase, notable type for ‘Justin Bieber’ is ‘/music/artist’).
At the end of preprocessing steps, for a given trend, Odin collects the categories
of the related web documents and the notable types of the related named entities. Treating the number of web documents and named entities in the various
categories as features, Odin uses a Support Vector Machine (SVM) classiﬁer
with Radial Basis Function kernel to classify a particular trend into one of the
9 categories shown in Table 1.
Training Data Preparation: For creation of training data, three human volunteers (regular users of Twitter, who are not authors of this paper) were asked
to manually inspect 700 distinct trends collected during the ﬁrst two weeks
of January 2014 (along with tweets containing these trends), and classify the
trends into diﬀerent categories. The volunteers identiﬁed the nine broad categories shown in Table 1, such as Entertainment, Sports, Technology, Idioms
(following the deﬁnition of idioms in ). Out of the 700 trends, all three volunteers agreed upon a particular category for 575 trends. We created the training
data considering this unanimous categorization as the ground truth.
Classification Performance: Standard 10-fold cross validation on the data of
the 575 trends showed that Odin attains 77.15% accuracy in predicting trend
categories, which is good considering that it is a complex nine-class classiﬁcation
URLs leading to social media sites like Facebook, Twitter, Instagram, are ignored,
since these pages usually do not have much content to help the topic categorization
K. Rudra et al.
Table 2. Statistics of data collected
Number of trends
Total #tweets containing the trends (millions)
Mean #tweets per trend
Total #distinct users posting the trend
Mean #distinct users per trend
Idiom Sports Entertainment Technology
Since most of the Twitter trends were related to the three topics entertainment,
sports, and technology (see Table 1), we decided to focus on idioms and trends
related to these three topics; the trends related to any of these three topics are
collectively referred to as ‘topical trends’. For each of the trends belonging to
the four selected categories, we collected as many tweets containing the trend
as possible using the Twitter search API. To get a better understanding about
the trends, in our analysis as presented in later sections, we used only those
trends for which we were able to collect more than 30,000 tweets. To maintain
uniformity across categories, we ﬁnally selected a set of 150 trends related to
each of the categories (the actual distribution is stated in Table 1).
For each of the 600 selected trends, we further collected detailed statistics
about all the users (including their proﬁle details, social links and recently posted
tweets) who posted a tweet containing any of the selected trends. Table 2 summarizes the statistics of the data collected for the trends of the four categories.
Comparing Idioms and Topical Trends
In this section, we compare how idioms and topical trends are discussed in the
Twitter social network, and the users who discuss them frequently.
How Trends Are Discussed in Twitter
We ﬁrst analyze how the trends of diﬀerent categories are, in general, discussed in
Twitter. For a given trend t, we consider all tweets containing t, and measure what
percentage of these tweets contain other hashtags (apart from t itself), and URLs.
Figure 1 shows mean values of the percentage of tweets containing other hashtags and URLs, where the mean values are computed over all trends of a particular
category. Statistical measures like two sample KS-test and Mann-Whitney U test
with signiﬁcance level 0.05 show that there is a signiﬁcant diﬀerence in the distribution of the mean values among the four categories. Expectedly, we ﬁnd that the
topical trends are much more likely to be accompanied by other hashtags and URLs