1、工程领域软件工程申 请 人jiangModeling Method Study on the Influence of Social Network Based on Micro-blogThesis Submitted toTsinghua Universityin partial fulfillment of the requirement for the professional degree of Master of Software EngineeringbyMa Xingjun( Software Engineering)Thesis Supervisor:Associate Pr

2、ofessor Li ChunpingNovember, 2014摘 要随着Twitter的流行和海量数据的堆积,如何从这些数据中进行合理的筛选并从中挖掘出有意义的内容已经成为当今的研究热点。虽然数据挖掘的目的各不相同但其中特征的选择和模型的构建是影响研究的关键因素。本文的研究基于微博平台Twitter展开。本文对用户的时间特征,也就是活跃度随时间的变化规律进行了详细的研究并结合此类特征对现有影响力模型进行了改进。首先,我们对Twitter用户的时间特征进行了统计分析并基于K-SC聚类算法对用户的时间模式进行了聚类。统计分析和聚类的结果显示Twitter用户的时间模式可以划分为6种不同的类型。


4、者被分组的次数和接收者在Tweet发布时的活跃度。最后,基于在响应预测分析中得到的结论和PageRank算法的思想,本文提出了一种新的影响力排行模型:TTR(Temporal-Topic Rank)。在实验评估部分我们验证了TTR在三种不同的影响力排行任务中的表现比TR(TwitterRank)和PR(PageRank)更加优秀、更加灵活。在全局影响力排行任务中TTR不但能够挖掘影响力大的用户而且能够通过调节惩罚因子C的大小从影响力大的用户中重点突出那些影响信息传播的关键用户。在好友推荐任务中TTR算法在7种场景下的6种场景中的表现优于TR算法,同时在6种场景下优于PR算法,其中在TR表现最差

5、的两种场景下TTR算法可以通过调节惩罚因子的大小大幅提高推荐效果。另外,在挖掘最大影响力好友的任务中TTR的结果比TR和PR更接近实际情况。关键词:时间特征 响应预测 影响力排行 社交网络 AbstractAlong with the popularity of Twitter, more and more people become active, and the accumulationof massive data, how to extract proper features selectively from those data and to make use of them to

6、mine valuable information have became hot topics already. Although focusing on different targets, feature selection and modeling method are the two critical factors that affect todays researches. Twitter Features could be divided into four dimensions: social, topic, spacial and temporal dimension, a

7、mong which the temporal feature consists of user temporal feature and tweet temporal feature. User temporal feature refers to variation patterns of user behavior such as post, retweet and reply, while tweet temporal feature means similar variation patterns, however of tweet. This article studies use

8、r-temporal-feature comprehensively and further investigates its influence on existing models. In the first place, using analytical methods ofStatistics, we calculated the variation patterns of Twitter users active degree, and then clustered those patterns into six common classes based on K-SC cluste

9、ring algorithm. Experiment result indicates that there are six unique patterns, and each of them represents one type of users activity variation. It also shows that the peak periods of Twitter users pattern curve are twelve to twenty three oclock on Monday and Tuesday, and twenty to twenty three ocl

10、ock at the Weekend. Meanwhile, by computing delay and look-back distribution of all the responses, we finds that 72 percent of the retweet are taken place within one hour and the percent of reply is 83. Actually, 80% of the retweet are retweeted by looking back no more than 100 historical tweets and

11、 that of reply is 85%. These interesting findings not only display users reading habit but also indicate that tweets do have time validity. Furthermore, we investigated the influence that user- temporal-feature has on response prediction model. Based on new feature space, our new model could predict

12、 whether a user would retweet or reply one of his/her friends tweet with an accuracy of 80.55%. Compared with Petrovics model, the new model achieves a score (Balanced F-score) of 86.6, which betters Petrovics of 46.6. Besides of the accuracy and performance, we ranked all the features according to

13、their contributions to the model. Ranking result shows that the top 3 features are as follows: the topic similarity between the friend and the follower, the listed number of the friend and the time feature of the follower. In the third part of our research we incorporate user time feature into one o

14、f current influence ranking models: TR (TwitterRank). The new model we then get is denoted as TTR (Temporal-Topic Rank). New ranking result substantiates the assumption that user time feature would have great effects on existing models: the coefficient degree of TTR and TR is just 0.7627. Albeit dif

15、ferent ranking models may serve different ranking tasks well, the discrepancy between TR and TTR could at least demonstrates that user time feature does have a certain amount of influence on TR.Keywords: Temporal Pattern Response Prediction Influence Ranking Social Network 第1章 引言1.1 课题背景和意义近年来,随着互联网行业的蓬勃发展出现了很多大型的社交网络,如Twitter、Facebook等,这些社交平台对人们的生活产生越来越大的影响。在这些大型社交网络中Twitter是一个典型的代表,它允许用户不经过其他用户的同意而关注他们。这种特征吸引了越来越多的用户,他们在Twitter中阅读、发布、转发和回复各种信息,这使得Twitter更像是一个新闻媒体1。据Twitter官方发布的统计数据,截止到2013年10月Twitter活跃用户超过2亿,每天发布Tweet超过4亿条。大量的活跃用户和快速的信息传播使得Twitter在很多事件中都产生了巨大的影响力,如2008年美国

