LeoTask 快速可靠可扩展的计算研究框架 (可靠的轻量级多核MapReduce框架) 源码: https://github.com/mleoking/LeoTask 特别适合于编写长时间运行的多参数计算/模拟程序。LeoTask自动遍历多个参数的组合参数空间,自动并行运行和统计数据,找出最优值,格式化结果输出,高质量画图(Gnuplot)等。 LeoTask,不需要任何额外的代码,就能够自动备份程序并行统计的数据,在服务器发生异常(重启,断电等)后,可以重新继续从断点并行运行程序。 LeoTask LeoTask is a parallel task running and results aggregation (MapReduce) framework. It is a free and open-source project designed to facilitate running computational intensive tasks . The framework implements the MapReduce model, allocating tasks to multi-cores of a computer and aggregating results according to a XML based configuration file. The framework includes mechanisms to automatically recover applications from interruptions caused by accidents (e.g. Power Cut). Applications using the framework can continue running after an interruption without losing its calculated results. Download | Introduction | Applications | Discussion Features: Automatic parallel parameter space exploration. Flexible configuration-based result aggregation. Programming model focusing only on the key logic. Reliable automatic interruption recovery. Ultra lightweight ~ 300KB Jar. Utilities: All dynamic cloneable networks structures : a node, a link, a network, a network set (within which networks can overlap with each other), multiplex networks. Integration with Gnuplot : hybrid programming with Gnuplot, output statistic results as Gnuplot scripts. Network generation according to common network models : random networks, scale-free networks, etc. DelimitedReader : a sophisticated reader that explores CSV (Comma-Separated Values) files like a database. Fast random number generator based on the Mersenne Twister algorithm . Versatile curve fitter and function value optimizer (minimizer) . Example Application: Please refer to the introduction for building an example application using the framework. Code (RollDice.java): public class RollDice extends Task { public Integer nSide; //Number of dice sides public Integer nDice; //Number of dices to roll public Integer sum;//Sum of the results of nDice dices public boolean prepTask() { boolean rtn = nSide 0 nDice 0; return rtn; } public void beforeRept() { super.beforeRept(); sum = 0; } public boolean step() { boolean rtn = iStep = nDice; if (rtn) { sum += (int) (rand.nextDouble() * nSide + 1); } return rtn; } } Configuration (rolldice.xml): Tasks name val=task-rolldice/usage val=0.9/nRepeats val=2000/checkInterval val=4/ variables class=org.leores.task.app.RollDice nSide val=2;4;6/ nDice val=2:1:5/!--from 2 to 5 with a step of 1, i.e. 2;3;4;5 -- /variables statistics members iinfo val=Fig1%pltm+@afterRept@/valVar val=sum;#$sum$/$nDice$#/ parVars val=nSide;nDice//i iinfo val=Fig2%plt+@afterRept@/valVar val=sum/parVars val=nSide//i iinfo val=Fig3%plt+@afterRept@/valVar val=sum/parVars val=nDice//i /members /statistics /Tasks Before running the example application, please install Java and include the the directory of the command java in the system's PATH environment variable. Windows system users can alternatively download and install ( install.bat ) the all-in-one runtime environment package: LeoTaskRunEnv Chang the current directory to the Demo folder and then execute the following commnad java -jar leotask.jar -load=rolldice.xml If you are using a MS windows system, you can also execute rolldice.bat. References: Changwang Zhang, Shi Zhou, Benjamin M. Chain (2015). LeoTask: a fast, flexible and reliable framework for computational research , arXiv:1501.01678. (PDF)
春节前夕,2014年1月28日,巴西利亚大学 TransLab 团队的一篇文章:在线社交网络内动态群组的查询(Querying dynamic communities in online social networks) 在 浙大学报 (英文版C - Journal of Zhejiang University-SCIENCE C - Computers Electronics)第15(2)期网络版刊出,该文中文摘要附后。 浙大学报分A、B、C三种版本,均属SCI收录期刊,以纸质出版为主,同时在线发布。据该期刊编辑翟自洋老师( 科学网博主 )介绍,他们赶在28号的春节放假前将最后版本交给浙大印刷厂印制。如果要等到纸质期刊到读者手里,将会是二月中旬后的事情了。然而在网络时代的今天,他们于当天下班前把本期文章在线发布,作者、读者和编者都可以在第一时间先睹为快。 由于TransLab团队的一个研究方向是社交网络,本博也喜欢研究与网站有关的这些个事情。我们文章在上线后几天内的点击量和下载量不是很高,例如至31号这4天的访问量累计60人次, 下载量为23人次, 参见网站 www.zju.edu.cn/jzus/current.php ,要说这也是正常现象。 到了2月3号, 在午间休息时又看看文章在网络传播情况,发现对该文的下载量上升了十倍,达到235人次。图1给出该期刊上线9天后8篇文章的访问量和下载量统计结果,平均数分别为:访问量168人次,下载量162人次。 图1, 浙大学报(英文版C)15-2期上线9天后8篇文章的访问量和下载量。 截图来自: http://www.zju.edu.cn/jzus/current.php 该期间我们团队并没有将文章连接推送到任何论坛或电邮群组。对此问题思索中,就查了查本人在 Google Scholar 上的文献收录单,果然不出所料,Google Scholar 刚刚收录了该文,并直接附上浙大学报发表的PDF下载网址连接, 参见图2。由此可见,Google Scholar对开放期刊是促进支持的。这也是该类期刊在网络上成功展示的重要资源。 图2 Google Scholar 对文章的收录,给出浙大学报的下载连接 有趣的是,当时学报网站显示对该文的访问量仅为60人次和下载量不匹配。也就是2月4号,本博把文章资料放到自己的 ResearchGate 网页上,结果第二天看到访问量已增达130余人次。这一点,还不能肯定就是ResearchGate的效果,但这个较成功的科学人网站对网络传播会有所帮助的吧。社交网络的事情,往往很难说清楚,不少问题都值得深究。 为便于比较,又看了看本博担任主编的英文开放期刊《 社交网络 - Social Networking 》2014年第1期的文章下载情况。该期由 科学出版社 于1月23号刊出,共发表7篇文章,上线15天内平均下载量为116人次。看来《社交网络》这个新期刊还是不能与张月红老师精心主编的《浙大学报》相比。但仅以单文下载量的一项指标看,特别值得提及的是本期《社交网络》中,来自美国和巴西DataGenno Interactive Research 公司的Ruchita Gujarathi 和Fabricio F. Costa一文:The Impact of Online Networks and Big Data in Life Sciences ,在15天内的下载量达410次。 以上分析再次说明,很难说传统的、普通水平的纸质期刊能在十天内会有近200位读者阅读。只有在网络普及的今天,高质新颖的学术文献在开放期刊上是大有作为的。我们期待着《浙大学报》和《社交网络》等开放期刊更上一层楼。 附:浙大学报(英文版C )“在线社交网络内动态群组的查询”一文的中文摘要 Querying dynamic communities in online social networks Li Weigang, Edans F. O. Sandes, Jianya Zheng, Alba C. M. A. de Melo, Lorna Uden http://www.zju.edu.cn/jzus/article.php?doi=10.1631/jzus.C1300281 (中文摘要)本文研究在线社交网络的动态群组形成的在线即时、信息突发和传播迅速等特点,指出在大数据环境下及时发现有用的群组内的信息,是本专业一项富有挑战性的工作。文中引用描述用户关系的逻辑模型(Follow Model, 简称粉丝模型),结合文章映射和化简(MapReduce)概念,探讨映射关注和化简粉丝(MapFollowee ReduceFollower)机制在Hadoop系统联机实现的算法。文章介绍的粉丝模型(Follow Model)的各类函数把微博用户关系简洁和准确地描述出来,同时具备以下三个特点:反对称与对称性、可扩展性和可组合性,这些特性的灵活应用,形成本文提出的两大类查询算法:反对称关系查询算法(Reverse relation)和高阶关系查询算法(High-order relation)。 该文研究在线社交网络,特别是Twitter和新浪微博平台的动态群组形成机理,提出描述用户间关系的逻辑模型,即粉丝模型。将此模型结合映射和化简理念,提出对这些动态群组信息查询的并行算法。特别是通过对Twitter平台内两个群组信息查询的实际检验,展示大数据环境下本文算法的有效性。 参考文献: Li Weigang, Edens F. O. Sandes, Zheng Jianya, A. C. M. A. de Melo, Lorna Uden, (2014) Querying dynamic communities in online social networks. J ZHEJIANG U-SCI C, v. 15, p. 81-90. http://www.zju.edu.cn/jzus/article.php?doi=10.1631/jzus.C1300281 R. Gujarathi, and F. Costa (2014) The Impact of Online Networks and Big Data in Life Sciences. Social Networking, 3, 58-64. doi: 10.4236/sn.2014.31007. http://www.scirp.org/journal/sn/
2013年1月2日,新年尹始,笔者陆续在科学网博客发出《微博de故事》系列文章三篇,以博文的形式介绍巴西利亚大学 TransLab 团队一年来对在线社交网络,特别是微博和Twitter研究的最新成果。 由于互联网时代社交网络的社会效应和经济意义,对Twitter和新浪、腾讯微博的研究是个热门课题。 全球不少科研机构和大学的研究中心和著名学者都纷纷投入人力物力来研究这个新事物和新现象。 在首篇《 微博de故事:物理学者对计算机科学同行的批评 》一文,笔者以从事复杂网络研究的物理学家与计算机学者间的争议为出发点,指出诸家学者都在研究微博,但并没有让人感到“漂亮”的数学(逻辑)模型来展现迷人的微博,特别是描述平台内用户间的各种关系。该博文承蒙科学网网友和编辑厚爱,到目前为止已有5110访问量,10位博主评议,27位博友推荐。周涛博主等学者认为,一个好的微博模型是进一步开发高效咨询算法和推荐系统的基础。 我们团队的Edans Sandes等提出了“粉丝模型(Follow Model)”, 这是基于图论理论和人工智能的逻辑关系,对微博平台内用户间逻辑关系和行为特性的基本描述。对此,《 微博的转发哲学 (上、下) 》博文进行了详细介绍,科技论文原著参见 。一年来,同行们对“粉丝模型”的评价褒贬不一,特别是一些学者提出批评意见。笔者在感谢的同时也认为,一个有实际意义的数学模型及其应用需要长时期的实践来检验。 令人欣慰的是,笔者团队在此基础上进一步提出《 微博de故事:关系寓意下的邻接矩阵 》,将粉丝模型扩展到关系寓意下的邻接矩阵。按照粉丝模型的定义, A in 为粉丝邻接矩阵(Follower AM),其转置矩阵就是关注矩阵(Followee AM),具体表示为 A out = A in T 。大家也许说这是当然的事情。但这毕竟是第一个捅破那张窗户纸的工作,科学文献参见 。 《 微博de故事: 映射化简找朋友 》 一文是提示如何把粉丝模型与著名的映射和化简(MapReduce)模型相结合,利用强有力的Hadoop并行计算系统,在海量社交网络里面找到动态群组,并进一步进行排行 。当然,问题的实质是大海捞针,发现社交网络内的“问题人物”。值得一提的是,一位审稿人对该算法是这样评价的:Facebook已经列出明星排行了,你这里扯上MapReduce干什么的?真是有点遗憾,让一些学者理解大数据还真不容易。这个算法首先要从整个社交网络用户间错综复杂关系里尽快理出动态群组,请大家注意,是5亿多个用户,上百亿对连接关系和行为关系!!!对于微博或Twitter平台来说,也非易事。团队开发的算法刚好把粉丝模型与映射和化简理念巧妙组合,派上用场。 感谢《2013年第九届中国网络科学论坛》, 特别是会议组织者方锦清和汪秉宏等老师, TransLab 团队能够全面、系统地正式介绍上述对在线社交网络的研究工作。作为参加这次论坛的主要成果,《在线社交网络的逻辑模型和并行查询》一文 已在《复杂系统与复杂性科学》2013年第2期上发表。这是团队对粉丝模型以及应用综合表述的中文文献,也是本博13年来重新开始在国内期刊上发表学术文章。 在发出本博的两天之后,得到另一个好消息,燕山大学的张亚明老师、博士生唐朝生同学与本博合作的另一篇文章《微博机制和转发预测研究》 在武夷山老师主编、马兰老师为责任编辑的《情报学报》期刊 2013年8月份上发表。该文对近年来国内外在微博机制和微博转发等方面进行小结和综合分析,同时也扼要地介绍了粉丝模型及其应用。 欢迎博友和同行批评指正。 参考文献: Edans F. Sandes, Li Weigang, Alba C. de Melo: Logical Model of Relationship for Online Social Networks and Performance Optimizing of Queries - WISE 2012 Challenge - T1: Performance Track Scalability Winner. 13th International Conference on Web Information System Engineering - WISE 2012: 726-736. Li Weigang, Zheng Jianya, Liu Yang, Retweeting Prediction Using Relationship Committed Adjacency Matrix, in II Brazilian Workshop on Social Network Analysis and Mining (BraSNAM 2013, CSBC 2013), 1561 - 1572. Zheng Jianya, Li Weigang and Lorna Uden, Top-X Querying in Online Social Networks with MapReduce Solution, Accepted in 2013 Eighth International Knowledge Management in Organizations Conference Social and Big Data Computing for Knowledge Management, (KMO2013), Taiwan. Li Weigang, Zheng Jianya, Logical model and parallel querying in online social networks. Complex Systems and Complexity Science, v.10, p.77 - 87, 2013. Zhang Yaming,Tang Chaosheng and Li Weigang, Research on the Micro-blog Mechanism and Re-posting Prediction, Journal of the China Society for Scientific and Technical Information, 32(8), 868-867, 2013.
Part 1: Algorithms. Generally two categories: 1) Apriori-based - AGM: Inokuchi, et al. (PKDD’00) - FSG: Kuramochi and Karypis (ICDM’01) - FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) 2) Pattern-growth - MoFa: Borgelt and Berthold (ICDM’02) - gSpan: Yan and Han (ICDM’02) - Gaston: Nijssen and Kok (KDD’04) Part 2: MapReduce/BSP implementations. Here are some open source graph processing frameworks. Most follow the Bulk Synchronous Parallel model, where vertices send messages to each other in a series of iterations called supersteps. 1. Giraph , a Java BSP implementation that runs on existing Hadoop clusters and provides fault tolerance for both the workers and the master using ZooKeeper. Giraph came out of Yahoo! and is now an Apache Incubator project. 2. Apache Hama is a pure BSP(Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms. Also an Apache Incubator project, and works with ZooKeeper. 3. GraphLab , a new, asynchronous programming model in C++ from Carnegie Mellon. 4. Phoebus , an Erlang BSP implementation that can use HDFS for storage. 5. Golden Orb , another Java BSP implementation on Hadoop. 6. Signal/Collect , a Scala BSP implementation from the University of Zurich. Signal/Collect also supports extensions including an asynchronous mode. 7. Spark , a Scala framework from UC Berkeley that aims to balance efficiency and fault tolerance by providing immutable, distributed, in-memory collections. There's a BSP implementation on top of Spark . 8. PEGASUS , a pure MapReduce implementation on Hadoop. Problem domain: - Hama: a pure BSP implemention. Target general computation, not only graph processing. - Giraph: similar to Hama, but more specifically for graphs, such as page rank, shared connections, personalization-based popularity, etc. Giraph performs like a map-only job based Hadoop. - GraphLab: Supports machine learning and graph computation, such as Jacobi, Gaussian Belief Propagation (GaBP), Conjugate gradient, Matrix operations, collaborative filtering, clutering, and belief propagation algorithm, etc. - Phoebus, Golden Orb, Bagel: Based on Google Pregel. Currently support basic graph computation. - Sigal/Collect: Support synchronous and asynchronous algorithms on graph, such as, shortest path, belief propagation, coloring, etc. - PEGASUS: Supports computation for Degree, PageRank, Random Walk with Restart (RWR), Radius, Connected components Reference: Giraph: http://incubator.apache.org/giraph/ Hama: http://incubator.apache.org/hama/ GraphLab: http://graphlab.org/ Phoebus: https://github.com/xslogic/phoebus Golden Orb: http://www.goldenorbos.org/ Signal/Collect: http://code.google.com/p/signal-collect/ Spark: http://spark-project.org/ Bagel (BSP on Spark): https://github.com/mesos/spark/wiki/Bagel-Programming-Guide PEGASUS: www.cs.cmu.edu/~pegasus/