The evolution of data science and big data research: A bibliometric analysis Daphne R. Raban · Avishag Gordon In this study the evolution of Big Data (BD) and Data Science (DS) literatures and the relationship between the two are analyzed by bibliometric indicators that help establish the course taken by publications on these research areas before and after forming concepts. We observe a surge in BD publications along a gradual increase in DS publications. Interestingly, a new publications course emerges combining the BD and DS concepts. We evaluate the three literature streams using various bibliometric indicators including research areas and their origin, central journals, the countries producing and funding research and startup organizations, citation dynamics, dispersion and author commitment. We fnd that BD and DS have difering academic origin and diferent leading publications. Of the two terms, BD is more salient, possibly catalyzed by the strong acceptance of the pre-coordinated term by the research community, intensive citation activity, and also, we observe, by generous funding from Chinese sources. Overall, DS literature serves as a theory-base for BD publications. 本研究利用文献计量指标中分析了大数据(BD)和数据科学(DS)文献的演变以及两者之间的关系,本研究有助于在概念形成之前和之后在这些研究领域开设相关课程。我们观察到随着DS文献的逐渐增加,BD文献在不断激增。有趣的是,一个新的课程将BD和DS概念整合在一起。我们使用不同的文献计量指标来评估这三种文献流,包括研究领域及其起源、核心期刊,研究和启动组织的产生和资助国家,引文动态,分布和作者。我们发现BD和DS具有不同的学术渊源和不同的领先出版物。我们从大量中文资助文献中发现,在这两个术语中,BD更为显着,可能是由于研究者强烈接受pre-coordinated term的催化。总体而言,DS文献是BD出版物的一个理论基础。 DATA The data in this study was drawn from the database Clarivate Analytics (also known as the WoS, Web of Science) 2019 core collection. This is a selective index of good quality publications. The search was conducted on titles and abstracts of scientifc peer-reviewed publications (N=41,961 for BD, N=244,695 for DS, N=3,552 for interchangeable use). Publications containing BD and DS as pre-coordinated concepts were retrieved from 2006 to March 2019, including publications that use these terms interchangeably (N=7938 for BD, N=2648 for DS, N=242 for interchangeable use). ...... Methodology The retrieved set of publications was analyzed to discover overall productivity, current research areas and their origin, central journals and citation patterns, the countries producing and funding research and startup organizations (Hartmann et al. 2016). The dynamics of BD and DS over time was examined by bibliometric indicators including “highly cited” papers and the immediacy index. Highly cited papers are those that received a high number of citations, usually within the range of 10 recent years or less, depending on the discipline. The highly cited papers indicator was devised to bypass the high number of citations accumulated during a very long publications’ history of researchers. Immediacy index is calculated by dividing citations by publications within the year of publication. The immediacy index indicates, to a large extent, the journal impact (Tomer 1986; Yue et al. 2004), and is also considered to be an indication of the “research front” of a science feld (Meadows 1998: 61). The immediacy index was complemented by the examination of the Price Index which measures the citations to publications in the last fve years as compared to the total number of citations per topic, and examines the aging of the literature. “Ageing patterns can be characterized as a combination of phases of maturation and decline in citation processes” (Glänzel et al. 2016: 2169). The three indicators were used to reveal which concept, BD or DS, is more in use. Intensive usage in a feld indicates a dynamic and promising science feld. The Price Index was measured in two years: in 2010 when all three literatures already existed, and in 2018, more recently, to observe the dynamic of the three trends. Dispersion in the felds of BD and DS was calculated by comparing the number of publications yielded by searches by topic to the same searches by title. A high percent of dispersion indicates a feld with a small cohesive literature core (Tal and Gordon 2017). Another test was that of commitment of authors to the research feld which is an indication of regularity and constancy by authors who are not “one-time visitors” to the feld. Such authors could help in creating theories and paradigms in the research area and maintain continuity in the research feld (González-Alcaide et al. 2016; Gordon 2007). Result(部分图表)
“基于科学方法体系的双语逻辑” (UNILOG2018国际统一逻辑会议报告题目)和“自然语言处理的新方向”(ICIS2018智能科学国际会议报告题目)与前面我报告过的“两大类形式化方略”即上海计算中心报告题目的改进版三个报告一道可组成做未来我们可多方多校合作的国际国内重大特大项目的三个关键内核(有此基础理论和方法以及广大师生皆可参加的网络工具平台暨新一代广义双语开发环境做支撑,各式各样的发明创造和发现创新的活动都会有一个自由发挥的虚拟现实世界来配套)!-邹晓辉(塞尔科技) 附录1:“两大类形式化方略”即上海计算中心报告题目及摘要 摘要: 阐述机器翻译的两大类形式化方略。其中,第一大类形式化方略涉及:编程语言和英语(自然语言);第二大类形式化方略涉及:二进制数与十进制数、十进制数与汉字中文、中文和英文(可换)三类双语协同变换,属于形式化及其拓展研究领域。其结果是:凸显了第二大类形式化方略。其意义是:揭示了其理论依据,并为含语言学在内的学科知识系统工程提供了广义双语信息处理技术,有利于母语为非英语的计算机用户改善人机对话的语言环境。 关键词: 机器翻译; 形式化; 双语信息处理; 分类号:TP391.2 两大类形式化方略.pdf http://xuewen.cnki.net/CJFD-JYRJ201309055.html 附录2:ICIS2018智能科协国际会议报告题目及摘要 2017-09-07 ZouXiaohui and ZouShunpeng: Logic of sequence and position.pdf 附录3:ULILOG2018国际逻辑会议报告题目及摘要 11. New Approaches to Natural Language Understanding Xiaohui Zou Sino-American Saerle Research Center, Abstract: This talk aims to disclose the know-how of launching a new generation of excellent courses and to develop the learning environment in which human-computer collaboration can optimize the expert knowledge acquisition. The method is to form a teaching environment that can be integrated online and offline with some technical platform of cloud classrooms, cloud offices and cloud conference rooms. Taking Chinese, English, classical and summary abstracts as examples, human-computer coordination mechanism, to do the appropriate new generation of quality courses. Its characteristics are: teachers and students can use the text analyzed method to do the fine processing of the same knowledge module, and only in Chinese or English, through the selection of keywords and terminology and knowledge modules, you can use the menu to select as the way to achieve knowledge. The module's precision machining can adopt the big production method that combines on the line first, complete coverage and accurate grasp each language point and knowledge point and original point even their respective combination. This method can finish fine processing instantly for any text segment. The result is the learning environment that enables human-computer collaboration to optimize the expert knowledge acquisition. Its significance is that this project of this learning environment software based on the National Excellent Courses is already owned by Peking University and that is constructed by using the numbers-words chessboard with the feature of the introduction on the knowledge big production mode for the textual knowledge module finishing. Biography Xiaohui, Zou, male, Chengdu, Sichuan Province, chief researcher, Working in the Project Team on the new generation Excellent Courses of Natural Science Foundation in Peking University and the Sino-American Searle Research Center, head of the group on collaborative intelligent education research, deputy director of the Artificial Intelligence Committee of the China Branch of the International Information Research Society, and Assistant Director of the Education Information Professional Committee, The research direction is language, information and intelligence science. http://www.intsci.ac.cn/icis2018/committees.jsp http://www.intsci.ac.cn/icis2018/speaker.jsp
不要说数据可视化的优点,以及为了展示给老板看。 本文参考维基百科: https://en.m.wikipedia.org/wiki/Anscombe%27s_quartet 下图是著名的安斯库母四重奏, 它们具有相同的统计值,但不同的x,y,然而结果用简单的线性回归建模却得到同样的结果,事实上,拟合的结果的准确性是值得商榷的,有的效果可以,有的却是错误的。 Property Value Accuracy Mean of x 9 exact Sample variance of x 样本方差 11 exact Mean of y 7.50 to 2 decimal places Sample variance of y 4.125 plus/minus 0.003 Correlation between x and y 0.816 to 3 decimal places Linear regression line y =3.00+0.500 x to 2 and 3 decimal places, respectively Coefficient of determination of the linear regression 线性回归的确定系数 0.67 to 2 decimal places 好好看看,第二个图和第四个图是不是直接错误,第三个图勉强算对,但不准确,有个离群值明显可以舍去。第一个图是正确的。 由此可见,在数据探索中,有必要进行简单的验证,查看数据是否可以用已有的模型,模型重要,但数据质量更重要。
第三届智能科学国际会议配套的跨学科、跨领域和跨行业第二次大讨论班 主题:成长中的数据科学与三类信息协同处理之间的关系 学术对话主讲人:彭永红(IEEE计算智能学会大数据专委会主席,IEEE Transaction on Big Data 副主编,英国桑德兰大学数据科学讲席教授)和邹晓辉(IS4SI-CC教育信息化专委会主任助理,人工智能专委会副主任,中美塞尔研究中心主任) 时间:2017-12-21下午2点~5点(邹晓辉教授主持),晚上6:30~9:30点(马尽文教授主持)这两段时间的线上JionNet远程互联互通均由林建祥教授主持(北京大学教师教学发展中心提供JionNet远程技术支持) 地点:北京大学数学科学学院理科一号楼1365教室 到会代表涉及的相关学科领域: 数据科学、信息科学、智能科学、思维科学、脑科学、数学、计算机科学、软件工程、计算语言学、认知心理学、教育技术学等诸多学科领域。 读书人摄制组将全程录音录像。
时间 :2017.6.29(周四)15:00~17:00 地点 :南京理工大学经济管理学院106报告厅 主讲人 :何大庆(匹兹堡大学教授) 内容简介: 社会科学家们长期以来共享数据,然而大部分系统和全面的研究是基于定性数据来进行的。关于社会科学家如何共享定性数据,人们了解甚少。本报告介绍了我们针对社会科学中的定性数据共享实践所做的三个案例研究。我们的研究建立在知识基础设施工程(KI)和远程科学协作理论(TORSC)等两个已有的概念框架,探究了三个研究目标:社会科学中的定性数据共享实践的现状,研究参与者的定性数据共享行为中的决定性因素,和世界上最大的社会科学数据基础设施工程中对社会科学数据管理的具体实践。我们的研究结果证实:第一,处于职业生涯早期的社会科学家的数据共享意识较低且不活跃;第二,社会科学家共享研究成果的偏好与方法论相关,而与实验原始数据无关;第三,可感知的技术支持和奖励是定性数据共享行为的有力预测因素。最后,我们总结了社会科学中数据共享的最佳实践,并为如何在社会科学和其他领域打造一个可持续的数据共享环境提出了建议。 讲人简介 : 何大庆博士现为匹兹堡大学计算与信息学院(iSchool)教授,并担任iSchool图书馆与信息博士计划委员会主任。何教授在苏格兰爱丁堡大学获得人工智能专业的博士学位。在2004年加盟匹兹堡大学之前,何教授曾在苏格兰罗伯特戈登大学、美国马里兰大学等地从事研究工作。何教授的研究工作主要集中在:信息检索(单语言和多语言)、社交网络上的信息获取、自适应Web系统与用户建模、交互检索界面设计、Web日志挖掘与分析和研究数据管理。 何博士是十余个研究项目的主持人或共同主持人,研究项目包括:美国国家科学基金会项目、美国国防部高级研究计划署资助项目、匹兹堡大学以及其他机构资助的项目。何教授在国际公认的期刊与会议上发表论文120余篇,期刊与会议包括Journal of Association for Information Science and Technology,Information Processing and Management,ACM Transaction on Information Systems,Journal of Information Science,ACM SIGIR,CIKM,WWW,CSCW等等。另外,他是信息检索及Web技术领域的二十多个主要的国际会议程序委员会成员,并且是该领域多个国际一流期刊的审稿人,他是SCI索引杂志Information Processing and Management、Internet Research和Aslib Journal of Information Management的编委。
============================================================ Welcome to Complex Systems and Data Science Weekly 关于本周刊及其所有内容: click here ============================================================ 复杂系统与数据科学周刊 (第1期,2014-03-09)
数据科学(大数据处理)领域的大牛和研究机构总结 (第3次修改) 1 Jeffrey David Ullman ( Stanford University) http://infolab.stanford.edu/~ullman/ Mining of Massive Datasets (大数据:互联网大规模数据挖掘与分布式处理) Compilers: Principles, Techniques, and Tools 这两本书的作者。 2 Anand Rajaraman Mining of Massive Datasets 的一作。 3 Jim Gray http://research.microsoft.com/en-us/um/people/gray/ 可惜已经消失在大海里,估计喂鱼了,非常可惜!! 4 Andrew Ng(吴恩达) , Stanford University, 会说中文, 很NB的一个人。 http://ai.stanford.edu/~ang/ 5 Daphne Koller , Stanford University http://ai.stanford.edu/~koller/index.html 6 Michael I. Jordan , University of California, Berkeley http://www.cs.berkeley.edu/~jordan/ 7 David M. Blei, Princeton University http://www.cs.princeton.edu/~blei/ 8 Geoffrey E. Hinton , University of Toronto godfather of neural networks, 人工神经网络之父 deep learning的领军人物 http://www.cs.toronto.edu/~hinton/