科学网

 找回密码
  注册

tag 标签: 第四范式

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

第四范式:基于大数据的科学研究
热度 27 lionbin 2015-10-26 17:08
图灵奖得主,关系型数据库的鼻祖吉姆·格雷(Jim Gray)也是一位航海运动爱好者。2007年1月28日,他驾驶帆船在茫茫大海中失联了。而就是17天前的1月11日,在加州山景城召开的NRC-CSTB(National Research Council-Computer Science and Telecommunications Board)大会上,他发表了留给世人的最后一次演讲“科学方法的革命”,提出将科学研究分为四类范式(Paradigm,某种必须遵循的规范或大家都在用的套路),依次为 实验归纳,模型推演,仿真模拟和数据密集型科学发现 (Data-Intensive Scientific Discovery)。其中, 最后的“数据密集型”,也就是现在我们所称的“科学大数据”。 人类最早的科学研究,主要以记录和描述自然现象为特征,称为 “实验科学”(第一范式) ,从原始的钻木取火,发展到后来以伽利略为代表的文艺复兴时期的科学发展初级阶段,开启了现代科学之门。 但这些研究,显然受到当时实验条件的限制,难于完成对自然现象更精确的理解。科学家们开始尝试尽量简化实验模型,去掉一些复杂的干扰,只留下关键因素(这就出现了我们在学习物理学中“足够光滑”、“足够长的时间”、“空气足够稀薄”等令人费解的条件描述),然后通过演算进行 归纳总结,这就是第二范式 。这种研究范式一直持续到19世纪末,都堪称完美,牛顿三大定律成功解释了经典力学,麦克斯韦理论成功解释了电磁学,经典物理学大厦美轮美奂。但之后量子力学和相对论的出现,则以理论研究为主,以超凡的头脑思考和复杂的计算超越了实验设计,而随着验证理论的难度和经济投入越来越高,科学研究开始显得力不从心。 20世纪中叶,冯·诺依曼提出了现代电子计算机架构,利用电子计算机对科学实验进行模拟仿真的模式得到迅速普及,人们可以对复杂现象通过模拟仿真,推演出越来越多复杂的现象,典型案例如模拟核试验、天气预报等。随着 计算机仿真 越来越多地取代实验,逐渐成为科研的常规方法,即 第三范式 。 而未来科学的发展趋势是,随着数据的爆炸性增长,计算机将不仅仅能做模拟仿真,还能进行分析总结,得到理论。 数据密集范式理应从第三范式中分离出来,成为一个独特的科学研究范式。 也就是说,过去由牛顿、爱因斯坦等科学家从事的工作,未来完全可以由计算机来做。这种科学研究的方式,被称为 第四范式 。 我们可以看到,第四范式与第三范式,都是利用计算机来进行计算,二者有什么区别呢?现在大多科研人员,可能都非常理解第三范式,在研究中总是被导师、评委甚至是自己不断追问“科学问题是什么?”,“有什么科学假设?”,这就是先提出可能的理论,再搜集数据,然后通过计算来验证。而基于大数据的第四范式,则是先有了大量的已知数据,然后通过计算得出之前未知的理论。在维克托·迈尔-舍恩伯格撰写的《大数据时代》(中文版译名)中明确指出, 大数据时代最大的转变,就是放弃对因果关系的渴求,取而代之关注相关关系 。也就是说,只要知道“是什么”,而不需要知道“为什么”。这就颠覆了千百年来人类的思维惯例,据称是对人类的认知和与世界交流的方式提出了全新的挑战。因为人类总是会思考事物之间的因果联系,而对基于数据的相关性并不是那么敏感;相反,电脑则几乎无法自己理解因果,而对相关性分析极为擅长。这样我们就能理解了, 第三范式是“人脑+电脑”,人脑是主角,而第四范式是“电脑+人脑”,电脑是主角。 这样的一种说法,显然遭到了许多人的反对,认为这是将科学研究的方向领入歧途。从科学论文写作角度来说,如果通篇只有对数据相关性的分析,而缺乏具体的因果解读,这样的文章一般被认为是数据堆砌,是不可能发表的。 然而,要发现事物之间的因果联系,在大多数情况下总是困难重重的。 我们人类推导的因果联系,总是基于过去的认识,获得“确定性”的机理分解,然后建立新的模型来进行推导。但是,这种过去的经验和常识,也许是不完备的,甚至可能有意无意中忽略了重要的变量。 这里举一个大家容易理解的例子。现在我们人人都在关注雾霾天气。我们想知道:雾霾天气是如何发生的,如何预防?首先需要在一些“代表性”位点建立气象站,来收集一些与雾霾形成有关的气象参数。根据已有的机理认识,雾霾天气的形成不仅与源头和大气化学成分有关,还与地形、风向、温度、湿度气象因素有关。仅仅这些有限的参数,就已经超过了常规监测的能力,只能进行简化人为去除一些看起来不怎么重要的,只保留一些简单的参数。那些看起来不重要的参数会不会在某些特定条件下,起到至关重要的作用?如果再考虑不同参数的空间异质性,这些气象站的空间分布合理吗,足够吗?从这一点来看,如果能够获取更全面的数据,也许才能真正做出更科学的预测,这就是第四范式的出发点,也许是最迅速和实用的解决问题的途径。 那么,第四范式将如何进行研究呢?多年前说这个话题,也许许多人会认为是天方夜谭,但目前在移动终端横行和传感器高速发展的时代,未来的趋势似乎就在眼前了。现在,我们的手机可以监测温度、湿度,可以定位空间位置,不久也许会出现能监测大气环境化学和PM2.5功能的传感设备,这些移动的监测终端更增加了测定的空间覆盖度,同时产生了海量的数据,利用这些数据,分析得出雾霾的成因,最终进行预测也许指日可待。 这种海量数据的出现,不仅超出了普通人的理解和认知能力,也给计算机科学本身带来了巨大的挑战。因此 当大这些规模计算的数据量超过1PB时,传统的存储子系统已经难以满足海量数据处理的读写需要,数据传输I/O带宽的瓶颈愈发突出。而简单地将数据进行分块处理并不能满足数据密集型计算的需求,与大数据分析的初衷是相违背的。 因此,目前许多在具体研究中所面临的最大问题,不是缺少数据,而是面对太多的数据,却不知道如何处理。目前可见的一些技术,比如超级计算机、计算集群、超级分布式数据库、基于互联网的云计算,似乎并没有解决这些矛盾的核心问题。计算机科学期待新的革命!
个人分类: 一起读顶刊|69652 次阅读|57 个评论
数据洪流导致科学理论终结?!
热度 3 xcfcn 2013-1-28 21:21
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete http://www.wired.com/science/discoveries/magazine/16-07/pb_theory "All models are wrong , but some are useful." So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don't have to settle for wrong models. Indeed, they don't have to settle for models at all. Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age. The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies. At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right. Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content. Speaking at the O'Reilly Emerging Technology Conference this past March, Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them." This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years. Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n -dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on. Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility. In short, the more we learn about biology, the further we find ourselves from a model that can explain it. There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms. If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species. This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation. This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM's Tivoli and open source versions of Google File System and MapReduce. 1 Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software. Learning to use a "computer" of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. There's no reason to cling to our old ways. It's time to ask: What can science learn from Google? Chris Anderson ( canderson@wired.com ) is the editor in chief of Wired.
个人分类: 杂论|1788 次阅读|3 个评论
关于《第四范式:数据密集型的科学发现》
热度 1 dsc70 2011-5-10 12:33
关于《第四范式:数据密集型的科学发现》
由 Tony Hey 等人主编,微软研究院出版的《 第四范式:数据密集型的科学发现 》( The Fourth Paradigm: Data-intensive Scientific Discovery )已经不是新书了( 2009 年出版)。据闻国内也有机构在组织翻译成中文,但迄未读到。 然而,无论是对国内许多做自然科学研究、做科学学研究、做科学交流研究乃至做科学管理研究的人来说,这都是一部值得一阅的书。因此,虽然可免费获得此书英文版( http://research.microsoft.com/en-us/collaboration/fourthparadigm/ ),我们仍然翻出此书目录与同好分享。 第四范式:数据密集型的科学发现 Tony Hey, Stewart Tansley, and Kristin Tolle 主编 目 录 Xiii 前言 Gordon Bell Xix Jim Gray 眼中的 eScience :变革了的科研方法 Tony Hey, Stewart Tansley, and Kristin Tolle 主编 1. 地球与环境 3 引言 Dan Fay 5 格雷定律:以数据库为中心的科学计算 Alexander S. Szalay, José A. Blakeley 13 新兴的环境应用科学 Jeff Dozier, William B. Gail 21 利用数据重新定义生态科学 James R. Hunt, Dennis D. Baldocchi, Catharine van Ingen 27 展望 2020 年的海洋科学 John R. Delaney, Roger S. Barga 39 拉近与夜空的距离:海量数据中的发现 Alyssa A. Goodman, Curtis G. Wong 45 装备地球:下一代传感网络与环境科学 Michael Lehning, Nicholas Dawes, Mathias Bavay, Marc Parlange, Suman Nath, Feng Zhao 2. 健康与幸福 55 引言 Simon Mercer 57 医疗知识的奇点与语义医学时代 Michael Gillam, Craig Feied, Jonathan Handler, Eliza Moody,Ben Shneiderman, Catherine Plaisant, Mark Smith, John Dickason 65 发展中国家的医疗传播:挑战与潜在解决方案 Joel Robertson, Del DeHart, Kristin Tolle, David Heckerman 75 发现人脑的布线图 Jeff W. Lichtman, R. Clay Reid, Hanspeter Pfister, Michael F. Cohen 83 神经科学的计算显微镜 Eric Horvitz, William Kristan 91 面向数据密集型医疗的统一建模方法 Iain Buchan, John Winn, Chris Bishop 99 生物系统代数模型可视化进展 Luca Cardelli, Corrado Priami 3. 科学基础设施 109 引言 Daron Green 111 科学新路径? Mark R. Abbott 117 超越海量数据:发展生命科学数据基础设施 Christopher Southan, Graham Cameron 125 多核运算与科学发现 James Larus, Dennis Gannon 131 并行计算与云计算 Dennis Gannon, Dan Reed 137 工作流程工具对以数据为中心的研究的影响 Carole Goble, David De Roure 147 语义 eScience :编码对下一代数字增强型科学的意义 Peter Fox, James Hendler 153 数据密集型科学的可视化 Charles Hansen, Chris R. Johnson, Valerio Pascucci, Claudio T. Silva 165 万有平台:创建知识驱动型研究基础设施 Savas Parastatidis 4. 科学交流 175 引言 Lee Dirks 177 Jim Gray 的第四范式与科学记录的构建 Clifford Lynch 185 以数据为中心的世界中的文本 Paul Ginsparg 193 出发:前往界面友好的科学交流系统 Herbert Van de Sompel, Carl Lagoze 201 未来的数据政策 Anne Fitzgerald, Brian Fitzgerald, Kylie Pappalardo 209 我目睹了范式的变革 , 我们来了 John Wilbanks 215 从 web2.0 到全球数据库 Timo Hannay 结语 223 前方的路 Craig Mundie 227 总结 Tony Hey, Stewart Tansley, Kristin Tolle 230 下一步 231 致谢 235 关于吉姆…… 237 术语表 241 索引
16358 次阅读|1 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-16 18:31

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部