科学网 › 标签 › Learning › 相关日志

标签: Learning

相关日志

人工智能中的“深度学习”露馅了？: 热度 4 zlyang 2020-6-11 15:14; 人工智能中的“深度学习” 露馅了？露馅 lòu xiàn：是一个汉语词汇，比喻不肯让人知道而隐瞒的事物暴露出来。如：物证面前，谎言露馅了。 https://baike.baidu.com/item/%E9%9C%B2%E9%A6%85 Deep Learning ？张钹，中国科学院院士，信息技术学部 http://casad.cas.cn/sourcedb_ad_cas/zw2/ysxx/xxjskxb/200906/t20090624_1807697.html 张钹老师的照片。找不到照片的出处了。感谢有关人员！一、清华大学，2020-01-15，清华人工智能研究院院长张钹：深度学习的钥匙丢在黑暗角落 https://www.thepaper.cn/newsDetail_forward_5513995 https://news.tsinghua.edu.cn/info/1013/68575.htm 深度学习应用于模式识别虽然可以在大数据的训练中学到正确的分类，却很容易受到恶意干扰、欺骗和攻击。将狮子识别为图书馆、把雪山认作一只狗、停止标志识别被当成限速标志…… 此类深度学习系统被“忽悠”的案例层出不穷，如果发生在自动驾驶场景，就可能产生严重后果。二、科学网，2017-08-09，张钹院士：人工智能超过人类只是特定意义上的可能 http://news.sciencenet.cn/htmlnews/2017/8/384661.shtm 为什么机器下围棋能够超过人类？为什么人工智能在图像识别的某些方面会超过人类？我认为有三大法宝：第一是数据，第二是计算资源，第三则是算法。这就是深度学习成功的三大法宝。日常生活中，人们常常感慨大数据的力量，计算资源的力量，但是没有看到背后算法的重要性。比如AlphaGo能够在两三周的时间内，学到几千万个棋局，能够自己和自己下围棋，靠的正是强化学习算法。三、2018- 03-20，张钹院士：深度学习优势与短板中国AI机遇和挑战（最新演讲实录） https://blog.csdn.net/weixin_34198881/article/details/89750898 我们要解决小样本甚至零样本学习的问题，小样本学习就是用很少的样本学习和训练，然后就可以推广到应用。比如小孩学习一个马或者牛的概念，只要看一下马或牛，甚至看一下马的图片就能认识真正的马，计算机不行，得把所有情况所有背景下的马都得让它看，要看成千上万个它才能识别。我们看一下为什么机器学习的效率这么低，还要使用那么多样本，比如用这张图告诉（机器）说这里有一只猫，这个猫在这里面信息流占了多少比重呢？我们有计算过是1.1%，也就是说提供的这个样本只有1%左右有用，99%没有用，因为提供这个照片告诉它这里是一只猫，计算机根本不知道猫在哪儿，所以这就迫使人们必须用大量的样本，告诉它这是猫，在草地的猫，在另外的背景里猫会变成这样，要用各式各样的样本在不同背景下的猫去训练它，它才能认识，只有跟它相近的背景、相近的角度拍下的猫它才认识，如果背景变了，猫拍摄的角度变了它也不认识了，所以这是它的一个根本性的问题，它不理解，但人是看了这个猫就理解这个猫。四、2017-10-07，四位人工智能界的泰斗大牛关于人工智能的理解与预言 https://www.sohu.com/a/196585244_650579 深度学习目前有两个很难克服的重要缺点： 1、鲁棒性差。机器学习过的内容，和没学习过的内容，在识别效果方面差距太大。例如一个模式识别系统，经过训练可以很好地识别马、牛、羊。你给它一块石头，它有可能认为是马。 2、机器数据输入和输出结果差距太大。人的智能是举一反三，而机器是举一百反一。给几百万的数据，识别几万个目标。这和人类是背道而驰的。所以，现在的人工智能还有很长的路要走。五，2018-06-30，张钹院士：走向真正的人工智能 | CCF-GAIR 2018 https://www.sohu.com/a/238591807_114877?_f=index_pagerecom_94 必须回答下面三个问题：第一，什么叫做真正的人工智能？我们的目标是什么？第二，为什么我们需要真正的人工智能？第三，我们如何走向真正的人工智能？周穆王西巡狩，路遇匠人名偃师。翌日偃师谒见王，偕来一个假人。「趋步俯仰，信人也」。「领其颅，则歌合律；捧其手，则舞应节。千变万化，惟意所适。王以为实人也，与盛姫内御并观之，技将终，倡者瞬其目而招王之左右侍妾。王大怒，要杀这个偃师。偃师大慑，立剖其倡者以示王，皆傅会革、木、胶、漆、白、黑、丹、青之所为。穆王始悦，诏贰车载之以归。六、徐匡迪之问《科技日报》，2019年06月24日，星期一，第08版：AI实验室，正视短板加大核心算法等关键基础研究投入 http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/html/2019-06/24/content_424121.htm?div=-1 里的“ 加大对核心算法等关键基础研究的投入 ”，后来以“ 中国有多少数学家投入到人工智能的基础算法研究中？ ”方式，被称作“ 徐匡迪之问 ”。见：《上海科技报》2019年5月17日第003版的“聚焦基础算法，让“徐匡迪之问”有解”。 http://www.duob.cn/cont/812/206828.html http://www.duob.cn/FileUploads/pdf/190517/kj05173.pdf 相关链接： 2010-08-27，11年前的记忆：人脑复杂性的估计及其哲学意义 http://blog.sciencenet.cn/blog-107667-356704.html 杨正瓴. 人脑有多复杂？《百科知识》，1997, 7(总第216期): 39 – 40. http://www.cnki.com.cn/Article/CJFDTotal-BKZS199707022.htm 杨正瓴，林孔元. 人类智能模拟的“第2类数学（智能数学）”方法的哲学研究，《哲学研究》，1999, (4): 44 – 50. http://www.cqvip.com/QK/80454X/19994/1002190349.html http://www.cnki.com.cn/Article/CJFDTotal-ZXYJ199904005.htm 2019-02-28，往日（1）：小样本数理统计学与“压缩感知 Compressed sensing” http://blog.sciencenet.cn/blog-107667-1164730.html 2018-08-18，“大数据”时期，更渴望“小样本数理统计学” http://blog.sciencenet.cn/blog-107667-1129894.html 感谢您的指教！感谢您指正以上任何错误！感谢您提供更多的相关资料！; 3376 次阅读|18 个评论

RL资源网站（2.27.2020）: ChenChengHUST 2020-2-27 19:44; Python: https://blog.csdn.net/u014119694/article/details/76095796 RL： http://incompleteideas.net/book/code/code2nd.html RL的b站网课： https://www.bilibili.com/video/av32149008?p=1; 个人分类: RL学习|2 次阅读|0 个评论

Install: tensorflow2.0/1.4 -keras GPU windows: sunsett 2019-10-14 22:28; Readme 20191014 ====================== 准备入门keras，后面将会记录过程中遇到的问题与心得 === reference https://visualstudio.microsoft.com/vs/older-downloads/ https://developer.nvidia.com/cuda-downloads https://docs.nvidia.com/deeplearning/sdk/cudnn-install/ https://tensorflow.google.cn/ https://keras.io/ ===== workflow 需要注意的是版本匹配，具体参考 https://tensorflow.google.cn/install/source_windows （英文版！）看到这张表，感觉已经成功了一大半，接下来的步骤就是：安装vs2017（或更高版本） cuda 10.0以及库文件cudnn7.4 python 3.6：虽然官方文档显示 3.7也可以，pip指令可能找不到很多库文件 pycharm: 本人常用的IDE, 比较简洁，入门快速 tensorflow-gpu版本安装，注意新版本已经将gpu与cpu版本集成，故只安装gpu版本即可。并且，tensorflow已经继承了keras,故不需要额外安装 keras库，且目前最新版本的keras库可能并没有很好的集成tensorflow,故建议使用tensorflow已有的keras库。因此，加载方式一般为 tensorflow.keras，调用方式一般为 tf.keras 版本1.14是稳定版，2.0最新版：本身测试了一下2.0，有许多函数需要修改，为了方便，还是装了14版本的测试：可参考 https://keras.io/ 选择一个实例，进行测试另外numpy库可能会报错，因为版本过高（1.17.0），可以装1.16.0; 个人分类: tensorflow|3658 次阅读|0 个评论

机器学习常用方法网址（第一批汇集）: zlyang 2019-9-28 19:06; 机器学习常用方法网址（第一批汇集）（1）2016-09-26，30分钟学会用scikit-learn的基本回归方法（线性、决策树、SVM、KNN）和集成方法（随机森林，Adaboost和GBRT） https://blog.csdn.net/u010900574/article/details/52666291 竟然KNN这个计算效能最差的算法效果最好（2）2018-12-26，机器学习的几种方法（knn，逻辑回归，SVM，决策树，随机森林，极限随机树，集成学习，Adaboost，GBDT） https://blog.csdn.net/fanzonghao/article/details/85260775 （3） 2018-02-13，Machine Learning: 十大机器学习算法 https://zhuanlan.zhihu.com/p/33794257 （4）2018-12-28，机器学习 – machine learning | ML https://easyai.tech/ai-definition/machine-learning/ 15种经典机器学习算法算法训练方式线性回归监督学习逻辑回归监督学习线性判别分析监督学习决策树监督学习朴素贝叶斯监督学习 K邻近监督学习学习向量量化监督学习支持向量机监督学习随机森林监督学习 AdaBoost 监督学习高斯混合模型非监督学习限制波尔兹曼机非监督学习 K-means 聚类非监督学习最大期望算法非监督学习（5）2017-06-01，深度学习笔记——基于传统机器学习算法（LR、SVM、GBDT、RandomForest）的句子对匹配方法 https://blog.csdn.net/mpk_no1/article/details/72836042 从准确率上来看，随机森林的效果最好。时间上面，SVM耗时最长。（6）2016-08-04，8种常见机器学习算法比较 https://www.leiphone.com/news/201608/WosBbsYqyfwcDNa4.html 通常情况下：【GBDT=SVM=RF=Adaboost=Other…】（7）2016-07-21，各分类方法应用场景逻辑回归，支持向量机，随机森林，GBT，深度学习 https://www.quora.com/What-are-the-advantages-of-different-classification-algorithms （8）2018-12-10，LR、SVM、RF、GBDT、XGBoost和LightGbm比较 https://www.cnblogs.com/x739400043/p/10098659.html （9）2018-03-07，机器学习（三十六）——XGBoost, LightGBM, Parameter Server https://antkillerfarm.github.io/ml/2018/03/07/Machine_Learning_36.html （10）2019-07-29，深度学习（三十七）——CenterNet, Anchor-Free, NN Quantization https://blog.csdn.net/antkillerfarm/article/details/97623139 （11）2019-05-13，机器学习该怎么入门？ https://www.zhihu.com/question/20691338 （12）2018-12-28，机器学习、人工智能、深度学习是什么关系？ https://easyai.tech/ai-definition/machine-learning/ （13）2019-09-27，机器学习 https://www.zhihu.com/topic/19559450/hot 2019-09-26， ALBERT A Lite BERT for Self-supervised Learning of Language Representations https://arxiv.org/abs/1909.11942 （14）2016，机器学习 Machine-Learning https://github.com/JustFollowUs/Machine-Learning （15） Bradley Efron, Trevor Hastie, 2016-08, Computer Age Statistical Inference: Algorithms, Evidence and Data Science https://web.stanford.edu/~hastie/CASI/ （16）2019-06-05，张钹院士：人工智能技术已进入第三代 https://blog.csdn.net/cf2SudS8x8F0v/article/details/90986936 相关链接： 2016-09-01，支持向量机 Support Vector Machine 程序网址 http://blog.sciencenet.cn/blog-107667-1000087.html 2016-09-01，Crosswavelet and Wavelet Coherence 小波分析的程序网址 http://blog.sciencenet.cn/blog-107667-1000091.html 2019-07-27，威布尔分布 Weibull Distribution 资源网页搜集 http://blog.sciencenet.cn/blog-107667-1191323.html 2016-09-01，极限学习机 Extreme Learning Machines (ELM) 程序网址 http://blog.sciencenet.cn/blog-107667-1000094.html 2019-09-27，极值分布 Extreme Values Distribution 相关网页 http://blog.sciencenet.cn/blog-107667-1199726.html 2019-09-22，模糊数学：扎德“解模糊”、卡尔曼“模糊化”（博文网页汇集） http://blog. sciencenet.cn/blog-107667-1199064.html 2018-08-26,估计 Largest Lyapunov exponent 的 matlab 程序搜集（网址） http://blog.sciencenet.cn/blog-107667-1131215.html 2018-08-18 ，“大数据”时期，更渴望“小样本数理统计学” http://blog.sciencenet.cn/blog-107667-1129894.html 2017-07-11，布拉德利·埃弗龙（Bradley Efron）：2005年美国国家科学奖章得主（统计学） http://blog.sciencenet.cn/blog-107667-1065714.html 感谢您的指教！感谢您指正以上任何错误！感谢您提供更多的相关资料！; 个人分类: 风电功率预测|3860 次阅读|3 个评论

Learning History II A student’s Guide to The National Exper: 黄安年 2019-3-14 15:35; Learning History II A student’s Guide to The National Experience 【Judith M. Walter等著《国家的经历，学习历史指南 II 》， 19 93 年第八版】【黄安年个人藏书书目（美国问题英文部分编号4 33 】黄安年辑黄安年的博客 /2019 年 3 月 14 日发布（第 21203 号）自2019年起，笔者将通过博客陆续发布个人收藏的全部图书书目，目前先发布美国问题英文书目，已经超过432单独编号，不分出版时间先后与图书类别。这里发布的是 Judith M. Walter 等著 Learning History II A student’s Guide to The National Experience Part Two The History of The United States Since 1865 ( 《国家的经历，学习历史指南 II 》, 第二部分 1865年以来的美国史) ，Harcount Brace Jovanovich College Publishers 19 93 年第八版, 390 页。照片15张拍自该书 1 ， 2 ， 3 ， 4, 5, 6, 7, 8, 9, 10, 11 ， 12 ， 13 ， 14 ， 15 ，; 个人分类: 个人藏书书目|1231 次阅读|0 个评论

【AI泥沙龙笔记：热议周教授提出的深度突破的三大条件】: 热度 1 liwei999 2018-4-18 15:41; 李：上周，周志华教授作为神秘AI大咖嘉宾，请到京东的AI峰会做了个主题演讲。有意思的是他讲到的三点。他的讲演主题是“ 满足这三大条件，可以考虑不用深度神经网络 ”： 1. 有逐层的处理；2 有特征的内部变化； 3. 有足够的模型复杂度。这就有意思了。我们符号派所说的深度解析（deep parsing）和主流当红的深度学习（deep learning），在这三点上，是英雄所见还是殊途同归？不知道这种“巧合”是不是有些牵强，或者是非主流丑小鸭潜意识对主流白天鹅的“攀附”？总之，学士大满贯的周教授的这个总结不仅字字珠玑，深入本质，而且非常受用。他是说深度神经的突破，根本原因是由于上面三条。所以，反过来论证说，既然如此，如果有了这三条，其他模型未尝不能突破，或者其他模型可以匹敌或弥补深度神经。陈：有了dl，谁还费力想其它的李：周教授就是“费力”想其他的人。他指出了深度神经的缺陷：1 调参的困扰；2. 可重复性差；3. 模型复杂度不能随数据自动适应；4. 理论分析难；5. 黑箱；6. 依赖海量标注。由于这些问题的存在，并不是每一个AI任务都合适用深度神经。对于同一个任务，也不是每一个AI团队都可以重复AI大咖的成绩。毛：谁说每个AI任务都合适用深度神经了？DL只是补上缺失的一环。李：没人明说，无数人这么 assume 毛：应该说，无数人这么 misunderstand。李：哈，我称之为“迷思”：misconception 毛：反正是mis-something 李：从我的导师辈就开始的无数探索和实践，最后得出了自然语言的解析和理解必须多层进行的结论。虽然这与教科书，与乔姆斯基相悖。陈：小孩好像从不这么理解李：以前论过的：鉴于自然语言的结构复杂性，文句的深度解析和理解很难在单层的系统一蹴而就，自浅而深的多层管式系统于是成为一个很有吸引力的策略。多年的实践表明，多层系统有利于模块化开发和维护，为深度解析的工程化和实用化开辟了道路。但多层系统面临一个巨大的挑战，这个挑战来自于语言中的并不鲜见的相互依赖的歧义现象。多层了以后，很多不可解的问题，变得可解了。论解析的深度和应对复杂现象和结构能力，多层系统与单层系统完全不可同日而语。30多年前，我的导师做的解析系统是四、五层。但是多层的思路已经萌芽，而且方法论得到确认。最近20多年，我自己的摸索和尝试，发现大约是 50-100 层这个区间比较从容和自如。这不是因为语言中表现出来的递归结构需要这么多层，如果只是为了对付真实语言的递归，五六层也足够了。多层的必要性为的是要有足够的厚度及其动态的中间表达，去容纳从词法分析、实体识别、（嵌套）短语分析、单句分析、复句分析乃至跨句分析（篇章分析）以及从形式分析、语义分析到语用分析的全谱。当然，这么多层能够顺利推展，前提是要找到解决多层系统面临的挑战的有效方法，即：对相互依赖现象的化解之策。如何在多层系统中确保“负负得正”而不是“错误放大”（error propagation）（【立委科普：管式系统是错误放大还是负负得正？】）？如何应对 nondeterministic 结果的多层组合爆炸？如果采用 deterministic 的结果，多层的相互依赖陷阱如何规避？我们论过的“休眠唤醒”的创新就是其中一个对策（【立委科普：结构歧义的休眠唤醒演义】）。毛：乔老爷没说不能多层啊。递归与多层不就是一回事？李：他的递归是在一层里面 parse 的，CFG chart parsing 是教科书里面的文法学派的经典算法。毛：这只是形式和实质的区别。我觉得只是深度优先与宽度优先的区别。李：他鼓吹 CFG 的递归特性，正是因为他不懂得或不屑认真对待多层叠加的道路。后者理论上的确不够漂亮。多少有些“凑”的意思，太多工程的味道，模块化的味道，补丁摞补丁的味道，这不符合乔老爷的口味，但实践中比他的递归论要强得多。CFG 能做到的，叠加和拓展了的 FSAs 全部可以做到，但是叠加的 FSAs 所能达到的深度和能力，CFG 却望尘莫及。递归算个啥事儿嘛，不过是在多层里n次循环调用而已。多层所解决的问题比递归结构的挑战要广得多，包括困扰parsing界很久的“伪歧义”问题（【李白雷梅59：自动句法分析中的伪歧义泥潭】）。毛：我倒也是更赞同你说的 FSA，但是认为本质上没有什么不同，不同的只是方法。李：这是第一个英雄所见，或殊途同归。深度神经现在几百层了，deep parsing 也 50-100 层了。不是不能超过 100 层，而是确实没有这个必要。迄今还没有发现语言现象复杂到需要超过百层的符号逻辑。毛：这两个多层，性质是不一样的。李：所以我说这种比对可能“牵强”。但哲学上有诸多相通之处，的确二者都是很 deep 的，有厚度。那边叫隐藏层，反正我是搞不懂。这边倒是小葱拌豆腐，一清二白的，不说老妪能解吧，但这些个符号逻辑的层次，至少可以对语言学家，领域专家，还有AI哲学家像毛老和群主，还有AI工程大咖利人，可以对你们这些“老人”讲清楚的。这就是我说的，所谓符号逻辑，就是人类自己跟自己玩一个游戏，其中的每一个步骤都是透明的，可解释的。符号派的旗号可以是“模拟”人脑的思维逻辑，其实这个旗号也就是个旗号而已。模拟不摸拟，这一点已经不重要了，关键是效果。何况鬼知道人的语言认知是不是这么乏味、死板、机械，拼拼凑凑，还不如玩家家呢（如果人类思维真的是符号派所模型的那个样子，其实感觉人类蛮可怜的）。毛：大多数人的思维可能还没有这么复杂。李：但这种游戏般的模拟，在实践中的好处是显然的，它利于开发（自己能跟自己玩的那些游戏规则有助于步骤的梳理，以便各个击破），容易维护和debug（比较容易知道是哪一层的错误，或哪几层有修复的机会及其各自的利弊）. 马：越是层次的思维越是更容易模拟，符号派模拟的是高层次的。毛：对，就是缺了低层次这一环，才需要由DL来补上。郭： @毛德操，周志华这次演讲，还特别强调了深度之于广度的核心差异，那就是他的第二条：每层都是在不同特征维度上。他从两个角度阐明这点。一，至少在1989年，大家就已经知道，在无限逼近任意连续可微函数这件事上，只要宽度足够，单隐含层就好。多层貌似并非必要，或者说多层并没有提高“表达力”。但是，单层系统，从来没能达到同规模多层系统的学习和泛化能力。二，多层，就可以有结构。譬如resnet，可以在不同层面选取综合不同维度的特征，可以有多信息流。这条，貌似隐含地说了，人的干预还是重要的。李：是的，周教授强调的第二点是特征逐层更新。深度学习之前的系统是在同一个静态特征集上work的，包括最像符号逻辑的决策树模型。而深度之所以 deep，之所以有效和powerful，是与特征的变化更新分不开的，这个道理不难理解。深度的系统不可能在静态的特征上发力，或者说，特征静态也就没有深度的必要了。深度系统是一个接力赛的过程，是一浪推一浪的。这一点在我们的实践中是预设的，当成不言而喻的公理。我们的深度解析，起点就是词典特征和形态特征，随着从浅层到深层的逐层推进，每一步处理都是在更新特征：根据各种角度的上下文条件，不断增加新特征，消除过时的旧特征，或细化已有的特征。后面一层层就这样在越来越优化的特征上，逐步取得对于语言的结构解析和理解。毛：深度优先与广度优先，没有绝对的好坏或强弱，要看具体的应用。在NLP中也许是广度优先好一些。乔姆斯基讲的是专门针对 CFG 的，你那个实际上已经越出了这个范畴。李：特征是动态的，反映了搜素空间不断缩小，是真理不断逼近的认知过程。很难想象一个系统在一个静态特征的平面可以达到对于复杂语言现象的深度解析。马：在某些特殊情况下，已经证明层数少，需要指数级的增加神经元才可以达到层数深的效果。而神经元的增加又加大了计算复杂性，对数据量的要求更大。毛：如果上下文相关，那么分层恐怕确实更灵活一些。李：这就是我说的乔老爷把“power”这个日常用词术语化以后，实际上给人带来了巨大的误导：他的更 “powerful” 的递归 CFG 比二等公民的 less powerful 的 FSA 所多出来的 “power” 不过就是在单层系统里面可以处理一些递归结构而已。而把一批 FSAs 一叠加，其 power 立马超越 CFG。总之，特征不断更新是深度解析的题中应有之义。而这一点又恰好与深度神经不谋而合，殊途同归了。周教授眼毒啊。教授的第三点，关于深度系统需要足够的模型复杂度，我不大有把握可以做一个合适的比对。直觉上，由于分而治之由浅入深的多层系统对于组合爆炸的天然应对能力，如果我们假想我们有一种超自然的能力能够把一个 50 层的解析系统，完全碾压到一个平面，那将是一个多大的 network，遮天蔽日，大到难以想象！马：符号表示的复杂性可以说是无穷大吧？模型的复杂度指表达能力？太复杂又容易过拟合李：周说的是，因为不知道多复杂合适，所以得先弄得很复杂，然后再降低复杂度。他把这个说成是深度神经的一个缺陷。郭：周志华特别强调，他的“复杂度”，不是指“表达力”(“单层多层同样的表达力，但多层可以复杂的多”)。他没给定义，但举了resnet作为例子，并且明确提了“特征信息流的数目”，还说了：多层，但特征信息流动单一的，也没有复杂度。回顾周说的这三条，李维的 deep parser 条条符合！有逐层的处理 -- 李维的，少说也有50层吧！有特征的内部变化 -- 李维的，每层都在不同的维度/颗粒度/角度，用不同的特征/属性，产生新的特征/属性有足够的模型复杂度 -- 李维的，也有明显的“复杂度”(周志华强调，“复杂度”，不是指“表达力”。过度的“表达力”，往往是负面的)。李维的，不仅有传统的 linguistics motivated 概念/特征/属性，也广泛采用“大数据”(基于统计的)。最近也开始利用“AI”(基于分布式表示的)。还有一点，周志华多次强调(我认为是作为“三条件”必然推论的)，“深度学习，关键是深度，但不一定要 '端到端' ”。他更强调(至少是我的理解)，为了端到端，一味追求可微可导，是本末倒置。深度学习，中间有不可微不可导的特征/存储，应该是允许甚至是必要的。对这一点，李维的“休眠唤醒”，大概也可算是 remotely related. 白：拉倒。带前后条件的FSA早已不是纯种的FSA，只是拿FSA说事儿而已，真实的能力早已超过FSA几条街。毛：这就对了。其实，自然语言哪里是 CFG 可以套得上的。李：我其实不想拿 FSA 或 FSA++ 说事儿，听上去就那么低端小气不上档次。可总得有个名儿吧，白老师帮助起个名字？教给实习生的时候，我说你熟悉 regex 吧，这就好比是个大号的 regex，可实习生一上手说不对呀这比 regex 大太多了。这套 formalism 光 specs，已经厚厚一摞了，的确太超过。要害是剔除了没有线性算法的递归能力。毛：记得白老师提过毛毛虫的说法，我还说了句“毛毛虫的长度大于CFG的直径”。（【白硕– 穿越乔家大院寻找“毛毛虫”】）白：有cat，有subcat，还拿这些东西的逻辑组合构成前后条件，还有优先级。有相谐性，有远距离雷达，有实例化程度不等带来的优先级设定。哪个FSA有这么全套的装备？陈：基于规则，遇到长句子一般必死李：非规则的找个不死的瞧瞧。再看看规则的怎么个死法。反正是死。看谁死得优雅。你出一组长句子，找一个学习的 parser，然后咱们可以比较一下死的形态。白：先说任务是啥，再说死活。李：我是说利人的腔调，极具代表性，那种典型的“成见/偏见”（【 W. Li T. Tang: 主流的傲慢与偏见：规则系统与机器学习】）。马：人家DL端到端，不做parser。现在有人做从语音直接到文本的翻译，不过效果还不行，主要可能是数据问题李：苹果梨子如何比较死活。毛：乔老爷的CFG不应该算入AI，那只是形式语言的解析。陈：确实都死。。。但一个死了也没法解释，不要解释。另一个就得思考哪个规则出问题了毛：人也好不到哪里，只不过人不死，只是懵了。李： 😄懵了就是人造死，artificial death 马：规则的好处是，你说什么不行？我马上可以加一个规则。这就是我前面说的复杂性无穷。😄即表达能力无穷白：假设任务是从文本抽取一堆关系，放进知识图谱。假设任务是根据用户反馈，把错的对话改对，同时对的对话不错。陈：抽取这个很重要，很多理解的问题其实是抽取问题。比如，阅读问答题毛：我还是相信多层符号会赢。李：从文本抽取关系谁更行，需要假设同等资源的投入才好比。我以前一直坚信多层符号，现在有些犹疑了，主要是标注人工太便宜了。到了标注车间，简直就是回到了卓别林的《摩登时代》，生产线上的标注“白领”面对源源不断的数据，马不停蹄地标啊标啊，那真不是人干的活儿啊，重复、单调、乏味，没看见智能，只看见人工，甭管数据有多冗余和灰色。这就是当今主流“人工智能”的依托，让人唏嘘。当然，另一方面看，这是当今AI在取代了很多人工岗位后，难得地给社会创造就业机会呢，将功补过，多多益善，管他什么工作，凡是创造就业机会的，一律应予鼓励。毛： @wei 这不正好是训练条件反射吗陈：反正智能的事都让机器去做了，人就只好做些低级如标注的活了白：问题是啥叫符号？基于字节？字符？基于词已经是符号了吧。是不是要退到茹毛饮血，连词也不分，才算非符号。否则都是站在符号肩膀上毛：我认为可以这样来类比: 一个社会经验丰富、老江湖的文盲，跟一个教授，谁能理解更多的语句。我想，除那些江湖切口和黑话，还有些需要“锣鼓听声，说话听音”的暗示以外，一定是教授能理解更多的语句。而且，即使是江湖切口黑话，也能慢慢加到教授的知识库中。李：都是站在符号肩膀上。然而，符号系统的实质不是符号，而是显性的可解释的符号逻辑。就是那套自己跟自己玩系统内部能够自圆其说有过程有因果链条的针对符号及其动态特征做处理的算法。相对于建立在符号和特征基础上的不可解释的学习系统，很多时候这些系统被归结为一个分类问题，就是用原子化的类别符号作为语言落地的端对端目标。如果一个落地场景需要10个分类，只要定义清晰界限相对分明，你就找一批大学生甚至 crowd source 给一批在家的家庭妇女标注好了，一个类标它百万千万，然后深度训练。要是需要100个分类，也可以这么办，虽然标注的组织工作和质量控制要艰难得多，好在大唐最不缺的就是人工。可是，如果落地场景需要一千个、一万个不同侧面的分类，标注和学习的路线就难以为继了。白：结果是一个集合，已经比较复杂了。结果是关系集合，又更加复杂。让人类标注，好不到哪儿去。标注一个关系集合，等价于标注一个结构。【相关】周志华：满足这三大条件，可以考虑不用深度神经网络周志华最新演讲：深度学习为什么深？【立委科普：结构歧义的休眠唤醒演义】【立委科普：歧义parsing的休眠唤醒机制再探】【白硕– 穿越乔家大院寻找“毛毛虫”】【科研笔记：NLP “毛毛虫” 笔记，从一维到二维】【泥沙龙笔记：NLP 专门语言是规则系统的斧头】【新智元：理论家的围墙和工程师的私货】乔姆斯基批判泥沙龙笔记：再聊乔老爷的递归陷阱泥沙龙笔记：骨灰级砖家一席谈，真伪结构歧义的对策（2/2) 《自然语言是递归的么？》语言创造简史【立委科普：管式系统是错误放大还是负负得正？】【李白雷梅59：自动句法分析中的伪歧义泥潭】【 W. Li T. Tang: 主流的傲慢与偏见：规则系统与机器学习】《一日一析系列》【语义计算：李白对话录系列】【置顶：立委NLP博文一览】《朝华午拾》总目录; 个人分类: 立委科普|12638 次阅读|1 个评论

[转载]Understanding LSTM Networks: tsingguo 2017-6-7 16:58; Recurrent Neural Networks http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones. Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist. Recurrent Neural Networks have loops. In the above diagram, a chunk of neural network, $A$ , looks at some input $x_t$ and outputs a value $h_t$ . A loop allows information to be passed from one step of the network to the next. These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop: An unrolled recurrent neural network. This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data. And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks . But they really are pretty amazing. Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore. The Problem of Long-Term Dependencies One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends. Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky ,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French .” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information. In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) and Bengio, et al. (1994) , who found some pretty fundamental reasons why it might be difficult. Thankfully, LSTMs don’t have this problem! LSTM Networks Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter Schmidhuber (1997) , and were refined and popularized by many people in following work. 1 They work tremendously well on a large variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn! All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. The repeating module in a standard RNN contains a single layer. LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way. The repeating module in an LSTM contains four interacting layers. Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using. In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations. !--To be a bit more explicit, we can split up each line into lines carrying individual scalar values: -- The Core Idea Behind LSTMs The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state. Step-by-Step LSTM Walk Through The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at $h_{t-1}$ and $x_t$ , and outputs a number between $0$ and $1$ for each number in the cell state $C_{t-1}$ . A $1$ represents “completely keep this” while a $0$ represents “completely get rid of this.” Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject. The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, $\tilde{C}_t$ , that could be added to the state. In the next step, we’ll combine these two to create an update to the state. In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting. It’s now time to update the old cell state, $C_{t-1}$ , into the new cell state $C_t$ . The previous steps already decided what to do, we just need to actually do it. We multiply the old state by $f_t$ , forgetting the things we decided to forget earlier. Then we add $i_t*\tilde{C}_t$ . This is the new candidate values, scaled by how much we decided to update each state value. In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps. Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through $\tanh$ (to push the values to be between $-1$ and $1$ ) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to. For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next. Variants on Long Short Term Memory What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them. One popular LSTM variant, introduced by Gers Schmidhuber (2000) , is adding “peephole connections.” This means that we let the gate layers look at the cell state. The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others. Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older. A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014) . It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular. These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015) . There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014) . Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks. Conclusion Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks! Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable. LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner… Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015) , Chung, et al. (2015) , or Bayer Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!; 个人分类: 深度学习|1651 次阅读|0 个评论

【李白之38：叫ＮＬＰ太沉重】: liwei999 2017-4-14 22:15; 没有规则的文法是怎么回事儿？白: “这件事非他莫属”，这种情况下“他”填谁的坑？李：填坑，从句法角度没有疑问，“这件事”是句法主语，“非他莫属”是谓语。如果主语是行为，采纳董老师的处理，把句法主语转为逻辑谓语，把句法谓语中的“他”提出来作为其逻辑语义的施事。如果主语不是行为，那么可以相应做一些逻辑语义表达（ｓｅｍａｎｔｉｃ　ｒｅｐｒｅｓｅｎｔａｔｉｏｎ）的调整，其中之一是，把该名词的“标配”动词作为省略成分提出来，“事儿”的标配就是“处理”或“做”（ＤＯ）。然后逻辑施事照旧。 “这件事非他莫属” ＝＝（只有）他（能）ＤＯ（这件事儿）这些个鸡零狗碎的处置，说到底都是自己跟自己玩儿。这里的所谓语义表达和语义落地，在ｐａｔｔｅｒｎ确定之后，我们其实心里都明白其涵义了，只不过需要用一种容易记忆容易处理的方式把“语义”表达出来，让人类看着舒服。其实这都是小事儿，属于　ｐａｒｓｉｎｇ　的　ｓｉｄｅ　ｅｆｆｅｃｔｓ，怎么方便怎么来，无一定之规，系统内部自足即可。关键不在作为　ｓｉｄｅ　ｅｆｆｅｃｔｓ　的　ｏｕｔｐｕｔ，而在于什么样的　ｐａｔｔｅｒｎ　、什么样的条件与　ｉｎｐｕｔ　匹配合适，匹配完了既然并无歧义，总能找到一个语义表达的出路。后面的考量也就是为了“好看”而已（就好比软件工程中很多内部数据结构表达出来要ｐｒｅｔｔｙ　ｐｒｉｎｔ一样）。白: “拉小提琴他最拿手了”“去北京他最合适了”“喝白酒他二两就醉了” 这是一种很常见的格式，不是因熟语而发明的，熟语只不过往这上面靠而已。李: 看电影他爱打瞌睡做报告他出口成章举样例他偷梁换柱。白: 在我的体系里，这是一种已经在局部填了坑的萝卜，在满足一定条件的更大范围内又被再利用而已。冯: 这是紧缩复句。白: 在我这里是merge的一种。两个互不隶属的谓词的坑共享萝卜，都是merge。冯: merge也就是紧缩了。白: 如果前面只是NP，就降格为状语：“这件事他最拿手了”当中“这件事“为状语。N降格为S+ 李: 白老师的句法里面，“填坑”对应的是文法的ａｒｇ，“修饰”对应的是ｍｏｄ，“合并”貌似对应的是ｃｏｎｊ或者ｃｏｍｐ，其余两个针对ｔｏｋｅｎ自己的操作，不对应ｄｅｐｅｎｄｅｎｃｙ的结构关系。这与词驱动ＨＰＳＧ有相当的吻合之处。白: 可以认为有一个隐性的token 李: ＨＰＳＧ　也声称只有词的结构表达，没有一条条文法规则，只有几个　ｓｃｈｅｍａｔａ　或叫　Ｐｒｉｎｃｉｐｌｅｓ，其中一个是针对　ａｒｇ　的　连接原则。另一个针对的是　ｍｏｄ。白: 他还是用PSG作拐棍儿。我连小词都是负载结构的。李: 如果仔细看那个根据原则而来的ｓｃｈｅｍａｔａ，基本没啥内容，就是一点最抽象的关系限制。说的是，如果一个　ｔｏｋｅｎ　要填另一个　ｔｏｋｅｎ　在词典的　ｓｕｂｃａｔ　ｐａｔｔｅｒｎ　里面规定的坑，除了所有规定的　ｆｅａｔｕｒｅｓ　必须能　ｕｎｉｆｙ　外（这个可以比喻为情投意合的自由恋爱），另外还有一点原则性限制（这个可以比喻为婚姻原则：譬如传统的婚姻原则必须是异性之间的结合，否则不发证书，也就是原则层面不允许结合，即便双方情投意合），加上一些子结构数据的　ｓｈａｒｉｎｇ　的规定。这个跟白老师声称没有规则，只有子范畴，以及根据子范畴的　ｐａｒｓｉｎｇ－ｒｕｎｎｅｒ　的抽象算法是同样的精神。当然，ＨＰＳＧ叠床架屋的数据结构以及ＰＳＧ与生俱来的组合爆炸低效率以及伪歧义困扰，也许已经被白老师解决了。白: 我没有组合爆炸。复杂特征集不是好东西，扔。李: 我同意。可是一开始用会入迷。逻辑上很清晰、细致和美丽。可以把语言的任何单位模型成一个非常飘逸的玉人一般。白: 严格限定只解决谁跟谁有关系，若非搂草打兔子顺手，绝不碰是什么关系。逻辑主宾语分不出来是本分，分出来是情分。李: 这个分不分就是一个阶段的问题。本质上是所有的　ａｒｇｓ　都是　ａｒｇ，这个上位概念是一致的，ａｒｇ１　还是　ａｒｇ２，还是　ａｒｇ３，ｓｕｂｃａｔ　可以进一步去规约。白: 隐性介词不知道什么格是本分，顺手安一个非他莫属的格标记是情分。李: 到了语义层面，必须去进一步去区分。白: 在检查相谐性的时候，有些角色已经跑不了了。这就叫搂草打兔子但是还有漏网之鱼，句法层面不应该care。比如“这场火多亏了消防队来得及时”里面，“这场火”前面有个隐性介词，知道这一点就够了。是啥介词管他呢。李: 有人问，没有规则怎么能叫文法呢？这里面的ｔｒｉｃｋ就在，并不是没有规则，而是规则隐藏在词典里面了。本质上是词驱动的规则集，构成了词典主义的文法。如果这种规则的隐藏，不以大家通常习惯的显性的产生式（ｐｒｏｄｕｃｔｉｏｎｓ）的形式表现，而是以一套　ｌｅｘｉｃａｌ　ｆｅａｔｕｒｅｓ　来表达，无论是　ＨＰＳＧ　那种非常精细繁复的　ｔｙｐｅｄ　ｆｅａｔｕｒｅ　ｓｔｒｕｃｔｕｒｅ（表达形式是所谓　ＡＶＭ，　Ａｔｔｒｉｂｕｔｅ　Ｖａｌｕｅ　Ｍａｔｒｉｘ），还是白老师那种简省的原子化（ａｔｏｍｉｃ）的　ｃａｔ　或　ｓｕｂｃａｔ的标注，那么就给人一个本文法无规则的“假象”。没研究过这类ｐａｒｓｉｎｇ的人可能还是疑惑：无论如何，没有规则，只有ｆｅａｔｕｒｅｓ，那怎么做ｐａｒｓｉｎｇ呢？这事儿说玄也玄，说白了就是一层窗户纸。任何一个　ｐａｒｓｅｒ　都是要对ｉｎｐｕｔ文句做操作的，这是无论声称有规则还是无规则的系统，都必须要有的一个部分，我们通常称之为ｒｕｎｎｅｒ，可以形象地比喻成一个ｓｃａｎｎｅｒ。诀窍就在这个ｒｕｎｎｅｒ是怎样在ｒｕｎ（ｐａｒｓｅ）文句呢？在产生式的显性符号规则体系里，靠的就是对这些规则的解释（ｉｎｔｅｒｐｒｅｔａｔｉｏｎ）或编译（ｃｏｍｐｉｌａｔｉｏｎ）。这个过程比较直观、可解析。在隐藏了过程性产生式规则的体系里面，ｒｕｎｎｅｒ怎么工作呢？回答是靠文法或模型自然不错，但太笼统。说就是靠词典里那些ｆｅａｔｕｒｅｓ的标注，先判你一个不及格，因为你没说明标注怎么转化为解析器（或自动机）的，里面还缺了啥。歇口气儿，群里面的后学（ＮＬＰ的ｆｒｅｓｈ博士啊博士后啊啥的）不妨当成一个家庭作业，试着回答一下这个问题。不要以为ＮＬＰ就是神经，或深度神经。除了神经，就没有ＮＬＰ了。Ｐａｒｓｉｎｇ　是　ＮＬＰ　的皇冠，如果只懂神经，不懂　ｐａｒｓｉｎｇ的基本原理和理论，你可能在工作市场上大卖，你也可能确实用神经做过ｐａｒｓｉｎｇ的工作，但ＡＩ这股热浪过后，你会发现叫自己是ＮＬＰｅｒ太沉重。当然，我们这些还没神经的人，其实也是叫ＮＬＰ太沉重，不过是倚老卖老罢了，反正我们终归是要退场的人。现如今不神经的话，都不好意思说自己是ＡＩ圈的人。认真说，隔行如隔山，行内也隔山，这是ＡＩ里面的真实写照。不隔山的全能的人有没有？肯定有，群主白老师就是。但９０＋％的大牛都不是，这也是事实。 Nick: 伟哥这是被人欺负了吗？李: 哈，欺负个球啊。信笔写　想到哪里写到哪里，神经＝ＮＬＰ的感慨而已。炼到我这功夫早已百毒不侵了，谁欺负谁呀。把一个算法上升到一个领域，这是概念混乱。这种怪相你学ＡＩ历史的应该给个说法。邓: 我觉得咱们开放心态看待会比较好，一个算法如果让一个行业上了一个大台阶，是会在一个历史时期称为这个行业的代名词的。听其言察其行。阮: 要么理论上碾压，要么实验或系统验证。要么更有权有势。。三点都不占，就只能忍了。马: 不为了深度而深度，也不不为了深度而不深度。我们组基本没有神经。李: 我脑袋是ｐａｒｓｅ不过来了，这口令绕的。当然可以为神经欢呼　鼓吹　毕竟人家有实绩在那儿（图像、语音、机器神译）。问题不在其崇高的行业地位，而在于这种地位所带来的一种默认意识：(1) 只要是ＡＩ和ＮＬＰ的任务，神经不仅仅是首选，而是是必需；(2) 甚至是，如果你不神经，就会异样地看着你：怎么可以不神经呢？(3) 还有很多问题，根本没有任何证明，神经可以ｗｏｒｋ，但是行内和行外的压倒意识是：(i) 神经一定ｗｏｒｋ，(ii) 而且非神经一定不ｗｏｒｋ，或不值一提。陈: 讨论的人也很神经李: 这种观念是如此之深厚顽固，以至于你即便可以证明非神经也一样工作，或者工作得更好，也基本没有人听。这时候你才知道所谓科学家，其实一大半不如没文化的暴发户，因为暴发户至少有常识，懂得白猫黑猫，暴发户没有被洗脑。这个观察不是从神经流行才有的，至少２０多年的体会了。阮: 科学，越年轻越好，大家喜欢看到新方法而已，目前确实没看到其他新方法。李: 一个人不可能抗拒一个世界。所以生存之道不是试图说服、试图证明，这些基本都没用。生存之道就是：（１）与暴发户为伍，不问姓社还是姓资。挖煤给钱就行。（２）另一个生存之道就是：挂羊头卖狗肉，努力靠拢，不管真心还是假意。无论是否真能融合，永远把羊头当成菩萨。【相关】《朝华午拾：在美国写基金申请的酸甜苦辣》围脖：一个人对抗一个世界，理性主义大师Lenat 教授 .. 【语义计算：李白对话录系列】中文处理 Parsing 【置顶：立委NLP博文一览】《朝华午拾》总目录; 个人分类: 立委科普|3459 次阅读|0 个评论

做文献笔记的好软件: 热度 1 bear1991 2017-3-22 17:35; 平时读文献遇到一个问题，就是做笔记只能在PDF上做，读得多了就很容易忘记，笔记也很难归纳起来，不自觉就降低了生产力。在这里推荐一款国产软件，就是知网开发的CNKI-E learning。话说这软件一开始做的挺好，特别是2.0版本的，那时候还叫做E-Study,后来改成 E-learning，是3.0版本的，现在改回去，叫做 E-Study，绕来绕去，也不知道有什么特殊意义。后来升级到3.0，用起来就有点坑了，界面很难受，用起来还卡，读个PDF文献拖拽不动。像这样的软件，应该是以基本功能为定向，保留其特色就好了。一开始的2.0版本就不错，现在最新版的用起来可能有点功能过剩，比如它参考文献引用的功能我觉得就是多余，用endnote就足够了。后来我也试过Docear,是国外产的一个以思维导图为基础的文献笔记管理软件，不过试用了一下就卸载了，思维导图很麻烦，一篇文献伸出无数小树枝，看起来都头疼。很多人比较鄙视国产软件，现在的国产软件甚至还不如以前的国产软件做的好。国外软件一般做的比较精巧，国产软件一会儿卡、一会死机难免让用户觉得难受。如果 CNKI-E learning这款软件认认真真开发的话，真的帮助我们解决了文献笔记这个麻烦，想必让使用者出点赞助费也是可以的吧。比较实用的功能及注意事项：（1）直接在PDF上做笔记。方法类似word的批注功能，做完笔记后E-study直接帮你把笔记整理好，可以直接导出到word中，格式是各文献标题+笔记标题+笔记时间+笔记内容。甚至可以直接按照标签分类导出笔记，极大的帮助用户整理笔记；（2）为研究目标新建学习单元。目前直接在软件中建立学习单元就可以了，不必要在硬盘中新建，否则时间长了很难找。如果要把学习单元转移到其他电脑使用，方法一是右键直接导出学习单元（笔记、文献、cel均同步导出），方法二是使用sync功能，会直接同步；（3）文件会越来越多。跟endnote一样，总是把已有的PDF文献添加到它自己的库中，而原来自己存放路径下的文献并没有删除，时间长了易造成重复。因此可以把其他杂乱的PDF文件删除，只保留它的库中的PDF即可；本文也算帮这个软件做点宣传，免得好东西默默无闻的躲在角落。希望不涉及侵权问题。; 个人分类: 软件使用|23531 次阅读|2 个评论

Newest GNMT: time to witness the miracle of Google Translate: liwei999 2016-10-4 09:22; Wei: Recently, the microblogging (wechat) community is full of hot discussions and testing on the newest annoucement of the Google Translate breakthrough in its NMT (neural network-based machine translation) offering, claimed to have achieved significant progress in data quality and readability. Sounds like a major breakthrough worthy of attention and celebration. The report says: Ten years ago, we released Google Translate, the core algorithm behind this service is PBMT: Phrase-Based Machine Translation. Since then, the rapid development of machine intelligence has given us a great boost in speech recognition and image recognition, but improving machine translation is still a difficult task. Today, we announced the release of the Google Neural Machine Translation (GNMT) system, which utilizes state-of-the-art training techniques to maximize the quality of machine translation so far. For a full review of our findings, please see our paper Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation . A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark data set was comparable to that of a phrase-based translation system. Since then, researchers have proposed a number of techniques to improve NMT, including modeling external alignment models to handle rare words, using attention to align input and output words, and word decomposition into smaller units to cope with rare words. Despite these advances, the speed and accuracy of NMT has not been able to meet the requirements of a production system such as Google Translate. Our new paper describes how to overcome many of the challenges of making NMT work on very large data sets and how to build a system that is both fast and accurate enough to deliver a better translation experience for Google users and services. ............ Using side-by-side comparisons of human assessments as a standard, the GNMT system translates significantly better than the previous phrase-based production system. With the help of bilingual human assessors, we found in sample sentences from Wikipedia and the news website that GNMT reduced translational errors by 55% to 85% or more in the translation of multiple major pairs of languages. In addition to publishing this research paper today, we have also announced that GNMT will be put into production in a very difficult language pair (Chinese-English) translation. Now, the Chinese-English translations of the Google Translate for mobile and web versions have been translated at 100% using the GNMT machine - about 18 million translations per day. GNMT's production deployment uses our open machine learning tool suite TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models, meeting Google Translate strict latency requirements for products. Chinese-to-English translation is one of the more than 10,000 language pairs supported by Google Translate. In the coming months, we will continue to extend our GNMT to far more language pairs. GNMT translated from Google Translate achieves a major breakthrough ! As an old machine translation researcher, this temptation cannot be resisted. I cannot wait to try this latest version of the Google Translate for Chinese-English. Previously I tried Google Chinese-to-English online translation multiple times, the overall quality was not very readable and certainly not as good as its competitor Baidu. With this newest breakthrough using deep learning with neural networks, it is believed to get close to human translation quality. I have a few hundreds of Chinese blogs on NLP, waiting to be translated as a try. I was looking forward to this first attempt in using Google Translate for my Science Popularization blog titled Introduction to NLP Architecture . My adventure is about to start. Now is the time to witness the miracle, if miracle does exist. Dong: I hope you will not be disappointed. I have jokingly said before: the rule-based machine translation is a fool, the statistical machine translation is a madman, and now I continue to ridicule: neural machine translation is a liar (I am not referring to the developers behind NMT). Language is not a cat face or the like, just the surface fluency does not work, the content should be faithful to the original! Wei: Let us experience the magic, please listen to this translated piece of my blog: This is my Introduction to NLP Architecture fully automatically translated by Google Translate yesterday (10/2/2016) and fully automatically read out without any human interference. I have to say, this is way beyond my initial expectation and belief. Listen to it for yourself, the automatic speech generation of this science blog of mine is amazingly clear and understandable. If you are an NLP student, you can take it as a lecture note from a seasoned NLP practitioner (definitely clearer than if I were giving this lecture myself, with my strong accent). The original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques. Prof. Dong, you know my background and my original doubtful mindset. However, in the face of such a progress, far beyond our original imagination limits for automatic translation in terms of both quality and robustness when I started my NLP career in MT training 30 years ago, I have to say that it is a dream come true in every sense of it. Dong: In their terminology, it is less adequate, but more fluent. Machine translation has gone through three paradigm shifts. When people find that it can only be a good information processing tool, and cannot really replace the human translation, they would choose the less costly. Wei: In any case, this small test is revealing to me. I am still feeling overwhelmed to see such a miracle live. Of course, what I have just tested is the formal style, on a computer and NLP topic, it certainly hit its sweet spot with adequate training corpus coverage. But compared with the pre-NN time when I used both Google SMT and Baidu SMT to help with my translation, this breakthrough is amazing. As a senior old school practitioner of rule-based systems, I would like to pay deep tribute to our nerve-network colleagues. These are a group of extremely genius crazy guys. I would like to quote Jobs' famous quotation here: “Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.” @Mao, this counts as my most recent feedback to the Google scientists and their work. Last time, about a couple of months ago when they released their parser, proudly claimed to be the most accurate parser in the world , I wrote a blog to ridicule them after performing a serious, apples-to-apples comparison with our own parser . This time, they used the same underlying technology to announce this new MT breakthrough with similar pride, I am happily expressing my deep admiration for their wonderful work. This contrast of my attitudes looks a bit weird, but it actually is all based on facts of life. In the case of parsing, this school suffers from lacking naturally labeled data which they would make use of in perfecting the quality, especially when it has to port to new domains or genres beyond the news corpora. After all, what exists in the language sea involves corpora of raw text with linear strings of words, while the corresponding parse trees are only occasional, artificial objects made by linguists in a limited scope by nature (e.g. PennTree, or other news-genre parse trees by the Google annotation team). But MT is different, it is a unique NLP area with almost endless, high-quality, naturally-occurring labeled data in the form of human translation, which has never stopped since ages ago. Mao: @wei That is to say, you now embrace or endorse a neuron-based MT, a change from your previous views? Wei: Yes I do embrace and endorse the practice. But I have not really changed my general view wrt the pros and cons between the two schools in AI and NLP . They are complementary and, in the long run, some way of combining the two will promise a world better than either one alone. Mao: What is your real point? Wei: Despite biases we are all born with more or less by human nature, conditioned by what we have done and where we come from in terms of technical background, we all need to observe and respect the basic facts. Just listen to the audio of their GSMT translation by clicking the link above, the fluency and even faithfulness to my original text has in fact out-performed an ordinary human translator, in my best judgment. If an interpreter does not have sufficient knowledge of my domain, if I give this lecture in a classroom, and ask an average interpreter to translate on the spot for me, I bet he will have a hard time performing better than the Google machine listed above (of course, human translation gurus are an exception). This miracle-like fact has to be observed and acknowledged. On the other hand, as I said before, no matter how deep the learning reaches, I still do not see how they can catch up with the quality of my deep parsing in the next few years when they have no way of magically having access to a huge labeled data of trees they depend on, especially in the variety of different domains and genres. They simply cannot make bricks without straw (as an old Chinese saying goes, even the most capable housewife can hardly cook a good meal without rice). Because in the natural world, there are no syntactic trees and structures for them to learn from, there are only linear sentences. The deep learning breakthrough seen so far is still mainly supervised learning, which has almost an insatiable appetite for massive labeled data, forming its limiting knowledge bottleneck. Mao: I'm confused. Which one do you believe stronger? Who is the world's No. 0? Wei: Parsing-wise, I am happy to stay as No. 0 if Google insists on their being No. 1 in the world. As for MT, it is hard to say, from what I see, between their breakthrough and some highly sophisticated rule-based MT systems out there. But what I can say is, at a high level, the trends of the mainstream statistical MT winning the space both in the industry as well as in academia over the old school rule-based MT are more evident today than before. This is not to say that the MT rule system is no longer viable, or going to an end. There are things which SMT cannot beat rule MT. For examples, certain types of seemingly stupid mistakes made by GNMT (quite some laughable examples of totally wrong or opposite translation have been illustrated in this salon in the last few days) are almost never seen in rule-based MT systems. Dong: here is my try of GNMT from Chinese to English: 学习上，初二是一个分水岭，学科数量明显增多，学习方法也有所改变，一些学生能及时调整适应变化，进步很快，由成绩中等上升为优秀。但也有一部分学生存在畏难情绪，将心思用在学习之外，成绩迅速下降，对学习失去兴趣，自暴自弃，从此一蹶不振，这样的同学到了初三往往很难有所突破，中考的失利难以避免。 Learning, the second of a watershed, the number of subjects significantly significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is Fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day, Mao: This translation cannot be said to be good at all. Wei: Right, that is why it calls for an objective comparison to answer your previous question. Currently, as I see, the data for the social media and casual text are certainly not enough, hence the translation quality of online messages is still not their forte. As for the previous textual sample Prof. Dong showed us above, Mao said the Google translation is not of good quality as expected. But even so, I still see impressive progress made there. Before the deep learning time, the SMT results from Chinese to English is hardly readable, and now it can generally be read loud to be roughly understood. There is a lot of progress worth noting here. Ma: In the fields with big data, in recent years, DL methods are by leaps and bounds. I know a number of experts who used to be biased against DL have changed their views when seeing the results. However, DL in the IR field is still basically not effective so far, but there are signs of slowly penetrating IR. Dong: The key to NMT is looking nice. So for people who do not understand the original source text, it sounds like a smooth translation. But isn't it a liar if a translation is losing its faithfulness to the original? This is the Achille's heel of NMT. Ma: @Dong, I think all statistical methods have this aching point. Wei: Indeed, there are respective pros and cons. Today I have listened to the Google translation of my blog three times and am still amazed at what they have achieved. There are always some mistakes I can pick here and there. But to err is human, not to say a machine, right? Not to say the community will not stop advancing and trying to correct mistakes. From the intelligibility and fluency perspectives, I have been served super satisfactorily today. And this occurs between two languages without historical kinship whatsoever. Dong: Some leading managers said to me years ago, In fact, even if machine translation is only 50 percent correct, it does not matter. The problem is that it cannot tell me which half it cannot translate well. If it can, I can always save half the labor, and hire a human translator to only translate the other half. I replied that I am not able to make a system do that. Since then I have been concerned about this issue, until today when there is a lot of noise of MT replacing the human translation anytime from now. It's kinda like having McDonald's then you say you do not need a fine restaurant for French delicacy. Not to mention machine translation today still cannot be compared to McDonald's. Computers, with machine translation and the like, are in essence a toy given by God for us human to play with. God never agrees to permit us to be equipped with the ability to copy ourselves. Why GNMT first chose language pairs like Chinese-to-English, not the other way round to showcase? This is very shrewd of them. Even if the translation is wrong or missing the points, the translation is usually fluent at least in this new model, unlike the traditional model who looks and sounds broken, silly and erroneous. This is the characteristics of NMT, it is selecting the greatest similarity in translation corpus. As a vast number of English readers do not understand Chinese, it is easy to impress them how great the new MT is, even for a difficult language pair. Wei: Correct. A closer look reveals that this breakthrough lies more on fluency of the target language than the faithfulness to the source language, achieving readability at cost of accuracy. But this is just a beginning of a major shift. I can fully understand the GNMT people's joy and pride in front of a breakthrough like this. In our career, we do not always have that type of moment for celebration. Deep parsing is the NLP's crown. Yet to see how they can beat us in handling domains and genres lacking labeled data. I wish them good luck and the day they prove they make better parsers than mine would be the day of my retirement. It does not look anything like this day is drawing near, to my mind. I wish I were wrong, so I can travel the world worry-free, knowing that my dream has been better realized by my colleagues. Thanks to Google Translate at https://translate.google.com/ for helping to translate this Chinese blog into English, which was post-edited by myself. Wei’s Introduction to NLP Architecture Translated by Google OVERVIEW OF NATURAL LANGUAGE PROCESSING NLP White Paper: Overview of Our NLP Core Engine Introduction to NLP Architecture It is untrue that Google SyntaxNet is the world’s most accurate parser Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Is Google SyntaxNet Really the World’s Most Accurate Parser? Dr Li's NLP Blog in English; 个人分类: 立委科普|6087 次阅读|0 个评论

Wei's Introduction to NLP Architecture Translated by Google: liwei999 2016-10-2 11:37; Introduction to NLP Architecture by Dr. Wei Li (fully automatically translated by Google Translate) The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable, if you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent) by folloing the link below: Wei’s Introduction to NLP Architecture To preserve the original translation, nothing is edited below. I will write another blog to post-edit it to make this an official NLP architecture introduction to the audiences perused and honored by myself, the original writer. But for time being, it is completely unedited, thanks to the newly launched Google Translate service from Chinese into English at https://translate.google.com/ For the natural language processing (NLP) and its application, the system architecture is the core issue, I blog which gave four NLP system architecture diagram, now one by one to be a brief . I put the NLP system from the core engine to the application, is divided into four stages, corresponding to the four frame diagram. At the bottom of the core is deep parsing, is the natural language of the bottom-up layer of automatic analyzer, this work is the most difficult, but it is the vast majority of NLP system based technology. The purpose of parsing is to structure unstructured languages. The face of the ever-changing language, only structured, and patterns can be easily seized, the information we go to extract semantics to solve. This principle began to be the consensus of (linguistics) when Chomsky proposed the transition from superficial structure to deep structure after the linguistic revolution of 1957. A tree is not only the arcs that express syntactic relationships, but also the nodes of words or phrases that carry various information. Although the importance of the tree, but generally can not directly support the product, it is only the internal expression of the system, as a language analysis and understanding of the carrier and semantic landing for the application of the core support. The next layer is the extraction layer (extraction), as shown above. Its input is the tree, the output is filled in the content of the templates, similar to fill in the form: is the information needed for the application, pre-defined a table out, so that the extraction system to fill in the blank, the statement related words or phrases caught out Sent to the table in the pre-defined columns (fields) to go. This layer has gone from the original domain-independent parser into the face-to-face, application-oriented and product-demanding tasks. It is worth emphasizing that the extraction layer is domain-oriented semantic focus, while the previous analysis layer is domain-independent. Therefore, a good framework is to do a very thorough analysis of logic, in order to reduce the burden of extraction. In the depth analysis of the logical semantic structure to do the extraction, a rule is equivalent to the extraction of thousands of surface rules of language. This creates the conditions for the transfer of the domain. There are two types of extraction, one is the traditional information extraction (IE), the extraction of fact or objective information: the relationship between entities, entities involved in different entities, such as events, can answer who dis what when and where When and where to do what) and the like. This extraction of objective information is the core technology and foundation of the knowledge graph which can not be renewed nowadays. After completion of IE, the next layer of information fusion (IF) can be used to construct the knowledge map. Another type of extraction is about subjective information, public opinion mining is based on this kind of extraction. What I have done over the past five years is this piece of fine line of public opinion to extract (not just praise classification, but also to explore the reasons behind the public opinion to provide the basis for decision-making). This is one of the hardest tasks in NLP, much more difficult than IE in objective information. Extracted information is usually stored in a database. This provides fragmentation information for the underlying excavation layer. Many people confuse information extraction and text mining, but in fact this is two levels of the task. Extraction is the face of a language tree, from a sentence inside to find the information you want. The mining face is a corpus, or data source as a whole, from the language of the forest inside the excavation of statistical value information. In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean, therefore, must use the computer to dig out the information from the ocean of critical intelligence to meet different applications. Therefore, mining rely on natural statistics, there is no statistics, the information is still out of the chaos of the debris, there is a lot of redundancy, mining can integrate them. Many systems do not dig deep, but simply to express the information needs of the query as an entrance, real-time (real time) to extract the relevant information from the fragmentation of the database, the top n results simply combined, and then provide products and user. This is actually a mining, but is a way to achieve a simple search mining directly support the application. In fact, in order to do a good job of mining, there are a lot of work to do, not only can improve the quality of existing information. Moreover, in-depth, you can also tap the hidden information, that is not explicitly expressed in the metadata information, such as the causal relationship between information found, or other statistical trends. This type of mining was first done in traditional data mining because the traditional mining was aimed at structural data such as transaction records, making it easy to mine implicit associations (eg, people who buy diapers often buy beer , The original is the father of the new people's usual behavior, such information can be excavated to optimize the display and sale of goods). Nowadays, natural language is also structured to extract fragments of intelligence in the database, of course, can also do implicit association intelligence mining to enhance the value of intelligence. The fourth architectural diagram is the NLP application layer. In this layer, analysis, extraction, mining out of the various information can support different NLP products and services. From the Q A system to the dynamic mapping of the knowledge map (Google search search star has been able to see this application), from automatic polling to customer intelligence, from intelligent assistants to automatic digest and so on. This is my overall understanding of the basic architecture of NLP. Based on nearly 20 years in the industry to do NLP product experience. 18 years ago, I was using a NLP structure diagram to the first venture to flicker, investors themselves told us that this is million dollar slide. Today's explanation is to extend from that map to expand from. Days unchanged Road is also unchanged. Where previously mentioned the million-dollar slide story. Clinton said that during the reign of 2000, the United States to a great leap forward in Internet technology, known as. Com bubble, a time of hot money rolling, all kinds of Internet startups are sprang up. In such a situation, the boss decided to hot to find venture capital, told me to achieve our prototype of the language system to do an introduction. I then draw the following three-tier structure of a NLP system diagram, the bottom is the parser, from shallow to deep, the middle is built on parsing based on information extraction, the top of the main categories are several types of applications, including Q A system. Connection applications and the following two language processing is the database, used to store the results of information extraction, these results can be applied at any time to provide information. This architecture has not changed much since I made it 15 years ago, although the details and icons have been rewritten no less than 100 times. The architecture diagram in this article is about one of the first 20 editions. Off the core engine (background), does not include the application (front). Saying that early in the morning by my boss sent to Wall Street angel investors, by noon to get his reply, said he was very interested. Less than two weeks, we got the first $ 1 million angel investment check. Investors say that this is a million dollar slide, which not only shows the threshold of technology, but also shows the great potential of the technology. Pre - Knowledge Mapping: The Structure of Information Extraction Engine 【Related】 Pre - Knowledge Mapping: The Architecture of Information Extraction Engine 【Essay contest: a dream come true 】 OVERVIEW OF NATURAL LANGUAGE PROCESSING NLP White Paper: Overview of Our NLP Core Engine White Paper of NLP Engine Zhaohua afternoon pick up directory retrieved 10/1/2016 from https://translate.google.com/ translated from http://blog.sciencenet.cn/blog-362400-981742.html; 个人分类: 立委科普|4735 次阅读|0 个评论

Once upon a time, we were publishing like crazy ……: liwei999 2016-9-2 05:04; List of 23 NLP Publications (Cymfony Period) Once upon a time, we were publishing like crazy …… as if we were striving for tenure faculty R. Srihari, W. Li and X. Li. 2006. Question Answering Supported by Multiple Levels of Information Extraction. a book chapter in T. Strzalkowski S. Harabagiu (eds.), Advances in Open- Domain Question Answering. Springer, 2006, ISBN:1-4020-4744-4. http://link.springer.com/chapter/10.1007%2F978-1-4020-4746-6_11 R. Srihari, W. Li, C. Niu and T. Cornell. 2006. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37 http://journals.cambridge.org/action/displayAbstract?fromPage=onlineaid=1513012 This paper focuses on IE tasks designed to support information discovery applications. It defines new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information. C. Niu, W. Li, R. Srihari, H. Li. 2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation. Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005) W05-0605 C. Niu, W. Li and R. Srihari. 2004. Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction. In Proceedings of ACL 2004. ACL 2004 Niu Li Srihari 372_pdf_2-col C. Niu, W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities. In Proceedings of Senseval-3 Workshop. ACL 2004 Context Clustering for WSD niu1 C. Niu, W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation. International Journal of Artificial Intelligence Tools, Vol. 13, No. 1, 2004. IJAIT 2004 Niu, Li, Ding, and Srihari caseR (7) Cheng Niu, Wei Li and Rohini Srihari 2004. A Bootstrapping Approach to Information Extraction Domain Porting. ATEM-2004: The AAAI-04 Workshop on Adaptive Text Extraction and Mining. San Jose. (PDF) WS104NiuC W. Li, X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert Lexicon Approach to Identifying English Phrasal Verbs. In Proceedings of ACL 2003. Sapporo, Japan. pp. 513-520. ACL 2003 Li, Zhang, Niu, Jiang and Srihari 2003 PhrasalVerb_ACL2003_submitted C. Niu, W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping Approach to Named Entity Classification using Successive Learners. In Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342. ACL 2003 Niu, Li, Ding and Srihari 2003 ne-acl2003 W. Li, R. Srihari, C. Niu, and X. Li. 2003. Question Answering on a Case Insensitive Corpus. In Proceedings of Workshop on Multilingual Summarization and Question Answering – Machine Learning and Beyond (ACL-2003 Workshop). Sapporo, Japan. pp. 84-93. ACL 2003 Workshop Li, Srihari, Niu and Li 2003 QA-workshopl2003_final C. Niu, W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for Named Entity Tagging using Concept-based Seeds. In Proceedings of HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada. NAACL 2003 Niu, Li, Ding and Srihari 2003 ne_submitted R. Srihari, W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. In Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). pp. 52-59, Edmonton, Canada. NAACL 2003 Workshop InfoXtract SEALTS paper2 H. Li, R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Locatio Normalization: A Hybrid Approach to Geographic References in Information Extraction. In Proceedings of HLT/NAACL 2003 Workshop on Analysis of Geographic References. Edmonton, Canada. NAACL 2003 Workshop Li, Srihari, Niu and Li 2003 CymfonyLoc_final W. Li, R. Srihari, C. Niu, and X. Li 2003. Entity Profile Extraction from Large Corpora. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. PACLING 2003 Li, Srihari, Niu and Li 2003 Entity Profile profile_PACLING_final_submitted C. Niu, W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a Hidden Markov Model for Relationship Extraction Using Multi-level Contexts. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. PACLING 2003 Niu, Li, Srihari and Crist 2003 CE Bootstrapping PACLING03_15_final C. Niu, Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised Learning for Verb Sense Disambiguation Using Both Trigger Words and Parsing Relations. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. PACLING 2003 Niu, Zheng, Srihari, Li and Li 2003 Verb Sense Identification PACLING_14_final C. Niu, W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation. In Proceedings of the Sixteenth International FLAIRS Conference, St. Augustine, FL, May 2003, pp. 402-406. FLAIRS 2003 Niu, Li, Ding and Srihari 2003 FLAIRS03CNiu R. Srihari and W. Li 2003. Rapid Domain Porting of an Intermediate Level Information Extraction Engine. In Proceedings of International Conference on Natural Language Processing 2003. ICON2003 paper FINAL H. Li, R. Srihari, C. Niu and W. Li 2002. Location Normalization for Information Extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan. COLING 2002 Li, Srihari, Niu and Li 2002 coling2002LocNZ W. Li, R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002. Extracting Exact Answers to Questions Based on Structural Links. In Proceedings of Multilingual Summarization and Question Answering (COLING-2002 Workshop). Taipei, Taiwan. COLING 2002 Workshop Li et al CymfonyQA_final R. Srihari, and W. Li. 2000. A Question Answering System Supported by Information Extraction. In Proceedings of ANLP 2000. Seattle. ANLP 2000 Srihari and Li 2000 anlp9l R. Srihari, C. Niu and W. Li. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging. In Proceedings of ANLP 2000. Seattle. ANLP 2000 Srihari, Niu and Li 2000 anlp105_final9 R. Srihari and W. Li. 1999. Question Answering Supported by Information Extraction. In Proceedings of TREC-8. Washington cymfony Other publications: SBIR Final Reports W. Li R. Srihari. 2003. Flexible Information Extraction Learning Algorithm (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari. 2001. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari. 2000. A Domain Independent Event Extraction Toolkit (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari. 2000. Flexible Information Extraction Learning Algorithm (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari 2003. Automated Verb Sense Identification (Phase I), Final Techinical Report, U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003) R. Srihari W. Li 2003. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0156 (2002-2003) R. Srihari, W. Li C. Niu 2004. A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004) R. Srihari W. Li 2003. An Automated Domain Porting Toolkit for Information Extraction (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0057 (2002-2003) T. Cornell, R. Srihari W. Li 2004. Automatically Time Stamping Events in Unrestricted Text (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004) Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP; 个人分类: 立委科普|2324 次阅读|0 个评论

极限学习机 Extreme Learning Machines (ELM) 程序网址: 热度 1 zlyang 2016-9-1 10:18; 极限学习机 Extreme Learning Machines (ELM) 程序网址 http://www.ntu.edu.sg/home/egbhuang/ E xtreme L earning M achines (ELM) : Filling the Gap between Frank Rosenblatt's Dream and John von Neumann's Puzzle - Network architectures : a homogenous hierarchical learning machine for partially or fully connected multi layers / single layer of (artifical or biological) networks with almost any type of practical (artifical) hidden nodes (or bilogical neurons). - Learning theories : Learning can be made without iteratively tuning (articial) hidden nodes (or biological neurons). - Learning algorithms : General, unifying and universal (optimization based) learning frameworks for compression, feature learning, clustering, regression and classification. Basic steps: 1) Learning are made layer wise ( in white box ) 2) Randomly generate ( any nonliear piecewise ) hidden neurons or inheritate hidden neuorns from ancestors 3) Learn the output weights in each hidden layer ( with application based optimization constraints ) 2013 www.extreme-learning-machines.org. All rights reserved. Best view with Google Chrome 感谢您提供更多的可下载程序！感谢您的指教！感谢您指正以上任何错误！; 个人分类: 风电功率预测|5198 次阅读|2 个评论

On Hand-crafted Myth of Knowledge Bottleneck: liwei999 2016-8-8 04:39; In my article “ Pride and Prejudice of Main Stream “, the first myth listed as top 10 misconceptions in NLP is as follows: Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). While there are numerous misconceptions on the old school of rule systems , this hand-crafted myth can be regarded as the source of all. Just take a review of NLP papers, no matter what are the language phenomena being discussed, it’s almost cliche to cite a couple of old school work to demonstrate superiority of machine learning algorithms, and the reason for the attack only needs one sentence, to the effect that the hand-crafted rules lead to a system “difficult to develop” (or “difficult to scale up”, “with low efficiency”, “lacking robustness”, etc.), or simply rejecting it like this, “literature , and have tried to handle the problem in different aspects, but these systems are all hand-crafted”. Once labeled with hand-crafting, one does not even need to discuss the effect and quality. Hand-craft becomes the rule system’s “original sin”, the linguists crafting rules, therefore, become the community’s second-class citizens bearing the sin. So what is wrong with hand-crafting or coding linguistic rules for computer processing of languages? NLP development is software engineering. From software engineering perspective, hand-crafting is programming while machine learning belongs to automatic programming. Unless we assume that natural language is a special object whose processing can all be handled by systems automatically programmed or learned by machine learning algorithms, it does not make sense to reject or belittle the practice of coding linguistic rules for developing an NLP system. For consumer products and arts, hand-craft is definitely a positive word: it represents quality or uniqueness and high value, a legit reason for good price. Why does it become a derogatory term in NLP? The root cause is that in the field of NLP, almost like some collective hypnosis hit in the community, people are intentionally or unintentionally lead to believe that machine learning is the only correct choice. In other words, by criticizing, rejecting or disregarding hand-crafted rule systems, the underlying assumption is that machine learning is a panacea, universal and effective, always a preferred approach over the other school. The fact of life is, in the face of the complexity of natural language, machine learning from data so far only surfaces the tip of an iceberg of the language monster (called l ow-hanging fruit by Church in K. Church: A Pendulum Swung Too Far ), far from reaching the goal of a complete solution to language understanding and applications. There is no basis to support that machine learning alone can solve all language problems, nor is there any evidence that machine learning necessarily leads to better quality than coding rules by domain specialists (e.g. computational grammarians). Depending on the nature and depth of the NLP tasks, hand-crafted systems actually have more chances of performing better than machine learning, at least for non-trivial and deep level NLP tasks such as parsing, sentiment analysis and information extraction (we have tried and compared both approaches). In fact, the only major reason why they are still there, having survived all the rejections from mainstream and still playing a role in industrial practical applications, is the superior data quality, for otherwise they cannot have been justified for industrial investments at all. the “forgotten” school: why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well. quote from On Recall of Grammar Engineering Systems In the real world, NLP is applied research which eventually must land on the engineering of language applications where the results and quality are evaluated. As an industry, software engineering has attracted many ingenious coding masters, each and every one of them gets recognized for their coding skills, including algorithm design and implementation expertise, which are hand-crafting by nature. Have we ever heard of a star engineer gets criticized for his (manual) programming? With NLP application also as part of software engineering, why should computational linguists coding linguistic rules receive so much criticism while engineers coding other applications get recognized for their hard work? Is it because the NLP application is simpler than other applications? On the contrary, many applications of natural language are more complex and difficult than other types of applications (e.g. graphics software, or word processing apps). The likely reason to explain the different treatment between a general purpose programmer and a linguist knowledge engineer is that the big environment of software engineering does not involve as much prejudice while the small environment of NLP domain is deeply biased, with belief that the automatic programming of an NLP system by machine learning can replace and outperform manual coding for all language projects. For software engineering in general, (manual) programming is the norm and no one believes that programmers’ jobs can be replaced by automatic programming in any time foreseeable. Automatic programming, a concept not rare in science fiction for visions like machines making machines, is currently only a research area, for very restricted low-level functions. Rather than placing hope on automatic programming, software engineering as an industry has seen a significant progress on work of the development infrastructures, such as development environment and a rich library of functions to support efficient coding and debugging. Maybe in the future one day, applications can use more and more of automated code to achieve simple modules, but the full automation of constructing any complex software project is nowhere in sight. By any standards, natural language parsing and understanding (beyond shallow level tasks such as classification, clustering or tagging) is a type of complex tasks. Therefore, it is hard to expect machine learning as a manifestation of automatic programming to miraculously replace the manual code for all language applications. The application value of hand-crafting a rule system will continue to exist and evolve for a long time, disregarded or not. “Automatic” is a fancy word. What a beautiful world it would be if all artificial intelligence and natural languages tasks could be accomplished by automatic machine learning from data. There is, naturally, a high expectation and regard for machine learning breakthrough to help realize this dream of mankind. All this should encourage machine learning experts to continue to innovate to demonstrate its potential, and should not be a reason for the pride and prejudice against a competitive school or other approaches. Before we embark on further discussions on the so-called rule system’s knowledge bottleneck defect, it is worth mentioning that the word “automatic” refers to the system development, not to be confused with running the system. At the application level, whether it is a machine-learned system or a manual system coded by domain programmers (linguists), the system is always run fully automatically, with no human interference. Although this is an obvious fact for both types of systems, I have seen people get confused so to equate hand-crafted NLP system with manual or semi-automatic applications. Is hand-crafting rules a knowledge bottleneck for its development? Yes, there is no denying or a need to deny that. The bottleneck is reflected in the system development cycle. But keep in mind that this “bottleneck” is common to all large software engineering projects, it is a resources cost, not only introduced by NLP. From this perspective, the knowledge bottleneck argument against hand-crafted system cannot really stand, unless it can be proved that machine learning can do all NLP equally well, free of knowledge bottleneck: it might be not far from truth for some special low-level tasks, e.g. document classification and word clustering, but is definitely misleading or incorrect for NLP in general, a point to be discussed below in details shortly. Here are the ballpark estimates based on our decades of NLP practice and experiences. For shallow level NLP tasks (such as Named Entity tagging, Chinese segmentation), a rule approach needs at least three months of one linguist coding and debugging the rules, supported by at least half an engineer for tools support and platform maintenance, in order to come up with a decent system for initial release and running. As for deep NLP tasks (such as deep parsing, deep sentiments beyond thumbs-up and thumbs-down classification), one should not expect a working engine to be built up without due resources that at least involve one computational linguist coding rules for one year, coupled with half an engineer for platform and tools support and half an engineer for independent QA (quality assurance) support. Of course, the labor resources requirements vary according to the quality of the developers (especially the linguistic expertise of the knowledge engineers) and how well the infrastructures and development environment support linguistic development. Also, the above estimates have not included the general costs, as applied to all software applications, e.g. the GUI development at app level and operations in running the developed engines. Let us present the scene of the modern day rule-based system development. A hand-crafted NLP rule system is based on compiled computational grammars which are nowadays often architected as an integrated pipeline of different modules from shallow processing up to deep processing. A grammar is a set of linguistic rules encoded in some formalism, which is the core of a module intended to achieve a defined function in language processing, e.g. a module for shallow parsing may target noun phrase (NP) as its object for identification and chunking . What happens in grammar engineering is not much different from other software engineering projects. As knowledge engineer, a computational linguist codes a rule in an NLP-specific language, based on a development corpus. The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system for independent QA to test and feedback. The development is an iterative process and cycle where incremental enhancements on bug reports from QA and/or from the field (customers) serve as a necessary input and step towards better data quality over time. Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion). There are almost infinite combinations of such conditions that can be enforced in rules’ patterns. A linguist’s job is to code such conditions to maximize the benefits of capturing the target language phenomena, a balancing art in engineering through a process of trial and error. Macroscopically speaking, the rule hand-crafting process is in its essence the same as programmers coding an application, only that linguists usually use a different, very high-level NLP-specific language, in a chosen or designed formalism appropriate for modeling natural language and framework on a platform that is geared towards facilitating NLP work. Hard-coding NLP in a general purpose language like Java is not impossible for prototyping or a toy system. But as natural language is known to be a complex monster, its processing calls for a special formalism (some form or extension of Chomsky’s formal language types) and an NLP-oriented language to help implement any non-toy systems that scale. So linguists are trained on the scene of development to be knowledge programmers in hand-crafting linguistic rules. In terms of different levels of languages used for coding, to an extent, it is similar to the contrast between programmers in old days and the modern software engineers today who use so-called high-level languages like Java or C to code. Decades ago, programmers had to use assembly or machine language to code a function. The process and workflow for hand-crafting linguistic rules are just like any software engineers in their daily coding practice, except that the language designed for linguists is so high-level that linguistic developers can concentrate on linguistic challenges without having to worry about low-level technical details of memory allocation, garbage collection or pure code optimization for efficiency, which are taken care of by the NLP platform itself. Everything else follows software development norms to ensure the development stay on track, including unit testing, baselines construction and monitoring, regressions testing, independent QA, code reviews for rules’ quality, etc. Each level language has its own star engineer who masters the coding skills. It sounds ridiculous to respect software engineers while belittling linguistic engineers only because the latter are hand-crafting linguistic code as knowledge resources. The chief architect in this context plays the key role in building a real life robust NLP system that scales. To deep-parse or process natural language, he/she needs to define and design the formalism and language with the necessary extensions, the related data structures, system architecture with the interaction of different levels of linguistic modules in mind (e.g. morpho-syntactic interface), workflow that integrate all components for internal coordination (including patching and handling interdependency and error propagation) and the external coordination with other modules or sub-systems including machine learning or off-shelf tools when needed or felt beneficial. He also needs to ensure efficient development environment and to train new linguists into effective linguistic “coders” with engineering sense following software development norms (knowledge engineers are not trained by schools today). Unlike the mainstream machine learning systems which are by nature robust and scalable, hand-crafted systems’ robustness and scalability depend largely on the design and deep skills of the architect. The architect defines the NLP platform with specs for its core engine compiler and runner, plus the debugger in a friendly development environment. He must also work with product managers to turn their requirements into operational specs for linguistic development, in a process we call semantic grounding to applications from linguistic processing. The success of a large NLP system based on hand-crafted rules is never a simple accumulation of linguistics resources such as computational lexicons and grammars using a fixed formalism (e.g. CFG) and algorithm (e.g. chart-parsing). It calls for seasoned language engineering masters as architects for the system design. Given the scene of practice for NLP development as describe above, it should be clear that the negative sentiment association with “hand-crafting” is unjustifiable and inappropriate. The only remaining argument against coding rules by hands comes down to the hard work and costs associated with hand-crafted approach, so-called knowledge bottleneck in the rule-based systems. If things can be learned by a machine without cost, why bother using costly linguistic labor? Sounds like a reasonable argument until we examine this closely. First, for this argument to stand, we need proof that machine learning indeed does not incur costs and has no or very little knowledge bottleneck. Second, for this argument to withstand scrutiny, we should be convinced that machine learning can reach the same or better quality than hand-crafted rule approach. Unfortunately, neither of these necessarily hold true. Let us study them one by one. As is known to all, any non-trivial NLP task is by nature based on linguistic knowledge, irrespective of what form the knowledge is learned or encoded. Knowledge needs to be formalized in some form to support NLP, and machine learning is by no means immune to this knowledge resources requirement. In rule-based systems, the knowledge is directly hand-coded by linguists and in case of (supervised) machine learning, knowledge resources take the form of labeled data for the learning algorithm to learn from (indeed, there is so-called unsupervised learning which needs no labeled data and is supposed to learn from raw data, but that is research-oriented and hardly practical for any non-trivial NLP, so we leave it aside for now). Although the learning process is automatic, the feature design, the learning algorithm implementation, debugging and fine-tuning are all manual, in addition to the requirement of manual labeling a large training corpus in advance (unless there is an existing labeled corpus available, which is rare; but machine translation is a nice exception as it has the benefit of using existing human translation as labeled aligned corpora for training). The labeling of data is a very tedious manual job. Note that the sparse data challenge represents the need of machine learning for a very large labeled corpus. So it is clear that knowledge bottleneck takes different forms, but it is equally applicable to both approaches. No machine can learn knowledge without costs, and it is incorrect to regard knowledge bottleneck as only a defect for the rule-based system. One may argue that rules require expert skilled labor, while the labeling of data only requires high school kids or college students with minimal training. So to do a fair comparison of the costs associated, we perhaps need to turn to Karl Marx whose “Das Kapital” has some formula to help convert simple labor to complex labor for exchange of equal value: for a given task with the same level of performance quality (assuming machine learning can reach the quality of professional expertise, which is not necessarily true), how much cheap labor needs to be used to label the required amount of training corpus to make it economically an advantage? Something like that. This varies from task to task and even from location to location (e.g. different minimal wage laws), of course. But the key point here is that knowledge bottleneck challenges both approaches and it is not the case believed by many that machine learning learns a system automatically with no or little cost attached. In fact, things are far more complicated than a simple yes or no in comparing the costs as costs need also to be calculated in a larger context of how many tasks need to be handled and how much underlying knowledge can be shared as reusable resources. We will leave it to a separate writing for the elaboration of the point that when put into the context of developing multiple NLP applications, the rule-based approach which shares the core engine of parsing demonstrates a significant saving on knowledge costs than machine learning. Let us step back and, for argument’s sake, accept that coding rules is indeed more costly than machine learning, so what? Like in any other commodities, hand-crafted products may indeed cost more, they also have better quality and value than products out of mass production. For otherwise a commodity society will leave no room for craftsmen and their products to survive. This is common sense, which also applies to NLP. If not for better quality, no investors will fund any teams that can be replaced by machine learning. What is surprising is that there are so many people, NLP experts included, who believe that machine learning necessarily performs better than hand-crafted systems not only in costs saved but also in quality achieved. While there are low-level NLP tasks such as speech processing and document classification which are not experts’ forte as we human have much more restricted memory than computers do, deep NLP involves much more linguistic expertise and design than a simple concept of learning from corpora to expect superior data quality. In summary, the hand-crafted rule defect is largely a misconception circling around wildly in NLP and reinforced by the mainstream, due to incomplete induction or ignorance of the scene of modern day rule development. It is based on the incorrect assumption that machine learning necessarily handles all NLP tasks with same or better quality but less or no knowledge bottleneck, in comparison with systems based on hand-crafted rules. Note: This is the author’s own translation, with adaptation, of part of our paper which originally appeared in Chinese in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013 Domain portability myth in natural language processing Pride and Prejudice of NLP Main Stream K. Church: A Pendulum Swung Too Far , Linguistics issues in Language Technology, 2011; 6(5) Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4 Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP; 个人分类: 立委科普|4304 次阅读|0 个评论

Google SyntaxNet is NOT the “world’s most accurate parser": 热度 1 liwei999 2016-7-25 18:02; As we all know, natural language parsing is fairly complex but instrumental in Natural Language Understanding (NLU) and its applications . We also know that a breakthrough to 90%+ accuracy for parsing is close to human performance and is indeed an achievement to be proud of. Nevertheless, following the common sense, we all have learned that you got to have greatest guts to claim the “most” for anything without a scope or other conditions attached, unless it is honored by authoritative agencies such as Guinness. For Google’s claim of “the world’s most accurate parser”, we only need to cite one system out-performing theirs to prove its being untrue or misleading. We happen to have built one . For a long time, we know that our English parser is near human performance in data quality, and is robust, fast and scales up to big data in supporting real life products. For the approach we take, i.e. the approach of grammar engineering, which is the other “school” from the mainstream statistical parsing, this was just a natural result based on the architect’s design and his decades of linguistic expertise. In fact, our parser reached near-human performance over 5 years ago, at a point of diminishing returns, hence we decided not to invest heavily any more in its further development. Instead, our focus was shifted to its applications in supporting open-domain question answering and fine-grained deep sentiment analysis for our products, as well as to the multilingual space . So a few weeks ago when Google announced SyntaxNet , I was bombarded by the news cited to me from all kinds of channels as well as many colleagues of mine, including my boss and our marketing executives. All are kind enough to draw my attention to this “newest breakthrough in NLU” and seem to imply that we should work harder, trying to catch up with the giant. In my mind, there has never been doubt that the other school has a long way before they can catch us. But we are in information age, and this is the power of Internet: eye-catching news from or about a giant, true or misleading, instantly spreads to all over the world. So I felt the need to do some study, not only to uncover the true picture of this space, but more importantly, also to attempt to educate the public and the young scholars coming to this field that there have always been and will always be two schools of NLU and AI (Artificial Intelligence). These two schools actually have their respective pros and cons , they can be complementary and hybrid , but by no means can we completely ignore or replace one by the other. Plus, how boring a world would become if there were only one approach, one choice, one voice, especially in core cases of NLU such as parsing (as well as information extraction and sentiment analysis , among others) where the “select approach” does not perform nearly as well as the forgotten one. So I instructed a linguist who was not involved in the development of the parser to benchmark both systems as objectively as possible, and to give an apples-to-apples comparison of their respective performance. Fortunately, the Google SyntaxNet outputs syntactic dependency relationships and ours is also mainly a dependency parser . Despite differences in details or naming conventions, the results are not difficult to contrast and compare based on linguistic judgment. To make things simple and fair, we fragment a parse tree of an input sentence into binary dependency relations and let the testor linguist judge; once in doubt, he will consult another senior linguist to resolve, or to put on hold if believed to be in gray area, which is rare. Unlike some other areas of NLP tasks, e.g. sentiment analysis, where there is considerable space of gray area or inter-annotator disagreement, parsing results are fairly easy to reach consensus among linguists. Despite the different format such results are embodied in by two systems (an output sample is shown below), it is not difficult to make a direct comparison of each dependency in the sentence tree output of both systems. (To be stricter on our side, a patched relationship called Next link used in our results do not count as a legit syntactic relation in testing.) SyntaxNet output: 1.Input: President Barack Obama endorsed presumptive Democratic presidential nominee Hillary Clinton in a web video Thursday . Netbase output: Benchmarking was performed in two stages as follows. Stage 1, we select English formal text in the news domain, which is SyntaxNet’s forte as it is believed to have much more training data in news than in other styles or genres. The announced 94% accuracy in news parsing is indeed impressive. In our case, news is not the major source of our development corpus because our goal is to develop a domain-independent parser to support a variety of genres of English text for real life applications on text such as social media (informal text) for sentiment analysis, as well as technology papers (formal text) for answering how questions. We randomly select three recent news article for this testing, with the following links. (1) http://www.cnn.com/2016/06/09/politics/president-barack-obama-endorses-hillary-clinton-in-video/ (2) Part of news from: http://www.wsj.com/articles/nintendo-gives-gamers-look-at-new-zelda-1465936033 (3) Part of news from: http://www.cnn.com/2016/06/15/us/alligator-attacks-child-disney-florida/ Here are the benchmarking results of parsing the above for the news genre: (1) Google SyntaxNet: F-score= 0.94 (tp for true positive, fp for false positive, tn for true negative; P for Precision, R for Recall, and F for F-score) P = tp/(tp+fp) = 1737/(1737+104) = 1737/1841 = 0.94 R = tp/(tp+tn) = 1737/(1737+96) = 1737/1833 = 0.95 F= 2* = 2* = 2*(0.893/1.89) = 0.94 (2) Netbase parser: F-score = 0.95 P = tp/(tp+fp) = 1714/(1714+66) = 1714/1780 = 0.96 R = tp/(tp+tn) = 1714/(1714+119) = 1714/1833 = 0.94 F = 2* = 2* = 2*(0.9024/1.9) = 0.95 So the Netbase parser is about 2 percentage points better than Google SyntaxNet in precision but 1 point lower in recall. Overall, Netbase is slightly better than Google in the precision-recall combined measures of F-score. As both parsers are near the point of diminishing returns for further development, there is not too much room for further competition. Stage 2 , we select informal text, from social media Twitter to test a parser’s robustness in handling “degraded text”: as is expected, degraded text will always lead to degraded performance (for a human as well as a machine), but a robust parser should be able to handle it with only limited degradation. If a parser can only perform well in one genre or one domain and the performance drastically falls in other genres, then this parser is not of much use because most genres or domains do not have as large labeled data as the seasoned news genre . With this knowledge bottleneck, a parser is severely challenged and limited in its potential to support NLU applications. After all, parsing is not the end, but a means to turn unstructured text into structures to support semantic grounding to various applications in different domains . We randomly select 100 tweets from Twitter for this testing, with some samples shown below. 1.Input: RT @ KealaLanae : ima leave ths here. https : //t.co/FI4QrSQeLh2.Input: @ WWE_TheShield12 I do what I want jk I ca n’t kill you .10.Input: RT @ blushybieber : Follow everyone who retweets this , 4 mins 20.Input: RT @ LedoPizza : Proudly Founded in Maryland. @ Budweiser might have America on their cans but we think Maryland Pizza sounds better 30.Input: I have come to enjoy Futbol over Football 40.Input: @ GameBurst That ‘s not meant to be rude. Hard to clarify the joke in tweet form . 50.Input: RT @ undeniableyella : I find it interesting , people only talk to me when they need something … 60.Input: Petshotel Pet Care Specialist Jobs in Atlanta , GA # Atlanta # GA # jobs # jobsearch https : //t.co/pOJtjn1RUI 70.Input: FOUR ! BUTTLER nailed it past the sweeper cover fence to end the over ! # ENG – 91/6 -LRB- 20 overs -RRB- . # ENGvSL https : //t.co/Pp8pYHfQI8 79..Input: RT @ LenshayB : I need to stop spending money like I ‘m rich but I really have that mentality when it comes to spending money on my daughter 89.Input: RT MarketCurrents : Valuation concerns perk up again on Blue Buffalo https : //t.co/5lUvNnwsjA , https : //t.co/Q0pEHTMLie 99.Input: Unlimited Cellular Snap-On Case for Apple iPhone 4/4S -LRB- Transparent Design , Blue/ https : //t.co/7m962bYWVQ https : //t.co/N4tyjLdwYp 100.Input: RT @ Boogie2988 : And some people say , Ethan ‘s heart grew three sizes that day. Glad to see some of this drama finally going away. https : //t.co/4aDE63Zm85 Here are the benchmarking results for the social media Twitter: (1) Google SyntaxNet: F-score = 0.65 P = tp/(tp+fp) = 842/(842+557) = 842/1399 = 0.60 R = tp/(tp+tn) = 842/(842+364) = 842/1206 = 0.70 F = 2* = 2* = 2*(0.42/1.3) = 0.65 Netbase parser: F-score = 0.80 P = tp/(tp+fp) = 866/(866+112) = 866/978 = 0.89 R = tp/(tp+tn) = 866/(866+340) = 866/1206 = 0.72 F = 2* = 2* = 2*(0.64/1.61) = 0.80 For the above benchmarking results, we leave it to the next blog for interesting observations and more detailed illustration, analyses and discussions. To summarize, our real life production parser beats Google’s research system SyntaxtNet in both formal news text (by a small margin as we both are already near human performance) and informal text, with a big margin of 15 percentage points. Therefore, it is safe to conclude that Google’s SytaxNet is by no means “world’s most accurate parser”, in fact, it has a long way to get even close to the Netbase parser in adapting to the real world English text of various genres for real life applications. Is Google SyntaxNet Really the World’s Most Accurate Parser? Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open K. Church: “A Pendulum Swung Too Far” , Linguistics issues in Language Technology, 2011; 6(5) Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering Introduction of Netbase NLP Core Engine Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP 发布; 个人分类: 立委科普|6036 次阅读|1 个评论

[转载]C++ By Example(三) References (Greg Perry英文版)【清晰PDF版: lcj2212916 2016-5-5 19:41; You do not have to understand the concepts in this appendix to become well-versed in C++. You can master C++, however, only if you spend some time learning about the behind-the-scenes roles played by binary numbers. The material presented here is not difficult, but many programmers do not take the time to study it; hence, there are a handful of C++ masters who learn this material and understand how C++ works “under the hood,” and there are those who will never master the language as they could. You should take the time to learn about addressing, binary numbers, and hexadecimal numbers. These fundamental principles are presented here for you to learn, and although a working knowledge of C++ is possible without knowing them, they greatly enhance your C++ skills (and your skills in every other programming language). After reading this appendix, you will better understand why different C++ data types hold different ranges of numbers. You also will see the importance of being able to represent hexadecimal numbers in C++, and you will better understand C++ array and pointer addressing. 下载地址： http://edu.ctfile.com/info/W5N57733 http://www.yimuhe.com/file-3045045.html; 986 次阅读|0 个评论

分享deep learning大牛Yoshua Bengio的《deep learning》最终版: lhj701 2016-4-7 21:54; 分享deep learning大牛Yoshua Bengio的《deep learning》最终版印刷前公开最终版，原来的版本才500多页，这个版本800多页，值得分享。（pdf 太大而传不上科学网）百度云网盘下载地址： http://pan.baidu.com/s/1qYIeAJ U 作者： Ian Goodfellow Yoshua Bengio Aaron Courville 网站： http://www.deeplearningbook.org/ 目录内容： Table of Contents Acknowledgements Notation 1 Introduction Part I: Applied Math and Machine Learning Basics 2 Linear Algebra 3 Probability and Information Theory 4 Numerical Computation 5 Machine Learning Basics Part II: Modern Practical Deep Networks 6 Deep Feedforward Networks 7 Regularization 8 Optimization for Training Deep Models 9 Convolutional Networks 10 Sequence Modeling: Recurrent and Recursive Nets 11 Practical Methodology 12 Applications Part III: Deep Learning Research 13 Linear Factor Models 14 Autoencoders 15 Representation Learning 16 Structured Probabilistic Models for Deep Learning 17 Monte Carlo Methods 18 Confronting the Partition Function 19 Approximate Inference 20 Deep Generative Models Bibliography Index; 个人分类: 大数据|24607 次阅读|0 个评论

机器学习-deep learning reading 网上链接: lhj701 2016-2-23 15:19; 机器学习-deep learning reading 网上链接 Deep learning Reading List Following is a growing list of some of the materials i found on the web for Deep Learning beginners. Free Online Books Deep Learning by Yoshua Bengio, Ian Goodfellow and Aaron Courville Neural Networks and Deep Learning by Michael Nielsen Deep Learning by Microsoft Research Deep Learning Tutorial by LISA lab, University of Montreal Courses Machine Learning by Andrew Ng in Coursera Neural Networks for Machine Learning by Geoffrey Hinton in Coursera Neural networks class by Hugo Larochelle from Université de Sherbrooke Deep Learning Course by CILVR lab @ NYU CS231n: Convolutional Neural Networks for Visual Recognition On-Going CS224d: Deep Learning for Natural Language Processing Going to start Video and Lectures How To Create A Mind By Ray Kurzweil - Is a inspiring talk Deep Learning, Self-Taught Learning and Unsupervised Feature Learning By Andrew Ng Recent Developments in Deep Learning By Geoff Hinton The Unreasonable Effectiveness of Deep Learning by Yann LeCun Deep Learning of Representations by Yoshua bengio Principles of Hierarchical Temporal Memory by Jeff Hawkins Machine Learning Discussion Group - Deep Learning w/ Stanford AI Lab by Adam Coates Making Sense of the World with Deep Learning By Adam Coates Demystifying Unsupervised Feature Learning By Adam Coates Visual Perception with Deep Learning By Yann LeCun Papers ImageNet Classification with Deep Convolutional Neural Networks Using Very Deep Autoencoders for Content Based Image Retrieval Learning Deep Architectures for AI CMU’s list of papers Tutorials UFLDL Tutorial 1 UFLDL Tutorial 2 Deep Learning for NLP (without Magic) A Deep Learning Tutorial: From Perceptrons to Deep Networks WebSites deeplearning.net deeplearning.stanford.edu Datasets MNIST Handwritten digits Google House Numbers from street view CIFAR-10 and CIFAR-100 IMAGENET Tiny Images 80 Million tiny images Flickr Data 100 Million Yahoo dataset Berkeley Segmentation Dataset 500 Frameworks Caffe Torch7 Theano cuda-convnet Ccv NuPIC DeepLearning4J Miscellaneous Google Plus - Deep Learning Community Caffe Webinar 100 Best Github Resources in Github for DL Word2Vec Caffe DockerFile TorontoDeepLEarning convnet Vision data sets Fantastic Torch Tutorial My personal favourite. Also check out gfx.js Torch7 Cheat sheet 原文链接：http://jmozah.github.io/links/#rd; 个人分类: 大数据|4033 次阅读|0 个评论

一切声称用机器学习做社会媒体舆情挖掘的系统，都值得怀疑: liwei999 2015-11-21 03:51; 一切声称用主流机器学习方法做社会媒体舆情挖掘的系统，都值得怀疑。捉襟见肘不堪应用是基本现状。原因是如此显然，机器学习在短消息主导的社会媒体面前失效了。短消息根本就没有足够密度的数据点（所谓 keyword density）供机器学习施展。巧妇且难为无米之炊，这是一袋子词的方法论决定的，再大的训练集也难以克服这个局限。没有语言学的结构分析，这是不可逾越的挑战。 I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry far. So let me make it simple to be understood: The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media. The major reason is simple: the social media posts are full of short messages which do not have the keyword density required by a classifier to make the proper sentiment decision. The precision ceiling for this line of work in real life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system. Trusting a machine learning classifier to perform social media sentiment is not much better than flipping a coin. So let us get straight. From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning). Fancy visualizations may make the mining results look real and attractive but they are just not trustable at all. 【补记】朋友截屏了朋友圈，说这是一竿子打翻一船人的架势。但关于这一点，实在没有办法，无论中文还是西文，短消息压倒多数是移动时代社交媒体的现实, 总须有人揭出社交媒体大数据挖掘背后的事实真相。 BOW 面对短消息束手无策，是不争的事实，不会因为这是最简便 available 的主流方法，多数人用它，它就在不适合它的场所突然显灵了。不 work 就是不 work，这一路突破不了60%的精度瓶颈，离公认的可用精度门槛80%遥不可及，这是方法论决定的。 Related Posts: Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering Coarse-grained vs. fine-grained sentiment analysis 舆情挖掘系统独立验证的意义 2015-11-22 【立委科普：NLP 中的一袋子词是什么】 2015-11-27 【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|6640 次阅读|0 个评论

《泥沙龙笔记：deep，情到深处仍孤独》: liwei999 2015-10-28 03:05; 我: 《泥沙龙铿锵三人行：句法语义纠缠论》且喷且整理。白: 这篇语言学的味道太浓了，不知道的还以为我是文科生呢。我：最近看到AI和deep learning顶尖大腕坐而论道，豪气冲天。可见行业之热，大数据带来的机遇和资源。只是一条，说模式匹配（pattern matching）是毒药，我就不明白，这个结论怎么这么快就下了。你得先把 deep learning 弄到 parsing 上成功了可以匹敌模式匹配以后再下断语吧，也让咱们吃了一辈子毒药已经五毒不侵了的人服气一点不是？再说，模式匹配可以玩的花样多去了，绝不是乔姆斯基当年批判的单层的有限状态一样的东西了（正如在DL兴起之前敏斯基批判单层神经网络一样），怎么就能 jump to conclusion？ speech 和 image，咱服输投降，不跟你玩，text 咱还要看两年，才知道这 deep 的家伙到底能做到多deep，是不是能超过多层模式匹配的deep？如果这一仗 deep learning 真地功德圆满，就像当年统计MT 打败了规则MT一样，我就真地洗手不干NLP，乐见其成，回家专门整理家庭录像和老古董诗词校注去了。白: 伟哥，淡定洪: 别介！不过你可以让你女儿做 deep learning，万一要败，也败在女儿手下。 Philip: 好主意，不然伟哥难以释怀我: good idea, 要败，也败在女儿手下。 Xinan: 我觉得做PARSING的人没有把当前的计算能力挖掘出来。人家都GPU用上了，你们能把多核用起来吗？白: 我可以断定，统计思路向深层NLP发展的必由之路就是RNN。RNN的deep与视觉/图像用的CNN的deep，有本质差别。用RNN学出正则表达式或者更强表达力的东东是自然而然的事情。因为它考虑了一个更根本的东西，时序。做到相当于多层有限状态机的能力，毫不奇怪。如果工程上再做一些贴近实际的适配，会更好。此外，RNN天然可硬件级并行，做出的专用硬件比通用机上软件实现的FSA，性能只会更好。我: 白老师，这 RNN 是不是姓 DL ？白: 是这杆大旗下的一支。但是RNN用纯学习的方法获得，尚有难关，但用编译的方法获得，我已经走通了。在编译基础上再做小规模的学习，是make sense的。我: 那么 training 的 corpus 怎么弄呢？干脆我来提供 training corpus 要多少给多少。白: 哈哈。我: 为的就是：取法乎上仅得其中。当然也可以想见，取法乎中可得其上，这不是科幻。因为我自己就做过 “句法自学习”的实验，确实可以通过 self-learning 自我提高的，就是 overhead 太大，反正我没本事让它实用化，虽然理论上的路子在实验中是走通了。白: 可以认为，之前，统计和浅层是同义词，规则和深层是同义词，但是，随着RNN/LSTM这些东东的出现，这同义词都是老黄历了。所以，复旦的美女教授教导我们说，不要一提统计就断定是浅层。是吧？黄: 现在做浅层，发不了论文，所以越来越深。我: 统计也的确在浅层里扎了太久。主要还是以前的低枝果实太多了。在水牛城的时候，我的搭档牛博士尝试过做统计深层，他用我提供的parsing 做基础，去做 IE 抽取，试图突破关键词技术的质量瓶颈，当时的有限的试验是：（1）很难， keywords are often hard to beat ；（2）但不是完全没有希望。白: 我说的编译算统计还是规则，我自己都糊涂，输入是规则，输出是权值。雷: rnn是胡子眉毛一把抓吗？白: 不是。很多是“可解释的”。我: 白老师的编译不就是 symbolic rule learning 的一种么？提供规则模板，让统计从数据中去学具体的规则。甚至 Brill 的那套也是这个思路。当然，设计规则模板里面牵涉到的语言学，比单纯加上一些简单的 features ，要深。白: @wei 不是，我不学规则，只是把规则实现为RNN。雷: 是不是一个矩阵呢？白: 差不多。雷: 内有各种特征，这些特征上上下下的都有？白: 隐节点和FSA的状态有的一比。输入，输出节点都是可解释的。隐节点的可解释程度不差于有限状态自动机状态的可解释程度。我: 还是不知道你的训练集从哪里来，如何扩大你的训练集？或者只需要有限的训练集，然后加上大量的没有labelled的数据？白: 训练是下一步的事情。准确实现规则，regex已经可以了。我: sparse data 会成为瓶颈么？不过你的起点就是规则的话，这个问题也许不那么严重黄: @wei 是的，现有带标数据不够，严重不够。白: 现在想的是，稍微把毛毛虫的身材再撑胖一点，覆盖力再强大一点。但是不允许突破线性复杂度。 @黄所以不能白手起家学习。先有点啥垫底。我: @黄你要多少带标数据都可以给你，你赢了，给我一点儿 credit 就可以了。阮: 如果能够用无监督的方法，在弱可用数据上学习就好了。我: 无监督，除了做clustering，谈何容易。黄: 您老有多少句parsed sentence？@wei 我: 没有上限，不过是让机器跑几天，输出grammar tree 的XML而已。认真说来，用我们的自动标注做底子是一个出路，就看能不能通过大数据青出于蓝了，不是不可能。其实，我们的手工系统，有时候为了照顾 recall 是对 precision 做了牺牲的，我完全可以 cut 掉那些照顾性规则，做一个接近 100% precision 的标注来，漏掉的不算。这样我的已经标注的东西可以超过人工的水平，因为人会打瞌睡，系统不会。譬如我可以自动标注70%，准确率达到 95%，剩下的 30% 再去找人做标注，或者不管它，以后系统用缺省的办法或 smoothing 啥的来对付。这应该是可行的。黄: 您正式release吧，我们引用您。白: 伟哥，你这是拿规则系统训练统计系统，整下来，统计还是超不过规则。我: 所以我说，你需要青出于蓝呀。以前想过给 LDC ，后来就算了，毕竟还是需要 costs，公司也没看到好处（他们不懂学界，只要人家用了，用得多了，这就是 marketing 啊）。雷: 现在目下公开的中文标注文本，除了penn的ctb，还有什么？ ctb中还是有不少错误的。黄: @wei 您就辛苦些，包装包装给LDC吧。雷: ctb也是收费的。白: 我是深度不学习，而是深度编译。我: 我是不深度学习（DL），而是深度分析（deep parsing）。雷: 学习没深度。我: 咱这帮三教九流不同背景来的，成群口相声了。雷: @白，编译=置换？白: 不是置换雷: 那是什么？白: 从句法规则映射到权值。让相应的网络在实际跑起来的时候，做的动作恰好是分析。雷: 再把权值映射到规则？白: 不了。雷: 画个图？ Nick: @wei 你这篇码出来就叫情到深处仍孤独。我：yeh, deep，情到deep仍孤独 whether deep learning or deep parsing 【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|5114 次阅读|0 个评论

钩沉：Early arguments for a hybrid model for NLP and IE: liwei999 2015-10-25 01:01; January 2000 On Hybrid Model Pre-Knowledge-Graph Profile Extraction Research via SBIR (3) This section presents the feasibility study conducted in Phase I of the proposed hybrid model for Level-2 and Level‑3 IE. This study is based on literature review and supported by extensive experiments and prototype implementation. This model complements corpus-based machine learning by hand-coded FST rules. The essential argument for this strategy is that by combining machine learning methods with an FST rule-based system, the system is able to exploit the best of both paradigms while overcoming their respective weaknesses. This approach was intended to meet the demand of the designed system for processing unrestricted real life text. 2.2.1 Hybrid Approach It was proposed that FST hand-crafted rules combine with corpus-based learning in all major modules of Textract . More precisely, each module M will consist of two sub-modules M1 and M2, i.e. FST model and trained model. The former serves as a preprocessor, as shown below. M1: FST Sub-module ˉ M2: trained Sub-module The trained model M2 has two features: (i) adaptive training ; (ii) structure-based training. In a pipeline architecture, the output of the previous module is the input of the succeeding module. If the succeeding module is a trained model, there are two types of training: adaptive training or non-adaptive training. In adaptive training, the input in the training phase is exactly the same as the input in the application phase. That is, the possibly imperfect output from the previous module is the input for training even if the previous module may make certain mistakes. This type of training “adapts” the model to imperfect input and the trained model will be more robust and results in some necessary adjustment. In contrast, a naive non-adaptive training is often conducted based on a perfect, often artificial input. The assumption is that the previous module is a continuously improving module and will be able to provide near-perfect output for the next module. There are pros and cons for both adaptive and non-adaptive methods. Non-adaptive training is suitable for the case when the training time is significantly long and in the case where the previous module is simple and reaches high precision. In contrast, an adaptively trained model has to be re-trained each time the previous module(s) undergo some major changes. Otherwise, the performance will be seriously affected. This imposes stringent requirements on training time and algorithm efficiency. Since the machine learning tools, which Cymfony has developed in-house, are very efficient, Textract can afford to adopt the more flexible training method using adaptive input. Adaptive training provides the rationale for placing the FST model before the trained model. The development of the FST sub-module M1 and the trained sub-module M2 can be done independently. When the time comes for the integration of M1 and M2 for better performance, it suffices to re-train M2 on the output of M1. The flexible adaptive training capabilities make this design viable, as verified inthe prototype implementation of Textract2.0/CE . In contrast, if M1 were placed after M2, the development of hand-crafted rules for M1 would have to wait until M2 is implemented. Otherwise, many rules may have to be re-written and re-debugged, which is not desirable. The second issue is structure-based training. Natural language is structural by nature; any sophisticated high level IE can hardly be successful based on linear strings of tokens. In order to capture CE/GEphenomena, traditional n -gram training with a window size of n linear tokens is not sufficient. Sentences can be long where the related entities are far apart, not to mention the long distance phenomena in linguistics. Without structure based training, no matter how large the window size one chooses, generalized rules cannot be effectively learned. However, once the training is based on linguistic structures, the distance between the entities becomes tractable. In fact, as linguistic structures are hierarchical, we need to perform multi-level training in order to capture CE/GE. For CE, it has been found during the Phase I research that three levels of training are necessary. Each level of training should be supported by the corresponding natural language parser. The remainder of this section presents the feasibility study and arguments for the choice of an FST rule based system to complement the corpus-based machine learning models. 2.2.2 FST Grammars The most attractive feature of the FST formalism lies in its superior time and space efficiency. Applying FST basically depends linearly on the input size of the text. This is in contrast to the more pervasive formalism used in NLP, namely, Context Free Grammars. This theoretical time/space efficiency has been verified through the extensive use of Cymfony’s proprietary FST Toolkit in the following applications of Textract implementation: (i) tokenizer; (ii) FST-based rules for capturing NE; (iii) FST representation of lexicons (lexical transducers); (iv) experiments in FST local grammars for shallow parsing; and (v) local CE/GEgrammars in FST. For example, the Cymfony shallow parser has been benchmarked to process 460 MB of text an hour on a 450 MHz Pentium II PC running Windows NT. There is a natural combination of FST-based grammars and lexical approaches to natural language phenomena . In order for IE grammars/rules to perform well, the lexical approach must be employed. In fact, the NE/CE/GE grammars which have been developed in Phase I have demonstrated a need for the lexical approach. Take CE as an example. In order to capture a certain CE relationship, say affiliation , the corresponding rules need to check patterns involving specific verbs and/or prepositions, say work for/hired by , which denote this relationship in English. The GE grammar, which aims at decoding the key semantic relationships in the argument structure in stead of surface syntactic relationships, has also demonstrated the need to involve considerable level of lexical constraints. Efficiency is always an important consideration indeveloping a large-scale deployable software system. Efficiency is particularly required for lexical grammars since lexical grammars are usually too large for efficient processing using conventional, more powerful grammar formalisms (e.g. Context Free Grammar formalism). Cymfony is convinced through extensive experiments that the FST technology is an outstanding tool to tackle this efficiency problem. It was suggested that a set of cascaded FST grammars could simulate sophisticated natural language parsing. This use of FST application has already successfully been applied to the Textract shallow parsing and local CE/GE extraction. There are a number of success stories of FST-based rule systems in the field of IE. For example, the commercial NE system NetOwl relies heavily on FST pattern matching rules . SRI also applied a very efficient FST local grammar for the shallow parsing of basic noun phrases and verb groups in order to support IE tasks . More recently, Universite Paris VII/LADL has successfully applied FST technology to one specified information extraction/retrieval task; that system can extract information on-the-fly about one's occupation from huge amounts of free text. The system is able to answer questions which conventional retrieval systems cannot handle, e.g. W ho is the minister of culture in France? Finally, it has also been proven by many research programs such as , and INTEX , as well as Cymfony , that an FST-based rule system is extremely efficient. In addition, FST is a convenient tool for capturing linguistic phenomena, especially for idioms and semi-productive expressions that are abundant in natural languages. As Hobbs says, “languages in general are very productive in the construction of short, multiword fixed phrases and proper names employing specialized microgrammars”. However, a purely FST-based rule system suffers from the same disadvantage in knowledge acquisition as that for all handcrafted rule systems. After all, the FST rules or local grammars have to be encoded by human experts, imposing this traditional labor-intensive problem in developing large scale systems. The conclusion is that while FST overcomes a number of shortcomings of the traditional rule based system (in particular the efficiency problem), it does not relieve the dependence on highly skilled human labor. Therefore, approaches for automatic machine learning techniques are called for. 2.2.3 Machine Learning The appeal of corpus-based machine learning in language modeling lies mainly in its automatic training/learning capabilities,hence significantly reducing the cost for hand-coding rules. Compared with rule based systems, there are definite advantages of corpus-based learning: · automatic knowledge acquisition: results in fast development time since the system discovers regularity automatically when given sufficient correctly annotated data · robustness: since knowledge/rules are learned directly from corpus · acceptable speed : in general, there is little run-time processing; the knowledge/rules obtained in the training phase can be stored in efficient data structures for run-time lookup · portability : a domain shift only requires the truthing of new data; new knowledge/rules will be automatically learned with no need to change any part of the program or control BBN has recently implemented an integrated, fully trainable model, SIFT, applied to IE . This system performs the tasks of linguistic processing (POS tagging, syntactic parsing and semantic relationship identification), TE and TR as well as NE, all at once. They have reported 83.49% F-measures for TE and 71.23% F-measures for TR, a result close to those of the best systems in MUC-7. In addition, their successful experiment in making use of the Penn Treebank for training the initial syntactic parser significantly reduces the cost of human annotation. There is no doubt that their effort is a significant progress in this field. It demonstrates the state-of-the-art in applying grammar induction to Level-2 IE. However, there are two potentially serious problems with their approach. The first is the lack of efficiency in applying the model. As they acknowledge, the speed of the system is rather slow. In terms of efficiency, the CKY‑based parsing algorithm they use is not comparable with algorithms for formalisms based on the finite state scheme (e.g. FST, Viterbi for HMM). This limiting factor is due to the inherent nature of the learned grammar based on the CFG formalism. To overcome this problem, rule induction has been explored in the direction of learning FST style grammars for local CE/GEextraction instead of CFG. The second problem is with their integrated approach. Because everything is integrated in one process, it is extremely difficult to trace where a problem lies, making debugging difficult. It is believed that a much more secure way is to follow the conventional practice of modularizing the NLP/IE process in different tasks and sub-tasks, as Cymfony has proposed in the Textract architecture design: POS tagging, shallow parsing, co-referencing, full parsing, pragmatic filtering, NE, CE,GE. Along this line, it is easy to find directly whether a particular degradation in performance is due to poor support from co-referencing or from mistakes in shallow parsing, for example. Performance benchmarking can be measured for each module; efforts to improve the performance of each individual module will contribute to the improvement of the overall system performance. 2.2.4 Drawbacks of Corpus-based Learning The following drawbacks motivate the proposed idea of building a hybrid system/module, complementing the automatic corpus-based learning by handcrafted grammars in FST. · ‘Sparse data’ problem : this is recognized as a bottle-neck for all corpus-based models . Unfortunately, a practical solution to this problem (e.g. smoothing or back-off techniques) often results in a model much less sophisticated than traditional rule-based systems. · ‘Local maxima’ problem : even if the training corpus is large and sufficiently representative, the training program can result in a poor model because training got stuck in a local maximum and failed to find the global peak . This is an inherent problem with the standard training algorithms for both HMM (i.e. forward-backward algorithm ) and CFG grammar induction ( inside-outside algorithm ). This problem can be very serious when there is no extra information applied to guide the training process. · computational complexity problem : It is often the case that there is a trade-off between expressive power/prior knowledge/constraints in the templates and feasibility. Usually, the more sophisticateda model or rule template is, the more the minimum requirement for a corpus increases, often up to an unrealistic level of training complexity. To extend the length of the string to be examined (e.g. from bigram to trigram), or to add more features (or categories/classes) for a template to be able to make reference to, usually means an enormous jump in such requirement. Otherwise, the system suffers from more serious sparse data effect. In many cases, the limitation imposed on the training complexity makes some research ideas unattainable, which in turn limits the performance power. · potentially very high cost for manual annotationof corpus: that is why Cymfony has proposed as one important direction for future research to explore the combination of supervised training and unsupervised training. Among the above four problems, the sparse data problem is believed to be most serious. To a considerable extent, the success of a system depends on how this problem is addressed. In general, there are three ways to minimize the negative effect of sparse data, discussed below. The first is to condition the probabilities/rules on fewer elements, e.g. to back off from N-gram model to (N-1)-gram model. This remedy is clearly a sacrifice of the power and therefore is not a viable option for sophisticated NLP/IE tasks. The second approach is to condition the probabilities/rules on appropriate levels of linguistic structures (e.g. basic phrase level) instead of surface based linear tokens. The research in the CE prototyping showed this to be one the most promising ways of handling the sparse data problem. This approach calls for a reliable natural language parser to establish the necessary structural foundation for conducting structure-based adaptive learning. The shallow parser which Cymfony has built using the FST engine and an extensively tested manual grammar has been tested to perform with 90.5% accuracy. The third method is to condition the probabilities/rules on more general features, e.g. using syntactic categories (e.g. POS) or semantic classes (e.g. the results from semantic lexicon; or from word clustering training) instead of the token literal. This is also a proven effective means for overcoming this bottleneck. However,there is considerable difficulty in applying this approach due to the high degree of lexical ambiguity widespread in natural languages. As for the ‘local maxima’ problem, the proposed hybrid approach in integrating handcrafted FST rules and the automatic grammar learner promises a solution. The learned model can be re-trained using the FST component as a ‘seed’ to guide the learning. In general, the more constraints and heuristics that are given to the initial statistical model for training, the better the chance for the training algorithm to result in the global maximum. It is believed that a handcrafted grammar is the most effective of such constraints since it embodies human linguistic knowledge. 2.2.5 Feasibility and Advantages of Hybrid Approach In fact, the feasibility of such collaboration between a handcrafted rule system (FST in this case) and a corpus-based system has already been verified for all the major types of models: · For transformation based systems, Brill's training algorithm ensures that the input to the system can be either a randomly tagged text ( naive initial state ) or a text tagged by another module with the same function ( sophisticated initial state ) . Using the POS tagging as an example, the input to the transformation-based tagger can be either a text randomly tagged or a text tagged by another POS tagger. The shift in the input sources only requires re-training the system; nothing in the algorithm and the annotated corpus need to be changed. · In the case of rule induction, the FST-based grammar can serve as a ‘seed’ to effectively constrain/guide the learning process in overcoming the ‘local maxima’ problem. In general, a better initial estimate of the parameters gives the learning procedure a chance to obtain better results when many local maximal points exist . It is proven by experiments conducted by Briscoe Waegner that even with a very crude handcrafted grammar of only seven binary-branching rules (e.g. PP -- P NP) to start with, a much better grammar is automatically learned than the one using the same approach without a grammar ‘seed’. Another more interesting experiment they conducted gives the following encouraging results. Given the seed of an artificial grammar that can only parse 25% of the 50,00-word corpus, the training program is able to produce a grammar capable of parsing 75% the corpus. This demonstrates the feasibility of combining handcrafted grammar and automatic grammar induction in line with the general approach proposed above: FST rules before statistical model. · When the trained sub-module is an HMM, Cymfony has verified its feasibility through extensive experiments in implementing the hybrid NE tagger, Textract 1.0 . Cymfony first implemented an NE system purely on HMM bi-gram learning, and found there were weaknesses. Due to sparse data problem, although time and numerical NEs are expressed in very predictable patterns, there was considerable amount of mistagging. Later this problem was addressed by FST rules which are good at capturing these patterns. The FST pattern rules for NE serve as a preprocessor. As a result, Textract1.0 achieved significant performance enhancement (from 85% F-measures raised to 93%). The advantages of this proposed hybrid approach are summarized below: · strict modularity : the proposal of combining FST rules and statistical models makes the system more modular as each major module is now divided into two sub-modules. Of course, adaptive re-training is necessary in the later stage of integrating the two sub-modules but it is not a burden as the process is automatic and in principle, it does not require modifications in the algorithm or the training corpus. · enhanced performance : due to the complementary nature of handcrafted and machine-learning systems. · flexible ratio of sub-modules : one module may have a large trained model and a small FST component, or the other way around, depending on the nature of a given task, i.e. how well the FST approach or the learning approach applies to the task. One is free to decide how to allocate more effort and resources to develop one component or the other. If we judge that for Task One, automatic learning is most effective, we are free to decide that more effort and resources should be used to develop the trained module M2 for this task (and less effort for the FST module M1). In other words, the relative size or contribution of M1 versus M2 is flexible,e.g. M1=20% and M2=80%. Technology developed for the proposed information extraction system and its application has focused on six specific areas: (i) machine learning toolkit, (ii) CE, (iii),CO (iv) GE, (v) QA and (vi) truthing and evaluation. The major accomplishments in these areas from the Phase I research are presented in the following sections. In fact, it is also the case in the development of a pure statistical system: repeated training and testing is the normal practice of adjusting the model in the effort for performance improvement and debugging. It is possible that one module is based exclusively on FST rules, i.e. M1=100% and M2=0%, or completely on a learned model, i.e. M1=0% and M2=100% so long as its performance is deemed good enough or the overhead of combining the FST grammar and the learned model outweighs the slight gain in performance. In fact, some minor modules like Tokenizer and POS Tagger can produce very reliable results using only one approach. REFERENCES Abney, S.P. 1991. Parsingby Chunks, Principle-Based Parsing: Computation and Psycholinguistics ,Robert C. Berwick, Steven P. Abney, Carol Tenny, eds. Kluwer Academic Publishers, Boston, MA, pp.257-278. Appelt, D.E. et al. 1995. SRI International FASTUS System MUC-6 TestResults and Analysis. Proceedings ofMUC-6 , Morgan Kaufmann Publishers, San Mateo, CA Beckwith, R. et al. 1991. WordNet: A Lexical Database Organized on Psycholinguistic Principles. Lexicons: Using On-line Resources to build a Lexicon , Uri Zernik,editor, Lawrence Erlbaum, Hillsdale, NJ. Bikel, D.M. et al .,1997. Nymble: a High-Performance Learning Name-finder. Proceedings ofthe Fifth Conference on Applied Natural Language Processing , MorganKaufmann Publishers, pp. 194-201. Brill, E., 1995.Transformation-based Error-Driven Learning and Natural language Processing: A Case Study in Part-of-Speech Tagging, Computational Linguistics , Vol.21,No.4, pp. 227-253 Briscoe, T. Waegner,N., 1992. Robust Stochastic Parsing Using the Inside-Outside Algorithm. WorkshopNotes, Statistically-Based NLP Techniques , AAAI, pp. 30-53 Charniak, E. 1994. Statistical Language Learning , MIT Press, Cambridge, MA. Chiang, T-H., Lin, Y-C. Su, K-Y. 1995. Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution, Computational Linguistics , Vol.21,No.3, pp. 321-344. Chinchor, N. Marsh,E. 1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedingsof MUC-7 Darroch, J.N. Ratcliff, D. 1972. Generalized iterative scaling for log-linear models. TheAnnals of Mathematical Statistics, pp. 1470-1480. Grishman, R., 1997.TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA. Hobbs, J.R. 1993. FASTUS: A System for Extracting Informationfrom Text, Proceedings of the DARPA workshop on Human Language Technology , Princeton, NJ, pp. 133-137. Krupka, G.R. Hausman, K. 1998. IsoQuest Inc.: Description of the NetOwl (TM) ExtractorSystem as Used for MUC-7, Proceedings of MUC-7 Lin, D. 1998. Automatic Retrieval and Clustering of Similar Words, Proceedings of COLING-ACL '98 , Montreal, pp. 768-773. Miller, S. et al .,1998. BBN: Description of the SIFT System as Used for MUC-7. Proceedings of MUC-7 Mohri, M. 1997.Finite-State Transducers in Language and Speech Processing, ComputationalLinguistics , Vol.23, No.2, pp.269-311. Mooney, R.J. 1999. Symbolic Machine Learning for NaturalLanguage Processing. Tutorial Notes, ACL ’99 . MUC-7, 1998. Proceedings of the Seventh MessageUnderstanding Conference (MUC-7), published on the websitehttp://www.muc.saic.com/ Pine, C. 1996. Statement-of-Work (SOW) for The Intelligence Analyst Associate (IAA)Build 2, Contract for IAA Build 2, USAF, AFMC, RomeLaboratory. Rilof, E. Jones, R.1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth a National Conference on Artificial Intelligence (AAAI-99) Rosenfeld, R. 1994. Adaptive Statistical Language Modeling. PhD thesis, Carnegie Mellon University. Senellart, J. 1998. Locating Noun Phrases with Finite StateTransducers, Proceedings of COLING-ACL '98 , Montreal, pp. 1212-1219. Silberztein, M. 1998.Tutorial Notes: Finite State Processing with INTEX, COLING-ACL'98, Montreal(also available at http://www.ladl.jussieu.fr) Srihari, R. 1998. A Domain Independent Event Extraction Toolkit, AFRL-IF-RS-TR-1998-152 Final Technical Report, published by Air Force Research Laboratory, Information Directorate,Rome Research Site, New York Yangarber, R. Grishman, R. 1998. NYU: Description of the Proteus/PET System as Used for MUC-7ST, Proceedings of MUC-7 Pre-Knowledge-Graph Profile Extraction Research via SBIR (1) 2015-10-24 Pre-Knowledge-Graph Profile Extraction Research via SBIR (2) 2015-10-24 朝华午拾：在美国写基金申请的酸甜苦辣 - 科学网【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|5135 次阅读|0 个评论

机器学习与R语言书中R程序问题: waterbridge7 2015-9-15 10:18; 1. 4.2.2中数据读取时有问题: str(sms_raw$type) 后有三个factor。原因是数据中有一行格式有问题，用其他行替换掉即可。 2. 4.2.3中 sms_dtm - DocumentTermMatrix(corpus_clean) 此句前面要加 corpus_clean - tm_map(corpus_clean, PlainTextDocument) 出处： http://www.zhihu.com/question/29114787/answer/44383827 3. 6.2.3中ins_model3 可能打字错误应该为 ins_model 没有3; 个人分类: 交流|1010 次阅读|0 个评论

Torch/Lua Material for Deep Learning: qixianbiao 2015-7-19 19:06; Torch/Lua Material for Deep Learning Lua --Lua is a powerful, fast, lightweight, embeddable scripting language. The core code for the interpreter of Lua is very short. Lua tutorial: Learn Lua in an hour -- https://www.youtube.com/watch?v=S4eNl1rA1Ns Learn Lua in one video --- https://www.youtube.com/watch?v=iMacxZQMPXs Lua more --- https://www.youtube.com/watch?v=Us46grT9wsAindex=1list=PL0o3fqwR2CsWg_ockSMN6FActmMOJ70t_ Torch7(Basic Library) -nn -Tensor -image -math -random -CmdLine -More library in Torch, timer Torch7(More) -Five simple examples -torch/tutorials examples -torch demos --https://github.com/torch/demos -Artificial and robotic vision (introduction to Torch7 and lua, and some examples) -torch Cheatsheet -Machine Learning with Torch7; 个人分类: 科研|6381 次阅读|0 个评论

Why Hybrid?: liwei999 2015-6-19 00:19; Before we start discussing the topic of a hybrid NLP (Natural Language Processing) system, let us look at the concept of hybrid from our life experiences. I was driving a classical Camry for years and had never thought of a change to other brands because as a vehicle, there was really nothing to complain. Yes, style is old but I am getting old too, who beats whom? Until one day a few years ago when we needed to buy a new car to retire my damaged Camry. My daughter suggested hybrid, following the trend of going green. So I ended up driving a Prius ever since and have fallen in love with it. It is quiet, with bluetooth and line-in, ideal for my iPhone music enjoyment. It has low emission and I finally can say bye to smog tests. It at least saves 1/3 gas. We could have gained all these benefits by purchasing an expensive all-electronic car but I want the same feel of power at freeway and dislike the concept of having to charge the car too frequently. Hybrid gets the best of both worlds for me now, and is not that more expensive. Now back to NLP. There are two major approaches to NLP, namely machine learning and grammar engineering (or hand-crafted rule system). As mentioned in previous posts, each has its own strengths and limitations, as summarized below. In general, a rule system is good at capturing a specific language phenomenon (trees) while machine learning is good at representing the general picture of the phenomena (forest). As a result, it is easier for rule systems to reach high precision but it takes a long time to develop enough rules to gradually raise the recall. Machine learning, on the other hand, has much higher recall, usually with compromise in precision or with a precision ceiling. Machine learning is good at simple, clear and coarse-grained task while rules are good at fine-grained tasks. One example is sentiment extraction. The coarse-grained task there is sentiment classification of documents (thumbs-up thumbs down), which can be achieved fast by a learning system. The fine-grained task for sentiment extraction involves extraction of sentiment details and the related actionable insights, including association of the sentiment with an object, differentiating positive/negative emotions from positive/negative behaviors, capturing the aspects or features of the object involved, decoding the motivation or reasons behind the sentiment,etc. In order to perform sophisticated tasks of extracting such details and actionable insights, rules are a better fit. The strength for machine learning lies in its retraining ability. In theory, the algorithm, once developed and debugged, remains stable and the improvement of a learning system can be expected once a larger and better quality corpus is used for retraining (in practice, retraining is not always easy: I have seen famous learning systems deployed in client basis for years without being retrained for various reasons). Rules, on the other hand, need to be manually crafted and enhanced. Supervised machine learning is more mature for applications but it requires a large labelled corpus. Unsupervised machine learning only needs raw corpus, but it is research oriented and more risky in application. A promising approach is called semi-supervised learning which only needs a small labelled corpus as seeds to guide the learning. We can also use rules to generate the initial corpus or seeds for semi-supervised learning. Both approaches involve knowledge bottlenecks. Rule systems's bottleneck is the skilled labor, it requires linguists or knowledge engineers to manually encode each rule in NLP, much like a software engineer in the daily work of coding. The biggest challenge to machine learning is the sparse data problem, which requires a very large labelled corpus to help overcome. The knowledge bottleneck for supervised machine learning is the labor required for labeling such a large corpus. We can build a system to combine the two approaches to complement each other. There are different ways of combining the two approaches in a hybrid system. One example is the practice we use in our product, where the results of insights are structured in a back-off model: high precision results from rules are ranked higher than the medium precision results returned by statistical systems or machine learning. This helps the system to reach configurable balance between precision and recall. When labelled data are available (e.g. the community has already built the corpus, or for some tasks, the public domain has the data, e.g. sentiment classification of movie reviews can use the review data with users' feedback on 5-star scale), and when the task is simple and clearly defined, using machine learning will greatly speed up the development of a capability. Not every task is suitable for both approaches. (Note that suitability is in the eyes of beholder: I have seen many passionate ML specialists willing to try everything in ML irrespective of the nature of the task: as an old saying goes, when you have a hammer, everything looks like a nail.) For example, machine learning is good at document classification whilerules are mostly powerless for such tasks. But for complicated tasks such as deep parsing, rules constructed by linguists usually achieve better performance than machine learning. Rules also perform better for tasks which have clear patterns, for example, identifying data items like time,weight, length, money, address etc. This is because clear patterns can be directly encoded in rules to be logically complete in coverage while machine learning based on samples still has a sparse data challenge. When designing a system, in addition to using a hybrid approach for some tasks, for other tasks, we should choose the most suitable approach depending on the nature of the tasks. Other aspects of comparison between the two approaches involve the modularization and debugging in industrial development. A rule system can be structured as a pipeline of modules fairly easily so that a complicated task is decomposed into a series of subtasks handled by different levels of modules. In such an architecture, a reported bug is easy to localize and fix by adjusting the rules in the related module. Machine learning systems are based on the learned model trained from the corpus. The model itself, once learned, is often like a black-box (even when the model is represented by a list of symbolic rules as results of learning, it is risky to manually mess up with the rules in fixing a data quality bug). Bugs are supposed to be fixable during retraining of the model based on enhanced corpus and/or adjusting new features. But re-training is a complicated process which may or may not solve the problem. It is difficultto localize and directly handle specific reported bugs in machine learning. To conclude, Hybrid gets the best of both worlds . Due to the complementary nature for pros/cons of the two basic approaches to NLP, a hybrid system involving both approaches is desirable, worth more attention and exploration. There are different ways of combining the two approaches in a system, including a back-off model using rulles for precision and learning for recall, semi-supervised learning using high precision rules to generate initial corpus or “seeds”, etc.. Related posts： Comparison of Pros and Cons of Two NLP Approaches Is Google ranking based on machine learning ? 《立委随笔：语言自动分析的两个路子》《立委随笔：机器学习和自然语言处理》【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|3984 次阅读|0 个评论

Deep Learning 资源: hardman 2015-5-1 00:36; 下面为Deep Learning 的相关资源，希望能给对此有兴趣的提供帮助。一、Deep Learning 方面几个大牛的主页 1、提出者Geoffrey E. Hinton的主页： http://www.cs.toronto.edu/~hinton/ 2、美籍华裔大牛Andrew Ng 的主页： http://robotics.stanford.edu/~ang/ 二、部分资源 1、 Rachel-Zhang 的博客(有许多机器学习的资源)： http://blog.csdn.net/abcjennifer/article/details/27170627 2、Neural Networks and Deep Learning： http://neuralnetworksanddeeplearning.com/ 3、A Deep Learning Tutorial: From Perceptrons to Deep Networks： http://www.toptal.com/machine-learning/an-introduction-to-deep-learning-from-perceptrons-to-deep-networks 4、Deep Learning： https://developer.nvidia.com/deep-learning 5、Deep Learning 教程翻译： http://blog.sina.com.cn/s/blog_46d0a3930101h6nf.html 6、Toronto Deep Learning Demo： http://deeplearning.cs.toronto.edu/ 7、Deep Learning学习笔记： http://blog.csdn.net/zouxy09/article/category/1387932 8、百度深度学习实验室： http://idl.baidu.com/index.html 9、CMU的Deep Learning课程： http://deeplearning.cs.cmu.edu/ 10、Deep Learning and Shallow Learning： http://freemind.pluskid.org/machine-learning/deep-learning-and-shallow-learning/ 11、阐述无监督特征学习和深度学习的主要观点的教程： http://ufldl.stanford.edu/wiki/index.php/UFLDL 12、Deep Learning学习体会： http://www.cnblogs.com/JackOne/archive/2013/02/19/DeepLearning-FirstBoold.html 13、Deep Learning Methods for Vision： http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/ 14、Deep Learning的一个比较全面的网站： http://deeplearning.net/ 15、Deep Learning的视频教程： A tutorial on deep learning - Video; 个人分类: DeepLearning|3392 次阅读|0 个评论

Python for Deep Learning: 热度 1 qixianbiao 2015-4-12 19:35; My road to learn python for deep learning. Preface: before I started to learn python for deep learning, the author is extremely familiar with the theory of the deep learning. Meanwhile, the author is also familiar with two deep learning toolboxs (caffe and matconvnet). Since I am very interested in python language, thus I decided to learn python for deep learning. For deep learning theory, I recommend the following materials: Michael Nielsen: Neural Network and Deep Learning, a online book(now updated to Chapter five). Remarks: this book is awesome, I spent two days finishing this book. Interestingly, I got the feeling when I read the Pattern Recognition and Machine Learing(PRML). Geoffrey E. Hinton: Neural Network for Deep Learning, the course in coursera. https://www.coursera.org/course/neuralnets you can find many materials from the internet such as the deep learning course from stanford. I do not intend to mention them all. What I have done: To learn to use deep learning toolbox written in Python, such as Theano or Torch, you need to learn Python language well. If you have good knowledge to C++ and Matlab, you will find it not so hard to learn Python. Python Skills, I recommend : Google Python Class: https://developers.google.com/edu/python/, if you can access youtube, it will be good. you can watch the video. https://www.youtube.com/watch?v=tKTZoB2Vjuklist=PLC8825D0450647509 Coursera course: https://www.coursera.org/course/pythonlearn Numpy, Scipy, and matlabplot: https://www.youtube.com/watch?v=oYTs9HwFGbY Python Imaging Library: A tutorial post by RootOfTheNull, https://www.youtube.com/watch?v=dkrXgzuZk3klist=PL1H1sBF1VAKXCayO4KZqSmym2Z_sn6Pha A very good tutorial for Theano by Alec Radford : Introduction to Deep Learning with Python https://www.youtube.com/watch?v=S75EdAcXHKkindex=1list=PL9Nq-Q1jocNvwnoyUIQFA1SAR9dwakhxa Until now, for python, I have read above-mentioned content. My future plan is to follow two projects, 1: Kaggle competions on Detecting the Local of Keypoints on Face Images. The Link: https://www.kaggle.com/c/facial-keypoints-detection A very blog to Using CNN to detect facial keypoints tutorial: http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/ 2: Kaggle competions on Predict Ocean Health, one Plankton at a time The Link: http://www.kaggle.com/c/datasciencebowl The rank 1st method: http://benanne.github.io/2015/03/17/plankton.html I have summitted one result to the facial keypoint detection and randed 3rd. I will continue updating the content in future when I learn more. I do not check the language, pay attention to the contest and ignore the typos.; 个人分类: 科研|6204 次阅读|1 个评论

[转载]深度学习、机器学习与模式识别三者之间的区别: machinelearn 2015-3-30 16:10; Lets take a close look at three relatedterms (Deep Learning vs Machine Learning vs Pattern Recognition), and see howthey relate to some of the hottest tech-themes in 2015 (namely Robotics andArtificial Intelligence). In our short journey through jargon, you shouldacquire a better understanding of how computer vision fits in, as well as gainan intuitive feel for how the machine learning zeitgeist has slowly evolvedover time. Fig 1. Putting a human inside a computeris not Artificial Intelligence (Photo from WorkFusionBlog ) If you look around, you'll see no shortage of jobs at high-tech startupslooking for machine learning experts. While only a fraction of them are lookingfor Deep Learning experts, I bet most of these startups can benefit from eventhe most elementary kind of data scientist. So how do you spot a futuredata-scientist? You learn how they think. The three highly-relatedlearning buzz words “Pattern recognition,” “machinelearning,” and “deep learning” represent three different schools ofthought. Pattern recognition is the oldest (and as a term is quiteoutdated). Machine Learning is the most fundamental (one of the hottest areasfor startups and research labs as of today, early 2015). And DeepLearning is the new, the big, the bleeding-edge -- we’re not even close tothinking about the post-deep-learning era . Just take a look at thefollowing Google Trends graph. You'll see that a) Machine Learning isrising like a true champion, b) Pattern Recognition started as synonymous withMachine Learning, c) Pattern Recognition is dying, and d) Deep Learning is newand rising fast. 1. Pattern Recognition: The birth ofsmart programs Pattern recognition was a term popularin the 70s and 80s. Theemphasis was on getting a computer program to do something “smart” likerecognize the character 3. And it really took a lot ofcleverness and intuition to build such a program. Just think of 3vs B and 3 vs 8. Back in the day, itdidn’t really matter how you did it as long as there was no human-in-a-boxpretending to be a machine. (See Figure 1) So if your algorithm wouldapply some filters to an image, localize some edges, and apply morphologicaloperators, it was definitely of interest to the pattern recognitioncommunity. Optical Character Recognition grew out of this community andit is fair to call “Pattern Recognition” as the “Smart Signal Processingof the 70s, 80s, and early 90s. Decision trees, heuristics, quadraticdiscriminant analysis, etc all came out of this era. Pattern Recognition becomesomething CS folks did, and not EE folks. One of the most popular booksfrom that time period is the infamous Duda Hart PatternClassification book and is still a great starting point for youngresearchers. But don't get too caught up in the vocabulary, it's a bitdated. The character 3 partitionedinto 16 sub-matrices. Custom rules, custom decisions, and customsmart programs used to be all the rage. SeeOCR Page . Quiz : The most popular Computer Vision conference is called CVPRand the PR stands for Pattern Recognition. Can you guess the year of thefirst CVPR conference? 2. Machine Learning: Smart programs canlearn from examples Sometime in the early 90s people startedrealizing that a more powerful way to build pattern recognition algorithms isto replace an expert (who probably knows way too much about pixels) with data(which can be mined from cheap laborers). So you collect a bunch of faceimages and non-face images, choose an algorithm, and wait for the computationsto finish. This is the spirit of machine learning. Machine Learningemphasizes that the computer program (or machine) must do some work after it isgiven data. The Learning step is made explicit. And believeme, waiting 1 day for your computations to finish scales better than invitingyour academic colleagues to your home institution to design some classificationrules by hand. What is Machine Learningfrom DrNatalia Konstantinova's Blog . The most important part of thisdiagram are the Gears which suggests thatcrunching/working/computing is an important step in the ML pipeline. As Machine Learning grew into a majorresearch topic in the mid 2000s, computer scientists began applying these ideasto a wide array of problems. No longer was it only character recognition,cat vs. dog recognition, and other “recognize a pattern inside an array ofpixels” problems. Researchers started applying Machine Learning to Robotics(reinforcement learning, manipulation, motion planning, grasping), to genomedata, as well as to predict financial markets. Machine Learning wasmarried with Graph Theory under the brand “Graphical Models,” every roboticsexpert had no choice but to become a Machine Learning Expert, and MachineLearning quickly became one of the most desired and versatile computing skills . However Machine Learning says nothing about the underlyingalgorithm. We've seen convex optimization, Kernel-based methods, SupportVector Machines, as well as Boosting have their winning days. Togetherwith some custom manually engineered features, we had lots of recipes, lots ofdifferent schools of thought, and it wasn't entirely clear how a newcomershould select features and algorithms. But that was all about tochange... Further reading: To learn more about the kinds of features that were used inComputer Vision research see my blog post: Fromfeature descriptors to deep learning: 20 years of computer vision . 3. Deep Learning: one architecture torule them all Fast forward to today and what we’reseeing is a large interest in something called Deep Learning. The most popularkinds of Deep Learning models, as they are using in large scale imagerecognition tasks, are known as Convolutional Neural Nets, or simplyConvNets. ConvNet diagram from TorchTutorial Deep Learning emphasizes the kind ofmodel you might want to use (e.g., a deep convolutional multi-layer neuralnetwork) and that you can use data fill in the missing parameters. Butwith deep-learning comes great responsibility. Because you are starting witha model of the world which has a high dimensionality, you really need a lot ofdata (big data) and a lot of crunching power (GPUs). Convolutions are usedextensively in deep learning (especially computer vision applications), and thearchitectures are far from shallow. If you're starting out with Deep Learning, simply brush up on some elementaryLinear Algebra and start coding. I highly recommend AndrejKarpathy's Hacker's guideto Neural Networks . Implementing your own CPU-based backpropagationalgorithm on a non-convolution based problem is a good place to start. There are still lots of unknowns. The theory of why deep learning works isincomplete, and no single guide or book is better than true machine learningexperience. There are lots of reasons why Deep Learning is gainingpopularity, but Deep Learning is not going to take over the world. Aslong as you continue brushing up on your machine learning skills, your job issafe. But don't be afraid to chop these networks in half, slice 'n dice atwill, and build software architectures that work in tandem with your learningalgorithm. The Linux Kernel of tomorrow might run on Caffe (one of the most popular deeplearning frameworks), but great products will always need great vision, domainexpertise, market development, and most importantly: human creativity. Other related buzz-words Big-data is the philosophy of measuring allsorts of things, saving that data, and looking through it forinformation. For business, this big-data approach can give you actionableinsights. In the context of learning algorithms, we’ve only startedseeing the marriage of big-data and machine learning within the past fewyears. Cloud-computing , GPUs , DevOps ,and PaaS providers have made large scale computing within reach ofthe researcher and ambitious everyday developer. Artificial Intelligence is perhaps the oldest term, the most vague,and the one that was gone through the most ups and downs in the past 50 years.When somebody says they work on Artificial Intelligence, you are either goingto want to laugh at them or take out a piece of paper and write down everythingthey say. Further reading: My 2011 Blog post ComputerVision is Artificial Intelligence . Conclusion Machine Learning is here to stay. Don't think about it as PatternRecognition vs Machine Learning vs Deep Learning, just realize that each termemphasizes something a little bit different. But the search continues. Go ahead and explore. Break something. We will continue building smartersoftware and our algorithms will continue to learn, but we've only begun toexplore the kinds of architectures that can truly rule-them-all. If you're interested in real-time visionapplications of deep learning, namely those suitable for robotic and homeautomation applications, then you should check out what we've been buildingat vision.ai . Hopefully in a few days, I'll be ableto say a little bit more. :-); 个人分类: 科研笔记|4633 次阅读|0 个评论

[转载]神经网络的10个常见误解: machinelearn 2015-3-30 15:46; 10 Common Misconceptions about Neural Networks As a computer scientist, I often get asked about neural networks because people would like to use them but often don't know how to go about it. Alternatively, they may have tried to use them but were disappointed in the results. Neural Networks don't have to be hard to use, and when used correctly they can produce superior results to other classes of predictive models such as regression analysis and decision tree induction . In quantitative finance, neural networks are most often used for time-series forecasting, proprietary trading signal generation, fully automated trading (decision making), financial modelling, derivatives pricing, credit risk assessments, pattern matching, and classification of securities. This article will discuss some of the theory behind neural networks. Neural networks are not models of the human brain Neural networks are not just a weak form of statistics Neural networks come in many different architectures Size matters, but bigger isn't always better Many training algorithms exist for neural networks Neural networks do not always require a lot of data Neural networks cannot be trained on any data Neural networks may need to be retrained Neural networks are not black boxes Neural networks are not hard to implement 1. Neural networks are not models of the human brain The human brain is a mystery and many scientists don't agree on how it works. Two popular theories of the brain are the grandmother cell theory and the distributed representation theory. In the first theory individual neurons are capable of representing complex concepts such as your grandmother or Jennifer Aniston . In the second theory neurons are believed to be much more simple. Artificial neural networks are inspired by the second theory and consist of many simple statistical functions connected together to form a network. Personally I support the belief that biological neurons are a lot more complex than artificial ones, and that their information capacity is larger as well. A single neuron in the brain is an incredibly complex machine that even today we don’t understand. A single “neuron” in a neural network is an incredibly simple mathematical function that captures a minuscule fraction of the complexity of a biological neuron. So to say neural networks mimic the brain, that is true at the level of loose inspiration, but really artificial neural networks are nothing like what the biological brain does. - Andrew Ng Another big difference between the brain and neural networks is size and organization. Human brains contain many more neurons and synapses than neural network and they are self-organizing and adaptive. Neural networks, by comparison, are organized according to an architecture. In other words, neural networks are not self-organizing in the same sense as the brain. The only exception to this is adaptive neural networks which are discussed later on in this article. So what does that mean? Think of it this way: a neural network is inspired by the brain in the same way that the Olympic stadium in Beijing is inspired by a bird's nest. That does not mean that the Olympic stadium is-a bird's nest, it means that some elements of birds nests are present in the design of the stadium. In other words, elements of the brain are present in the design of neural networks but they are a lot less similar than you might think. In my opinion, neural networks are actually more closely related to statistical methods like curve fitting and regression analysis than the human brain. In the context of quantitative finance I think it is important to remember that, because whilst it may sound cool to say that something is 'inspired by the brain', this statement may result unrealistic expectations or fear. For more info see my LinkedIn article 'No! Artificial Intelligence is not an existential threat'. Some very interesting views of the brain as created by state of the art brain imagine techniques. Click on the image for more information. 2. Neural networks aren't a weak form of statistics Neural networks have more in common with statistical methods like curve fitting and regression analysis than with the human brain. This is because they approximate a complex non-linear function between inputs and outputs. This fact has led s ome academics and industry professionals to view neural networks as a weak form of statistics. This misconception exists because soft computing techniques, such as neural networks, are defined as algorithms capable of finding good inexact solutions to intractable problems. However, t his does not mean that the resulting function is less accurate, it simply means that it is impossible to know what the real-world function is or whether one even exists. The mechanisms underlying a neural networks are in fact statistical and it is possible to reason about them using statistics. Neural networks are created by combining artificial neurons, called perceptrons . A perceptron contains a function, called an activation function, which maps inputs to outputs. In statistical terms, we can think of a perceptron as a multiple linear regression. The following activation functions are commonly used in neural networks: linear function, step function, ramp function, Sigmoid function, hyperbolic tangent function, and a Gaussian function. The type of activation function is very important, as it asserts requirements on the data being fed into the neural network (see misconception #7). This diagram shows six of the most popular activation functions which can be used in perceptrons making the neural network. Each perceptron in the network is adjusted such that the mean the classification error of the network on a set of known data points is minimized. How each perceptron is adjusted is determined by an optimization algorithm e.g. gradient descent. This process is called training. A perceptron is only able to linearly separate the training data points. This diagram illustrates how a single layer perceptron acts as a linear classifier of a data set. By combining multiple perceptrons together in a network structure, we are in effect creating a complex function which allows a neural network to achieve non-linear separability of the training data points. A network consisting of a single 'layer' of perceptrons is equivalent to a multiple linear regression , a statistical model for determining the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. This diagram illustrates how multiple perceptrons can be connected to create multiple linear regressions on a data set. The neural network is an evolution on this because it consists of multiple layers. In fact, a feed forward neural network is also called a multi-layer perceptron in some circles. In this model, the outputs from one layer of perceptrons form the inputs to the following layer of perceptrons. Layering enables neural networks to learn complex relationships. In trading, a regression between a set of technical indicators (input layer) and the future price of a security (output layer), might not exist. However, a neural network might find a complex relationship between regressions done on the inputs (hidden layer) and the future prices of the security. This diagram shows the general architecture of a multi-layer perceptron. This is the standard architecture for most neural network implementations. To summarize, neural networks are based on a strong statistical foundation and stating that a neural network is just a weak form of statistics is incorrect. I will not pretend that I have covered the full theoretical foundation of neural networks, therefore, I highly recommend this brilliant paper to inspired readers interested in the statistics of neural networks. 3. Neural networks come in many architectures A neural network's performance is directly linked to it's architecture yet most practitioners have only ever used the feedforward neural network. This architecture consists of three layers, an input layer, a hidden layer, and an output layer. Whilst this is a generally good architecture, many others exist which may be better suited to the problem. There are many many neural network architectures out there, so an exhaustive list is outside of the scope of this article. Here are some popular ones, Partially recurrent networks - some connections flow backwards, in other words, feed back loops exist in the network. These networks are believed to perform better on time series data. Therefore they are especially relevant for trading strategies. This diagram shows three popular recurrent Neural Network Architectures namely the Elman neural network, the Jordan neural network, and the Hopfield single-layer neural network. Boltzmann neural network - these were the first neural networks capable of learning internal representations, and solving very difficult combinatoric problems. When constrained they can prove more efficient than traditional neural networks. This diagram shows how different Boltzmann Machines with connections between the different nodes can significantly affect the results of the neural network (graphs to the right of the networks) Deep neural networks - neural networks with multiple hidden layers. Deep neural networks are currently at the forefront of research. Essentially deep neural networks consist of many hidden layers which are trained independently usually using Stochastic Gradient Descent. A great site for deep learning resources is DeepLearning.net . This diagram shows a deep neural network which consists of multiple hidden layers. Adaptive neural networks - neural networks which simultaneously adapt and optimize their architectures whilst learning by either growing the architecture or shrinking it. Adaptive neural networks have been shown to perform well for forecasting time series events. This diagram shows two different types of adaptive neural network architectures. The left image is a cascade neural network and the right image is a self-organizing map. I believe that these represent the future of neural networks, because the architecture of a network determines what it can approximate. If the architecture is sub-optimal, then the network will never perform optimally regardless of how many perceptrons or connections it has. Another benefit of optimal architectures is improved information capacity . Neural networks with higher information capacity require less perceptrons to fit a complex function. Given that the larger a network becomes, the harder it is to train, this benefit can be very useful. Radial basis networks - although not a different type of architecture in the sense of perceptrons and connections, radial basis functions make use of radial basis functions as their activation functions, these are real valued functions whose output depends on the distance from a particular point. The most commonly used radial basis functions is the Gaussian distribution. This diagram shows how curve fitting can be done using radial basis functions Because radial basis functions can take on much more complex forms, they were originally used for performing function interpolation. As such, a radial basis function neural network can have a much higher information capacity. Radial basis functions are also used in the kernel of a Support Vector Machine . For more information take a look at this presentation . In summary various neural network architectures exist. The performance of one neural network is often vastly superior to another, therefore for quantitative analysts interested in using neural networks, an important first step is to decide which architecture(s) you want to test. 4. Size matters, but bigger isn't always better Having selected an architecture (excluding an adaptive architecture), one must then decide how large or small the neural network should be. To illustrate the point, assume I have selected a feed-forward neural network with three layers. How many inputs should I have? How many hidden perceptrons should I have? And how many outputs are required? There are many old programmer's tales which state that you must have between 10 and 20 perceptrons for high dimensional problems. The truth is that every problem is unique, and that the best technique for finding the optimal size of any architecture is to empirically test various sizes. It is bad practice to stick with one initial guess and hope it works. That having been said, you can use a modified version of Occam's razor . Which is a scientific heuristic for finding good hypotheses. In this case we use it to reason about neural network architectures. Simpler architectures are defined as ones with fewer perceptrons and fewer connections between them. For an interesting discussion on simplicity, read this . When you have two competing neural networks which make the same predictions, the one with the simpler architecture will generalize better. - Occam's razor for neural networks 5. Many training algorithms exist for neural networks The learning algorithm of a neural network tries to optimize the neural network's weights until some stopping condition has been met. This condition is normally when the neural network can predict the outcome of the training data set to an acceptable level of accuracy but could also be when the computational budget has been exhausted. The most common learning algorithm is backpropagation which uses stochastic gradient descent . This algorithm consists of two phases: The feedforward pass - the training data set is passed through the network and the output from the neural network is recorded, and Backward propagation - the error signal is passed back through the network and the weights of the neural network are optimized using gradient descent. The are some problems with this approach. Adjusting all the weights at once can result in a significant movement of the neural network in weight space, the gradient descent algorithm is slow, and is susceptible to local minima . Assuming the neural network objective function contains local minima (this has been debated in recent years), it may make sense to use an optimization algorithm which is less sensitive to local minima. Such algorithms are called global optimization algorithms. Two popular global optimization algorithms are the Particle Swarm Optimization (PSO) and the Genetic Algorithm (GA). Here is how they can be used to train neural networks: Neural network vector representation - by encoding the neural network as a vector of weights, each representing the weight of a connection in the neural network, we can train neural networks using most meta-heuristic search algorithms. This technique does not work well with deep neural networks because the vectors become too large. This diagram illustrates how a neural network can be represented in a vector notation and related to the concept of a search space or fitness landscape. Particle Swarm Optimization - to train a neural network using a PSO we construct a population / swarm of those neural networks. Each neural network is represented as a vector of weights and is adjusted according to it's position from the global best particle and it's personal best. The fitness function is calculated as the sum-squared error of the reconstructed neural network after completing one feedforward pass of the training data set. The main consideration with this approach is the velocity of the weight updates. This is because if the weights are adjusted too quickly, the sum-squared error of the neural networks will stagnate and no learning will occur. This diagram shows how particles are attracted to one another in a single swarm Particle Swarm Optimization algorithm. Genetic Algorithm - to train a neural network using a genetic algorithm we first construct a population of vector represented neural networks. Then we apply the three genetic operators on that population to evolve better and better neural networks. These three operators are, Selection - Using the sum-squared error of each network calculated after one feedforward pass, we rank the population of neural networks. The top x% of the population are selected to 'survive' to the next generation and be used for crossover. Crossover - The top x% of the population's genes are allowed to cross over with one another. This process forms 'offspring'. In context, each offspring will represent a new neural network with weights from both of the 'parent' neural networks. Mutation - this operator is required to maintain genetic diversity in the population. A small percentage of the population are selected to undergo mutation. Some of the weights in these neural networks will be adjusted randomly within a particular range. This algorithm shows the selection, crossover, and mutation genetic operators being applied to a population of neural networks represented as vectors. In addition to these population-based metaheuristic search algorithms, other algorithms have been used to train of neural networks including backpropagation with added momentum, differential evolution , Levenberg Marquardt , simulated annealing , and many more . 6. Neural networks do not always require a lot of data Neural networks can use one of three learning strategies namely a supervised learning strategy, an unsupervised learning strategy, or a reinforcement learning strategy. The principal difference between these three strategies is the amount of labelled data that they require. Labelled data is training data for which the correct output is known upfront. Supervised learning - these strategies require at least two datasets, a training set which consists of inputs with the expected output, and a generalization set which consists of inputs without the expected output. Over-fitting is the term given to a neural network which has 'learnt' the noise in the test set too well and cannot generalize well on unseen data. Unsupervised learning - these strategies discover hidden structures (such as Markov chains) in unlabeled data. They are based on well known statistical techniques such as density estimation , principal component analysis , Hebbian learning, and clustering algorithms. Because unsupervised learning does not need any labelled data, the neural network can be applied to under-formulated problems where the correct output is not known. An example is the Google neural network which used unsupervised learning to discover cats without having any prior knowledge of them i.e. no sets of cat images were used to train the network. Two common unsupervised neural networks are the self organizing map (SOM), and adaptive resonance theory (ART). A SOM is a multi-dimensional scaling method for projecting high dimensional search spaces onto a two dimensional grid. A SOM produced a 'heat map' which can be analysed to identify similar patterns and their underlying characteristics. A self-organizing map showing U.S. Congress voting patterns visualized in Synapse. The first two boxes show clustering and distances while the remaining ones show the component planes. Red means a yes vote while blue means a no vote in the component planes (except the party component where red is Republican and blue is Democratic). This thesis describes an unsupervised learning strategy using the particle swarm optimization algorithm and a neural network to discover favourable technical market indicators and trading strategies all starting with zero expert knowledge of the securities or technical indicators. Another interesting application of SOM's is in colouring time segments of stock charts to represent which market patterns they represent. This website provides a detailed tutorial and code snippets for implementing the idea for improved Forex trading strategies. Reinforcement learning - these strategies are based on the simple premise of rewarding neural networks for good behaviours and punishing them for bad behaviours. This strategy lends itself to trading because good decisions and bad decisions are easy to quantifiable in terms of existing metrics and profit or loss. It also requires no training data. Reinforcement learning strategies consist of three components. A policy which specifies how the neural network will make decisions e.g. using technical and fundamental indicators. A reward function which distinguishes good from bad e.g. making vs. losing money. And a value function which specifies the long term goal e.g. a high Sortino ratio. This diagram shows how a neural network can be either negatively or positively reinforced. 7. Neural networks cannot be trained on any data In my opinion this is the worst misconception about neural networks. Many people who try to use neural networks do not properly pre-process the data being fed into the neural network. The result of this is that the neural network will under perform. Data normalization, removal of redundant information, and outlier removal should be done to improve performance. Data normalization - neural networks consist of various perceptrons linked together through weighted connections. Each perceptron contains an activation function which each have an ' active range ' excepting radial basis functions. Inputs into the neural network should be scaled within this range so that the activation function's outputs are different for each input. Consider a neural network trading system which receives indicators about a set of securities as inputs and outputs whether each security should be bought or sold. One of the inputs is the price of the security and we are using the Sigmoid activation function. However, most of the securities cost between 5$ and 15$ per share and the output of the Sigmoid function approaches 1.0. So the output of the Sigmoid function will be be 1.0 for all securities, all of the perceptrons will 'fire' and the neural network will not learn. Neural networks trained on unprocessed data produce models where 'the lights are on but nobody's home' The active range of the Sigmoid function is -sqrt(3) to sqrt(3), so the prices of each security should be scaled to fit within that range. The other consideration is how to handle securities like Berkshire Hathaway which cost $190,000. Would feeding this securities' price into our trading system impact it's ability to correctly classify $5 - $15 securities? Outlier removal - an outlier is value that is much smaller or larger than most of the other values in some set of data. Outliers can cause problems with statistical techniques like regression analysis and curve fitting because when the model tries to 'accommodate' the outlier, performance of the model across all other data deteriorates. Consider the illustration below, This diagram shows the effect of removing an outlier from the training data for a linear regression. The results are comparable for neural networks. Image source: https://statistics.laerd.com/statistical-guides/img/pearson-6.png The illustration shows that trying to accommodate an outlier into the linear regression model results in a poor fits of the data set. The effect of outliers on non-linear regression models, including neural networks, is similar. Therefore it is good practice is to remove outliers from the training data set. This is a challenge itself, this tutorial and paper discuss existing techniques. Remove redundancy - larger neural networks are less likely to generalize well, which is one reason why simpler networks are desirable. Another reason is that they are faster, more efficient, and easier to 'decipher'. Removing redundant inputs can simplify your neural network. Different inputs can share mutual information about the problem. Mutual information is the degree of dependence between two inputs. If this is high, then the two variables will be strongly correlated with one another. This means that the amount of unique information presented by either input is small, and the less significant input can be removed. Adaptive neural networks will automatically prune redundant connections, perceptrons, and inputs. For fixed architecture, this process requires some data-pre-processing. Measuring the correlation between each pair of inputs and performing a sensitivity analysis between the inputs and the expected outputs can both help identify redundant inputs. 8. Neural networks may need to be retrained Neural networks tend to stop working over time. That having been said, you would be wrong to assume that this is a poor reflection of neural networks. It is actually an accurate reflection of the world we live in. The world is constantly changing. This is especially true for financial markets because the underlying mechanisms, market participants, are not predictable. Investors emotions' drive markets to bubble and then burst. What I find interesting is that chaos theory originated from the study of weather, whose underlying mechanisms are well understood. What does that says about financial markets? Crowding also contributes the dynamic nature of financial markets, for more information read The Crisis of Crowding . Dynamic environments, such as financial markets, are extremely difficult for neural networks to model. Two approaches are either to keep retraining the neural network over-time, or to use a dynamic neural network. Dynamic neural networks 'track' changes to the environment over time and adjust their architecture and weights accordingly. They are adaptive over time. For dynamic problems, multi-solution meta-heuristic optimization algorithms can be used to track changes to local optima over time. One such algorithm is the multi-swarm optimization algorithm, a derivative of the particle swarm optimization. Genetic algorithms with enhanced diversity or memory have also been shown to be robust in dynamic environments. The illustration below demonstrates how a genetic algorithm evolves over time to find new optima in a dynamic environment. This illustration also happens to mimic trade crowding which is when market participants crowd a profitable trading strategy, thereby exhausting trading opportunities causing the trade to become less profitable over time. This animated image shows a dynamic fitness landscape (search space) change over time. Image source: http://en.wikipedia.org/wiki/Fitness_landscape 9. Neural networks are not black boxes By itself a neural network is a black-box . This presents problems for people wanting to use them. For example, fund managers wouldn't know how a neural network makes trading decisions, so it is impossible to assess the risks of the trading strategies learned by the neural network. Similarly, banks using neural networks for credit risk modelling would not be able to justify why a customer has a particular credit rating, which is a regulatory requirement. That having been said, state of the art rule-extraction algorithms have been developed to vitrify some neural network architectures. These algorithms extract knowledge from the neural networks as either mathematical expressions, symbolic logic, fuzzy logic, or decision trees. This image shows a neural network as a black box and how it related to rule extraction techniques. Mathematical rules - algorithms have been developed which can extract multiple linear regression lines from neural networks. The problem with these techniques is that the rules are often still difficult to understand, therefore these do not solve the 'black-box' problem. Propositional logic - propositional logic is a branch of mathematical logic which deals with operations done on discrete valued variables. These variables, such as A or B, are often either TRUE or FALSE, but they could occupy values within a discrete range e.g. {BUY,HOLD,SELL}. Logical operations can then be applied to those variables such as OR, AND, and XOR. The results are called predicates which can also be quantified over sets using the exists or for-all quantifiers. This is the difference between predicate and propositional logic. If we had a simple neural network which Price (P), Simple Moving Average (SMA), and Exponential Moving Average (EMA) as inputs and we extracted a trend following strategy from the neural network in propositional logic, we might get rules like this, Fuzzy logic - fuzzy logic is where probability and propositional logic meet. The problem with propositional logic is that is deals in absolutes e.g. BUY or SELL, TRUE or FALSE, 0 or 1. Therefore for traders there is no way to determine the confidence of these results. Fuzzy logic overcomes this limitation by introducing a membership function which specifies how much a variable belongs to a particular domain. For example, a company (GOOG) might belong 0.7 to the domain {BUY} and 0.3 to the domain {SELL}. Combinations of neural networks and fuzzy logic are called Neuro-Fuzzy systems . This research survey discusses the various fuzzy rule extraction techniques which exist for neural networks. Decision trees - decision trees are data structures which show decision making under various conditions or given certain information. This article I wrote describes how to evolve security analysis decision trees using genetic programming . Decision tree induction is the term given to the process of extracting decision trees from neural networks. An example of a simple trading strategy represented using a decision tree. The triangular boxes represent decision nodes, these could be to BUY, HOLD, or SELL a company. Each box represents a tuple of indicator, inequality,= value=. An example might be sma,, 25 or ema, =, 30=. 10. Neural networks are not hard to implement Speaking from personal experience, neural networks are quite difficult to code from scratch. Luckily for us, there are many existing open source and proprietary packages which contain implementations of different types of neural networks. However, for advanced topics, such as rule extraction, custom development is unavoidable. Encog - is an easy to use library containing implementations of many machine learning algorithms and neural networks. Encog is particularly nice because it offers an API which allows users to define new algorithms for training and creating adaptive neural networks. PyBrain - is a modular Machine Learning Library for Python which contains implementations of various neural networks. Python is great for financial modelling because it can be combined with statistical packages such as Pandas , and SciPy . SAS Enterprise Miner - SAS is a proprietary statistical programming language used across the financial services industry. The SAS Enterprise Miner module contains an implementation of various neural networks and decision tree classification structures. Scikit Learn - is another open source machine learning library for the Python programming language. Again, Python is great for financial modelling because it can be combined with statistical packages such as Pandas , and SciPy . For readers interested in using neural networks, I recommend using an existing package. A general rule of thumb is that 'off the shelf' packages, whether open source or proprietary, contain less bugs and produce more reliable results than custom developed applications. That said, there is no better way to learn neural networks than to code one. For more useful tools and applications check out my Tools for Computational Finance page. Conclusion Neural networks are a class of powerful machine learning algorithms. They are based on solid statistical foundations and have been applied successfully in financial models as well as in trading strategies for many years. Despite this, they have a bad reputation in industry caused by the many unsuccessful attempts to use them in practice. In most cases, unsuccessful neural network implementations can be traced back to inappropriate neural network design decisions and general misconceptions about how they work. This article aims to articulate some of these misconceptions in the hopes that they might help individuals implementing neural networks meet with success. For readers interested in getting more information, I have found the following books to be quite instructional when it comes to neural networks and their role in finance and trading. Lastly, if you managed to read all 4,500 words of this article congratulations. Please contact me or comment below if you have any specific questions, suggestions, or corrections. Thank you. 博文来源：http://www.stuartreid.co.za/misconceptions-about-neural-networks/; 个人分类: 科研笔记|4105 次阅读|0 个评论

Deep Learning For image restoration: machinelearn 2015-3-30 15:10; 关于采用deep learning进行图像复原的论文： Image denoising Deep convolutional neural network for image deconvolution Image denoising: Can plain Neural Networks compete with BM3D? Image denoising and inpainting with deep neural networks Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion Image denoising with multi-layer perceptrons Adaptive multi-column deep neural networks with application to robust image denoising Robust image denoising with multi-column deep neural networks Image Denoising with Rectified Linear Units super-resolutin Learning a deep convolutional network for image super-resolution Image Super-Resolution Using Deep Convolutional Networks Image Super-Resolution with Fast Approximate Convolutional Sparse Coding Deep Network Cascade for Image Super-resolution deblurring Image Deblurring Using Back Propagation Neural Network; 个人分类: 科研笔记|11596 次阅读|0 个评论

《概率图模型：原理与技术》译者序: 热度 23 王飞跃 2015-3-28 14:21; 《概率图模型：原理与技术》译者序王飞跃由清华大学出版社出版的《概率图模型：原理与技术》将于近期发布，敬请关注京东、当当、亚马逊的图书信息。 -------------------------------------------------------------------------------------------- 译者序美国斯坦福大学教授 Daphne Koller 和以色列希伯来大学教授 Nir Friedman 的专著《概率图模型：原理与技术（ P robabilistic Graphical Models: Principles and Techniques ）》是机器学习和人工智能领域一部里程碑式的著作。本书内容十分丰富，作者以前所未有的广度和深度，对概率图模型这一领域的基础知识和最新进展进行了全面描述和总结。显然，这本书是对 Judea Pearl 教授关于不确定情况下贝叶斯网络推理与决策方面开创性工作之权威和重要的拓广，是本领域专家学者和研究生的最佳参考书和教科书之一。近几年来，人工智能和机器学习的研究与应用已成为全球性的科技热点，此书的出版恰当其时、正当其用，并且已产生了十分积极的影响。将一本近 1200 页的英文原著翻译成中文，的确是一次刻骨铭心的难忘经历，有时甚至感觉自已比原著的第二作者 Friedman 更有资格被称为 “ 烤焦的人 ” （英文就是 Fried man ）！此书内容广博，而且作者特别善于见微知著，长篇细论，即便是单纯地读完此书亦需要坚韧的毅力，何况将其译成中文！用 Koller 自己在邮件中的话说： “ 这是一次英雄般的努力！ ” 对于英文原著国内外已有许多评论，我已无需锦上添花。在此，只希望简要回顾一下自己为何如此钟情于此书以及本书五年的翻译历程，权当为序。自提出社会计算这一新的研究领域之后，我一直试图寻求能够支撑社会计算的理论框架与解析方法。 2004 年正式提出基于人工社会、计算实验和平行执行的复杂系统计算方法 ACP 框架，同时也将其作为社会计算的一般框架，因为社会计算所涉及的社会问题，显然是典型的复杂系统问题。但社会计算用于量化分析的一般解析方法是什么？在哪里？这在当时是一个首要的研究问题。自 2007 年起，作为副主编，我不断地推动《 IEEE 智能系统（ IEEE Intelligent Systems ）》杂志将社会计算列为主题，并先后组织了社会计算、社会学习、社会媒体、社会安全等相关专题专刊的征稿与出版，试图通过这些工作寻找完整的社会计算框架、方法及应用的学术 “ 生态 ” 体系。这些工作，得到了时任主编 James Hendler 教授的热情支持，他自己与 Tim Berners-Lee 还相应地提出了社会机器（ Social Machines ）的研究方向。我于 2008 年底接任主编，次年的年度编委会之后， Jim 打电话问我对 Koller 和 Friedman 的《概率图模型》的看法如何，可否在 IEEE Intelligent Systems 上组织一篇书评。但当时我并没有看过此书，因此建议由他来找人写。 2010 年春，我们实验室的王珏研究员再次让我注意此书，认为这本书会将对机器学习的研究产生重要影响，建议我组织大家学习讨论，并组织人员翻译。在此情况下，我才买了英文原版书，一看此书一千多页，比一块厚砖头还厚，当即就对能否找到时间翻译产生怀疑，表示不可能有时间从事这项工作。然而，粗读此书后，特别是从网络搜索了相关基于概率图模型的研究论文之后，我下定决心组织人员翻译此书，并立即让正在从事相关研究的博士生周建英着手写相关综述（见周建英、王飞跃、曾大军， “ 分层 Dirichlet 过程及其应用综述 ” ，自动化学报 , 第 37 卷 4 期 , 389-407, 2011 ）。促使我态度转变的原因主要有三点：首先，本书是按照 “ 知识表示 — 学习推理 — 自适应决策 ” 这一大框架组织的，与我自己的 ACP （即：人工社会 / 组织 / 系统 + 计算实验 + 平行执行）方法之主旨完全一致。显然，基于概率图的表示是人工组织建模方法的特殊形式，机器学习和相关推理是计算实验的智能化方式，而自适应决策则更是平行执行的闭环反馈化的具体实现。实际上，我一直希望把计算实验与机器学习打通，从而使统计实验方法升华并与最新的机器学习方法，特别是与集成学习（ ensemble learning ）结合起来。其次，概率图模型方法已在文本分类、观点分析、行为分析、舆情计算、车辆分类、交通路况分析以及智能控制上有了初步应用，这些都是我所带领的团队当时正在从事的研究及开发工作，而我正苦寻有何方法能将这些工作的基础统一起来，从而实现从基础、研究到应用的一体化整合，概率图模型为我开启了一个十分具有前景和希望的途径。唯一的缺憾，是我对此书的内容组织和写作方式有所保留，觉得虽然总体合理流畅，而且还用了 “ 技巧框盒（ skill boxes ） ” 、 “ 案例框盒 ( case study boxes) ” 、 “ 概念框盒（ concept boxes ） ” 等让人一目了然印象深刻的表述形式，但对一般研究人员，特别是伴随微博微信成长起来的 QQ 新一代，此书相对难读，对某些人甚至基本无法读下去。因此，我希望通过细品此书，能够写出一本自己的书，就叫《可能图模型：原理与方法》，作为社会计算、平行方法和知识自动化的基础教材，统一我们团队学生之基本功的培育和修炼。有了这些想法，立即付诸行动。为了尽快完成翻译，我准备尝试利用自己一直全力推动的 “ 人肉搜索 ” 众包形式，希望通过社会媒体号召专业和社会人士参与此次工作，也为将来此书中文版之发行打下一个良好基础，就算是一种 “ 计算广告 ” 和 “ 精准投放 ” 实习吧！首先，我在实验室组织了十多名学生参与此次工作，作为众包的 “ 种子 ” ；再请王珏老师给大家系统化地上课，讲解此书；希望在此基础上，再挑选一批人，进行翻译，其余的参加翻译的讨论，并将翻译讨论后的结果分布于网上，发动专业和社会人士，进行评论、修正和改进，以 “ 众包 ” 方式完成此书的中文翻译。王珏老师对此给予我极大地支持，并安排了他的学生韩素青和张军平等参与此次工作。与此同时，我也开始联系此书的版权事宜。王珏老师推荐了清华大学出版社薛慧老师作为联系人，我也找到了正邀请我写作《社会计算》的 MIT 出版社的 Jane 女士，询问翻译版权事宜。 Jane 立即回信，表示感兴趣，同时将回信抄送 Koller 和 Friedman ，并表示她将参加 2010 年 5 月在 Alaska 召开的 IEEE 机器人与自动化国际大会（ ICRA ）大会，因我是 IEEE ICRA2010 的国际程序委员会委员，希望见面细谈。其实自己原本不想去 Alaska ，但为了尽快顺利促成此事，我决定还是去一趟。对于此次会面，我印象极深，一是刚下飞机就接到国内电话，社会计算工程团队发生重大事件，让我十分担心；二是刚进大会酒店 Hilton 大厅，一眼就看到左边酒吧里同事 JM 正与一位女士坐在一起品酒交谈， JM 起身与我打招呼，并介绍他的女伴，原来正是 Jane ！版权事宜很顺利， Jane 说她刚开始时担心翻译质量，因为此书很有难度，现在放心了。而且，对清华大学出版社也很了解，可又担心此文的中文版权是否有繁体与简体的冲突，因为繁体中文版权已授予一家台湾公司。对此，我只能介绍了她与清华大学出版社薛慧直接联系，似乎后来的版权事宜还算顺利。没有想到的是，接下来的翻译工作极不顺畅。参与学习班的许多同学提交的初稿可以用一个词来描述： “ 惨不忍睹 ” 。错误理解、望文生义，甚至将意思满拧，比比皆是，有时让人哭笑不得，完全可以用 “ 天马行空，独往独来 ” 来形容。让我难以相信，这是出自完成大学甚至硕士学业的学生之手。我一度想把这些翻译放到网上，让大家开心一下，或许会吸引更多的人士参与此书翻译的众包工作。但王珏和几位同事的意见完全打消了我的想法，也使我放弃了以 “ 众包 ” 方式进行翻译的打算。主要原因是当时希望能趁大家还在热切期盼此书的时候出版中文版，而众包翻译这种形式从未尝试过，结果如何不能妄断，万一比学生翻译的还差怎么办？何时才能完成？就是完成之后，署名和其他出版问题如何解决？最后，决定由我主笔并组织翻译工作，分翻译、统稿、审校、修正、清样校对五个阶段进行，邀请实验室毕业的韩素青、张军平、周建英和杨剑以及王立威和孙仕亮等机器学习领域的一线年轻研究人员辅助。在 2012 年以前，我只能用零星时间从事翻译工作。 2011 年底的一场大病，让我在之后有了充裕的时间和精力一边修养一边翻译修改。特别是在北京西郊与山林相伴的日日夜夜，效率极高，终于在 2012 年夏初初步完成了此书的翻译和统稿。必须说明的是，本项工作是集体努力的结果。参与人员五十余人，多为我和王珏老师的历届学生。首先，非常感谢韩素青博士在翻译和统稿过程付出的巨大努力和心血，她的坚持使我打消了一度放弃此项目的想法。此外，我团队的学生王友忠、王坤峰、王凯、叶佩军、田滨、成才、吕宜生、任延飞、孙涛、苏鹏、李叶、李林静、李泊、李晓晨、沈栋、宋东平、张柱、陈松航、陈诚、周建英、赵学亮、郝春辉、段伟、顾原、徐文聪、彭景、葛安生等参与了本书的翻译，北京大学的王立威教授、复旦大学的张军平教授、华东师范大学的孙仕亮教授、北京工业大学的杨剑教授、公安部第三研究所的周建英博士、中国科学院自动化研究所的王坤峰博士参与了审校，我的学生王坤峰、田滨、李叶、李泊、苟超、姚彦洁等参与了修正，最后，王坤峰带领我团队的学生王晓、亢文文、朱燕燕、刘玉强、刘裕良、杨坚、陈亚冉、陈圆圆、苟超、赵一飞、段艳杰、姚彦洁等完成了清样的校对和通读，在此我向他们深表谢意。还有许多其他同学和同事在不同阶段参与了本项工作，十分感谢他们的贡献，抱歉无法在此一一具名。在此书的翻译过程中，还得到 Koller 教授的帮助。 2011 年 8 月上旬，我在旧金山的 AAAI 年会上与她相见，讨论了翻译事宜。 Koller 表示可以让她的两位中国学生参与翻译，我还同他们联系过，但除了几个名词和一些修正问题，并没有太劳驾其学生。 Koller 提供的详细勘读表和网站信息，对我们翻译的校正很有帮助。今年四月，我赴美参加主编会议，本计划去旧金山与 Koller 见面确定翻译的最后一些细节，不想因病作罢，只能通过邮件进行。此书的翻译还让我与斯坦福大学人工智能实验室的三位主任有了较深的来往，而且三位分别是当今世界上最成功 MOOC 网络大学 Coursera 和 Udacity 的创始人。当 Koller 因和吴恩达创办 Coursera 而辞去 AI 主任之后， Sebastian Thrun 接任，那时我恰好与他合作组织 IJCAI2013 北京大会。 2011 年他来京得知我们正在翻译《概率图模型》后，希望也能翻译他的《概率机器人（ Probabilistic Robotics ）》。自己虽然教授了 20 年的机器人课程，但再无精力和时间做此类工作，只能安排实验室其他研究人员承担。但是他的中文翻译版权已被转让，只好作罢。后来 Thrun 辞去斯坦福和谷歌的工作，创办 Udacity ，接任实验室主任的，正是后来加入百度的吴恩达博士，十分感谢他表示愿为《概率图模型》中文版的推广而尽力。从初春北京的西山到五月阿拉斯加的海滨，从夏雨中长沙的跨线桥到烈日下图森的仙人掌，从秋枫叶飘的旧金山到海浪冲沙的海牙，从深秋风凉的青岛石老人再回到初冬消失在雾霾里的北京高楼， ...... 本书的翻译伴我度过了五年的风风雨雨，差点成了完不成的任务（ Mission Impossible ！）。今日落稿，顿觉释然，除了感谢自己的学生与同事之外，我必须特别感谢清华大学出版社的薛慧女士，感谢她在整个翻译过程中的热心和耐心。最后，希望把本书的中文版献给王珏老师，没有他就没有本项目的开始。更重要的是，希望本书中文版的完成，能使他早日从疾病中康复！中国科学院自动化研究所复杂系统管理与控制国家重点实验室国防科技大学军事计算实验与平行系统技术研究中心王飞跃 2012 年秋初记于长沙跨线桥居， 2013 年初补记于北京国奥新村， 2014 年初春再记于美国图森 Catalina 山居，同年深秋重记于青岛石老人海滨。又记：十分不幸的是， 2014 年 12 月 3 日传来噩耗：王珏老师因病逝世！相识相知相助 21 年，内心悲痛，无以言喻。特别是王珏老师生前没能看到本书中文版的正式出版，遗憾之余，深感自责。鉴于他和他学生的巨大帮助，我曾多次同他商谈，希望将他列为译者之一，但每次他都坚决拒绝；最后无奈，曾托薛慧拿着出版合同请他签字，但依然被拒绝。唯感欣慰的是， 12 月 2 日下午，在他神志清醒的最后时刻，我们见了最后一面。他去世后，当日我即电邮 Daphne Koller ，告她先前不曾知晓的王珏老师，还有他对中国机器学习的重要贡献以及在翻译其专著过程中所起的关键作用，希望她在中文版的序言里有所表述。英文如下： Prof. Jue Wang, a pioneer in ML in China and a research scientist in my lab, died of cancer today at age 66. He was a big promoter of your book and without his strong push behind, I might not have determined to do the translation in the first place. Many of outstanding young ML researchers in China are his former students and they have given me a huge support during the translation and proofreading of your book. So I would like you to say a few words about his effortin your preface. 可以告慰王珏老师的是， Koller 教授在其序言里恰如其分地表示了对他的贡献之衷心感谢。本书中文版的最终出版，就是对王珏老师的纪念！ 2014 年 12 月 9 日于北京科技会堂 -------------------------------------------------------------------------------------------- 图书目录致谢插图目录算法目录专栏目录第 1 章引言 1.1 动机 1.2 结构化概率模型 1.2.1 概率图模型 1.2.2 表示、推理、学习 1.3 概述和路线图 1.3.1 各章的概述 1.3.2 读者指南 1.3.3 与其他学科的联系 1.4 历史注记第 2 章基础知识 2.1 概率论 2.1.1 概率分布 2.1.2 概率中的基本概念 2.1.3 随机变量与联合分布 2.1.4 独立性与条件独立性 2.1.5 查询一个分布 2.1.6 连续空间 2.1.7 期望与方差 2.2 图 2.2.1 节点与边 2.2.2 子图 2.2.3 路径与迹 2.2.4 圈与环 2.3 相关文献 2.4 练习第Ⅰ部分表示第 3 章贝叶斯网表示 3.1 独立性性质的利用 3.1.1 随机变量的独立性 3.1.2 条件参数化方法 3.1.3 朴素贝叶斯模型 3.2 贝叶斯网 3.2.1 学生示例回顾 3.2.2 贝叶斯网的基本独立性 3.2.3 图与分布 3.3 图中的独立性 3.3.1 d- 分离 3.3.2 可靠性与完备性 3.3.3 d- 分离算法 3.3.4 I- 等价 3.4 从分布到图 3.4.1 最小 I-Map 3.4.2 P-Map 3.4.3 发现 P-Map* 3.5 小结 3.6 相关文献 3.7 习题第 4 章无向图模型 4.1 误解示例 4.2 参数化 4.2.1 因子 4.2.2 吉布斯分布与马尔可夫网 4.2.3 简化的马尔可夫网 4.3 马尔可夫网的独立性 4.3.1 基本独立性 4.3.2 独立性回顾 4.3.3 从分布到图 4.4 参数化回顾 4.4.1 细粒度参数化方法 4.4.2 过参数化 4.5 贝叶斯网与马尔可夫网 4.5.1 从贝叶斯网到马尔可夫网 4.5.2 从马尔可夫网到贝叶斯网 4.5.3 弦图 4.5.4 I- 等价 4.6 部分有向图 4.6.1 条件随机场 4.6.2 链图模型 * 4.7 小结与讨论 4.8 相关文献 4.9 习题第 5 章局部概率模型 5.1 CPD 表 5.2 确定性 CPD 5.2.1 表示 5.2.2 依赖性 5.3 上下文特定的 CPD 5.3.1 表示 5.3.2 独立性 5.4 因果影响的独立性 5.4.1 noisy-or 模型 5.4.2 广义线性模型 5.4.3 一般公式化表示 5.4.4 独立性 5.5 连续变量 5.5.1 混合模型 5.6 条件贝叶斯网 5.7 小结 5.8 相关文献 5.9 习题第 6 章基于模板的表示 6.1 引言 6.2 时序模型 6.2.1 基本假设 6.2.2 动态贝叶斯网 6.2.3 状态观测模型 6.3 模板变量与模板因子 6.4 对象 - 关系领域的有向概率模型 6.4.1 plate 模型 6.4.2 概率关系模型 6.5 无向表示 6.6 结构不确定性 * 6.6.1 关系不确定性 6.6.2 对象不确定性 6.7 小结 6.8 相关文献 6.9 习题第 7 章高斯网络模型 7.1 多元高斯分布 7.2.1 基本参数化方法 7.2.2 高斯分布的运算 7.2.3 高斯分布的独立性 7.2 高斯贝叶斯网 7.3 高斯马尔可夫随机场 7.4 小结 7.5 相关文献 7.6 练习第 8 章指数族 8.1 引言 8.2 指数族 8.2.1 线性指数族 8.3 因子化的指数族 (factored exponential families) 8.3.1 积分布 (product distributions) 8.3.2 贝叶斯网络 8.4 熵和相对熵 8.4.1 熵 8.4.2 相对熵 8.5 投影 8.5.1 比较 8.5.2 M- 投影 8.5.3 I- 投影 8.6 小结 8.7 相关文献 8.8 习题第Ⅱ部分推理第 9 章精确推理：变量消除 9.1 复杂性分析 9.1.1 精确推理分析 9.1.2 近似推理分析 9.2 变量消除：基本思路 9.3 变量消除 9.3.1 基本消除 9.3.2 证据处理 9.4 复杂性与图结构：变量消除 9.4.1 简单分析 9.4.2 图论分析 9.4.3 寻找消除排序 * 9.5 条件作用 * 9.5.1 条件作用算法 9.5.2 条件作用与变量消除 9.5.3 图论分析 9.5.4 改进的条件作用算法 9.6 用结构 CPD 推理 * 9.6.1 因果影响的独立性 9.6.2 上下文特定的独立性 9.6.3 讨论 9.7 小结与讨论 9.8 相关文献 9.9 习题第 10 章精确推理：团树 10.1 变量消除与团树 10.1.1 聚类图 10.1.2 团树 10.2 信息传递：和积 10.2.1 团树中的变量消除 10.2.2 团树校准 10.2.3 作为分布的校准团树 10.3 信息传递：信念更新 10.3.1 使用除法的信息传递 10.3.2 和 - 积与信息 - 更新消息的等价性 10.3.3 回答查询 10.4 构建一个团树 10.4.1 源自变量消除的团树 10.4.2 来自弦图的团树 10.5 小结 10.6 相关文献 10.7 习题第 11 章推理优化 11.1 引言 11.1.1 再议精确推理 * 11.1.2 能量泛函 11.1.3 优化能量泛函 11.2 作为优化的精确推理 11.2.1 不动点刻画 . 11.2.2 推理优化 11.3 基于传播的近似 11.3.1 一个简单的例子 11.3.2 聚类图信念传播 11.3.3 聚类图信念传播的性质 11.3.4 收敛性分析 * 11.3.5 构建聚类图 11.3.6 变分分析 11.3.7 其他熵近似 * 11.3.8 讨论 11.4 用近似信息传播 * 11.4.1 因子分解的消息 11.4.2 近似消息计算 11.4.3 用近似消息推理 11.4.4 期望传播 11.4.5 变分分析 11.4.6 讨论 11.5 结构化的变分近似 11.5.1 平均场近似 11.5.2 结构化的近似 11.5.3 局部变分法 * 11.6 小结与讨论 11.7 相关文献 11.8 习题第 12 章基于粒子的近似推理 12.1 前向采样 12.1.1 从贝叶斯网中采样 12.1.2 误差分析 12.1.3 条件概率查询 12.2 似然加权与重要性采样 12.2.1 似然加权：直觉 12.2.2 重要性采样 12.2.3 贝叶斯网的重要性采样 12.2.4 重要性采样回顾 12.3 马尔可夫链的蒙特卡罗方法 12.3.1 吉布斯采样算法 12.3.2 马尔可夫链 12.3.3 吉布斯采样回顾 12.3.4 一马尔可夫链的一个更广泛的类 * 12.3.5 利用马尔可夫链 12.4 坍塌的粒子 12.4.1 坍塌的似然加权 * 12.4.2 坍塌的 MCMC 12.5 确定性搜索方法 * 12.6 小结 12.7 相关文献 12.8 习题第 13 章最大后验推断 13.1 综述 13.1.1 计算复杂性 13.1.2 求解方法综述 13.2 （边缘） MAP 的变量消除 13.2.1 最大 - 积变量消除 13.2.2 找到最可能的取值 13.2.3 边缘 MAP 的变量消除 * 13.3 团树中的最大 - 积 13.3.1 计算最大边缘 13.3.2 作为再参数化的信息传递 13.3.3 最大边缘解码 13.4 多圈聚类图中的最大 - 积信念传播 13.4.1 标准最大 - 积消息传递 13.4.2 带有计数的最大 - 积 BP* 13.4.3 讨论 13.5 作为线性优化问题的 MAP* 13.5.1 整数规划的公式化 13.5.2 线性规划松弛 13.5.3 低温极限 13.6 对 MAP 使用图割 13.6.1 使用图割的推理 13.6.2 非二元变量 13.7 局部搜索算法 * 13.8 小结 13.9 相关文献 13.10 习题第 14 章混合网络中的推理 14.1 引言 14.1.1 挑战 14.1.2 离散化 14.1.3 概述 14.2 高斯网络中的变量消除 14.2.1 标准型 14.2.2 和 - 积算法 14.2.3 高斯信念传播 14.3 混合网 14.3.1 面临的困难 14.3.2 混合高斯网络的因子运算 14.3.3 CLG 网络的 EP 14.3.4 一个“准确的” CLG 算法 * 14.4 非线性依赖 14.4.1 线性化 14.4.2 期望传播与高斯近似 14.5 基于粒子的近似方法 14.5.1 在连续空间中采样 14.5.2 贝叶斯网中的前向采样 14.5.3 MCMC 方法 14.5.4 坍塌的粒子 14.5.4 非参数信息传递 14.6 小结与讨论 14.7 相关文献 14.8 习题第 15 章在时序模型中推理 15.1 推理任务 15.2 精确推理 15.2.1 基于状态观测模型的滤波 15.2.2 作为团树传播的滤波 15.2.3 DBN 中的团树推理 15.2.4 纠缠 15.3 近似推理 15.3.1 核心思想 15.3.2 因子分解的信念状态方法 15.3.3 粒子滤波 15.2.4 确定性搜索技术 15.4 混合 DBN 15.4.1 连续模型 15.4.2 混合模型 15.5 小结 15.6 相关文献 15.7 习题第Ⅲ部分学习第 16 章图模型学习：概述 16.1 动机 16.2 学习目标 16.2.1 密度估计 16.2.2 具体的预测任务 16.2.3 知识发现 16.3 作为优化的学习 16.3.1 经验风险与过拟合 16.3.2 判别式与生成式训练 16.4 学习任务 16.4.1 模型限制 16.4.2 数据的可观测性 16.4.3 学习任务的分类 16.5 相关文献第 17 章参数估计 17.1 最大似然估计 17.1.1 图钉的例子 17.1.2 最大似然准则 17.2 贝叶斯网的 MLE 17.2.1 一个简单的例子 17.2.2 全局似然分解 17.2.3 条件概率分布表 17.2.4 高斯贝叶斯网 * 17.2.5 作为 M- 投影的最大似然估计 * 17.3 贝叶斯参数估计 17.3.1 图钉例子的回顾 17.3.2 先验与后验 . 17.4 贝叶斯网络中的贝叶斯参数估计 17.4.1 参数独立性与全局分解 17.4.2 局部分解 17.4.3 贝叶斯网络学习的先验分布 17.4.4 MAP 估计 * 17.5 学习具有共享参数的模型 17.5.1 全局参数共享 17.5.2 局部参数共享 17.5.3 具有共享参数的贝叶斯推理 17.5.4 层次先验 * 17.6 泛化分析 * 17.6.1 渐进性分析 . 17.6.2 PAC 界 17.7 小结 17.8 相关文献 17.9 习题第 18 章贝叶斯网络中的结构学习 18.1 引言 18.1.1 问题定义 18.1.2 方法概述 18.2 基于约束的方法 18.2.1 基本框架 18.2.2 独立性测试 . 18.3 结构得分 18.3.1 似然得分 18.3.2 贝叶斯得分函数 18.3.3 单个变量的边缘似然 18.3.4 贝叶斯网的贝叶斯得分 18.3.5 理解贝叶斯得分 18.3.6 先验 18.3.7 得分等价性 * 18.4 结构搜索 18.4.1 学习树结构网 18.4.2 给定序 18.4.3 一般的图 18.4.4 用等价类学习 * 18.5 贝叶斯模型平均 * 18.5.1 基本理论 18.5.2 基本给定序的模型平均 18.5.3 一般的情况 . 18.6 关于额外结构学习模型 18.6.1 关于局部结构学习 18.6.2 学习模板模型 18.7 小结与讨论 18.8 相关文献 18.9 习题第 19 章部分观测数据 19.1 基础知识 19.1.1 数据的似然和观测模型 19.1.2 观测机制的解耦 19.1.3 似然函数 19.1.4 可识别性 19.2 参数估计 19.2.1 梯度上升方法 19.2.2 期望最大化（ EM ） 19.2.3 比较：梯度上升与 EM 19.2.4 近似推断 * 19.3 使用不完全数据的贝叶斯学习 * 19.3.1 概述 19.3.2 MCMC 采样 19.3.3 变分贝叶斯学习 19.4 结构学习 19.4.1 得分的结构 19.4.2 结构搜索 19.4.3 结构的 EM 19.5 用隐变量学习模型 19.5.1 隐变量的信息内容 19.5.2 确定基数 19.5.3 引入隐变量 19.6 小结 19.7 相关文献 19.8 习题第 20 章学习无向模型 20.1 概述 20.2 似然函数 20.2.1 一个例子 20.2.2 似然函数的形式 20.2.3 似然函数的性质 20.3 最大（条件）似然参数估计 20.3.1 最大似然估计 20.3.2 条件训练模型 20.3.3 用缺失数据学习 20.3.4 最大熵和最大似然 * 20.4 参数先验与正则化 20.4.1 局部先验 20.4.2 全局先验 20.5 用近似推理学习 20.5.1 信念传播 20.5.2 基于 MAP 的学习 * 20.6 替代目标 20.6.1 伪似然及其推广 20.6.2 对比优化准则 20.7 结构学习 20.7.1 使用独立性检验的结构学习 . 20.7.2 基于得分的学习：假设空间 . 20.7.3 目标函数 20.7.4 优化任务 20.7.5 评价模型的改变 20.8 小结 20.9 相关文献 20.10 习题第Ⅳ部分行为与决策第 21 章因果关系 21.1 动机与概述 21.1.1 条件作用与干预 21.1.2 相关关系和因果关系 21.2 因果关系模型 21.3 结构因果关系的可识别性 . 21.3.1 查询简化规则 21.3.2 迭代的查询简化 21.4 机制与响应变量 * 21.5 功能因果模型中的部分可识别性 * 21.6 仅事实查询 * 21.6.1 成对的网络 21.6.2 仅事实查询的界 21.7 学习因果模型 21.7.1 学习没有混合因素的因果模型 21.7.2 从干预数据中学习 21.7.3 处理隐变量 * 21.7.4 学习功能因果关系模型 * 21.8 小结 21.9 相关文献 21.10 习题第 22 章效用和决策 22.1 基础：期望效用最大化 22.1.1 非确定性决策制订 22.1.2 理论证明 * 22.2 效用曲线 22.2.1 货币效用 22.2.2 风险态度 22.2.3 合理性 22.3 效用的获取 22.3.1 效用获取过程 22.3.2 人类生命的效用 22.4 复杂结果的效用 22.4.1 偏好和效用独立性 * 22.4.2 加法独立性特性 22.5 小结 22.6 相关文献 22.7 习题第 23 章结构化决策问题 23.1 决策树 23.1.1 表示 23.1.2 逆向归纳算法 23.2 影响图 23.2.1 基本描述 23.2.2 决策规则 23.2.3 时间与记忆 23.2.4 语义与最优性准则 23.3 影响图的逆向归纳 23.3.1 影响图的决策树 23.3.2 求和 - 最大化 - 求和规则 23.4 期望效用的计算 23.4.1 简单的变量消除 23.4.2 多个效用变量：简单的方法 23.4.3 广义变量消除 * 23.5 影响图中的最优化 23.5.1 最优化一个单一的决策规则 23.5.2 迭代优化算法 23.5.3 策略关联与全局最优性 * 23.6 忽略无关的信息 * 23.7 信息的价值 23.7.1 单一观测 23.7.2 多重观测 23.8 小结 23.9 相关文献 23.10 习题第 24 章结束语附录 A 背景材料 A.1 信息论 . A.1.1 压缩和熵 A.1.2 条件熵与信息 A.1.3 相对熵和分布距离 A.2 收敛界 A.2.1 中心极限定理 A.2.2 收敛界 A.3 算法与算法的复杂性 A.3.1 基本图算法 A.3.2 算法复杂性分析 A.3.3 动态规划 A.3.4 复杂度理论 A.4 组合优化与搜索 A.4.1 优化问题 A.4.2 局部搜索 A.4.3 分支定界搜索 A.5 连续最优化 A.5.1 连续函数最优解的刻画 A.5.2 梯度上升方法 A.5.3 约束优化 A.5.4 凸对偶性参考文献符号索引符号缩写; 个人分类: 科研记事|44754 次阅读|23 个评论

[转载]Unsupervised Feature Learning and Deep Learning: zhileiliu 2015-1-13 09:56; Description: This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems. This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this Machine Learning course and complete sections II, III, IV (up to Logistic Regression) first. Sparse Autoencoder Neural Networks Backpropagation Algorithm Gradient checking and advanced optimization Autoencoders and Sparsity Visualizing a Trained Autoencoder Sparse Autoencoder Notation Summary Exercise:Sparse Autoencoder Vectorized implementation Vectorization Logistic Regression Vectorization Example Neural Network Vectorization Exercise:Vectorization Preprocessing: PCA and Whitening PCA Whitening Implementing PCA/Whitening Exercise:PCA in 2D Exercise:PCA and Whitening Softmax Regression Softmax Regression Exercise:Softmax Regression Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning Building Deep Networks for Classification From Self-Taught Learning to Deep Networks Deep Networks: Overview Stacked Autoencoders Fine-tuning Stacked AEs Exercise: Implement deep networks for digit classification Linear Decoders with Autoencoders Linear Decoders Exercise:Learning color features with Sparse Autoencoders Working with Large Images Feature extraction using convolution Pooling Exercise:Convolution and Pooling Note : The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. Miscellaneous MATLAB Modules Style Guide Useful Links Miscellaneous Topics Data Preprocessing Deriving gradients using the backpropagation idea Advanced Topics : Sparse Coding Sparse Coding Sparse Coding: Autoencoder Interpretation Exercise:Sparse Coding ICA Style Models Independent Component Analysis Exercise:Independent Component Analysis Others Convolutional training Restricted Boltzmann Machines Deep Belief Networks Denoising Autoencoders K-means Spatial pyramids / Multiscale Slow Feature Analysis Tiled Convolution Networks Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen; 1043 次阅读|0 个评论

[转载]Deep Learning（深度学习）学习笔记: zhileiliu 2014-12-17 13:07; 目录：一、概述二、背景三、人脑视觉机理四、关于特征 4.1、特征表示的粒度 4.2、初级（浅层）特征表示 4.3、结构性特征表示 4.4、需要有多少个特征？五、Deep Learning的基本思想六、浅层学习（Shallow Learning）和深度学习（Deep Learning）七、Deep learning与Neural Network 八、Deep learning训练过程 8.1、传统神经网络的训练方法 8.2、deep learning训练过程九、Deep Learning的常用模型或者方法 9.1、AutoEncoder自动编码器 9.2、Sparse Coding稀疏编码 9.3、Restricted Boltzmann Machine(RBM)限制波尔兹曼机 9.4、Deep BeliefNetworks深信度网络 9.5、Convolutional Neural Networks卷积神经网络十、总结与展望十一、参考文献和Deep Learning学习资源一、概述 Artificial Intelligence，也就是人工智能，就像长生不老和星际漫游一样，是人类最美好的梦想之一。虽然计算机技术已经取得了长足的进步，但是到目前为止，还没有一台电脑能产生“自我”的意识。是的，在人类和大量现成数据的帮助下，电脑可以表现的十分强大，但是离开了这两者，它甚至都不能分辨一个喵星人和一个汪星人。图灵（图灵，大家都知道吧。计算机和人工智能的鼻祖，分别对应于其著名的“图灵机”和“图灵测试”）在 1950 年的论文里，提出图灵试验的设想，即，隔墙对话，你将不知道与你谈话的，是人还是电脑。这无疑给计算机，尤其是人工智能，预设了一个很高的期望值。但是半个世纪过去了，人工智能的进展，远远没有达到图灵试验的标准。这不仅让多年翘首以待的人们，心灰意冷，认为人工智能是忽悠，相关领域是“伪科学”。但是自 2006 年以来，机器学习领域，取得了突破性的进展。图灵试验，至少不是那么可望而不可及了。至于技术手段，不仅仅依赖于云计算对大数据的并行处理能力，而且依赖于算法。这个算法就是，Deep Learning。借助于 Deep Learning 算法，人类终于找到了如何处理“抽象概念”这个亘古难题的方法。 2012年6月，《纽约时报》披露了Google Brain项目，吸引了公众的广泛关注。这个项目是由著名的斯坦福大学的机器学习教授Andrew Ng和在大规模计算机系统方面的世界顶尖专家JeffDean共同主导，用16000个CPU Core的并行计算平台训练一种称为“深度神经网络”（DNN，Deep Neural Networks）的机器学习模型（内部共有10亿个节点。这一网络自然是不能跟人类的神经网络相提并论的。要知道，人脑中可是有150多亿个神经元，互相连接的节点也就是突触数更是如银河沙数。曾经有人估算过，如果将一个人的大脑中所有神经细胞的轴突和树突依次连接起来，并拉成一根直线，可从地球连到月亮，再从月亮返回地球），在语音识别和图像识别等领域获得了巨大的成功。项目负责人之一Andrew称：“我们没有像通常做的那样自己框定边界，而是直接把海量数据投放到算法中，让数据自己说话，系统会自动从数据中学习。”另外一名负责人Jeff则说：“我们在训练的时候从来不会告诉机器说：‘这是一只猫。’系统其实是自己发明或者领悟了“猫”的概念。” 2012年11月，微软在中国天津的一次活动上公开演示了一个全自动的同声传译系统，讲演者用英文演讲，后台的计算机一气呵成自动完成语音识别、英中机器翻译和中文语音合成，效果非常流畅。据报道，后面支撑的关键技术也是DNN，或者深度学习（DL，DeepLearning）。 2013年1月，在百度年会上，创始人兼CEO李彦宏高调宣布要成立百度研究院，其中第一个成立的就是“深度学习研究所”（IDL，Institue of Deep Learning）。为什么拥有大数据的互联网公司争相投入大量资源研发深度学习技术。听起来感觉deeplearning很牛那样。那什么是deep learning？为什么有deep learning？它是怎么来的？又能干什么呢？目前存在哪些困难呢？这些问题的简答都需要慢慢来。咱们先来了解下机器学习（人工智能的核心）的背景。二、背景机器学习（Machine Learning）是一门专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能的学科。机器能否像人类一样能具有学习能力呢？1959年美国的塞缪尔(Samuel)设计了一个下棋程序，这个程序具有学习能力，它可以在不断的对弈中改善自己的棋艺。4年后，这个程序战胜了设计者本人。又过了3年，这个程序战胜了美国一个保持8年之久的常胜不败的冠军。这个程序向人们展示了机器学习的能力，提出了许多令人深思的社会问题与哲学问题（呵呵，人工智能正常的轨道没有很大的发展，这些什么哲学伦理啊倒发展的挺快。什么未来机器越来越像人，人越来越像机器啊。什么机器会反人类啊，ATM是开第一枪的啊等等。人类的思维无穷啊）。机器学习虽然发展了几十年，但还是存在很多没有良好解决的问题：例如图像识别、语音识别、自然语言理解、天气预测、基因表达、内容推荐等等。目前我们通过机器学习去解决这些问题的思路都是这样的（以视觉感知为例子）：从开始的通过传感器（例如CMOS）来获得数据。然后经过预处理、特征提取、特征选择，再到推理、预测或者识别。最后一个部分，也就是机器学习的部分，绝大部分的工作是在这方面做的，也存在很多的paper和研究。而中间的三部分，概括起来就是特征表达。良好的特征表达，对最终算法的准确性起了非常关键的作用，而且系统主要的计算和测试工作都耗在这一大部分。但，这块实际中一般都是人工完成的。靠人工提取特征。截止现在，也出现了不少NB的特征（好的特征应具有不变性（大小、尺度和旋转等）和可区分性）：例如Sift的出现，是局部图像特征描述子研究领域一项里程碑式的工作。由于SIFT对尺度、旋转以及一定视角和光照变化等图像变化都具有不变性，并且SIFT具有很强的可区分性，的确让很多问题的解决变为可能。但它也不是万能的。然而，手工地选取特征是一件非常费力、启发式（需要专业知识）的方法，能不能选取好很大程度上靠经验和运气，而且它的调节需要大量的时间。既然手工选取特征不太好，那么能不能自动地学习一些特征呢？答案是能！Deep Learning就是用来干这个事情的，看它的一个别名UnsupervisedFeature Learning，就可以顾名思义了，Unsupervised的意思就是不要人参与特征的选取过程。那它是怎么学习的呢？怎么知道哪些特征好哪些不好呢？我们说机器学习是一门专门研究计算机怎样模拟或实现人类的学习行为的学科。好，那我们人的视觉系统是怎么工作的呢？为什么在茫茫人海，芸芸众生，滚滚红尘中我们都可以找到另一个她（因为，你存在我深深的脑海里，我的梦里我的心里我的歌声里……）。人脑那么NB，我们能不能参考人脑，模拟人脑呢？（好像和人脑扯上点关系的特征啊，算法啊，都不错，但不知道是不是人为强加的，为了使自己的作品变得神圣和高雅。）近几十年以来，认知神经科学、生物学等等学科的发展，让我们对自己这个神秘的而又神奇的大脑不再那么的陌生。也给人工智能的发展推波助澜。三、人脑视觉机理 1981 年的诺贝尔医学奖，颁发给了 David Hubel（出生于加拿大的美国神经生物学家）和TorstenWiesel，以及 Roger Sperry。前两位的主要贡献，是“发现了视觉系统的信息处理”：可视皮层是分级的：我们看看他们做了什么。1958 年，DavidHubel 和Torsten Wiesel 在 JohnHopkins University，研究瞳孔区域与大脑皮层神经元的对应关系。他们在猫的后脑头骨上，开了一个3 毫米的小洞，向洞里插入电极，测量神经元的活跃程度。然后，他们在小猫的眼前，展现各种形状、各种亮度的物体。并且，在展现每一件物体时，还改变物体放置的位置和角度。他们期望通过这个办法，让小猫瞳孔感受不同类型、不同强弱的刺激。之所以做这个试验，目的是去证明一个猜测。位于后脑皮层的不同视觉神经元，与瞳孔所受刺激之间，存在某种对应关系。一旦瞳孔受到某一种刺激，后脑皮层的某一部分神经元就会活跃。经历了很多天反复的枯燥的试验，同时牺牲了若干只可怜的小猫，David Hubel 和Torsten Wiesel 发现了一种被称为“方向选择性细胞（Orientation Selective Cell）”的神经元细胞。当瞳孔发现了眼前的物体的边缘，而且这个边缘指向某个方向时，这种神经元细胞就会活跃。这个发现激发了人们对于神经系统的进一步思考。神经-中枢-大脑的工作过程，或许是一个不断迭代、不断抽象的过程。这里的关键词有两个，一个是抽象，一个是迭代。从原始信号，做低级抽象，逐渐向高级抽象迭代。人类的逻辑思维，经常使用高度抽象的概念。例如，从原始信号摄入开始（瞳孔摄入像素 Pixels），接着做初步处理（大脑皮层某些细胞发现边缘和方向），然后抽象（大脑判定，眼前的物体的形状，是圆形的），然后进一步抽象（大脑进一步判定该物体是只气球）。这个生理学的发现，促成了计算机人工智能，在四十年后的突破性发展。总的来说，人的视觉系统的信息处理是分级的。从低级的V1区提取边缘特征，再到V2区的形状或者目标的部分等，再到更高层，整个目标、目标的行为等。也就是说高层的特征是低层特征的组合，从低层到高层的特征表示越来越抽象，越来越能表现语义或者意图。而抽象层面越高，存在的可能猜测就越少，就越利于分类。例如，单词集合和句子的对应是多对一的，句子和语义的对应又是多对一的，语义和意图的对应还是多对一的，这是个层级体系。敏感的人注意到关键词了：分层。而Deep learning的deep是不是就表示我存在多少层，也就是多深呢？没错。那Deep learning是如何借鉴这个过程的呢？毕竟是归于计算机来处理，面对的一个问题就是怎么对这个过程建模？因为我们要学习的是特征的表达，那么关于特征，或者说关于这个层级特征，我们需要了解地更深入点。所以在说Deep Learning之前，我们有必要再啰嗦下特征（呵呵，实际上是看到那么好的对特征的解释，不放在这里有点可惜，所以就塞到这了）。接上因为我们要学习的是特征的表达，那么关于特征，或者说关于这个层级特征，我们需要了解地更深入点。所以在说Deep Learning之前，我们有必要再啰嗦下特征（呵呵，实际上是看到那么好的对特征的解释，不放在这里有点可惜，所以就塞到这了）。四、关于特征特征是机器学习系统的原材料，对最终模型的影响是毋庸置疑的。如果数据被很好的表达成了特征，通常线性模型就能达到满意的精度。那对于特征，我们需要考虑什么呢？ 4.1、特征表示的粒度学习算法在一个什么粒度上的特征表示，才有能发挥作用？就一个图片来说，像素级的特征根本没有价值。例如下面的摩托车，从像素级别，根本得不到任何信息，其无法进行摩托车和非摩托车的区分。而如果特征是一个具有结构性（或者说有含义）的时候，比如是否具有车把手（handle），是否具有车轮（wheel），就很容易把摩托车和非摩托车区分，学习算法才能发挥作用。 4.2、初级（浅层）特征表示既然像素级的特征表示方法没有作用，那怎样的表示才有用呢？ 1995 年前后，Bruno Olshausen和 David Field 两位学者任职 Cornell University，他们试图同时用生理学和计算机的手段，双管齐下，研究视觉问题。他们收集了很多黑白风景照片，从这些照片中，提取出400个小碎片，每个照片碎片的尺寸均为 16x16 像素，不妨把这400个碎片标记为 S, i = 0,.. 399。接下来，再从这些黑白风景照片中，随机提取另一个碎片，尺寸也是 16x16 像素，不妨把这个碎片标记为 T。他们提出的问题是，如何从这400个碎片中，选取一组碎片，S , 通过叠加的办法，合成出一个新的碎片，而这个新的碎片，应当与随机选择的目标碎片 T，尽可能相似，同时，S 的数量尽可能少。用数学的语言来描述，就是： Sum_k (a * S ) -- T, 其中 a 是在叠加碎片 S 时的权重系数。为解决这个问题，Bruno Olshausen和 David Field 发明了一个算法，稀疏编码（Sparse Coding）。稀疏编码是一个重复迭代的过程，每次迭代分两步： 1）选择一组 S ，然后调整 a ，使得Sum_k (a * S ) 最接近 T。 2）固定住 a ，在 400 个碎片中，选择其它更合适的碎片S’ ，替代原先的 S ，使得Sum_k (a * S’ ) 最接近 T。经过几次迭代后，最佳的 S 组合，被遴选出来了。令人惊奇的是，被选中的 S ，基本上都是照片上不同物体的边缘线，这些线段形状相似，区别在于方向。 Bruno Olshausen和 David Field 的算法结果，与 David Hubel 和Torsten Wiesel 的生理发现，不谋而合！也就是说，复杂图形，往往由一些基本结构组成。比如下图：一个图可以通过用64种正交的edges（可以理解成正交的基本结构）来线性表示。比如样例的x可以用1-64个edges中的三个按照0.8,0.3,0.5的权重调和而成。而其他基本edge没有贡献，因此均为0 。另外，大牛们还发现，不仅图像存在这个规律，声音也存在。他们从未标注的声音中发现了20种基本的声音结构，其余的声音可以由这20种基本结构合成。 4.3、结构性特征表示小块的图形可以由基本edge构成，更结构化，更复杂的，具有概念性的图形如何表示呢？这就需要更高层次的特征表示，比如V2，V4。因此V1看像素级是像素级。V2看V1是像素级，这个是层次递进的，高层表达由底层表达的组合而成。专业点说就是基basis。V1取提出的basis是边缘，然后V2层是V1层这些basis的组合，这时候V2区得到的又是高一层的basis。即上一层的basis组合的结果，上上层又是上一层的组合basis……（所以有大牛说Deep learning就是“搞基”，因为难听，所以美其名曰Deep learning或者Unsupervised Feature Learning）直观上说，就是找到make sense的小patch再将其进行combine，就得到了上一层的feature，递归地向上learning feature。在不同object上做training是，所得的edge basis 是非常相似的，但object parts和models 就会completely different了（那咱们分辨car或者face是不是容易多了）：从文本来说，一个doc表示什么意思？我们描述一件事情，用什么来表示比较合适？用一个一个字嘛，我看不是，字就是像素级别了，起码应该是term，换句话说每个doc都由term构成，但这样表示概念的能力就够了嘛，可能也不够，需要再上一步，达到topic级，有了topic，再到doc就合理。但每个层次的数量差距很大，比如doc表示的概念-topic（千-万量级）-term（10万量级）-word（百万量级）。一个人在看一个doc的时候，眼睛看到的是word，由这些word在大脑里自动切词形成term，在按照概念组织的方式，先验的学习，得到topic，然后再进行高层次的learning。 4.4、需要有多少个特征？我们知道需要层次的特征构建，由浅入深，但每一层该有多少个特征呢？任何一种方法，特征越多，给出的参考信息就越多，准确性会得到提升。但特征多意味着计算复杂，探索的空间大，可以用来训练的数据在每个特征上就会稀疏，都会带来各种问题，并不一定特征越多越好。好了，到了这一步，终于可以聊到Deep learning了。上面我们聊到为什么会有Deep learning（让机器自动学习良好的特征，而免去人工选取过程。还有参考人的分层视觉处理系统），我们得到一个结论就是Deep learning需要多层来获得更抽象的特征表达。那么多少层才合适呢？用什么架构来建模呢？怎么进行非监督训练呢？接上好了，到了这一步，终于可以聊到Deep learning了。上面我们聊到为什么会有Deep learning（让机器自动学习良好的特征，而免去人工选取过程。还有参考人的分层视觉处理系统），我们得到一个结论就是Deep learning需要多层来获得更抽象的特征表达。那么多少层才合适呢？用什么架构来建模呢？怎么进行非监督训练呢？五、Deep Learning的基本思想假设我们有一个系统S，它有n层（S1,…Sn），它的输入是I，输出是O，形象地表示为： I =S1=S2=…..=Sn = O，如果输出O等于输入I，即输入I经过这个系统变化之后没有任何的信息损失（呵呵，大牛说，这是不可能的。信息论中有个“信息逐层丢失”的说法（信息处理不等式），设处理a信息得到b，再对b处理得到c，那么可以证明：a和c的互信息不会超过a和b的互信息。这表明信息处理不会增加信息，大部分处理会丢失信息。当然了，如果丢掉的是没用的信息那多好啊），保持了不变，这意味着输入I经过每一层Si都没有任何的信息损失，即在任何一层Si，它都是原有信息（即输入I）的另外一种表示。现在回到我们的主题Deep Learning，我们需要自动地学习特征，假设我们有一堆输入I（如一堆图像或者文本），假设我们设计了一个系统S（有n层），我们通过调整系统中参数，使得它的输出仍然是输入I，那么我们就可以自动地获取得到输入I的一系列层次特征，即S1，…, Sn。对于深度学习来说，其思想就是对堆叠多个层，也就是说这一层的输出作为下一层的输入。通过这种方式，就可以实现对输入信息进行分级表达了。另外，前面是假设输出严格地等于输入，这个限制太严格，我们可以略微地放松这个限制，例如我们只要使得输入与输出的差别尽可能地小即可，这个放松会导致另外一类不同的Deep Learning方法。上述就是Deep Learning的基本思想。六、浅层学习（Shallow Learning）和深度学习（Deep Learning）浅层学习是机器学习的第一次浪潮。 20世纪80年代末期，用于人工神经网络的反向传播算法（也叫Back Propagation算法或者BP算法）的发明，给机器学习带来了希望，掀起了基于统计模型的机器学习热潮。这个热潮一直持续到今天。人们发现，利用BP算法可以让一个人工神经网络模型从大量训练样本中学习统计规律，从而对未知事件做预测。这种基于统计的机器学习方法比起过去基于人工规则的系统，在很多方面显出优越性。这个时候的人工神经网络，虽也被称作多层感知机（Multi-layer Perceptron），但实际是种只含有一层隐层节点的浅层模型。 20世纪90年代，各种各样的浅层机器学习模型相继被提出，例如支撑向量机（SVM，Support Vector Machines）、 Boosting、最大熵方法（如LR，Logistic Regression）等。这些模型的结构基本上可以看成带有一层隐层节点（如SVM、Boosting），或没有隐层节点（如LR）。这些模型无论是在理论分析还是应用中都获得了巨大的成功。相比之下，由于理论分析的难度大，训练方法又需要很多经验和技巧，这个时期浅层人工神经网络反而相对沉寂。深度学习是机器学习的第二次浪潮。 2006年，加拿大多伦多大学教授、机器学习领域的泰斗Geoffrey Hinton和他的学生RuslanSalakhutdinov在《科学》上发表了一篇文章，开启了深度学习在学术界和工业界的浪潮。这篇文章有两个主要观点：1）多隐层的人工神经网络具有优异的特征学习能力，学习得到的特征对数据有更本质的刻画，从而有利于可视化或分类；2）深度神经网络在训练上的难度，可以通过“逐层初始化”（layer-wise pre-training）来有效克服，在这篇文章中，逐层初始化是通过无监督学习实现的。当前多数分类、回归等学习方法为浅层结构算法，其局限性在于有限样本和计算单元情况下对复杂函数的表示能力有限，针对复杂分类问题其泛化能力受到一定制约。深度学习可通过学习一种深层非线性网络结构，实现复杂函数逼近，表征输入数据分布式表示，并展现了强大的从少数样本集中学习数据集本质特征的能力。（多层的好处是可以用较少的参数表示复杂的函数）深度学习的实质，是通过构建具有很多隐层的机器学习模型和海量的训练数据，来学习更有用的特征，从而最终提升分类或预测的准确性。因此，“深度模型”是手段，“特征学习”是目的。区别于传统的浅层学习，深度学习的不同在于：1）强调了模型结构的深度，通常有5层、6层，甚至10多层的隐层节点；2）明确突出了特征学习的重要性，也就是说，通过逐层特征变换，将样本在原空间的特征表示变换到一个新特征空间，从而使分类或预测更加容易。与人工规则构造特征的方法相比，利用大数据来学习特征，更能够刻画数据的丰富内在信息。七、Deep learning与Neural Network 深度学习是机器学习研究中的一个新的领域，其动机在于建立、模拟人脑进行分析学习的神经网络，它模仿人脑的机制来解释数据，例如图像，声音和文本。深度学习是无监督学习的一种。深度学习的概念源于人工神经网络的研究。含多隐层的多层感知器就是一种深度学习结构。深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征，以发现数据的分布式特征表示。 Deep learning本身算是machine learning的一个分支，简单可以理解为neural network的发展。大约二三十年前，neural network曾经是ML领域特别火热的一个方向，但是后来确慢慢淡出了，原因包括以下几个方面： 1）比较容易过拟合，参数比较难tune，而且需要不少trick； 2）训练速度比较慢，在层次比较少（小于等于3）的情况下效果并不比其它方法更优；所以中间有大约20多年的时间，神经网络被关注很少，这段时间基本上是SVM和boosting算法的天下。但是，一个痴心的老先生Hinton，他坚持了下来，并最终（和其它人一起Bengio、Yann.lecun等）提成了一个实际可行的deep learning框架。 Deep learning与传统的神经网络之间有相同的地方也有很多不同。二者的相同在于deep learning采用了神经网络相似的分层结构，系统由包括输入层、隐层（多层）、输出层组成的多层网络，只有相邻层节点之间有连接，同一层以及跨层节点之间相互无连接，每一层可以看作是一个logistic regression模型；这种分层结构，是比较接近人类大脑的结构的。而为了克服神经网络训练中的问题，DL采用了与神经网络很不同的训练机制。传统神经网络中，采用的是back propagation的方式进行，简单来讲就是采用迭代的算法来训练整个网络，随机设定初值，计算当前网络的输出，然后根据当前输出和label之间的差去改变前面各层的参数，直到收敛（整体是一个梯度下降法）。而deep learning整体上是一个layer-wise的训练机制。这样做的原因是因为，如果采用back propagation的机制，对于一个deep network（7层以上），残差传播到最前面的层已经变得太小，出现所谓的gradient diffusion（梯度扩散）。这个问题我们接下来讨论。八、Deep learning训练过程 8.1、传统神经网络的训练方法为什么不能用在深度神经网络 BP算法作为传统训练多层网络的典型算法，实际上对仅含几层网络，该训练方法就已经很不理想。深度结构（涉及多个非线性处理单元层）非凸目标代价函数中普遍存在的局部最小是训练困难的主要来源。 BP算法存在的问题：（1）梯度越来越稀疏：从顶层越往下，误差校正信号越来越小；（2）收敛到局部最小值：尤其是从远离最优区域开始的时候（随机值初始化会导致这种情况的发生）；（3）一般，我们只能用有标签的数据来训练：但大部分的数据是没标签的，而大脑可以从没有标签的的数据中学习； 8.2、deep learning训练过程如果对所有层同时训练，时间复杂度会太高；如果每次训练一层，偏差就会逐层传递。这会面临跟上面监督学习中相反的问题，会严重欠拟合（因为深度网络的神经元和参数太多了）。 2006年，hinton提出了在非监督数据上建立多层神经网络的一个有效方法，简单的说，分为两步，一是每次训练一层网络，二是调优，使原始表示x向上生成的高级表示r和该高级表示r向下生成的x'尽可能一致。方法是： 1）首先逐层构建单层神经元，这样每次都是训练一个单层网络。 2）当所有层训练完后，Hinton使用wake-sleep算法进行调优。将除最顶层的其它层间的权重变为双向的，这样最顶层仍然是一个单层神经网络，而其它层则变为了图模型。向上的权重用于“认知”，向下的权重用于“生成”。然后使用Wake-Sleep算法调整所有的权重。让认知和生成达成一致，也就是保证生成的最顶层表示能够尽可能正确的复原底层的结点。比如顶层的一个结点表示人脸，那么所有人脸的图像应该激活这个结点，并且这个结果向下生成的图像应该能够表现为一个大概的人脸图像。Wake-Sleep算法分为醒（wake）和睡（sleep）两个部分。 1）wake阶段：认知过程，通过外界的特征和向上的权重（认知权重）产生每一层的抽象表示（结点状态），并且使用梯度下降修改层间的下行权重（生成权重）。也就是“如果现实跟我想象的不一样，改变我的权重使得我想象的东西就是这样的”。 2）sleep阶段：生成过程，通过顶层表示（醒时学得的概念）和向下权重，生成底层的状态，同时修改层间向上的权重。也就是“如果梦中的景象不是我脑中的相应概念，改变我的认知权重使得这种景象在我看来就是这个概念”。 deep learning训练过程具体如下： 1）使用自下上升非监督学习（就是从底层开始，一层一层的往顶层训练）：采用无标定数据（有标定数据也可）分层训练各层参数，这一步可以看作是一个无监督训练过程，是和传统神经网络区别最大的部分（这个过程可以看作是feature learning过程）：具体的，先用无标定数据训练第一层，训练时先学习第一层的参数（这一层可以看作是得到一个使得输出和输入差别最小的三层神经网络的隐层），由于模型capacity的限制以及稀疏性约束，使得得到的模型能够学习到数据本身的结构，从而得到比输入更具有表示能力的特征；在学习得到第n-1层后，将n-1层的输出作为第n层的输入，训练第n层，由此分别得到各层的参数； 2）自顶向下的监督学习（就是通过带标签的数据去训练，误差自顶向下传输，对网络进行微调）：基于第一步得到的各层参数进一步fine-tune整个多层模型的参数，这一步是一个有监督训练过程；第一步类似神经网络的随机初始化初值过程，由于DL的第一步不是随机初始化，而是通过学习输入数据的结构得到的，因而这个初值更接近全局最优，从而能够取得更好的效果；所以deep learning效果好很大程度上归功于第一步的feature learning过程。接上九、Deep Learning的常用模型或者方法 9.1、AutoEncoder自动编码器 Deep Learning最简单的一种方法是利用人工神经网络的特点，人工神经网络（ANN）本身就是具有层次结构的系统，如果给定一个神经网络，我们假设其输出与输入是相同的，然后训练调整其参数，得到每一层中的权重。自然地，我们就得到了输入I的几种不同表示（每一层代表一种表示），这些表示就是特征。自动编码器就是一种尽可能复现输入信号的神经网络。为了实现这种复现，自动编码器就必须捕捉可以代表输入数据的最重要的因素，就像PCA那样，找到可以代表原信息的主要成分。具体过程简单的说明如下： 1）给定无标签数据，用非监督学习学习特征：在我们之前的神经网络中，如第一个图，我们输入的样本是有标签的，即（input, target），这样我们根据当前输出和target（label）之间的差去改变前面各层的参数，直到收敛。但现在我们只有无标签数据，也就是右边的图。那么这个误差怎么得到呢？如上图，我们将input输入一个encoder编码器，就会得到一个code，这个code也就是输入的一个表示，那么我们怎么知道这个code表示的就是input呢？我们加一个decoder解码器，这时候decoder就会输出一个信息，那么如果输出的这个信息和一开始的输入信号input是很像的（理想情况下就是一样的），那很明显，我们就有理由相信这个code是靠谱的。所以，我们就通过调整encoder和decoder的参数，使得重构误差最小，这时候我们就得到了输入input信号的第一个表示了，也就是编码code了。因为是无标签数据，所以误差的来源就是直接重构后与原输入相比得到。 2）通过编码器产生特征，然后训练下一层。这样逐层训练：那上面我们就得到第一层的code，我们的重构误差最小让我们相信这个code就是原输入信号的良好表达了，或者牵强点说，它和原信号是一模一样的（表达不一样，反映的是一个东西）。那第二层和第一层的训练方式就没有差别了，我们将第一层输出的code当成第二层的输入信号，同样最小化重构误差，就会得到第二层的参数，并且得到第二层输入的code，也就是原输入信息的第二个表达了。其他层就同样的方法炮制就行了（训练这一层，前面层的参数都是固定的，并且他们的decoder已经没用了，都不需要了）。 3）有监督微调：经过上面的方法，我们就可以得到很多层了。至于需要多少层（或者深度需要多少，这个目前本身就没有一个科学的评价方法）需要自己试验调了。每一层都会得到原始输入的不同的表达。当然了，我们觉得它是越抽象越好了，就像人的视觉系统一样。到这里，这个AutoEncoder还不能用来分类数据，因为它还没有学习如何去连结一个输入和一个类。它只是学会了如何去重构或者复现它的输入而已。或者说，它只是学习获得了一个可以良好代表输入的特征，这个特征可以最大程度上代表原输入信号。那么，为了实现分类，我们就可以在AutoEncoder的最顶的编码层添加一个分类器（例如罗杰斯特回归、SVM等），然后通过标准的多层神经网络的监督训练方法（梯度下降法）去训练。也就是说，这时候，我们需要将最后层的特征code输入到最后的分类器，通过有标签样本，通过监督学习进行微调，这也分两种，一个是只调整分类器（黑色部分）：另一种：通过有标签样本，微调整个系统：（如果有足够多的数据，这个是最好的。end-to-end learning端对端学习）一旦监督训练完成，这个网络就可以用来分类了。神经网络的最顶层可以作为一个线性分类器，然后我们可以用一个更好性能的分类器去取代它。在研究中可以发现，如果在原有的特征中加入这些自动学习得到的特征可以大大提高精确度，甚至在分类问题中比目前最好的分类算法效果还要好！ AutoEncoder存在一些变体，这里简要介绍下两个： Sparse AutoEncoder稀疏自动编码器：当然，我们还可以继续加上一些约束条件得到新的Deep Learning方法，如：如果在AutoEncoder的基础上加上L1的Regularity限制（L1主要是约束每一层中的节点中大部分都要为0，只有少数不为0，这就是Sparse名字的来源），我们就可以得到Sparse AutoEncoder法。如上图，其实就是限制每次得到的表达code尽量稀疏。因为稀疏的表达往往比其他的表达要有效（人脑好像也是这样的，某个输入只是刺激某些神经元，其他的大部分的神经元是受到抑制的）。 Denoising AutoEncoders降噪自动编码器：降噪自动编码器DA是在自动编码器的基础上，训练数据加入噪声，所以自动编码器必须学习去去除这种噪声而获得真正的没有被噪声污染过的输入。因此，这就迫使编码器去学习输入信号的更加鲁棒的表达，这也是它的泛化能力比一般编码器强的原因。DA可以通过梯度下降算法去训练。接上 9.2、Sparse Coding稀疏编码如果我们把输出必须和输入相等的限制放松，同时利用线性代数中基的概念，即O = a1*Φ1 + a2*Φ2+….+ an*Φn， Φi是基，ai是系数，我们可以得到这样一个优化问题： Min |I – O|，其中I表示输入，O表示输出。通过求解这个最优化式子，我们可以求得系数ai和基Φi，这些系数和基就是输入的另外一种近似表达。因此，它们可以用来表达输入I，这个过程也是自动学习得到的。如果我们在上述式子上加上L1的Regularity限制，得到： Min |I – O| + u*(|a1| + |a2| + … + |an |) 这种方法被称为Sparse Coding。通俗的说，就是将一个信号表示为一组基的线性组合，而且要求只需要较少的几个基就可以将信号表示出来。“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数 ai 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。选择使用具有稀疏性的分量来表示我们的输入数据是有原因的，因为绝大多数的感官数据，比如自然图像，可以被表示成少量基本元素的叠加，在图像中这些基本元素可以是面或者线。同时，比如与初级视觉皮层的类比过程也因此得到了提升（人脑有大量的神经元，但对于某些图像或者边缘只有很少的神经元兴奋，其他都处于抑制状态）。稀疏编码算法是一种无监督学习方法，它用来寻找一组“超完备”基向量来更高效地表示样本数据。虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组“超完备”基向量来表示输入向量（也就是说，基向量的个数比输入向量的维数要大）。超完备基的好处是它们能更有效地找出隐含在输入数据内部的结构与模式。然而，对于超完备基来说，系数ai不再由输入向量唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准“稀疏性”来解决因超完备而导致的退化（degeneracy）问题。（详细过程请参考：UFLDL Tutorial稀疏编码）比如在图像的Feature Extraction的最底层要做Edge Detector的生成，那么这里的工作就是从Natural Images中randomly选取一些小patch，通过这些patch生成能够描述他们的“基”，也就是右边的8*8=64个basis组成的basis，然后给定一个test patch, 我们可以按照上面的式子通过basis的线性组合得到，而sparse matrix就是a，下图中的a中有64个维度，其中非零项只有3个，故称“sparse”。这里可能大家会有疑问，为什么把底层作为Edge Detector呢？上层又是什么呢？这里做个简单解释大家就会明白，之所以是Edge Detector是因为不同方向的Edge就能够描述出整幅图像，所以不同方向的Edge自然就是图像的basis了……而上一层的basis组合的结果，上上层又是上一层的组合basis……（就是上面第四部分的时候咱们说的那样） Sparse coding分为两个部分： 1）Training阶段：给定一系列的样本图片，我们需要学习得到一组基，也就是字典。稀疏编码是k-means算法的变体，其训练过程也差不多（EM算法的思想：如果要优化的目标函数包含两个变量，如L(W, B)，那么我们可以先固定W，调整B使得L最小，然后再固定B，调整W使L最小，这样迭代交替，不断将L推向最小值。EM算法可以见我的博客：“ 从最大似然到EM算法浅解 ”）。训练过程就是一个重复迭代的过程，按上面所说，我们交替的更改a和Φ使得下面这个目标函数最小。每次迭代分两步： a）固定字典Φ ，然后调整a ，使得上式，即目标函数最小（即解LASSO问题）。 b）然后固定住a ，调整Φ ，使得上式，即目标函数最小（即解凸QP问题）。不断迭代，直至收敛。这样就可以得到一组可以良好表示这一系列x的基，也就是字典。 2）Coding阶段：给定一个新的图片x，由上面得到的字典，通过解一个LASSO问题得到稀疏向量 a 。这个稀疏向量就是这个输入向量x的一个稀疏表达了。例如：接上注：下面的两个Deep Learning方法说明需要完善，但为了保证文章的连续性和完整性，先贴一些上来，后面再修改好了。 9.3、Restricted Boltzmann Machine (RBM)限制波尔兹曼机假设有一个二部图，每一层的节点之间没有链接，一层是可视层，即输入数据层（v)，一层是隐藏层(h)，如果假设所有的节点都是随机二值变量节点（只能取0或者1值），同时假设全概率分布p(v,h)满足Boltzmann 分布，我们称这个模型是Restricted BoltzmannMachine (RBM)。下面我们来看看为什么它是Deep Learning方法。首先，这个模型因为是二部图，所以在已知v的情况下，所有的隐藏节点之间是条件独立的（因为节点之间不存在连接），即p(h|v)=p(h1|v)…p(hn|v)。同理，在已知隐藏层h的情况下，所有的可视节点都是条件独立的。同时又由于所有的v和h满足Boltzmann 分布，因此，当输入v的时候，通过p(h|v) 可以得到隐藏层h，而得到隐藏层h之后，通过p(v|h)又能得到可视层，通过调整参数，我们就是要使得从隐藏层得到的可视层v1与原来的可视层v如果一样，那么得到的隐藏层就是可视层另外一种表达，因此隐藏层可以作为可视层输入数据的特征，所以它就是一种Deep Learning方法。如何训练呢？也就是可视层节点和隐节点间的权值怎么确定呢？我们需要做一些数学分析。也就是模型了。联合组态（jointconfiguration）的能量可以表示为：而某个组态的联合概率分布可以通过Boltzmann 分布（和这个组态的能量）来确定：因为隐藏节点之间是条件独立的（因为节点之间不存在连接），即：然后我们可以比较容易（对上式进行因子分解Factorizes）得到在给定可视层v的基础上，隐层第j个节点为1或者为0的概率：同理，在给定隐层h的基础上，可视层第i个节点为1或者为0的概率也可以容易得到：给定一个满足独立同分布的样本集：D={ v (1), v (2),…, v (N)}，我们需要学习参数θ={W,a,b}。我们最大化以下对数似然函数（最大似然估计：对于某个概率模型，我们需要选择一个参数，让我们当前的观测样本的概率最大）：也就是对最大对数似然函数求导，就可以得到L最大时对应的参数W了。如果，我们把隐藏层的层数增加，我们可以得到Deep Boltzmann Machine(DBM)；如果我们在靠近可视层的部分使用贝叶斯信念网络（即有向图模型，当然这里依然限制层中节点之间没有链接），而在最远离可视层的部分使用Restricted Boltzmann Machine，我们可以得到DeepBelief Net（DBN）。 9.4、Deep Belief Networks深信度网络 DBNs是一个概率生成模型，与传统的判别模型的神经网络相对，生成模型是建立一个观察数据和标签之间的联合分布，对P(Observation|Label)和 P(Label|Observation)都做了评估，而判别模型仅仅而已评估了后者，也就是P(Label|Observation)。对于在深度神经网络应用传统的BP算法的时候，DBNs遇到了以下问题：（1）需要为训练提供一个有标签的样本集；（2）学习过程较慢；（3）不适当的参数选择会导致学习收敛于局部最优解。 DBNs由多个限制玻尔兹曼机（Restricted Boltzmann Machines）层组成，一个典型的神经网络类型如图三所示。这些网络被“限制”为一个可视层和一个隐层，层间存在连接，但层内的单元间不存在连接。隐层单元被训练去捕捉在可视层表现出来的高阶数据的相关性。首先，先不考虑最顶构成一个联想记忆（associative memory）的两层，一个DBN的连接是通过自顶向下的生成权值来指导确定的，RBMs就像一个建筑块一样，相比传统和深度分层的sigmoid信念网络，它能易于连接权值的学习。最开始的时候，通过一个非监督贪婪逐层方法去预训练获得生成模型的权值，非监督贪婪逐层方法被Hinton证明是有效的，并被其称为对比分歧（contrastive divergence）。在这个训练阶段，在可视层会产生一个向量v，通过它将值传递到隐层。反过来，可视层的输入会被随机的选择，以尝试去重构原始的输入信号。最后，这些新的可视的神经元激活单元将前向传递重构隐层激活单元，获得h（在训练过程中，首先将可视向量值映射给隐单元；然后可视单元由隐层单元重建；这些新可视单元再次映射给隐单元，这样就获取新的隐单元。执行这种反复步骤叫做吉布斯采样）。这些后退和前进的步骤就是我们熟悉的Gibbs采样，而隐层激活单元和可视层输入之间的相关性差别就作为权值更新的主要依据。训练时间会显著的减少，因为只需要单个步骤就可以接近最大似然学习。增加进网络的每一层都会改进训练数据的对数概率，我们可以理解为越来越接近能量的真实表达。这个有意义的拓展，和无标签数据的使用，是任何一个深度学习应用的决定性的因素。在最高两层，权值被连接到一起，这样更低层的输出将会提供一个参考的线索或者关联给顶层，这样顶层就会将其联系到它的记忆内容。而我们最关心的，最后想得到的就是判别性能，例如分类任务里面。在预训练后，DBN可以通过利用带标签数据用BP算法去对判别性能做调整。在这里，一个标签集将被附加到顶层（推广联想记忆），通过一个自下向上的，学习到的识别权值获得一个网络的分类面。这个性能会比单纯的BP算法训练的网络好。这可以很直观的解释，DBNs的BP算法只需要对权值参数空间进行一个局部的搜索，这相比前向神经网络来说，训练是要快的，而且收敛的时间也少。 DBNs的灵活性使得它的拓展比较容易。一个拓展就是卷积DBNs（Convolutional Deep Belief Networks(CDBNs)）。DBNs并没有考虑到图像的2维结构信息，因为输入是简单的从一个图像矩阵一维向量化的。而CDBNs就是考虑到了这个问题，它利用邻域像素的空域关系，通过一个称为卷积RBMs的模型区达到生成模型的变换不变性，而且可以容易得变换到高维图像。DBNs并没有明确地处理对观察变量的时间联系的学习上，虽然目前已经有这方面的研究，例如堆叠时间RBMs，以此为推广，有序列学习的dubbed temporal convolutionmachines，这种序列学习的应用，给语音信号处理问题带来了一个让人激动的未来研究方向。目前，和DBNs有关的研究包括堆叠自动编码器，它是通过用堆叠自动编码器来替换传统DBNs里面的RBMs。这就使得可以通过同样的规则来训练产生深度多层神经网络架构，但它缺少层的参数化的严格要求。与DBNs不同，自动编码器使用判别模型，这样这个结构就很难采样输入采样空间，这就使得网络更难捕捉它的内部表达。但是，降噪自动编码器却能很好的避免这个问题，并且比传统的DBNs更优。它通过在训练过程添加随机的污染并堆叠产生场泛化性能。训练单一的降噪自动编码器的过程和RBMs训练生成模型的过程一样。接上 9.5、Convolutional Neural Networks卷积神经网络卷积神经网络是人工神经网络的一种，已成为当前语音分析和图像识别领域的研究热点。它的权值共享网络结构使之更类似于生物神经网络，降低了网络模型的复杂度，减少了权值的数量。该优点在网络的输入是多维图像时表现的更为明显，使图像可以直接作为网络的输入，避免了传统识别算法中复杂的特征提取和数据重建过程。卷积网络是为识别二维形状而特殊设计的一个多层感知器，这种网络结构对平移、比例缩放、倾斜或者共他形式的变形具有高度不变性。 CNNs是受早期的延时神经网络（TDNN）的影响。延时神经网络通过在时间维度上共享权值降低学习复杂度，适用于语音和时间序列信号的处理。 CNNs是第一个真正成功训练多层网络结构的学习算法。它利用空间关系减少需要学习的参数数目以提高一般前向BP算法的训练性能。CNNs作为一个深度学习架构提出是为了最小化数据的预处理要求。在CNN中，图像的一小部分（局部感受区域）作为层级结构的最低层的输入，信息再依次传输到不同的层，每层通过一个数字滤波器去获得观测数据的最显著的特征。这个方法能够获取对平移、缩放和旋转不变的观测数据的显著特征，因为图像的局部感受区域允许神经元或者处理单元可以访问到最基础的特征，例如定向边缘或者角点。 1）卷积神经网络的历史 1962年Hubel和Wiesel通过对猫视觉皮层细胞的研究，提出了感受野(receptive field)的概念，1984年日本学者Fukushima基于感受野概念提出的神经认知机(neocognitron)可以看作是卷积神经网络的第一个实现网络，也是感受野概念在人工神经网络领域的首次应用。神经认知机将一个视觉模式分解成许多子模式（特征），然后进入分层递阶式相连的特征平面进行处理，它试图将视觉系统模型化，使其能够在即使物体有位移或轻微变形的时候，也能完成识别。通常神经认知机包含两类神经元，即承担特征抽取的S-元和抗变形的C-元。S-元中涉及两个重要参数，即感受野与阈值参数，前者确定输入连接的数目，后者则控制对特征子模式的反应程度。许多学者一直致力于提高神经认知机的性能的研究：在传统的神经认知机中，每个S-元的感光区中由C-元带来的视觉模糊量呈正态分布。如果感光区的边缘所产生的模糊效果要比中央来得大，S-元将会接受这种非正态模糊所导致的更大的变形容忍性。我们希望得到的是，训练模式与变形刺激模式在感受野的边缘与其中心所产生的效果之间的差异变得越来越大。为了有效地形成这种非正态模糊，Fukushima提出了带双C-元层的改进型神经认知机。 Van Ooyen和Niehuis为提高神经认知机的区别能力引入了一个新的参数。事实上，该参数作为一种抑制信号，抑制了神经元对重复激励特征的激励。多数神经网络在权值中记忆训练信息。根据Hebb学习规则，某种特征训练的次数越多，在以后的识别过程中就越容易被检测。也有学者将进化计算理论与神经认知机结合，通过减弱对重复性激励特征的训练学习，而使得网络注意那些不同的特征以助于提高区分能力。上述都是神经认知机的发展过程，而卷积神经网络可看作是神经认知机的推广形式，神经认知机是卷积神经网络的一种特例。 2）卷积神经网络的网络结构卷积神经网络是一个多层的神经网络，每层由多个二维平面组成，而每个平面由多个独立神经元组成。图：卷积神经网络的概念示范：输入图像通过和三个可训练的滤波器和可加偏置进行卷积，滤波过程如图一，卷积后在C1层产生三个特征映射图，然后特征映射图中每组的四个像素再进行求和，加权值，加偏置，通过一个Sigmoid函数得到三个S2层的特征映射图。这些映射图再进过滤波得到C3层。这个层级结构再和S2一样产生S4。最终，这些像素值被光栅化，并连接成一个向量输入到传统的神经网络，得到输出。一般地，C层为特征提取层，每个神经元的输入与前一层的局部感受野相连，并提取该局部的特征，一旦该局部特征被提取后，它与其他特征间的位置关系也随之确定下来；S层是特征映射层，网络的每个计算层由多个特征映射组成，每个特征映射为一个平面，平面上所有神经元的权值相等。特征映射结构采用影响函数核小的sigmoid函数作为卷积网络的激活函数，使得特征映射具有位移不变性。此外，由于一个映射面上的神经元共享权值，因而减少了网络自由参数的个数，降低了网络参数选择的复杂度。卷积神经网络中的每一个特征提取层（C-层）都紧跟着一个用来求局部平均与二次提取的计算层（S-层），这种特有的两次特征提取结构使网络在识别时对输入样本有较高的畸变容忍能力。 3）关于参数减少与权值共享上面聊到，好像CNN一个牛逼的地方就在于通过感受野和权值共享减少了神经网络需要训练的参数的个数。那究竟是啥的呢？下图左：如果我们有1000x1000像素的图像，有1百万个隐层神经元，那么他们全连接的话（每个隐层神经元都连接图像的每一个像素点），就有1000x1000x1000000=10^12个连接，也就是10^12个权值参数。然而图像的空间联系是局部的，就像人是通过一个局部的感受野去感受外界图像一样，每一个神经元都不需要对全局图像做感受，每个神经元只感受局部的图像区域，然后在更高层，将这些感受不同局部的神经元综合起来就可以得到全局的信息了。这样，我们就可以减少连接的数目，也就是减少神经网络需要训练的权值参数的个数了。如下图右：假如局部感受野是10x10，隐层每个感受野只需要和这10x10的局部图像相连接，所以1百万个隐层神经元就只有一亿个连接，即10^8个参数。比原来减少了四个0（数量级），这样训练起来就没那么费力了，但还是感觉很多的啊，那还有啥办法没？我们知道，隐含层的每一个神经元都连接10x10个图像区域，也就是说每一个神经元存在10x10=100个连接权值参数。那如果我们每个神经元这100个参数是相同的呢？也就是说每个神经元用的是同一个卷积核去卷积图像。这样我们就只有多少个参数？？只有100个参数啊！！！亲！不管你隐层的神经元个数有多少，两层间的连接我只有100个参数啊！亲！这就是权值共享啊！亲！这就是卷积神经网络的主打卖点啊！亲！（有点烦了，呵呵）也许你会问，这样做靠谱吗？为什么可行呢？这个……共同学习。好了，你就会想，这样提取特征也忒不靠谱吧，这样你只提取了一种特征啊？对了，真聪明，我们需要提取多种特征对不？假如一种滤波器，也就是一种卷积核就是提出图像的一种特征，例如某个方向的边缘。那么我们需要提取不同的特征，怎么办，加多几种滤波器不就行了吗？对了。所以假设我们加到100种滤波器，每种滤波器的参数不一样，表示它提出输入图像的不同特征，例如不同的边缘。这样每种滤波器去卷积图像就得到对图像的不同特征的放映，我们称之为Feature Map。所以100种卷积核就有100个Feature Map。这100个Feature Map就组成了一层神经元。到这个时候明了了吧。我们这一层有多少个参数了？100种卷积核x每种卷积核共享100个参数=100x100=10K，也就是1万个参数。才1万个参数啊！亲！（又来了，受不了了！）见下图右：不同的颜色表达不同的滤波器。嘿哟，遗漏一个问题了。刚才说隐层的参数个数和隐层的神经元个数无关，只和滤波器的大小和滤波器种类的多少有关。那么隐层的神经元个数怎么确定呢？它和原图像，也就是输入的大小（神经元个数）、滤波器的大小和滤波器在图像中的滑动步长都有关！例如，我的图像是1000x1000像素，而滤波器大小是10x10，假设滤波器没有重叠，也就是步长为10，这样隐层的神经元个数就是(1000x1000 )/ (10x10)=100x100个神经元了，假设步长是8，也就是卷积核会重叠两个像素，那么……我就不算了，思想懂了就好。注意了，这只是一种滤波器，也就是一个Feature Map的神经元个数哦，如果100个Feature Map就是100倍了。由此可见，图像越大，神经元个数和需要训练的权值参数个数的贫富差距就越大。需要注意的一点是，上面的讨论都没有考虑每个神经元的偏置部分。所以权值个数需要加1 。这个也是同一种滤波器共享的。总之，卷积网络的核心思想是将：局部感受野、权值共享（或者权值复制）以及时间或空间亚采样这三种结构思想结合起来获得了某种程度的位移、尺度、形变不变性。 4）一个典型的例子说明一种典型的用来识别数字的卷积网络是LeNet-5（效果和paper等见这）。当年美国大多数银行就是用它来识别支票上面的手写数字的。能够达到这种商用的地步，它的准确性可想而知。毕竟目前学术界和工业界的结合是最受争议的。那下面咱们也用这个例子来说明下。 LeNet-5共有7层，不包含输入，每层都包含可训练参数（连接权重）。输入图像为32*32大小。这要比 Mnist数据库（一个公认的手写数据库）中最大的字母还大。这样做的原因是希望潜在的明显特征如笔画断电或角点能够出现在最高层特征监测子感受野的中心。我们先要明确一点：每个层有多个Feature Map，每个Feature Map通过一种卷积滤波器提取输入的一种特征，然后每个Feature Map有多个神经元。 C1层是一个卷积层（为什么是卷积？卷积运算一个重要的特点就是，通过卷积运算，可以使原信号特征增强，并且降低噪音），由6个特征图Feature Map构成。特征图中每个神经元与输入中5*5的邻域相连。特征图的大小为28*28，这样能防止输入的连接掉到边界之外（是为了BP反馈时的计算，不致梯度损失，个人见解）。C1有156个可训练参数（每个滤波器5*5=25个unit参数和一个bias参数，一共6个滤波器，共(5*5+1)*6=156个参数），共156*(28*28)=122,304个连接。 S2层是一个下采样层（为什么是下采样？利用图像局部相关性的原理，对图像进行子抽样，可以减少数据处理量同时保留有用信息），有6个14*14的特征图。特征图中的每个单元与C1中相对应特征图的2*2邻域相连接。S2层每个单元的4个输入相加，乘以一个可训练参数，再加上一个可训练偏置。结果通过sigmoid函数计算。可训练系数和偏置控制着sigmoid函数的非线性程度。如果系数比较小，那么运算近似于线性运算，亚采样相当于模糊图像。如果系数比较大，根据偏置的大小亚采样可以被看成是有噪声的“或”运算或者有噪声的“与”运算。每个单元的2*2感受野并不重叠，因此S2中每个特征图的大小是C1中特征图大小的1/4（行和列各1/2）。S2层有12个可训练参数和5880个连接。图：卷积和子采样过程：卷积过程包括：用一个可训练的滤波器fx去卷积一个输入的图像（第一阶段是输入的图像，后面的阶段就是卷积特征map了），然后加一个偏置bx，得到卷积层Cx。子采样过程包括：每邻域四个像素求和变为一个像素，然后通过标量Wx+1加权，再增加偏置bx+1，然后通过一个sigmoid激活函数，产生一个大概缩小四倍的特征映射图Sx+1。所以从一个平面到下一个平面的映射可以看作是作卷积运算，S-层可看作是模糊滤波器，起到二次特征提取的作用。隐层与隐层之间空间分辨率递减，而每层所含的平面数递增，这样可用于检测更多的特征信息。 C3层也是一个卷积层，它同样通过5x5的卷积核去卷积层S2，然后得到的特征map就只有10x10个神经元，但是它有16种不同的卷积核，所以就存在16个特征map了。这里需要注意的一点是：C3中的每个特征map是连接到S2中的所有6个或者几个特征map的，表示本层的特征map是上一层提取到的特征map的不同组合（这个做法也并不是唯一的）。（看到没有，这里是组合，就像之前聊到的人的视觉系统一样，底层的结构构成上层更抽象的结构，例如边缘构成形状或者目标的部分）。刚才说C3中每个特征图由S2中所有6个或者几个特征map组合而成。为什么不把S2中的每个特征图连接到每个C3的特征图呢？原因有2点。第一，不完全的连接机制将连接的数量保持在合理的范围内。第二，也是最重要的，其破坏了网络的对称性。由于不同的特征图有不同的输入，所以迫使他们抽取不同的特征（希望是互补的）。例如，存在的一个方式是：C3的前6个特征图以S2中3个相邻的特征图子集为输入。接下来6个特征图以S2中4个相邻特征图子集为输入。然后的3个以不相邻的4个特征图子集为输入。最后一个将S2中所有特征图为输入。这样C3层有1516个可训练参数和151600个连接。 S4层是一个下采样层，由16个5*5大小的特征图构成。特征图中的每个单元与C3中相应特征图的2*2邻域相连接，跟C1和S2之间的连接一样。S4层有32个可训练参数（每个特征图1个因子和一个偏置）和2000个连接。 C5层是一个卷积层，有120个特征图。每个单元与S4层的全部16个单元的5*5邻域相连。由于S4层特征图的大小也为5*5（同滤波器一样），故C5特征图的大小为1*1：这构成了S4和C5之间的全连接。之所以仍将C5标示为卷积层而非全相联层，是因为如果LeNet-5的输入变大，而其他的保持不变，那么此时特征图的维数就会比1*1大。C5层有48120个可训练连接。 F6层有84个单元（之所以选这个数字的原因来自于输出层的设计），与C5层全相连。有10164个可训练参数。如同经典神经网络，F6层计算输入向量和权重向量之间的点积，再加上一个偏置。然后将其传递给sigmoid函数产生单元i的一个状态。最后，输出层由欧式径向基函数（Euclidean Radial Basis Function）单元组成，每类一个单元，每个有84个输入。换句话说，每个输出RBF单元计算输入向量和参数向量之间的欧式距离。输入离参数向量越远，RBF输出的越大。一个RBF输出可以被理解为衡量输入模式和与RBF相关联类的一个模型的匹配程度的惩罚项。用概率术语来说，RBF输出可以被理解为F6层配置空间的高斯分布的负log-likelihood。给定一个输入模式，损失函数应能使得F6的配置与RBF参数向量（即模式的期望分类）足够接近。这些单元的参数是人工选取并保持固定的（至少初始时候如此）。这些参数向量的成分被设为-1或1。虽然这些参数可以以-1和1等概率的方式任选，或者构成一个纠错码，但是被设计成一个相应字符类的7*12大小（即84）的格式化图片。这种表示对识别单独的数字不是很有用，但是对识别可打印ASCII集中的字符串很有用。使用这种分布编码而非更常用的“1 of N”编码用于产生输出的另一个原因是，当类别比较大的时候，非分布编码的效果比较差。原因是大多数时间非分布编码的输出必须为0。这使得用sigmoid单元很难实现。另一个原因是分类器不仅用于识别字母，也用于拒绝非字母。使用分布编码的RBF更适合该目标。因为与sigmoid不同，他们在输入空间的较好限制的区域内兴奋，而非典型模式更容易落到外边。 RBF参数向量起着F6层目标向量的角色。需要指出这些向量的成分是+1或-1，这正好在F6 sigmoid的范围内，因此可以防止sigmoid函数饱和。实际上，+1和-1是sigmoid函数的最大弯曲的点处。这使得F6单元运行在最大非线性范围内。必须避免sigmoid函数的饱和，因为这将会导致损失函数较慢的收敛和病态问题。 5）训练过程神经网络用于模式识别的主流是有指导学习网络，无指导学习网络更多的是用于聚类分析。对于有指导的模式识别，由于任一样本的类别是已知的，样本在空间的分布不再是依据其自然分布倾向来划分，而是要根据同类样本在空间的分布及不同类样本之间的分离程度找一种适当的空间划分方法，或者找到一个分类边界，使得不同类样本分别位于不同的区域内。这就需要一个长时间且复杂的学习过程，不断调整用以划分样本空间的分类边界的位置，使尽可能少的样本被划分到非同类区域中。卷积网络在本质上是一种输入到输出的映射，它能够学习大量的输入与输出之间的映射关系，而不需要任何输入和输出之间的精确的数学表达式，只要用已知的模式对卷积网络加以训练，网络就具有输入输出对之间的映射能力。卷积网络执行的是有导师训练，所以其样本集是由形如：（输入向量，理想输出向量）的向量对构成的。所有这些向量对，都应该是来源于网络即将模拟的系统的实际“运行”结果。它们可以是从实际运行系统中采集来的。在开始训练前，所有的权都应该用一些不同的小随机数进行初始化。“小随机数”用来保证网络不会因权值过大而进入饱和状态，从而导致训练失败；“不同”用来保证网络可以正常地学习。实际上，如果用相同的数去初始化权矩阵，则网络无能力学习。训练算法与传统的BP算法差不多。主要包括4步，这4步被分为两个阶段：第一阶段，向前传播阶段： a）从样本集中取一个样本(X,Yp)，将X输入网络； b）计算相应的实际输出Op。在此阶段，信息从输入层经过逐级的变换，传送到输出层。这个过程也是网络在完成训练后正常运行时执行的过程。在此过程中，网络执行的是计算（实际上就是输入与每层的权值矩阵相点乘，得到最后的输出结果）： Op=Fn（…（F2（F1（XpW（1））W（2））…）W（n））第二阶段，向后传播阶段 a）算实际输出Op与相应的理想输出Yp的差； b）按极小化误差的方法反向传播调整权矩阵。 6）卷积神经网络的优点卷积神经网络CNN主要用来识别位移、缩放及其他形式扭曲不变性的二维图形。由于CNN的特征检测层通过训练数据进行学习，所以在使用CNN时，避免了显式的特征抽取，而隐式地从训练数据中进行学习；再者由于同一特征映射面上的神经元权值相同，所以网络可以并行学习，这也是卷积网络相对于神经元彼此相连网络的一大优势。卷积神经网络以其局部权值共享的特殊结构在语音识别和图像处理方面有着独特的优越性，其布局更接近于实际的生物神经网络，权值共享降低了网络的复杂性，特别是多维输入向量的图像可以直接输入网络这一特点避免了特征提取和分类过程中数据重建的复杂度。流的分类方式几乎都是基于统计特征的，这就意味着在进行分辨前必须提取某些特征。然而，显式的特征提取并不容易，在一些应用问题中也并非总是可靠的。卷积神经网络，它避免了显式的特征取样，隐式地从训练数据中进行学习。这使得卷积神经网络明显有别于其他基于神经网络的分类器，通过结构重组和减少权值将特征提取功能融合进多层感知器。它可以直接处理灰度图片，能够直接用于处理基于图像的分类。卷积网络较一般神经网络在图像处理方面有如下优点： a）输入图像和网络的拓扑结构能很好的吻合；b）特征提取和模式分类同时进行，并同时在训练中产生；c）权重共享可以减少网络的训练参数，使神经网络结构变得更简单，适应性更强。 7）小结 CNNs中这种层间联系和空域信息的紧密关系，使其适于图像处理和理解。而且，其在自动提取图像的显著特征方面还表现出了比较优的性能。在一些例子当中，Gabor滤波器已经被使用在一个初始化预处理的步骤中，以达到模拟人类视觉系统对视觉刺激的响应。在目前大部分的工作中，研究者将CNNs应用到了多种机器学习问题中，包括人脸识别，文档分析和语言检测等。为了达到寻找视频中帧与帧之间的相干性的目的，目前CNNs通过一个时间相干性去训练，但这个不是CNNs特有的。呵呵，这部分讲得太啰嗦了，又没讲到点上。没办法了，先这样的，这样这个过程我还没有走过，所以自己水平有限啊，望各位明察。需要后面再改了，呵呵。接上十、总结与展望 1）Deep learning总结深度学习是关于自动学习要建模的数据的潜在（隐含）分布的多层（复杂）表达的算法。换句话来说，深度学习算法自动的提取分类需要的低层次或者高层次特征。高层次特征，一是指该特征可以分级（层次）地依赖其他特征，例如：对于机器视觉，深度学习算法从原始图像去学习得到它的一个低层次表达，例如边缘检测器，小波滤波器等，然后在这些低层次表达的基础上再建立表达，例如这些低层次表达的线性或者非线性组合，然后重复这个过程，最后得到一个高层次的表达。 Deep learning能够得到更好地表示数据的feature，同时由于模型的层次、参数很多，capacity足够，因此，模型有能力表示大规模数据，所以对于图像、语音这种特征不明显（需要手工设计且很多没有直观物理含义）的问题，能够在大规模训练数据上取得更好的效果。此外，从模式识别特征和分类器的角度，deep learning框架将feature和分类器结合到一个框架中，用数据去学习feature，在使用中减少了手工设计feature的巨大工作量（这是目前工业界工程师付出努力最多的方面），因此，不仅仅效果可以更好，而且，使用起来也有很多方便之处，因此，是十分值得关注的一套框架，每个做ML的人都应该关注了解一下。当然，deep learning本身也不是完美的，也不是解决世间任何ML问题的利器，不应该被放大到一个无所不能的程度。 2）Deep learning未来深度学习目前仍有大量工作需要研究。目前的关注点还是从机器学习的领域借鉴一些可以在深度学习使用的方法，特别是降维领域。例如：目前一个工作就是稀疏编码，通过压缩感知理论对高维数据进行降维，使得非常少的元素的向量就可以精确的代表原来的高维信号。另一个例子就是半监督流行学习，通过测量训练样本的相似性，将高维数据的这种相似性投影到低维空间。另外一个比较鼓舞人心的方向就是evolutionary programming approaches（遗传编程方法），它可以通过最小化工程能量去进行概念性自适应学习和改变核心架构。 Deep learning还有很多核心的问题需要解决：（1）对于一个特定的框架，对于多少维的输入它可以表现得较优（如果是图像，可能是上百万维）？（2）对捕捉短时或者长时间的时间依赖，哪种架构才是有效的？（3）如何对于一个给定的深度学习架构，融合多种感知的信息？（4）有什么正确的机理可以去增强一个给定的深度学习架构，以改进其鲁棒性和对扭曲和数据丢失的不变性？（5）模型方面是否有其他更为有效且有理论依据的深度模型学习算法？探索新的特征提取模型是值得深入研究的内容。此外有效的可并行训练算法也是值得研究的一个方向。当前基于最小批处理的随机梯度优化算法很难在多计算机中进行并行训练。通常办法是利用图形处理单元加速学习过程。然而单个机器GPU对大规模数据识别或相似任务数据集并不适用。在深度学习应用拓展方面，如何合理充分利用深度学习在增强传统学习算法的性能仍是目前各领域的研究重点。十一、参考文献和Deep Learning学习资源（持续更新……）先是机器学习领域大牛的微博：@余凯_西二旗民工；@老师木；@梁斌penny；@张栋_机器学习；@邓侃；@大数据皮东；@djvu9…… （1）Deep Learning http://deeplearning.net/ （2）Deep Learning Methods for Vision http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/ （3）Neural Network for Recognition of Handwritten Digits http://www.codeproject.com/Articles/16650/Neural-Network-for-Recognition-of-Handwritten-Digi （4）Training a deep autoencoder or a classifier on MNIST digits http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html （5）Ersatz：deep neural networks in the cloud http://www.ersatz1.com/ （6）Deep Learning http://www.cs.nyu.edu/~yann/research/deep/ （7）Invited talk A Tutorial on Deep Learning by Dr. Kai Yu (余凯) http://vipl.ict.ac.cn/News/academic-report-tutorial-deep-learning-dr-kai-yu （8）CNN - Convolutional neural network class http://www.mathworks.cn/matlabcentral/fileexchange/24291 （9）Yann LeCun's Publications http://yann.lecun.com/exdb/publis/index.html#lecun-98 （10） LeNet-5, convolutional neural networks http://yann.lecun.com/exdb/lenet/index.html （11） Deep Learning 大牛Geoffrey E. Hinton's HomePage http://www.cs.toronto.edu/~hinton/ （12）Sparse coding simulation software http://redwood.berkeley.edu/bruno/sparsenet/ （13）Andrew Ng's homepage http://robotics.stanford.edu/~ang/ （14）stanford deep learning tutorial http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial （15）「深度神经网络」（deep neural network）具体是怎样工作的 http://www.zhihu.com/question/19833708?group_id=15019075#1657279 （16）A shallow understanding on deep learning http://blog.sina.com.cn/s/blog_6ae183910101dw2z.html （17）Bengio's Learning Deep Architectures for AI http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf （18）andrew ng's talk video: http://techtalks.tv/talks/machine-learning-and-ai-via-brain-simulations/57862/ （19）cvpr 2012 tutorial： http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/tutorial_p2_nnets_ranzato_short.pdf （20）Andrew ng清华报告听后感 http://blog.sina.com.cn/s/blog_593af2a70101bqyo.html （21）Kai Yu：CVPR12 Tutorial on Deep Learning Sparse Coding （22）Honglak Lee：Deep Learning Methods for Vision （23）Andrew Ng ：Machine Learning and AI via Brain simulations （24）Deep Learning 【2,3】 http://blog.sina.com.cn/s/blog_46d0a3930101gs5h.html （25）deep learning这件小事…… http://blog.sina.com.cn/s/blog_67fcf49e0101etab.html （26）Yoshua Bengio, U. Montreal：Learning Deep Architectures （27）Kai Yu：A Tutorial on Deep Learning （28）Marc'Aurelio Ranzato：NEURAL NETS FOR VISION （29）Unsupervised feature learning and deep learning http://blog.csdn.net/abcjennifer/article/details/7804962 （30）机器学习前沿热点–Deep Learning http://elevencitys.com/?p=1854 （31）机器学习——深度学习(Deep Learning) http://blog.csdn.net/abcjennifer/article/details/7826917 （32）卷积神经网络 http://wenku.baidu.com/view/cd16fb8302d276a200292e22.html （33）浅谈Deep Learning的基本思想和方法 http://blog.csdn.net/xianlingmao/article/details/8478562 （34）深度神经网络 http://blog.csdn.net/txdb/article/details/6766373 （35）Google的猫脸识别:人工智能的新突破 http://www.36kr.com/p/122132.html （36）余凯，深度学习-机器学习的新浪潮，Technical News程序天下事 http://blog.csdn.net/datoubo/article/details/8577366 （37）Geoffrey Hinton：UCLTutorial on: Deep Belief Nets （38）Learning Deep Boltzmann Machines http://web.mit.edu/~rsalakhu/www/DBM.html （39）Efficient Sparse Coding Algorithm http://blog.sina.com.cn/s/blog_62af19190100gux1.html （40）Itamar Arel, Derek C. Rose, and Thomas P. Karnowski： Deep Machine Learning—A New Frontier in Artificial Intelligence Research （41）Francis Quintal Lauzon：An introduction to deep learning （42）Tutorial on Deep Learning and Applications （43）Boltzmann神经网络模型与学习算法 http://wenku.baidu.com/view/490dcf748e9951e79b892785.html （44）Deep Learning 和 Knowledge Graph 引爆大数据革命 http://blog.sina.com.cn/s/blog_46d0a3930101fswl.html; 5690 次阅读|0 个评论

[转载]深度学习论文笔记：OverFeat: zhuwei3014 2014-11-17 23:02; 原文转自： http://blog.csdn.net/chenli2010/article/details/25204241 今天我们要谈论的文章为： OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. ICLR2014. 这是大牛Yann LeCun小组的文章。 openreview中有下载链接和讨论： http://openreview.net/document/cb1bf585-d8ae-4de7-aa0c-f47cdc763d8d#cb1bf585-d8ae-4de7-aa0c-f47cdc763d8d 引言：对于分类问题而言，一个常用的增加训练样本的方法是将训练样本随机移动一个小的位移，或者，等价的，在图像中随机取一些大的图像块。然后以这些图像块为输入训练分类模型。在测试阶段，可以采用滑窗的方法对每一个图像块进行分类，然后组合这些分类结果，得到一个置信度更高的类别标签。这种技巧被广泛运用于机器学习算法中，例如： cuda-convnet库的wiki： https://code.google.com/p/cuda-convnet/wiki/TrainingNet 瑞士一个研究组的文章： Multi-column Deep Neural Networks for Image Classiﬁcation . CVPR2012. 对于检测和定位问题，最自然（也是最常用的方法）就是采用滑窗对每一个图像块进行检测，从而确定目标物体的位置。以上解决分类、检测和定位的方法有一个共同的地方，就是需要一个滑窗对整幅图像进行密集采样，然后处理每一个采样得到的图像块。传统的处理这些图像块的方法是一个接一个处理。但是，CNN有更便捷的做法。文章内容： CNN模型：与Hinton小组的经典论文：ImageNet Classiﬁcation with Deep Convolutional Neural Networks. NIPS2012. 结构一致。区别在于：a) 训练时输入大小固定，测试时用多尺度输入（具体方法见下文）；b) 没有对比度归一化；c) 采用没有overlap的max pooling；d) 前面两层的feature map的个数比上述论文中多。模型结构如上图所示。其中，1-5层为feature extraction layers，为三个任务（recognition、localization 和detection）共用的部分，6-9层为classifier layers。训练过程的一些参数为：min-batch的大小为128， momentum = 0.6，weight decay = 1.0e-5，weight初始化为从(μ, σ) = (0, 0.01)的随机采样，等等。密集采样（滑窗）支持：对于一个训练好的CNN来说，CNN的结构（如CNN的层数、每一层feature map的个数，卷积层的kernel size 等等）是固定的，但是，每一层的feature map的大小是可以改变的。当测试样本和训练样本大小相同时，CNN 最后一层的每一个节点分别输出一个0~1的实数，代表测试样本属于某一类的概率；当测试样本比训练样本大时， CNN最后一层每一个节点的输出为一个矩阵，矩阵中的每一个元素表示对应的图像块属于某一类的概率，其结果相当于通过滑窗从图像中进行采样，然后分别对采样到的图像块进行操作：如上图所示，当输入图像为16x16时，output层的每一个节点输出为一个2x2的矩阵，相当于在原图像上的四个角分别采样了一个14x14的图像块，然后分别对每一个图像块进行处理。多尺度支持：传统的检测/定位算法是固定输入图像不变，采用不同大小的滑窗来支持不同尺度的物体。对于CNN来说，滑窗的大小就是训练时输入图像的大小，是不可以改变的。那么，CNN支持多尺度的办法就是，固定滑窗的大小，改变输入图像的大小。具体来说，对于一幅给定的待处理的图像，将图像分别resize到对应的尺度上，然后，在每一个尺度上执行上述的密集采样的算法，最后，将所有尺度上的结果combine起来，得到最终的结果。然而，对于上文所述的密集采样算法来说，采样的间隔（也就是滑窗的stride）等于执行所有卷积和pooling 操作后图像缩小的倍数，也就是所有卷积层和pooling层的stride的乘积。如果这个值较大的话，采样就显得 sparse，图像中的某些位置就无法采样得到。这时，或者减小卷积层和pooling层的stride，或者干脆去掉某个 pooling层，或者采用某种方法替代某个pooling层。文章采用的方法如上图所示。对于stride=3的pooling层，先做常规的pooling（Δ=0），再往右移一格做 pooling（Δ=1），再往右移一格做pooling（Δ=2）。这样做的结果相当于，在相邻的两个滑窗的中间进行插值；也可以理解为，将滑窗的stride减小了三分之一。这样，既实现了pooling操作，也减小了取样的间隔。分类问题测试过程：训练一个CNN并确定图像的六个尺度。给定一幅待识别的图像，先将图像resize到这六个不同的尺度，然后，将这六种尺度的图像以及对它们进行水平翻转后的图像（总共12幅图像）送到CNN中，产生12个CNN输出的结果。然后，对每一个CNN输出的结果进行空间域的非极大值抑制。（例如，对于一个类别数为1000的CNN来说，其输出为1000个矩阵。第m个矩阵的位于(x, y)位置的元素表示点(x, y)对应的图像块属于类别m的概率。对这个输出结果进行非极大值抑制，即对矩阵中所有元素求最大值。）然后，对于所有的非极大值抑制的结果（总共12个）按类别求平均值，然后取均值的Top-1或者Top-5。定位问题的模型和测试过程：定位问题的模型也是一个CNN，1-5层作为特征提取层和分类问题完全一样，后面接两个全连接层，组成 regressor network。训练时，前面5层的参数由classification network给定，只需要训练后面的两个全连接层。这个 regressor network的输出就是一个bounding box，也就是说，如果将一幅图像或者一个图像块送到这个regressor network中，那么，这个regressor network输出一个相对于这个图像或者图像块的区域，这个区域中包含感兴趣的物体。这个regressor network的最后一层是class specific的，也就是说，对于每一个class，都需要训练单独最后一层。这样，假设类别数有1000，则这个regressor network输出1000个bounding box，每一个bounding box对应一类。对于定位问题，测试时，在每一个尺度上同时运行classification network和regressor network。这样，对于每一个尺度来说，classification network给出了图像块的类别的概率分布，regressor network进一步为每一类给出了一个bounding box，这样，对于每一个bounding box，就有一个置信度与之对应。最后，综合这些信息，给出定位结果。（最近颇忙，后面的内容还是烦劳各位看论文吧。）; 个人分类: DL论文笔记|19278 次阅读|0 个评论

Caffe 实例测试一： MNIST: 热度 1 zhuwei3014 2014-11-13 19:19; 参考： http://caffe.berkeleyvision.org/gathered/examples/mnist.html 原笔记在印象笔记中，复制过来格式就乱了，原笔记地址： http://app.yinxiang.com/l/AARGaZ2rlLdEn7ikhvPcGA7hd7_EnB2OlIo/ caffe安装目录：home/user/caffe-master cd caffe-master download MNIST dataset 注意：新版caffe都需要从根目录上执行，不然可能会遇到这个错误： ./create_mnist.sh: 16: ./create_mnist.sh: build/examples/mnist/convert_mnist_data.bin: not found ./data/mnist/get_mnist.sh 生成mnist-train-leveldb/ 和 mnist-test-leveldb/文件夹，这里包含了LDB格式的数据集 ./examples/mnist/create_mnist.sh LeNet: caffe中用的模型结构是非常著名的手写体识别模型LeNet，唯一的区别是把其中的sigmoid激活函数换成了ReLU，整个结构中包含两个convolution layer、两个pooling layer和两个fully connected layer。结构定义在$caffe-master/examples/mnist/lenet_train_test.prototxt中。定义MNIST Network：该结构定义在lenet_train_test.prototxt中，需要对google protobuf有一定了解并且看过Caffe中protobuf的定义，其定义在$caffe-master/src/caffe/proto/caffe.proto。 protobuf简介： protobuf是google公司的一个开源项目，主要功能是把某种数据结构的信息以某种格式保存及传递，类似微软的XML，但是效率较高。目前提供C++、java和python的API。 protobuf简介：http://blog.163.com/jiang_tao_2010/blog/static/12112689020114305013458/ 使用实例：http://www.ibm.com/developerworks/cn/linux/l-cn-gpb/ 编写LeNet的protobuf： 1、首先给定一个名字 name: LeNet 2、数据层从生成的lmdb文件中读取MNIST数据，定义数据层如下： layers { name: mnist type: DATA data_param { source: mnist_train_lmdb backend: LMDB batch_size: 64 scale: 0.00390625 //归一化到 Creating Layer conv1I1203 net.cpp:76] conv1 - dataI1203 net.cpp:101] conv1 - conv1I1203 net.cpp:116] Top shape: 20 24 24I1203 net.cpp:127] conv1 needs backward computation. // 显示连接信息和输出信息 I1203 net.cpp:142] Network initialization done.I1203 solver.cpp:36] Solver scaffolding done.I1203 solver.cpp:44] Solving LeNet I1203 solver.cpp:204] Iteration 100, lr = 0.00992565 // learning rate printed every 100 iterationsI1203 solver.cpp:66] Iteration 100, loss = 0.26044 // loss function ...I1203 solver.cpp:84] Testing netI1203 solver.cpp:111] Test score #0: 0.9785 // test accuracy every 1000 iterations I1203 solver.cpp:111] Test score #1: 0.0606671 // test loss function I1203 solver.cpp:84] Testing netI1203 solver.cpp:111] Test score #0: 0.9897I1203 solver.cpp:111] Test score #1: 0.0324599I1203 solver.cpp:126] Snapshotting to lenet_iter_10000I1203 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstateI1203 solver.cpp:78] Optimization Done. 最终训练完的模型存储为一个二进制的protobuf文件： lenet_iter_10000 Caffe默认的训练是在GPU上的，但是如果想改为CPU上，只需要在lenet_solver.prototxt中修改一行： # solver mode: CPU or GPUsolver_mode: CPU; 个人分类: Caffe|27676 次阅读|1 个评论

Ubuntu+CUDA6.5+Caffe安装配置汇总: 热度 3 zhuwei3014 2014-11-7 17:23; 感谢欧新宇的分享，此配置贴大部分参考他的博客。 http://ouxinyu.github.io/Blogs/20140723001.html 此贴历经坎坷，一入DL深似海啊，配个caffe玩玩足足折腾了我半个多月，就在我想放弃之时，峰回路转，成功了，其中心酸只有自己知道啊。起初安装ubuntu因为引导问题折腾了一三天左右，然后各种方式安装及引导ubuntu手到擒来，半小时搞定。然后开始在笔记本安装cuda，出现各种问题不说，最终无法进入GUI折腾了近一个星期，其间重装了不下15次系统，最后才发现好像是optimus双显卡的问题，也罢，直接卸载也不愿在笔记本上折腾了。再然后在单显卡台机上安装cuda，非常顺利，接着开始安装各种依赖库及配置环境，遇到问题疯狂google（百度真心不行啊），这样折腾了有一个星期之久，最终遇到一个连google都搜不到的问题，思考半晌，考虑放弃这么高大上的东西了。在不忍删除辛苦安装的系统用再生龙备份之余，想想试试ubuntu12.04，结果虽然遇到不少问题，都曲折的解决了，最终一天终于搞定了，一把鼻涕一把泪啊！我想应该是属于比较倒霉的，跟着别人的教程按部就班，但是每一步都出现问题，本人又是linux新手，出现问题只能google，所以浪费了太长时间。现在回头想想，好像真的没什么地方很难解决的，理应一天时间搞完的，最多有一些版本之间的冲突，真心觉得时间花费的有点不值。总之，成功了，也是醉了。。。简单介绍一下：Caffe，一种Convolutional Neural Network的工具包，和Alex的cuda-convnet功能类似，但各有特点。都是使用C++ CUDA进行底层编辑，Python进行实现，原作不属于Ubuntu 12，也有大神发布了Windows版，但其他相关资料较少，不适合新手使用，所以还是Ubuntu的比较适合新手。本人为Linux新手，安装ubuntu和cuda折腾了一个多星期，起初是因为ubuntu安装导致引导失效，中途每次都需要手动引导进入系统，然后安装cuda失败后用ultraISO制作U盘启动重新安装才恢复正常。至于安装过程可以参考： http://blog.sciencenet.cn/home.php？mod=spaceuid=1583812do=blogid=839793 一、CUDA Toolkit的安装和调试这里其实可以参考nVidia 官方提供的CUDA安装手册，全英文的，我就是参考这个文档完成后面的配置和验证工作。https://developer.nvidia.com/rdp/cuda-65-rc-toolkit-download#linux。一般要输入你的用户名和密码，就是下载6.5的那个账号。 1、Verify You Have a CUDA-Capable GPU 执行下面的操作，然后验证硬件支持GPU CUDA，只要型号存在于 https://developer.nvidia.com/cuda-gpus ，就没问题了 $ lspci | grep -i nvidia 2、Verify You Have a Supported Version of Linux $ uname -m cat /etc/*release 重点是“ x86_64 ”这一项，保证是x86架构，64bit系统 3、Verify the System Has gcc Installed $ gcc --version 4、Download the NVIDIA CUDA Toolkit 下载地址： https://developer.nvidia.com/cuda-toolkit 在根目录下新建cuda_install文件夹，把run文件放进去 mkdir cuda_install 验证地址： https://developer.nvidia.com/rdp/cuda-rc-checksums $ md5sum filename 例如：md5sum cuda_6.5.14_linux_64.run，然后与官网核对 5、安装必要的一些库和头文件文件 sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev 如果有依赖冲突的，建议分开安装。 6、Handle Conflicting Installation Methods 根据官网介绍，之前安装的版本都会有冲突的嫌疑所以，之前安装的Toolkit和Drievers就得卸载，屏蔽，等等（因为我是新系统，没有安装过nvidia驱动，所以此步可以省略） sudo apt-get --purge remove nvidia* 7、Graphical Interface Shutdown 退出GUI，也就是X-Win界面，操作方法是：同时按：CTRL+ALT+F1（F2-F6），切换到TTY1-6命令行模式。关闭桌面服务： $ sudo stop lightdm 8、Interaction with Nouveau 这是卡住本人将近一个星期的问题，我原来用的是笔记本，双显卡，装了不下二十次，不管按照何种方法，最终装完cuda之后图形界面就只剩下墙纸，只有鼠标可以动，进不了桌面还打不开终端，最后换了一个台式机，半天所有东西全部搞定。原来以为是nouveau过于顽固，怎么样都卸不掉，之后顿悟，可能是optimus显卡问题，默认3D渲染由nvidia独显完成，而2D渲染由intel集显完成，但是我的机子是华硕的，BIOS里面无法关闭集显（貌似thinkpad可以），所以没有进一步尝试，反正台式机环境搭好了。如果遇到以上问题，可以移步： http://wenku.baidu.com/link?url=hjEIoYx-spMxyrU-zy057bOBb4dtYUc7s6bj8CM-TTJ4-QPQTmc9KX3DQ0ZZCfhJpkar0To8y54Cc2gR8LwTOLRCQ8TS4iUUPXavaw7o2Eu 可能装了这个bumblebee显卡调节程序可能解决问题，也可以参考此贴： http://www.cnblogs.com/bsker/archive/2011/10/03/2198423.html 还有一个之前没找到的帖子，白白浪费了那么长时间。。。用prime解决这个问题 http://www.cnblogs.com/zhcncn/p/3989572.html 百度经验也有： http://jingyan.baidu.com/article/046a7b3efe8c58f9c27fa98b.html Nouveau是一个开源的显卡驱动，Ubuntu 14.04 默认安装了，但是它在nvidia驱动安装过程中会有冲突，所以要禁用它。以下是欧新宇同学的过程，反正我按照这个没有成功，大家可以试试，因为在第三步中我的boot文件夹里没有initramfs，只有initrd，重新生成initrd貌似不起作用，这就是linux新手的悲哀，出了问题完全不知道原因。如果有高人指点一下，小弟感激不尽！（1）将nouveau添加到黑名单，防止它启动 $ cd /etc/modprobe.d $ sudo vi nvidia-graphics-drivers.conf 写入：blacklist nouveau 保存并退出: wq! 检查： $ cat nvidia-graphics-drivers.conf （2）对于：/etc/default/grub，添加到末尾。 $ sudo vi /etc/default/grub 末尾写入：rdblacklist=nouveau nouveau.modeset=0 保存并退出: wq! 检查： $ cat /etc/default/grub （3）官网提供的操作：(感觉上这一小步，可以略过，不执行，执行了也会报错) $ sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img 然后重新生成initrd文件 $ sudo dracut /boot/initramfs-$(uname -r).img $(uname -r) $ sudo update-initramfs -u 上面那条是nVidia官方提供的命令，不知道为什么在我这里会提示dracut是不存在的命令，也许是版本问题，或者少了什么包，不过无所谓，第二条命令也可以搞定，应该是一样的功能。我试过在ubuntu12.04下安装，只要修改/etc/modprobe.d/blacklist.conf就可以解决问题，可是ubuntu14.04中这个文件是只读的，所以我就给它添加了写的权限，强制修改了。 sudo chmod +w /etc/modprobe.d/blacklist.conf sudo vi /etc/modprobe.d/blacklist.conf 在里面加入: blacklist nouveau options nouveau modeset=0 由于试过很多种方法，最终是哪种方法成功禁用了nouveau，说实话我还真不记得了，大家可以互相交流。测试nouveau是否被禁用成功很简单：（1）重启之后明显感觉画质变差（2） lsmod | grep nouveau ，如果显示为空，那么就是卸载成功了。 9、Installation 默认情况下，可以跳过显卡驱动的安装，直接安装CUDA，因为它包含了Drivers，Toolkit和Sample三个部分，但是如果出现问题，可以尝试二次安装CUDA或者利用官方的显卡驱动，来进行处理。GTX显卡驱动的下载地址如下（Tesla版的驱动，请大家自己去nVidia的官网下载）：下载地址： http://www.geforce.cn/drivers $ sudo sh ./NVIDIA-Linux-x86_64-340.24.run (Optional) 切换到cuda_6.5.14_linux_64.run 所在的目录，然后执行安装命令： $sudo cd cuda_install $ sudo sh cuda_6.5.11_rc_linux_64.run 再次提醒，安装前一定要执行 md5sum ，如果不一样会导致安装的Sumary里显示Driver成功，Toolkit和Samples失败，需要重新下载run文件。这里会一路问你各种问题，基本上就是Accept-yes-Enter-yes-Enter-yes-Enter，接受协议，安装的默认位置确认。 10、驱动装完了，可以回到GUI界面了 $ sudo start lightdm （在这里又出现问题，开机重启后进不了GUI，估计是显卡版本有问题，本机配置Nvidia Quadro K600显卡，官网下载专用驱动，按以上步骤重新安装，在CUDA安装过程中的第一步提示是否安装显卡驱动选择no）检查显卡是否安装成功可以用命令 sudo apt-get install mesa-utils glxinfo | grep -i nvidia 11、POST-INSTALLATION ACTIONS 这一步就是验证一下安装是否正确，编译和完成以下CUDA自带的程序，建议做一下~ （1）Environment Setup $ export PATH=/usr/local/cuda-6.5/bin:$PATH $ export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64:$LD_LIBRARY_PATH 环境变量配置完，使用nvcc -V命令检查cuda是否安装正确，这里开始使用普通用户操作，始终提示没有安装cuda toolkit，最后改到root用户下就显示成功了。（2）(Optional) Install Writable Samples $ cuda-install-samples-6.5.sh 安装到Home下，搞定了之后可以在GUI下调整一下，主要是前面的要求，会有一个Sample的文件夹 NVIDIA_CUDA-6.5_Samples在Home的根目录下就ok了。因为后面编译测试各方面什么的方便。其实如果之前安装CUDA驱动和Toolkit一切正常，这一步基本可以省略，应该会自动建立，但检查一下无妨。（3）Verify the Installation a. 验证驱动的版本，其实主要是保证驱动程序已经安装正常了 $ cat /proc/driver/nvidia/version b. Compiling the Examples $ nvcc -V 不出意外的话应该会提示，nvcc没有安装，其实就是，nvidia-cuda-toolkit的编译器没有安装完整，总之，根据提示继续就好了 $ sudo apt-get install nvidia-cuda-toolkit 这里安装完，就可以编译了，切换目录到~/NVIDIA_CUDA-6.5_Samples： $ cd /home/username/NVIDIA_CUDA-6.5_Samples $ make c. Running the Binaries 运行编译好的文件，例如看看设备的基本信息和带宽信息： $ cd /bin/x86_64/linux/release $ ./deviceQuery $ ./bandwidthTest PS：如果测试的时候出现说运行版驱动和实际驱动不符，原因可能是因为后面安装的nvidia-cuda-toolkit更新了配置文件，所以和原始的Cuda-Samples的配置或者是驱动程序有变化，所以检测无法编译通过。考虑下面的解决方法：（1）卸载现有驱动 $ sudo nvidia-installer --uninstall （2）下载合适版本的驱动，并安装：下载地址： http://www.geforce.cn/drivers $ sudo sh ./NVIDIA-Linux-x86_64-340.24.run （3）重装CUDA Toolkit $ sudo sh cuda_6.5.11_rc_linux_64.run Nvidia Cuda安装结束二、Caffe的安装和测试对于Caffe的安装严格遵照官网的要求来： http://caffe.berkeleyvision.org/installation.html 1、安装BLAS 这里可以选择（ATLAS，MKL或者OpenBLAS），我这里使用MKL，首先下载并安装英特尔® 数学内核库 Linux* 版MKL，下载链接是： https://software.intel.com/en-us/intel-education-offerings ，可以下载Student版的，先申请，然后会立马收到一个邮件（里面有安装序列号），打开照着下载就行了。下载完之后，要把文件解压到home文件夹，或者其他的ext4的文件系统中。接下来是安装过程，先授权，然后安装： $ tar zxvf parallel_studio_xe_2015.tgz （如果你是直接拷贝压缩文件过来的） $ chmod a+x /home/username/ parallel_studio_xe_2015 -R $cd parallel_studio_xe_2015 $ sudo ./install_GUI.sh 然后进入图形安装模式，跟windows差不多，其中序列号就是邮箱发过来的那个。这里使用root权限安装。 $ sudo passwd root 2、MKL与CUDA的环境设置文件夹切换到/etc/ld.so.conf.d，并进行如下操作（1）新建intel_mkl.conf，并编辑之： $ cd /etc/ld.so.conf.d $ sudo vi intel_mkl.conf 加入：/opt/intel/lib/intel64 /opt/intel/mkl/lib/intel64 （2）新建cuda.conf，并编辑之： $ sudo vi cuda.conf 加入：/usr/local/cuda/lib64 /lib （3）完成lib文件的链接操作，执行： $ sudo ldconfig -v （这里我按照这样的方法最终编译出现cblas找不到的问题，应该是MKL安装有问题，但是又没办法解决，最终我就按照官网的方法安装了ATLAS sudo apt-get install libatlas-base-dev 一句话就搞定，虽然性能可能比不上MKL，但是将就着能用就行。） 3、安装OpenCV （1）这里我用他的方法发现报错，所以按照依赖包以正常方式安装 sudo apt-get install build-essential libgtk2.0-dev libavcodec-dev libavformat-dev libjpeg62-dev libtiff4-dev cmake libswscale-dev libjasper-dev 这里libtiff4-dev出现依赖错误，于是分开安装就解决了。（2）根据官网提示，还要安装python，于是： sudo apt-get install python-pip sudo apt-get install python-dev sudo apt-get install python-numpy （3）下载官网opencv压缩包，我下载的是opencv-3.0.0-alpha.zip，移动到主目录下，解压： unzip opencv-3.0.0-alpha 然后执行以下命令： cd opencv-3.0.0-alpha mkdir release cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local .. (这里可能会遇到CMakeList.txt找不到的问题，把“..”换成CMakeList.txt的所在目录opencv-3.0.0-alpha就可以了） make sudo make install 这个过程时间比较久，耐心等待。。。下面配置library，打开/etc/ld.so.conf.d/opencv.conf，加入/usr/local/lib： sudo su vi /etc/ld.so.conf.d/opencv.conf sudo ldconfig -v 然后更改变量： sudo gedit /etc/bash.bashrc 添加： PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig export PKG_CONFIG_PATH 至此opencv安装配置完成，最后随便写个hello.cpp，包含#include opencv2/core/core.hpp进行测试，在命令行输入： g++ hello.cpp -o hello `pkg-config --cflags --libs opencv` 编译不报错就说明配置正确，其中有个问题弄了半天要注意，这个命令中的单引号不是平常的单引号，而是键盘上tab键上面那个符号。（在14.04下opencv安装还算顺利，但是后来转到12.04下安装opencv-3.0.0出现一堆错误，折腾了很长时间，最终换成opencv-2.4.9，很快解决中途可能会遇到这个错误： opencv- 2.4 . 9 /modules/gpu/src/nvidia/core/NCVPixelOperations.hpp( 51 ): error: a storage class is not allowed in an explicit specialization 参考 http://code.opencv.org/issues/3814 ，重新下载 NCVPixelOperations.hpp 取代opencv2.4.9中的即可。如果遇到这个错误，参考 http://www.foreverlee.net/ /usr/bin/ld: cannot find -lcufft /usr/bin/ld: cannot find -lnpps /usr/bin/ld: cannot find -lnppi /usr/bin/ld: cannot find -lnppc /usr/bin/ld: cannot find -lcudart 编译命令改为 g++ -L /usr/local/cuda/lib64/ hello.cpp -o hello `pkg-config --cflags --libs opencv` ） 4、安装其他依赖项（1） Google Logging Library（glog），下载地址： https://code.google.com/p/google-glog/ ，然后解压安装： $ tar zxvf glog-0.3.3.tar.gz $cd glog-0.3.3 $ ./configure $ make $ sudo make install 如过没有权限就chmod a+x glog-0.3.3 -R , 或者索性 chmod 777 glog-0.3.3 -R , 装完之后，这个文件夹就可以kill了。（历经坎坷，最终ubuntu14.04由于不知名错误实在无法解决，投入到了ubuntu12.02的怀抱。这里需要安装另外两个依赖项：gflags、lmdb。不装之后编译会出问题。参考： http://www.shwley.com/index.php/archives/52/ # glog wget https://google-glog.googlecode.com/files/glog- 0.3 .3 .tar.gz tar zxvf glog- 0.3 .3 .tar.gz cd glog- 0.3 .3 ./configure make make install # gflags wget https://github.com/schuhschuh/gflags/archive/master.zip unzip master.zip cd gflags-master mkdir build cd build export CXXFLAGS = -fPIC cmake .. make VERBOSE = 1 make sudo make install # lmdb git clone git://gitorious.org/mdb/mdb.git cd mdb/libraries/liblmdb make make install ）（2）其他依赖项，确保都成功 $ sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev 如果安装过程中出现错误，E: Sub-process /usr/bin/dpkg returned an error code (1)，可能是因为sudo apt-get install出现到意外，不用着急，可以试试这个解决办法：（我没有遇到这个问题） $ cd /var/lib/dpkg $ sudo mv info info.bak $ sudo mkdir info $ sudo apt-get --reinstall install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev 如果使用的是2014年9月之后的新版Caffe,对于ubuntu 14.04来说，需要安装以下依赖文件： $ sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler 5、安装Caffe并测试 1. 切换到Caffe的下载文件夹，然后执行： $ cp Makefile.config.example Makefile.config 修改新生成的Makefile.config文件，修改“ BLAS := mkl ”（我这里装的就是ATLAS，所以不用改，使用默认配置就行）。希望使用nVidia开发的cuDNN来加速Caffe模型运算的同学，在安装完cuDNN之后，确保Makefile.config文件中的 USE_CUDNN := 1 处于启用状态。幸运的是，新版的Caffe已经默认集成了cuDNN的库文件，不需要做额外的设置了。 cuDNN的安装方法如下： cuDNN Introdution and Download 下载cuDNN之后解压，进入解压后的文件夹： $ sudo cp cudnn.h /usr/local/include $ sudo cp libcudnn.so /usr/local/lib $ sudo cp libcudnn.so.6.5 /usr/local/lib $ sudo cp libcudnn.so.6.5.18 /usr/local/lib 链接cuDNN的库文件 $ sudo ln -sf /usr/local/lib/libcudnn.so.6.5.18 /usr/local/lib/libcudnn.so.6.5 不做链接，可能会出现这个报错：“./build/tools/caffe: error while loading shared libraries: libcudnn.so.6.5: cannot open shared object file: No such file or directory”那是因为cuDNN没有链接成功，只能做一下硬链接。下面可以编译caffe-master了！！！ $ make all $ make test $ make runtest 这里出现libcudnn.so.6.5：cannot open shared object file，查看LD_LIBRARY_PATH发现环境变量没问题，折腾了半天发现cuda的配置文件没有加进去，就是上面安装MKL时候的cuda.conf忘了写了。错误Fixed： 1. 如果提示： make: protoc: 命令未找到，那是因为protoc没有安装，安装一下就好了。 $ sudo apt-get install protobuf-c-compiler protobuf-compiler 2. (该问题已经在9月以后的Caffe中得到作者修复)提示“src/caffe/util/math_functions.cu(140): error: calling a host function(std::signbit ) from a globalfunction(caffe::sgnbit_kernel ) is not allowed” 解决办法：修改 ./include/caffe/util/math_functions.hpp 224行删除(注释)：using std::signbit; 修改：DEFINE_CAFFE_CPU_UNARY_FUNC(sgnbit, y = signbit(x )); 为：DEFINE_CAFFE_CPU_UNARY_FUNC(sgnbit, y = std::signbit(x )); 得到作者，大神Yangqing Jia的回复，解决方法如上，没有二致。六、使用MNIST数据集进行测试 Caffe默认情况会安装在$CAFFE_ROOT，就是解压到那个目录，例如：$ home/username/caffe-master，所以下面的工作，默认已经切换到了该工作目录。下面的工作主要是，用于测试Caffe是否工作正常，不做详细评估。具体设置请参考官网：http://caffe.berkeleyvision.org/gathered/examples/mnist.html 1. 数据预处理可以用下载好的数据集，也可以重新下载，我网速快，这里就偷懒直接下载了，具体操作如下： $ cd data/mnist $ sudo sh ./get_mnist.sh 2. 重建LDB文件，就是处理二进制数据集为Caffe识别的数据集，以后所有的数据，包括jpe文件都要处理成这个格式 $ cd examples/mnist $ sudo sh ./create_mnist.sh 生成mnist-train-leveldb/ 和 mnist-test-leveldb/文件夹，这里包含了LDB格式的数据集 PS: 这里可能会遇到一个报错信息： Creating lmdb... ./create_mnist.sh: 16: ./create_mnist.sh: build/examples/mnist/convert_mnist_data.bin: not found 解决方法是，直接到Caffe-master的根目录执行，实际上新版的Caffe，基本上都得从根目录执行。 ~/caffe-master$ sudo sh examples/mnist/create_mnist.sh 3. 训练mnist $ sudo sh examples/mnist/train_lenet.sh 至此，Caffe安装的所有步骤完结，下面是一组简单的数据对比，实验来源于MNIST数据集，主要是考察一下不同系统下CPU和GPU的性能。可以看到明显的差别了，虽然MNIST数据集很简单，相信复杂得数据集，差别会更大，Ubuntu+GPU是唯一的选择了。测试平台：i7-4770K/16G/GTX 770/CUDA 6.5 MNIST Windows8.1 on CPU：620s MNIST Windows8.1 on GPU：190s MNIST Ubuntu 14.04 on CPU：270s MNIST Ubuntu 14.04 on GPU：160s MNIST Ubuntu 14.04 on GPU with cuDNN：35s Cifar10_full on GPU wihtout cuDNN：73m45s = 4428s　（Iteration 70000） Cifar10_full on GPU with cuDNN：20m7s = 1207s　（Iteration 70000）; 个人分类: Caffe|39747 次阅读|4 个评论

Wiley Health Learning限时免费在线培训【皮肤病学】: WileyChina 2014-9-11 17:54; Wiley Health Learning 提供连续的教育活动，帮助您的临床实践。我们的最前沿的临床学习活动是来自最受信赖出版物里的循证医学内容，便于您将来的使用。只要您在线，就可以开始、暂存或者完成学习活动；如果您想获取证书，只需要到电子商店完成任务。 Wiley Health Learning 给您高质量教育，提高您的医疗水平。目前我们向亚太地区的皮肤病学家提供免费在线培训项目： “Guidelines for management of androgenetic alopecia based onBASP classification–the Asian consensus committee guideline” （基于BASP分类的雄激素源性脱发的管理指南——亚洲共识委员会指南） Journal of the European Academy of Dermatology and Venereology . August, 2013 活动类型：基于期刊的继续医学教育注册并开始浏览我们在WileyHealthLearning上推出的所有学习计划！了解有关在Wiley Health Learning创建您自己的网络学习计划的更多信息！浏览本页面，了解 Wiley Health Learning 上新的学习活动的更新内容！; 个人分类: Health Science|1684 次阅读|0 个评论

Neural Networks, Manifolds, and Topology: bigdataage 2014-8-12 13:13; Neural Networks, Manifolds, and Topology Posted on April 6, 2014 ( colah's blog ) topology, neural networks, deep learning, manifold hypothesis Recently, there’s been a great deal of excitement and interest in deep neural networks because they’ve achieved breakthrough results in areas such as computer vision. 1 However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing. If one trains it well, it achieves high quality results, but it is challenging to understand how it is doing so. If the network fails, it is hard to understand what went wrong. While it is challenging to understand the behavior of deep neural networks in general, it turns out to be much easier to explore low-dimensional deep neural networks – networks that only have a few neurons in each layer. In fact, we can create visualizations to completely understand the behavior and training of such networks. This perspective will allow us to gain deeper intuition about the behavior of neural networks and observe a connection linking neural networks to an area of mathematics called topology. A number of interesting things follow from this, including fundamental lower-bounds on the complexity of a neural network capable of classifying certain datasets. A Simple Example Let’s begin with a very simple dataset, two curves on a plane. The network will learn to classify points as belonging to one or the other. The obvious way to visualize the behavior of a neural network – or any classification algorithm, for that matter – is to simply look at how it classifies every possible data point. We’ll start with the simplest possible class of neural network, one with only an input layer and an output layer. Such a network simply tries to separate the two classes of data by dividing them with a line. That sort of network isn’t very interesting. Modern neural networks generally have multiple layers between their input and output, called “hidden” layers. At the very least, they have one. Diagram of a simple network from Wikipedia As before, we can visualize the behavior of this network by looking at what it does to different points in its domain. It separates the data with a more complicated curve than a line. With each layer, the network transforms the data, creating a new representation . 2 We can look at the data in each of these representations and how the network classifies them. When we get to the final representation, the network will just draw a line through the data (or, in higher dimensions, a hyperplane). In the previous visualization, we looked at the data in its “raw” representation. You can think of that as us looking at the input layer. Now we will look at it after it is transformed by the first layer. You can think of this as us looking at the hidden layer. Each dimension corresponds to the firing of a neuron in the layer. The hidden layer learns a representation so that the data is linearly seperable Continuous Visualization of Layers In the approach outlined in the previous section, we learn to understand networks by looking at the representation corresponding to each layer. This gives us a discrete list of representations. The tricky part is in understanding how we go from one to another. Thankfully, neural network layers have nice properties that make this very easy. There are a variety of different kinds of layers used in neural networks. We will talk about tanh layers for a concrete example. A tanh layer tanh ( W x + b ) consists of: A linear transformation by the “weight” matrix W A translation by the vector b Point-wise application of tanh. We can visualize this as a continuous transformation, as follows: The story is much the same for other standard layers, consisting of an affine transformation followed by pointwise application of a monotone activation function. We can apply this technique to understand more complicated networks. For example, the following network classifies two spirals that are slightly entangled, using four hidden layers. Over time, we can see it shift from the “raw” representation to higher level ones it has learned in order to classify the data. While the spirals are originally entangled, by the end they are linearly separable. On the other hand, the following network, also using multiple layers, fails to classify two spirals that are more entangled. It is worth explicitly noting here that these tasks are only somewhat challenging because we are using low-dimensional neural networks. If we were using wider networks, all this would be quite easy. (Andrej Karpathy has made a nice demo based on ConvnetJS that allows you to interactively explore networks with this sort of visualization of training!) Topology of tanh Layers Each layer stretches and squishes space, but it never cuts, breaks, or folds it. Intuitively, we can see that it preserves topological properties. For example, a set will be connected afterwards if it was before (and vice versa). Transformations like this, which don’t affect topology, are called homeomorphisms. Formally, they are bijections that are continuous functions both ways. Theorem : Layers with N inputs and N outputs are homeomorphisms, if the weight matrix, W , is non-singular. (Though one needs to be careful about domain and range.) Proof : Let’s consider this step by step: Let’s assume W has a non-zero determinant. Then it is a bijective linear function with a linear inverse. Linear functions are continuous. So, multiplying by W is a homeomorphism. Translations are homeomorphisms tanh (and sigmoid and softplus but not ReLU) are continuous functions with continuous inverses. They are bijections if we are careful about the domain and range we consider. Applying them pointwise is a homemorphism Thus, if W has a non-zero determinant, our layer is a homeomorphism. ∎ This result continues to hold if we compose arbitrarily many of these layers together. Topology and Classification A is red, B is blue Consider a two dimensional dataset with two classes A , B ⊂ R 2 : A = { x | d ( x , 0 ) 1 / 3 } B = { x | 2 / 3 d ( x , 0 ) 1 } Claim : It is impossible for a neural network to classify this dataset without having a layer that has 3 or more hidden units, regardless of depth. As mentioned previously, classification with a sigmoid unit or a softmax layer is equivalent to trying to find a hyperplane (or in this case a line) that separates A and B in the final represenation. With only two hidden units, a network is topologically incapable of separating the data in this way, and doomed to failure on this dataset. In the following visualization, we observe a hidden representation while a network trains, along with the classification line. As we watch, it struggles and flounders trying to learn a way to do this. For this network, hard work isn’t enough. In the end it gets pulled into a rather unproductive local minimum. Although, it’s actually able to achieve ∼ 80 % classification accuracy. This example only had one hidden layer, but it would fail regardless. Proof : Either each layer is a homeomorphism, or the layer’s weight matrix has determinant 0. If it is a homemorphism, A is still surrounded by B , and a line can’t separate them. But suppose it has a determinant of 0: then the dataset gets collapsed on some axis. Since we’re dealing with something homeomorphic to the original dataset, A is surrounded by B , and collapsing on any axis means we will have some points of A and B mix and become impossible to distinguish between. ∎ If we add a third hidden unit, the problem becomes trivial. The neural network learns the following representation: With this representation, we can separate the datasets with a hyperplane. To get a better sense of what’s going on, let’s consider an even simpler dataset that’s 1-dimensional: A = B = ∪ Without using a layer of two or more hidden units, we can’t classify this dataset. But if we use one with two units, we learn to represent the data as a nice curve that allows us to separate the classes with a line: What’s happening? One hidden unit learns to fire when x − 1 2 and one learns to fire when x 1 2 . When the first one fires, but not the second, we know that we are in A. The Manifold Hypothesis Is this relevant to real world data sets, like image data? If you take the manifold hypothesis really seriously, I think it bears consideration. The manifold hypothesis is that natural data forms lower-dimensional manifolds in its embedding space. There are both theoretical 3 and experimental 4 reasons to believe this to be true. If you believe this, then the task of a classification algorithm is fundamentally to separate a bunch of tangled manifolds. In the previous examples, one class completely surrounded another. However, it doesn’t seem very likely that the dog image manifold is completely surrounded by the cat image manifold. But there are other, more plausible topological situations that could still pose an issue, as we will see in the next section. Links And Homotopy Another interesting dataset to consider is two linked tori, A and B . Much like the previous datasets we considered, this dataset can’t be separated without using n + 1 dimensions, namely a 4 th dimension. Links are studied in knot theory, an area of topology. Sometimes when we see a link, it isn’t immediately obvious whether it’s an unlink (a bunch of things that are tangled together, but can be separated by continuous deformation) or not. A relatively simple unlink. If a neural network using layers with only 3 units can classify it, then it is an unlink. (Question: Can all unlinks be classified by a network with only 3 units, theoretically?) From this knot perspective, our continuous visualization of the representations produced by a neural network isn’t just a nice animation, it’s a procedure for untangling links. In topology, we would call it an ambient isotopy between the original link and the separated ones. Formally, an ambient isotopy between manifolds A and B is a continuous function F : × X → Y such that each F t is a homeomorphism from X to its range, F 0 is the identity function, and F 1 maps A to B . That is, F t continuously transitions from mapping A to itself to mapping A to B . Theorem : There is an ambient isotopy between the input and a network layer’s representation if: a) W isn’t singular, b) we are willing to permute the neurons in the hidden layer, and c) there is more than 1 hidden unit. Proof : Again, we consider each stage of the network individually: The hardest part is the linear transformation. In order for this to be possible, we need W to have a positive determinant. Our premise is that it isn’t zero, and we can flip the sign if it is negative by switching two of the hidden neurons, and so we can guarantee the determinant is positive. The space of positive determinant matrices is path-connected , so there exists p : → G L n ( R ) 5 such that p ( 0 ) = I d and p ( 1 ) = W . We can continually transition from the identity function to the W transformation with the function x → p ( t ) x , multiplying x at each point in time t by the continuously transitioning matrix p ( t ) . We can continually transition from the identity function to the b translation with the function x → x + t b . We can continually transition from the identity function to the pointwise use of σ with the function: x → ( 1 − t ) x + t σ ( x ) . ∎ I imagine there is probably interest in programs automatically discovering such ambient isotopies and automatically proving the equivalence of certain links, or that certain links are separable. It would be interesting to know if neural networks can beat whatever the state of the art is there. (Apparently determining if knots are trivial is NP. This doesn’t bode well for neural networks.) The sort of links we’ve talked about so far don’t seem likely to turn up in real world data, but there are higher dimensional generalizations. It seems plausible such things could exist in real world data. Links and knots are 1 -dimensional manifolds, but we need 4 dimensions to be able to untangle all of them. Similarly, one can need yet higher dimensional space to be able to unknot n -dimensional manifolds. All n -dimensional manifolds can be untangled in 2 n + 2 dimensions. 6 (I know very little about knot theory and really need to learn more about what’s known regarding dimensionality and links. If we know a manifold can be embedded in n-dimensional space, instead of the dimensionality of the manifold, what limit do we have?) The Easy Way Out The natural thing for a neural net to do, the very easy route, is to try and pull the manifolds apart naively and stretch the parts that are tangled as thin as possible. While this won’t be anywhere close to a genuine solution, it can achieve relatively high classification accuracy and be a tempting local minimum. It would present itself as very high derivatives on the regions it is trying to stretch, and sharp near-discontinuities. We know these things happen. 7 Contractive penalties, penalizing the derivatives of the layers at data points, are the natural way to fight this. 8 Since these sort of local minima are absolutely useless from the perspective of trying to solve topological problems, topological problems may provide a nice motivation to explore fighting these issues. On the other hand, if we only care about achieving good classification results, it seems like we might not care. If a tiny bit of the data manifold is snagged on another manifold, is that a problem for us? It seems like we should be able to get arbitrarily good classification results despite this issue. (My intuition is that trying to cheat the problem like this is a bad idea: it’s hard to imagine that it won’t be a dead end. In particular, in an optimization problem where local minima are a big problem, picking an architecture that can’t genuinely solve the problem seems like a recipe for bad performance.) Better Layers for Manipulating Manifolds? The more I think about standard neural network layers – that is, with an affine transformation followed by a point-wise activation function – the more disenchanted I feel. It’s hard to imagine that these are really very good for manipulating manifolds. Perhaps it might make sense to have a very different kind of layer that we can use in composition with more traditional ones? The thing that feels natural to me is to learn a vector field with the direction we want to shift the manifold: And then deform space based on it: One could learn the vector field at fixed points (just take some fixed points from the training set to use as anchors) and interpolate in some manner. The vector field above is of the form: F ( x ) = v 0 f 0 ( x ) + v 1 f 1 ( x ) 1 + f 0 ( x ) + f 1 ( x ) Where v 0 and v 1 are vectors and f 0 ( x ) and f 1 ( x ) are n-dimensional gaussians. This is inspired a bit by radial basis functions . K-Nearest Neighbor Layers I’ve also begun to think that linear separability may be a huge, and possibly unreasonable, amount to demand of a neural network. In some ways, it feels like the natural thing to do would be to use k-nearest neighbors (k-NN). However, k-NN’s success is greatly dependent on the representation it classifies data from, so one needs a good representation before k-NN can work well. As a first experiment, I trained some MNIST networks (two-layer convolutional nets, no dropout) that achieved ∼ 1 % test error. I then dropped the final softmax layer and used the k-NN algorithm. I was able to consistently achieve a reduction in test error of 0.1-0.2%. Still, this doesn’t quite feel like the right thing. The network is still trying to do linear classification, but since we use k-NN at test time, it’s able to recover a bit from mistakes it made. k-NN is differentiable with respect to the representation it’s acting on, because of the 1/distance weighting. As such, we can train a network directly for k-NN classification. This can be thought of as a kind of “nearest neighbor” layer that acts as an alternative to softmax. We don’t want to feedforward our entire training set for each mini-batch because that would be very computationally expensive. I think a nice approach is to classify each element of the mini-batch based on the classes of other elements of the mini-batch, giving each one a weight of 1/(distance from classification target). 9 Sadly, even with sophisticated architecture, using k-NN only gets down to 5-4% test error – and using simpler architectures gets worse results. However, I’ve put very little effort into playing with hyper-parameters. Still, I really aesthetically like this approach, because it seems like what we’re “asking” the network to do is much more reasonable. We want points of the same manifold to be closer than points of others, as opposed to the manifolds being separable by a hyperplane. This should correspond to inflating the space between manifolds for different categories and contracting the individual manifolds. It feels like simplification. Conclusion Topological properties of data, such as links, may make it impossible to linearly separate classes using low-dimensional networks, regardless of depth. Even in cases where it is technically possible, such as spirals, it can be very challenging to do so. To accurately classify data with neural networks, wide layers are sometimes necessary. Further, traditional neural network layers do not seem to be very good at representing important manipulations of manifolds; even if we were to cleverly set weights by hand, it would be challenging to compactly represent the transformations we want. New layers, specifically motivated by the manifold perspective of machine learning, may be useful supplements. (This is a developing research project. It’s posted as an experiment in doing research openly. I would be delighted to have your feedback on these ideas: you can comment inline or at the end. For typos, technical errors, or clarifications you would like to see added, you are encouraged to make a pull request on github .) Acknowledgments Thank you to Yoshua Bengio, Michael Nielsen, Dario Amodei, Eliana Lorch, Jacob Steinhardt, and Tamsyn Waterhouse for their comments and encouragement. This seems to have really kicked off with Krizhevsky et al. , (2012) , who put together a lot of different pieces to achieve outstanding results. Since then there’s been a lot of other exciting work. ↩ These representations, hopefully, make the data “nicer” for the network to classify. There has been a lot of work exploring representations recently. Perhaps the most fascinating has been in Natural Language Processing: the representations we learn of words, called word embeddings, have interesting properties. See Mikolov et al. (2013) , Turian et al. (2010) , and, Richard Socher’s work . To give you a quick flavor, there is a very nice visualization associated with the Turian paper. ↩ A lot of the natural transformations you might want to perform on an image, like translating or scaling an object in it, or changing the lighting, would form continuous curves in image space if you performed them continuously. ↩ Carlsson et al. found that local patches of images form a klein bottle. ↩ G L n ( R ) is the set of invertible n × n matrices on the reals, formally called the general linear group of degree n . ↩ This result is mentioned in Wikipedia’s subsection on Isotopy versions . ↩ See Szegedy et al. , where they are able to modify data samples and find slight modifications that cause some of the best image classification neural networks to misclasify the data. It’s quite troubling. ↩ Contractive penalties were introduced in contractive autoencoders. See Rifai et al. (2011) . ↩ I used a slightly less elegant, but roughly equivalent algorithm because it was more practical to implement in Theano: feedforward two different batches at the same time, and classify them based on each other. ↩ 原文： http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/; 7670 次阅读|0 个评论

Deep big simple neural nets excel on hand-written digit...: zhuwei3014 2014-7-24 05:14; 2010_Deep big simple neural nets excel on hand-written digit recognition 首先看这篇文章： 2003_Best practice for convolutional neural networks applied to visual document analysis 这篇文章中心思想： Expanding training set with elastic deformation and affine transformation 本文的中心思想是运用仿射变换和形变扩大数据集，然后运用GPU加速训练庞大的简单神经网络，从而获得比较好的结果。实验结果：结论： 1. convolutional network is better than MLP 2. Elastic deformations have a considerable impact on performance both for 2-layer MLP and convolutional network 3. Cross-entropy seems better and faster than mean square error(MSE); 个人分类: DL论文笔记|2706 次阅读|0 个评论

An analysis of single-layer network in unsupervised...: zhuwei3014 2014-7-24 05:11; 2011_An analysis of single-layer network in unsupervised feature learning 此文详细研究了单层网络的几个因素，分析很到位，对初学者帮助较大。本文主要研究影响单层模型的几个因素：隐层节点数量、局部接收域大小、步长、白化。主要分析对比了四个单层结构： sparse RBM sparse auto-encoder K-means clustering Gaussian Mixture Model（GMM）实验数据集：CIFAR-10、NORB 实验结果：; 个人分类: DL论文笔记|2906 次阅读|0 个评论

What is the best multi-stage architecture for OR？: zhuwei3014 2014-7-24 05:08; ICCV_2009_What is the best multi-stage architecture for object recognition? 个人认为这是一篇很好的分析型文章，思路浅显易懂，实验验证也很充分。本文主要在于研究三个问题： 1. 滤波器后面的非线性对识别率有什么影响？ 2. 有监督学习与无监督学习，以及hard-wired与random的filter有什么影响？ 3. 两层结构是否优于单层结构？本文用PSD无监督学习方法，对几个因素进行组合比较，主要有一下几种组合：卷积层filter bank layer：修正层rectification layer：局部对比度归一层local contrast normalization layer：最大值池化或均值池化max-pooling or subsampling layer：四种结构：、、、训练方法： Random Features and Supervised Classifier： R and RR Unsupervised Features, Supervised Classifier： U and UU Random Features, Global Supervised Refinement： R+ and R+R+ Unsupervised Feature, Global Supervised Refinement： U+ and U+U+ 实验数据：Caltech-101 实验对比结果图： On NORB dataset：; 个人分类: DL论文笔记|3457 次阅读|0 个评论

Sparse DBN-Sparse deep belief net model for visual area V2: zhuwei3014 2014-7-24 05:04; NIPS_2007_Sparse deep belief net model for visual area V2 这篇文章主要讲的是sparse DBN。很多比较学习算法的结果与V1区域相似的工作，但是没有与大脑视觉体系更深层次的比较，比如V2、V4，这篇文章量化的比较了sparse DBN与V2学习的特征，V2的结果引用自这篇文章： M. Ito and H. Komatsu. Representation of angles embedded within contour stimuli in area v2 of macaque monkeys. The Journal of Neuroscience, 24(13):3313–3324, 2004. 1. Introduction J. H. van Hateren and A. van der Schaaf. Independent component filters of natural images compared with simple cells in primary visual cortex. Proc.R.Soc.Lond. B, 265:359–366, 1998. 这篇文章研究表明在自然图像中ICA学习到的filter与V1中简单细胞的局部接受野非常相似。 2. Biological comparison 2.1 Features in early visual cortex: area V1 V1中简单细胞的局部接收野是localized, oriented, bandpass filters that resemble gabor filters. 2.2 Features in visual cortex area V2 J. B. Levitt, D. C. Kiper, and J. A. Movshon. Receptive fields and functional architecture of macaque v2. Journal of Neurophysiology, 71(6):2517–2542, 1994. 这篇文章的研究暗示了area V2 may serve as a place where different channels of visual information are integrated. 接下来讲解了第一节中那篇文章对V2中细胞的选择性的分析。 3. Algorithm 3.1 Sparse RBM Gaussian RBM能量函数： compute conditional probability distribution：是高斯密度。加入稀疏惩罚，最终优化问题变为：其中是给定数据的conditional expectation，是regularization constant，p是一个常数控制稀疏程度。 3.2 Learning deep networks using sparse RBM 跟DBN的思想一致，本文学习了含有两个隐层的网络。 4. Visualization 4.1 Learning strokes from handwritten digits 首先PCA降维到69维，然后用69-200结构学习出的结果： 4.2 Learning from natural images 用http://hlab.phys.rug.nl/imlib/index.html的自然图片学习，从2000张图片中抽取100000个14*14的patches，200个patches作为一个mini-batch，用196-400的结构学习得到的结果类似V1： 4.3 Learning a two-layer model of natural images using sparse RBMs 5. Evaluation experiments; 个人分类: DL论文笔记|3619 次阅读|0 个评论

2010_Tutorial_Feature Learning for Image Classification笔记: zhuwei3014 2014-7-24 04:59; 这篇 ECCV2010上的 tutorial，由余凯和Andrew Ng两位大神做的，我在此把ppt中一些摘要整理一下，供大家参考。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The quality of visual features is crucial for a wide range of computer vision topics, e.g., scene classification, object recognition, and object detection, which are very popular in recent computer vision venues. All these image classification tasks have traditionally relied on hand-crafted features to try to capture the essence of different visual patterns. Fundamentally, a long-term goal in AI research is to build intelligent systems that can automatically learn meaningful feature representations from a massive amount of image data. We believe a comprehensive coverage of the latest advances on image feature learning will be of broad interest to ECCV attendees. The primary objective of this tutorial is to introduce a paradigm of feature learning from unlabeled images, with an emphasis on applications to supervised image classification. We provide a comprehensive coverage of recently developed algorithms for learning powerful sparse nonlinear features , and showcase their superior performance on a number of challenging image classification benchmarks, including Caltech101, PASCAL, and the recent large-scale problem ImageNet. Furthermore, we describe deep learning and a variety of deep learning algorithms, which learn rich feature hierarchies from unlabeled data and can capture complex invariance in visual patterns. 1. Introduction where do we get low-level representation from? 2. State-of-the-art Image Classification Methods (1) Features (2) Discriminative Methods a. bag of words Issues: Spatial information is missed. b. Spatial Pyramid Pooling (3) Generative Model 3. Image Classification Using Sparse Coding V1中的处理和Gabor小波变换类似，做边缘检测。 Sparse Coding大致意思是寻找一组基向量，使得所有input数据可以用这组基向量线性表示出来，而系数大部分为0，因此称为稀疏。这个方法假设边缘是一个场景最基本的元素，用这种方法可以得到一个比像素更简洁的，更高层的表示。主要步骤：但是这样依然不如SIFT，有三个方法改善：与SIFT结合就是把输入数据变为SIFT descriptors。与K-means方法比较，发现sparse coding就是K-means的一种soft版本，任何时候需要用k-means得到字典的时候，用sparse coding都会提升效果。 4. Advanced Topics on Image Classification Using Sparse Coding (1) Why sparse coding helps classification? A topic model view to sparse coding: A geometric view to sparse coding: 接下来讲述了SC在MNIST上的实验数据，当SC得到最小误差的时候，学习到的基向量就像数字，启发：研究数据中的几何特征可能对分类有帮助。 Local Coordinate Coding Applications: (2) Recent Advances in Sparse Coding for Image Classification 5. Learning Feature Hierachies and Deep Learning; 个人分类: DL论文笔记|3101 次阅读|0 个评论

SRBM-Modeling image patches with a directed hierarchy of MRF: zhuwei3014 2014-7-24 04:47; 2007_Modeling image patches with a directed hierarchy of Markov random fileds 这篇文章主要介绍了Semi-RBM，也就是半受限玻尔兹曼机，在RBM中加入可视层单元之间的连接，并且使用mean-field approximation。 1. Introduction 2.Learning deep belief nets: An overview 2.1 Restricted Boltzmann Machines 2.2 Compositions of experts 第一个隐层最终指向可视层形成图模型，那是因为我们保留了p(v|h)，但是丢弃了第一层RBM产生的p(h)。 3. Semi-restricted Boltzmann Machines SRBM就是加入可视层单元之间的连接，这样能量函数为：更新规则为： 4. Inference in a directed hierarchy of MRF's 5. Whitening without waiting 如果可视层单元的连接保证了重建的分布中的pairwise correlation，那么CD算法将会忽略这些相关性，因为CD算法是学习两个分布之间的区别。一副图，表示了是否有lateral connection学习到的特征： A图是普通RBM学习到的特征，B图是SRBM学习到的特征，输入层每个单元都与其他783个单元连接，只有局部有连接的connection可以产生大的weight。局部连接的像素产生一种中间亮四周暗的特征。 6. Modeling patches of natural images 6.1 Adapting Restricted Boltzmann machines to real-valued data 使用mean-field approximation，SRBM的学习方法以及运用到实数数据上与普通RBM一样。两个DBN模型，一个有lateral connection，一个没有，Van Hateren数据集，150000*20*20 patches，零均值归一化，ZCA。模型结构：400-2000-500-1000，参数具体见论文。; 个人分类: DL论文笔记|2628 次阅读|0 个评论

Efficient Learning of Sparse Representations with EB model: zhuwei3014 2014-7-24 04:39; 2006_Efficient Learning of Sparse Representations with an Energy-based Model 此文也是Deep learning三大breakthrough文章之一，其实就是稀疏的autoencoder，以下为摘要。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 模型结构图：模型组成部分： encoder Sparsifying Logistic decoder 能量函数为：其中， Sparsifying Logistic是一个非线性模块，它的输入输出如下： i是code的第i个component，控制稀疏度，控制输出的饱和度（softness）另一种观点是类似于sigmoid函数，右边除以，得到：学习过程： Loss function：实验： 1. Feature Extraction from natural image patches dataset: Berkeley segmentation dataset 2. Feature Extraction from handwritten numerals 3. Learning Local Features for MNIST dataset 用文中叙述的方法预训练LeNet-5的第一层，将网络结构改为50-50-200-10，用5*5的image patches训练，得到50维的稀疏表示，用此参数初始化CNN的第一层。; 个人分类: DL论文笔记|3141 次阅读|0 个评论

NIPS_2006: Greedy layer-wise learning of deep networks笔记: zhuwei3014 2014-7-23 18:28; 这篇文章相信学习DL的人都知道，deep learning在2006年三大breakthrough文章之一，其主要思想是逐层贪婪训练方法，以下为论文部分摘抄及翻译。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Problem: To train deep networks, gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. 1. Introduction 对于shallow architecture模型，如SVM，对于d个输入，要有2^d个样本，才足够训练模型。当d增大的时候，这就产生了维数灾难问题。而多层神经网络能够避免这个问题： boolean functions (such as the function that computes the multiplication of two numbers from their d-bit representation) expressible by O(logd) layers of combinatorial logic with O(d) elements in eachlayer may require O(2^d)elements when expressed with only 2 layers. Three important aspects: 1. Pre-training one layer at a time in a greedy way. 2. Using unsupervised learning at each layer in order to preserve information from the inputs. 3. Fine-tuning the whole network with respect to the ultimate criterion of interest. 2. DBN 2.1 RBM 2.2 Gibbs Markov Chain and log-likelyhood gradient in a RBM RBMUpdate Algorithm 2.3 Greedy layer-wise training of a DBN 每当训练完一层RBM，向上叠加一层RBM，且输入层使用下一层RBM的输出作为输入。使用RBM中隐层的posterior distribution作为DBN中可视层的posterior distribution。贪婪学习的动机是一个部分DBN对最底层的表示比单个RBM要好。 TrainUnsupervisedDBN 其中i为层数 2.4 Fine-tuning wake-sleep algorithm or mean-field approximation TrainSupervisedDBN 这里C为squared error or cross entropy DBMSupervisedFineTuning 3. Extension to continuous-valued inputs 将输入向量进行归一化，转换为(0,1)区间的数，把它当做是二值单元变成1的概率，然后用RBM的方法进行训练。这种方法对灰度像素是有效的，但是可能对其他形式的输入无效。 4.Understand why layer-wise strategy works TrainGreedyAutoEncodingDeepNet n为各层单元数 TrainGreedySupervisedDeepNet Experiment 2 shows that greedy unsupervised layer-wise pre-training gives much better results than the standard way to train a deep network (with no greedy pre-training) or a shallownetwork, and that, without pre-training, deep networks tend to perform worse than shallow networks. 同样supervised pretraining要比unsupervised效果差，因为它太贪婪，可能的解释是在学习到的隐层表示中，它抛弃了目标的一些信息。 Experiment 3 将最顶层设置为只有20个单元，因为在实验2中training error都很小，基本看不出pretraining对optimization的帮助，那是因为即使没有很好的初始化，最底层和最高层还是组成了一个标准的浅层网络，他们能够保留足够的输入信息去适应训练集，但是对生成没有什么帮助。由实验室结果可以看出这个假设是正确的。 Continuous training of all layers of a DBN 我们希望连续的训练一个DBN而不是每次加一层，再决定迭代次数来训练。 To achieve this it is suf_cient to insert a line in TrainUnsupervisedDBN , so that RBMupdate is called on all the layers and the stochastic hidden values are propagated all the way up. The advantage is that we can now have a single stopping criterion (for the whole network). 具体怎么做文章没有细说。 5. Dealing with uncooperative input distributions 当输入分布于目标不太相关的情况下，例如x~p(x), p is Gaussian and target y=f(x)+noise,f=sinus，这时候在p和f之间没有特别的相关性，这时候无监督贪婪预训练就帮不上忙。这时候在训练每层时可以用一个混合的训练规则，结合无监督和有监督。 TrainPartiallySupervisedLayer; 个人分类: DL论文笔记|5841 次阅读|0 个评论

13-Representation Learning-A Review and New Perspectives笔记: zhuwei3014 2014-7-23 18:15; 此文又是Bengio大神的综述，2013年，内容更新，与09年那篇综述侧重点稍有不同，同样，需要完全读懂需要很好的基础和时间，反正我是很多地方懵懵懂懂，姑且将自己的翻译和摘取贴出来大家互相交流，如有不对的地方欢迎指正。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. Introduction 2. Why should we care about learning representation? Although depth is an important part of the story, many other priors are interesting and can be conveniently captured by a learner when the learning problem is cast as one of learning a representation. 接下来几个领域的应用： Speech Recognition and Signal Processing Object Recognition MNIST最近的记录：Ciresan et al. (2012) 0.27% Rifai et al. (2011c) 0.81% ImageNet：Krizhevsky et al.(2012) 15.3% Natural Language Processing Multi-Task and Transfer Learning, Domain Adaptation 3. What makes a representation good? 3.1 Priors for representation learning in AI a. Smoothness b. Multiple explanatory factors: 数据形成的分布是由一些隐含factors生成的 c. A hierarchical organization of explanatory factors：变量的层次结构 d. Semi-supervised learning e. Shared factors across tasks f. Manifolds:流形学习，主要用于AE的研究 g. Natural clustering h. Temporal and spatial coherence i. Sparsity 这些priors都可以用来帮助learner学习representation。 3.2 Smoothness and the Curse of Dimensionality 机器学习局部非参数学习例如核函数，它们只能获取局部泛化，假设目标函数是足够平滑的，但这不足以解决维度灾难。这些叫smoothness based learner和线性模型仍然可以用在这些学习到的表示的顶层。实际上，学习一个表示和核函数等同于学习一个核。 3.3 Distributed representations 一个one-hot表示，例如传统的聚类算法、高斯混合模型。最近邻算法、高斯SVM等的结果都需要O(N)个参数来区分O(N)个输入区域。但是像 RBM、sparse coding、auto-encoder或者多层神经网络可以用O(N)个参数表示个输入区域。它们就是分布式表示。 3.4 Depth and abstraction 深度带来了两个明显的提高：促进特征的重复利用、潜在的获取更抽象的高层特征。 re-use：深度回路的一个重要性质是path的数量，它会相对于深度呈指数级增长。我们可以改变每个节点的定义来改变回路的深度。典型的计算包括：weighted,sum, product, artificial neuron model,computation of a kernel, or logic gates. Abstraction and invariance 更抽象的概念可以依靠less抽象的概念构造。例如在CNN中，通过pooling来构造这种抽象，更抽象的概念一般是对于大部分输入的局部变化具有不变性。 3.5 Disentangle factors of variation 最鲁棒的特征学习是尽可能多的解决factors，抛弃尽可能少的信息，类似降维。 3.6 What are good criteria for learning representations? 在分类中，目标很明显就是最小化错误分类，但是表示学习不一样，这是一个值得思考的问题。 4. Building deep representations 06年在特征学习方面有一个突破，中心思想就是贪婪单层非监督预训练，学习层次化的特征，每次一层，用非监督学习来学习组合之前学习的transformation。单层训练模式也可以用于监督训练，虽然效果不如非监督训练，但比没有预训练要好。下面几种方法组合单层网络来构成监督模型：层叠RBM形成DBN ，但是怎么去估计最大似然来优化生产模型目前还不是很清楚，一种选择是wake-sleep算法。另一种方法是结合RBM参数到DBM 中，基本上就是减半。还有一种方法是层叠RBM或者AE形成深层自编码器，另一种方法训练深层结构是iterative construction of a free energy function 5. Single-layer learning modules 在特征学习方面，有两条主线：一个根源于PGM，一个根源于NN。根本上这两个的区别在于每一层是描述为PGM还是computational graph，简单的说，就是隐层节点便是latent random variables还是computational nodes。 RBM是在PGM这一边，AE在NN这一边。在RBM中，用score matching训练模型本质上与AE的规则化重建目标是想通的。 PCA三种解释： a. 与PM相关，例如probabilistic PCA、factor analysis和传统的多元变量高斯分布。 b. 它本质上和线性AE相同 c. 它可以看成是线性流形学习的一种形式但是线性特征的表达能力有限，他不能层叠老获取更抽象的表示，因为组合线性操作最后还是产生线性操作。 6. Probabilistic Models Learning is conceived in term of estimating a set of model parameters that (locally) maximizes the likelihood of the training data with respect to the distribution over these latent variables. 6.1 Directed Graphical Models 6.1.1 Explaining away 一个事件先验独立的因素在给出了观察后变得不独立了，这样即使h是离散的，后验概率P(h|x)也变的intractable。 6.1.2 Probabilistic interpretation of PCA 6.1.3 Sparse coding 稀疏编码和PCA的区别在于包含一个惩罚来保证稀疏性。 Laplace分布（等同于L1惩罚）产生稀疏表示。与RBM和AE比较，稀疏编码中的推理包含了一个内在循环的优化也就增加了计算复杂度，稀疏编码中的编码对每一个样本是一个free variable，在那种意义上潜在的编码器是非参数的。 6.2 Undirected Graphical Models 无向图模型，也叫马尔科夫随机场（MRF），一种特殊形式叫BM，其能力方程： 6.2.1 RBM 6.3 Generalization of the RBM to real-valued data 最简单的方法是Gaussian RBM，但是 Ranzato2010 给出了更好的方法来model自然图像，叫做 mean and covariance RBM（mcRBM），这是covariance RBM和GRBM的结合。还介绍了mPoT模型。 Courville2011 提出了ssRBM在CIFAR-10上取得了很好的表现。这三个模型都用来model real-valued data，隐层单元不仅编码了数据的conditional mean，还是conditional covariance。不同于训练方法，这几个的区别在于怎么encode conditional covariance。 6.4 RBM parameter estimation 6.4.1 Contrastive Divergence 6.4.2 Stochastic Maximum Likelihood（SML/PCD）在每一步梯度更新时，不像CD算法中positive phase中gibbs sampling每次都更新，而是用前一次更新过的状态。但是当权值变大，估计的分布就有更多的峰值，gibbs chain就需要很长的时间mix。 Tieleman2009 提出一种fast-weight PCD（FPCD）算法用来解决这个问题。 6.4.3 Pseudolikelihood，Ratio-matching and other Inductive Principles Marlin2010 比较了CD和SML。 7. Direct Encoding: Learning a Parametric Map from Input to Representation 7.1 Auto-Encoders 特征提取函数和重构误差函数根据应用领域不同而不同。对于无界的输入，用线性解码器和平方重构误差。对于之间的输入，使用sigmoid函数。如果是二值输入，可以用binary cross-entropy误差。 binary cross-entropy：线性编码和解码得到类似于PCA的subspace，当使用sigmoid非线性编码器的时候也一样，但是权值如果是tied就不同了。如果编码器和解码器都使用sigmoid非线性函数，那么他得到类似binary RBM的特征。一个不同点在于RBM用一个单一的矩阵而AE在编码和解码使用不同的权值矩阵。在实际中AE的优势在于它定义了一个简单的tractable优化目标函数，可以用来监视进程。 7.2 Regularized Auto-encoders 传统的AE就像PCA一样，作为一种数据降维的方法，因此用的是 bottleneck 。而 RBM和sparse coding倾向于over-complete表示，这可能导致AE太简单（直接复制输入到特征）。这就需要regularization，可以把它看成让表示对输入的变化不太敏感。 7.2.1 Sparse Auto-encoders 当直接惩罚隐层单元时，有很多变体，但没有文章比较他们哪一个更好。虽然L1 penalty看上去最自然，但很少SAE的文章用到。一个类似的规则是student-t penalty 。详见UFLDL。 7.2.2 Denoising Auto-encoders Vincent2010 中考虑了对灰度图加入等向高斯噪声、椒盐噪声。 7.2.3 Contractive Auto-encoders Refai2011 提出了CAE与DAE动机一样，也是要学习鲁棒的特征，通过加入一个analytic contractive penalty term。令Jacobian matrix ，CAE的目标函数为：是一个超参数控制规则强度。如果是sigmoid encoder：三个与DAE不同的地方，也有紧密的联系，有小部分噪声的DAE可以看做是一种CAE，当惩罚项在全体重构误差上而不仅仅是encoder上。另外Refai又提出了CAE+H，加入了第三项使得与靠近：注意 CAE学习到的表示倾向于饱和而不是稀疏，也就是说大部分隐层单元靠近极值，它们的导数也很小。不饱和的单元很少，对输入敏感，与权值一起构成一组基表示样本的局部变化。在CAE中，权值矩阵是tied。 7.2.4 Predictive Sparse Decomposition（PSD） (Kavukcuoglu et al., 2008) 提出了sparse coding和autoencoder的变种，叫PSD。 PSD是sparse coding和AE的一个变种，目标函数为： PSD可以看做是sparse coding的近似，只是多了一个约束，可以将稀疏编码近似为一个参数化的编码器。 8.Representation Learning As Manifold Learning PCA就是一种线性流形算法。 8.1 Learning a parametric mapping based on a neighborhood graph 8.2 Learning a non-linear manifold through a coding scheme 8.3 Leveraging the modeled tangent spaces 9. Connections between probabilistic and direct encoding models 标准的概率模型框架讲训练准则分解为两个：log-likelihood 和prior 。 9.1 PSD：a probabilistic interpretation PSD是一个介于概率模型和直接编码方法的表示学习算法。RBM因为在隐层单元连接的限制，也是一个交叉。但是DBM中就没有这些性质。 9.2 Regularized Auto-encoder Capture Local Statistics of the Density 规则化的AE训练准则不同于标准的似然估计因为他们需要一种prior，因此他们是data-dependent。 (Vincent2011) A connection between score matching and denoising autoencoders Bengio2012_Implicit density estimation by local moment matching to sample from auto- encoders (Refai2012)_A generative process for sampling contractive auto-encoders 规则项需要学习到的表示尽可能的对输入不敏感，然而在训练集上最小化重构误差又迫使表示包含足够的信息来区分他们。 9.3 Learning Approximate Inference 9.4 Sampling Challenge MCMC sampling在学习过程中效率很低，因为the models of the learned distribution become sharper, makeing mixing between models very slow. Bengio2012 表明深层表示可以帮助mixing。 9.5 Evaluating and Monitoring Performance 一般情况下我们在学习到的特征上加一层简单的分类器，但是最后的分类器可能计算量更大（例如fine-tuning一般比学习特征需要更多次数的迭代）。更重要的是这样可能给出特征的不完全的评价。对于AE和sparse coding来说，可以用测试集上的重构误差来监视。对于RBM和一些BM， Murray2008 提出用 Annealed Importance Sampling 来估计partition function。RBM的另一种（ Desjardins2011 ）是在训练中跟踪partition function，这个可以用来early stopping和减少普通AIS的计算量。 10. Global Training of Deep Models 10.1 On the Training of Deep Architectures 第一种实现就是非监督或者监督单层训练， Erhan2010 解释了为什么单层非监督预训练有帮助。这些也与对中间层表示有指导作用的算法有联系，例如Semi-supervised Embedding（ Weston2008 ）。在Erhan2010中，非监督预训练的作用主要体现在regularization effect和optimization effect上。前者可以用stacked RBM或者AE实验证明。后者就很难分析因为不管底层的特征有没有用，最上面两层也可以过拟合训练集。改变优化的数值条件对训练深层结构影响很大，例如改变初始化范围和改变非线性函数（ Glorot2010 ）。第一个是梯度弥散问题，这个难题促使了二阶方法的研究，特别是 Hessian-free second-order方法（ Marten2010 ）。Cho2011 提出了一种RBM自适应的学习速率。 Glorot2011 表明了sparse rectifying units也可以影响训练和生成的表现。 Ciresan2010 表明有大量标注数据，再合理的初始化和选择非线性函数，有监督的训练深层网络也可以很成功。这就加强这种假设，当有足够的标注样本时，无监督训练只是作为一种prior。 Krizhevsky2012 中运用了很多技术，未来的工作希望是确定哪些元素最重要，怎么推广到别的任务中。 10.2 Joint Training of Deep Boltzmann Machines Salakhutdinov and Hinton (2009) 提出了DBM。 DBM的能量方程： DBM相当于在Boltzmann Machine的能量方程中U=0，并且V和W之间稀疏连接结构。 10.2.1 Mean-field approximate inference 因为隐层单元之间有联系，所以后验概率变得intractable，上文中用了mean-field approximation。比如一个有两层隐层的DBM，我们希望用近似后验，使得最小。 10.2.2 Training Deep Boltzmann Machines 在训练DBM时，与训练RBM最主要的不同在于，不直接最大似然，而是选择参数来最大化lower-bound。具体见论文。 11. Building-in Invariance 11.1 Augmenting the dataset with known input deformations Ciresan2010 中使用了一些小的放射变换来扩大训练集MNIST，并取得了很好的结果。 11.2 Convolution and Pooling (Le Roux et al., 2008a) 研究了图片中的2D拓扑。论述这种结构与哺乳类动物的大脑在目标识别上的相同点： Serre et al., 2007 ：Robust object recognition with cortex-like mechanisms DiCarlo et al., 2012 ：How does the brain solve visual object recognition? Pooling的重要性： Boureau2010 ：A theoretical analysis of feature pooling in vision algorithms Boureau2011 ：Ask the locals：multi-way local pooling for image recognition 一个成功的pooling的变种是L2 Pooling： Le2010 ：Tiled convolutional neural networks Kavukcuoglu2009 ：Learning invariant features through topographic filter maps Kavukcuoglu2010 ：Learning convolutional feature hierarchies for visual recognition Patch-based Training : (Coates and Ng, 2011) ：The importance of encoding versus training with sparse coding and vector quantization。这篇文章compared several feature learners with patch-based training and reached state- of-the-art results on several classification benchmarks。文中发现最后的结果与简单的K-means聚类方法类似，可能是因为patch本来就是低维的，例如边缘一般来说就是6*6，不需要分布式的表示。 Convolutional and tiled-convolutional training: convolutional RBMs： Desjardins and Bengio, 2008 ：Empirical evaluation of convolutional RBMs for vision Lee，2009 ：Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations Taylor，2010 ：Convolutional learning of spatio-temporal features a convolutional version of sparse coding: Zeiler2010 : Deconvolutional networks tiled convolutional networks: GregorLeCun,2010 : Emergence of complex- like cells in a temporal product network with local receptive field Le,2010 : Tiled convolutional neural networks Alternatives to pooling: (Scattering operations) Mallat,2011: Group invariant scattering BrunaMallat,2011 : Classification with scattering operators 11.3 Temporal Coherence and slow features 11.4 Algorithms to Disentanle Facotors of Variantion Hinton，2011 ：Transforming auto-encoders 这篇文章take advantage of some of the factors of variation known to exist in the data。 Erhan，2010 ：Understanding representations learned in deep architectures 这篇文章有实验表明DBN的第二层倾向于比第一层更具有不变性。 12. Conclusion 本文讲述了三种表示学习方法：概率模型、基于重构的算法和几何流形学习方法。 Practical Guide and Guidelines： Hinton，2010 ：Practical guide to train Restricted Bolzmann Machine Bengio，2012 ：Practical recommendations for gradient- based training of deep architectures Snoek2012 ：Practical Bayesian Optimization of Machine Learning Algorithms; 个人分类: DL论文笔记|11097 次阅读|0 个评论

2009_Learning Deep Architecture for AI论文笔记: 热度 2 zhuwei3014 2014-7-23 17:57; 刚开通的博客，突然觉得什么都记在印象笔记中不如分享来的重要，在过程中也许还可以和同领域的人们交流，这就算是处女贴吧。从印象笔记贴过来，排版什么的可能会有些乱，重新整理太耗时间，请大家见谅。今年开始研究近年来比较火热的deep learning，对于我来说是一个全新的领域，基本上是零基础，所以这篇综述性质的文章对于我的帮助是很大的，不过要看懂这篇文章，不仅需要很多时间，而且需要对神经网络有一点基础。这篇文章是一个长文，涉及的范围也比较广，有很多地方我还没有搞懂，本文只是对这篇文章的总结及摘要性质，并没有夹杂我个人的观点，有些地方翻译不对或内容理解错误，请大家帮忙指点出来。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. Introduction we assume that the computational machinery necessary to express complex behaviors requires highly varying mathematical functions, i.e. mathematical functions that are highly non-linear in terms of raw sensory inputs, and display a very large number of variations. If a machine captured the factors that explain the statistical variations in the data, and how they interact to generate the kind of data we observe, we would be able to say that the machine understands those aspects of the world covered by these factors of variation. 1.1 How do we train deep architectures? Automatically learning features at multiple levels of abstraction allows a system to learn complex functions mapping the input to the output directly from data, without depending completely on human-crafted features. Depth of architecture refers to the number of levels of composition of non-linear operations in the function learned. the mammal brain is organized in a deep architecture with a given input percept represented at multiple levels of abstraction, each level corresponding to a different area of cortex. This is particularly clear in the primate visual system (Serre et al., 2007), with its sequence of processing stages: detection of edges, primitive shapes, and moving up to gradually more complex visual shapes. Something that can be considered a breakthrough happened in 2006: DBN, autoencoder... apparently exploiting the same principle: guiding the training of intermediate levels of representation using unsupervised learning, which can be performed locally at each level . 1.2 Intermediate Representations: Sharing Features and Abstractions Across Tasks These algorithms can be seen as learning to transform one representation (the output of the previous stage) into another, at each step maybe disentangling better the factors of variations underlying the data. 每一层的特征都不是相互独立的，他们构成了一个分布表示： the information is not localized in a particular neuron but distributed across many.大脑中的表示是稀疏的，大概只有1-4%的神经在同一个时间激活。 Even though statistical efficiency is not necessarily poor when the number of tunable parameters is large, good generalization can be obtained only when adding some form of prior (e.g. that smaller values of the parameters are preferred) Exploiting the underlying commonalities between these tasks and between the concepts they require has been the focus of research on multi-task learning. Consider a multi-task setting in which there are different outputs for different tasks, all obtained from a shared pool of high-level features. 2. Theoretical Advantages of Deep Architectures 这一节讲了学习深度结构的动机和一些结构深度的解释。有一些函数不能被浅层结构有效的表达，因为可调整元素的数量。 For a fixed number of training examples, and short of other sources of knowledge injected in the learning algorithm, we would expect that compact representations of the target function 2 would yield better generalization.具体的说一个可以被k层结构表示的函数如果要用k-1层结构表示需要增加指数级的计算元素。 To formalize the notion of depth of architecture, one must introduce the notion of a set of computational elements . Theoretical results suggest that it is not the absolute number of levels that matters, but the number of levels relative to how many are required to represent efficiently the target function (with some choice of set of computational elements). 2.1 Computational Complexity 最基本的结论就是如果一个函数可以被一个深层结构简明的表示，如果用不够深的结构来表示的话它需要一个很大的结构。举了逻辑门函数的例子，但是这些理论并不能证明别的一些函数（比如AI中的一些task）需要深层结构，也不能证明这些限制适用于其他回路。但是这引发了我们的思考，普通的浅层网络是不是不能有效的表示复杂函数。 Results such as the above theorem also suggest that there might be no universally right depth : each function (i.e. each task) might require a particular minimum depth (for a given set of computational elements). 2.2 Informal Arguments We say that a function is highly-varying when a piecewise approximation (e.g., piecewise-constant or piecewise-linear) of that function would require a large number of pieces. 深层结构是很多运算的组合，应该都可以用一个非常大的2层结构来表示。 To conclude, a number of computational complexity results strongly suggest that functions that can be compactly represented with a depth k architecture could require a very large number of elements in order to be represented by a shallower architecture. 3. Local vs Non-Local Generalization 3.1 The Limits of Matching Local Templates Local estimators不适合用来学习highly-varying functions,即使他们能够用深层结构有效的表示。 An estimator that is local in input space obtains good generalization for a new input x by mostly exploiting training examples in the neighborhood of x. Local estimators 直接或间接的将输入空间划分为区域，每个区域都需要不同的参数来表示目标函数，所以当需要很多区域的时候，参数也会变的很多。根据局部模板匹配的结构可以看成两层结构，第一层是模板匹配层，第二层是分类层。最典型的例子是Kernel machine：和构成第二层，在第一层，核函数将输入x和训练样本xi进行匹配。著名的kernel machine有SVM和Gaussian Process。Kernel Machine得到generalization是通过找出smooth prior： the assumption that the target function is smooth or can be well approximated with a smooth function. 如果没有关于task的先验知识，就不能设计合适的kernel，这就刺激了很多研究。 (Salakhutdinov Hinton, 2008) Gaussian Process Kernel Machine可以用DBN来学习特征空间来提高效率。深层结构的学习算法可以看成是为kernel machine学习好的特征空间的方法。考虑目标函数中的方向v上下震荡，对于高斯kernel machine，需要的数据量对于目标函数需要学习的震荡的数量是线性增长的。 For a maximally varying function such as the parity function, the number of examples necessary to achieve some error rate with a Gaussian kernel machine is exponential in the input dimension. 对于一个仅仅依靠先验知识目标函数式局部平滑的学习者，学习一个在一个方向上符号变化很多的函数式很困难的。对于高维复杂的任务，如果一个曲线有很多variation并且这些variation各自没有关联，那么局部的estimator应该是最好的算法。但是在AI中我们假设目标函数中存在一些潜在的规律，所以需要寻找variation的更加compact的representation，从而可能导致更好的generalization。 Most of these unsupervised and semi-supervised algorithms using local estimators rely on the neighborhood graph: a graph with one node per example and arcs between near neighbors. 接下来举了一个流形学习的例子：图中同一个目标4的一组图片，通过旋转和缩小，形成了一个低维的流形。因为流形是局部平滑的，所以原则上可以用linear patches去估计局部，每个patch与流形相切。但是如果流形过度弯曲，这些patch就需要很小，并且数量也指数级增长。考虑基于neighborhood graph的半监督学习算法。labeled examples需要和variations of interst一样多才行，当decision surface变化太多的话就不行了。 Theoretical analysis(Bengio et al.,2009)说明了对于特定的函数要得到一个给定的error rate需要的数据量是呈指数级的。empirical result表明决策树的泛化能力是会随着variations的增多而降低的。 Ensembles of trees：They add a third level to the architecture which allows the model to discriminate among a number of regions exponential in the number of parameters. 3.2 Learning Distributed Representations 一个简单的local representation for 是一个N比特的向量r(i),其中有一个1和N-1个0。同样的distributed representation是一个的向量，更加简洁的表示。在分布式表示中，输入的特征不是相互独立的，但是有可能统计独立，举例来说，聚类算法不是分布式的因为cluster必须mutually exclusive，但是PCA和ICA是分布式表示。（这句没完全理解）监督学习例如多层神经网络和无监督学习例如Boltzmann machine，都是学习分布式的内在表示，他们的目标在于让学习算法学习构成分布式表示的特征。 4. Neural Networks for Deep Architecture 4.1 Multi-Layer Neural Networks 每一层是上一层特征的非线性函数sigmoid或者tanh的激活值，参数有偏置b和权值w，通过最小化最后一层与目标的误差，更新参数学习。最后一层也可以用其他表示，例如softmax，计算激活值的比例，最后一层的输出hi就是P(Y=i|x)的估计，这种情况下经常用negative conditional log-likelihood，也就是-logP (Y=y|x)作为损失函数。 4.2 The Challenge of Training Deep Neural Networks 基于随机梯度的算法用在深层网络上时容易陷入局部最优，当随机初始化时，深层网络的结果可能比浅层结构更差。06年Hinton开始使用pre-training得到更好的结果，这些文字都发现了greedy layer-wise unsupervised学习算法：首先用无监督算法训练第一层，产生了第一层的初始值，然后用第一层的输出作为第二层的输入，训练第二层。全部训练完以后，用监督算法进行fine-tune。这些基于RBM和auto-encoder的算法的相同点是layer-local unsupervised criteria，也就是the idea that injecting an unsupervised training signal at each layer may help to guide the parameters of that layer towards better regions in parameter space. In Weston et al. (2008), the neural networks are trained using pairs of examples (x, ˜x), which are either supposed to be “neighbors” (or of the same class) or not. A local training criterion is defined at each layer that pushes the intermediate representations hk(x) and hk(˜x) either towards each other or away fromeach other , according to whether x and ˜x are supposed to be neighbors or not. 同样的方法已经被用于无监督流形学习算法。 Bergstra and Bengio (2010) exploit the temporal constancy of high-level abstraction to provide an unsupervised guide to intermediate layers: 连续帧可能包含同样的目标。那么这样的改进是因为better optimization还是better regularization呢？ Erhan(2009)的实验中指出对于同样的训练误差，有非监督预训练的测试误差会更低。非监督pre-training可以看做是一种regularizer/prior：约束了参数空间。Bengio(2007)的实验指出poor tuning of the lower layersmight be responsible for the worse results without pre-training。当增加隐层节点的数量时训练误差可以降到0。当最顶层被限制为很少，如果没有pre-training，那么training和test error都降很大，因为最上面两层可以看做是正常的两层网络，如果最顶层足够大，它已经足够fit训练集，因此需要pre-training使下面的层better optimized，小的最顶层才能yield better generalization。如果顶层节点足够多，即使底层训练的不好训练误差也可以很低，但是这样generalization可能比浅层网络更差。当训练误差低而测试误差高，我们叫做overfitting。因为pre-training的效果，它也可以被看做是一种data-dependent regularizer。当训练集的size很小，虽然非监督学习可以提高test error，但是training error会变大。（为什么？）用Gaussian process或SVM代替最上面两层可以降低training error，但是如果如果低层没有足够优化，那么对generalization还是没有帮助。另外一种可以产生better generalization的事一种regularization：with unsupervised pre-training, the lower layers are constrained to capture regularities of the input distribution. One way to reconcile the optimization and regularization viewpointsmight be to consider the truly online setting. 这种情况下， online gradient descent是一个随机优化过程，如果无监督pre-training仅仅是一个regularization，可以预期有一个无限的训练集，那么有没有pre-training都将收敛到同一个程度。为了解释这个问题，用了一个’infinite MNIST’ dataset (Loosli, Canu, Bottou, 2007)，结果发现pre-trained的3层网络明显收敛到更低的error，也就是说pre-training不仅仅是一种regularizer，而且是一种寻找最优最小值的方法。为什么低层更难优化呢？上述表明反向传播的梯度可能不够使得参数转移到别的region，他们容易陷入局部最优。The gradient becomes less informative about the required changes in the parameters as we move back towards the lower layers, or that the error function becomes too ill-conditioned for gradient descent to escape these apparent local minima. 4.3 Unsupervised Learning for Deep Architecture 非监督学习找到了输入数据的一种统计规律的表示。 PCA和ICA可能不适合因为他们不能处理overcomplete case，也就是outputs比inputs更多的情况。另外，stack linear projections仍然是一个线性变换（例如2层PCA），不是构造深层结构。另一种动机去研究非监督学习：它可以是一种把问题分解为子问题的方法，每个子问题对应于不同level的abstraction。第一层可以提取显著信息，但是由于limited capacity，他只是低层特征，然后下一层用这些低层特征作为输入，那么就可以提取稍微高一点的特征。但是这里用梯度下降法的话，又会产生梯度弥散的问题。 4.4 Deep Generative Architectures 除了给监督算法做预训练，非监督算法还可以学习分布和生成样本。生成模型一般用图模型表示。Sigmoid belief net是一种多层生成模型，使用variatonal approximations训练。DBN与sigmoid belief net有点相似，除了最上面两层，DBN的最上面两层是RBM，一种无向图模型。 4.5 Convolutional Neural Networks 虽然深层网络使用监督算法训练很困难，但是有一个例外，就是CNN。原因有两个猜想： 1. The small fan-in of these neurons (few inputs per neuron) helps gradients to propagate through so many layers without diffusing so much as to become useless. 2. The hierarchical local connectivity structure is a very strong prior that is particularly appropriate for vision tasks, and sets the parameters of the whole network in a favorable region (with all non-connections corresponding to zero weight) from which gradient-based optimization works well . 4.6 Auto-Encoders Auto-Encoder和RBM之间也有一些联系， auto-encoder training approximates RBM training by Contrastive Divergence. 如果有一层线性隐层，那么k个隐层单元学习将input投影到前k个重要成分，有点类似PCA。如果隐层是非线性的，那么就具有捕捉输入分布的multi-modal aspects 。一个很重要的问题是如果没有其他约束，一个有n维输入和至少n维编码的自编码器，可能只学习到恒等方程（很多编码没有用，例如仅仅copy输入）。Bengio(2007)实验表明，当用随机梯度下降法，overcomplete(隐层节点比输入节点多)非线性自编码器可以产生有用的表示。一个简单的解释是early stopping有点类似L2约束。对于连续输入的重建，一个非线性自编码器在第一层需要很小的权值(to bring the non-linearity of the hidden units in their linear regime)，第二层需要很大的权值。对于二值输入，也需要很大的权值去完全最小化重构误差。除了constraining encoder by explicit or implicit regularization of the weights, 另一种策略是添加noise。这就是本质上RBM所做的。另一种策略是 sparsity constraint 。这些方法生成的权值矩阵与V1,V2神经元观察的结果类似(Lee, Ekanadham, Ng, 2008)。 sparsity和regularization为了避免学习identity，但是减少了capacity，而RBM既有很大的capacity，也不会学习identity，因为它是捕捉输入数据的统计结构。有一种auto-encoder的变体denoising auto-encoder具有和RBM类似的特性。 5 Energy-Based Models and Boltzmann Machines 5.1 Energy-Based Models and Products of Experts Energy-based model引入了能量的定义，energy-based概率模型可以用一个能量函数来定义一个概率分布：任何一个概率分布都可以用能量模型计算，归一化元素Z被称为partition function：在product of experts formulation中，energy定义为： 5.1.1 Introducing Hidden Variables an observed part x, a hidden part h: marginal: map this to energy function: with and 用代表这个模型的参数，那么log-likelihood gradient：因此average log-likelyhood gradient：为训练集empirical distribution，为expectation under model’s distribution P 也可以将能量写成一个关于某一个hidden unit的和：例如在RBM中，FreeEnergy和分子都是可以求出的：表示被赋予所有值的和，如果h是连续值，那么sum可以用integral取代。 5.1.2 Conditional Energy-Based Models 计算这个partition function很难，如果我们的最终目的是决定y给定x，那么我们不需要求出联合分布P(x,y)，而只要求P(y|x)：这类方法可以应用于判别RBM(Discriminative RBM)。 5.2 Boltzmann Machine Boltzmann machine是一种特殊形式含有隐变量的能量模型，能量函数是一个二次多项式：和分别为连接x和h的偏置，权值、、分别对应一对连接，U和V是对称的，在大部分模型中对角线为0。非0的对角线可以用来获得其他变种，例如Gaussian代替二项式单元。由于隐层单元的相互作用，上述FreeEnergy的计算方法在这里不适用，但是可以用MCMC采样方法，推导如下：很容易计算，因此如果可以sample from P(h|x)和P(x,h)，那么就可以得到unbiased stachastic estimator of the log-likelihood gradient。 Hinton(1986)介绍了一下术语：在 positive phase ，x作为输入，sample h from x；在 negative phase ，x和h都采样，理想中是从模型自身中采样。 Gibbs sampling是一种近似采样，N个随机变量的联合分布是通过N个子采样得到的，，每次从N-1个随机变量里采样Si，经过无穷步采样逐渐收敛到P(S)。以下是怎么在Boltzmann machine中运用Gibbs sampling，并举了一个例子，可惜没看懂，等用到的时候再看。因为对于每个样本都需要两条MCMC链（一个positive phase，一个negative phase），所以计算量很大，故这种方法被BP算法取代。但是接下来的Contrastive Divergence算法又成功运用了这种方法。 5.3 Restricted Boltzmann Machine RBM是DBN的building block，它当中的、都为0，因为层间没有连接。Energy function：输入的Free Energy： Conditional probability P(h|x)：在binary 输入的case中，因为x和h在能量函数中是对称的，因此在Hinton（2006），binomial input units are used to encode pixel gray levels in input images as if they were the probability of a binary event. 在MNIST训练集上很有效，但是在其他case中不行。在Bengio（2007）的实验中描述了当输入是连续值时Gaussian input对于binomial input的优势。虽然RBM可能不能像BM那样有效的表示某些分布，但是它可以表示任何离散分布，只要有足够的隐层单元。Le Roux Bengio(2008)的实验中表明除非RBM已经完美的表示了训练集分布，增加一个隐层单元总能提高log-likelihood。一个RBM也可以看做是multi-clustering，每一个隐层单元产生一个2-region partition of the input space。 The sum over the exponential number of possible hidden-layer configurations of an RBM can also beseen as a particularly interesting form of mixture, with an exponential number of components (with respect to the number of hidden units and of parameters)：例如，如果P(x|h)被选择为Gaussian，这就是一个Gaussian mixture，h有n bits就有个成分。但是这些成分不能独立tuned，因为它们share parameters。Gaussian均值通过一个线性函数得到，也就是说每个隐层单元在均值中都贡献一个。 5.3.1 Gibbs Sampling in RBMs Gibbs sampling在RBM中在每步中有两个子步骤：第一sample h from x，第二sample a new x from h。随着样本的增加，模型分布与训练分布越来越相似。如果我们从模型分布开始，那么一步就可以收敛，所以从训练样本的经验分布开始保证只需要很少的步骤就可以收敛。 5.4 Contrastive Divergence 5.4.1 Justifying Contrastive Divergence 在这个算法中，我们需要做的第一个approximation就是用单个样本代替所有可能输入的平均值。在k-step Contrastive Divergency中，包含了第二个近似：是k步后的最后一个样本。when ，the bias goes away. 当模型分布非常接近于经验分布的话，也就是时，start the chain from x(a sample) the chain has converged，我们只需要一步就得到了unbiased sample 。结果表明即使k=1都能得到很好的结果。一种方法解释CD算法就是这是一种locally around样本x1的对log-likelihood gradient的一种近似。LeCun(2006)指出EBM的训练算法中最重要的就是使得observed inputs的能量最小，这里指的就是FreeEnergy。CD算法中的contrast指的是一个真实的训练样本和一个链中的样本的对比。 5.4.2 Alternatives to Contrastive Divergence Tieleman(2008) ; Salakhutdinov Hinton(2009)提出了persistent MCMC for the negative phase。主要思想很简单： keep a background MCMC chain. . .xt → ht → xt+1 → ht+1 . . . to obtain the negative phase samples 。不同于CD-k中run a short chain，在做出近似时忽略参数在不断变化的事实，也就是说，we do not run a separate chain for each value of the parameters。因为参数变化实际上很慢，所以这种近似效果很好，但是 the trade-off with CD-1 is that the variance is larger but the bias is smaller .（这个方法实际上还没有看懂！！！）另一种方法是Score Matching (Hyv¨arinen, 2005, 2007b, 2007a) ，这是一种用来训练EBM的方法，它可以求出能量，但不是分母那个归一化常数Z。一个密度的score function是，这个函数与归一化常数Z没有关系，这个方法的基本思想是 match the score function of the model with the score function of the empirical density ，然后比较两个score function的difference。（需要结合paper细看） 5.4.3 Truncations of the Log-Likelihood Gradient in Gibbs-Chain Models（纯数学推导，具体要细看paper ） Bengio and Delalleau (2009) 给出了一个定理： Theorem 5.1. Consider the converging Gibbs chain x1 ⇒ h1 ⇒ x2 ⇒ h2 . . . starting at data point x1. The log-likelihood gradient can be written and the final term converges to zero as time goes to infinit 。所以truncating the chain to k steps用这样的一个近似：这就是CD-k算法，这就告诉我们CD-k算法的bias就是，这个bias会随着k的增加而减少，所以增加CD-k的步骤会更快更好的收敛。当用x1来初始化Markov chain，第一步相对于x1就向着正确的方向移动了，也就是说，roughly going down the energy landscape from x1. CD-1是进行了两次采样，如果只进行一次呢？那么分析log-likelihood gradient expansion：用average configuration 来代替：忽略后一项（ WHY ），我们得到了右边作为update direction，这就是 reconstruction error，典型地用来训练自编码器：所以我们发现 truncation of the chain得到了第一个近似是大概的重构误差，然后稍微更好的近似是CD-1 ，重构误差也是在训练RBM时用来跟踪整个过程的。 5.4.4 Model Samples Are Negative Examples（数学解释没有看懂）在boltzmann machine和CD算法中，一个很重要的元素就是 the ability to sample from the model 。极大似然准则想要在训练样本上得到很高的相似，而在其他地方很低。如果我们已经有一个模型，那么where the model puts high probability(represented by samples)和where the training examples are指出了怎么来改变这个模型。如果我们可以用一个decision surface（决策面）分离训练样本和模型样本，我们可以这样增加likelihood：减少决策面中有更多训练样本那一侧的能量函数值，增加另一面。以下用数学推导证明了如果可以增加一个classifier分离训练样本和模型样本的能力，就可以增加这个模型的log-likelihood，将更大的可能面放在训练样本一侧。实际中，我们可以用一个分类器，这个分类器的判别函数像生成模型的free energy那样定义，再假设可以从模型中采样，可以达到这样的目标。 6. Greedy Layer-Wise Training of Deep Architectures 6.1 Layer-Wise Training of Deep Belief Networks 一个含有l层的DBM构成联合分布如下：是visible-given-hidden conditional distribution，是最上面一层RBM的joint distribution。代表posteriors，除了最上一层因为最上面一层是RBM，可以exact inference。具体算法见paper： Greedy layer-wise learning of deep networks 6.2 Training Stacked Auto-Encoders Training的方法与DBN类似： 1、训练第一层最小化重构误差 2、用隐层输出作为下一层的输入，训练第二层 3、迭代第2步 4、最后一个隐层的输出作为一个有监督层的输入，初始化参数 5、用监督训练准则fine-tune所有的参数与DBN的对比实验表明，DBN一般比SAE有优势，可能是因为 CD-k is closer to the log-likelihood gradient than the reconstruction error gradient 。但是因为no sampling is involved，所以reconstruction error has less variance than CD-k。 SAE的优势在于每一层的任何参数化都有可能，但是概率图模型中例如CD算法或者其他tractable estimators of the log-likelihood gradient能被应用的有限。SAE的缺点在于它不是生成模型，生成模型中sample可以被量化的检查学习到了什么，例如可视化。 6.3 Semi-Supervised and Partially Supervised Training 除了上述的先unsupervised training，再supervised training，还有其他的两者结合的方法。 Bengio(2007) 提出了partially supervised training，这个方法在输入分布P(X)与P(Y|X)不是强烈相关的情况下有用，具体看paper。还有一种self-taught learning，( Lee, Battle, Raina,Ng, 2007 ; Raina et al., 2007) 7. Variants of RBMs and Auto-Encoders 7.1 Sparse Representations in Auto-Encoders and RBMs 1. Why a sparse representation? 几种解释sparsity的观点，具体见paper。 Ranzato(2008) 2. Sparse Auto-Encoder and Sparse Coding 第一个成功在深度结构中发掘出稀疏表示的是 Ranzato(2006) 。第二年同一个组介绍了一个变种，based on a Student-t prior。还有一种方法与computational neuroscience有关，它包含了两层sparse RBMs （Lee 2008）。在压缩感知力sparsity是通过加入L1惩罚，也就是说，当h是稀疏的时候输入x以很低的L2 reconstructed error重构。这里。就像有向图模型，sparse coding表现有点像explaining away：不同的configuration竞争，从中选取一个，别的都关掉。 The advantage is that if a cause is much more probable than the other, than it is the one that we want to highlight. The disadvantage is that it makes the resulting codes somewhat unstable, in the sense that small perturbations of the input x could give rise to very different values of the optimal code h. 为了解决稳定性问题和fine-tuning的问题，Bagnell, Bradley(2009)提出用一个softer近似取代L1惩罚，也就是说many very small coefficients, without actually converging to 0. sparse auto-encoder和sparse RBMs不存在这些问题：computational complexity (of inferring the codes), stability of the inferred codes, and numerical stability and computational cost of computing gradients on the first layer in the context of global fine-tuning of a deep architecture. 一些介于它们之间的SAE在(Ranzato et al., 2007, 2007; RanzatoLeCun, 2007;Ranzato et al., 2008)中被提出，他们提出let the codes h be free，but include a parametric encoder and a penalty for the difference between the free non-parametric codes h and the outputs of the parametric encoder. 在实验中，encoder只是一个affine transformation接着一个non-linearity(like the sigmoid)，decoder是线性的(as in sparse coding)。 7.2 Denoising Auto-Encoders DAE是AE的一个随机版本，它的输入被stochasitically corrupted，但是uncorrupted输入还是作为重构的目标。它做两件事，第一，encode输入，第二，消除corruption的影响。第二件只能通过捕获输入数据中的statistical dependency完成。 Vincent(2008) 提出 corruption操作为随机初始化多达一半的输入为0。一个recuurent版本早在Seung(1998)就被提出，用AE来denoising实际上在 (LeCun, 1987; Gallinari, LeCun, Thiria, Fogelman-Soulie, 1987)被提出。DAE因此展示了这个策略用在无监督预训练上的成功，并且与生成模型连接。一个DAE有趣的性质是它相当于一个生成模型，另一个有趣的性质是it naturally lends ifself to data with missing values or multi-modal data。 7.3 Lateral Connections RBM可以slightly restricted，通过在显示层加入一些lateral connections。这样sampling h仍然很简单，但是sampling x将会复杂一点。 Osindero and Hinton (2008) 中的结果表明基于这种模块的DBN比传统DBN效果更好。这种横向连接捕获了pairwise dependencies，让隐层捕获更高层的dependency。这样第一层就相当于一种whitening，作为一种预处理。这样的优势在于隐层表示的更高层的factors不需要编码所有的局部细节，这些细节横向连接可以捕获。 7.4 Conditional RBMs and Temporal RBMs A Conditional RBM is an RBM where some of the parameters are not free but are instead parametrized functions of a conditioning random variable. Taylor and Hinton(2009) 提出了context-dependent RBMs，隐层的参数c是一个context variable z的affine function。这是一个temporal RBM的例子，双箭头表明一个RBM，虚箭头表明conditional dependency。这个想法成功运用到了human motion中。 7.5 Factored RBMs 运用在language Model中，不太了解。 7.6 Generalizing RBMs and Contrastive Divergence (待补充) 8. Stochastic Variational Bounds for Joint Optimization of DBN Layers 以下Q代表RBM，P代表DBN。运用jensen 不等式可以使DBN的似然对数lower bounded。首先运用到、、，于是改写：指Q(h|x)的entropy，根据non-negativity of the KL divergence得到不等式：当P和Q相同时就说等号。在DBN中用P表示probability，在RBM中用Q表示。在第一层RBM中Q=P，但事实上不可能相等，因为在DBN中第一个隐层P(h)是由上层决定的。 8.1 Unfolding RBMs into Infinite Directed Belief Networks 上式证明greedy training procedure之前，需要先建立DBN中和RBM中的关系。当第二层RBM的权值是第一层的transpose时，这两者相等。另一种方法看这个问题，将一个gibbs sampling无限的链看成无限有向图with tied weights，这种无限有向图其实相当于RBM。一个2层的DBN，第二层的权值等于第一层权值的transpose，这个2层DBN相当于单个RBM。 8.2 Variational Justification of Greedy Layer-wise Training 下面来证明增加一层RBM可以提高DBN的likelihood。首先按照上述构造一个2层等价DBN，也就是权值矩阵是第一层的逆矩阵，固定第一层的两个条件概率，提高，增加KL项。开始KL项是0，entropy项不依靠DBN中的，所以的增加可以增加logP(x)。因为KL项和entropy项的非负性，进一步训练第二层RBM可以increase a lower bound。因此训练第二层RBM来最大化： If there was no constraint on P(h1), the maximizer of the above training criterion would be its “empirical” or target distribution：同理可证增加第三层也是这样。增加的一层RBM的size和weight的限制不是必须的，用前一层的转置矩阵去初始化权值是否会增加训练速度需要实验证明。注意在训练最顶层时，不能保证会单调增加。当lower bound持续增加时，实际的log-likelihood将会减小。这需要KL项减少，这个一般不可能，因为当DBN中的越来越脱离RBM中的，那么和也会越来越分离，导致KL变大。当训练第二层时，就会从慢慢向移动。但并不是第二层RBM从任何参数开始训练都会增加likelihood ，（以下举了一个反例，没有看明白）Consider the case where the first RBMhas very large hidden biases, so that ，but large weights and small visible offsets so that ，i.e., the hidden vector is copied to the visible units . When initializing the second RBM with the transpose of the weights of the first RBM, the training likelihood of the second RBM cannot be improved, nor can the DBN likelihood。当第二层RBM从一个不好的配置开始训练，那么将会向移动，导致KL变小。另一种解释就是第二层RBM的训练分布是第一层生成的，相当于进行了一次gibbs采样，我们知道gibbs采样越多越可以准确的得到真实的数据。 When we train within this greedy layer-wise procedure an RBM that will not be the top-level level of a DBN, we are not taking into account the fact that more capacity will be added later to improve the prior on the hidden units. Le Roux and Bengio (2008) 提出了一种方法替换CD算法来训练RBM，实验表明训练第一个RBM用KL divergense可以更好的优化DBN。但是这种方法intractable，因为需要计算隐层所有configuration的和。 8.3 Joint Unsupervised Training of All the Layers 8.3.1 The wake-sleep algorithm 在wake-sleep算法中，向上的recognition parameter和向下的generative parameter是分离的。主要思想如下： 1、wake阶段：用x生成h~Q(h|x)，用这个（h，x）作为fully observed data训练P(x|h)和P(h)，相当于对做了一次随机梯度。 2、sleep阶段：sample (h,x) from P(x,h)，然后用它作为observed data训练Q(h|x)，这相当于对做了一次随机梯度。 8.3.2 Transforming the DBN into a Boltzmann Machine 当每一层作为RBM初始化之后，DBN就转变成了一个deep boltzmann machine。因为在BM中每个单元接收上面和下面的输入， (Salakhutdinov Hinton, 2009) 提出将RBM的权值二等分来初始化DBM。 9. Looking Forward 9.1 Global Optimization Strategies 9.2 Why Unsupervised Learning is Important 1. Scarcity of labeled examples 2. Unknown future tasks 3. Once a good high-level representation is learned, other learning tasks could be much easy. 4. Layer-wise unsupervised learning 5. Unsupervised learning could put the parameters of a supervised or reinforcement learning machine in a region from which gradient descent (local optimization) would yield good solutions 6. The extra constraints imposed on the optimization by requiring the model to capture not only the input-to-target dependency but also the statistical regularities of the input distributionmight be helpful in avoiding some poorly generalizing apparent local minima. In general extra constraints may also create more local minima, 但是无监督预训练可以减少训练和测试误差，说明预训练将参数移动到最好的representation的参数空间附近。 Deep architectures have typically been used to construct a supervised classifier, and in that case the unsupervised learning component can clearly be seen as a regularizer or a prior that forces the resulting parameters to make sense not only to model classes given inputs but also to capture the structure of the input distribution. 9.3 Open questions 1、为什么基于梯度的深层网络用随机初始化经常不成功？ 2、用CD算法训练的RBM能否保留输入的信息（因为它不像自编码器那样，可能丢失一些重要信息），如果不能怎么修改？ 3、在CD算法中gibbs sampling的步骤需要调整吗？ 4、Persistent CD算法值得挖掘？ 5、除了重构误差，还有别的方法监视DBN或者RBM的训练吗？ 6、RBM和AE能否用某种形式的sparse penalty提高效果？ 7、SAE和SDAE有没有概率论的解释？; 个人分类: DL论文笔记|8655 次阅读|2 个评论

Comparison of Pros and Cons of Two NLP Approaches: liwei999 2014-6-19 17:19; So it is time to compare and summarize the pros and cons of the two basic NLP (Natural Language Processing) approaches and show where they are complementary to each other. Some notes: 1. In text processing, majority of basic robust machine learning is based on keywords, so-called BOW (bag-of-word) model although there is research of machine learning that goes beyond keywords. It actutally utilizes n-gram (mostly bigram or trigram) linear word sequence to simulate the language structure. 2. Grammar engineering is mostly a hand-crafted rule system based on linguistic structures (often represented internally as a grammar tree), to simulate the linguistic parsing in human mind. 3. Machine learning is good at viewing the forest (tasks such as document classification or word clustering from a corpus; and it fails in short messages) while rules are good at examining each tree (sentence-level tasks such as parsing and extraction; and it handles short messages well). This is understandable. Document or corpus contains a fairly big bag of keywords, making it easy for machine to learn statistical clues of the words for a given task. Short messages do not have enough data points for a machine learning system to use as evidence. On the other hand, grammar rules decode the linguistic relationships between words to understand the sentence, therefore it is good at handling short messages. 4. In general, a machine learning system based on keyword statistics is recall-oriented while a rule system is precision-oriented. They are complementary in these two core metrics of data quality. Each rule may only cover a tiny portion of the language phenomena, but once it captures it, it is usually precise. It is easy to develop a highly precise rule system but the recall typically only picks up incrementally in accordance with the number of rules developed. Because keyword based machine learning has no knowledge of sentence structures, at best its ngram evidence indirectly simulates languiage structure, it usually cannot reach high precision, but as long as the training corpus is sizable, good recall can be expected by the nature of underlying keyword statistics and the disregard for structural constraints. 5. Machine learning is known for its robustness and scalability as its algorithms are based on science (e.g. MaxEnt is based on information theory) that can be repeated and rigidly tested (of course, like any application areas, there are tricks and know-how to make things work or fail too in practice). The development is also fast once the labeled corpus is available (which is often not easy in practice) because there are off-shelf tools in open source and tons of documentation and literature in the community for proven ML algorithms. 6. Grammar engineering on the other hand tends to depend more on the expertise of the designer and developer for being robust and scalable. It requires deep skills and secret source which may only be accumulated based on years of successes as well as lessons learned. It is not purely a scientific undertaking but more of a blancing art in architect, design and development. To a degree, this is like chefs for Chinese cooking: with the same materials and the assumably the same recipe, one chef's dish can taste a lot better or different from that of another chef. Recipe only gives a framework while the monster of great taste is in the details of know-how. It is not easily repeatable across developers but the same master can repeatedly make the best quality dishes/systems. 7. The knowledge bottleneck is reflected in both machine learning systems and in grammar systems. A decent machine learning system requires a large hand-labeled corpus (research oriented unsupervised learning systems do not need manual annotation, but they are often not practical either). There is consensus in the community that the quality of machine learning usually depends more on the data than on the algorithms. On the other hand, the bottleneck of grammar engineering lies in skilled designer (data scientist) and well-trained domain developers (computational linguists), who are often in short supply today. 8. Machine learning is good at coarse-grained specific task (typical example is classification) while grammar engineering is good at fine-grained analysis and detailed insight extraction. Their respective strengths make them highly complementary in certain application scenarios because as information consumers, users often demand both coarse-grained overview as well as details of actionable intelligence. 9. One big big problem of a machine learning system is the difficulty to fix a reported quality bug. This is because the learned model is usually a black box and no direct human interference is allowed or even possible to address a specific problem unless the model is re-trained with new corpus and/or new features. In the latter case, there is no guarantee that the specific problem we want to solve will be addressed well by re-training as the learning process needs to blance all features in a unified model. This issue is believed to be the major reason why the Google search ranking algorithm favors hand-crafted functions over machine learning because their objective of better user experience can hardly by achieved by a black box model . 10. Grammar system is much more transparent in the language understanding process. The modern grammar systems are all designed with careful modularization so that each specific quality bug can be traced to the corresponding module of the system for fine-tuning. The effect is direct, immediate and can be incrementally accumulated for overall perforamcece enhancement. 11. From the perspective of the NLP depth, at least for the current state of the art, machine learning seems to do shallow NLP work fairly well while grammar engineering can go much deeper in linguistic parsing to achieve deep analytics and insights. (The on-going deep learning research program might get machine learning to some level deeper than before, but it is yet to see how effective it can do real deep NLP and how deep it can go, especially in the area of text processing and understanding.) Related blogs: why hybrid? on machine learning vs. hand-coded rules in NLP 再谈机器学习和手工系统：人和机器谁更聪明能干？【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|11183 次阅读|0 个评论

[转载] Is Google ranking based on machine learning？: liwei999 2014-6-18 17:21; Quora has a question with discussions on Why is machine learning used heavily for Google's ad ranking and less for their search ranking? A lot of people I've talked to at Google have told me that the ad ranking system is largely machine learning based, while search ranking is rooted in functions that are written by humans using their intuition (with some components using machine learning). Surprise? Contrary to what many people have believed, Google search consists of hand-crafted functions using heuristics. Why? 479 One very popular reply there is from Edmond Lau , Ex-Google Search Quality Engineer who said something which we have been experiencing and have indicated over and over in my past blogs on Machine Learning vs. Rule System, i.e. it is very difficult to debug an ML system for specific observed quality bugs while the rule system, if designed modularly, is easy to control for fine-tuning: From what I gathered while I was there, Amit Singhal , who heads Google's core ranking team, has a philosophical bias against using machine learning in search ranking. My understanding for the two main reasons behind this philosophy is: In a machine learning system, it's hard to explain and ascertain why a particular search result ranks more highly than another result for a given query. The explainability of a certain decision can be fairly elusive; most machine learning algorithms tend to be black boxes that at best expose weights and models that can only paint a coarse picture of why a certain decision was made. Even in situations where someone succeeds in identifying the signals that factored into why one result was ranked more highly than other, it's difficult to directly tweak a machine learning-based system to boost the importance of certain signals over others in isolated contexts. The signals and features that feed into a machine learning system tend to only indirectly affect the output through layers of weights, and this lack of direct control means that even if a human can explain why one web page is better than another for a given query, it can be difficult to embed that human intuition into a system based on machine learning. Rule-based scoring metrics, while still complex, provide a greater opportunity for engineers to directly tweak weights in specific situations. From Google's dominance in web search, it's fairly clear that the decision to optimize for explainability and control over search result rankings has been successful at allowing the team to iterate and improve rapidly on search ranking quality. The team launched 450 improvements in 2008 , and the number is likely only growing with time. Ads ranking, on the other hand, tends to be much more of an optimization problem where the quality of two ads are much harder to compare and intuit than two web page results. Whereas web pages are fairly distinctive and can be compared and rated by human evaluators on their relevance and quality for a given query , the short three- or four-line ads that appear in web search all look fairly similar to humans. It might be easy for a human to identify an obviously terrible ad, but it's difficult to compare two reasonable ones: Branding differences, subtle textual cues, and behavioral traits of the user, which are hard for humans to intuit but easy for machines to identify, become much more important. Moreover, different advertisers have different budgets and different bids, making ad ranking more of a revenue optimization problem than merely a quality optimization problem. Because humans are less able to understand the decision behind an ads ranking decision that may work well empirically, explainability and control -- both of which are important for search ranking -- become comparatively less useful in ads ranking, and machine learning becomes a much more viable option. Jackie Bavaro , Google PM for 3 years Suggest Bio Votes by Piaw Na ( Worked at Google ) , Marc Bodnick , Alex Clemmer , Tudor Achim , and 92 more . Edmond Lau's answer is great, but I wanted to add one more important piece of information. When I was on the search team at Google (2008-2010), many of the groups in search were moving away from machine learning systems to the rules-based systems. That is to say that Google Search used to use more machine learning, and then went the other direction because the team realized they could make faster improvements to search quality with a rules based system. It's not just a bias, it's something that many sub-teams of search tried out and preferred. I was the PM for Images, Video, and Local Universal - 3 teams that focus on including the best results when they are images, videos, or places. For each of those teams I could easily understand and remember how the rules worked. I would frequently look at random searches and their results and think Did we include the right Images for this search? If not, how could we have done better?. And when we asked that question, we were usually able to think of signals that would have helped - try it yourself. The reasons why *you* think we should have shown a certain image are usually things that Google can actually figure out. Upvote • Comment • Share • Thank • Report • Written 10 Apr, 2013 Anonymous Votes by Edmond Lau ( Ex-Google Search Quality Engineer ) , Bin Lu ( Software Engineer at Google ) , Keith Rabois , Vu Ha , and 34 more . Part of the answer is legacy, but a bigger part of the answer is the difference in objectives, scope and customers of the two systems. The customer for the ad-system is the advertiser (and by proxy, Google's sales dept). If the machine-learning system does a poor job, the advertisers are unhappy and Google makes less money. Relatively speaking, this is tolerable to Google. The system has an objective function ($) and machine learning systems can be used when they can work with an objective function to optimize. The total search-space (# of ads) is also much much smaller. The search ranking system has a very subjective goal - user happiness. CTR, query volume etc. are very inexact metrics for this goal, especially on the fringes (i.e. query terms that are low-volume/volatile). While much of the decisioning can be automated, there are still lots of decisions that need human intuition. To tell whether site A better than site B for topic X with limited behavioural data is still a very hard problem. It degenerates into lots of little messy rules and exceptions that tries to impose a fragile structure onto human knowledge, that necessarily needs tweaking. An interesting question is - is the Google search index (and associated semantic structures) catching up (in size and robustness) to the subset of the corpus of human knowledge that people are interested in and searching for ? My guess is that right now, the gap is probably growing - i.e. interesting/search-worthy human knowledge is growing faster than Google's index.. Amit Singhal's job is probably getting harder every year. By extension, there are opportunities for new search providers to step into the increasing gap with unique offerings. p.s: I used to manage an engineering team for a large search provider (many years ago). 【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|3747 次阅读|0 个评论

从新皮质层Neocortex开始: wolfewu 2014-2-7 22:41; 声明：这篇文章摘自Dileep George的博士学位论文《How the brain might work》 Neocortex新大脑皮层（拉丁语，表示“新皮层” 或者“新外壳”），也被称作为neopallium（“新壁炉架”）或iscortex（“外壳等价物”），是哺乳动物大脑的一部分。它是大脑两个半球的外层，由六层结构组成，各层用I到VI标记（VI是最里层，I是最外层）。新大脑皮层是大脑皮层的一部分（与古皮层和原皮层连在一起，而这两个皮层是边缘系统的一部分）。它涉及到更高级的功能（相对于大脑的基本功能）例如感观知觉，运动指令的产生，空间推理，理性思考和语言。 ——Wikipedia 如果从机器学习的观点来看待对大脑的研究，就会发现不论机器学习以及分类算法设计得多么牛X，这些方法，即先训练参数化自适应模型再通过某种准则调整参数来完成新的分类任务的方法，都隐含着一些机器学习的基本问题。 “没有免费的午餐”（No free Lunch, NFL定理），没有学习算法内在地比其他学习算法优越。如果说解决一个具体问题的算法比其它的优越，它只不过因为这个算法探究了适合这个问题的假设条件。同一个算法不可能很好地解决与原先假设完全不同的问题。这就意味着，为了使算法有效地解决问题，机器学习的研究者必须将目标问题领域的先验知识嵌入到最初的模型结构中。嵌入的先验知识越多，训练起来就越容易。这是否意味着我们需要为每一个试图用机器学习来解决的新问题创造一个新的模型？这实在是太费力了。话说回来，人类和其他哺乳动物解决问题的方式是不一样的。人类会学习，能够适应原来没有遇到过的新问题。很多研究者都猜想新大脑皮层是使用相同的基本算法来学习不同的模式。这意味着我们学习听觉、视觉、体觉感知以及语言都是同一种算法在运作。很多研究发现都支持这样的观点——通用皮层算法是存在的。将通用皮层算法和NFL定理组合在一起，就可以得到机器学习的重要推论，并可由此创造新的智能机器。表面上看，NFL定理看起来给存在通用皮层算法的想法带来了不少的麻烦。一种机制或者算法如何才能很好地就完成听觉、视觉以及语言这些不同的任务？答案来自NFL本身，也就是我们需要探究的问题的前提假设。也就是说，通用皮层算法与NFL定理是可以相互保持一致的，只要我们能够发现学习视觉、听觉和语言的相同基本假设。如果说大脑皮层善于使用通用机制学习各种各样的任务，那么在这些看起来完全不同的任务之间就一定存在着某种通用的东西。生物的进化一定发现了这种通用性的东西，而新大脑皮层就是它的操作主体。从NFL定理出发可以得出这样的结论：一种统一的学习理论在本质上就是一种统一的假设理论。如果输入是从未遇到过的信息，学习机用于预测输出的假设集合就被认为是学习算法的“归纳偏向”。我们所做的假设越多，学习就越容易。然而，我们所做的假设越多，我们所能解决的问题数量就越少。如果想要设计解决一大类问题的算法，我们所要回答的就是：什么样的基本假设是足够具体的，以使得在合理的时间内实现学习；而它（们）又足够一般以能解决一大类问题？这就是要研究新大脑皮层所需要做的。对新大脑皮层实施反向工程是个令人望而生畏的问题。如何搜寻这些统一假设？脑中有许多解剖学和生理学的细节；如何知道那个重要，那个不重要？什么只不过是生物学上的细节，因为只有神经元在起作用；什么才是绝对不能漏掉的重要计算原则？同时研究新大脑皮层和现实世界是个不错的策略。研究新大脑皮层的解剖学和生理学结构会为找到关于它所作出的假设的本质提供重要线索。研究新大脑皮层的组织结构，就需要寻找与学习的观点有关的一般原则。因此，只需要选择那些能够在现实世界中找到对应物的一般原则。如果新大脑皮层中的一种组织结构特性与现实世界的一种组织结构特性（也就是一般原则）相匹配，就有理由肯定找到了一条与学习的观点有关的一般原则。记忆-预测框架 Jeff Hawkins 在《人工智能的未来》中从生物学观点和计算的观点出发，提出了大脑皮层运作理论，称之为记忆-预测框架。其主要观点如下：新大脑皮层为输入其中的空间和时间模式创建了一个模型。创建这个模型的目标是为了预测输入的下一个模式。大脑皮层是由一种称为规范皮层回路的基本计算单元反复复制构成。从计算观点来讲，这样的规范回路可以看做是被反复复制若干次的节点。皮层被组织成为层次结构，这意味着上面所说的节点彼此连接形成了树形的层次结构。皮层的功能就是对它所“看到”的世界进行建模。这个模型是一种时空层次结构，结构中的每一个节点都存储了模式和序列。它随后就用于对输入进行预测。新大脑皮层以无监督方式对现实世界建模。结果中的每一个节点都存储了大量的模式和序列。皮层就是依靠这些模式来进行模式识别。一个节点的输出是用它所学习过的模式序列表达的。消息在层次结构中向上和向下传递以识别和分辨信息，并且还在时间上向前传播以预测下一个输入的模式。记忆-预测框架的数学表达形式就是层次瞬时记忆（Hierarchical Temporal Memory, HTM）。; 4560 次阅读|0 个评论

初识 Hierarchical Temporal Memory: wolfewu 2014-2-7 22:20; 这是一篇我在几年前对Numenta公司的网页上的一些内容的翻译。虽然原网页已经变了，但是对认识Hierarchical Temporal Memory（HTM）多少有些帮助吧。维基百科比较详细的介绍，网址是 http://en.wikipedia.org/wiki/Hierarchical_temporal_memory Hierarchical Temporal Memmory:这个名字里面都是些什么呢？选择这个名字的原因是： H ierarchical ——HTMs（复数的）被组织成为一种树形的节点层次结构。各个节点实现了一个学习与记忆的函数，也就是说，每个节点中封装了一种算法。低层节点接受大量的输入，并将处理过的信息作为输入送到接下来的一层。以这种方式，随着信息沿着层次结构逐层上传，它被HTM网络处理得越来越抽象。 T emporal ——在训练过程中，必须用目标随着时间变化的观点来表述HTM应用。在图片应用的训练中，先从上到下然后从左到右地描述图像，就好像图像正随着时间运动。请注意，非持久性（Temporal，我的理解为照片在每一时刻所处的位置都不应该一样）的要素非常重要：所设计的算法本身就期望目标随着时间渐渐发生变化（也就是说这是算法的需要）。 M emory ——一个HTM应用会分为两个阶段，可以分别认为是“训练记忆阶段”和“使用记忆阶段”。在训练过程中，HTM学着从它接受到的输入中识别模式。这样单独地训练层次结构中的每一层。完全训练好的HTM网络中的每一层都知道——都记得——它自己世界中的目标。在推导过程中，当HTM网络前摆着新的目标时，HTM网络就可以确定一个目标是某个已知目标的似然函数。 HTM技术与传统的计算机应用 HTM应用与传统的计算机应用是不同的。传统上每一个程序解决一个具体问题例如处理电邮或是分析数据。与此相反，HTM算法可以被训练来解决不领域不同类型的问题。程序员为HTM网络准备数据并训练HTM网络。受过训的网络可以分析新的信息并对其采取行动（暂时这样翻，原文是act on it）。 HTM技术与现有的数据建模技术乍一看，许多其他的数据建模技术都看起来像HTM技术，然而却不具备HTM的所用功能。HTMs与众不同的特征是：一个HTM网络即处理时间信息又处理空间信息。受过训的HTM网络可以产生数据，也就是说可以在时间上向前预测。有着很坚实的生物学基础。在 Jeff Hawkins 的《人工智能的未来》这本书中，他对这个话题讨论得更为详细。他指出了传统技术如专家系统和神经网络与HTM的不同之处：所有其他技术都试图模拟人类行为，而唯有HTM是基于“人脑是如何工作的”这样一套理论的。适合HTM应用的问题成功应用HTM的关键在于把它用于一个适合HTM的问题并对该问题进行恰如其分的阐述。《适合HTMs的问题》白皮书对这个问题进行了详细探讨。这一节概述几个最为重要的观点。有些问题并不是很适合HTMs：HTM并不适于解决可以用一组离散的规则就能解决的问题。HTM更适宜于解决含混不清和嘈杂领域的问题。需要具体时间的问题（如乐曲识别）并不适于这一代的HTMs（看来还需要进化）。 HTM最适合于这样的问题：用于建模的数据是从时变（随时间变化的）causes的层次结构中产生的。这里的一个cause是指一个产生HTM输入数据的目标（好像可以理解为因果关系的“因”）。这个问题必须同时包含空间和时间成分。例如白皮书中讨论过的汽车监视系统、波浪采样等应用，后者用监视器的输入来对河流的状态进行分类。这两个例子都包含了时间元素：车辆监视系统中，传感器跟踪例如引擎温度等的信息。波浪采样中，传感器跟踪河流的温度。两个例子都包含了一个空间层次结构：车辆监视系统中，不同的系统组成了更大的子系统，这些子系统又组成了汽车。波浪采样中，河流状态包含了一个个体温度的集合。就好像人类大脑，HTM网络需要充分的训练数据。有了这些充分的数据，HTM网络就可以很好地处理含混和嘈杂问题，并且能使用从不同的来源信息（温度，速度等）。它们同样不需要数据上的对称。总而言之，HTM所应用的领域需要有一个内在的层次结构，并且数据必须有时间和空间的关联。同时还要有以时间序组织起来的充分数据以供训练。; 8144 次阅读|0 个评论

对HTM白皮书中文译本的个人勘误（1）: wolfewu 2014-2-7 21:48; 声明：最近看到了Numenta的网站上有了CLA皮质学习算法中文版本的白皮书(白皮书的网页链接是 http://numenta.org/cla-white-paper.html )，看了下实在是不敢恭维翻译的质量，所以自己将其中的部分内容翻译整理了一下。文中的相关图片全部引自英文版的白皮书。欢迎大牛们批评指正~~ 约定：我们用Hierarchy 来直接表示层级组织结构，用region表示区域/层级，用level来表示hierachical中的等级什么是HTM？（我的另一篇文章也做了介绍 http://blog.sciencenet.cn/blog-1245419-765449.html ）是指Hierarchical Temproal Memory, 直译是层次化的时间记忆。是一种模拟人脑皮层结构的新型神经体系结构，与现有的用数学模型描述的ANN有很大的区别。但是他们的研究目的都是一样的——实现人工智能 HTM原理在这部分我们会涉及到HTM的一些核心原理，要说明几个问题：为什么hierarchical是重要的，HTM 的region是如何构建的，为什么数据要以sparse distributed representation的形式储存，为什么基于时间的信息是关键。层级组织结构 (Hierarchy) 一个HTM是由按层级排列的多个region所组成的。region是HTM中的主要记忆和预测单元，我们将会在下面的部分详细介绍。通常而言，每个region代表Herarchical中的一个level。信息沿着Hierachy中的level上升总是在进行聚合（聚合本身就是一个复杂的问题），下一个level的多个子元素将被聚合到上一个level的一个子元素中。然而，由于反馈连接的存在，信息也会随着level的下降而不断分流。（level和region基本上是同意词。当描述region的内在功能时我们使用“region”，档描述region对于整个Hierarchy的影响时我们使用“level”一词。图 1.1： 4 level的HTM的例子，每个level就是一个region，在level间，region内以及HTM的输入输出都存在数据通信我们完全有可能将多个分支HTM网络组合到一起来使用（见图1.2）。这种结构可以利用来自多个数据源和感受器提供的数据（也就是所谓的Multimodal的数据）。举个例子，一个分支HTM网络可能是处理音频信息而另一个分支HTM网络可能是处理视觉信息。在每一个分支HTM网络内存在聚合，而各个分支HTM的输出信息只向上聚合（分之间并没有信息传输）。图1.2：对不同感受器的网络进行聚合组织成Hierarchy的好处就是高效率。它显著减少了训练时间和存储空间，因为每一level中已学会的模式将以一种新奇的方式被组合在一起到更高的level中得到重用。我们以视觉为例进行说明。在人的大脑中，Hierarchy的最低level储存着许多有关视场中极小区域的信息，例如“边”和“角”。它们是许多视觉对象的基本构件。我们称之为模式。这些低level的模式将会在中level内重组成为更复杂的模式，例如曲线和条纹。一条弧线可以是耳朵的轮廓，方向盘的上部，咖啡杯的杯把。这些中level模式被进一步组合起来用于表征高level的（视觉）对象特征，例如脑袋、汽车、房子。因此，学习一个新的高level的（视觉）对象时，不需要重学它的构件，因为这些构件是在高level的（视觉）对象之间共享的。再例如，想想当你学习一个新单词的时候，你不需要再去学习字母、音节、因素等这些低level的语言构建了。在hierarchy内共享表征（也就是前面所说的构件）也导致了对“期望行为”的泛化。啥意思呢？比如，当你看见一只新的动物，如果你看到了它的嘴和牙齿，你会预测这个动物用它的嘴来进食，并有可能会咬你。Hierarchy使得你看到的新对象可以继承其子构件的已知特性。（“期望行为“的泛化在哪里呢？在上例中，”嘴“、”牙齿“这些构件的”吃“和”咬“的行为被泛化了，所以就能够用于预测新视觉对象的行为。这种泛化是一个对低level构件进行”抽象“的过程，这是后话） HTM中的一个level能学习多少内容呢？或者换句话说，需要多少个level才够用呢？我需要在每level的存储空间与level的数量间进行权衡。幸运地是，在给定输入的统计量和分配所得的资源数量后，HTM可以自动地在各level上学习到与这些给定数据相适应的最佳表征。如果你为某一层分配了较多的空间，那一层将构建更大更复杂的表征，这也就意味着需要更少的层级。如果你为某层分配了较少的空间，那一层将构建较小、较简单的表征，这也就意味着可能需要更多的层级。（说些题外话，类似的问题在传统的ANN中也是存在的，只不过扁平网络需要比较大的隐含层，大概是深度的指数级倍数，而较为深的网络又难以训练成功，至少在Deep Learning 出现前是这样）到此目前为止，我们已经描述了许多较难的问题，例如视觉推导（”推导“类似于模式识别）。但是许多有价值的的问题比视觉要简单，而且用一个HTM的region就足以解决。例如，我们用一个HTM网络来预测一个浏览网站的人下一步会点击哪里。这个问题涉及到使用web点击数据流来训练HTM网络。在这个问题中，有着很少甚至没有空间上的Hierarchy，求解这个问题的过程主要是发现与时间相关的统计量，也就是说通过识别典型的用户模式来预测用户的下一步会点击哪里。总而言之，hierarchy能减少训练时间，减少存储空间，并引入了一种泛化的形式。尽管如此，有的时候一个简单的HTM的region就能解决许多简单的预测问题。（后文请查阅本系列的第2篇，讲述有关region的原理）; 4647 次阅读|0 个评论

learning to rank代码调试: 热度 1 maoxianmenlian 2014-1-30 00:46; 最近在研究learning to rank算法，然后就找到了一个宝贝： https://github.com/JK-SUN/cikm12-vs-cf-sourcecode 于是我就通过百度网盘把代码数据下载下来，然后在运行时就报了这个错误： raceback (most recent call last): File G:\software\CIKM-SourceCode\sourcecode\weighted_KendallTauRank_specialty_degree_sim.py, line 169, in module main('user_user_sim_eachmovie_weight.data') File G:\software\CIKM-SourceCode\sourcecode\weighted_KendallTauRank_specialty_degree_sim.py, line 119, in main results=pprocess.pmap(calculate_tf,sequence_tf,limit) File G:\software\CIKM-SourceCode\sourcecode\pprocess.py, line 917, in pmap mymap = Map(limit=limit) File G:\software\CIKM-SourceCode\sourcecode\pprocess.py, line 675, in __init__ Exchange.__init__(self, *args, **kw) File G:\software\CIKM-SourceCode\sourcecode\pprocess.py, line 277, in __init__ self.poller = select.poll() AttributeError: 'module' object has no attribute 'poll' 是由select.poll()引起的，经检查才知道这个接口仅限于Unix操作系统，windows上不能用。 http://docs.python.org/2/library/select.html 真是可惜，本以为python在哪个操作系统都能用，看来该用Linux还是要用Linux，拿windows来碰运气纯粹是浪费时间。找到错误原因了，可以睡而无憾了。; 个人分类: 推荐系统|5337 次阅读|2 个评论

Five major learning promotors: hustfliee 2014-1-18 15:56; 1. Innate learning programs 2. Repeatition of information 3. Excitement 4. Eating carbohydrates 5. 8-9 hours of sleep; 2552 次阅读|0 个评论

[转载]加州大学圣迭戈分校学习算法研讨会 deep learning 深度学习: hestendelin 2013-12-1 21:48; 摘自： http://cseweb.ucsd.edu/~dasgupta/254-deep/ CSE 254: Seminar on Learning Algorithms Time TuTh 3.30-5 in CSE 2154 Instructor: Sanjoy Dasgupta Office hours TBA in EBU3B 4138 This quarter the theme of CSE 254 is deep learning . Prerequisite: CSE 250AB. The first couple of lectures will be an overview of basic material. Thereafter, in each class meeting, a student will give a talk lasting about 60 minutes presenting a technical paper (or several papers) in detail. In questions during the talk, and in the final 20 minutes, all seminar participants will discuss the paper and the issues raised by it. Date Presenter Paper Slides Jan 10 Sanjoy Introduction Jan 12 Sanjoy Hopfield nets Jan 17 Sanjoy Markov random fields, Gibbs sampling, simulated annealing Jan 19 Sanjoy Deep belief nets as autoencoders and classifiers Jan 24 Brian Task-driven dictionary learning here Jan 26 Vicente A quantitative theory of immediate visual recognition here Jan 31 Emanuele Convolutional deep belief networks here Feb 2 Nakul Restricted Boltzmann machines: learning , and hardness of inference here Feb 7 Craig The independent components of natural scenes are edge filters here Feb 9 No class: ITA conference at UCSD Feb 14 Janani Deep learning via semi-supervised embedding here Feb 16 Stefanos A unified architecture for natural language processing here Feb 21 Hourieh An analysis of single-layer networks in unsupervised feature learning here Feb 23 Ozgur Emergence of simple-cell receptive properties by learning a sparse code for natural images here Feb 28 Matus Representation power of neural networks: Barron , Cybenko , Kolmogorov here Mar 1 Frederic Reinforcement learning on slow features of high-dimensional input streams Mar 6 Dibyendu, Sreeparna Learning deep energy models and What is the best multistage architecture for object recognition? here Mar 8 No class: Sanjoy out of town Mar 13 Bryan Inference of sparse combinatorial-control networks here Mar 15 Qiushi Weighted sums of random kitchen sinks here This is a four unit course in which the work consists of oral presentations. The procedure for each student presentation is as follows: · One week in advance: Finish a draft of Latex/Powerpoint that present clearly the work in the paper. Make an appointment with me to discuss the draft slides. And email me the slides. · Several days in advance: Meet for about one hour to discuss improving the slides, and how to give a good presentation. · Day of presentation: Give a good presentation with confidence, enthusiasm, and clarity. · Less than three days afterwards: Make changes to the slides suggested by the class discussion, and email me the slides in PDF, two slides per page, for publishing. Try to make your PDF file less than one megabyte. Please read, reflect upon, and follow these presentation guidelines , courtesy of Prof Charles Elkan. Presentations will be evaluated, in a friendly way but with high standards, using this feedback form . Here is a preliminary list of papers .; 个人分类: 深度学习|2490 次阅读|0 个评论

[转载]CMU2013 DeepLearning Course 卡内基梅隆大学深度学习课程: hestendelin 2013-11-30 10:55; Deep LearningInstructor: Bhiksha Raj COURSE NUMBER -- MLD: 10805 LTI: 11-785 (Lab) / 11-786 (Seminar) Timings: 1:30 p.m. -- 2:50 p.m. Days: Mondays and Wednesdays Location: GHC 4211 Website: http://deeplearning.cs.cmu.edu Credits: 10-805 and 11-786 are 6-credit seminar courses. 11-785 is a 12-credit lab course. Students who register for 11-785 will be required to complete all lab exercises. IMPORTANT: LTI students are requested to switch to the 11-XXX courses. All students desiring 12 credits must register for 11-785. Instructor: Bhiksha Raj Contact: email:bhiksha@cs.cmu.edu, Phone:8-9826, Office: GHC6705 Office hours: 3.30-5.00 Mondays. You may also meet me at other times if I'm free. TA: Anders Oland Contact: email:anderso@cs.cmu.edu, Office: GHC7709 Office hours: 12:30-2:00 Fridays. Deep learning algorithms attempt to learn multi-level representations of data, embodying a hierarchy of factors that may explain them. Such algorithms have been demonstrated to be effective both at uncovering underlying structure in data, and have been successfully applied to a large variety of problems ranging from image classification, to natural language processing and speech recognition. In this course students will learn about this resurgent subject. The course presents the subject through a series of seminars, which will explore it from its early beginnings, and work themselves to some of the state of the art. The seminars will cover the basics of deep learning and the underlying theory, as well as the breadth of application areas to which it has been applied, as well as the latest issues on learning from very large amounts of data. Although the concept of deep learning has been applied to a number of different models, we will concentrate largely, although not entirely, on the connectionist architectures that are most commonly associated with it. Students who participate in the course are expected to present at least one paper on the topic to the class. Presentations are expected to be thorough and, where applicable, illustrated through experiments and simulations conducted by the student. Students are registered for the lab course must also complete all lab exercises. Labs Lab 1 is up Lab 1: Perceptrons and MLPs Data sets Due: 18 Sep 2013 Lab 2 is up Lab 1: The effect of increasing network depth Data set Due: 17 Oct 2013 Papers and presentations Date Topic/paper Author Presenter Additional Links 28 Aug 2013 Introduction Bhiksha Raj Intelligent Machinery Alan Turing Subhodeep Moitra 4 Sep 2013 Bain on Neural Networks. Brain and Cognition 33:295-305, 1997 Alan L. Wilkes and Nicholas J. Wade Lars Mahler McCulloch, W.S. Pitts, W.H. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics, 5:115-137, 1943. W.S. McCulloch and W.H. Pitts Kartik Goyal Michael Marsalli's tutorial on the McCulloch and Pitts Neuron 9 Sep 2013 The Perceptron: A Probalistic Model For Information Storage And Organization In The Brain. Psychological Review 65 (6): 386.408, 1958. F. Rosenblatt Daniel Maturana ?? Chapter from “The organization of Behavior”, 1949. D. O. Hebb Sonia Todorova 11 Sep 2013 The Widrow Hoff learning rule (ADALINE and MADALINE). Widrow Pallavi Baljekar ?? Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks 2 (6): 459.473, 1989. T. Sanger Khoa Luu A simplified Neuron model as a principal component analyzer, by Erkki Oja 16 Sep 2013 Learning representations by back-propagating errors. Nature323(6088): 533.536 Rumelhart et al. Ahmed Hefny Chapter by Rumelhart, Hinton and Williams Backpropagation through time: what it does and how to do it., P. Werbos, Proc. IEEE 1990 A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm, IEEE Intl. Conf. on Neural Networks, 1993 M. Riedmiller, H. Braun Danny (ZhenZong) Lan 18 Sep 2013 Neural networks and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sciences, Vol 79, 2554-2558, 1982 J. J. Hopfield Prasanna Muthukumar The self-organizing map. Proc. IEEE, Vol 79, 1464:1480, 1990 Teuvo Kohonen Fatma Faruq 23 Sep 2013 Phoneme recognition using time-delay neural networks, IEEE trans. Acoustics, Speech Signal Processing, Vol 37(3), March 1989 A. Waibel et al. Chen Chen A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the echo state network approach, GMD Report 159, German National Research Center for Information Technology, 2002 Herber Jaeger Shaowei Wang 25 Sep 2013 Bidirectional recurrent neural networks, IEEE transactions on signal processing, Vol 45(11), Nov. 1997 M. Schuster and K. Paliwal Felix Juefei Xu Long short-term memory. Neural Computation, 9(8):1735.1780, 1997 S. Hochreiter and J. Schmidhuber Dougal Sutherland 30 Sep 2013 A learning algorithm for Boltzmann machines, Cognitive Science, 9, 147-169, 1985 D. Ackley, G. Hinton, T. Sejnowski Siyuan Improved simulated annealing, Boltzmann machine, and attributed graph matching, EURASIP Workshop on Neural Networks, vol 412, LNCS, Springer, pp: 151-160, 1990 Lei Xu, Erkii Oja. Ran Chen 2 Oct 2013 Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position, Pattern Recognition Vol. 15(6), pp. 455-469, 1982 K. Fukushima, S. Miyake Sam Thomson Shift invariance and the Neocognitron, E. Barnard and D. Casasent, Neural Networks Vol 3(4), pp. 403-410, 1990 Face recognition: A convolutional neural-network approach, IEEE transactions on Neural Networks, Vol 8(1), pp98-113, 1997 S. Lawrence, C. L. Giles, A. C. Tsoi, A. D. Back Hoang Ngan Le Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, P.Y.Simard, D. Steinkraus, J.C. Platt, Prc. Document analysis and recognition, 2003 Gradient based learning applied to document recognition, Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. Proceedings of the IEEE, November 1998, pp. 1-43 7 Oct 2013 On the problem of local minima in backpropagation, IEEE tran. Pattern Analysis and Machine Intelligence, Vol 14(1), 76-86, 1992 M. Gori, A. Tesi Jon Smereka Learning long-term dependencies with gradient descent is difficult, IEEE trans. Neural Networks, Vol 5(2), pp 157-166, 1994 Y. Bengio, P. Simard, P. Frasconi Keerthiram Murugesan Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, in A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press , 2001 Backpropagation is sensitive to initial conditions, J. F. Kolen and J. B. Pollack, Advances in Neural Information Processing Systems, pp 860-867, 1990 9 Oct 2013 Multilayer feedforward networks are universal approximators, Neural Networks, Vol:2(3), 359-366, 1989 K. Hornik, M. Stinchcombe, H. White Sonia Todorova Approximations by superpositions of a sigmoidal function, G. Cybenko, Mathematics of control, signals and systems, Vol:2, pp. 303-314, 1989 On the approximation realization of continuous mappings by neural networks, K. Funahashi, Neural Networks, Vol. 2(3), pp. 183-192, 1989 Universal approximation bounds for superpositions of a sigmoidal function, A. R. Barron, IEEE Trans. on Info. Theory, Vol 39(3), pp. 930-945, 1993 On the expressive power of deep architectures, Proc. 14th intl. conf. on discovery science, 2011 Y. Bengio and O. Delalleau Prasanna Muthukumar Scaling learning algorithms towards AI, Y. Bengio and Y. LeCunn, in Large Scale Kernel Machines , Eds. Bottou, Chappelle, DeCoste, Weston, 2007 Shallow vs. Deep sum product networks, O. Dellaleau and Y. Bengio, Advances in Neural Information Processing Systems, 2011 14 Oct 2013 Information processing in dynamical systems: Foundations of Harmony theory; In Parallel Distributed Processing: Explorations in the microstructure of cognition , Rumelhart and McLelland eds., 1986 Paul Smolensky Kathy Brigham Geometry of the restricted Boltzmann machine, M. A. Cueto, J. Morton, B. Sturmfels, Contemporary Mathematics, Vol. 516., pp. 135-153, 2010 Exponential family harmoniums with and application to information retrieval, Advances in Neural Information Processing Systems (NIPS), 2004 M. Welling, M. Rosen-Zvi, G. Hinton Ankur Gandhe Continuous restricted Boltzmann machine with an implementable training algorithm, H. Chen and A. F. Muray, IEE proceedings on Vision, Image and Signal Processing, Vol. 150(3), pp. 153-158, 2003 Diffusion networks, product of experts, and factor analysis, T. K. Marks and J. R. Movellan, 3rd Intl. Conf. on Independent Component Analysis and Signal Separation, 2001 16 Oct 2013 Distributed optimization of deeply nested systems. Unpublished manuscript, Dec. 24, 2012, arXiv:1212.5921 M. Carrera-Perpiñan and W. Wang M. Carrera-Perpiñan 21 Oct 2013 Training products of experts by minimizing contrastive divergence, Neural Computation, Vol. 14(8), pp. 1771-1800, 2002 G. Hinton Yuxiong Wang On contrastive divergence learning, M. Carrera-Perpinñan, AI and Statistics, 2005 Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient, T. Tieleman, International conference on Machine learning (ICML), pp. 1064-1071, 2008 An Analysis of Contrastive Divergence Learning in Gaussian Boltzmann Machines, Chris Williams, Felix Agakov, Tech report, University of Edinburgh, 2002 Justifying and generalizing contrastive divergence, Y. Bengio, O. Delalleau, Neural Computation, Vol. 21(6), pp. 1601-1621, 2009 23 Oct 2013 A fast learning algorithm for deep belief networks, Neural Computation, Vol. 18, No. 7, Pages 1527-1554, 2006. G. Hinton, S. Osindero, Y.-W. Teh Aaron Wise Reducing the dimensionality of data with Neural Networks, G. Hinton and R. Salakhutidnov, Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006 Greedy layer-wise training of deep networks, Neural Information Processing Systems (NIPS), 2007. Y. Bengio, P. Lamblin, D. Popovici and H. Larochelle Ahmed Hefny Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model, M. Ranzato, C.S. Poultney, S. Chopra, Y. Lecunn, Neural Information Processing Systems (NIPS), 2006. 28 Oct 2013 Imagenet classification with deep convolutional neural networks, NIPS 2012 A. Krizhevsky, I. Sutskever, G. Hinton Danny Lan Convolutional recursive deep learning for 3D object classification, R. Socher, B. Huval, B. Bhat, C. Manning, A. Ng, NIPS 2012 Multi-column deep neural networks for image classification, D. Ciresan, U. Meier and J. Schmidhuber, CVPR 2012 Learning hierarchial features for scene labeling, IEEE transactions on pattern analysis and machine intelligence, Vol 35(8), pp. 1915-1929, 2012 C. Couprie, L. Najman, Y. LeCun Jon Smereka Learning convolutional feature hierarchies for visual recognition, K. Laukcuoglu,P. Sermanet, Y-Lan Boureau, K. Gregor, M. Mathieu, Y. LeCun, NIPS 2010 30 Oct 2013 Statistical language models based on neural networks, PhD dissertation, Brno, 2012, chapters 3 and 6 T. Mikolov, Fatma Faruq Semi-supervised recursive autoencoders for predicting sentiment R. Socher, J. Pennington, E. Huang, A. Ng and C. Manning Yueran Yuan Dynamic pooling and unfoloding recursive autoencoders for paraphrase detection, R. Socher, E. Huang, J. Pennington, A. Ng, C. Manning, EMNLP 2011 Joint learning of words and meaning representation for open-text semantic parsing, A.Bodes, X. Glorot, J. Weston, Y. Bengio, AISTATS 2012 4 Nov 2013 Supervised sequence labelling with recurrent neural networks, PhD dissertation, T. U. Munchen, 2008, Chapters 4 and 7 A. Graves, Georg Schoenherr Speech recognition with deep recurrent neural networks, A. Graves, A.-. Mohamed, G. Hinton, ICASSP 2013 Deep neural networks for acoustic modeling in speech recognition: the shared view of four research groups, IEEE Signal Processing Magazine, Vold 29(6), pp 82-97, 2012. G. Hinton et al. Daniel Maturana 6 Nov 2013 Modeling Documents with a Deep Boltzmann Machine, UAI 2013 N. Srivastava, R. Salakhutidinov, G. Hinton Siyuan Generating text with Recurrent Neural Networks, I. Sutskever, J. Martens, G. Hinton, ICML 2011 Word representations: A simple and general method for semi-supervised learning, ACL 2010 J. Turian, L. Ratinov, Y. Bengio Sam Thomson 11 Nov 2013 An empirical evaluation of deep architectures on problems with many factors or variables, ICML 2007 H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio Ran Chen The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training, AISTATS 2009 D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, P. Vincent Ankur Gandhe 13 Nov 2013 Extracting and Composing Robust Features with Denoising Autoencoders, ICML 2008 P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzgool Pallavi Baljekar Improving neural networks by preventing co-adaptation of feature detectors, G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sustskever, R. R. Salakhutdinov Subhodeep Moitra 18 Nov 2013 A theory of deep learning architectures for sensory perception: the ventral stream, Fabio Anselmi, Joel Z Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, Tomaso Poggio Dipan Pal 20 Nov 2013 No more pesky learning rates, ICML 2013 Tom Schaul, Sixin Zhang and Yann LeCun Georg Shoenherr No more pesky learning rates: supplementary material On the importance of initialization and momentum in deep learning, JMLR 28(3): 1139.1147, 2013 Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton Kartik Goyal Supplementary material for paper 25 Nov 2013 Guest lecture Quoc Le 27 Nov 2013 A multi-layer sparse coding network learns contour coding from natural images Neural Networks Research Centre, Vision Research 42(12): 1593-1605, 2002 Patrik O. Hoyer and Aapo Hyvarinen Sparse Feature Learning for Deep Belief Networks, NIPS 2007 Marc.Aurelio Ranzato Y-Lan Boureau, Yann LeCun Sparse deep belief net model for visual area V2, NIPS 2007 Honglak Lee Chaitanya Ekanadham Andrew Y. Ng Deep Sparse Rectifier Neural Networks, JMLR 16: 315-323, 2011 Xavier Glorot, Antoine Bordes, Yoshua Bengio To be arranged Exploring strategies for training deep neural networks, Journal of Machine Learning Research, Vol. 1, pp 1-40, 2009 H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin Why Does Unsupervised Pre-training Help Deep Learning?, AISTATS 2010 D. Erhan, A. Courville, Y. Bengio, P. Vincent Understanding the difficulty of training deep feedforward neural networks, AISTATS 2010 X. Glorot and Y. Bengio A Provably Efficient Algorithm for Training Deep Networks, arXiv:1304.7045 , 2013 R. Livni, S. Shalev-Schwartz, O. Shamir; 个人分类: 深度学习|6905 次阅读|1 个评论

Python3.3.2_LearningPython(5rd)_学习笔记_Chapter3: a582234617 2013-10-13 13:51; 1. cmd 不是 DOS ，切换盘符的方法：要进其他盘直接打盘符：就行了，如 e ：回车。 cd 是进入目录的命令 2. 双击图标只有一个黑窗口一闪而过，这并不是一个 bug ，只是这个文件飞快的执行完成并退出了，除非你的电脑很旧很慢，否则你是看不清它的，解决办法很简单，在 3.x 版本里在文件最后加上一行 input() ，在 2.x 里加上一行 raw_input ，再双击图标，输出结果就会显示在屏幕上不动了，再点击一下 Enter ，它就会退出。 3. 如果 import 没有导入或报告有错误，那么不能使用 reload ，除非重新导入 4. 模块就是不同变量名的封装，被认作是命名空间，封装里的名字被称为属性。 5.Import 获得模块和它的属性， from 获得文件变量名的副本 6.内置的 dir 函数可以获得模块内部的可用变量名的列表，我们代码里定义的变量在最后显示。; 个人分类: Python3.3.2学习笔记|1170 次阅读|0 个评论

[转载]Facebook begin deep learning after google 深度学习: hestendelin 2013-9-24 14:46; Facebook Launches Advanced AI Effort to Find Meaning in Your Posts A technique called deep learning could help Facebook understand its users and their data better. By Tom Simonite on September 20, 2013 Facebook ’s piles of data on people’s lives could allow it to push the boundaries of what can be done with the emerging AI technique known as deep learning . Facebook is set to get an even better understanding of the 700 million people who use the social network to share details of their personal lives each day. A new research group within the company is working on an emerging and powerful approach to artificial intelligence known as deep learning , which uses simulated networks of brain cells to process data. Applying this method to data shared on Facebook could allow for novel features and perhaps boost the company’s ad targeting. Deep learning has shown potential as the basis for software that could work out the emotions or events described in text even if they aren’t explicitly referenced, recognize objects in photos, and make sophisticated predictions about people’s likely future behavior. The eight-person group , known internally as the AI team, only recently started work, and details of its experiments are still secret. But Facebook’s chief technology officer , Mike Schroepfer , will say that one obvious way to use deep learning is to improve the news feed, the personalized list of recent updates he calls Facebook’s “ killer app .” The company already uses conventional machine learning techniques to prune the 1,500 updates that average Facebook users could possibly see down to 30 to 60 that are judged most likely to be important to them. Schroepfer says Facebook needs to get better at picking the best updates because its users are generating more data and using the social network in different ways. “The data set is increasing in size, people are getting more friends, and with the advent of mobile, people are online more frequently,” Schroepfer told MIT Technology Review . “It’s not that I look at my news feed once at the end of the day; I constantly pull out my phone while I’m waiting for my friend or I’m at the coffee shop. We have five minutes to really delight you.” Shroepfer says deep learning could also be used to help people organize their photos or choose which is the best one to share on Facebook . In looking into deep learning , Facebook follows its competitors Google and Microsoft , which have used the approach to impressive effect in the past year. Google has hired and acquired leading talent in the field (see “ 10 Breakthrough Technologies 2013: Deep Learning ”), and last year it created software that taught itself to recognize cats and other objects by reviewing stills from YouTube videos. The underlying technology was later used to slash the error rate of Google’s voice recognition services (see “ Google’s Virtual Brain Goes to Work ”). Meanwhile, researchers at Microsoft have used deep learning to build a system that translates speech from English to Mandarin Chinese in real time (see “ Microsoft Brings Star Trek’s Voice Translator to Life ”). Chinese Web giant Baidu also recently established a Silicon Valley research lab to work on deep learning . Less complex forms of machine learning have underpinned some of the most useful features developed by major technology companies in recent years, such as spam detection systems and facial recognition in images. The largest companies have now begun investing heavily in deep learning because it can deliver significant gains over those more established techniques, says Elliot Turner , founder and CEO of AlchemyAPI , which rents access to its own deep learning software for text and images. “Research into understanding images, text, and language has been going on for decades, but the typical improvement a new technique might offer was a fraction of a percent,” he says. “In tasks like vision or speech, we’re seeing 30 percent-plus improvements with deep learning .” The newer technique also allows much faster progress in training a new piece of software, says Turner. Conventional forms of machine learning are slower because before data can be fed into learning software, experts must manually choose which features of it the software should pay attention to, and they must label the data to signify, for example, that certain images contain cars. Deep learning systems can learn with much less human intervention because they can figure out for themselves which features of the raw data are most significant. They can even work on data that hasn’t been labeled, as Google’s cat-recognizing software did. Systems able to do that typically use software that simulates networks of brain cells, known as neural nets, to process data. They require more powerful collections of computers to run. Facebook’s AI group will work on applications that can help the company’s products as well as on more general research that will be made public, says Srinivas Narayanan , an engineering manager at Facebook who’s helping to assemble the new group. He says one way Facebook can help advance deep learning is by drawing on its recent work creating new types of hardware and software to handle large data sets (see “ Inside Facebook’s Not-So-Secret New Data Center ”). “It’s both a software and a hardware problem together; the way you scale these networks requires very deep integration of the two,” he says. Facebook hired deep learning expert Marc’Aurelio Ranzato away from Google for its new group. Other members include Yaniv Taigman , cofounder of the facial recognition startup Face.com (see “ When You’re Always a Familiar Face ”); computer vision expert Lubomir Bourdev ; and veteran Facebook engineer Keith Adams . 原文： http://www.technologyreview.com/news/519411/facebook-launches-advanced-ai-effort-to-find-meaning-in-your-posts/; 个人分类: 深度学习|2374 次阅读|0 个评论

脑与deep learning读书会第二期视频和讲稿: xiaoda99 2013-8-2 20:56; 　题目：Descriptive, Mechanistic and Interpretive Models of Primary Visual Cortex 　　主讲人：肖达，北京邮电大学计算机学院教师。　　袁行远，前淘宝网数据挖掘与并行计算高级算法工程师，现辞职休假中。　　　　提纲：　　1.Descriptive models (What): 　　 * Responses of a Neuron in an Intact Cat Brain, (视频: Hubel Wiesel - Cortical Neuron - V1 http://v.youku.com/v_show/id_XNDc0MTkxODc2.html ) 　　 * Contrast sensitivity of Human 　　 * Receptive Fields and Edges Detection Program Demo 　　2.Machanistic Models (How): 　　 * Oriented Receptive Fields and Position-Less Receptive Fields 　　 * Fourier Decomposition hypothesis 　　 * Build Self-Organizing Map for V1 　　3.Interpretive Models (Why): 　　 * What is the Best Multi-Stage Architecture for Object Recognition 　　4.The columnar organization of the neocortex and its implication for computer vision 　　　　参考文献：　　【NB】Matteo Carandini (2012) Area V1. Scholarpedia, 7(7):12105. http://www.scholarpedia.org/article/Area_V1 　　【NB】【CM】Carandini M, et al. (2005) Do we know what the early visual system does? Journal of Neuroscience, 25:10577-10597. 　　【NB】Douglas, RJ and Martin, KAC (2007) Recurrent neuronal circuits in the neocortex. Current Opinion in Biology, 17:496-500. 　　【NB】Douglas, RJ and Martin, KAC (2010) Canonical cortical circuits. Chapter 2 in Handbook of Brain Microcircuits 15-21. 　　【ML】Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. (2009) What is the Best Multi-Stage Architecture for Object Recognition? in Proc. International Conference on Computer Vision (ICCV’09). 　　(文章前的标签代表类型，NB=神经生物学发现，CM=计算模型，ML=机器学习算法，SP=统计物理。) 多贝视频： http://www.duobei.com/room/3011311368 讲稿：　　袁行远，What do we know about V1 　　 http://vdisk.weibo.com/s/u4Vws15JLvz_z 　　　　代码演示网址 http://www.demogng.de/ 　　　　肖达，Modular organization of neocortex and its implication for computer vision 　　 http://vdisk.weibo.com/s/u4Vws15JLvz_l; 3768 次阅读|0 个评论

[转载]manifold learning 流形学习: 热度 1 aaa0 2013-7-27 02:24; manifold learning 流形学习流形学习简介流形学习是个很广泛的概念。这里我主要谈的是自从2000年以后形成的流形学习概念和其主要代表方法。自从2000年以后，流形学习被认为属于非线性降维的一个分支。众所周知，引导这一领域迅速发展的是2000年Science杂志上的两篇文章: Isomap and LLE (Locally Linear Embedding)。 1. 流形学习的基本概念那流形学习是什莫呢？为了好懂，我尽可能应用少的数学概念来解释这个东西。所谓流形（manifold）就是一般的几何对象的总称。比如人，有中国人、美国人等等；流形就包括各种维数的曲线曲面等。和一般的降维分析一样，流形学习把一组在高维空间中的数据在低维空间中重新表示。和以往方法不同的是，在流形学习中有一个假设，就是所处理的数据采样于一个潜在的流形上，或是说对于这组数据存在一个潜在的流形。对于不同的方法，对于流形性质的要求各不相同，这也就产生了在流形假设下的各种不同性质的假设，比如在Laplacian Eigenmaps中要假设这个流形是紧致黎曼流形等。对于描述流形上的点，我们要用坐标，而流形上本身是没有坐标的，所以为了表示流形上的点，必须把流形放入外围空间（ambient space）中，那末流形上的点就可以用外围空间的坐标来表示。比如R^3中的球面是个2维的曲面，因为球面上只有两个自由度，但是球面上的点一般是用外围R^3空间中的坐标表示的，所以我们看到的R^3中球面上的点有3个数来表示的。当然球面还有柱坐标球坐标等表示。对于R^3中的球面来说，那末流形学习可以粗略的概括为给出R^3中的表示，在保持球面上点某些几何性质的条件下，找出找到一组对应的内蕴坐标（intrinsic coordinate）表示，显然这个表示应该是两维的，因为球面的维数是两维的。这个过程也叫参数化（parameterization）。直观上来说，就是把这个球面尽量好的展开在通过原点的平面上。在PAMI中，这样的低维表示也叫内蕴特征（intrinsic feature）。一般外围空间的维数也叫观察维数，其表示也叫自然坐标（外围空间是欧式空间）表示,在统计中一般叫observation。了解了流形学习的这个基础，那末流形学习中的一些是非也就很自然了，这个下面穿插来说。由此，如果你想学好流形学习里的方法，你至少要了解一些微分流形和黎曼几何的基本知识。 2. 代表方法 a) Isomap。 Josh Tenenbaum的Isomap开创了一个数据处理的新战场。在没有具体说Isomap之前，有必要先说说MDS（Multidimensional Scaling）这个方法。我们国内的很多人知道PCA，却很多人不知道MDS。PCA和MDS是相互对偶的两个方法。MDS就是理论上保持欧式距离的一个经典方法，MDS最早主要用于做数据的可视化。由于MDS得到的低维表示中心在原点，所以又可以说保持内积。也就是说，用低维空间中的内积近似高维空间中的距离。经典的MDS方法，高维空间中的距离一般用欧式距离。 Isomap就是借窝生蛋。他的理论框架就是MDS，但是放在流形的理论框架内，原始的距离换成了流形上的测地线（geodesic）距离。其它一模一样。所谓的测地线，就是流形上加速度为零的曲线，等同于欧式空间中的直线。我们经常听到说测地线是流形上两点之间距离最短的线。其实这末说是不严谨的。流形上两点之间距离最短的线是测地线，但是反过来不一定对。另外，如果任意两个点之间都存在一个测地线，那末这个流形必须是连通的邻域都是凸的。Isomap就是把任意两点的测地线距离（准确地说是最短距离）作为流形的几何描述，用MDS理论框架理论上保持这个点与点之间的最短距离。在Isomap中，测地线距离就是用两点之间图上的最短距离来近似的，这方面的算法是一般计算机系中用的图论中的经典算法。如果你曾细致地看过Isomap主页上的matlab代码，你就会发现那个代码的实现复杂度远超与实际论文中叙述的算法。在那个代码中，除了论文中写出的算法外，还包括了 outlier detection和embedding scaling。这两样东西，保证了运行他们的程序得到了结果一般来说相对比较理想。但是，这在他们的算法中并没有叙述。如果你直接按照他论文中的方法来实现，你可以体会一下这个结果和他们结果的差距。从此我们也可以看出，那几个作者做学问的严谨态度，这是值得我们好好学习的。另外比较有趣的是，Tenenbaum根本不是做与数据处理有关算法的人，他是做计算认知科学（computational cognition science）的。在做这个方法的时候，他还在stanford，02年就去了MIT开创一派，成了CoCoSci 的掌门人，他的组成长十分迅速。但是有趣的是，在Isomap之后，他包括他在MIT带的学生就从来再也没有做过类似的工作。其原因我今年夏天有所耳闻。他在今年参加 UCLA Alan Yuille 组织的一个summer school上说，（不是原文，是大意）我们经常忘了做研究的原始出发点是什莫。他做Isomap就是为了找一个好的visual perception的方法，他还坚持了他的方向和信仰，computational cognition，他没有随波逐流。而由他引导起来的 manifold learning 却快速的发展成了一个新的方向。这是一个值得我们好好思考的问题。我们做一个东西，选择一个研究方向究竟是为了什莫。你考虑过吗？（当然，此问题也在问我自己） b) LLE (Locally linear Embedding) LLE 在作者写出的表达式看,是个具有十分对称美的方法. 这种看上去的对称对于启发人很重要。LLE的思想就是，一个流形在很小的局部邻域上可以近似看成欧式的，就是局部线性的。那末，在小的局部邻域上，一个点就可以用它周围的点在最小二乘意义下最优的线性表示。LLE把这个线性拟合的系数当成这个流形局部几何性质的刻画。那末一个好的低维表示，就应该也具有同样的局部几何，所以利用同样的线性表示的表达式，最终写成一个二次型的形式，十分自然优美。注意在LLE出现的两个加和优化的线性表达，第一个是求每一点的线性表示系数的。虽然原始公式中是写在一起的，但是求解时，是对每一个点分别来求得。第二个表示式，是已知所有点的线性表示系数，来求低维表示（或嵌入embedding）的，他是一个整体求解的过程。这两个表达式的转化正好中间转了个弯，使一些人困惑了，特别后面一个公式写成一个二次型的过程并不是那末直观，很多人往往在此卡住，而阻碍了全面的理解。我推荐大家去精读 Saul 在JMLR上的那篇LLE的长文。那篇文章无论在方法表达还是英文书写，我认为都是精品，值得好好玩味学习。另外值得强调的是，对于每一点处拟合得到的系数归一化的操作特别重要，如果没有这一步，这个算法就没有效果。但是在原始论文中，他们是为了保持数据在平行移动下embedding不变。 LLE的matlab代码写得简洁明了，是一个样板。在此有必要提提Lawrence Saul这个人。在Isomap和LLE的作者们中，Saul算是唯一一个以流形学习（并不限于）为研究对象开创学派的人。Saul早年主要做参数模型有关的算法。自从LLE以后，坐阵UPen创造了一个个佳绩。主要成就在于他的两个出色学生，Kilian Weinberger和 Fei Sha，做的方法。拿了很多奖，在此不多说，可以到他主页上去看。Weinberger把学习核矩阵引入到流形学习中来。他的这个方法在流形学习中影响到不是很显著，却是在 convex optimization 中人人得知。Fei Sha不用多说了，machine learning中一个闪亮的新星，中国留学生之骄傲。现在他们一个在Yahoo,一个在Jordan手下做PostDoc。 c) Laplacian Eigenmaps 要说哪一个方法被做的全面，那莫非LE莫属。如果只说LE这个方法本身，是不新的，许多年前在做mesh相关的领域就开始这莫用。但是放在黎曼几何的框架内，给出完整的几何分析的，应该是Belkin和Niyogi（LE作者）的功劳。 LE的基本思想就是用一个无向有权图来描述一个流形，然后通过用图的嵌入（graph embedding）来找低维表示。说白了，就是保持图的局部邻接关系的情况把这个图从高维空间中重新画在一个低维空间中（graph drawing）。在至今为止的流行学习的典型方法中，LE是速度最快、效果相对来说不怎莫样的。但是LE有一个其他方法没有的特点，就是如果出现outlier情况下，它的鲁棒性（robustness）特别好。后来Belkin和Niyogi又分析了LE的收敛性。大家不要忽视这个问题，很重要。鼓励有兴趣数学功底不错的人好好看看这篇文章。 d) Hessian Eigenmaps 如果你对黎曼几何不懂，基本上看不懂这个方法。又加作者表达的抽象，所以绝大多数人对这个方法了解不透彻。在此我就根据我自己的理解说说这个方法。这个方法有两个重点：（1）如果一个流形是局部等距（isometric）欧式空间中一个开子集的，那末它的Hessian矩阵具有d+1维的零空间。（2）在每一点处，Hessian系数的估计。首先作者是通过考察局部Hessian的二次型来得出结论的，如果一个流形局部等距于欧式空间中的一个开子集，那末由这个流形 patch到开子集到的映射函数是一个线性函数，线性函数的二次混合导数为零，所以局部上由Hessian系数构成的二次型也为零，这样把每一点都考虑到，过渡到全局的Hessian 矩阵就有d+1维的零空间，其中一维是常函数构成的，也就是1向量。其它的d维子空间构成等距坐标。这就是理论基础的大意，当然作者在介绍的时候，为了保持理论严谨，作了一个由切坐标到等距坐标的过渡。另外一个就是局部上Hessian系数的估计问题。我在此引用一段话： If you approximate a function f(x) by a quadratic expansion f(x) = f(0) + (grad f)^T x + x^T Hf x + rem then the hessian is what you get for the quadratic component. So simply over a given neighborhood, develop the operator that approximates a function by its projection on 1, x_1,…,x_k, x_1^2,…,x_k^2, x_1*x_2,… ,x_{k-1}*x_{k}. Extract the component of the operator that delivers the projection on x_1^2,…,x_k^2, x_1*x_2,… ,x_{k-1}*x_{k}. dave 这段话是我在初学HE时候，写信问Dave Donoho，他给我的回信。希望大家领会。如果你了解了上述基本含义，再去细看两遍原始论文，也许会有更深的理解。由于HE牵扯到二阶导数的估计，所以对噪声很敏感。另外，HE的原始代码中在计算局部切坐标的时候，用的是奇异值分解(SVD)，所以如果想用他们的原始代码跑一下例如图像之类的真实数据，就特别的慢。其实把他们的代码改一下就可以了，利用一般PCA的快速计算方法，计算小尺寸矩阵的特征向量即可。还有，在原始代码中，他把Hessian系数归一化了，这也就是为什莫他们叫这个方法为 Hessian LLE 的原因之一。 Dave Dohono是学术界公认的大牛，在流形学习这一块，是他带着他的一个学生做的，Carrie Grimes。现在这个女性研究员在Google做 project leader，学术界女生同学的楷模 : ) e) LTSA (Local tangent space alignment) 很荣幸，这个是国内学者（浙江大学数学系的老师ZHANG Zhenyue）为第一作者做的一个在流行学习中最出色的方法。由于这个方法是由纯数学做数值分析出身的老师所做，所以原始论文看起来公式一大堆，好像很难似的。其实这个方法非常直观简单。象 Hessian Eigenmaps 一样，流形的局部几何表达先用切坐标，也就是PCA的主子空间中的坐标。那末对于流形一点处的切空间，它是线性子空间，所以可以和欧式空间中的一个开子集建立同构关系，最简单的就是线性变换。在微分流形中，就叫做切映射 (tangential map)，是个很自然很基础的概念。把切坐标求出来，建立出切映射，剩下的就是数值计算了。最终这个算法划归为一个很简单的跌代加和形式。如果你已经明白了MDS，那末你就很容易明白，这个算法本质上就是MDS的从局部到整体的组合。这里主要想重点强调一下，那个论文中使用的一个从局部几何到整体性质过渡的alignment技术。在spectral method（特征分解的）中，这个alignment方法特别有用。只要在数据的局部邻域上你的方法可以写成一个二次项的形式，就可以用。其实LTSA最早的版本是在02年的DOCIS上。这个alignment方法在02年底Brand的 charting a manifold 中也出现，隐含在Hessian Eigenmaps中。在HE中，作者在从局部的Hessian矩阵过渡到全局的Hessian矩阵时，用了两层加号，其中就隐含了这个 alignment方法。后来国内一个叫 ZHAO Deli 的学生用这个方法重新写了LLE，发在Pattern Recognition上，一个短文。可以预见的是，这个方法还会被发扬光大。 ZHA Hongyuan 后来专门作了一篇文章来分析 alignment matrix 的谱性质，有兴趣地可以找来看看。 f) MVU (Maximum variance unfolding) 这个方法刚发出来以后，名字叫做Semi-definite Embedding (SDE)。构建一个局部的稀疏欧式距离矩阵以后，作者通过一定约束条件（主要是保持距离）来学习到一个核矩阵，对这个核矩阵做PCA就得到保持距离的 embedding，就这莫简单。但是就是这个方法得了多少奖，自己可以去找找看。个人观点认为，这个方法之所以被如此受人赏识，无论在vision还是在learning，除了给流形学习这一领域带来了一个新的解决问题的工具之外，还有两个重点，一是核方法（kernel），二是半正定规划（semi- definite programming），这两股风无论在哪个方向（learning and Vision）上都吹得正猛。 g) S-Logmaps aa 这个方法不太被人所知，但是我认为这个是流形学习发展中的一个典型的方法（其实其他还有很多人也这莫认为）。就效果来说，这个方法不算好，说它是一个典型的方法，是因为这个方法应用了黎曼几何中一个很直观的性质。这个性质和法坐标(normal coordinate)、指数映射(exponential map)和距离函数(distance function)有关。如果你了解黎曼几何，你会知道，对于流形上的一条测地线，如果给定初始点和初始点处测地线的切方向，那莫这个测地线就可以被唯一确定。这是因为在这些初始条件下，描述测地线的偏微分方程的解是唯一的。那末流形上的一条测地线就可以和其起点处的切平面上的点建立一个对应关系。我们可以在这个切平面上找到一点，这个点的方向就是这个测地线在起点处的切方向，其长度等于这个测地线上的长。这样的一个对应关系在局部上是一一对应的。那末这个在切平面上的对应点在切平面中就有一个坐标表示，这个表示就叫做测地线上对应点的法坐标表示（有的也叫指数坐标）。那末反过来，我们可以把切平面上的点映射到流形上，这个映射过程就叫做指数映射（Logmap就倒过来）。如果流形上每一个点都可以这样在同一个切平面上表示出来，那末我们就可以得到保持测地线长度的低维表示。如果这样做得到，流形必须可以被单坐标系统所覆盖。如果给定流形上的采样点，如果要找到法坐标，我们需要知道两个东西，一是测地线距离，二是每个测地线在起点处的切方向。第一个东西好弄，利用Isomap中的方法直接就可以解决，关键是第二个。第二个作者利用了距离函数的梯度，这个梯度和那个切方向是一个等价的关系，一般的黎曼几何书中都有叙述。作者利用一个局部切坐标的二次泰勒展开来近似距离函数，而距离是知道的，就是测地线距离，局部切坐标也知道，那末通过求一个简单的最小二乘问题就可以估计出梯度方向。如果明白这个方法的几何原理，你再去看那个方法的结果，你就会明白为什莫在距离中心点比较远的点的embedding都可以清楚地看到在一条条线上，效果不太好。 bb 最近这个思想被北大的一个年轻的老师 LIN Tong 发扬光大，就是ECCV‘06上的那篇，还有即将刊登出的TPAMI上的 Riemannian Manifold Learning，实为国内研究学者之荣幸。Lin的方法效果非常好，但是虽然取名叫Riemannian，没有应用到黎曼几何本身的性质，这样使他的方法更容易理解。 Lin也是以一个切空间为基准找法坐标，这个出发点和思想和Brun（S-Logmaps）的是一样的。但是Lin全是在局部上操作的，在得出切空间原点处局部邻域的法坐标以后，Lin采用逐步向外扩展的方法找到其他点的法坐标，在某一点处，保持此点到它邻域点的欧式距离和夹角，然后转化成一个最小二乘问题求出此点的法坐标，这样未知的利用已知的逐步向外扩展。说白了就像缝网一样，从几个临近的已知点开始，逐渐向外扩散的缝。效果好是必然的。有人做了个好事情，做了个系统，把几个方法的matlab代码放在了一起 http://www.math.umn.edu/~wittman/mani/ 以上提到方法论文，都可以用文中给出的关键词借助google.com找到。 3. 基本问题和个人观点流形学习现在还基本处于理论探讨阶段，在实际中难以施展拳脚，不过在图形学中除外。我就说说几个基本的问题。 a. 谱方法对噪声十分敏感。希望大家自己做做实验体会一下，流形学习中谱方法的脆弱。 b. 采样问题对结果的影响。 c. 收敛性 d. 一个最尴尬的事情莫过于，如果用来做识别，流形学习线性化的方法比原来非线性的方法效果要好得多，如果用原始方法做识别，那个效果叫一个差。也正因为此，使很多人对流形学习产生了怀疑。原因方方面面 : ) e. 把偏微分几何方法引入到流形学习中来是一个很有希望的方向。这样的工作在最近一年已经有出现的迹象。 f. 坦白说，我已不能见庐山真面目了，还是留给大家来说吧结尾写得有点草率，实在是精疲力尽了，不过还好主体部分写完。 4. 结束语做学问的人有很多种，有的人学问做得很棒，但是独善其身者居多；有的人还谈不上做学问总想兼济天下。小弟不幸成了后一种人，总觉才学疏浅，力不从心，让各位见笑了。今天一位朋友（filestorm）给我分享《列子御风》的故事，很受教育。鄙人功力不及二层，心却念是非，口却言利害，实在惭愧。但又想主动尽自己绵薄之力为“兼济天下”做点事 : )，遂呈拙文于此（包括以前两篇人脸识别的），难免出现偏颇，无论好话坏话，还请各位尽管说来。; 4286 次阅读|1 个评论

deep learning学习推荐网址: a6657266 2013-7-15 20:29; 已经接触了deep learning快俩月了，但还是一点进展没有……还是没找到方法，一直关注python的写法，但是真正的learning实质并不懂~~所以还是找找专家们的说法吧，争取能搞清楚搞清楚…… http://ufldl.stanford.edu/wiki/index.php/UFLDL_Recommended_Readings 这个上面列了学习deep learning需要补充的基础知识……发现自己就是0哇……加油吧！但是开始的开始，我需要学习 http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning 再学习 http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial 再按照reading list上面推荐的挨个学习吧！！接下来呢，一定要参照 http://www.cs.toronto.edu/~hinton/ 大牛牛发表的文章好好学习嗯，这些够我看一通一通的了…… 看完这些，再看 http://www.cnblogs.com/tornadomeet/tag/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/ 是不应该就会简单易懂了……期待啊……; 4852 次阅读|0 个评论

A novel situated social-personalized learning approach: xianglee 2013-7-13 22:55; Description of the Research Plan Title: A novel situated social-personalized learning approach Keywords: e-learning, personalized learning, knowledge space construction, situatedknowledge representation, learning path analysis, cognitive learning process, ande-learning assistant agent Objectives: This research program is to develop an intelligent E-LearningAssistant (ELA) agent for personalized learning. A novel situated social-personalizedlearning approach is proposed in this proposal. This research plan covers threeaspects: The first one is to develop a situated knowledge representation andorganization (SKRO) model. The second is to research nonlinear personallearning path or behavior model. The third is to research social-personalizedlearning mechanism based on machine learning algorithms. The details of thesethree aspects are as follows: 1. Research on situated knowledge representation and organization model Knowledge representationand organization method is the fundamental element for building an e-learningsystem. Here, enlightened by the belief of cognitive learning and pedagogy, i.e.“knowledge is situated and learning is also situated”. A situated knowledgerepresentation and organization (SKRO) model is proposed for building asituated e-learning environment. The SKRO model not only can construct acontextual relationship of knowledge, but also can build a map of differentkinds of heterogeneous knowledge (e.g. text, audio, video, animation, imagesand others). The SKRO model is also a modeling foundation of personal learningprocess and how they learn. 2. Research on nonlinear personal learning path or behavior model With the SKRO model, the concept ofconstructive memory is proposed for storing learning content about thelearners’ situations, goals, and learned knowledge. More important, all theselearned knowledge will be recorded according to a time series and a specificsequence of learning activities. An individual learning path (actually shouldbe a map, not a path) is modeled here to record his learning process andcognitive process. At last, the personal learning space (PLS) is constructedwith personal learned knowledge and nonlinear learning path/behavior. 3. The social-personalized learning mechanism based on machine learningalgorithms The thirdresearch highlight is that we hold a belief that one’s learning process andcognitive process may be helpful to others. Therefore, a novelsocial-personalized learning mechanism based on machine learning algorithms isproposed for personalized learning. The social-personalized learning mechanismmeans that intelligent ELA agent can sense social learning processes (i.e. eachlearner’s sequence of learning activities in any order) to analyze therelevance of knowledge. After that, it constructs a situated knowledgerelevance model through employing statistic-based machine learning algorithms. Sucha model can tell one learner what others’ learning process are and how theylearn through analyzing personal and social learning path. At the same time, thesocial-personalized learning mechanism can also tell the learner what knowledgeI should learn now according to his situation and goals. The framework of intelligent E-Learning Assistant(ELA) agent is as following:; 个人分类: AI|3471 次阅读|0 个评论

[转载]Deep learning 学习笔记: yyjsunny 2013-7-8 15:02; http://blog.csdn.net/zouxy09/article/details/8775360; 个人分类: 学习心得|0 个评论

[转载]直推学习（transductive learning）: scott89 2013-7-2 20:11; Transductive Learning：从彼个例到此个例，有点象英美法系，实际案例直接结合过往的判例进行判决。关注具体实践。 Inductive Learning：从多个个例归纳出普遍性，再演绎到个例，有点象大陆法系，先对过往的判例归纳总结出法律条文，再应用到实际案例进行判决。从有限的实际样本中，企图归纳出普遍真理，倾向形而上，往往会不由自主地成为教条。在传统的监督学习中，学习器通过对大量有标记的（labeled）训练例进行学习，从而建立模型用于预测未见示例的标记。这里的“标记”（label）是指示例所对应的输出，在分类问题中标记就是示例的类别，而在回归问题中标记就是示例所对应的实值输出。随着数据收集和存储技术的飞速发展，收集大量未标记的（unlabeled）示例已相当容易，而获取大量有标记的示例则相对较为困难，因为获得这些标记可能需要耗费大量的人力物力。例如在计算机辅助医学图像分析中，可以从医院获得大量的医学图像作为训练例，但如果要求医学专家把这些图像中的病灶都标识出来，则往往是不现实的。事实上，在真实世界问题中通常存在大量的未标记示例，但有标记示例则比较少，尤其是在一些在线应用中这一问题更加突出。例如，在进行Web网页推荐时，需要用户标记出哪些网页是他感兴趣的，很少会有用户愿意花大量的时间来提供标记，因此有标记的网页示例比较少，但Web上存在着无数的网页，它们都可作为未标记示例来使用。目前，利用未标记示例的主流学习技术主要有三大类，即半监督学习（semi-supervised learning）、直推学习（transductive learning）和主动学习（active learning）。这三类技术都是试图利用大量的未标记示例来辅助对少量有标记示例的学习，但它们的基本思想却有显著的不同。在半监督学习中，学习器试图自行利用未标记示例，即整个学习过程不需人工干预，仅基于学习器自身对未标记示例进行利用。直推学习与半监督学习的相似之处是它也是由学习器自行利用未标记示例，但不同的是，直推学习假定未标记示例就是测试例，即学习的目的就是在这些未标记示例上取得最佳泛化能力。换句话说，半监督学习考虑的是一个“开放世界”，即在进行学习时并不知道要预测的示例是什么，而直推学习考虑的则是一个“封闭世界”，在学习时已经知道了需要预测哪些示例。实际上，直推学习这一思路直接来源于统计学习理论，并被一些学者认为是统计学习理论对机器学习思想的最重要的贡献1。其出发点是不要通过解一个困难的问题来解决一个相对简单的问题。V. Vapnik认为，经典的归纳学习假设期望学得一个在整个示例分布上具有低错误率的决策函数，这实际上把问题复杂化了，因为在很多情况下，人们并不关心决策函数在整个示例分布上性能怎么样，而只是期望在给定的要预测的示例上达到最好的性能。后者比前者简单，因此，在学习过程中可以显式地考虑测试例从而更容易地达到目的。这一思想在机器学习界目前仍有争议，但直推学习作为一种重要的利用未标记示例的技术，则已经受到了众多学者的关注。主动学习和前面两类技术不同，它假设学习器对环境有一定的控制能力，可以“主动地”向学习器之外的某个“神谕”(oracle)2 进行查询来获得训练例的标记。因此，在主动学习中，学习器自行挑选出一些未标记示例并通过神谕查询获得这些示例的标记，然后再将这些有标记示例作为训练例来进行常规的监督学习，而其技术难点则在于如何使用尽可能少的查询来获得强泛化能力。对比半监督学习、直推学习和主动学习可以看出，后者在利用未标记示例的过程中需要与外界进行交互，而前两者则完全依靠学习器自身，正因为此，也有一些研究者将直推学习作为一种半监督学习技术来进行研究。以上内容参考： 1. http://shiyueyangyang.blog.163.com/blog/static/79144804201021410479895/ 2. http://blog.sina.com.cn/s/blog_5212bec30100xtad.html 3. 《半监督学习中的协同训练风范》--周志华（南京大学）; 9463 次阅读|0 个评论

和国外同行讨论deep learning被郁闷到: 热度 1 xiaoda99 2013-6-23 07:14; 跟ICML'13一篇paper（Maxout Networks）的作者讨论一个bio-inspired deep learning的想法，得到的回复是：你的想法没啥硬伤，关键是能不能跑出更好的结果。 Other people, including me, having worked on fairly similar things and didn't really get them to work. That doesn't mean it's impossible though; just that you might need to try a few different tricks to go with it that we haven't tried. In general it's very hard to predict theoretically how well a deep learning method will work in advance. You just have to get your hands dirty and try a lot of them. 虽说“实践是检验真理的唯一标准”没错，但还是觉得，现阶段的DL不像科学，更像炼金术！接下来打算好好玩玩pylearn2和cuda-convnet了。; 6034 次阅读|1 个评论

脑与deep learning读书会第一期: xiaoda99 2013-6-23 06:52; 题目：Overview: deep architectures in brain and machine 提纲 1. An overview of primate visual pathways 2. What problems does the visual system solve? An object recognition perspective 3. A very general introduction to deep learning and some personal comments 4. Neural representation benchmark: Brain vs Machine 讲稿 http://www.kuaipan.cn/file/id_2602161770890478.htm; 3057 次阅读|0 个评论

[转载]Top 10 Challenges of Blended learning: isping 2013-6-17 21:40; Technical Challenges (技术挑战) 1. （如何）确保参与者能够能够成功地使用技术（Ensuring participants can successfully use the technology） 2.（如何）抵制盲目使用技术的冲动（Resisting the urge to use technology simply because it is available） Organizational Challenges (管理挑战) 3. （如何）克服“混合学习不如传统课堂培训有效”的想法（Overcoming the idea that blended learning is not as effective as traditional classroom training） 4. （如何）重新定位教辅人员的角色（Redefining the role of the facilitator） 5. （如何）管理和监测参与者进程（Managing and monitoring participant progress） Instructional Challenges (教学挑战) 6. （如何）重视如何教，而不是简单的教什么（Looking at how to teach, not just what to teach） 7.（如何）匹配最佳的传播媒介和最佳的性能目标（Matching the best delivery medium to the performance objective） 8. （如何）保持在线服务互动，而不仅仅回复参与者（Keeping online offerings interactive rather than just “talking at” participants） 9. （如何）确保参与者的承诺和后续的“非实况直播”（ Ensuring participant commitment and follow-through with “non-live” elements.） 10.（如何）确保混合教学的各个元素相互配合（Ensuring all the elements of the blend are coordinated）文章来源：“Top 10 Challenges of Blended learning”, by Jennifer Hofmann; 5 次阅读|0 个评论

海报：脑与deep learning读书会: xiaoda99 2013-6-5 15:36; 最近要在集智俱乐部组织一次脑与deep learning的读书会活动。 http://www.douban.com/event/19029525/ 简介 Deep learning是一类基于多层神经网络的非常有前景的机器学习算法。最早提出的多层结构神经网络的灵感源泉之一就是神经生理学家Hubel Wiesel上世纪60年代基于实验发现提出的视觉系统信息处理的层次模型。遗憾的是，当前主流机器学习研究者对生物脑工作机理的认识似乎仍然停留在上世纪60年代。在神经科学和机器学习都取得长足进展的今天，二者的结合一定会碰撞出新的火花。本次读书会结合神经生理学和计算神经科学的最新进展，探索脑是怎样deep learning的。以高级哺乳动物的视觉通路为主线，沿着V1-V2-V4-IT/MT-海马的层次结构，看每一层都学到了什么，以及相关的计算模型，探讨不同层间的差异和共性。并和一些机器学习里的deep learning算法和系统作为对照。本次读书会以研读并讨论神经生物学、计算神经科学和机器学习领域的经典文献为主，希望参与者： 1）对探索脑的工作机理感兴趣 1）乐于承担主讲，能比较轻松的阅读专业英文文献并抓住核心给大家讲解清楚。 2）本读书主要是探索生物脑学习和机器学习背后的共性和普适原理，而非deep learning算法的科普讲座。所以，欢迎如下背景的童鞋加入：生物学、神经科学、物理学、计算机、机器学习，尤其欢迎各种交叉学科。大纲 Overview: deep architecture in brain and machine（1次） Early visual system (retinal ganglion cell, LGN, V1), canonical cortical circuits（1.5次） Learning features(selectivity) sparse coding, cortical maps（0.5次） Learning transformations(invariance)（1次） V2（1次） V4, shape perception（1次） IT, object face recognition（1次）海马体，记忆，睡眠（1次）视觉系统的发育和进化，低等动物的视觉（1次）脑波、Neuronal oscillation and synchrony（1次）; 3377 次阅读|0 个评论

[转载]Deep learning总结: jane8008 2013-5-31 12:23; 转载来源： http://blog.csdn.net/zouxy09/article/details/8782018 十、总结与展望 1）Deep learning总结深度学习是关于自动学习要建模的数据的潜在（隐含）分布的多层（复杂）表达的算法。换句话来说，深度学习算法自动的提取分类需要的低层次或者高层次特征。高层次特征，一是指该特征可以分级（层次）地依赖其他特征，例如：对于机器视觉，深度学习算法从原始图像去学习得到它的一个低层次表达，例如边缘检测器，小波滤波器等，然后在这些低层次表达的基础上再建立表达，例如这些低层次表达的线性或者非线性组合，然后重复这个过程，最后得到一个高层次的表达。 Deep learning能够得到更好地表示数据的feature，同时由于模型的层次、参数很多，capacity足够，因此，模型有能力表示大规模数据，所以对于图像、语音这种特征不明显（需要手工设计且很多没有直观物理含义）的问题，能够在大规模训练数据上取得更好的效果。此外，从模式识别特征和分类器的角度，deep learning框架将feature和分类器结合到一个框架中，用数据去学习feature，在使用中减少了手工设计feature的巨大工作量（这是目前工业界工程师付出努力最多的方面），因此，不仅仅效果可以更好，而且，使用起来也有很多方便之处，因此，是十分值得关注的一套框架，每个做ML的人都应该关注了解一下。当然，deep learning本身也不是完美的，也不是解决世间任何ML问题的利器，不应该被放大到一个无所不能的程度。 2）Deep learning未来深度学习目前仍有大量工作需要研究。目前的关注点还是从机器学习的领域借鉴一些可以在深度学习使用的方法，特别是降维领域。例如：目前一个工作就是稀疏编码，通过压缩感知理论对高维数据进行降维，使得非常少的元素的向量就可以精确的代表原来的高维信号。另一个例子就是半监督流行学习，通过测量训练样本的相似性，将高维数据的这种相似性投影到低维空间。另外一个比较鼓舞人心的方向就是evolutionary programming approaches（遗传编程方法），它可以通过最小化工程能量去进行概念性自适应学习和改变核心架构。 Deep learning还有很多核心的问题需要解决：（1）对于一个特定的框架，对于多少维的输入它可以表现得较优（如果是图像，可能是上百万维）？（2）对捕捉短时或者长时间的时间依赖，哪种架构才是有效的？（3）如何对于一个给定的深度学习架构，融合多种感知的信息？（4）有什么正确的机理可以去增强一个给定的深度学习架构，以改进其鲁棒性和对扭曲和数据丢失的不变性？（5）模型方面是否有其他更为有效且有理论依据的深度模型学习算法？探索新的特征提取模型是值得深入研究的内容。此外有效的可并行训练算法也是值得研究的一个方向。当前基于最小批处理的随机梯度优化算法很难在多计算机中进行并行训练。通常办法是利用图形处理单元加速学习过程。然而单个机器GPU对大规模数据识别或相似任务数据集并不适用。在深度学习应用拓展方面，如何合理充分利用深度学习在增强传统学习算法的性能仍是目前各领域的研究重点。十一、参考文献和Deep Learning学习资源（持续更新…… ）先是机器学习领域大牛的微博：@余凯_西二旗民工；@老师木；@梁斌penny；@张栋_机器学习；@邓侃；@大数据皮东；@djvu9…… （1）Deep Learning http://deeplearning.net/ （2）Deep Learning Methods for Vision http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/ （3）Neural Network for Recognition of Handwritten Digits http://www.codeproject.com/Articles/16650/Neural-Network-for-Recognition-of-Handwritten-Digi （4）Training a deep autoencoder or a classifier on MNIST digits http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html （5）Ersatz：deep neural networks in the cloud http://www.ersatz1.com/ （6）Deep Learning http://www.cs.nyu.edu/~yann/research/deep/ （7）Invited talk A Tutorial on Deep Learning by Dr. Kai Yu (余凯) http://vipl.ict.ac.cn/News/academic-report-tutorial-deep-learning-dr-kai-yu （8）CNN - Convolutional neural network class http://www.mathworks.cn/matlabcentral/fileexchange/24291 （9）Yann LeCun's Publications http://yann.lecun.com/exdb/publis/index.html#lecun-98 （10） LeNet-5, convolutional neural networks http://yann.lecun.com/exdb/lenet/index.html （11） Deep Learning 大牛Geoffrey E. Hinton's HomePage http://www.cs.toronto.edu/~hinton/ （12）Sparse coding simulation software http://redwood.berkeley.edu/bruno/sparsenet/ （13）Andrew Ng's homepage http://robotics.stanford.edu/~ang/ （14）stanford deep learning tutorial http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial （15）「深度神经网络」（deep neural network）具体是怎样工作的 http://www.zhihu.com/question/19833708?group_id=15019075#1657279 （16）A shallow understanding on deep learning http://blog.sina.com.cn/s/blog_6ae183910101dw2z.html （17）Bengio's Learning Deep Architectures for AI http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf （18）andrew ng's talk video: http://techtalks.tv/talks/machine-learning-and-ai-via-brain-simulations/57862/ （19）cvpr 2012 tutorial： http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/tutorial_p2_nnets_ranzato_short.pdf （20）Andrew ng清华报告听后感 http://blog.sina.com.cn/s/blog_593af2a70101bqyo.html （21）Kai Yu：CVPR12 Tutorial on Deep Learning Sparse Coding （22）Honglak Lee：Deep Learning Methods for Vision （23）Andrew Ng ：Machine Learning and AI via Brain simulations （24）Deep Learning 【2,3】 http://blog.sina.com.cn/s/blog_46d0a3930101gs5h.html （25）deep learning这件小事…… http://blog.sina.com.cn/s/blog_67fcf49e0101etab.html （26）Yoshua Bengio, U. Montreal：Learning Deep Architectures （27）Kai Yu：A Tutorial on Deep Learning （28）Marc'Aurelio Ranzato：NEURAL NETS FOR VISION （29）Unsupervised feature learning and deep learning http://blog.csdn.net/abcjennifer/article/details/7804962 （30）机器学习前沿热点–Deep Learning http://elevencitys.com/?p=1854 （31）机器学习——深度学习(Deep Learning) http://blog.csdn.net/abcjennifer/article/details/7826917 （32）卷积神经网络 http://wenku.baidu.com/view/cd16fb8302d276a200292e22.html （33）浅谈Deep Learning的基本思想和方法 http://blog.csdn.net/xianlingmao/article/details/8478562 （34）深度神经网络 http://blog.csdn.net/txdb/article/details/6766373 （35）Google的猫脸识别:人工智能的新突破 http://www.36kr.com/p/122132.html （36）余凯，深度学习-机器学习的新浪潮，Technical News程序天下事 http://blog.csdn.net/datoubo/article/details/8577366 （37）Geoffrey Hinton：UCLTutorial on: Deep Belief Nets （38）Learning Deep Boltzmann Machines http://web.mit.edu/~rsalakhu/www/DBM.html （39）Efficient Sparse Coding Algorithm http://blog.sina.com.cn/s/blog_62af19190100gux1.html （40）Itamar Arel, Derek C. Rose, and Thomas P. Karnowski： Deep Machine Learning—A New Frontier in Artificial Intelligence Research （41）Francis Quintal Lauzon：An introduction to deep learning （42）Tutorial on Deep Learning and Applications （43）Boltzmann神经网络模型与学习算法 http://wenku.baidu.com/view/490dcf748e9951e79b892785.html （44）Deep Learning 和 Knowledge Graph 引爆大数据革命 http://blog.sina.com.cn/s/blog_46d0a3930101fswl.html （45）……; 个人分类: 文本分类|3295 次阅读|0 个评论

Multi-Task Least-Squares Support Vector Machines: xiaohai2008 2013-5-30 09:25; @ARTICLE{XAQZ14, author = {Xu, Shuo and An, Xin and Qiao, Xiaodong and Zhu, Lijun}, title = {Multi-Task Least-Squares Support Vector Machines}, journal = {Multimedia Tools and Applications}, year = {2014}, volume = {71}, number = {2}, pages = {699--715}, issn = {1380-7501}, abstract = {There are often the underlying cross relatedness amongst multiple tasks, which is discarded directly by traditional single-task learning methods. Since multi-task learning can exploit these relatedness to further improve the performance, it has attracted extensive attention in many domains including multimedia. It has been shown through a meticulous empirical study that the generalization performance of Least-Squares Support Vector Machine (LS-SVM) is comparable to that of SVM. In order to generalize LS-SVM from single-task to multi-task learning, inspired by the regularized multi-task learning (RMTL), this study proposes a novel multi-task learning approach, multi-task LS-SVM (MTLS-SVM). Similar to LS-SVM, one only solves a convex linear system in the training phrase, too. What's more, we unify the classification and regression problems in an efficient training algorithm, which effectively employs the Krylow methods. Finally, experimental results on \emph{school} and \emph{dermatology} validate the effectiveness of the proposed approach.}, doi = { 10.1007/s11042-013-1526-5 }, keywords = {Multi-Task Learning \sep Least-Square Support Vector Machine (LS-SVM) \sep Multi-Task LS-SVM (MTLS-SVM) \sep Krylow Methods}, } 全文见： XAQZ14.pdf; 个人分类: 机器学习|4886 次阅读|0 个评论

[转载]Hinton wants to take deep learning to next level 深度学习: hestendelin 2013-5-21 20:03; 原址： http://www.wired.com/wiredenterprise/2013/05/hinton/ Computer Brain Escapes Google’s X Lab to Supercharge Search BY ROBERT MCMILLAN 05.20.13 6:30 AM Geoffrey Hinton (right), Alex Krizhevsky, and Ilya Sutskever (left) will do machine-learning work at Google. Photo: U of T Two years ago Stanford professor Andrew Ng joined Google’s X Lab, the research group that’s given us Google Glass and the company’s driverless cars . His mission: to harness Google’s massive data centers and build artificial intelligence systems on an unprecedented scale. He ended up working with one of Google’s top engineers to build the world’s largest neural network; A kind of computer brain that can learn about reality in much the same way that the human brain learns new things. Ng’s brain watched YouTube videos for a week and taught itself which ones were about cats. It did this by breaking down the videos into a billion different parameters and then teaching itself how all the pieces fit together. But there was more. Ng built models for processing the human voice and Google StreetView images. The company quickly recognized this work’s potential and shuffled it out of X Labs and into the Google Knowledge Team. Now this type of machine intelligence — called deep learning — could shake up everything from Google Glass, to Google Image Search to the company’s flagship search engine. It’s the kind of research that a Stanford academic like Ng could only get done at a company like Google, which spends billions of dollars on supercomputer-sized data centers each year. “At the time I joined Google, the biggest neural network in academia was about 1 million parameters,” remembers Ng. “At Google, we were able to build something one thousand times bigger.” Ng stuck around until Google was well on its way to using his neural network models to improve a real-world product: its voice recognition software. But last summer, he invited an artificial intelligence pioneer named Geoffrey Hinton to spend a few months in Mountain View tinkering with the company’s algorithms. When Android’s Jellly Bean release came out last year, these algorithms cut its voice recognition error rate by a remarkable 25 percent. In March, Google acquired Hinton’s company. Now Ng has moved on (he’s running an online education company called Coursera), but Hinton says he wants to take this deep learning work to the next level. A first step will be to build even larger neural networks than the billion-node networks he worked on last year. “I’d quite like to explore neural nets that are a thousand times bigger than that,” Hinton says. “When you get to a trillion , you’re getting to something that’s got a chance of really understanding some stuff.” Hinton thinks that building neural network models about documents could boost Google Search in much the same way they helped voice recognition. “Being able to take a document and not just view it as, “It’s got these various words in it,” but to actually understand what it’s about and what it means,” he says. “That’s most of AI, if you can solve that.” Test images labeled by Hinton’s brain. Image: Geoff Hinton Hinton already has something to build on. Google’s knowledge graph : a database of nearly 600 million entities. When you search for something like “ The Empire State Building ,” the knowledge graph pops up all of that information to the right of your search results. It tells you that the building is 1,454 feet tall and was designed by William F. Lamb. Google uses the knowledge graph to improve its search results, but Hinton says that neural networks could study the graph itself and then both cull out errors and improve other facts that could be included. Image search is another promising area. “‘Find me an image with a cat wearing a hat.’ You should be able to do that fairly soon,” Hinton says. Hinton is the right guy to take on this job. Back in the 1980s he developed the basic computer models used in neural networking. Just two months ago, Google paid an undisclosed sum to acquire Hinton’s artificial intelligence company , DNNresearch , and now he’s splitting his time between his University of Toronto teaching job, and working for Jeff Dean on ways to make Google’s products smarter at the company’s Mountain View campus. In the past five years, there’s been a mini-boom in neural networking as researchers have harnessed the power of graphics processors (GPUs) to build out ever-larger neural networks that can quickly learn from extremely large sets of data. “Until recently… if you wanted to learn to recognize a cat, you had to go and label tens of thousands of pictures of cats,” says Ng. “And it was just a pain to find so many pictures of cats and label then.” Now with “unsupervised learning algorithms,” like the ones Ng used in his YouTube cat work, the machines can learn without the labeling, but to build the really large neural networks, Google had to first write code that would work on such a large number of machines, even when one of the systems in the network stopped working. It typically takes a large number of computers sifting through a large amount of data to train the neural network model. The YouTube cat model, for example, was trained on 16,000 chip cores . But once that was hammered out, it too k just 100 cores to be able to spot cats on YouTube. Google’s data centers are based on Intel Xeon processors, but the company has started to tinker with GPUs because they are so much more efficient at this neural network processing work, Hinton says. Google is even testing out a D-Wave quantum computer , a system that Hinton hopes to try out in the future. But before then, he aims to test out his trillion-node neural network. “People high up in Google I think are very committed to getting big neural networks to work very well,” he says.; 个人分类: 深度学习|3132 次阅读|0 个评论

分享一些deep learning的资源: xiaoda99 2013-5-8 09:54; 都是来自Bengio组的。 1 2篇paper 【1】 Deep Learning of Representations: Looking Forward , Yoshua Bengio, Université de Montréal, arXiv report 1305.0445, 2013 一篇最新综述，写得很棒。【2】Maxout Networks . Ian J. Goodfellow , David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. To appear in ICML 2013 . 一个和dropout搭配使用做更好的model averaging的方法，特点也和dropout类似：做法超简单，效果却很好。 2 http://www.kaggle.com/c/challenges-in-representation-learning-the-black-box-learning-challenge/forums Kaggle的一个竞赛：Challenges in Representation Learning: The Black Box Learning Challenge 论坛里有参赛者关于各种算法和开源工具包的讨论，例如： http://www.kaggle.com/c/challenges-in-representation-learning-the-black-box-learning-challenge/forums/t/4491/dropout-maxout-and-deep-neural-networks 3 http://ift6266h13.wordpress.com/ Bengio的一个课程（Representation Learning）的资料。有很多学生写的课程笔记和问题讨论，例如： http://ift6266h13.wordpress.com/tag/pierre-luc-carrier/ 4 http://plcift6266.wordpress.com/ 上面课程一个学生的course project实验报告。每篇报告中详细描述了他尝试的各种最新deep learning算法的的实验过程和结果分析，这个可比paper中的结果有价值多了，你懂的。我觉得，把这些跟下来，想上手做些东西就不会有太多障碍了，比光看paper好很多。; 4957 次阅读|0 个评论

brain & deep learning reading list: 热度 1 xiaoda99 2013-5-3 15:19; 文献列表（更新）因为文献太多一次读书会不可能面面俱到，采取follow每个领域最重要的1~2个研究者的最具代表性的工作的方式，挑选出下面文章重点研读。文章前的标签代表类型，NB=神经生物学发现，CM=计算模型，ML=机器学习算法。 Overview: deep architecture in brain and machine（1次） James DiCarlo 【NB】Chris I. Baker (2004) Visual Processing in the Primate Brain. In Handbook of Psychology, Biological Psychology, Wiley. 【NB】【CM】DiCarlo JJ, Zoccolan D, Rust NC. (2012) How does the brain solve visual object recognition? Neuron, 73(3):415-34. 【CM】Cadieu CF, et al. (2013) The Neural Representation Benchmark and its Evaluation on Brain and Machine. International Conference on Learning Representations (ICLR) 2013. Early visual system (retinal ganglion cell, LGN, V1), canonical cortical circuits（1.5次） Matteo Carandini, Rodney Douglas 【NB】Matteo Carandini (2012) Area V1. Scholarpedia, 7(7):12105. http://www.scholarpedia.org/article/Area_V1 【NB】【CM】Carandini M, et al. (2005) Do we know what the early visual system does? Journal of Neuroscience, 25:10577-10597. 【NB】Douglas, RJ and Martin, KAC (2007) Recurrent neuronal circuits in the neocortex. Current Opinion in Biology, 17:496-500. 【NB】Douglas, RJ and Martin, KAC (2010) Canonical cortical circuits. Chapter 2 in Handbook of Brain Microcircuits 15-21. 【ML】Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. (2009) What is the Best Multi-Stage Architecture for Object Recognition? in Proc. International Conference on Computer Vision (ICCV’09). learning features(selectivity) sparse coding, cortical maps（0.5次） Bruno Olshausen 【CM】Olshausen, B. A., Field, D. J. (1997). Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research, 37 (23), 3311-3325. 【CM】Bednar JA. (2012) Building a mechanistic model of the development and function of the primary visual cortex. Journal of Physiology (Paris), 106:194-211. learning transformations(invariance)（1次） Aapo Hyvarinen, Yan Karklin 【CM】Hyvarinen, A. and Hoyer, P. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18):2413–2423. 【CM】Karklin, Y., Lewicki, M. S. (2009). Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457(7225), 83-85. 【CM】Adelson E.H. and Bergen J.R. (1985) Spatiotemporal energy models for the perception of motion. Journal Opt. Soc. Am. 【ML】Ian J. Goodfellow, Quoc V. Le, Andrew M. Saxe, Honglak Lee, and Andrew Y. Ng. (2009) Measuring invariances in deep networks. Advances in Neural Information Processing Systems (NIPS). 补充【ML】Q.V. Le, et, al. Building high-level features using large scale unsupervised learning. ICML, 2012. V2（1次）【NB】Lawrence C. Sincich and Jonathan C. Horton (2005) The Circuitry of V1 and V2: Integration of Color, Form, and Motion. Annu. Rev. Neurosci. 28:303–26. 【NB】Roe AW, Lu HD, Chen G (2008) Functional architecture of area V2. Encyclopedia of Neuroscience (Squire L, ed.). Elsevier, Oxford, UK. 【CM】Cadieu C.F. Olshausen B.A. (2012) Learning Intermediate-Level Representations of Form and Motion from Natural Movies. Neural Computation. 【ML】Zou, W.Y., Zhu, S., Ng, A., and Yu, K. (2012) Deep learning of invariant features via simulated fixations in video. In Advances in Neural Information Processing Systems (NIPS). 【CM】Gutmann MU Hyvarinen A (2013) A three-layer model of natural image statistics. Journal of Physiology-Paris. 专题讨论：learning mid-level features（形式待定）【ML】Memisevic, R., Exarchakis, G. (2013) Learning invariant features by harnessing the aperture problem. International Conference on Machine Learning (ICML). 【ML】Kihyuk Sohn, Guanyu Zhou, Chansoo Lee, and Honglak Lee. (2013) Learning and Selecting Features Jointly with Point-wise Gated Boltzmann Machines。 Proceedings of the 30th International Conference on Machine Learning (ICML). 【ML】Roni Mittelman, Honglak Lee, Benjamin Kuipers, and Silvio Savarese. (2013) Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). V4, shape perception（1次） Charles E. Connor 【NB】Pasupathy, A. Connor, C.E. (2002) Population coding of shape in area V4. Nature Neuroscience 5: 1332-1338. 【NB】Connor, C.E. (2007) Transformation of shape information in the ventral pathway. Current Opinion in Neurobiology 17: 140-147. 【NB】【CM】Roe AW, et al. (2012) Towards a unified theory of visual area V4. Neuron 74(2):12-29. 【NB】【CM】Cadieu C, Kouh M, Pasupathy A, Connor CE, Riesenhuber M, Poggio T. (2007) A model of V4 shape selectivity and invariance. Journal of Neurophysiology, 98(3), 1733-50. IT, object face recognition（1次） Keiji Tanaka, Doris Tsao 【NB】Charles G. Gross (2008) Inferior temporal cortex. Scholarpedia, 3(12):7294. http://www.scholarpedia.org/article/Inferior_temporal_cortex 【NB】Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. 【NB】【CM】Tsao DY, Livingstone, MS. (2008) Mechanisms for face perception. Annual Review of Neuroscience, 31: 411-438. 【NB】【CM】Tsao D.Y., Cadieu C. and Livingstone M. (2010) Object Recognition: Physiological and Computational Insights. Chapter 24 in Primate Neuroethology. Edited by M. Platt and A. Ghazanfar. Oxford University Press. 海马体，记忆，睡眠（1次）待补充视觉系统的发育和进化，低等动物的视觉（1次） Jon Kaas The Evolution Of The Visual System In Primates 待补充 Neuronal oscillation and synchrony（1次） Christoph von der Malsburg, Markus Siegel 【NB】Siegel M., Donner T. H., Engel A. K. (2012) Spectral fingerprints of large-scale neuronal interactions. Nature Reviews Neuroscience 13:121-134 【NB】Varela F (2001) The brainweb: Phase synchronization and large-scale integration. Nature Reviews Neuroscience 2, 229-239. 【CM】Donner T. H., Siegel M. (2011) A framework for local cortical oscillation patterns. Trends in Cognitive Sciences 15(5): 191-199 【CM】von der Malsburg C. (1999) The What and Why of Binding: The Modeler’s Perspective. Neuron, Vol. 24, 95–104.; 4446 次阅读|5 个评论

关于广义智能系统，以及临界态、层次的猜想: 热度 1 xiaoda99 2013-5-3 15:12; 最近和jake和计算士的讨论，脑子里的信息结构经过头脑风暴的随机扰动后，不经意间自发整合成有机体，焕发出全新的活力。茅塞顿开，十分兴奋！把想到的东西记下来，供大家进一步讨论。最近一年多以来，我主要纠结于三个问题： 1）从非平衡态统计物理的视角，能量冲的冲刷怎样产生自然界的有序结构，涌现出生命？或与之类比的，信息流的冲刷怎样产生脑的有序结构，涌现出智能？ 2）自然界和大脑中广泛存在的无标度和分形现象，证明这些复杂系统处于临界态（参见我之前写的自然图像的制尺度无关性和文章）。复杂性和临界态有什么关系？ 3）另一方面，自然界和大脑中广泛存在层次结构，复杂性和层次有什么关系？昨天突然意识到，这三个问题是有紧密联系的。 Chapter 1 让我们还是从玻尔兹曼机（BM）开始。我们知道，BM学习算法有正学习（记忆）和逆学习（遗忘）两个阶段，前者加强两个同时发放的神经元间的突触连接，使系统趋向有序，对应于减小系统在感知态的能量E，后者削弱两个同时发放的神经元间的突触连接，使系统趋向无序，对应于增大系统在平衡态的熵S，或者理解为通过引入随机噪声避免过拟合。整个过程看做通过消耗自由能F（= E - S）做功塑造网络结构。因此，BM是信息流冲刷产生有序结构的一个原型。 Chapter 2 BM的最为反直觉也是最深刻的启示是：逆学习（遗忘）有举足轻重的作用，它让系统在辛苦的学习之后进行消化和调整，以维持一个相对“健康”的状态而不至于因为不间断的学习而累疯掉。这里一个容易想到的问题是：逆学习要做到什么程度？导致无序的遗忘和导致有序的学习间怎样协同？在原始的BM学习算法中，正学习和逆学习同时进行，遗忘作用相对记忆的强弱（对突触连接权重的改变量）完全由输入数据决定。但是在实际学习系统（如大脑）中，两个过程可能并不同时发生，而且遗忘作用的相对强弱也很难由输入数据精确控制。不妨假设你的大脑在白天学习，晚上睡觉的时候遗忘。忘的太少，容易产生过拟合；把所有白天学的都忘掉，让整个网络完全无序，显然也不是想要的结果。你的大脑怎么知道遗忘多少恰到好处呢？也许，大脑聪明的通过遗忘把自己调整某个最佳状态，以适合第二天继续学习。这个状态既不能太有序，也不能太无序，对，临界态！ Chapter 3 可是最关键的问题是：临界态究竟好在哪里，让大脑孜孜以求呢？可以从两个方面解释： 1）我们已经知道，大脑接受的很多输入数据（如自然图像、语言等），本身就具有分形和临界特性，表现为尺度无关、Zipf律等。以这样的数据长期训练大脑，使它已经把输入是临界的作为先验，即使在不清楚实际输入统计特性的情况下，也尽可能使自身也处于临界以和外部世界保持同构。 2）大脑对临界态的偏好，还有个更精巧的解释。处于临界态的系统有个非常重要的性质，就是经过某种变换（编码）后得到的新系统（位于一个更高的层次）仍然处于临界态，用重整化群的语言来说，它是这种变换下的非平凡不动点（参看jake的文章）。如果把学到的表示当做输入数据本身，再学一层。这新一层的表示仍将是临界的，于是可以在其上再学一层。。。一直做下去，就得到一个无穷尽的层次结构。大脑每学一层，就对输入数据有了更深一层的理解。而上面提到的变换，就是对信息的重新编码和表示。对，这就是deep learning！上面的故事虽然只是梗概，但我们已经看到流产生有序结构、临界态、层次这三个角色的轮番出场和互相配合。如果这个看似天方夜谭的故事是真实的（我希望并倾向认为它是），我们可以作出下面两个猜想： 1）如果把信息流等价于能量流，包含分形（即具有临界特性）的信息流的能量是（理论上的）无穷大。 2）以这样的信息流冲刷（训练）一个无序系统，同时通过系统自身的最大熵趋势把系统调节在临界态，最终将演化出一个（理论上）无穷多层的结构。; 2655 次阅读|1 个评论

和jake关于重整化群和deep learning的讨论备忘: xiaoda99 2013-5-3 15:10; Xiao Da xiaoda99@gmail.com Mar 23 to jake 我这几天学习了些重整化群，发现它的基本思想和deep learning非常相似，但一个非常大的局限是粗粒化过程会丢失信息不可逆，因此不能作为一个generative model。昨天关于这个问题突然有了一个想法，想和你探讨一下。关键问题是如何将粗粒化过程变为可逆，使重整化群不仅能通过尺度变换分析一个已有状态是否处于临界，还能无中生有产生临界状态。具体来说，设原始状态为s，经过一步粗粒化后的状态为s'。现给定s',求s。这样选取s，使它同时满足两种约束： 1）粗粒化约束：s按给定规则粗粒化后能够得到s‘（硬约束）， 2）哈密顿量约束：s受其自身哈密顿量中相邻结点coupling的约束，使相邻结点取值趋近相同，使哈密顿量最小（软约束）。其中硬约束可以通过对不满足粗粒化约束的状态赋予极高的能量值实现。系统在这两种约束共同作用下弛豫，达到的稳态即为s。该逆粗粒化操作是可以迭代的。一旦得到s，我们可以继续逆重整化流而上，得到s的前一个状态。有了这个逆粗粒化，就可以干一件很好玩的事情。我们知道,Ising模型从一个稍微偏离临界态的状态开始（TTc），通过不断粗粒化操作，将趋于高温不动点。现在把这个过程逆过来，自上而下构造这样一个多层网络，最顶层为完全无序的状态（相当于温度无穷大，关联长度为0），每向下一层，就进行一步逆粗粒化，关联长度增大。随着层数增加，关联长度逐渐增大至无穷，最底层将逼近临界态。注意这个模型从完全无序状态产生准临界态只要2个条件：1）哈密顿量约束的强度远小于粗粒化约束强度；2）迭代层数足够多。并不需要调节coupling strength和T的比值到某个特定临界值，因此这可以看做一个自组织临界模型。我觉得这是个简单但有趣的想法，但不确定上面的推导对不对，尤其是那个逆粗粒化能否实现。很想听听你的看法。还有这个模型到底有没有价值？ Xiao Da xiaoda99@gmail.com Mar 27 to jake 昨天跟你讨论完思路又清晰了不少。发现其实可以先不去管自然界和脑网络产生临界现象的真实机制，就从一个非常具体的问题入手：具有什么统计特性的数据适合（或者不适合）用deep learning学习feature？为什么每一层都能学到有意义的feature？基本思路就是：把deep learning和对Ising模型做实空间重整化操作产生的层次结构作类比，每次重整化对应一层，最底层就是原始数据。如果数据不处于临界态（即特征尺度有限，没有长程关联），经过有限次重整化后必然收敛到平凡不动点，此时数据变成完全随机（特征尺度为0），也就是经过若干层以后就学不到有意义的feature了。只有具有长程关联的临界态数据，才能经过无限次重整化后始终保持长程关联，每层都能学到有意义的feature。因此，可以猜想这样的结论：数据具有长程关联（相关函数随距离幂律衰减）是适合deep learning的充要条件。上面推理隐含了一个局域作用假设：定义了空间和距离后，无论在哪一层次，实体只和它邻近的实体相互作用。这个问题本身是有实际意义的，并且可以实验验证。例如可以拿一些不同类型的数据，用现有的deep learning算法学习，测定相关函数随层数的变化，观察重整化流是趋近平凡不动点还是维持在非平凡不动点。当然，要解决的一个问题是怎样定义和测量相关函数。上述讨论是理想情况，现实数据可能不会严格临界，而且尺度有限，学习层数（重整化次数）必然有限。所以还可以讨论趋近临界态的程度和deep learning网络的最优层数间的关系。至于为什么很多实际数据恰好就长成这样，其背后的产生机制是否可以用相同的层次网络结构描述，就属于另一个层面上的问题了，先留着吧。 jake Mar 28 to me 嗯，思路很赞。我说说我的想法。我其实在博士以前都是在走人工智能、机器学习的路子，但是之所以后来没再走下去，而是跳出来走复杂系统的路子，就是因为感觉计算机科学里面有一大类是物理问题，即处于某种状态的复杂系统问题。所以，开始研究流系统，就是因为感觉机器学习就是一个流的问题。虽然现在离开始的初衷已经有点远了。所以，我帮你进一步明确你研究的对象：你研究的虽然是Deep learning，一种人工的系统，但是要把它当作一个客观的物理对象，就是要回答何种刺激能得到何种响应。我觉得这种切入点是最重要的，至于是不是一定具有长城相关的数据适合deep learning，还是要画上一个问号。当然，我们期待大概是这样。不妨学学19世纪的卡诺，他从科学的角度重新看待热机问题，抽象出了热力学。你可以重新从物理的角度审视机器学习问题，抽象出了XX。 BTW，给你的那本书就是这种研究的典范。 Xiao Da xiaoda99@gmail.com Mar 28 to jake 机器学习在仿大脑，而大脑其实在仿自然，它本身就是个非平衡态物理系统，被信息流的冲刷产生有序结构。自从意识到这点以来，我就一直试图走从物理角度研究机器学习和神经科学的路子，从最开始的最快耗散自由能、最大熵产生，到后来的尺度无关、相变、临界、层次结构，这些关键词也一直是我一年多来思考的主线。可现在回过头来自己走的这条路，虽然看了很多东西，也有很多想法，可一年多并没有取得什么实质进展，感觉荒废了时间。我怀疑是不是抽象得太过了，总想搞出一个“大统一”理论，反而什么产出都没有。而且这条路走得好孤单，几乎找不到同路人讨论。可能是我做研究更依赖感性直觉而不是理性思维，不接触大量具体的实例就形不成直觉，也就提不出新想法。。我以前一直期望从物理里挖出一个现成理论，直接拿过来推广到机器学习和神经科学就可以解决问题，就像当时香农把Boltzmann的东西拿过来推广搞出信息论一样。现在越来越觉得，复杂系统里很多基本问题物理学家并没有回答，我找的理论可能还没有出现。所以现在打算转变一下，从deep learning这个具体实例入手，再结合神经科学上的新近发现，走抽象和具体结合的路。这样能和更多的人讨论，也期待能有些实际产出。当然长远来看我还是相信总有一天这些东西都会统一到物理上，也许最后解决问题的是物理学家。; 2648 次阅读|0 个评论

拓尔思施总"非结构化大数据处理技术及行业应用"报告会听后感: wuxiaolananhui 2013-4-27 09:16; 在 4 月 25 日下午，导师带领我们一起听了施水才总监的题为 “ 非结构化大数据处理技术及行业应用 ” 的学术报告。施总从 “非结构化大数据的理解”，“非结构化大数据的关键技术”及“拓尔思针对大数据的应用” 三个方面给我们介绍了大数据，整场报告异常精彩。在第一个方面的讲解中，让我真实地体会到了非结构化大数据时代的来临，它已开始上升到国家战略层面，给我们带了很多商业机会，如就业、工作等，尤其对我们情报学专业来说，带来了一个不错的就业和发展机遇。当然，大数据时代的出现也带来了挑战，尤其是信息公开、信息安全及个人隐私等方面；在第二个方面，施总介绍了大数据所需的关键技术，施总讲了“与云计算结合，云是基础设施架构，大数据时灵魂资产；分析、挖掘是手段；发现和预测是最终目标”的论断，在这个方面，我体会到对大数据的存储、管理、查询和理解的难点，除自然语言处理技术及机器学习外，大数据更需要结合新技术（流技术， NoSQL ），尤其是图像、视频的 DeepLearning 。最后一个方面，施总给我们讲述了托尔思针对大数据的一些投资与应用，并现场给我们演示了他公司的微博热点分析及跟踪系统，系统可视化界面不仅直观、漂亮 , 也让我们大开眼界，原来社会上在大数据方面已经研究得这么好！; 3515 次阅读|0 个评论

[转载]scikit-learn: machine learning in Python: chuangma2006 2013-4-27 07:18; scikit-learn is a Python module integrating classique machine learning algorithmes in the tightly-nit world of scientific Python packages. It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering. website: http://scikit-learn.org/dev/index.html; 个人分类: Python|3187 次阅读|0 个评论

[转载]Deep Learning Toolbox 深度学习工具箱丹麦科技大学 DTU: hestendelin 2013-4-25 21:36; 原址: http://www.mathworks.cn/matlabcentral/fileexchange/38310-deep-learning-toolboxwatching=38310 Description PLEASE GO TO https://github.com/rasmusbergpalm/DeepLearnToolbox FOR NEWEST VERSION DeepLearnToolbox A Matlab toolbox for Deep Learning. Deep Learning is a new subfield of machine learning that focuses on learning deep hierarchical models of data. It is inspired by the human brain's apparent deep (layered, hierarchical) architecture. A good overview of the theory of Deep Learning theory is Learning Deep Architectures for AI For a more informal introduction, see the following videos by Geoffrey Hinton and Andrew Ng. The Next Generation of Neural Networks (Hinton, 2007) Recent Developments in Deep Learning (Hinton, 2010) Unsupervised Feature Learning and Deep Learning (Ng, 2011) If you use this toolbox in your research please cite: Prediction as a candidate for learning deep hierarchical models of data (Palm, 2012) Directories included in the toolbox NN/ - A library for Feedforward Backpropagation Neural Networks CNN/ - A library for Convolutional Neural Networks DBN/ - A library for Deep Belief Networks SAE/ - A library for Stacked Auto-Encoders CAE/ - A library for Convolutional Auto-Encoders util/ - Utility functions used by the libraries data/ - Data used by the examples tests/ - unit tests to verify toolbox is working For references on each library check REFS.md Required Products MATLAB MATLAB release MATLAB 7.11 (R2010b); 个人分类: 深度学习|11137 次阅读|0 个评论

利用小波变换的EEG的压缩传感: 热度 4 oliviazhang 2013-4-13 13:54; 转载自我在blogspot.com上的博文： http://marchonscience.blogspot.com/2013/04/compressed-sensing-of-eeg-using-dwt-as.html Since my paper Compressed Sensing of EEG for Wireless Telemonitoring with Low Energy Consumption and Inexpensive Hardware (IEEE T-BME, vol.60, no.1, 2013) has been published, lots of people asked me how to do the compressed sensing of EEG using wavelets. Their problem was that Matlab has no function to generate the DWT basis matrix (i.e. the matrix D in my paper). One has to generate such matrices using other wavelet toolboxes. Now I updated my codes, where I gave a guide to generate such dictionary matrices using the wavelab ( http://www-stat.stanford.edu/ ~wavelab/ ) , and there is a demo to show how to use a DWT basis matrix as the dictionary matrix for compressed sensing of EEG ( demo_useDWT.m ) . The codes can be downloaded at here . 如果上面的代码连接不对（可能需要“翻墙”），可来email向我索要。; 3089 次阅读|8 个评论

深度学习(Deep Learning, DL)的相关资料总结: 热度 3 bigdataage 2013-4-12 19:14; 深度学习(Deep Learning，DL)的相关资料总结有人认为DL是人工智能的一场革命，貌似很NB。要好好学学。 0 第一人（提出者）好像是由加拿大多伦多大学计算机系( Department of Computer Science , University of Toronto ) 的教授Geoffrey E. Hinton于2006年提出。其个人网站是： http://www.cs.toronto.edu/~hinton/ science上的那篇论文： http://www.sciencemag.org/content/313/5786/504.full 1 中文的资料（不含论文）：百度百科 http://baike.baidu.com/view/9964678.htm CSDN博客-，机器学习——深度学习(Deep Learning) http://blog.csdn.net/abcjennifer/article/details/7826917 深度学习(Deep Learning)综述 http://www.cnblogs.com/ysjxw/archive/2011/10/08/2201819.html “深度学习”是人工智能的一场革命吗？ http://article.yeeyan.org/view/371738/341235 科学家称，深度学习是硅谷科技企业的未来 http://www.36kr.com/p/175229.html 深度学习(Deep Learning)算法简介 http://hi.baidu.com/yimizizizi/item/4d32615787772a05e6c4a5e1 程序员杂志201302:深度学习——机器学习的新浪潮 http://blog.csdn.net/datoubo/article/details/8577366 机器学习、大数据、深度学习、数据挖掘、统计、决策和风险分析、概率和模糊逻辑的常见问题解答 http://blog.csdn.net/yihaizhiyan/article/details/8266045 百度深度学习研究院 Deep Learning 教程翻译 http://blog.sina.com.cn/s/blog_46d0a3930101h6nf.html Deep Learning（深度学习）学习笔记整理系列 http://blog.csdn.net/zouxy09/article/details/8775360 Deep learning 非常好的中文学习笔记 http://www.cnblogs.com/tornadomeet/archive/2013/03/14/2959138.html Deep Learning入门之路一、二 http://blog.sina.com.cn/s/blog_9b75a293010176km.html http://blog.sina.com.cn/s/blog_9b75a29301017dd5.html UFLDL教程 http://deeplearning.stanford.edu/wiki/ 机器学习前沿热点–Deep Learning http://blog.sciencenet.cn/blog-315535-663215.html UFLDL教程（中文版） http://deeplearning.stanford.edu/wiki/index.php/UFLDL%E6%95%99%E7%A8%8B Deep Learning学习（开篇） http://www.cnblogs.com/JackOne/archive/2013/02/19/DeepLearning-FirstBoold.html 深度学习：推进人工智能的梦想 http://www.csdn.net/article/2013-05-29/2815479 2 英文的资料（不含论文）： http://deeplearning.net/ (内容很多很丰富) http://en.wikipedia.org/wiki/Deep_learning http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/ http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial （很好） http://www.iro.umontreal.ca/~pift6266/H10/notes/deepintro.html http://reading-group.net.technion.ac.il/2012/11/27/deep-learning-introduction/ http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial Reading List： http://deeplearning.net/reading-list/ Learning Deep Architectures for AI： http://ishare.iask.sina.com.cn/f/36110040.html 3 论文（中文和英文）：论浅层学习与深度学习深度学习研究综述深度学习结构和算法比较分析英文的看Geoffrey Hinton和Andrew Ng的论文就够了： http://www.cs.toronto.edu/~hinton/papers.html http://ai.stanford.edu/~ang/papers.php; 23116 次阅读|3 个评论

[转载]Corpus use and learning to translate: carldy 2013-4-11 18:00; Corpususe and learning to translate GuyAston 1999.Textus 12: 289-314. Given the factthat total bi-directional correspondences are extremely rare phenomena, we often have to search forsecond-best matches, and that means we have to select one of several alternatives, namely the one that fits thecontext best. It should bepossible to come up with this match if the translator consults a large corpus and, by identifying thecontext pattern in question, finds the lexical unit that would `naturally' be used in such asituation. All it needs is an operational definition of context and context pattern. (Teubert 1996:241) 0. Introduction Apaper on the learning of translation must espouse some view of what translationinvolves. So let me state some premises. Following Joseph (1998), I take itthat translation involves interpreting a source text (ST), and then generatinga target text (TT) in another language which strategically directs its intendedaudience to an interpretation of it - generally one which in certain respectsmatches the interpretation given to the source text. From this point of view -substantially corresponding, as I understand it, to Nida's notion ofdynamic equivalence (1964) - would-be translators must developinterpretative and strategic competencies which they may well lack in at leastone of the languages involved for a particular task, since translators arerarely balanced bilinguals, nor always specialists in the discourse domain inquestion. In addition, translating - like editing - calls for the ability toelaborate, compare and evaluate different strategies and interpretations in thelight of externally-defined contextual restrictions. Translators typically workunder commission, where specific target audiences, and specific interpretationsof the source and/or of the target text are implied (Reiss 1981). Thetranslator thus needs resources which can suggest possible and probableinterpretations of the ST, which can indicate effective strategies forachieving particular interpretations of the TT, and which can facilitate the evaluationof alternative strategies and interpretations. Varantola (1997) suggests thatas much as 50% of the time spent on a translation can be dedicated toconsulting reference materials. In this paper I review the roles which can beplayed by electronic corpora in improving the quality and speed of thetranslation process, in helping would-be translators to develop theirinterpretative and strategic competence, and in developing their sensitivity tothe issues involved. While in no way wishing to suggest that electronic corporaare a touchstone to resolve the translator's many problems, I believe that theycan satisfy three significant criteria for translation instruments: se facilitano il processo eportano ad una migliore qualita' del prodotto, anche attraverso un aumentodelle possibilita' di scelta dell'utente; se offrono occasioni diapprendimento linguistico e metalinguistico; se permettono lo sviluppo di unacapacita' tecnica e critica nei confronti di simili strumenti. (Aston 1996: 308) Theyachieve these objectives by providing collections of helpful informationwhich facilitate their decision-making and make them feel more secure abouttheir choices (Varantola 1997), allowing better and/or faster solutionsto be obtained; by offering numerous opportunities for learning the language,the domain, and about the translation process; and by allowing the user to playan active role in their development and exploitation. 1. Types ofcorpora Interestin corpora in the field of translation has been from two main perspectives,descriptive and practical. On the one hand, scholars have designed and analysedcorpora of translations, comparing these with corpora of original texts inorder to establish the characteristics peculiar to translations in particularSL-TL combinations (Gellerstam 1996), and indeed possible universalsdistinguishing translated texts (Baker 1998, Laviosa 1998). On the other hand,there has been a growing interest in corpora as aids in the processes of humanand machine translation - their role which is my primary concern here. For thispurpose, three main types of corpora have been proposed as relevant: Monolingualcorpora consist of texts in a single language, which may be either the sourceor the target language of a given translation. While general monolingualcorpora include texts of a wide variety of types, specialized monolingualcorpora are restricted to a particular genre and/or topic domain. In eithercase, the corpus attempts to provide a sample of a particular textualpopulation, which ideally also reflects the variability of that population(Biber 1993). Wheremonolingual corpora of similar design are available for two or more languages,they may be treated as components of a single comparable corpus. With a fewexceptions (note 1), comparable corpora are currently specialized, with thetexts belonging to genres or domains which are sociolinguistically similar ineach of the cultures involved (in terms of participation framework, function,and topic), and have similar variabilities. Parallelcorpora also have components in two or more languages, consisting of originaltexts and their translations. Again, most parallel corpora are specialized.They take two main forms (figure 1): Figure 1: Comparable and parallelcorpora language A language B Comparable A. specialized corpus B. specialized corpus of same design Parallel A. specialized corpus B. translations of texts contained in A Unidirectional Parallel A1.specialized corpus B1.specialized corpus of same design as A1 Bidirectional A2.translations of B1 B2.translations of A1 .Unidirectionalparallel corpora consist of texts in one language along with translations ofthose texts into another language (or languages). Since the corpus in languageA is by definition restricted to texts which have been translated into languageB, this will not generally allow the textual population in language A to berepresentatively sampled (Aijmer et al 1996). The criteria to be adopted inselecting the translations to be included in the language B component are alsodebatable - for instance, whether these should be filtered forquality in some way. The two components are typically aligned on aparagraph-by- paragraph or sentence-by-sentence basis: that is to say,information is added to each sentence or paragraph of each text which indicatesthe corresponding sentence or paragraph in the parallel text in the othercomponent (note 2). (For a review of alignment procedures, see http://www.lpl.univ-aix.fr/projects/arcade/ ) .Bidirectionalor reciprocal parallel corpora contain four components: source texts inlanguage A and their aligned translations in language B, and source texts inlanguage B and their aligned translations in language A. They thereby combinethe characteristics of unidirectional parallel corpora with those of comparablecorpora: if the same design criteria are employed for both languages, theyinclude comparable collections of original texts in the two languages (A1 andB1), as well as comparable collections of translated texts in the two languages(A2 and B2). They additionally allow comparisons between original andtranslated texts in the same language (A1 and A2; B1 and B2: Johansson andEbeling 1996). Inthis paper I discuss the relevance of each of these types of corpus for thetrainee translator. In addition, I shall consider the role of ad hoc corpora,i.e. corpora compiled on the fly by the translator in order toinvestigate a specific problem encountered during a particular translation. 2. Uses ofdifferent corpora 2.1Monolingual general corpora Theobvious way in which corpora can help translators is as reference tools, ascomplements to traditional dictionaries and grammars. Thus the first sentenceof Bruce Chatwin's Utz (1989a: 7) reads as follows: An hour before dawn on March7th 1974, Kaspar Joachim Utz died of a second and long-expected stroke, in hisapartment at No. 5 Sirok Street, overlooking the Old Jewish Cemetery in Prague. Letme focus on just one problem here, the translation of overlooking. If weexamine the occasions where the word apartment occurs in the vicinity ofoverlooking in the 100-million word British National Corpus (http://info.ox.ac.uk/bnc),we find that apartments typically overlook mountains, rivers, oceans, ports,squares and gardens - all views which seem positively connotated. On the fewoccasions where what is overlooked is ugly, irony appears to be intended, as in: Not only do theytolerate the fast-food shops serving up nutriment that top breeders wouldn'trecommend for Fido, they go as far as purchasing two expensive weeks in agruesome timeshare apartment, and sit smoking all day on a balcony overlooking the A9. Thecorpus data thus suggests that overlooking has a positive semantic prosody(Sinclair 1991, Louw 1993) - a fact which is unmentioned by dictionaries, andmight even be overlooked by a translator whose native language was English. Itaids interpretation of the ST, raising the problem of whether Chatwin intendsthe Prague cemetery to be seen by the reader as a beauty spot, or whether he isbeing ironic - or indeed, whether he simply aims at ambiguity in this respect. Acorpus can also help the translator evaluate - or indeed come up with - apossible translation for this sentence. The Italian translation of Utz (Chatwin1989b: 9) renders it as: Il 7 marzo 1974,un'ora prima dell'alba, nel suo appartamento di via Sirok 5 che dava sul vecchio cimitero ebraico di Praga,Kaspar Utz mori' di un secondo colpo da tempo previsto. Doesthe choice of dava su share thepositive connotations of overlooking, and allow a similar, possibly ironic,interpretation? In a small (2 million word) collection of Italian literarytexts, we find the following instances of dava su: Lei non si vedeva. Ma il soggiorno dava su una veranda da cui una scaletta o negli onesti. La finestra di mezzo dava su un balcone di ferro. Concentr finestrone, dai vetri impolverati, che dava su di uno spalto esterno, da cui si la vasca. Al chiaro di una finestra che dava su un cortile interno, lesensazio be potuto uscire subito dalla porta che dava sul sottoponte. Ma, quasi a prende Muovendosi davanti alla vetrata che dava sul parco, il Bocchi vide i globi omandante aveva una grande finestra che dava sul pozzo a lume; di fronte, con un Bocchi abitavain un piccolo attico che dava sul Lungoparma, nel punto in cui Thesecitations offer little evidence that dava su has a distinctive prosody, andmake it doubtful that this translation could be interpreted as ironic. Datafrom monolingual corpora may thus support interpretative and strategichypotheses, or suggest that they should be rejected. They may also suggestalternative hypotheses. In the English corpus, overlooking tends to beassociated with a particular set of collocates (garden, sea, hills, squareetc.). If we search the Italian corpus for occurrences of equivalents to thesecollocates (giardino, mare, montagna, piazza, etc.) in the vicinity of wordslike appartamento, camera, casa and finestra, we find citations such as thefollowing: se vuole possoprenotarle una camera per domani stesso, una bella e linda cameretta con vistasul mare, vita sana, bagni di alghe, talassoterapia, Thiscitation suggests another possible translation of overlooking, namely con vistasu. As we did with dava su, we can now test this against the corpus in order tosee whether it is positively connotated, and whether there is evidence of itsbeing used ironically - whether, that is, it occurs in similar contexts tooverlooking. Amonolingual general corpus also provides a rich language learning environment.Even if the dava su hypothesis is rejected, the process of doing so allows theuser to learn much which may be of value in the future. Unlike the dictionary,a concordance leaves it to the user to work out how an expression is used fromthe data. This typically calls for deeper processing than does consulting adictionary, thereby increasing the probability of learning (Hultsijn 1992). Inmore general terms, by drawing attention to the different ways expressions aretypically used and with what frequencies, corpora can make learners moresensitive to issues of phraseology, register and frequency, which are poorlydocumented by other tools. Corporaalso allow much unpredictable, incidental learning. Almost any concordance islikely to contain unknown or unfamiliar uses, which may be noticed and exploredby the user who is prepared to go off at a tangent to follow them up(Bernardini 1997, in press). Looking through the occurrences of dava su, Inoticed the unfamiliar expression pozzo a lume. While I can roughly understandits meaning from the context, I may be able to get a better idea of its use andfrequency by generating a concordance of all its occurrences in the corpus. Astranslation aids, however, monolingual general corpora pose a number ofdifficulties: .Itmay be difficult to locate and select an appropriate corpus. Reference corporasuch as the Bank of English and the British National Corpus are sufficientlylarge and well-balanced to document the range of uses of all but the rarestlexical items in British English: there are, for example, 767 occurrences ofoverlooking in the BNC. But no similar corpus is yet publicly available forItalian, nor for American English. The Italian data cited above were taken froma relatively small (2 million word) collection of contemporary literary texts,put together from the Internet. The limited size and representativeness of thiscollection makes it much more difficult to identify and to evaluateregularities of use: there are only 8 occurrences of dava su, and it isdebatable how far the intertextual background against which a translation ofUtz should be interpreted is a purely literary one. .Itmay be difficult to retrieve appropriate instances from the corpus.Overlooking, for example, is polysemous, meaning either looking out ontoor ignoring. There is no way, using currently available corpora andconcordancing software, that it is possible to find just one of these sensesand exclude the other. Roughly 10% of the occurrences of overlooking in the BNChave the ignoring sense, and these must be excluded manually inorder to effectively analyse the semantic prosody of the looking outonto sense. In the case of dava su, not only do other senses (such asthat of gave up) have to be excluded, but morphological variants ofeach word should probably be investigated (da/dava/danno/davanosu/sul/sull'/sullo/sulla/sugli/ sulle), many of them polysemic, as shouldoccurrences where the component words are separated by, for example, adverbials(dava direttamente su). Considering such factors will tend to reduce theprecision of any search, making it more likely that spurioussolutions will be found which require manual deletion (note 3). .Itmay be difficult to match the data to the translation. Whether evaluating ausage in the ST, or a candidate translation in the TT, the user is unlikely tofind examples which precisely match the required context. Analysingconcordances requires identification, classification and generalization toestablish recurrent patterns and to relate these to particular contextualfeatures (Johns 1991: 4), and these procedures require training and practice.Faced with a concordance of overlooking, the learner will need to group useswith different senses (e.g. animate and inanimate subjects and/or physical andabstract objects: overlooking the problem vs overlooking the park), and to drawinferences as to what features are shared by a particular group. Since thisobliges the user to discriminate and attend to uses which differ from thatoccurring in the source or target text, the process will be time-consuming, andarguably dispersive in terms of the translation at hand - even if rewarding inlanguage learning terms, where greater understanding of the different uses ofoverlooking may be a valuable by-product. 2.2 Specializedand comparable corpora Thesedifficulties can be reduced by using corpora which are specialized, that is,which consist only of texts of a type similar to the ST and/or the desired TT.Such corpora may be extractable as sub-corpora from large general ones - thoughonly limited specialization can be obtained without compromisingrepresentativeness (Sinclair 1991) - or they may be specifically collected - aninvestment which may be well worth the effort where the translator foreseesdoing a number of similar translations in the future, and which is therefore auseful exercise for any translator training course (Maia 1997). Specializedcorpora can be seen as a development of the tradition of usingparallel texts in translation - i.e. collections of texts of thesame kind as the ST and/or TT (Haartman 1980; Williams 1996) - with electronicformat enabling more rapid and systematic searching of larger quantities oftext. Such corpora are particularly useful for the investigation of forms andmeanings which are typical of that type of text (in particular terminology, butalso features of register and text structure: Gavioli, forthcoming; Zanettin,forthcoming), and as an environment in which to prepare for work which has tobe carried out under time constraints, such as speech interpreting. Varantola(1997) underlines how specialized corpora have high reassurancevalue, particularly where the TT is in the translator's L2, insofar asthey illustrate similar contexts to those of the translation being worked on. Wherespecialized corpora have to be constructed by the user, this involves designdecisions as to what texts to include and why. One of our early experiences inForl withspecialized corpora involved learners who were translating material for theMelozzo centenary exhibition into English, for which we compiled English andItalian corpora from CD-ROMs of the National Gallery and of the Uffizi. Eachcorpus contained texts of similar types describing artists and their works,genres, schools and technical aspects for a lay public. While limited in size(under 100,000 words each), their specialization and authoritativeness madethem appropriate resources for the task, and given their similar composition,the two corpora could also be treated as comparable. Today, corpora of texts ofthis type could also be compiled from the Internet (Pearson, forthcoming).Clients are also a potential source of relevant specialized texts. Withrespect to general monolingual corpora, specialized ones are easier to handleand in many ways more informative. In particular: .Itis easy for the user to become familiar with the texts included in a smallspecialized corpus, facilitating the interrogation of that corpus and theinterpretation of data from it (Aston 1997) - a familiarity which will befurther enhanced if the corpus has been compiled by the user (Maia 1997). .Aspecialized corpus can provide figures concerning lexical density andrepetition for texts of that type, in the shape of standardized type/tokenratios, numbers of types accounting for different percentages of tokens, etc.The user can compare means and variances of these figures with values for theST and/or TT, to see how well the latter match the norms of the corpus. .Concordancesare less likely to contain spurious citations. Insofar as the frequency ofdifferent senses varies according to text-type, the likelihood of encounteringother senses of polysemous items may be reduced (if we simply restrict a searchfor overlooking to the subcorpus of fiction in the BNC, the proportion ofexamples of the ignore sense drops by 50%). Gavioli and Zanettin(1997) provide a clear example of this phenomenon. Faced with the ST phraseetilisti con o senza marcatori HBV in a medical research article on hepatitisC, the initial translation hypothesis was alcoholics with or without HBVmarkers. However a search for markers in a specialized corpus of similararticles in English revealed the recurrent positive/negative for HBV markers.This would not have emerged from the BNC, where there are fewer texts of thistype, and where other senses of the word markers predominate (in reference toexaminations, pens, and linguistics). The viability of such a corpus, however,depends on whether the intertextual background for the use of an expression canbe confidently limited to a single text-type in this way. .Specializedcorpora also provide more assistance in formulating translation hypotheses. Thegreater precision provided by a specialized corpus allows us to extend theprinciple of using collocates to identify possible equivalents to completetexts or segments of texts. Where an expression in the ST is used in relationto a particular person or concept in the field in question, it may be possibleto locate possible equivalents by searching for references to that same personor concept in the TL corpus and then reading the surrounding text. Software canfacilitate this process: Wordsmith Tools (Scott 1997) shows which texts andparts of those texts contain most occurrences of a particular form. Remainingwithin the field of hepatitis C research, let us say that we wish to find anappropriate translation for the ST's casi mortali per insufficienza epatica.Given that all the texts in the corpus deal with hepatitis C, we can guess thatthose where death is most frequently mentioned are most likely to be relevant.The following table shows the files where the word death is most frequent: N File Words Hits per 1,000 words 1 sx.11 1,436 17 11.84 2 rx.11 965 11 11.40 3 mx.11 914 8 8.75 4 rx.7 870 3 3.45 Readingthe first of these, we find the expression fatal liver disease - a translationhypothesis which we can then investigate using the entire corpus. .Specializedcorpora facilitate analyses related to textual macrostructure. Insofar as thetexts in the corpus share similar structures and functions, it is easier torelate occurrences to particular functions and positions in texts (Aston 1997). .Incidentallearning from texts and citations from a specialized corpus is more likely tobe relevant to the task at hand, or at any rate to come in useful for furthertranslations of a similar nature (Zanettin, forthcoming). For instance, aconcordance of panel painting in the National Gallery corpus includesreferences to types of panel paintings, such as tondi, and to the techniqueswhereby they were created, such as pastiglia - terms which may well proveuseful at other points of an art history translation. .Aspecialized corpus provides a useful means of learning about an area in whichthe translator needs to work and its textual conventions. Key concepts can belocated manually in wordlists, or a wordlist from a specialized corpus can becompared with one from a general corpus in order to highlight the distinctivefeatures of the former (Wordsmith Tools carries out such comparisons for bothsingle words and phrases). If the corpus is comparable, a candidate list ofterms in one language can be matched with one for the other language to createa terminology bank. Whilemost work involving specialized corpora as translation aids has used TL corpora(Bowker 1998, Varantola 1997, Friedbichler and Friedbichler 1997), wherecomparable specialized corpora are available, these can also be used toinvestigate the SL and the ST, particularly where the conventions of the latterare relatively unfamiliar, as a means to identify routine and non-routine uses.Comparable corpora seem particularly useful for learning purposes, as a meansof exploring a particular text- type in both languages prior to engaging intranslation. 2.3 Corpusconstruction Sincespecialized corpora for a particular text-type are rarely availableoff-the-shelf, the translator needs to learn to construct such corpora - anexperience which will develop awareness of their potential validity andreliability. Collecting a reasonably representative set of texts of aparticular type requires a preliminary survey of the textual population and ofits variability, as well as of the authoritativeness of candidate texts.Friedbichler and Friedbichler (1997) recommend selecting texts which have beensubject to peer review, and which are where possible widely cited in thespecialist literature (note 4); Varantola (1997) recommends avoiding textswritten by non-native speakers. It isclear that for any specialized corpus, the greater the variability of the text-type to be represented, the larger the corpus should be. In general, the largerthe better, though there is clearly a point where the returns on expansiondiminish. Friedbichler and Friedbichler (1997) suggest that for English,authoritative specialized corpora of 500,000 to 5 million words (according tothe variability of the text-type) should provide solutions to 97% of thetranslator's questions. In what follows, a number of criteria for evaluatingspecialized corpora are proposed: in each case, the smaller the value thebetter. .Thesmaller the type/token ratio, the more lexically repetitive the corpus, andhence the better documented the types it contains. A ratio of 2% means thateach word-type occurs, on average, 50 times every 1000 words in the corpus. .Whileindicating the extent of documentation of the types contained in the corpus,the type/token ratio gives no indication of whether those occurring in asimilar text from outside the corpus will be documented. This probability canbe assessed from the ratio of hapax legomena (word-types which occur only oncein the corpus) to the total number of tokens: an HL/tokens ratio of 2% meansthat when reading a new text, an undocumented type is likely to be encountered,on average, every 50 words. .TheHL/tokens ratio does not however consider the variability of the text-type.This can be assessed by considering the proportion of word-types that occur inonly one text in the corpus. This provides a further indication of thelikelihood, in any new text, of encountering new types. A proportion of 20%means that in any similar text, 20% of its word-types will on average beundocumented. Allthese measures are a function of variability within and across texts, and ofcorpus size (and in the case of the last measure, also of text size): a smallbut homogeneous corpus of weather reports may well have lower values than amuch larger one of tourist guides. Values will also depend on the language ofthe texts: given the greater morphological complexity of the language, Italiancorpora tend to have higher values than English ones (note 5). Thetranslator can use measures such as these to assess the reliability of aparticular specialized corpus and hence to determine its required size. Valuesobtained on the last two measures can also be compared with the actualproportions of undocumented types encountered in the ST and/or TT, as anindication of the goodness- of-fit of the corpus for the text inquestion. Thisfit will rarely be perfect, and in any case no specialized corpus is everlikely to document all the problems posed by a particular text. Specializedtexts also use non- specialized language, and the intertextual background onwhich they draw will rarely be simply that of the text-type in question. Therethus remains the need to recognize where general monolingual corpora should becalled on, or where it may be useful to compile a corpus ad hoc to analyse aparticular problem. 2.4 Ad hoccorpora Specializedcorpora will rarely document every word in an ST or TT, even if they are likelyto provide a much fuller documentation for features typical of that text typethan large general ones. One learner using a comparable specialized corpus oncancer of the colon in order to translate an English research article intoItalian was completely nonplussed by an allusion in the ST to the holy plane,for which she could find no explanation or equivalent. In such cases, relevantinformation may be obtainable from a large monolingual corpus or, failing that,CD-ROMs or the Internet. We can in fact use the Internet to compile corpora adhoc, using search engines to find all the texts containing particularexpressions. Since the world-wide web is an ever- changing entity of dubiousauthority whose overall composition is unknown, considerable care must howeverbe exercised in selecting texts and drawing inferences (Pearson, forthcoming). Thevalue of such ad hoc corpora can be illustrated by an example from Bertacciniand Aston (forthcoming), which focusses on the translation into English of aFrench newspaper article which contained the word clochemerlesques. Searcheswere made for clochemerl* in a CD-ROM of Le Monde, and using the Altavistasearch engine on the Internet (http://altavista.digital.com). Together, theseturned up 20 French texts, analysis of which allowed for a fairly confidentinterpretation of the ST: Clochemerle was a comic novel by G. Chevallier whichridiculed factionism in village politics, apparently well-enough known as anarchetype of petty factionism to be alluded to without explanation by Frenchjournalists. Howcould it be translated in English? Searches for English examples of clochemerl*on the Internet, and in CD-ROMs of The Independent and The Daily Telegraph,suggested that Clochemerle was far from equally familiar to a British public,and that it was if anything associated with public conveniences. Did anyarchetype in British culture have similar associations to the French one? One possibilitywhich came to mind was Gulliver's Travels, and the conflict in Lilliput betweenBig- and Little-endians as to the right way to crack an egg. However, furthersearches provided no evidence that reference to Lilliput, or to big/little-endian, would have these associations for a general reader (the formerseemed associated exclusively with size, and the latter were terms in computerarchitecture). The final (if less than fully satisfactory) solution was localsquabbling, whose derogatory connotations were confirmed by a study of thesemantic prosody of squabbl* in the BNC. Insuch cases, an ad hoc corpus is clearly better than none, though very time-consuming to compile. Friedbichler and Friedbichler (1997) suggest that to becost- effective, searches using corpora should not exceed an average of tenseconds: so the use of ad hoc corpora must be limited to a very smallproportion of the problems posed by any translation. 2.5 Parallelcorpora Afurther limit of monolingual and comparable corpora as translation tools is thedifficulty of generating hypotheses as to possible translations. The user mustrely on known or suspected equivalences as heuristics to retrieve similarcontexts in a TL corpus, providing a specification which is both sufficientlygeneral to recall a range of possibilities, and sufficiently precise to limitthe number of spurious hits. S/he must then verify that the citations retrievedare in fact sufficiently similar to those of the ST and/or the SL corpus. Theseprocedures are both time-consuming and error-prone: an expression in the TLcorpus may occur in a similar context to one in the SL corpus, yet in fact meansomething different. For example, in attempting to translate the phrase loopileostomy in a medical research article, Ferri (1999: 64) illustrates how asearch for similar contexts in the TL found ileostomia su bacchetta. Withoutdetailed medical knowledge, she initially assumed this term to be equivalent,while it is in fact hyponymous. Greatercertainty as to the equivalence of particular expressions can be obtained byusing parallel corpora, consisting of original texts and their translations,where these are similar to the ST and TT. If the corpus is aligned, andsuitable software is available, the user can locate all the occurrences of anyexpression along with the corresponding sentences in the other language. Thereis however a dearth of parallel corpora for English and Italian, and relativelylittle parallel concordancing software for the PC (though see Barlow 1995,Woolls 1997). The examples which follow were extracted using Multiconcord(Woolls 1997), from its sample collection of different language versions ofdiscussions in the European Parliament. This material has many limits, since wedo not know which version constitutes the original text, and which atranslation, or indeed a translation of a translation (Lauridson 1996).Nevertheless, it can illustrate how a parallel corpus may provide a means ofidentifying translation hypotheses in a specialized environment. Thefollowing concordance shows occurrences of the word establish and itsequivalents in Italian (some citations are abbreviated for reasons of space): We support the Socialist Group's demand forthe President to establish a committeeas soon as possible to conduct such a review. Condividiamo la richiesta del gruppo socialista in basealla quale il Presidente dovrebbe istituire quanto prima un comitato per larealizzazione di questa modifica. if we are to guarantee the quality andcompetitiveness of the European tourist industry, we shall have also to develop new forms of synergy with otherCommunity policies, bringing in all ofthe interested parties in an effort to establish the conditions favourable tothe development of the Union's tourist enterprises per garantire la qualita' ela competitivita' dell'industria europea del turismo, occorre inoltresviluppare nuove sinergie con le altre politiche comunitarie, coinvolgendotutte le parti interessate al fine di creare le condizioni favorevoli allosviluppo delle imprese turistiche dell'Unione Thus we need to establish a coherentEuropean tourism policy which adds value above and beyond Member State leveland against which we can judge and monitor the very considerable sums of moneywhich are spent through other EU funds ed e' quindi necessario realizzare unapolitica europea per il turismo globale, che aggiunga valore al di sopra ed oltreil livello di Stato Membro e rispetto alla quale possiamo valutare econtrollare le notevoli somme di denaro che vengono spese attraverso altrifondi europei It is vital at this point that we establishdiplomatic relations and therefore a dialogue with the current Kabulauthorities, Si rivela indispensabile in questo momento, instaurare relazionidiplomatiche e quindi un dialogo con le attuali autorita' di Kabul, It must put an end to the inconsistenciesand finally establish a clear and independent foreign policy, at lastshouldering its responsibilities, without hesitation and avoidinginconsistencies. Metta fine alle sue contraddizioni e elabori finalmente unapolitica estera chiara, autonoma, si assuma finalmente le sue responsabilita',senza tentennamenti e senza contraddizioni. We must ask the Union to establish whetherthe proposals made by these countries under the aegis of IGADD will be able tobring about a solution and if so to give them our support. Invitiamo l'Unione averificare se le proposte avanzate da questi Stati nell'ambito dell'IGAD sianotali da favorire una soluzione e, in caso positivo, la sollecitiamo a dare ilsuo sostegno. We need morespecific signs and we need clearer evidence that the Belarus Government doesindeed want to establish a free and more democratic society. Ci servono segnipiu' precisi, cosi' come deve essere precisa l'intenzione del governo bielorussodi instaurare a tutti gli effetti un sistema libero e democratico. Thisillustrates a wide range of possible equivalents to establish: avviare, creare,elaborare, ginstaurare, realizzare, verificare. For the translator of anEnglish text of this kind, it thus suggests a range of hypotheses which can befurther investigated using a general or specialized TL corpus. Notall expressions are paralleled by such a wide variety of equivalents. One ofthe most frequent lexical words in the Italian component of the corpus isrelazione. The parallel English term is invariably report (unlike the Britishparliamentary paper). In contrast, under a third of the occurrences of anotherfrequent word, favore, are paralleled by favour: parallel to votare a favore diwe find vote for; parallel to accogliamo con favore, we welcome. The corpussuggests equivalents for technical terms, and a wider variety of possibletranslations for sub-technical lexis than are likely to be found in a bilingualdictionary, particularly at a phraseological level. It may also highlightsyntactic contrasts, including differences in the organization of the text intosentences and paragraphs. Usingsuch a corpus can also have a positive impact on learning. Where a variety ofparallel realizations are encountered, this may help learners to distinguishbetween different contexts of use, and reduce their tendency to think in termsof one-to-one equivalence, as Ulrych (1997) illustrates in respect of parallelEnglish realizations to ossia. More general problems may also be faced:Danielsson and Mühlenbock (forthcoming) illustrate how a parallel corpus cancast light on translation strategies for proper names, showing whether theseare transcribed, translated, clarified or simplified. Johns (forthcoming)proposes a number of types of exercises using parallel concordances, forinstance by blanking out the search word in language A and asking learners toinfer it from the parallel citations provided in language B. Sinceparallel concordances provide translations of each occurrence, citations aremore likely to be immediately understandable for the user, diminishing thedifficulties of retrieval and risks of misinterpretation associated withmonolingual and comparable corpora. For the same reason, the scope forincidental learning may be increased. However, notwithstanding their apparentface validity, parallel corpora also introduce new dangers deriving from theassumption that parallel occurrences are effectively equivalent. It isnecessary to ask whether the translations in the corpus are reliable andauthoritative (note 6), and to bear in mind that the use of translations toidentify equivalents inevitably implies reduc the target language toa mirror image of the source language (Teubert 1996: 250) - or the SL toa mirror image of the TL: There is, for instance, no directT E in English for the German word Schadenfreude Therefore, we will rarely find occurrences of Schadenfreude in Germantranslations of English texts. Generally speaking, translations in language Bwill contain `grosso modo' only those lexical items which count as TEs foritems of the vocabulary of language A. The same is true for syntax. The`impersonal passive' (e.g. Es wurde viel getrunken, literally `It was drunk alot') is a fairly common syntactic construction in German for which there is noequivalent in English. (Teubert1996: 247) Usingtranslations as models for the TT thus risks reproducing those features oftranslationese which have been identified by workers using corporain descriptive translation studies: normalization, simplification,explicitation (Baker 1993, 1998), sanitization by reducingconnotational meanings (Kenny 1998), increased cohesion (Over†s 1998), andlower lexical density, higher mean sentence length, and higher proportions ofhigh-frequency words (Laviosa 1998). Gellerstam (1996) shows how translationsinto Swedish of English texts carry over many features of English vocabulary,syntax, and rhetoric when compared with comparable Swedish originals; Gavioliand Zanettin (1997) illustrate some similar features in Italian translationsfrom English. Using parallel corpora seems likely to reinforce such tendencies(though it is of course possible that they may increase learners' awareness ofthese features, and hence their conscious control of them: Ulrych 1997). Theunreliability of the translations in parallel corpora makes it advisable to usethem in conjunction with monolingual or comparable corpora, so that, forinstance, a translation hypothesis derived from a parallel corpus can be testedagainst a collection of original texts in the language in question. The idealparallel corpus, from this point of view, will be bidirectional or reciprocal(cf 1 above), allowing the user to see whether occurrences found intranslations into language B are also found in original texts in language B,and whether these are translated into language A in the manner encountered inoriginal texts in language A. Such a corpus combines the advantages of aparallel corpus with those of a comparable one: from this point of view,bidirectional English-Italian corpora would seem an important area for futureresearch and development. Such corpora are however considerably more difficultto design and compile than comparable ones, given the need to create comparablecollections of texts which have been translated, and to align the texts andtranslations prior to use. Given the amount of work involved, they are likelyto be relatively unspecialized in order to extend their range of application(see e.g. the English-Norwegian parallel corpus: Johansson and Hofland 1993).Consequently there is still likely to be a role for comparable andunidirectional parallel corpora of a more specialized nature. One form of thelatter may be compiled by the specialized translator (or their client), drawingon the texts that s/he has (had) translated in the past (cf note 5 above). Itshould be noted en passant that parallel concordancing software can also beused to analyse a single text and its translation. This is potentially a usefultool for translators to check and evaluate their own translations. Alignedversions of the ST and TT can be used to see whether a particular term in theST has been translated consistently in the TT, or whether (given the tendencyof translations to be less lexically varied than their source texts) aparticular expression in the TT corresponds to a variety of expressions in theST. Type/token ratios and lexical density measures for the ST and TT can alsobe compared, and evaluated by comparison with those found in comparable orparallel corpora of similar texts. 3. Conclusions Thereis as yet little hard empirical evidence to demonstrate the effectiveness ofcorpora as translation and as learning tools. Williams (1996) found a 40%improvement in the recovery of correct equivalents when paralleltexts were used as translation aids as opposed to bilingual dictionaries, andone might expect these results to be matched or bettered with largercollections of texts in electronic format, and the aid of retrieval software.In a pilot experiment Bowker (1998) found that learners using a specializedcorpus of texts in the target language (their L1) showed greater correct termchoice and idiomaticity than a matched group using bilingual dictionariesalone. On the other hand, Bernardini and Aston (forthcoming) found that on twotranslation tasks into the L2, learners using monolingual L2 dictionariesperformed better than matched groups using a general L2 monolingual corpus.While learners seem to a large extent enthusiastic about using corpora, itremains to be shown just in what respects, and under what conditions, theirperformance as translators may improve as a consequence: we cannot for instanceexclude the idea that training with corpora may also improve dictionary usage,by instilling greater attention to collocation and register. No research that Iam aware of has yet attempted to compare the effectiveness of different typesof corpora, or of different learner approaches to them; yet more difficult tomeasure are the overall effects of corpus use on learning, be this in terms ofgeneral linguistic knowledge and ability, or as relating to a specializedtext-type. Inthis climate of empirical uncertainty, arguments for and against the use ofcorpora in translator training must be of a theoretical nature, and can resortat best to anecdotal evidence. Where available and accessible, appropriatecorpora appear able to provide better and faster solutions to many of thetranslator's problems in a unified environment, with positive effects onlearning. They make possible more idiomatic, native-like interpretations ofsource texts and a use of more idiomatic, native-like strategies in targettexts. It is our experience at Forli' that few trainee-translators who haveused corpora would wish to be without them, notwithstanding (or because of?) theinvestment in time and effort required to compile corpora and to learn how touse them, and we expect that as the number of available corpora and thequantity of suitable software increases, the use of corpora for translation andtranslator-training will gather further momentum, with a growth in itscost-effectiveness. Notes 1.The Parole project aims to produce general comparable corpora for all thelanguages of the EU (http://www.ilc.pi.cnr.it/parole/parole.ht ml). 2.Parallel corpora can be extended to include multiple languages (Woolls 1997),or multiple translations of each text (Ulrych 1997, Malmkjaer 1998). As thevalue of such extensions seems more descriptive than pedagogic, I shall notdiscuss them here. 3.In the gave up sense, su is of course an adverb rather than a preposition. Ifthe corpus used is tagged with part-of-speech codes (as is the case with theBNC and the Bank of English), it may be possible to avoid unwanted senses bysearching for a specific part of speech, e.g. dare su=PRP (or an equivalentformalism). Part-of-speech tagging may also facilitate analysis, enabling thedata to be sorted by part-of-speech code. 4.Bowker (1998) and Pearson (1996, 1998) argue that where specialised corpora areused to train translators in a specialised field, they should include a rangeof different types of text - expert, instructional, and popularised. The lattertypes, they argue, are likely to explain terms and concepts which are taken forgranted in expert texts. However, it is important not to confuse these types inthe corpus, since we would not, for example, expect divulgative texts to havethe same collocational and colligational regularities as specialist ones, norto contain the same range of terms as the latter. Where the corpus is used totranslate a specific text, the appropriate component should be given priority. 5.King (1997: 396) compares the number of types in translations of Le petitprince with the French original: scoring the latter as 100, figures for Englishand for Italian are 83 and 107 respectively. 6.This may, for instance, be dubious if all the translations in the corpus havebeen produced by the same translator, as is often the case withtranslation memory systems. References Aijmer,K., B. Altenberg and M. Johansson (eds.), 1996, Languages in contrast , Lund University Press, Lund. Aston,G., 1996, Traduzione e tecnologia, in G. Cortese (a cura di), Tradurre i linguaggi settoriali ,Edizioni Cortina, Torino, pp. 293-310. Aston,G., 1997, Small and large corpora in language learning, inLewandowska- Tomaszczyk and Melia, pp. 51-62. Aston,G. (ed.), forthcoming, Learning withcorpora . Baker,M., 1993, Corpus linguistics and translation studies: implications andapplications, in Baker et al., pp. 233-250. Baker,M., 1998, R‚explorer la langue de la traduction: une approche parcorpus, Meta , 43/4, pp.480-485. Baker,M., G. Francis and E. Tognini-Bonelli (eds.), 1993, Text and technology: in honour of John Sinclair , Benjamins,Amsterdam. Barlow,M., 1995, ParaConc: a concordancer for parallel texts, Computers texts, 10. Bernardini,S., 1997, A `trainee' translator's perspective on corpora,available online, http://www.sslmit.unibo.it/cultpaps/trainee.htm Bernardini,S., in press, Competence, capacity,corpora, CLUEB, Bologna . Bernardini,S. and G. Aston, forthcoming, Do corpora actually helptranslators?. Bertaccini,F. and G. Aston, forthcoming, Exploring cultural connotations throughadhoc corpora, in Aston (forthcoming). Biber,D., 1993, Representativeness in corpus design, Literary and linguistic computing , 8/4, pp. 243-257. Bowker,L., 1998, Using specialized monolingual native-language corpora as atranslation resource: a pilot study, Meta , 43/4, pp. 631-651. Burnard,L. and T. McEnery (eds.), forthcoming, Papers from TALC 98 (provisional title),Peter Lang, Bern. Chatwin,B., 1989a, Utz , Pan, London. Chatwin,B., 1989b, Utz , trans. D. Mazzone,Adelphi, Milano. Danielsson,P., and K. Mhlenbock, forthcoming, Retrieval of name translations inparallel corpora, in Burnard and McEnery. Ferri,S., 1999, Uso di piccoli corpora comparabili per la traduzione medica,unpublished dissertation, SSLMIT, Forli'. Friedbichler,I. and M. Friedbichler, 1997, The potential of domain-specific target-language corpora for the translator's workbench, available online,http://www.sslmit.unibo.it/cultpaps/fried.htm Gavioli,L., forthcoming, Corpora and the concordancer in learning ESP: anexperiment in a course of interpreters and translators, in G. Azzaro andM. Ulrych (eds.), Anglistica e ....: metodi e percorsi comparatistici nellelingue, culture e letterature di origine europea. Volume II: Transitilinguistici e culturali, EUT, Trieste. Gavioli,L. and F. Zanettin, 1997, Comparable corpora and translation: a pedagogicperspective, available online,http://www.sslmit.unibo.it/cultpaps/laura-fede.htm Gellerstam,M., 1996, Translations as a source for cross-linguistic studies, inAijmer et al., pp. 53-62. Hartmann,R.R.K., 1980, Contrastive textology: comparative discourse analysis in appliedlinguistics, Julius Gross Verlag, Heidelberg. Hultsijn,J.H., 1992, Retention of inferred and given word meanings: experiments inincidental vocabulary learning, in P.J.L. Arnaud and H. B‚joint (eds.),Vocabulary and applied linguistics, Macmillan, London, pp. 113- 125. Johansson,S. and J. Ebeling, 1996, Exploring the English-Norwegian parallelcorpus, in C. Percy, C.F. Meyer and I. Lancashire (eds.), Synchroniccorpus linguistics, Rodopi, Amsterdam, pp. 3-15. Johansson,S. and K. Hofland, 1994, Towards an English-Norwegian parallelcorpus, in U. Fries, G. Tottie and P. Schneider (eds.), Creating andusing English language corpora, Rodopi, Amsterdam, pp. 25-37. Johns,T., 1991, Should you be persuaded: two examples of data-drivenlearning, in T. Johns and P. King (eds.), Classroom concordancing, (ELRjournal, 4), Centre for English language studies, Birmingham, pp. 1- 16. Johns,T., forthcoming, Reciprocal learning: a practical application of parallelconcordancing'. Joseph,J.E., 1998, Why isn't translation impossible?, in S. Hunston (ed.),Language at work, BAAL/Multilingual Matters, Clevedon, pp. 98- 108. Kenny,D., 1998, Creatures of habit? What translators usually do withwords, Meta, 43/4, pp. 515-523. King,P., 1997, Parallel corpora for translator training, in Lewandowska-Tomaszczyk and Melia, pp. 393-402. Lauridsen,K., 1996, Text corpora and contrastive linguistics: which type of corpusfor which type of analysis?, in Aijmer et al., pp. 63-71. Laviosa,S., 1998, Core patterns of lexical use in a comparable corpus of Englishnarrative prose, Meta, 43/4, pp. 557-570. Lewandowska-Tomaszczyk,B. and P.J. Melia (eds.), 1997, PALC'97: practical applications in languagecorpora, Lodz University Press, Lodz. Louw,B., 1993, Irony in the text or insincerity in the writer? The diagnosticpotential of semantic prosodies, in Baker et al., pp. 157-176. Maia,B., 1997, Do-it-yourself corpora... with a little bit of help from yourfriends!, in Lewandowska-Tomaszczyk and Melia, pp. 403-410. Malmkjaer,K., 1998, Love thy neighbour: will parallel corpora endear linguists totranslators?, Meta, 43/4, pp. 534-541. Nida,E., 1964, Towards a science of translating: with special reference toprinciples and procedures in Bible translating, E.J. Brill, Leiden. Over†s,L., 1998, In search of the third code: an investigation of norms inliterary translation, Meta, 43/4, pp. 571-588. Pearson,J., 1996, Teaching terminology using electronic resources, in S.Botley, J. Glass, T. McEnery and A. Wilson (eds.), Proceedings of Teaching andlanguage corpora 1996, UCREL, Lancaster, pp. 203-216. Pearson,J., 1998, Terms in context, Benjamins, Amsterdam. Pearson,J., forthcoming, Surfing the internet: teaching students to choose theirtexts wisely, in Burnard and McEnery. Reiss,K., 1981, Type, kind and individuality of text: decision making intranslation, Poetics today, 2/4, pp. 121-131. Scott,M., 1997, Wordsmith Tools (ver. 2.0), Oxford University Press, Oxford. Sinclair,J.M., 1991, Corpus, concordance, collocation, Oxford University Press, Oxford. Teubert,W., 1996, Comparable or parallel corpora?, International jounral oflexicography, 9/3, pp. 238-264. Ulrych,M., 1997, The impact of multilingual parallel concordancing ontranslation, in Lewandowska-Tomaszczyk and Melia, pp. 421-435. Varantola,K., 1997, Translators, dictionaries and text corpora, available online,http://www.sslmit.unibo.it/cultpaps/varanto.htm Williams,I.A., 1996, A translator's reference needs: dictionaries or paralleltexts, Target, 8, pp. 277-299. Woolls,D., 1997, MultiConc (ver. 1.0), CFL Software Development, Birmingham. Zanettin,F., 1998, Bilingual comparable corpora and the training oftranslators, Meta, 43/4, pp. 616-630. Zanettin,F., forthcoming, Swimming in words: corpora, translation and languagelearning, in Aston (forthcoming). http://www.sslmit.unibo.it/~guy/textus.htm; 个人分类: 语料库与翻译学研究 Corpus-based Translation Studi|2961 次阅读|0 个评论

Transfer Naive Bayes learning: lzhx171 2013-4-7 14:54; （本周把LNP的实验做完，发现效果很不好，查阅相关文献后，发现LNP因为类重叠与数据分布不平衡问题会使LNP构造图时由于选择的邻居不合理而影响分类性能。）本周主要研究了迁移学习的论文，下面讲一下TransferNaiveBayeslearning，首先假设是源数据和目标数据特征属性空间相同的条件下进行。 1 ）传统 NaiveBayes 分类模型如下：（1） 2 ）搜集目标集特征信息与训练集特征信息关系 m为测试集实例数，j为特征索引，定义max和min两个集合。定义h 定义si Si表示，第i个训练集实例与测试集的相似程度，由此相似度来为之后进行加权。 3）利用数据重力为训练集加权（2） wi是第i个实例的权重，利用物理学上重力的定义，定义第i个实例与整个测试集的联系程度。这样先验概率P(c)可改写为：条件概率可改写为：代入1）中的公式即可。最后的算法如下：这个算法有一个问题就是需要对数据进行归一化处理，论文上没有提到。; 5 次阅读|0 个评论

great scholars & professors on RBM: justinzhao 2013-4-6 08:25; (1) Hinton Geoffrey: http://www.cs.toronto.edu/~hinton/ father of RBM, it's him to make the RBM trainable in practice. (2) Andrew Ng: http://ai.stanford.edu/~ang/ Great professor and great speaker. His student helped to popularize the deep belief network (3) Honglak Lee: http://web.eecs.umich.edu/~honglak/ It's him to win the best application paper award of ICML 2009. Currently he works on how to model invariance using RBM. (4) Ruslan Salakhutdinov: http://www.utstat.toronto.edu/~rsalakhu/ He is student of Prof. Hinto,and his major contribution is introduction of deep boltzmann machine. Prof.Hinto coined deep belief network. There two kinds of networks share some similarity, both belonging to deep architectures. (5) Graham Taylor: http://www.uoguelph.ca/~gwtaylor/ He is also the student of Prof. Hinton, and his major contribution is the introduction of gated boltzmann machine, which makes generate gray scale images possible. (6) Hugo Larochelle: http://www.dmi.usherb.ca/~larocheh/index_en.html Again he is Prof. Hinto's student, and his major contribution is applying RBM to model attentionla data. (7) Mark Ranzato: http://www.cs.toronto.edu/~ranzato/ He finished his Ph.D under Prof. Yann Lecun, and spent two-years' postdoc under Prof. Hinton. His contribution is introduction of one duplicate image to model covariance among neighboring pixels. (8) Roland Memisevic: http://www.iro.umontreal.ca/~memisevr/ He modeled temporal data using RBM. Now he found a faculty position in the University of Montreal. (9) Yoshua Bengio: http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html Great professor. His work of 'Learning Deep Structure for AI' is a must-read. (10) Yann Lecun: http://yann.lecun.com/ He is a legend. He disregarded CV guys.He is super smart, and his work may revolutionize object recognition. (11) Rob Fergus: http://cs.nyu.edu/~fergus/ NYU guy, who rejected when I applied for him. Anyway, a genius, I love him. (12) Kai Yu: http://www.dbs.ifi.lmu.de/~yu_k/ He inspired me why whitening doesn't make data independent. Sincere thanks to him. These professor are those who I am most familiar with. However, with emergence of deep belief network and deep boltzmann machine, there are so many other scholars. You may find a list from 2012 UCLA Deep Learning Summer School: http://www.ipam.ucla.edu/programs/gss2012/; 个人分类: 读书日记|4365 次阅读|0 个评论

learning multiple layers of representation总结（5）: mimizhu2003 2013-3-20 22:54; 具有快速准确推理的非线性模型：介绍受限波尔兹曼机（RBM），并指出它是寻找用于深度、有向产生式模型的高效学习算法的关键。像素具有二值的图像能用RBM的隐层来建模像素间的更高阶的相互关系。为了从训练图像集合中学习一个好的特征检测子的集合，先将像素i和特征检测子j之间的初始权值置为0；然后使用下式重复更新每个权值wij：为学习率；V h 是当特征检测子受训练集中的图像驱动时像素i和特征检测子j 同时出现的频率；V h 当特征检测子受重构后的图像驱动时像素i和特征检测子j同时出现的频率。一个相似的学习规则也能用于偏差。给定一个训练图像，设定每个特征检测子为1的概率为： P(hj=1)= )其中为logistic函数，bj是偏差，Vi是像素i的二值状态；一旦为隐单元选择了二值状态，则通过用如下概率设定每个像素的值为1将产生一个图像的重构： P(vi=1)= 在此，所学习的权值和偏差直接根据上两式决定了条件分布P(h|v)和P(v|h)；非直接的，权值和偏差定义了联合和边缘概率P(v,h)和P(v),P(h)。从联合概率中采样是困难的，但是可以使用轮流的Gibbs采样。如果Gibbs采样时间足够长，网络将达到热平衡。 RBM的两个优点：首先，推理容易；给定可视矢量，在隐矢量上的后验分布因式分解为每个隐单元独立分布的乘积；因此为了从后验中获得采样，只需简单按照相应的概率打开每个隐单元；其次，通过堆叠RBM，很容易一次一层的学习深度有向网络。; 个人分类: 文章读后感|6165 次阅读|0 个评论

learning multiple layers of representation 总结(4): mimizhu2003 2013-3-20 21:33; 关于用于多层产生式模型的近似推理：作者指出：除了考虑产生每个训练样例的log概率，还应考虑推理过程的准确度；如果其他都一样，我们希望近似推理方法尽可能精确；我们会选择一个模型也许它产生数据的概率不是最高，但是却能有对隐表示的更加准确的推理。因此很有意义的是当最大化观察数据的log概率时，对每个训练样例使用不精确推理作为一个惩罚项。这也将导致一个新的目标函数，它更加容易最大化，并且在产生训练数据的log概率上有一个变化的低边界（lower-bound）。对于处理复杂产生式模型的推理问题的一个标准方法就是通过优化一个变化边界来学习。; 个人分类: 文章读后感|5616 次阅读|0 个评论

learning multiple layers of representation 总结(2): mimizhu2003 2013-3-15 21:51; 关于产生式模型：由于模型被很强的约束，所以后验分布能被高效和准确的推理。作者首先介绍了几个产生式模型：因子分析（factor analysis）：有一个单隐层的高斯隐变量，它们对可视变量的关系是线性的，独立高斯噪声被添加给每个可视变量。给定一个可视变量，不可能推出产生它的因子的确切的状态，但是很容易推出因子的高斯后验分布的均值和协方差，这足以使得模型的参数被增强。独立成份分析（Indenpendent component analysis）:是因子分析的进一步泛化，允许非高斯的隐变量，但是通过消除在可见变量上的观察噪声以及约束隐层变量和可见层变量的数目相等而维护了推理的简单。这些约束也使得后验分布最终聚到一个点，因为对于每一个可见变量，仅仅一套隐变量能确切的产生它。混合模型（Mixture model）:每个数据矢量被假定仅仅由混合中的一个成份分布产生，并且它在每个成份分布下很容易计算密度。作者指出，如果因子分析被泛化来允许非高斯隐变量，它能建模低层视觉感知域的发展。然而，如果不强加额外的约束（如在独立成份分析中的约束），它将不容易推理甚至很难表示给定隐变量下的后验分布。这是由于一种称作为explain away的现象造成的。; 7081 次阅读|0 个评论

Hinton文章learning multiple layers of representation总结: mimizhu2003 2013-3-2 22:33; 前两天看的文章，总结如下：在文章的开始，提出的思想是：不同于以往学习一个分类器的目标，而是希望学习一个生成模型（generative model）。作者指出，在一个神经网络中，如果它既包含自底向上的“识别”连接，也包含自顶向下的“生成”连接，就可以使用自底向上的过程来识别数据，而使用自顶向下的过程来产生数据。如果神经元是随机的，重复自顶向下的过程，将产生一个数据矢量的完整的分布。这表明，通过调整自顶向下的连接权值，来使得网络产生训练数据的概率达到最大，则训练数据将被驻留（reside in）在自顶向下的权值中。可以用RBM(受限波尔兹曼机)的隐层来建模二值图像像素间的高阶相互关系。为了从训练图像集中学习一套好的特征检测子，像素i和特征检测子j间的初始权值被置为0，使用两对相互关系之间的差异迭代地更新权值，即像素i和特征检测子j之间同时出现的频率，一是受训练图像驱动时的，二是受重构后的图像驱动时的，相似的学习规则也可用于偏差（bias）。一旦RBM的隐层确定，我们就产生了一个对训练图像的重构。通过组合RBM来学习多层特征。; 个人分类: 文章读后感|6065 次阅读|0 个评论

Machine learning for Hackers笔记（2）: yangleicq 2013-3-1 14:45; I have almost gone through the first chapter. This chapter emphasize on how to use R to clean the data, to sort the data to suit special need and visualize the data. I was following the author step by step in his work. I understand most contents except for code for graphs, those ggplot2 code. So I felt I have met with the bottle neck if I wish to journey on: I need to familiarize myself with the graphical system in MLFH: the ggplot2 package. That's why I decide to deviate a while and figure out how to use ggplot2 at first.; 0 个评论

欢迎报名参加Skype讨论组（Dictionary Learning): 热度 3 oliviazhang 2013-2-25 18:30; 准备近期就开始组织一个研读Dictionary Learning论文的讨论组。每周一次，大概１－２个小时，采用Skype的方式。研读的论文主要侧重Dictionary Learning/Sparse Representation，尤其是supervised dictionary learning等。欢迎报名参加（拟５人）。条件是需要在这个方向上发表至少一篇论文。详细情况请发信给我:zhangzlacademy( a t )gmail.com。请勿发站内信件。; 4381 次阅读|3 个评论

[转载]机器学习前沿热点–Deep Learning: lysciart 2013-2-19 16:44; 引言：神经网络（ N eural N etwork）与支持向量机（ S upport V ector M achines，SVM）是统计学习的代表方法。可以认为神经网络与支持向量机都源自于感知机（Perceptron）。感知机是1958年由Rosenblatt发明的线性分类模型。感知机对线性分类有效，但现实中的分类问题通常是非线性的。神经网络与支持向量机（包含核方法）都是非线性分类模型。1986年，Rummelhart与McClelland发明了神经网络的学习算法 B ack P ropagation。后来，Vapnik等人于1992年提出了支持向量机。神经网络是多层（通常是三层）的非线性模型，支持向量机利用核技巧把非线性问题转换成线性问题。神经网络与支持向量机一直处于“竞争”关系。 Scholkopf是Vapnik的大弟子，支持向量机与核方法研究的领军人物。据Scholkopf说，Vapnik当初发明支持向量机就是想"干掉"神经网络（He wanted to kill Neural Network)。支持向量机确实很有效，一段时间支持向量机一派占了上风。近年来，神经网络一派的大师Hinton又提出了神经网络的Deep Learning算法（2006年），使神经网络的能力大大提高，可与支持向量机一比。 Deep Learning假设神经网络是多层的，首先用Boltzman Machine（非监督学习）学习网络的结构，然后再通过Back Propagation（监督学习）学习网络的权值。关于Deep Learning的命名，Hinton曾开玩笑地说: I want to call SVM shallow learning. (注：shallow 有肤浅的意思)。其实Deep Learning本身的意思是深层学习，因为它假设神经网络有多层。总之，Deep Learning是值得关注的统计学习新算法。深度学习（Deep Learning）是ML研究中的一个新的领域，它被引入到ML中使ML更接近于其原始的目标：AI。查看 a brief introduction to Machine Learning for AI 和 an introduction to Deep Learning algorithms . 深度学习是关于学习多个表示和抽象层次，这些层次帮助解释数据，例如图像，声音和文本。对于更多的关于深度学习算法的知识，可以参看： The monograph or review paper Learning Deep Architectures for AI (Foundations Trends in Machine Learning, 2009). The ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references . The LISA public wiki has a reading list and a bibliography . Geoff Hinton has readings from last year’s NIPS tutorial . 这篇综述主要是介绍一些最重要的深度学习算法，并将演示如何用 Theano 来运行它们。 Theano是一个python库，使得写深度学习模型更加容易，同时也给出了一些关于在GPU上训练它们的选项。这个算法的综述有一些先决条件。首先你应该知道一个关于python的知识，并熟悉numpy。由于这个综述是关于如何使用Theano，你应该先阅读 Theano basic tutorial 。一旦你完成这些，阅读我们的 Getting Started 章节---它将介绍概念定义，数据集，和利用随机梯度下降来优化模型的方法。纯有监督学习算法可以按照以下顺序阅读： Logistic Regression - using Theano for something simple Multilayer perceptron - introduction to layers Deep Convolutional Network - a simplified version of LeNet5 无监督和半监督学习算法可以用任意顺序阅读(auto-encoders可以被独立于RBM/DBM地阅读)： Auto Encoders, Denoising Autoencoders - description of autoencoders Stacked Denoising Auto-Encoders - easy steps into unsupervised pre-training for deep nets Restricted Boltzmann Machines - single layer generative RBM model Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning 关于mcRBM模型，也有一篇新的关于从能量模型中抽样的综述： HMC Sampling - hybrid (aka Hamiltonian) Monte-Carlo sampling with scan() 上文翻译自 http://deeplearning.net/tutorial/ 深度学习是机器学习研究中的一个新的领域，其动机在于建立、模拟人脑进行分析学习的神经网络，它模仿人脑的机制来解释数据，例如图像，声音和文本。深度学习是无监督学习的一种。深度学习的概念源于人工神经网络的研究。含多隐层的多层感知器就是一种深度学习结构。深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征，以发现数据的分布式特征表示。深度学习的概念由Hinton等人于2006年提出。基于深信度网(DBN)提出非监督贪心逐层训练算法，为解决深层结构相关的优化难题带来希望，随后提出多层自动编码器深层结构。此外Lecun等人提出的卷积神经网络是第一个真正多层结构学习算法，它利用空间相对关系减少参数数目以提高训练性能。一、Deep Learning的前世今生图灵在 1950 年的论文里，提出图灵试验的设想，即，隔墙对话，你将不知道与你谈话的，是人还是电脑。这无疑给计算机，尤其是人工智能，预设了一个很高的期望值。但是半个世纪过去了，人工智能的进展，远远没有达到图灵试验的标准。这不仅让多年翘首以待的人们，心灰意冷，认为人工智能是忽悠，相关领域是“伪科学”。 2008 年 6 月，“连线”杂志主编，Chris Anderson 发表文章，题目是 “理论的终极，数据的泛滥将让科学方法过时”。并且文中还引述经典著作 “人工智能的现代方法”的合著者，时任 Google 研究总监的 Peter Norvig 的言论，说 “一切模型都是错的。进而言之，抛弃它们，你就会成功” 。言下之意，精巧的算法是无意义的。面对海量数据，即便只用简单的算法，也能得到出色的结果。与其钻研算法，不如研究云计算，处理大数据。如果这番言论，发生在 2006 年以前，可能我不会强力反驳。但是自 2006 年以来，机器学习领域，取得了突破性的进展。图灵试验，至少不是那么可望而不可即了。至于技术手段，不仅仅依赖于云计算对大数据的并行处理能力，而且依赖于算法。这个算法就是，Deep Learning。借助于 Deep Learning 算法，人类终于找到了如何处理 “抽象概念”这个亘古难题的方法。于是学界忙着延揽相关领域的大师。Alex Smola 加盟 CMU，就是这个背景下的插曲。悬念是 Geoffrey Hinton 和 Yoshua Bengio 这两位牛人，最后会加盟哪所大学。 Geoffrey Hinton 曾经转战 Cambridge、CMU，目前任教University of Toronto。相信挖他的名校一定不少。 Yoshua Bengio 经历比较简单，McGill University 获得博士后，去 MIT 追随 Mike Jordan 做博士后。目前任教 University of Montreal。 Deep Learning 引爆的这场革命，不仅学术意义巨大，而且离钱很近，实在太近了。如果把相关技术难题比喻成一座山，那么翻过这座山，山后就是特大露天金矿。技术难题解决以后，剩下的事情，就是动用资本和商业的强力手段，跑马圈地了。于是各大公司重兵集结，虎视眈眈。Google 兵分两路，左路以 Jeff Dean 和 Andrew Ng 为首，重点突破 Deep Learning 等等算法和应用。 Jeff Dean 在 Google 诸位 Fellows 中，名列榜首，GFS 就是他的杰作。Andrew Ng 本科时，就读 CMU，后来去 MIT 追随 Mike Jordan。Mike Jordan 在 MIT 人缘不好，后来愤然出走 UC Berkeley。Andrew Ng 毫不犹豫追随导师，也去了 Berkeley。拿到博士后，任教 Stanford，是 Stanford 新生代教授中的佼佼者，同时兼职 Google。 Google 右路军由 Amit Singhal 领军，目标是构建 Knowledge Graph 基础设施。 1996 年 Amit Singhal 从 Cornell University 拿到博士学位后，去 Bell Lab 工作，2000 年加盟 Google。据说他去 Google 面试时，对 Google 创始人 Sergey Brian 说，“Your engine is excellent, but let me rewirte it!” 换了别人，说不定一个大巴掌就扇过去了。但是 Sergey Brian 大人大量，不仅不怪罪小伙子的轻狂，反而真的让他从事新一代排名系统的研发。Amit Singhal 目前任职 Google 高级副总裁，掌管 Google 最核心的业务，搜索引擎。 Google 把王牌中之王牌，押宝在 Deep Learning 和 Knowledge Graph 上，目的是更快更大地夺取大数据革命的胜利果实。 Reference Turing Test. http://en.wikipedia.org/wiki/Turing_test The End of Theory: The Data Deluge Makes the Scientific Method Obsolete http://www.wired.com/science/discoveries/magazine/16-07/pb_theory Introduction to Deep Learning. http://en.wikipedia.org/wiki/Deep_learning Interview with Amit Singhal, Google Fellow. http://searchengineland.com/interview-with-amit-singhal-google-fellow-121342 原文链接： http://blog.sina.com.cn/s/blog_46d0a3930101fswl.html 作者微博： http://weibo.com/kandeng#1360336038853 二、Deep Learning的基本思想和方法实际生活中，人们为了解决一个问题，如对象的分类（对象可是是文档、图像等），首先必须做的事情是如何来表达一个对象，即必须抽取一些特征来表示一个对象，如文本的处理中，常常用词集合来表示一个文档，或把文档表示在向量空间中（称为VSM模型），然后才能提出不同的分类算法来进行分类；又如在图像处理中，我们可以用像素集合来表示一个图像，后来人们提出了新的特征表示，如SIFT，这种特征在很多图像处理的应用中表现非常良好，特征选取得好坏对最终结果的影响非常巨大。因此，选取什么特征对于解决一个实际问题非常的重要。然而，手工地选取特征是一件非常费力、启发式的方法，能不能选取好很大程度上靠经验和运气；既然手工选取特征不太好，那么能不能自动地学习一些特征呢？答案是能！Deep Learning就是用来干这个事情的，看它的一个别名Unsupervised Feature Learning，就可以顾名思义了，Unsupervised的意思就是不要人参与特征的选取过程。因此，自动地学习特征的方法，统称为Deep Learning。 1）Deep Learning的基本思想假设我们有一个系统S，它有n层（S1,…Sn），它的输入是I，输出是O，形象地表示为： I =S1=S2=…..=Sn = O，如果输出O等于输入I，即输入I经过这个系统变化之后没有任何的信息损失，保持了不变，这意味着输入I经过每一层Si都没有任何的信息损失，即在任何一层Si，它都是原有信息（即输入I）的另外一种表示。现在回到我们的主题Deep Learning，我们需要自动地学习特征，假设我们有一堆输入I（如一堆图像或者文本），假设我们设计了一个系统S（有n层），我们通过调整系统中参数，使得它的输出仍然是输入I，那么我们就可以自动地获取得到输入I的一系列层次特征，即S1，…, Sn。另外，前面是假设输出严格地等于输入，这个限制太严格，我们可以略微地放松这个限制，例如我们只要使得输入与输出的差别尽可能地小即可，这个放松会导致另外一类不同的Deep Learning方法。上述就是Deep Learning的基本思想。 2）Deep Learning的常用方法 a). AutoEncoder 最简单的一种方法是利用人工神经网络的特点，人工神经网络（ANN）本身就是具有层次结构的系统，如果给定一个神经网络，我们假设其输出与输入是相同的，然后训练调整其参数，得到每一层中的权重，自然地，我们就得到了输入I的几种不同表示（每一层代表一种表示），这些表示就是特征，在研究中可以发现，如果在原有的特征中加入这些自动学习得到的特征可以大大提高精确度，甚至在分类问题中比目前最好的分类算法效果还要好！这种方法称为AutoEncoder。当然，我们还可以继续加上一些约束条件得到新的Deep Learning方法，如如果在AutoEncoder的基础上加上L1的Regularity限制（L1主要是约束每一层中的节点中大部分都要为0，只有少数不为0，这就是Sparse名字的来源），我们就可以得到Sparse AutoEncoder方法。 b). Sparse Coding 如果我们把输出必须和输入相等的限制放松，同时利用线性代数中基的概念，即O = w1*B1 + W2*B2+….+ Wn*Bn， Bi是基，Wi是系数，我们可以得到这样一个优化问题： Min |I – O| 通过求解这个最优化式子，我们可以求得系数Wi和基Bi，这些系数和基础就是输入的另外一种近似表达，因此，它们可以特征来表达输入I，这个过程也是自动学习得到的。如果我们在上述式子上加上L1的Regularity限制，得到： Min |I – O| + u*(|W1| + |W2| + … + |Wn|) 这种方法被称为Sparse Coding。 c) Restrict Boltzmann Machine (RBM) 假设有一个二部图，每一层的节点之间没有链接，一层是可视层，即输入数据层（ v )，一层是隐藏层( h )，如果假设所有的节点都是二值变量节点（只能取0或者1值），同时假设全概率分布p( v, h )满足Boltzmann 分布，我们称这个模型是Restrict Boltzmann Machine (RBM)。下面我们来看看为什么它是Deep Learning方法。首先，这个模型因为是二部图，所以在已知 v 的情况下，所有的隐藏节点之间是条件独立的，即p( h | v ) =p( h 1| v )…..p( h n| v )。同理，在已知隐藏层 h 的情况下，所有的可视节点都是条件独立的，同时又由于所有的v和h满足Boltzmann 分布，因此，当输入 v 的时候，通过p( h | v ) 可以得到隐藏层 h ，而得到隐藏层 h 之后，通过p(v|h) 又能得到可视层，通过调整参数，我们就是要使得从隐藏层得到的可视层 v1 与原来的可视层 v 如果一样，那么得到的隐藏层就是可视层另外一种表达，因此隐藏层可以作为可视层输入数据的特征，所以它就是一种Deep Learning方法。如果，我们把隐藏层的层数增加，我们可以得到Deep Boltzmann Machine (DBM)；如果我们在靠近可视层的部分使用贝叶斯信念网络（即有向图模型，当然这里依然限制层中节点之间没有链接），而在最远离可视层的部分使用Restrict Boltzmann Machine，我们可以得到Deep Belief Net （DBN）。当然，还有其它的一些Deep Learning 方法，在这里就不叙述了。总之，Deep Learning能够自动地学习出数据的另外一种表示方法，这种表示可以作为特征加入原有问题的特征集合中，从而可以提高学习方法的效果，是目前业界的研究热点。原文链接： http://blog.csdn.net/xianlingmao/article/details/8478562 三、深度学习(Deep Learning)算法简介查看最新论文 Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009 深度(Depth) 从一个输入中产生一个输出所涉及的计算可以通过一个流向图(flow graph)来表示：流向图是一种能够表示计算的图，在这种图中每一个节点表示一个基本的计算并且一个计算的值(计算的结果被应用到这个节点的孩子节点的值)。考虑这样一个计算集合，它可以被允许在每一个节点和可能的图结构中，并定义了一个函数族。输入节点没有孩子，输出节点没有父亲。对于表达的流向图，可以通过一个有两个输入节点和的图表示，其中一个节点通过使用和作为输入(例如作为孩子)来表示；一个节点仅使用作为输入来表示平方；一个节点使用和作为输入来表示加法项(其值为 )；最后一个输出节点利用一个单独的来自于加法节点的输入计算SIN。这种流向图的一个特别属性是深度(depth)：从一个输入到一个输出的最长路径的长度。传统的前馈神经网络能够被看做拥有等于层数的深度(比如对于输出层为隐层数加1)。SVMs有深度2(一个对应于核输出或者特征空间，另一个对应于所产生输出的线性混合)。深度架构的动机学习基于深度架构的学习算法的主要动机是：不充分的深度是有害的；大脑有一个深度架构；认知过程是深度的；不充分的深度是有害的在许多情形中深度2就足够(比如logical gates, formal neurons, sigmoid-neurons, Radial Basis Function units like in SVMs)表示任何一个带有给定目标精度的函数。但是其代价是：图中所需要的节点数(比如计算和参数数量)可能变的非常大。理论结果证实那些事实上所需要的节点数随着输入的大小指数增长的函数族是存在的。这一点已经在logical gates, formal neurons 和rbf单元中得到证实。在后者中Hastad说明了但深度是d时，函数族可以被有效地(紧地)使用O(n)个节点(对于n个输入)来表示，但是如果深度被限制为d-1，则需要指数数量的节点数O(2^n)。我们可以将深度架构看做一种因子分解。大部分随机选择的函数不能被有效地表示，无论是用深地或者浅的架构。但是许多能够有效地被深度架构表示的却不能被用浅的架构高效表示(see the polynomials example in the Bengio survey paper )。一个紧的和深度的表示的存在意味着在潜在的可被表示的函数中存在某种结构。如果不存在任何结构，那将不可能很好地泛化。大脑有一个深度架构例如，视觉皮质得到了很好的研究，并显示出一系列的区域，在每一个这种区域中包含一个输入的表示和从一个到另一个的信号流(这里忽略了在一些层次并行路径上的关联，因此更复杂)。这个特征层次的每一层表示在一个不同的抽象层上的输入，并在层次的更上层有着更多的抽象特征，他们根据低层特征定义。需要注意的是大脑中的表示是在中间紧密分布并且纯局部：他们是稀疏的：1%的神经元是同时活动的。给定大量的神经元，任然有一个非常高效地(指数级高效)表示。认知过程看起来是深度的人类层次化地组织思想和概念；人类首先学习简单的概念，然后用他们去表示更抽象的；工程师将任务分解成多个抽象层次去处理；学习/发现这些概念(知识工程由于没有反省而失败？)是很美好的。对语言可表达的概念的反省也建议我们一个稀疏的表示：仅所有可能单词/概念中的一个小的部分是可被应用到一个特别的输入(一个视觉场景)。学习深度架构的突破 2006年前，尝试训练深度架构都失败了：训练一个深度有监督前馈神经网络趋向于产生坏的结果(同时在训练和测试误差中)，然后将其变浅为1(1或者2个隐层)。 2006年的3篇论文改变了这种状况，由Hinton的革命性的在深度信念网(Deep Belief Networks, DBNs)上的工作所引领： Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief nets .Neural Computation 18:1527-1554, 2006 Yoshua Bengio, Pascal Lamblin, Dan Popovici and Hugo Larochelle, Greedy Layer-Wise Training of Deep Networks , in J. Platt et al. (Eds), Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 153-160, MIT Press, 2007 Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun Efficient Learning of Sparse Representations with an Energy-Based Model , in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006), MIT Press, 2007 在这三篇论文中以下主要原理被发现：表示的无监督学习被用于(预)训练每一层；在一个时间里的一个层次的无监督训练，接着之前训练的层次。在每一层学习到的表示作为下一层的输入；用无监督训练来调整所有层(加上一个或者更多的用于产生预测的附加层)； DBNs在每一层中利用用于表示的无监督学习RBMs。Bengio et al paper 探讨和对比了RBMs和auto-encoders(通过一个表示的瓶颈内在层预测输入的神经网络)。Ranzato et al paper在一个convolutional架构的上下文中使用稀疏auto-encoders(类似于稀疏编码)。Auto-encoders和convolutional架构将在以后的课程中讲解。从2006年以来，大量的关于深度学习的论文被发表，一些探讨了其他原理来引导中间表示的训练，查看 Learning Deep Architectures for AI 原文链接： http://www.cnblogs.com/ysjxw/archive/2011/10/08/2201782.html 四、拓展学习推荐 Deep Learning 经典阅读材料： The monograph or review paper Learning Deep Architectures for AI (Foundations Trends in Machine Learning, 2009). The ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references . The LISA public wiki has a reading list and a bibliography . Geoff Hinton has readings from last year’s NIPS tutorial . Deep Learning工具—— Theano ： Theano 是deep learning的Python库，要求首先熟悉Python语言和numpy，建议读者先看 Theano basic tutorial ，然后按照 Getting Started 下载相关数据并用gradient descent的方法进行学习。学习了Theano的基本方法后，可以练习写以下几个算法：有监督学习： Logistic Regression - using Theano for something simple Multilayer perceptron - introduction to layers Deep Convolutional Network - a simplified version of LeNet5 无监督学习： Auto Encoders, Denoising Autoencoders - description of autoencoders Stacked Denoising Auto-Encoders - easy steps into unsupervised pre-training for deep nets Restricted Boltzmann Machines - single layer generative RBM model Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning 最后呢，推荐给大家基本ML的书籍： Chris Bishop, “Pattern Recognition and Machine Learning”, 2007 Simon Haykin, “Neural Networks: a Comprehensive Foundation”, 2009 (3rd edition) Richard O. Duda, Peter E. Hart and David G. Stork, “Pattern Classification”, 2001 (2nd edition) 原文链接： http://blog.csdn.net/abcjennifer/article/details/7826917 五、应用实例 1、计算机视觉 ImageNet Classification with Deep Convolutional Neural Networks , Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, NIPS 2012. Learning Hierarchical Features for Scene Labeling , Clement Farabet, Camille Couprie, Laurent Najman and Yann LeCun, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. Learning Convolutional Feature Hierachies for Visual Recognition , Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michaeuml;l Mathieu and Yann LeCun, Advances in Neural Information Processing Systems (NIPS 2010), 23, 2010. 2、语音识别微软研究人员通过与hintion合作，首先将RBM和DBN引入到语音识别声学模型训练中，并且在大词汇量语音识别系统中获得巨大成功，使得语音识别的错误率相对减低30%。但是，DNN还没有有效的并行快速算法，目前，很多研究机构都是在利用大规模数据语料通过GPU平台提高DNN声学模型的训练效率。在国际上，IBM、google等公司都快速进行了DNN语音识别的研究，并且速度飞快。国内方面，科大讯飞、百度、中科院自动化所等公司或研究单位，也在进行深度学习在语音识别上的研究。 3、自然语言处理等其他领域很多机构在开展研究，但目前深度学习在自然语言处理方面还没有产生系统性的突破。六、参考链接： 1. http://baike.baidu.com/view/9964 ... enter=deep+learning 2. http://www.cnblogs.com/ysjxw/archive/2011/10/08/2201819.html 3. http://blog.csdn.net/abcjennifer/article/details/7826917 本文转载来自 http://elevencitys.com/?p=1854 Stanford大学的Deep Learning 和 tutorial： http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial; 个人分类: 机器学习|21732 次阅读|0 个评论

小样本数据下传统贝叶斯网络学习算法的不足: 热度 1 missjia2005 2013-2-18 20:17; 现有的贝叶斯网络学习算法侧重于对数据本身的分析，如网络结构学习中的数据与模型匹配度分析(Score-based method)，结点间依赖关系分析(C onstraint -based method)，以及两种分析的混合。然而在小样本数据下，由于样本有限，这些分析往往不具有足够的统计显著性，因此传统的结构学习算法并不适用。（To be continue） This work has already been published in IJAR: DOI: http://dx.doi.org/10.1016/j.ijar.2014.02.008 If you have any questions, please do not hesitate to contact me. Cheers,; 个人分类: Research notes|1280 次阅读|2 个评论

sparse bayesian learning: yuaihua 2013-1-9 15:59; 稀疏贝叶斯学习算法最近看了张智林的博客稀疏贝叶斯学习文章关键总结如下。稀疏贝叶斯学习（Sparse Bayesian Learning, SBL）最初作为一种机器学习算法由Tipping于2001年前后提出，随后被引入到稀疏信号恢复/压缩感知领域 . 压缩感知的基本模型可描述为: y=Ax+v 其中A为N×M的感知矩阵，Y为N×1维压缩信号，x为M维待求的解向量，v为未知的噪声向量. 为求解x，SBL假设x中的每个元素都服从一个参数化的均值为０,方差为γ的高斯分布。 v是N(0，σ2）分布。可以得到p(y|σ2,γ)= p(x|γ)= 通过最大期望E(y,x|σ2,γ) 得到参数更新公式。在矩阵逆运算利用伪逆近似求解。; 个人分类: 稀疏贝叶斯学习|7861 次阅读|0 个评论

a few useful things to know about machine learning: justinzhao 2012-12-15 17:51; (1) A few useful things to know about machine learning (2) the attached paper is translation of this paper; 个人分类: 读书日记|3928 次阅读|0 个评论

key works on manifold learning & dimension reduction: 热度 1 justinzhao 2012-12-15 15:55; (1) LLE (Prof.Roweis, Unfortunately, he passes away): Nonlinear dimensionality reduction by locally linear embedding (2) ISOMAP: A global geometric framework for nonlinear dimensionality reduction (3) eigenmap: Laplacian eigenmaps for dimensionality reduction and data representation (4) neural network (Prof. Geoffrey Hinton): Reducing the dimensionality of data with neural networks (5) survey paper (Prof. Yoshua Bengio): Out-of-sample extensions for lle, isomap , mds, eigenmaps, and spectral clustering; 个人分类: 读书日记|3130 次阅读|2 个评论

great figures in deep learning: justinzhao 2012-12-15 15:21; http://www.smarttypes.org/blog/deep_learning; 个人分类: 读书日记|1949 次阅读|0 个评论

Online Feature Selection with Streaming Features: jingyanwang 2012-12-12 09:35; Session: Online Feature Selection with Streaming Features Reading: Online Feature Selection with Streaming Features Reviewing: Affinity Learning with Diffusion on Tensor Product Graph Writing: Joint-ViVo – Revs 2 Travel to shanghai: 5038 CNY; 3 次阅读|0 个评论

[CV源码分享] OpenPR开源代码项目: wuhuaiyu 2012-12-3 17:56; 欢迎大家访问OpenPR主页： http://www.openpr.org.cn ，并提出意见和建议！同时，OpenPR也期待您分享您的代码！ OpenPR, stands for Open Pattern Recognition project and is intended to be an open source platform for sharing algorithms of image processing, computer vision, natural language processing, pattern recognition, machine learning and the related fields. Code released by OpenPR is under BSD license, and can be freely used for education and academic research. OpenPR is currently supported by the National Laboratory of Pattern Recognition, Institution of Automation, Chinese Academy of Sciences. Thresholding program This is demo program on global thresholding for image of bright small objects, such as aircrafts in airports. the program include four method, otsu,2D-Tsallis,PSSIM, Smoothnees Method. Authorschen xueyun E-mail xueyun.chen@nlpr.ia.ac.cn Principal Component Analysis Based on Nonparametric Max... In this paper, we propose an improved principal component analysis based on maximum entropy (MaxEnt) preservation, called MaxEnt-PCA, which is derived from a Parzen window estimation of Renyi’s quadratic entropy. Instead of minimizing the reconstruction ... AuthorsRan He E-mail rhe@nlpr.ia.ac.cn Metropolis–Hastings algorithm Metropolis-Hastings alogrithm is a Markov chain Monte Carlo method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. Thi sequence can be used to approximate the distribution. AuthorsGong Xing E-mail xgong@nlpr.ia.ac.cn Tagssampling, distribution Maximum Correntropy Criterion for Robust Face Recogniti... This code is developed based on Uriel Roque's active set algorithm for the linear least squares problem with nonnegative variables in: Portugal, L.; Judice, J.; and Vicente, L. 1994. A comparison of block pivoting and interior-point algorithms for linear ... AuthorsRan HE E-mail rhe@nlpr.ia.ac.cn Tagspattern recognition Naive Bayes EM Algorithm OpenPR-NBEM is an C++ implementation of Naive Bayes Classifier, which is a well-known generative classification algorithm for the application such as text classification. The Naive Bayes algorithm requires the probabilistic distribution to be discrete. Op ... AuthorsRui XIA E-mail rxia@nlpr.ia.ac.cn Tagspattern recognition, natural language processing, text classification Local Binary Pattern This is a class to calculate histogram of LBP (local binary patterns) from an input image, histograms of LBP-TOP (local binary patterns on three orthogonal planes) from an image sequence, histogram of the rotation invariant VLBP (volume local binary patte ... AuthorsJia WU E-mail jwu@nlpr.ia.ac.cn Tagscomputer vision, image processing, pattern recognition Two-stage Sparse Representation This program implements a novel robust sparse representation method, called the two-stage sparse representation (TSR), for robust recognition on a large-scale database. Based on the divide and conquer strategy, TSR divides the procedure of robust recogni ... AuthorsRan HE E-mail rhe@dlut.edu.cn Tagspattern recognition CMatrix Class It's a C++ program for symmetric matrix diagonalization, inversion and principal component anlaysis(PCA). The matrix diagonalization function can also be applied to the computation of singular value decomposition (SVD), Fisher linear discriminant analysis ... AuthorsChenglin LIU E-mail liucl@nlpr.ia.ac.cn Tagspattern recognition P3P(Perspective 3-Points) Solver This is a implementation of the classic P3P(Perspective 3-Points) algorithm problem solution in the Ransac paper "M. A. Fischler, R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartogr ... AuthorsZhaopeng GU E-mail zpgu@nlpr.ia.ac.cn TagsComputer Vision, PNP, Extrinsic Calibration Linear Discriminant Function Classifier This program is a C++ implementation of Linear Discriminant Function Classifier. Discriminant functions such as perceptron criterion, cross entropy (CE) criterion, and least mean square (LMS) criterion (all for multi-class classification problems) are sup ... AuthorsRui Xia E-mail rxia@nlpr.ia.ac.cn Tagslinear classifier, discriminant function Naive Bayes Classifier This program is a C++ implementation of Naive Bayes Classifier, which is a well-known generative classification algorithm for the application such as text classification. The Naive Bayes algorithm requires the probabilistic distribution to be discrete. Th ... AuthorsRui Xia E-mail rxia@nlpr.ia.ac.cn Tagspattern recognition, natural language processing, text classification OpenCV Based Extended Kalman Filter Frame A simple and clear OpenCV based extended Kalman filter(EKF) abstract class implementation,absolutely following standard EKF equations. Special thanks to the open source project of KFilter1.3. It is easy to inherit it to implement a variable state and me ... AuthorsZhaopeng GU E-mail zpgu@nlpr.ia.ac.cn TagsComputer Vision, EKF, INS Supervised Latent Semantic Indexing Supervised Latent Semantic Indexing(SLSI) is an supervised feature transformation method. The algorithms in this package are based on the iterative algorithm of Latent Semantic Indexing. AuthorsMingbo Wang E-mail mb.wang@nlpr.ia.ac.cn SIFT Extractor This program is used to extract SIFT points from an image. AuthorsZhenhui Xu E-mail zhxu@nlpr.ia.ac.cn Tagscomputer vision OpenPR-0.0.2 Scilab Pattern Recognition Toolbox is a toolbox developed for Scilab software, and is used in pattern recognition, machine learning and the related field. It is developed for the purpose of education and research. AuthorsJia Wu E-mail jiawu83@gmail.com Tagspattern recognition Layer-Based Dependency Parser LDPar is an efficient data-driven dependency parser. You can train your own parsing model on treebank data and parse new data using the induced model. AuthorsPing Jian E-mail pjian@nlpr.ia.ac.cn Tagsnatural language processing Probabilistic Latent Semantic Indexing AuthorsMingbo Wang E-mail mbwang@nlpr.ia.ac.cn Calculate Normalized Information Measures The toolbox is to calculate normalized information measures from a given m by (m+1) confusion matrix for objective evaluations of an abstaining classifier. It includes total 24 normalized information measures based on three groups of definitions, that is, ... AuthorsBaogang Hu E-mail hubg@nlpr.ia.ac.cn Quasi-Dense Matching This program is used to find point matches between two images. The procedure can be divided into two parts: 1) use SIFT matching algorithm to find sparse point matches between two images. 2) use "quasi-dense propagation" algorithm to get "quasi-dense" p ... AuthorsZhenhui Xu E-mail zhxu@nlpr.ia.ac.cn Agglomerative Mean-Shift Clustering Mean-Shift (MS) is a powerful non-parametric clustering method. Although good accuracy can be achieved, its computational cost is particularly expensive even on moderate data sets. For the purpose of algorithm speedup, an agglomerative MS clustering metho ... AuthorsXiao-Tong Yuan E-mail xtyuan@nlpr.ia.ac.cn Histograms of Oriented Gradients (HOG) Feature Extracti... This program is used to extract HOG(histograms of oriented gradients) features from images. The integral histogram is used for fast histogram extraction. Both APIs and binary utility are provided. AuthorsLiang-Liang He E-mail llhe@nlpr.ia.ac.cn 相关PPT下载详见 “视觉计算研究论坛”「SIGVC BBS」： http://www.sigvc.org/bbs/thread-272-1-1.html; 5166 次阅读|0 个评论

[CV论文读讲] [2012.9.19]Paper reading on transfer learning: wuhuaiyu 2012-12-2 17:03; 讲者：杨杰报告时间:2012.9.19 文章信息:paper#1 Kaizhu Huang, Zenglin Xu,Irwin King, Michael R.Lyu, Colin Campbell: Supervised Self-taught Learning: Actively transferring knowledge from unlabeled data. IJCNN 2009 paper#2 X. Yu and Y. Aloimonos,Attribute-based Transfer Learning for Object Categorization with Zero or One Training Example, ECCV2010 文章简介: Paper#1 Problem: Self-taught Learning学出来的基分类效果有限 Motivation:Self-taught Learning基础上作了一个监督化形式的模型，将self taught learning三步整合到一起。上一次组会也提到通过无标签学出来的基对于分类不一定是帮助的，所以huang这篇文章正是基于这点，将监督标签信息加进来用来指导基的学习，这样改进分类性能。 Model:具体体现在目标函数，将基的组合对原数据拟合以及SVM分类面的学习放在一起。在模型的优化方面，由于问题是非凸的，但是如果固定一部分参数，则可以将问题转化为一个迭代优化的两个子问题，并且这两个子问题是凸问题 Paper#2 Problem:这篇文章要解决的问题是zero/one shot learning problem，即是待分类标签的数据在训练数据中并没有同标签的数据，或者这样的数据很少。作者是在Animals with Attributes(AwA)作了一个基于属性的方法，这里的属性是数据集上已经人工提取出来的一些较为直观的特点，比如马是四条腿，四条腿就是一个属性，所有数据共用一个属性集。 Motivation:作者基于此并且结合author topic model提出了attribute model，可以去学习attribute和topic之间的概率关联关系，从而将此作为先验，或者利用学得的参数来合成一些人工数据，来模拟待分类标签数据。相关PPT下载详见 “视觉计算研究论坛”「SIGVC BBS」： http://www.sigvc.org/bbs/thread-107-1-1.html; 3513 次阅读|0 个评论

[CV论文读讲] transfer learning论文列表: 热度 2 wuhuaiyu 2012-12-2 10:22; 经过咨询杨强组学生以及跟刘绍国和汪凌峰师兄讨论，冒昧对下一轮paper reading（transfer learning）提几点建议。 1. 综述：第一次由汪凌峰师兄讲解A Survey on Transfer Learning，综述性质的文章，给我们科普性的介绍TL。文章写的很好，不长，还是综述性质的，所以推荐大家在讲解之前都仔细阅读，上路之前有一个方向。 2. 理论部分：建议选取07-08年的ICML和NIPS上的文章；推荐Daiwen Yuan的文章；不推荐师弟们讲述。 3. 应用部分：根据各自的专业方向选择，不过为了考虑整体利益，建议选取三大顶级会议（ICCV，CVPR，ECCV）上的应用文章，尽量选择偏图像应用的、有代码的文章。 a. 论文列表。list地址： http://www1.i2r.astar.edu.sg/~jspan/conferenceTL.html 。另外我和刘绍国师兄会尽快选取一些文章分类放在服务器上面，还有每篇文章的简要介绍以及推荐程度。假如有人选定哪篇文章，请在列表中该文章title后面加上你的大名，比方A Survey on Transfer Learning（汪凌峰）。 b. 代码。code地址： http://www.cse.ust.hk/TL/index.html 。推荐TL matlab工具包： http://multitask.cs.berkeley.edu/ 。（@师弟：下面一些观点是我目前的个人经验，仅限于新手，可能是片面极端的：我们不是搞专业算法的，所以我们了解TL或者说Sparse只有两个目的：(1)他们能在哪些情况中work的比较好，为什么能work好，这样我们在碰到类似场景时能去借用和改用；（2）有哪些主要算法，有没有相应代码可以快速实现，这样我们动手就快一点。通常来说，我们理论基础不够深厚的情况下一般是尝试多种算法跑出好结果才会去思考为什么这种算法好；另外我一般会思考TL或者Sparse能否有助于我解决自己的问题，但很可能有想法也不去实现，不去实现的原因是想法实现时间成本较大，所以搁置了。建议一开始就去熟悉TL matlab工具包，因为也许以后能记住的就只有算法的思想和怎么跑这些代码了。）相关PPT下载详见 “视觉计算研究论坛”「SIGVC BBS」： http://www.sigvc.org/bbs/thread-58-1-1.html; 6413 次阅读|2 个评论

hysplit learning: cwjwang 2012-11-28 09:45; http://www.meteozone.com/tutorial/html/traj_freq.html; 3151 次阅读|0 个评论

[转载]机器学习几本书：my list of cool machine learning books: jiandanjinxin 2012-11-27 16:09; http://matpalm.com/blog/2010/08/06/my-list-of-cool-machine-learning-books/ 0) " Machine Learning: a Probabilistic Perspective " byKevin Patrick Murphy Now available amazon.com and other vendors. Electronic versions (e.g., for Kindle) will be available later in the Fall. Table of contents Chapter 1 (Introduction) Information for instructors from MIT Press . If you are an official instructor, you can request an e-copy, which can help you decide if the book is suitable for your class. You can also request the solutions manual. Errata Matlab software All the figures , together with matlab code to generate them 1) "programming collective intelligence" by toby segaran if you know nothing about machine learning and haven't done maths since high school then this is the book for you. it's a fantastically accesible introduction to the field. includes almost no theory and explains algorithms using actual python implementations. 2) "data mining" by witten and frank this book covers quite a bit more than programming c.i. while still being extremely practical (ie very few formula). about a fifth of the book is dedicated to weka, a machine learning workbench which was written by the authors. apart from the weka section this book has no code. i made a little screencast on weka awhile back if you're after a summary. 3) "introduction to data mining" by tan, steinbach and kumar covers almost the same material as the witten/frank text but delves a little bit deeper and with more rigour. includes no code (none of the books do from now on) with algorithms described by formula. has a number of appendices on linear algebra, probability, statistics etc so that you can read up if you're a bit rusty or new to the fields (the witten/frank text lack these). some people might argue having both of these books is a waste since they cover so much of the same ground but i've always found multiple explanations from different authors to be a great way to help understand a topic. i read the witten/frank text first and am glad i did but if i could only keep one i'd keep this one. intermission at this point you've probably got enough mental firepower to handle some of the uni level machine learning course notes that are floating about online. if you're keen to get a better foundation of the maths side of things it'd be worth working through andrew ng's lecture series on machine learning. (20 hours of a second year stanford course on machine learning) i also found andrew moore's lecture slides really great. (they do though require a reasonable understanding of the basics) 4) "foundations of statistical natural language processing" by manning and schutze not a machine learning book as such but great for learning to deal with one of the most common types of data around; text. since most of machine learning theory is about maths (ie numbers) this is awesome in helping to understanding how to deal with text in a mathematical context. 5) "introduction to machine learning" by ethem alpaydin covers generally the same sort of topics as the data mining books but with much more rigour and theory (derivations, proofs, etc). i think this is a good thing though since understanding how things work at a low level gives you the ability to tweak and modify as required. loads more formulas but again with appendixs that introduce the basics in enough detail to get by. 6) "all of statistics" by larry wasserman by this stage you'll probably have an appreciation of how important statistics is for this domain and it might be worth foccussing on it for a bit. personally i found this book to be a great read and though i've only read certain sections in depth i'm looking forward to when i get a chance to work through it cover to cover 7) "the elements of statistical learning" by hastie, tibshirani and friedman. with a bit more stats under your belt you might have a chance of getting through this one; the most complex of the lot. this book is absolutely beautifully presented and now that it's FREE to download you've got no reason not to have a crack at it. a remarkable piece of work and one i've yet to get through fully cover to cover, it's quite hardcore and right on the border of my level of understanding ( which makes it perfect for me :P ) ps. books i haven't read that are in the mail "machine learning" by tom mitchell have been wanting to read this one for awhile, i'm a big fan of tom mitchell , but couldn't justify the cost however just found out the other day the paperback is a third of the price of the hardback i was looking at!! the book's in the mail "pattern recognition and machine learning" by chris bishop all of a sudden seemed like everyone was reading this but me so it was time to jump on the bandwagon 《模式分类》如果是计算机、物理背景的，先看 Bishop的Machine Learning and Pattern Recognition ，然后看T. Hastie的 Elements of Statistical Learning 如果是数学、统计背景的，调转个顺序就可以了。Bishop的那本太厚推荐Jordan的统计学习的课件，全面，难度适中 http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/ 如果实在对英文没兴趣，可以看看李航的那本统计学习，比较基础如果仅仅想看看这方面的应用情景，推荐吴军的数学之美以上内容转自 http://www.zhizhihu.com/html/y2012/4019.html; 2974 次阅读|0 个评论

[转载]Reinventing the Classroom: whyhoo 2012-11-26 12:37; At a time of rising interest in new forms of teaching to effect greater learning, Harvard Magazine asked Harry Lewis, Gordon McKay professor of computer science , to recount how he rethought his—and his students’—roles in creating a new course, and what he learned from teaching it. ~The Editors Computer science is booming at Harvard (and across the country). The number of concentrators has nearly tripled in five years. For decades, most of our students have been converts; barely a third of recent CS graduates intended to study the field when they applied to college. But sometime in 2010, we realized that this boom was different from those of earlier years, when many of our students came to computer science from mathematics, physics, and engineering. Today many seem to be coming from the life sciences, social sciences, and humanities. Never having studied formal mathematics, these students were struggling in our mathematically demanding courses. Their calculus and linear algebra courses did not teach them the math that is used to reason about computer programs: logic, proofs, probability, and counting (figuring out how many poker hands have two pairs, for example). Without these tools they could become good computer programmers, but they couldn’t become computer scientists at all. It was time to create a new course to fill in the background. I’ve developed big courses like CS 50, our introduction to the field. Courses for specialists, like CS 121 (“Introduction to the Theory of Computation”) and CS 124 (“Data Structures and Algorithms”), the theory courses in the CS concentration. A lecture course mixing math and public policy—my “Bits” course, part of the Core and General Education curricula. Even a freshman seminar for 12, outside my professional expertise: on amateur athletics—really a social history of sports in America, heavily laced with Harvardiana. So I figured I knew how to create courses. They always come out well—at least by the standard that I can’t possibly do a worse job than the previous instructor! This time was different. Figuring out the right topics was the easy part. I polled faculty about their upper-level courses and asked them what math they wished their students knew. I looked at the websites of courses at competing institutions, and called some former students who teach those courses to get the real story. (College courses are no more likely to work as advertised than anything else described in a catalog.) Thus was born CS 20, “Discrete Mathematics for Computer Science.” But once I knew what I needed to teach, I started worrying. Every good course I have ever taught (or taken, for that matter) had a narrative. CS 121 is the story of computability, a century-long intellectual history as well as a beautiful suite of mathematical results. “Bits” is the drama of information freedom, the liberation of ideas from the physical media used to store and convey them (see “Study Card” ). CS 20, on the other hand, risked being more like therapy—so many treatments of this followed by so many doses of that, all nauseating. “It’s good for you” is not a winning premise for a course. And what if students did not show up for class? I had no desire to develop another set of finely crafted slides to be delivered to another near-empty lecture hall. I’ll accept the blame for the declining attendance. My classes are generally video-recorded for an Extension School audience. I believe that if the videos exist, then all my students should have them—and they should have my handouts too. In fact, I think I should share as much of these materials with the world as Harvard’s business interests permit. I could think of ways to force students to show up (not posting my slide decks, or administering unannounced quizzes, for example). But those would be tricks, devices to evade the truth: the digital explosion has changed higher education. In the digital world, there is no longer any reason to use class time to transfer the notes of the instructor to the notes of the student (without passing through the brain of either, as Mark Twain quipped). Instead, I should use the classroom differently. So I decided to change the bargain with my students. Attendance would be mandatory. Homework would be daily. There would be a reading assignment for every class. But when they got to class, they would talk to each other instead of listening to me. In class, I would become a coach helping students practice rather than an oracle spouting truths. We would “flip the classroom,” as they say: students would prepare for class in their rooms, and would spend their classroom time doing what we usually call “homework”—solving problems. And they would solve problems collaboratively, sitting around tables in small groups. Students would learn to learn from each other, and the professor would stop acting as though his job was to train people to sit alone and think until they came up with answers. A principal objective of the course would be not just to teach the material but to persuade these budding computer scientists that they could learn it. It had to be a drawing-in course, a confidence-building course, not a weeding-out course. I immediately ran into one daunting obstacle: there was no place to teach such a course. Every classroom big enough to hold 40 or 50 students was set up on the amphitheater plan perfected in Greece 2,500 years ago. Optimal for a performer addressing an audience; pessimal, as computer scientists would say, for students arguing with each other. The School of Engineering and Applied Sciences (SEAS) had not a single big space with a flat floor and doors that could be closed. Several other SEAS professors also wanted to experiment with their teaching styles, and in the fall of 2011 we started talking about designs. In remarkably short order by Harvard standards, SEAS made a dramatic decision. It would convert some underutilized library space on the third floor of Pierce Hall to a flat-floor classroom. In this prototype there would be minimal technology, just a projection system. Thanks to some heroic work by architects and engineers, the whole job was done between the end of classes in December and the start of classes in late January 2012. The space is bright, open, and intentionally low-tech. The room features lots of whiteboards, some fixed to the walls and others rolling on casters, and small paisley-shaped tables, easily rearranged to accommodate two, four, or six seats. Electric cables run underneath a raised floor and emerge here and there like hydras, sprouting multiple sockets for student laptops, which never seem to have working batteries. A few indispensable accouterments were needed—lots of wireless Internet connectivity; push-of-a-button shades to cover the spectacular skylight; and a guarantee from the building manager that the room would be restocked daily with working whiteboard markers. About 40 brave souls showed up to be the guinea pigs in what I told them would be an experiment. To make the point about how the course would work, I gave on day one not the usual hour-long synopsis of the course and explanation of grading percentages, but a short substantial talk on the “pigeonhole principle”: If every pigeon goes in a pigeonhole and there are more pigeons than pigeonholes, some pigeonhole must have at least two pigeons. I then handed out a problem for the tables to solve using that principle, right then and there: prove that if you pick any 10 points from the area of a 1 x 1 square, then some two of them must be separated by no more than one-third of the square root of two. They got it, and they all came back for the next class, some with a friend or two. (Try it yourself—and remember, it helps to have someone else to work with!) After a few fits and starts, the course fell into a rhythm. We met Mondays, Wednesdays, and Fridays from 10 to 11 a.m. The course material was divided into bite-sized chunks, one topic per day. For each topic I created a slide presentation, which was the basis for a 20-minute mini-lecture I recorded on my laptop while sitting at home. The video and the slides were posted on the course website by the end of a given class so students could view them at their convenience before the next class. I also assigned 10 to 20 pages of reading from relevant sources that were free online. (A standard text for this material costs $218.67, and I just couldn’t ask students to spend that kind of money.) The students, in turn, had to answer some short questions online to prove they had done the reading and watched the video before showing up for class. Once in class, I worked one problem and then passed out copies of a sheet posing three or four others. Students worked in groups of four around tables, and each table wrote its solution on a whiteboard. A teaching fellow (TF), generally a junior or senior concentrating in math or computer science, coached and coaxed, and when a table declared it had solved a problem, finally called on a student to explain and defend the group’s solution. (This protocol provided an incentive for the members of a group to explain the solution to each other before one of them was called on.) At the end of the class, we posted the solutions to all the in-class problems, and also posted real homework problems, to be turned in at the beginning of the next class. We took attendance, and we collected the homework submissions at the beginning of class, to make sure people showed up on time. I had serious doubts about whether this protocol would actually work. Required attendance is countercultural at Harvard, as is daily homework to be submitted in class. And education requires the trust of the students. To learn anything, they have to believe the professors know what they are doing. I really didn’t, though I had observed a master teacher, Albert Meyer ’63, Ph.D. ’72, MIT’s Hitachi America professor of engineering, utilize this style with great skill. There was also the choppiness, the lack of a dramatic story line for the whole course. I took the cheap way out of that problem—I threw in some personal war stories related to the material. How Bill Gates ’77, LL.D. ’07, as a sophomore, cracked a problem I gave him about counting pancake flips and published a paper about it called “Bounds for Sorting By Prefix Reversal.” How Mark Zuckerberg ’06 put me at the center of his prototype social-network graph (so pay attention to graph theory, students, you never know when it might come in handy!). With no camera on me, I used the intimacy of the classroom for topical gossip—including updates on the five varsity athletes taking the course, three of them on teams that won Ivy championships during the term. Student feedback was gratifyingly positive. Anonymous responses to my questionnaire included “I’ve found this to be the most helpful teaching method at Harvard” and “Oh my goodness, the in-class problem-solving is beautiful! We need more of it.” Even the negative comments were positive. One student said, “The TFs are great. Professor Lewis’s teaching is not good. …I find it more useful to…talk to the TFs than listen to his lectures.” Fine, I thought to myself, I’ll talk less. My TFs have always been better teachers than I am, anyway, and lots of them are top professors now, so this is par for the course. My favorite: “You might say the class is a kind of start-up, and that its niche is the ‘class as context for active, engaging, useful, and fun problem-solving’ (as opposed to ‘class as context for sitting, listening, and being bored’).” Yes! Discrete mathematics as entrepreneurial educational disruption! What have we learned from the whole CS 20 experiment? Thirty-three topic units were a lot to prepare—each includes a slide deck, a recorded lecture, a selection of readings, a set of in-class problems, and homework exercises. The trickiest part was coordinating the workflow and getting everything at the right difficulty level—manageable within our severe time constraints, but hard enough to be instructive. Fortunately, my head TF, Michael Gelbart, a Princeton grad and a Ph.D. candidate in biophysics, is an organizational and pedagogical genius. When our homework problems were too hard and students became collectively discouraged or angry, we pacified the class with an offering of cupcakes or doughnut holes. We kept the classroom noncompetitive—we gave the normal sorts of exams, but students were not graded on their in-class performance, provided they showed up. That created an atmosphere of trust and support, but in-class problem-solving is pedagogically inefficient: I could have “covered” a lot more material if I were lecturing rather than confronting, in every class, students’ (mis)understanding of the material! Harvard’s class schedule, which allots three class hours per week for every course, is an anachronism of the lecture era; for this course we really need more class time for practice, drill, and testing. I relearned an old cultural lesson in a more international Harvard. Thirty-five years ago I learned the hard way never to assign an exam problem that required knowing the rules of baseball, because (who knew?) in most of the world children don’t grow up talking about innings and batting averages. This year I learned (happily, this time, before I made up the final exam) that there are places where children aren’t taught about hearts and diamonds, because card games are considered sinful. I also responded to some familiar student objections. Having weathered storms of protest in 1995 over randomizing the Houses, I anticipated that students would prefer to pick their own table-mates, but (true to type) I decided that mixing up the groups would make for greater educational dynamism. It worked, but next time I will go one step further. I will re-scramble the groups halfway through the course, so everyone can exchange their newly acquired problem-solving strategies with new partners. With a good set of recorded lectures and in-class problems now in hand, the class could be scaled pretty easily; we could offer multiple sections at different hours of the day, if we could get the classroom space and hire enough conscientious, articulate, mathematically mature undergraduate assistants. Fortunately, the Harvard student body includes a great many of the latter, and I owe a lot of thanks to those who assisted me this year—Ben Adlam, Paul Handorff, Abiola Laniyonu, and Rachel Zax—as well as to Albert Meyer and my colleague Paul Bamberg ’63, senior lecturer on mathematics, who gave me good advice and course materials to adapt for CS 20. I had the added satisfaction, as a longtime distance-education buff, of finding out that this experience could be replicated online. With the support of Henry Leitner, Ph.D. ’82, associate dean in the Division of Continuing Education and senior lecturer in computer science, we tried, and seem to have succeeded. In CSci E-120, offered this spring through the Harvard Extension School, a group of adventurous students, physically spread out from California to England, replicated the CS 20 “active learning” experience. They watched the same lectures and did the same reading on their own time. They “met” together synchronously for three hours per week (in the early evening for some, and the early morning for others). Web conferencing software allowed them to form virtual “tables” of four students each. Each “table” collaborated to solve problems by text-chatting and by scribbling on a shared virtual “whiteboard” using a tablet and stylus. My prize assistant, Deborah Abel ’01, “wandered” among the rooms just as the teaching fellows were doing in the physical space of my Pierce Hall classroom. Most of all, the course was for me an adventure in the co-evolution of education and technology—indeed, of life and technology. The excitement of computing created the demand for the course in the first place. The new teaching style was a response to the flood of digital content—and to my stubborn, libertarian refusal to dam it up. The course couldn’t have been done without digital infrastructure—five years ago I could not have recorded videos, unassisted and on my own time, for students to watch on theirs. The distance version of the course is an exercise in cyber-mediated intercontinental collaboration. Yet in the Harvard College classroom, almost nothing is digital. It is all person-to-person-to-person, a cacophony of squeaky markers and chattering students, assistants, and professor, above which every now and then can be heard those most joyous words, “Oh! I get it now!” 原文见 http://harvardmagazine.com/2012/09/reinventing-the-classroom; 个人分类: 教育|1923 次阅读|0 个评论

MLA 2012 Notes (2) - Transfer Learning and Applications: jujumao 2012-11-25 18:52; 2.TransferLearningandApplications Speaker ：杨强 HKUST 教授， GeneralchairofSIGKDD, 华为诺亚方舟实验室主任杨强还是讲的迁移学习，不过却很有趣，迁移学习可以看做强化学习的一种，将一个领域习得的知识用于另一个领域，用于不同 domain 的 knowledgetransfer ，由 labeleddata 扩展到 unlabeleddataset ，对于一些标注代价昂贵且不易获得的数据集很有用，应用较广泛。一个简单的例子是，我们要针对中文的网页训练一个数据集，但是我们手上没有或只有少量的中文网页数据，却有充足的标注好的英文的数据集。目标还是一样，对未标注的中文数据集进行分类，这就是所谓的跨语言分类（ cross-languageclassification ）如图，显然这有别于传统的文本分类，因为在这里训练数据和测试数据是两种不同的语言。 Scenario 就是这样，那我们应该怎么办呢？在所有异质迁移学习任务中，我们首先需要建立两种异质特征空间的联系，在这里我们有充足的英文语料，但目标是对只有很少标注语料的中文文档进行分类，所以需要对两种语言的数据集建立关联， bridge 两种特征空间（（注：在文本分类里，特征空间就是词汇空间）），所以一个直观 solution 是想办法获得两种特征空间（这里表现为中文和英文词汇）的共现（ co-occurrence ）数据（例如我们手上有个字典，将英文 term 和中文 termmatch 起来），通过这种方法，我们可以根据他们的共现频率去估计两种数据集在特征（词汇）层面的迁移概率。因而能利用训练得到的英文分类器去分类标注中文文档，进而迭代获得更多的标注数据。另一个有趣的场景是，如何利用文本训练数据集去对图像做分类？众所周知图片的标注数据集相对于含有特定主题的文本而言是很难获得的。同时在实验中对一个很意思的命题进行了论证：所谓一图抵千言，真的是这样么？准确么？那么一幅图究竟能代表多少字呢？方法还是一样的，首先需要发现图像和文本特征空间的关联，找到沟通他们的桥梁。好在这种关联并不难以找到，通过爬取一些社交网站（如 flickr 等）上，人们对图片打上的标签（ tag ）。获得图像和对应标签注释文本之间的共现数据，从而建立起图像特征空间和文档特征空间之间的关联。在实验中，如果要达到 75% 的分类准确率，需要 100 多个标注图片，而如果使用文档（ document ）的话，要达到相同的准确率需要 200 多个文档。因此我们可以得出 eachImage=2textdocuments=1000words （ statisticallyperdocumentshasaverageof500words ）即所谓的一图抵千字！; 4159 次阅读|0 个评论

纽约时报上关于Deep Learning的研究进展的报道: 热度 1 oliviazhang 2012-11-25 17:11; 其中一些deep learning的应用确实让我打开眼界。全文连接： Scientists See Promise in Deep-Learning Programs 以前看过一些人做人工神经网络的动力学行为的分析。可惜仅仅只拿一些简单的没有多少实际用途的人工神经网络来分析（比如最简单的PCA神经网络）。如果能对Deep Learning这种太有潜力的网络做动力学行为分析，那将是多么有意义的研究工作啊。当然，Deep Learning确实很复杂。但是我希望那些分析神经网络动力学行为的人，不要一天到晚就只分析最简单的PCA,MCA这种网络。如果真要分析，分析分析sparse PCA, kernel PCA这样的神经网络更要有意义些。; 个人分类: Deep Learning|7301 次阅读|4 个评论

[转载]machine learning non-linear: SVM, Neural Network: genesquared 2012-11-8 15:59; Support vector machine - Wikipedia, the free encyclopedia en.wikipedia.org/.../Support_vector_ machin ... - 网页快照 - 翻译此页 In machine learning , support vector machines (SVMs, also support vector ... SVMs can efficiently perform non-linear classification using what is called the kernel ... Formal definition - History - Motivation - Linear SVM Machine Learning : Do there exist non-linear online (stochastic ... www.quora.com/ Machine - Learning /Do-there-exist- no ... - 翻译此页 Not sure weather I got it right but ... Artificial Neural Networks (ANN) are able to capture non-linear hypothesis. Furthermore, extensions to ANN incorporate ... Machine Learning scianta.com/technology/ machinelearning .htm - 网页快照 - 翻译此页 Linear and Non-Linear Regression is a machine learning technique for fitting a curve to a collection of data. The algebraic formula for the curve is a model of the ... Foundations of Machine Learning Lecture 5 www.cs.nyu.edu/~mohri/mls/lecture_5.pdf - 翻译此页文件格式: PDF/Adobe Acrobat - 快速查看作者：M Mohri - 相关文章 Mehryar Mohri - Foundations of Machine Learning . Motivation. Non-linear decision boundary. Efficient computation of inner products in high dimension. Flexible ... Machine Learning Learning highly non-linear functions www.cs.cmu.edu/~epxing/Class/10701.../lecture5.pdf - 翻译此页文件格式: PDF/Adobe Acrobat - 快速查看 1. Machine Learning . Neural Networks. Eric Xing. 10-701/15-781, Fall 2011. 781, Fall 2011. Lecture 5, September 26, 2011. Reading: Chap. 5 CB. 1. © Eric Xing ... Machine Learning Learning non-linear functions www.cs.cmu.edu/~epxing/Class/10701.../lecture6.pdf - 翻译此页文件格式: PDF/Adobe Acrobat - 快速查看 Lecture 6, February 4, 2008. Reading: Chap. 1.6, CB Chap 3, TM. Learning non-linear functions f: X → Y. ○. X (vector of) continuous and/or discrete vars. ○ ... machine learning - non linear svm kernel dimension - Stack Overflow stackoverflow.com/.../ non-linear -svm-kernel... - 网页快照 - 翻译此页 3 个回答-10月22日 I have some problems with understanding the kernels for non-linear ... The transformation usually increases the number of dimensions of your ... Investigation of expert rule bases, logistic regression, and non-linear ... www.ncbi.nlm.nih.gov/pubmed/19474477 - 翻译此页作者：MC Prosperi - 2009 - 被引用次数：17 - 相关文章 Investigation of expert rule bases, logistic regression, and non-linear machine learning techniques for predicting response to antiretroviral treatment. Prosperi ... Original article Investigation of expert rule bases, logistic regression ... www.intmedpress.com/serveFile.cfm?sUID...847b... - 翻译此页文件格式: PDF/Adobe Acrobat 作者：MCF Prosperi - 2009 - 被引用次数：17 - 相关文章 developed through machine learning methods. Methods: The aim of the study was to investigate linear and non-linear statistical learning models for classifying ... machine learning - Non-linear (e.g. RBF kernel) SVM with SCAD ... stats.stackexchange.com/.../ non-linear -e-g-r... - 网页快照 - 翻译此页 1 Mar 2012 – Is there one? I think there's a penalizedSVM package in R but it looks to use a linear kernel. Can't quite tell from the documentation. If it's linear ...; 2317 次阅读|0 个评论

[转载]MKL(multi-kernel learning)在图像分类中的应用: kistaria 2012-10-18 21:47; 分类在搜索引擎中的应用非常广泛，这种分类属性可以方便在rank过程中针对不同类别实现不同的策略，来更好满足用户需求。本人接触分类时间并不长，在刚用SVM做分类的时候对一个现象一直比较困惑，看到大家将各种不同类型特征，拼接在一起，组成庞大的高维特征向量，送给SVM，得到想要的分类准确率，一直不明白这些特征中，到底是哪些特征在起作用，哪些特征组合在一起才是最佳效果，也不明白为啥这些特征就能够直接拼在一起，是否有更好的拼接方式？后来了解到核函数以及多核学习的一些思想，临时抱佛脚看了点，对上面的疑问也能够作一定解释，正好拿来和大家一起探讨探讨，也望大家多多指点。本文探讨的问题所列举的实例主要是围绕项目中的图像分类展开，涉及SVM在分类问题中的特征融合问题。扩展开来对其他类型分类问题，理论上也适用。关键词： SVM 特征融合核函数多核学习 2基本概念阐述 SVM：支持向量机，目前在分类中得到广泛的应用特征融合：主要用来描述各种不同的特征融合方式，常见的方式有前期融合，就是前面所描述的将各个特征拼接在一起，后期融合本文后面会提到核函数：SVM遇到线性不可分问题时，可以通过核函数将向量映射到高维空间，在高维空间线性可分多核学习：在利用SVM进行训练时，会涉及核函数的选择问题，譬如线性核，rbf核等等，多核即为融合几种不同的核来训练。 3应用背景在图片搜索中，会出现这样的一类badcase，图像的内容和描述图像的文本不一致，经常会有文本高相关，而图像完全不一致的情况。解决这类问题的一个思路就是综合利用图像的内容分类属性和文本的query分类属性，看两者的匹配程度做相应策略。 4分类方法的选取下面就可以谈到本文的重点啦，那是如何对图像分类的呢？对分类熟悉的同学，马上可能要说出，这还不easy，抽取各种特征，然后一拼接，随便找个分类器，设定几个参数，马上分类模型文件就出来啦，80%准确率没问题。那这个方法确实不错也可行，但是有没有可以改进的地方呢？这里可能先要说明下图像分类的一些特殊性。图像的分类问题跟一般的分类问题方法本质上没太大差异，主要差异体现在特征的抽取上以及特征的计算时间上。图像特征的抽取分为两部分，一部分是针对通用图像的特征，还有一部分则是针对特定类别抽取的特征。这些特征与普通的文本特征不一致的地方在于，一个图像特征由于存在分块、采样、小波变换等，可能维度就已经很高。譬如常见的MPEG-7标准中提到的一些特征，边缘直方图150维，颜色自相关特征512维等。在分类过程中，如果将这些特征拼接在一起直接就可能过千维，但是实际在标注样本时，人工标注的正负样本也才几千张，所以在选择分类器时，挑选svm，该分类器由于可以在众多分类面中选择出最优分界面，以及在小样本的学习中加入惩罚因子产生一定软边界，可以有效规避overfitting。在特征的计算时间上，由于图像处理涉及的矩阵计算过多，一个特征的计算时间慢的可以达到0.3秒，所以如何挑选出既有效又快速的特征也非常重要。 5两种特征融合方式的比较那刚才的方法有什么问题呢？仔细想想，大致存在以下几点问题： 1. 你所提取的所有特征，全部串在一起，一定合适么？如果我想知道哪些特征组合在一起效果很好，该怎么办？ 2. 用svm进行学习时，不同的特征最适合的核函数可能不一样，那我全部特征向量串在一起，我该如何选择核函数呢？ 3. 参数的选取。不同的特征即使使用相同的核，可能最适合的参数也不一样，那么如何解决呢？ 4. 全部特征都计算，计算时间的花销也是挺大的对于刚才的问题，如果用前期融合，可能是用下面方式来解决： 1. 根据经验，觉得在样本中可能表现不错的特征加进来，至于组合么，全部串在一起，或者选几个靠谱的串一起，慢慢试验，慢慢调，看哪些特征有改进就融合在一起 2. 也是根据经验，选取普遍表现不错的RBF核，总之结果应该不会差 3. 交叉验证是用来干嘛的？验证调优参数呗，全部特征融合在一起，再来调，尽管验证时间长，不要紧，反正模型是离线训练的，多调会也没关系。那是否有更好的选择方案呢？多核学习(MKL)可能是个不错的选择，该方法属于后期融合的一种，通过对不同的特征采取不同的核，对不同的参数组成多个核，然后训练每个核的权重，选出最佳核函数组合来进行分类。先看下简单的理论描述：普通SVM的分类函数可表示为：其中为待优化参数，物理意义即为支持向量样本权重，用来表示训练样本属性，正样本或者负样本，为计算内积的核函数，为待优化参数。其优化目标函数为：其中用来描述分界面到支持向量的宽度，越大，则分界面宽度越小。C用来描述惩罚因子，而则是用来解决不可分问题而引入的松弛项。在优化该类问题时，引入拉格朗日算子，该类优化问题变为：其中待优化参数在数学意义上即为每个约束条件的拉格朗日系数。而MKL则可认为是针对SVM的改进版，其分类函数可描述为：其中，表示第K个核函数，则为对应的核函数权重。其对应的优化函数可以描述为：在优化该类问题时，会两次引入拉格朗日系数，参数与之前相同，可以理解为样本权重，而则可理解为核函数的权重，其数学意义即为对每个核函数引入的拉格朗日系数。具体的优化过程就不描述了，不然就成翻译论文啦~，大家感兴趣的可以看后面的参考文档。通过对比可知，MKL的优化参数多了一层其物理意义即为在该约束条件下每个核的权重。 Svm的分类函数形似上是类似于一个神经网络，输出由中间若干节点的线性组合构成，而多核学习的分类函数则类似于一个比svm更高一级的神经网络，其输出即为中间一层核函数的输出的线性组合。其示意图如下：在上图中，左图为普通SVM示例，而全图则为MKL示例。其中为训练样本，而为不同的核函数，为支持向量权重（假设三个训练样本均为支持向量），为核权重，y为最终输出分类结果。 6实验过程：以实际对地图类别的分类为例，目前用于分类的特征有A，B，C，D，E，F，G（分别用字母代表某特征），这些特征每个的维数平均几百维。准备工作： 1. 人工标注地图类别正负样本，本次标注正样本176张，负样本296张 2. 提取正负训练样本图片的A~G各个特征 3. 归一化特征 4. 为每个特征配置对应的核函数，以及参数工具： Shogun工具盒： http://www.shogun-toolbox.org/ ，其中关于该工具的下载，安装，使用实例都有详细说明。该工具除了提供多核学习接口之外，几乎包含所有机器学习的工具，而且有多种语言源码，非常方便使用。结果测试：经过大约5分钟左右的训练，输出训练模型文件，以及包含的核函数权重、准确率。在该实例中，7个特征分别用七个核，其权重算出来为： 0.048739 0.085657 0.00003 0.331335 0.119006 0.00000 0.415232，最终在测试样本上准确率为：91.6% 为了节省特征抽取的时间，考虑去掉权重较小的特征A、C、F，拿剩下4个核训练，几分钟后，得到核函数权重如下： 0.098070 0.362655 0.169014 0.370261，最终在测试样本上准确率为：91.4% 在这次训练中，就可以节约抽取A、C、F特征的训练时间，并且很快知道哪些特征组合在一起会有较好的结果。实验的几点说明： 1. 该类别的分类，因为样本在几百的时候就已经达到不错效果，所以选取数目较少。 2. 该实验是针对每个特征选择一个核，且每个核配置固定参数，实际中如果时间允许，可以考虑每个特征选不同核，同一核可以选取不同参数，这样可以得到稍微更好的结果。参考文章： Large Scale Multiple Kernel Learning SimpleMKL Representing shape with a spatial pyramid kernel 参考代码： http://www.shogun-toolbox.org/doc/cn/current/libshogun_examples.html ， 7个人经验与总结： 1. 多核学习在解释性上比传统svm要强。多核学习可以明显的看到各个子核中哪些核在起作用，哪些核在一起合作效果比较好。 2. 关于参数优化。曾经做过实验，关于同一特征选用同一核，但是不同参数，组合成多个核，也可以提升分类准确率。 3. 多核学习相比前期特征融合在性能上会有3%~5%左右的提升。 4. 通过特征选择，可以节约特征计算时间。 by wenshilei; 244 次阅读|0 个评论

基于抽样的缺陷预测: lzhx171 2012-10-14 18:25; 本周看了一篇12年发表在ASE上的文章，文章名为Sample-based software defect prediction with active and semi-supervised learning. 翻译为基于抽样方法的主动学习和半监督学习软件缺陷预测。文中作者描述了三种方法，一是利用传统的机器学习方法随机抽样训练，二是利用半监督学习（semi-supervised learning）训练器随机抽样，三是利用主动半监督学习训练主动抽样（active sampling）的结果，并提出一种叫ACoForest的算法进行主动抽样。本文核心是为主动学习+半监督学习，以及融合这二者的提出的ACoForest算法。为了描述这个算法，我们先描述一般抽样的半监督学习算法（CoForest，这个算法在07年时由本文作者提出，并应用于医学诊断当中）。给定带标签集合L，和为标记集合U，首先利用带标签的训练集初始化N个随机树，接着在每次迭代中用N-1个随机树集成训练预测标记U中数据，并将可信度较高的实例加入训练集L ，中，对L ，随机抽样，使其满足一定的条件，然后由带标签集合L和新标记的实例集合L ，进行优化，直到迭代中没有任何一个随机树变化为止。（红色标记部分的需要满足一定的条件，这个在以前报告中讲过）。以上为CoForest方法。在优化随机树时，应该选取最有助于优化的算法，这样可以减小训练集而同时提高精确率，因此在进行优化前（上段蓝色字体），选取N个随机树最有争议（说明所含信息多）的前M组数据，再进行之后的过程，这个就是ACoForest。这个算法利用了主动学习及半监督学习的优点，使得每个随机树收敛的更快。在实验部分，作者比较了ACoForest的方法，几乎在所有数据集上F1值都好于CoForest。此算法的新颖之处在于训练集的获取上，个人认为就是要找到含有有效信息最多的数据集合，作者通过他们之前提出的一种基于分歧的半监督学习方法，根据多分类器对每个数据的分歧程度来说明一个数据是否值得作为训练集。可以说这是一篇作者在他们之前研究基础上的一种应用延伸。; 201 次阅读|0 个评论

[转载]MKL(multi-kernel learning)在图像分类中的应用: hailuo0112 2012-10-10 10:04; 1摘要分类在搜索引擎中的应用非常广泛，这种分类属性可以方便在rank过程中针对不同类别实现不同的策略，来更好满足用户需求。本人接触分类时间并不长，在刚用SVM做分类的时候对一个现象一直比较困惑，看到大家将各种不同类型特征，拼接在一起，组成庞大的高维特征向量，送给SVM，得到想要的分类准确率，一直不明白这些特征中，到底是哪些特征在起作用，哪些特征组合在一起才是最佳效果，也不明白为啥这些特征就能够直接拼在一起，是否有更好的拼接方式？后来了解到核函数以及多核学习的一些思想，临时抱佛脚看了点，对上面的疑问也能够作一定解释，正好拿来和大家一起探讨探讨，也望大家多多指点。本文探讨的问题所列举的实例主要是围绕项目中的图像分类展开，涉及SVM在分类问题中的特征融合问题。扩展开来对其他类型分类问题，理论上也适用。关键词： SVM 特征融合核函数多核学习 2基本概念阐述 SVM：支持向量机，目前在分类中得到广泛的应用特征融合：主要用来描述各种不同的特征融合方式，常见的方式有前期融合，就是前面所描述的将各个特征拼接在一起，后期融合本文后面会提到核函数：SVM遇到线性不可分问题时，可以通过核函数将向量映射到高维空间，在高维空间线性可分多核学习：在利用SVM进行训练时，会涉及核函数的选择问题，譬如线性核，rbf核等等，多核即为融合几种不同的核来训练。 3应用背景在图片搜索中，会出现这样的一类badcase，图像的内容和描述图像的文本不一致，经常会有文本高相关，而图像完全不一致的情况。解决这类问题的一个思路就是综合利用图像的内容分类属性和文本的query分类属性，看两者的匹配程度做相应策略。 4分类方法的选取下面就可以谈到本文的重点啦，那是如何对图像分类的呢？对分类熟悉的同学，马上可能要说出，这还不easy，抽取各种特征，然后一拼接，随便找个分类器，设定几个参数，马上分类模型文件就出来啦，80%准确率没问题。那这个方法确实不错也可行，但是有没有可以改进的地方呢？这里可能先要说明下图像分类的一些特殊性。图像的分类问题跟一般的分类问题方法本质上没太大差异，主要差异体现在特征的抽取上以及特征的计算时间上。图像特征的抽取分为两部分，一部分是针对通用图像的特征，还有一部分则是针对特定类别抽取的特征。这些特征与普通的文本特征不一致的地方在于，一个图像特征由于存在分块、采样、小波变换等，可能维度就已经很高。譬如常见的MPEG-7标准中提到的一些特征，边缘直方图150维，颜色自相关特征512维等。在分类过程中，如果将这些特征拼接在一起直接就可能过千维，但是实际在标注样本时，人工标注的正负样本也才几千张，所以在选择分类器时，挑选svm，该分类器由于可以在众多分类面中选择出最优分界面，以及在小样本的学习中加入惩罚因子产生一定软边界，可以有效规避overfitting。在特征的计算时间上，由于图像处理涉及的矩阵计算过多，一个特征的计算时间慢的可以达到0.3秒，所以如何挑选出既有效又快速的特征也非常重要。 5两种特征融合方式的比较那刚才的方法有什么问题呢？仔细想想，大致存在以下几点问题： 1. 你所提取的所有特征，全部串在一起，一定合适么？如果我想知道哪些特征组合在一起效果很好，该怎么办？ 2. 用svm进行学习时，不同的特征最适合的核函数可能不一样，那我全部特征向量串在一起，我该如何选择核函数呢？ 3. 参数的选取。不同的特征即使使用相同的核，可能最适合的参数也不一样，那么如何解决呢？ 4. 全部特征都计算，计算时间的花销也是挺大的对于刚才的问题，如果用前期融合，可能是用下面方式来解决： 1. 根据经验，觉得在样本中可能表现不错的特征加进来，至于组合么，全部串在一起，或者选几个靠谱的串一起，慢慢试验，慢慢调，看哪些特征有改进就融合在一起 2. 也是根据经验，选取普遍表现不错的RBF核，总之结果应该不会差 3. 交叉验证是用来干嘛的？验证调优参数呗，全部特征融合在一起，再来调，尽管验证时间长，不要紧，反正模型是离线训练的，多调会也没关系。那是否有更好的选择方案呢？多核学习(MKL)可能是个不错的选择，该方法属于后期融合的一种，通过对不同的特征采取不同的核，对不同的参数组成多个核，然后训练每个核的权重，选出最佳核函数组合来进行分类。先看下简单的理论描述：普通SVM的分类函数可表示为：其中为待优化参数，物理意义即为支持向量样本权重，用来表示训练样本属性，正样本或者负样本，为计算内积的核函数，为待优化参数。其优化目标函数为：其中用来描述分界面到支持向量的宽度，越大，则分界面宽度越小。C用来描述惩罚因子，而则是用来解决不可分问题而引入的松弛项。在优化该类问题时，引入拉格朗日算子，该类优化问题变为：其中待优化参数在数学意义上即为每个约束条件的拉格朗日系数。而MKL则可认为是针对SVM的改进版，其分类函数可描述为：其中，表示第K个核函数，则为对应的核函数权重。其对应的优化函数可以描述为：在优化该类问题时，会两次引入拉格朗日系数，参数与之前相同，可以理解为样本权重，而则可理解为核函数的权重，其数学意义即为对每个核函数引入的拉格朗日系数。具体的优化过程就不描述了，不然就成翻译论文啦~，大家感兴趣的可以看后面的参考文档。通过对比可知，MKL的优化参数多了一层其物理意义即为在该约束条件下每个核的权重。 Svm的分类函数形似上是类似于一个神经网络，输出由中间若干节点的线性组合构成，而多核学习的分类函数则类似于一个比svm更高一级的神经网络，其输出即为中间一层核函数的输出的线性组合。其示意图如下：在上图中，左图为普通SVM示例，而全图则为MKL示例。其中为训练样本，而为不同的核函数，为支持向量权重（假设三个训练样本均为支持向量），为核权重，y为最终输出分类结果。 6实验过程：以实际对地图类别的分类为例，目前用于分类的特征有A，B，C，D，E，F，G（分别用字母代表某特征），这些特征每个的维数平均几百维。准备工作： 1. 人工标注地图类别正负样本，本次标注正样本176张，负样本296张 2. 提取正负训练样本图片的A~G各个特征 3. 归一化特征 4. 为每个特征配置对应的核函数，以及参数工具： Shogun工具盒： http://www.shogun-toolbox.org/ ，其中关于该工具的下载，安装，使用实例都有详细说明。该工具除了提供多核学习接口之外，几乎包含所有机器学习的工具，而且有多种语言源码，非常方便使用。结果测试：经过大约5分钟左右的训练，输出训练模型文件，以及包含的核函数权重、准确率。在该实例中，7个特征分别用七个核，其权重算出来为： 0.048739 0.085657 0.00003 0.331335 0.119006 0.00000 0.415232，最终在测试样本上准确率为：91.6% 为了节省特征抽取的时间，考虑去掉权重较小的特征A、C、F，拿剩下4个核训练，几分钟后，得到核函数权重如下： 0.098070 0.362655 0.169014 0.370261，最终在测试样本上准确率为：91.4% 在这次训练中，就可以节约抽取A、C、F特征的训练时间，并且很快知道哪些特征组合在一起会有较好的结果。实验的几点说明： 1. 该类别的分类，因为样本在几百的时候就已经达到不错效果，所以选取数目较少。 2. 该实验是针对每个特征选择一个核，且每个核配置固定参数，实际中如果时间允许，可以考虑每个特征选不同核，同一核可以选取不同参数，这样可以得到稍微更好的结果。参考文章： Large Scale Multiple Kernel Learning SimpleMKL Representing shape with a spatial pyramid kernel 参考代码： http://www.shogun-toolbox.org/doc/cn/current/libshogun_examples.html ， 7个人经验与总结： 1. 多核学习在解释性上比传统svm要强。多核学习可以明显的看到各个子核中哪些核在起作用，哪些核在一起合作效果比较好。 2. 关于参数优化。曾经做过实验，关于同一特征选用同一核，但是不同参数，组合成多个核，也可以提升分类准确率。 3. 多核学习相比前期特征融合在性能上会有3%~5%左右的提升。 4. 通过特征选择，可以节约特征计算时间。 by wenshilei 转自： http://stblog.baidu-tech.com/?p=1272; 个人分类: SVM|4100 次阅读|0 个评论

[转载][CODE] manifold learning matlab code 一个流行学习的matlab代: why196 2012-10-6 16:25; 一个流行学习demo，并且有源代码里面实现的代码有一下文章： MDS Michael Lee's MDS code ISOMAP J.B. Tenenbaum, V. de Silva and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319--2323, 2000. LLE L. K. Saul and S. T. Roweis. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds . Journal of Machine Learning Research, v4, pp. 119-155, 2003. Hessian LLE D. L. Donoho and C. Grimes. Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data . Technical Report TR-2003-08, Department of Statistics, Stanford University, 2003. Laplacian Eigenmap M. Belkin, P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , Neural Computation, June 2003; 15 (6):1373-1396. Diffusion Maps Nadler, Lafon, Coifman, and Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems LTSA Zhenyue Zhang and Hongyuan Zha. Principal Manifolds and Nonlinear Dimension Reduction via Tangent Space Alignment. SIAM Journal of Scientific Computing, 2004, 26 (1): 313-338. 原始链接： http://www.math.ucla.edu/~wittman/mani/index.html 本文引用地址： http://blog.sciencenet.cn/blog-722391-583977.html; 1521 次阅读|0 个评论

[转载]多任务学习方法及两个Multi-task learning(多任务学习)的: why196 2012-10-6 16:22; 条件：有n个任务t=1,2,...,n,每个任务给予m个样本：（xt1,yt1）,...,(xtm,ytm)。目的:得出一个X到Y的函数ft，t=1，2，...，n。当这些任务是相关的，联合的任务学习应该比单独学习每个任务的效果好，特别当每个任务的数据相当少的时候，在这种情况下，独自学习是很不成功的。在这个过程中主要用到了传递的功能：1）通过n个任务学习得到的好的概括结果能够传递到一个新的任务上，2）通过新任务t‘的一些数据，{(xt'1,yt'1),....,xt'l,yt'l},学习函数ft'，3）从n个任务中学习到的共同结构或者共同特征确实能够“传递”到新任务中来。 4）传递是人类智能的一个很重要的特征。在多任务学习中，当这些任务是相关的，联合的任务学习应该比单独学习每个任务的效果好，特别当每个任务的数据相当少的时候，在这种情况下，独自学习是很不成功的。 Convex Multi-task Feature Learning 是一篇比较经典的文章，代码点击这里可以下载。还有一篇是Multi-Task Feature Learning Via Efficient l2-1 Norm Minimization, 点击这里可以下载。这篇文章的最后一位作者就是JiePing Ye, 是LDA的大牛，2D LDA和GLDA就是他提出来的，而且他的主页上面公布了不少的源代码，有兴趣的可以看一看~; 2217 次阅读|0 个评论

[转载]近两年顶级会议上关于Distance Metric Learning的paper清单: rockycxy 2012-9-9 17:05; ICML 2012 Maximum Margin Output Coding Information-theoretic Semi-supervised Metric Learning via Entropy Regularization A Hybrid Algorithm for Convex Semidefinite Optimization Information-Theoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation Similarity Learning for Provably Accurate Sparse Linear Classification ICML 2011 Learning Discriminative Fisher Kernels Learning Multi-View Neighborhood Preserving Projections CVPR 2012 Order Determination and Sparsity-Regularized Metric Learning for Adaptive Visual Tracking Non-sparse Linear Representations for Visual Tracking with Online Reservoir Metric Learning Unsupervised Metric Fusion by Cross Diffusion Learning Hierarchical Similarity Metrics Large Scale Metric Learning from Equivalence Constraints Neighborhood Repulsed Metric Learning for Kinship Verification Learning Robust and Discriminative Multi-Instance Distance for Cost Effective Video Classification PCCA: a new approach for distance learning from sparse pairwise constraints Group Action Induced Distances for Averaging and Clustering Linear Dynamical Systems with Applications to the Analysis of Dynamic Visual Scenes CVPR 2011 A Scalable Dual Approach to Semidefinite Metric Learning AdaBoost on Low-Rank PSD Matrices for Metric Learning with Applications in Computer Aided Diagnosis Adaptive Metric Differential Tracking (HUST) Tracking Low Resolution Objects by Metric Preservation (HUST) ACM MM 2012 Optimal Semi-Supervised Metric Learning for Image Retrieval Low Rank Metric Learning for Social Image Retrieval Activity-Based Person Identification Using Sparse Coding and Discriminative Metric Learning Deep Nonlinear Metric Learning with Independent Subspace Analysis for Face Verification ACM MM 2011 Biased Metric Learning for Person-Independent Head Pose Estimation ICCV 2011 Learning Mixtures of Sparse Distance Metrics for Classification and Dimensionality Reduction Unsupervised Metric Learning for Face Identification in TV Video Random Ensemble Metrics for Object Recognition Learning Nonlinear Distance Functions using Neural Network for Regression with Application to Robust Human Age Estimation Learning parameterized histogram kernels on the simplex manifold for image and action classification ECCV 2012 Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Dual-force Metric Learning for Robust Distractor Resistant Tracker Learning to Match Appearances by Correlations in a Covariance Metric Space Image Annotation Using Metric Learning in Semantic Neighbourhoods Measuring Image Distances via Embedding in a Semantic Manifold Supervised Earth Mover’s Distance Learning and Its Computer Vision Applications Learning Class-to-Image Distance via Large Margin and L1-norm Regularization Labeling Images by Integrating Sparse Multiple Distance Learning and Semantic Context Modeling IJCAI 2011 Distance Metric Learning Under Covariate Shift Learning a Distance Metric by Empirical Loss Minimization AAAI 2011 Efficiently Learning a Distance Metric for Large Margin Nearest Neighbor Classification NIPS 2011 Learning a Distance Metric from a Network Learning a Tree of Metrics with Disjoint Visual Features Metric Learning with Multiple Kernels KDD 2012 Random Forests for Metric Learning with Implicit Pairwise Position Dependence WSDM 2011 Mining Social Images with Distance Metric Learning for Automated Image Tagging; 个人分类: 个人总结|7360 次阅读|0 个评论

BMJ Online learning: xupeiyang 2012-6-26 05:38; Dear Colleague, BMJ Learning is celebrating success following the completion of two million learning modules. To mark this milestone, we're throwing open our 1,000-strong collection of modules for free, for everyone, for one week. To thank you for helping us hit the two million mark, you can access hundreds of modules not usually available - free of charge any time during the week of July 2-9th. Whether it's our new, animated procedure based modules, journal-related CPD or anything from the A-Z of our offer (from Abdominal pain to Whooping cough), there's something extra for everyone. Here are some modules suitable forhealthcare professionalsto keep you busy in the mean time: Alcohol withdrawal: managing patients in the emergency department Arterial blood gases: a guide to interpretation Addison’s disease Upper gastrointestinal bleeding: a guide to diagnosis and management of non-variceal bleeding Acute kidney injury: a guide to diagnosis and treatment Best wishes, Dr. Helen Morant Editor, Online learning http://learning.bmj.com/learning/info/twomillionthmodule.html?utm_source=Adestrautm_medium=emailutm_campaign=2994utm_content=Celebrate%20with%20us%20-%20over%201%2C000%20free%20modules%20for%20one%20weekutm_term=BMJ%20LearningCampaign+name=SP%20250612%20healthcare%20professions%20weekly%20alert%20fre; 个人分类: 信息交流|2739 次阅读|0 个评论

[CODE] manifold learning matlab code 一个流行学习的matlab代: 热度 1 qianli8848 2012-6-20 08:53; 一个流行学习demo，并且有源代码里面实现的代码有一下文章： MDS Michael Lee's MDS code ISOMAP J.B. Tenenbaum, V. de Silva and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319--2323, 2000. LLE L. K. Saul and S. T. Roweis. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds . Journal of Machine Learning Research, v4, pp. 119-155, 2003. Hessian LLE D. L. Donoho and C. Grimes. Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data . Technical Report TR-2003-08, Department of Statistics, Stanford University, 2003. Laplacian Eigenmap M. Belkin, P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , Neural Computation, June 2003; 15 (6):1373-1396. Diffusion Maps Nadler, Lafon, Coifman, and Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems LTSA Zhenyue Zhang and Hongyuan Zha. Principal Manifolds and Nonlinear Dimension Reduction via Tangent Space Alignment. SIAM Journal of Scientific Computing, 2004, 26 (1): 313-338. 原始链接： http://www.math.ucla.edu/~wittman/mani/index.html; 个人分类: CODE|10519 次阅读|1 个评论

邹晓辉：如何在大学教育层面开展阅读和写作的国家项目?NWP: geneculture 2012-6-15 22:40; 邹晓辉：如何在大学教育层面开展阅读和写作的国家项目？回答这个问题，至少要考虑： 1.当前大学在母语和外语以及自然语言和程序语言乃至通用俗语与专用术语这三类双语教学及研究甚至计算机交互三个方面的现状调查； 2.如何借鉴国内外这方面成功的做法及事例？ 3.怎样搭建基于三类双语教学及研究甚至计算机交互网络平台？附录： NWP 国家写作项目（ http://www.nwp.org/ ） Writing is Essential Writing is essential to communication, learning, and citizenship . It is the currency of the new workplace and global economy. Writing helps us convey ideas, solve problems, and understand our changing world. Writing is a bridge to the future . About NWP Our Mission The National Writing Project focuses the knowledge, expertise, and leadership of our nation's educators on sustained efforts to improve writing and learning for all learners. Our Vision Writing in its many forms is the signature means of communication in the 21st century. The NWP envisions a future where every person is an accomplished writer, engaged learner, and active participant in a digital, interconnected world . Who We Are Unique in breadth and scale, the NWP is a network of sites anchored at colleges and universities and serving teachers across disciplines and at all levels , early childhood through university . We provide professional development, develop resources, generate research, and act on knowledge to improve the teaching of writing and learning in schools and communities . The National Writing Project believes that access to high-quality educational experiences is a basic right of all learners and a cornerstone of equity. We work in partnership with institutions, organizations, and communities to develop and sustain leadership for educational improvement. Throughout our work, we value and seek diversity—our own as well as that of our students and their communities—and recognize that practice is strengthened when we incorporate multiple ways of knowing that are informed by culture and experience. A Network of University-Based Sites Co-directed by faculty from the local university and from K–12 schools, nearly 200 local sites serve all 50 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands. Sites work in partnership with area school districts to offer high-quality professional development programs for educators . NWP continues to add new sites each year, with the goal of placing a writing project site within reach of every teacher in America. The network now includes two associated international sites . A Successful Model Customized for Local Needs NWP sites share a national program model , adhering to a set of shared principles and practices for teachers’ professional development , and offering programs that are common across the network. In addition to developing a leadership cadre of local teachers (called “ teacher-consultants ”) through invitational summer institutes , NWP sites design and deliver customized inservice programs for local schools, districts, and higher education institutions, and they provide a diverse array of continuing education and research opportunities for teachers at all levels. National research studies have confirmed significant gains in writing performance among students of teachers who have participated in NWP programs. The NWP is the only federally funded program that focuses on the teaching of writing . Support for the NWP is provided by the U.S. Department of Education , foundations, individuals, corporations, universities, and K-12 schools. NWP Core Principles The core principles at the foundation of NWP’s national program model are: Teachers at every level—from kindergarten through college—are the agents of reform ; universities and schools are ideal partners for investing in that reform through professional development . Writing can and should be taught , not just assigned, at every grade level. Professional development programs should provide opportunities for teachers to work together to understand the full spectrum of writing development across grades and across subject areas . Knowledge about the teaching of writing comes from many sources : theory and research, the analysis of practice, and the experience of writing . Effective professional development programs provide frequent and ongoing opportunities for teachers to write and to examine theory, research, and practice together systematically. There is no single right approach to teaching writing; however, some practices prove to be more effective than others. A reflective and informed community of practice is in the best position to design and develop comprehensive writing programs . Teachers who are well informed and effective in their practice can be successful teachers of other teachers as well as partners in educational research, development, and implementation. Collectively, teacher-leaders are our greatest resource for educational reform. http://www.nwp.org/cs/public/print/doc/about.csp; 个人分类: 双语信息处理|11 次阅读|0 个评论

Call For Paper of Special Session, IWSIS2012: xuqingzheng 2012-3-31 15:45; 2012 International Workshop on Swarm Intelligent Systems (IWSIS2012) http://www1.tyust.edu.cn/yuanxi/yjjg/iwsis2012/iwsis2012.htm Special Session- Recent Advances on Opposition-Based Learning Applications Session Chair: Dr. Qingzheng Xu Department of Military Electronic Engineering, Xi’an Communication Institute, China Scope: Diverse forms of opposition are already existent virtually everywhere around us and the interplay between entities and opposite entities is apparently fundamental for maintaining universal balance. However, it seems that there is a gap regarding oppositional thinking in engineering, mathematics and computer science. A better understanding of opposition could potentially establish new search, reasoning, optimization and learning schemes with a wide range of applications. The main idea of opposition-based learning (OBL) is to consider opposite estimates, actions or states as an attempt to increase the coverage of the solution space and to reduce exploration time. OBL has already been applied to reinforcement learning, differential evolution, artificial neural network, particle swarm optimization, ant colony optimization, and genetic algorithm, etc. Example applications include large scale optimization problem, multi-objective optimization, traveling salesman problem, data mining, nonlinear system identification, image processing and understanding. However, finding killer applications for OBL is still a hard task that is heavily pursued. The objective of this special session is to bring together the state-of-art research results and industrial applications on this topic. Contributed papers must be the original work of the authors and should not have been published or under consideration by other journals or conferences. Topics of primary interest include, but are not limited to: l Motivation and theory of opposition-based learning l Opposition-based optimization techniques l Reasoning and search strategies in opposition-based computing l Real-world applications in signal processing, pattern recognition, image understanding, robotics, social networking, etc. l Other methodologies and applications associated with opposition-based learning Submission and review process: Submissions should follow the IWSIS2012 manuscript format described in the Workshop Web site at http://www1.tyust.edu.cn/yuanxi/yjjg/iwsis2012/iwsis2012.htm . All the papers must be submitted electronically in PDF format only via email to Dr. Qingzheng Xu at xuqingzheng@hotmail.com . All the submitted papers will be strictly peer reviewed by at least two anonymous reviewers. Based on the reports by the reviewers, the final decision on papers submitted to this Special Session will be taken by General Chairs of IWSIS2012, Prof. Zhihua Cui and Prof. Jianchao Zeng. All accepted papers will be published in some EI journals as regular papers : Important dates: Submission Date: April 20, 2012 Acceptance Date: May 20, 2012 Registration Date: June 1, 2012 Final version Date: June 1, 2012 Publication Date: All accepted papers will be published in EI-indexed international journals within the late of 2012 and early of 2013; 个人分类: 写作投稿指南|5048 次阅读|0 个评论

[转载]关于马尔科夫随机场MRF的思考: ciwei020621 2012-3-19 10:11; 关于马尔科夫随机场MRF的思考 Markov Random Fields(MRF)是undirected graph的概率表示，下面说说它在computer vision中的应用。 MRF应用在视觉中，相当于一个Labeling问题，更具体点，是通过MAP inference来确定图中每个节点的label。MRF相比其他方法的优势是：1）提供了一种principled method来对Prior knowledge建模，2）MRF可以很容易用定量的方法描述contextual information。因此，相比其它pixel-based, 或local-based 方法，它可以考虑到环境知识的影响，如果建立的图模型得当，进而可能获得全局最优解释，这样正是向human vision更靠近了一步。说到MRF的Inference，首先必须有 graph construction, parameter learning,最后才是Inference，图的创建一般是对问题本身的建模，比如在image restoration 和image segmentation中，常用到4-neighborhood或8-neighborhood的pairwise模型，这样，4-或8-相邻的像素中间便用边连接，这样的模型就是paradigmatic pairwise Markov model,如果要加入高阶（=3)的potential，相当于我们引入了更多的约束，比如：connectivity 约束、非基督分类结果的约束..., 说到非监督分类结果的约束，要注意的是：一定是其他分类方法，而非MRF本身的分类结果累构成新的约束。 MRF中参数学习方法在此略过，后续补充。下面重点说MRF inference问题，即解求能量函数最小能量的问题。对于经典的只有unary 和 binary potential的MRF模型，graph cut已经能够在Linear time内进行求解，如果加入更高的potential,虽然问题本身可能变成了NP-hard，仍然有很多近似算法，比如Loopy belief propagation(LBP)，tree-reweighted message passing(TRW),Metropolis-Hastings,MCMC等等。撇开这些方法，其实问题的实质是 energy minimization，值得一提的是CVPR现在有个workshop是专门讨论这个问题的，叫做energy minimization methods in computer vision and pattern recognition (EMMcvpr). 组织者是：Yuri Boykov（UWO）, Fredrik Kahl, Victor Lempitsky,Vladimir Kolmogorov(UCL), Olga Veksler(UWO),个个都活跃在算法、离散数学、变分法、图论等领域的前沿，所以大家对能量最小化的优化方法感兴趣的可以经常关注这些教授的主页，他们也主要做计算机视觉方面的应用，所以他们的publication会多数发表在ICCV,CVPR,ECCV,NIPS上面。上面粗要的概述了MRF的三个步骤，下面谈谈自己对MRF的看法： 1）虽然它是通过MAP获得整体的maximal likely solution,但是，如果对问题本身建模时，只考虑到了局部约束，则：MRF得到的结果仍然是局部的，比如考虑最经典的影像分割，如果只用到pairwise Model，则很可能把一个完整的物体分割成了2块甚至更多块。2）当引入更多的constaints时，不同的约束如果非独立，而通常，我们使用Gibbs Distribution来描述Markov Network,不同的potential是相加的关系，转化到概率，则是相乘，就是不独立的两个clique直接相乘了，所以此时存在参数的冗余（redundency),所以Parameter learning时，要解决这个问题（这个问题在测量数据处理中叫参数的显著性检测）。下面分享一些好的资料： A tutorial of MRF on CV: Markov random field models in computer vision (E CCV 94, S.Z. Li) MRF inference, Prof. Yuri: http://www.csd.uwo.ca/~yuri/ Prof. Richard Szeliski : A comparativestudy of energy minimization methods for markovrandom fields . Richard 的计算机视觉教材 Computer Vision: algorithms and applications 正在被全球很多大学使用，有取代 Computer Vision：a modern approach, multiview geometry in computer vision 等权威教材的趋势。 code: http://vision.middlebury.edu/MRF/; 7982 次阅读|0 个评论

introduction to reinforcement learning: jiangdm 2012-3-1 11:11; Contents 1 source code 2 source code: Reinforcement Learning Toolbox: http://www.igi.tugraz.at/ril-toolbox/general/overview.html RL in Wiki: http://en.wikipedia.org/wiki/Reinforcement_learning#References 强化学习分类： -- 函数估计 function approximation -- 分层强化学习 hierarchical reinforcement learning -- 关系强化学习 relational reinforcement learning -- Bayes logic network -- Logic decision tree survey: A Survey of Reinforcement Learning in Relational Domains.pdf 耿直《因果挖掘的若干统计方法》 -- 井底之蛙 -- 替罪羔羊 -- 盲人摸象 -- 纲举目张 -- 寻根问底 + 顺藤摸瓜分类学习：模型选择、模型集成、正则化技术(regularization) 多示例多标记学习; 个人分类: AI & ML|1 次阅读|0 个评论

[转载]Future work on social tagging: timy 2012-2-28 00:31; Future work on social tagging The results from evaluations of social tags by experienced indexers in MELT highlighted a number of interesting issues that need further validation and investigation. Social tagging, as a feature in a conventional learning resource repository, is a very new phenomenon and it will take time before those interested in this approach have well developed evaluation methodologies and tools in this new context. Nevertheless, the MELT analysis shows that: Tags that expert indexers don't understand mostly constitute ‘noise’, but there are exceptions to this (see 2). Some tags travel across languages; i.e. people understand them even if they do not speak the language. These “travel well” tags can support retrieval in a multilingual context by facilitating the cross-border retrieval of resources. Some tags are understood only by a sub-group of users (e.g. “esl” = English as a Second Language) enhancing cross-border use and adding value for these sub-groups, but mostly they constitute ‘noise’ to others. Some tags correspond to descriptors in the LRE Thesaurus and can be used as indexing keywords for a resource, especially when the existing indexing is poor or the tag represents a narrower term. ”Thesaurus tags” can be used to determine the language equivalences between keywords, and affinities between tags and indexing keywords. Thesaurus terms could be used in order to determine affinities between tags, thus helping describe resources, as well as retrieval of resources in multiple languages. Tags can lead to interesting non-descriptors in the thesaurus and thus facilitate and enhance multilinguality. Tags can help enrich the thesaurus in terms of suggesting new descriptors based on how users have used tags to describe resources. Lots of food for thought here that can be investigated further once a critical mass of user-generated tags has been accumulated as a result of many thousands of teachers using the public version of the LRE portal in 2009. Look out for a fast expanding LRE tag cloud! The LRE tag cloud in May 2009; 个人分类: 信息组织|0 个评论

科研的境界：追求极致: 热度 1 xbjin 2012-2-12 21:11; 昨天，我看了一篇发表在机器学习顶级杂志JRML上的文章，总共55页，作者在2007年就提出了算法的一般框架发表在KDD会议上，完整长文2010年终于发表了！才读了2/3，不得不佩服作者的认真和追求极致的态度！作者把这个一般算法用到各个领域，不管是机器学习的理论领域包括struct learning, ranking learning,classification, regression learning还是应用领域human action recognition等，甚至包括当前比较热的算法的并行化执行都考虑到了，全文引经据典，把当前涉及的各个热点全部用了一遍，最后的1/3便是算法的收敛性证明。无论是理论，还是应用，叫你无话可说！倒是应了小沈阳的那句话，“走别人的路，让别人无路可走！”你要是还想用这个算法框架，不管是做理论，还是做应用，都得引用他的文章，因为他已经做过了！汗颜！照在中国当前只求量不求质的思维，这55页发10篇会议文章都有可能！人家全部浓缩到一篇文章里面了！让我想起了一个笑话：有一个美国数学教授有一个很好的命题，当他带第一个博士时，他让学生证n=1时成立，然后学生顺利毕业了，当带第2个学生时，他又让学生证n=2时成立，第二个博士生也证明了n=3，但是第4个博士生让他郁闷了，因为他证明了n为任意自然数，命题均成立！他不能再让学生做这个了！我所看的这篇文章的作者在某种程度上像最后的那个博士生，因为他把就当前这个算法框架下所涉及到的问题基本上都考虑到了！这就是做科研的境界：追求极致！; 4044 次阅读|1 个评论

machine learning introduction: jiangdm 2012-2-6 09:35; 书籍： 1. 入门好书《Programming Collective Intelligence》，培养兴趣是最重要的一环，一上来看大部头很容易被吓走的:P 2. Peter Norvig 的《AI, Modern Approach 2nd》（无争议的领域经典），上次讨论中 Shenli 使我开始看这本书了，建议有选择的看，部头还是太大了，比如我先看里面的概率推理部分的。 3. 《The Elements of Statistical Learning》，数学性比较强，可以做参考了。 4. 《Foundations of Statistical Natural Language Processing》，自然语言处理领域公认经典。 5. 《Data Mining, Concepts and Techniques》，华裔科学家写的书，相当深入浅出。 6. 《Managing Gigabytes》，信息检索经典。 7. 《Information Theory：Inference and Learning Algorithms》，参考书吧，比较深。相关数学基础（参考，不是拿来通读的）：《矩阵分析》，Roger Horn。矩阵分析领域无争议的经典。《概率论及其应用》，威廉·费勒。也是极牛的书。线性代数的？《Nonlinear Programming, 2nd》非线性规划的参考书。《Convex Optimization》凸优化的参考书 Bishop Pattern Recognition and Machine Learning 工具： 1. Weka （或知识发现大奖的数据挖掘开源工具集，非常华丽 Tom Michell 主页上 Andrew Moore 周志华统计机器学习概论 .ppt 强化学习.ppt 机器学习研究进展.ppt 机器学习及其挑战.ppt AdvanceML.ppt 机器学习研究及最新进展.pdf; 个人分类: ML|1 次阅读|0 个评论

强化学习 + agent: jiangdm 2012-2-2 16:44; ##一种新的多智能体Q 学习算法郭锐吴敏彭军彭姣曹卫华自动化学报 Vol. 33, No. 4 2007 年4 月摘要针对非确定马尔可夫环境下的多智能体系统, 提出了一种新的多智能体Q 学习算法. 算法中通过对联合动作的统计来学习其它智能体的行为策略, 并利用智能体策略向量的全概率分布保证了对联合最优动作的选择. 同时对算法的收敛性和学习性能进行了分析. 该算法在多智能体系统RoboCup 中的应用进一步表明了算法的有效性与泛化能力. 关键词: 多智能体, 增强学习, Q 学习 1 引言机器学习根据反馈的不同将学习分为: --监督学习 -- 非监督学习: 增强学习(Reinforcement learning) {一种以反馈为输入的自适应学习方法} 多智能体系统(Multi-agent systems) : 合作进化学习 Mini-Max Q学习算法 FOF Q 学习算法: 竞争 + 合作 2 多智能体Q 学习 2.1 多智能体Q 学习思想增强学习 + 多智能体系统. the arising difficulties: -- 首先必须改进增强学习所依据的环境模型 -- 再者, 在多智能体系统中, 学习智能体应学习其它智能体的策略, 系统当前状态到下一状态的变迁由学习智能体与其它智能体的动作决定, -- 2.2 多智能体Q 学习算法学习策略期望累计折扣回报引入多个智能体的行为: 迭代: 多智能体Q 学习算法: 2.3 算法收敛性和有效性分析 2.3.1 算法收敛性证明 2.3.2 算法有效性分析 PAC 准则 3 学习算法在RoboCup 中的应用 RoboCup 机器人仿真足球比赛 4 结论 I comment: I will study reinforcement learing in MAS in the next step. the Q-learing algorithm for MAS will provide some insight and references to my future study. 一种新的多智能体Q学习算法.pdf this paper written by the same correspond author, which emphasize the architecture of agent in MAS: 一种新的多智能体系统结构及其在RoboCup中的应用.pdf *** ##一种新颖的多agent强化学习方法周浦城, 洪炳镕, 黄庆成电子学报 2006-8 摘要 : 提出了一种综合了模块化结构、利益分配学习以及对手建模技术的多agen t强化学习方法, 利用模块化学习结构来克服状态空间的维数灾问题, 将Q􀀁学习与利益分配学习相结合以加快学习速度, 采用基于观察的对手建模来预测其他agent的动作分布. 追捕问题的仿真结果验证了所提方法的有效性. 关键词: 多agen t学习; Q学习; 利益分配学习; 模块化结构; 对手建模 1 引言多agen t系统 + 强化学习: 1) 一种方式将多agent系统作为单个学习agent, 借助单agent强化学习维数灾问题. 2) 另一种方式系统中每个agent拥有各自独立的强化学习机制. 由于多个agent协同学习, Note：对比协同进化 2 强化学习 2.1 Q学习 Q-learning: 一类似于动态规划的强化学习方法 2.2 利益分配学习 Profit Sharing, PS 学习: 强化学习算法 3 提出的模块强化学习方法学习agen t由三种模块构成: (1)学习模块LM, 实现强化学习算法. (2) 对手模块OM, 用于通过观察其他agent的动作来得到其动作分布估计, 以便评估自身动作值. (3) 仲裁模块MM, 用来综合学习模块和对手模块的结果. 3.1 对手建模 MAS: agent的动作效果 = 外界环境 + Other agents动作影响. 3.2 混合强化学习算法 author idea: PS-learning + Q-learning 3.3 仲裁模块决策 3.4 多agent学习过程 4 仿真 4.1 追捕问题 I comment: one of authors is Prof. 洪炳镕, whose robotic dance group attend the festival celebration of Spring in 2012. from various information source, I believe he and his team may be develop some program to collaborate robot dance and the kernel of system also come from the commerical product of foreign company. 一种新颖的多agent强化学习方法.pdf; 个人分类: ML|3 次阅读|0 个评论

增强学习综述: jiangdm 2012-1-27 16:32; *** ##Reinforcement Learning: A Survey Leslie Pack Kaelbling，Michael L. Littman and Andrew W. Moore Journal of Arti cial Intelligence Research ,1996 Abstract This paper surveys the eld of reinforcement learning from a computer-science per- spective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement. The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning. 1. Introduction two main strategies for solving reinforcement-learning problems. 1) to search in the space of behaviors in order to find one that performs well in the environment. approach: genetic algorithms and genetic programming as well as some more novel search techniques 2) to use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world. the structure of this paper: 1) Section 1 is devoted to establishing notation and describing the basic reinforcement-learning model. 2) Section 2 explains the trade-off between exploration and exploitation and presents some solutions to the most basic case of reinforcement-learning problems 3) Section 3 considers the more general problem in which rewards can be delayed in time from the actions that were crucial to gaining them. 4) Section 4 considers some classic model-free algorithms for reinforcement learning from delayed reward: adaptive heuristic critic, TD( ) and Q-learning. 5) Section 5 demonstrates a continuum of algorithms that are sensitive to the amount of computation an agent can perform between actual steps of action in the environment. 6) Section 6 describes Generalization - the cornerstone of mainstream machine learning research 7) Section 7 considers the problems that arise when the agent does not have complete perceptual access to the state of the environment. 8) Section 8 catalogs some of reinforcement learning's successful applications. 9) Finally, Section 9 concludes with some speculations about important open problems and the future of reinforcement learning. 1.1 Reinforcement-Learning Model Formally, the model consists of a discrete set of environment states, S; a discrete set of agent actions, A; and a set of scalar reinforcement signals; typically f0; 1g, or the real numbers. The agent's job : to find a policy ,mapping states to actions, that maximizes some long-run measure of reinforcement Reinforcement learning vs supervised learning 1) reinforcement learning has no presentation of input/output pairs 2) on-line performance is important: the evaluation of the system is often concurrent with learning. reinforcement learning vs search and planning issues in AI 1.2 Models of Optimal Behavior the crucial problem: what reinforce learning model of optimality will be? how the agent should take the future into account in the decisions it makes about how to behave now? three models: 1) The finite-horizon model 2) The infinite-horizon discounted model 3) average-reward model 1.3 Measuring Learning Performance several incompatible measures 1) Eventual convergence to optimal 2) Speed of convergence to optimality 3) Regret 1.4 Reinforcement Learning and Adaptive Control Adaptive control vs Reinforcement Learning Adaptive control: parameter estimation problem 2. Exploitation versus Exploration: The Single-State Case reinforcement learning vs supervised learning: reinforcement-learner must explicitly explore its environment. Problem: the k-armed bandit problem (K臂赌博机问题) 参考：《基于信任和K臂赌博机问题选择多问题协商对象》，软件学报，2006 the structure of this section: i) Section 2.1 discusses three solutions to the basic one-state bandit problem that have formal correctness results. ii) Section 2.2 presents three techniques that have had wide use in practice, 2.1 Formally Justi ed Techniques 2.1.1 Dynamic-Programming Approach Bayesian reasoning 2.1.2 Gittins Allocation Indices 2.1.3 Learning Automata 2.2 Ad-Hoc Techniques 2.2.1 Greedy Strategies 2.2.2 Randomized Strategies Boltzmann exploration （玻尔兹曼方法） 2.2.3 Interval-based Techniques 2.3 More General Problems 3. Delayed Reward 3.1 Markov Decision Processes An MDP consists of a set of states S, a set of actions A, a reward function R a state transition function T Definition： The model is Markov if the state transitions are independent of any previous environment states or agent actions. 3.2 Finding a Policy Given a Model tool：dynamic programming 3.2.1 Value Iteration 3.2.2 Policy Iteration 3.2.3 Enhancement to Value Iteration and Policy Iteration 3.2.4 Computational Complexity 4. Learning an Optimal Policy: Model-free Methods there are two ways to proceed. Model-free: Learn a controller without learning a model. Model-based: Learn a model, and use it to derive a controller. i）Section 4 examines model-free learning, ii）Section 5 examines model-based methods. The biggest problem facing a reinforcement-learning agent ： temporal credit assignment. i）How do we know whether the action just taken is a good one？ ii）when it might have farreaching effects? temporal difference methods： to adjust the estimated value of a state based on the immediate reward and the estimated value of the next state. 4.1 Adaptive Heuristic Critic and TD( ) two components: 1）a critic (labeled AHC), 2）a reinforcement-learning component (labeled RL). 4.2 Q-learning 4.3 Model-free Learning With Average Reward 5. Computing Optimal Policies by Learning Models 5.1 Certainty Equivalent Methods 5.2 Dyna 5.3 Prioritized Sweeping / Queue-Dyna 5.4 Other Model-Based Methods 6. Generalization 6.1 Generalization over Input the goal: examines approaches to generating actions or evaluations as a function of a description of the agent's current state. 6.1.1 Immediate Reward CRBP The idea behind this training ru le : whenever an action fails to generate reward, crbp will try to generate an action that is different from the current choice. ARC REINFORCE Algorithms Logic-Based Methods 6.1.2 Delayed Reward Adaptive Resolution Models Decision Trees Variable Resolution Dynamic Programming PartiGame Algorithm 6.2 Generalization over Actions 6.3 Hierarchical Methods 6.3.1 Feudal Q-learning 6.3.2 Compositional Q-learning 6.3.3 Hierarchical Distance to Goal 7. Partially Observable Environments 7.1 State-Free Deterministic Policies 7.2 State-Free Stochastic Policies 7.3 Policies with Internal State The only way to behave truly e ectively in a wide-range of environments : to use memory of previous actions and observations to disambiguate the current state a variety of approaches to learning policies with internal state 1) Recurrent Q-learning to use a recurrent neural net work to learn Q values. 2) Classi er Systems bucket brigade algorithm 3) Finite-history-window Approach to restore the Markov property is to allow decisions to be based on the history of recent observations and perhaps actions. 4) POMDP Approach use hidden Markov model (HMM) techniques to learn a model of the environment 8. Reinforcement Learning Applications a data point to questions such as: How important is optimal exploration? Can we break the learning period into exploration phases and exploitation phases? What is the most useful model of long-term reward: Finite horizon? Discounted? Infinite horizon? How much computation is available between agent decisions and how should it be used? What prior knowledge can we build into the system, and which algorithms are capable of using that knowledge? 8.1 Game Playing 8.2 Robotics and Control 9. Conclusions Reinforcement learning A survey.pdf 个人点评：只了解其形，并未理解其精髓附上一篇中文综述，可以与本文对照学习：强化学习研究综述.pdf 强化学习研究进展.doc 参考资料：强化学习史忠植.ppt *** 多agent teamwork研究综述李静，陈兆乾，陈世福，徐殿祥计算机研究与发展，2003 modify history 1) 2012-2-24 摘要： Teamwork在许多动态、复杂的多Agent环境中占据越来越重要的地位，是目前人工智能界研究的热点之一,通过对多Agent Teamwork的研究现状、关键技术和发展趋势进行综述和讨论，试图勾画出目前Team work研究的脉络、重点及其发展趋向.主要内容包括：Teamwork研究的背景；TeamworkL的研究方法以及典型的Teamwork模型；Teamwork模型的特点以及关键技术；Teamwork的应用领域以及进一步研究的方向关键词: 多Agent系统；Teamwork模型 1 引言共享心理模型 Teamwork:指多Agent间协作、联合行动以确保团队以一致性（coherent）方式运作的过程 Teamwork: 1) cooperation 2) collaboration 3) coordination the goal of this paper: 介绍Teamwork 的两种主要研究方法,重点介绍了目前Teamwork 研究的热点问题,即构建Teamwork 模型的相关理论和技术,勾画出目前Teamwork 研究的重要方面、关键技术及其发展趋势 2 多agent Teamwork研究的主要方法多agent环境中的Teamwork模型两个目标: 1) 通过定义团队结构和团队运作过程来构建有效的Teamwork 2) 要求团队中的agent能够灵活地适应不断变化的环境多agent研究主要有两种方法: 1) 一以Teamwork理论为基础的基于知识、规划的方法，该方法主要是联合意图的建立，典型代表模型：STEAM； 2) 另一基于行为的方法，主要实现agent间灵活的行为选择，产生具有容错性、灵活性、可靠性的行为，但不具有规划能力，典型代表结构是: ALLIANCE 3 基于知识、规划的方法 Teamwork characteristic: ①联合行动的相互承诺,即没有队友的参与,Agent 不能单独放弃承诺; ②相互支持,即必须主动帮助队友; ③相互响应,即如果有需要的话,能够接管队友的任务 Teamwork理论包括:联合意图和共享规划 3.1 Jonit Intetion Framework 3.2 Shared plan 3.3 STEAM模型 4 基于行为（behavior based）的方法基于行为的方法:基于反应式方法基础上发展起来的，它吸取了反应式结构缺少内部状态表示，不能够看到过去和未来的缺点，保留了反应式结构实时反应、鲁棒性、可扩展性等优势，成为物理多机器人系统中普遍采用的结构. 基于行为方法的Teamwork可以分为两类：( 数学收敛 ) 1) 群类型协作（swarm type cooperation）方法 2) 意图类型协作（intentional cooperation）方法 4.1 ALLIANCE 5 Teamwork模型的特点 4个基本特点：（1）强壮性和容错性（2）实时反应性（3）灵活性（4）持久稳固性 6 构建Teamwork模型的关键技术（1）协商协商技术可以分为3类： i) 基于博弈论的协商； ii) 基于规划的协商 iii) 由人参与的和复杂人工智能方法的协商（2）通信 Teamwork模型通信机制的实质: 通过通信的方法解决团队队员间的协作和冲突等问题（3）信念推理信念推理技术是一个富有挑战的领域: 包括逻辑、基于事例的推理（CBR）、信念修正、多agent规划、基于模型的推理（MBR）、优化和博弈论. （4）规划（5）学习冲突问题是Teamwork模型中普遍存在的问题，因此要求Teamwork模型能够提供学习机制，以使Agent能够从自己过去的失败经验中不断学习，增强对环境的适应性 7 Teamwork模型的应用概况 7.1 机器人世界杯足球赛（RoboCup）浙江大学 2003年RoboCup 8 结论个人点评: 理解成 multi-agent AI planing，用于入门: 多AgentTeamwork研究综述.pdf the correspond author similar paper: 基于多Agent的Teamwork研究综述.pdf 另有一篇强化学习综述：强化学习综述.pdf related paper: 多Agent系统合作与协调机制研究综述.pdf 并行学习神经网络集成方法.pdf 陈世福是周志华 2003年全国优秀博士学位论文的指导老师，文章可关注 *** Book: 《Reinforce Learning: An Introduction》 MIT 1998 http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html PDF: RL-3.pdf (this website has Lisp code for reinforce learning) MATLAB example: Netlog reinfore learning example: Reinforcement Learning Wargame.nlogo UML分析 (visio) : reinforce_learning_netlogo.vsd Approximate Dynamic Programming Chapter 6: Approximate Dynamic Programming.pdf ##面向多机器人系统的增强学习研究进展综述吴军, 徐昕, 王健, 贺汉根控制与决策 2011-11 摘要: 基于增强学习的多机器人系统优化控制是近年来机器人学与分布式人工智能的前沿研究领域. 多机器人系统具有分布、异构和高维连续空间等特性, 使得面向多机器人系统的增强学习的研究面临着一系列挑战, 为此, 对其相关理论和算法的研究进展进行了系统综述. 首先, 阐述了多机器人增强学习的基本理论模型和优化目标; 然后, 在对已有学习算法进行对比分析的基础上, 重点探讨了多机器人增强学习理论与应用研究中的困难和求解思路, 给出了若干典型问题和应用实例; 最后, 对相关研究进行了总结和展望. 关键词: 多机器人系统；多智能体；增强学习；随机对策；马氏决策过程 1 引言增强学习(RL): 一种不依赖于环境模型和先验知识的机器学习方法, 通过试错和延时回报机制, 结合自适应动态规划方法, 能够不断优化控制策略, 为系统自适应外界环境变化提供了可行方案单智能体增强学习(SARL) 多机器人增强学习(MRRL) 多智能体增强学习(MARL) 2 多机器人增强学习的理论框架 2.1 模型框架基础 MRRL模型框架: -- 应用于独立增强学习的马氏决策过程(MDPs) 模型 -- 应用于协同增强学习的随机对策(SGs) 模型. 2.1.1 MDPs 模型数学模型基础MDPs. 2.1.2 SGs 模型矩阵对策 2.2 学习任务类型 MARL任务分为静态任务和动态任务. 2.3 理论和方法基础 2.4 均衡解概念 Nash 均衡 2.5 学习目标 3 多机器人增强学习方法的分类研究 3.1 多机器人增强学习的分类 3.2 各类方法研究现状 3.2.1 集中式多机器人增强学习方法 3.2.2 分布式独立多机器人增强学习方法 3.2.3 基于CISG 的多机器人增强学习方法 3.2.4 基于ZSSG 的多机器人增强学习方法 Minimax-Q算法: 采用Minimax原理 3.2.5 基于GSSG的多机器人增强学习方法 4 多机器人增强学习的难点及发展趋势 4.1 多智能体增强学习的固有难题难题: 1) 维数灾难问题. 2) 信度分配问题. MARL的信度分配问题: 时间信度分配和结构信度分配两方面 3) 多均衡点协同选择问题 4.2 物理系统引入的约束限制开发合理的样本采集策略快速遍历学习空间, 同时开发能高效利用有限学习样本的快速学习算法已成为迫切需要. 4.3 多机器人增强学习的发展趋势分析 5 典型应用领域 6 结论与思考 I comment: 面向多机器人系统的增强学习研究进展综述.pdf; 个人分类: AI & ML|0 个评论

new: jiangdm 2012-1-7 20:01; new; 个人分类: CHI|1 次阅读|0 个评论

Danbury Music Learning Center 的 Musical History Map: 黄安年 2011-12-14 09:43; Danbury Music Learning Center 的 Musical History Map 黄安年文黄安年的博客 / 2011 年 12 月 13 日 ( 美东时间 ) 发布美国康州 Danbury Music Learning Center 集聚了一批富有教学经验热心普及音乐事业的音乐教学工作者 , 位于 Danbury 主街的音乐学习中心教学现场条件简陋 , 在地下室进行教学 , 但是专业素质很强 , 崇尚敬业。在地下室的走廊上有一张 Musical History Map ，十分引人瞩目。让人们对于音乐史上代表性人物对音乐发展历史的影响有清晰的了解。照片 5 张是近日拍摄的。; 个人分类: 教育改革思考(07-11)|7324 次阅读|0 个评论

相似度估计的距离学习：Distance Learning for Similarity Estim: jingyanwang 2011-12-12 15:17; Distance Learning for Similarity Estimation Jie Yu; Amores, J.; Sebe, N.; Radeva, P.;Qi Tian; PatternAnalysis and Machine Intelligence, IEEE Transactions on Volume: 30 , Issue:3 Digital Object Identifier: 10.1109/TPAMI.2007.70714 Publication Year: 2008 , Page(s): 451 - 462 Cited by: 7 IEEE JOURNALS Abstract | Full Text: PDF (2181KB) Distance Learning for Similarity Estimation_reading.doc Distance Learning for Similarity Estimation.pdf Jie Yu，这大哥不容易，从东华大学去了UTSA，最后去了柯达。他这篇文章实际上也是BoostMap的变种，只是首次提出了利用不同的feature和不同的distance的组合来做弱分类器。; 个人分类: RED|2835 次阅读|0 个评论

review: Statistical relational learning of trust: jiangdm 2011-11-23 21:36; Statistical relational learning of trust Achim Rettinger · Matthias Nickles · Volker Tresp Mach Learn (2011) Abstract The learning of trust and distrust is a crucial aspect of social interaction among autonomous, mentally-opaque agents. In this work, we address the learning of trust based on past observations and context information. We argue that from the truster’s point of view trust is best expressed as one of several relations that exist between the agent to be trusted (trustee) and the state of the environment . Besides attributes expressing trustworthiness, additional relations might describe commitments made by the trustee with regard to the current situation, like: a seller offers a certain price for a specific product. We show how to implement and learn context-sensitive trust using statistical relational learning in form of a Dirichlet process mixture model called Infinite Hidden Relational Trust Model (IHRTM). The practicability and effectiveness of our approach is evaluated empirically on user-ratings gathered from eBay. Our results suggest that (i) the inherent clustering achieved in the algorithm allows the truster to characterize the structure of a trust-situation and provides meaningful trust assessments; (ii) utilizing the collaborative filtering effect associated with relational data does improve trust assessment performance; (iii) by learning faster and transferring knowledge more effectively we improve cold start performance and can cope better with dynamic behavior in open multiagent systems. The later is demonstrated with interactions recorded from a strategic two-player negotiation scenario. Keywords: Relational learning · Computational trust · Social computing · Infinite hidden relational models · Initial trust 1 Introduction computational trust the existing problem: 1)lack the ability to take context sufficiently into account when trying to predict future behavior of interacting agents 2) not able to transfer knowledge gained in a specific context to a related context recommendations or social trust networks, cognitive and game theoretic models; 个人分类: ML|0 个评论

Danbury Music Learning Center举行汇报演出掠影（2011-11-19 5: 黄安年 2011-11-20 22:16; Danbury Music Learning Center 举行汇报演出掠影（ 2011-11-19 5 ： 00PM ）黄安年文黄安年的博客 / 2011 年 11 月 20 日 ( 美东时间 ) 发布 11 月 19 日 5:00PM ，在 Danbury Music Learning Center 学习的小朋友，在 Danbury 音乐厅举行了另一场汇报演出，集中汇报在专业老师指导下的联系钢琴的成果 , 家长们均参加观看汇报演出 , 孩子们穿着盛装分别登台表演。我们的小外孙参加下午 5 ： 00 的演出。由于他的年龄最小 , 所以第一个汇报演出。照片 24 张 , 是 5 ： 00-5 ： 50PM 拍摄的。; 个人分类: 教育改革思考(07-11)|2363 次阅读|0 个评论

Danbury Music Learning Center举行汇报演出掠影（2011-11-19 3: 黄安年 2011-11-20 11:29; Danbury Music Learning Center 举行汇报演出掠影（ 2011-11-19 3 ： 00PM ）黄安年文黄安年的博客 / 2011 年 11 月 20 日 ( 美东时间 ) 发布 11月19日下午 3:00PM 和 5:00PM ，在 Danbury Music Learning Center 学习的小朋友，在 Danbury 音乐厅举行了两场汇报演出，集中汇报在专业老师指导下的学习成果 , 家长们均参加观看汇报演出 , 孩子们穿着盛装分别登台表演。我们的大外孙和小外孙分别参加下午 3 ： 00 和 5 ： 00 的两场演出。照片 29 张 , 是 3 ： 00-3 ： 40PM 拍摄的。; 个人分类: 教育改革思考(07-11)|2355 次阅读|0 个评论

Danbury Music Learning Center举行汇报演出: 黄安年 2011-11-20 10:56; Danbury Music Learning Center 举行汇报演出黄安年文黄安年的博客 / 2011 年 11 月 19 日 ( 美东时间 ) 发布在 Danbury Music Learning Center 学习的各个不同程度小朋友每年两度举行学习回报演出 , 今天下午 3:00PM 和 5:00PM 分别有两次钢琴汇报演出 , 集中汇报在专业老师指导下的学习成果 , 家长们均参加观看汇报演出 , 孩子们穿着盛装分别登台表演。我们的大外孙和小外孙分别参加下午 3 ： 00 和 5 ： 00 的两场演出。照片 8 张中前两张是两场演出的合影 , 后 6 张分别是演出节目单。; 个人分类: 教育改革思考(07-11)|2643 次阅读|0 个评论

[转载]马尔文。明斯基（Marvin Minsky，1927-）的一个有趣访谈: 老李 2011-11-5 13:56; 笔者注：这篇访谈是John Brockman对明斯基的一个采访，明斯基是MIT教授，也是人工智能领域的大牛，看看他说点什么还是有意思的，闲话少说，上文！还是简单说两句为好，这个访谈反映了明斯基对于人类意识的一些看法。之所以出现对于意识解释的混乱现象，是因为大脑内部太复杂，可能有40-50种运作机制，而这些都是我们目前还没有办法清楚了解与解决的。他也否定了经验的主观特性。经验感觉是否可以还原？这是哲学家们感兴趣的话题，物理主义的解释是不能令人满意的。一百年前如此，今天也如此。即便意识是一个大手提箱，也不能什么都装啊？ CONSCIOUSNESS IS A BIG SUITCASE A Talk with Marvin Minsky MINSKY: My goal is making machines that can think-by understanding how people think. One reason why we find this hard to do is because our old ideas about psychology are mostly wrong. Most words we use to describe our minds (like "consciousness", "learning", or "memory") are suitcase-like jumbles of different ideas. Those old ideas were formed long ago, before 'computer science' appeared. It was not until the 1950s that we began to develop better ways to help think about complex processes. Computer science is not really about computers at all, but about ways to describe processes. As soon as those computers appeared, this became an urgent need. Soon after that we recognized that this was also what we'd need to describe the processes that might be involved in human thinking, reasoning, memory, and pattern recognition, etc. JB: You say 1950, but wouldn't this be preceded by the ideas floating around the Macy Conferences in the '40s? MINSKY: Yes, indeed. Those new ideas were already starting to grow before computers created a more urgent need. Before programming languages, mathematicians such as Emil Post, Kurt G鱠el, Alonzo Church, and Alan Turing already had many related ideas. In the 1940s these ideas began to spread, and the Macy Conference publications were the first to reach more of the technical public. In the same period, there were similar movements in psychology, as Sigmund Freud, Konrad Lorenz, Nikolaas Tinbergen, and Jean Piaget also tried to imagine advanced architectures for 'mental computation.' In the same period, in neurology, there were my own early mentors-Nicholas Rashevsky, Warren McCulloch and Walter Pitts, Norbert Wiener, and their followers-and all those new ideas began to coalesce under the name 'cybernetics.' Unfortunately, that new domain was mainly dominated by continuous mathematics and feedback theory. This made cybernetics slow to evolve more symbolic computational viewpoints, and the new field of Artificial Intelligence headed off to develop distinctly different kinds of psychological models. JB: Gregory Bateson once said to me that the cybernetic idea was the most important idea since Jesus Christ. MINSKY: Well, surely it was extremely important in an evolutionary way. Cybernetics developed many ideas that were powerful enough to challenge the religious and vitalistic traditions that had for so long protected us from changing how we viewed ourselves. These changes were so radical as to undermine cybernetics itself. So much so that the next generation of computational pioneers-the ones who aimed more purposefully toward Artificial Intelligence-set much of cybernetics aside. Let's get back to those suitcase-words (like intuition or consciousness) that all of us use to encapsulate our jumbled ideas about our minds. We use those words as suitcases in which to contain all sorts of mysteries that we can't yet explain. This in turn leads us to regard these as though they were "things" with no structures to analyze. I think this is what leads so many of us to the dogma of dualism-the idea that 'subjective' matters lie in a realm that experimental science can never reach. Many philosophers, even today, hold the strange idea that there could be a machine that works and behaves just like a brain, yet does not experience consciousness. If that were the case, then this would imply that subjective feelings do not result from the processes that occur inside brains. Therefore (so the argument goes) a feeling must be a nonphysical thing that has no causes or consequences. Surely, no such thing could ever be explained! The first thing wrong with this "argument" is that it starts by assuming what it's trying to prove. Could there actually exist a machine that is physically just like a person, but has none of that person's feelings? "Surely so," some philosophers say. "Given that feelings cannot not be physically detected, then it is 'logically possible' that some people have none." I regret to say that almost every student confronted with this can find no good reason to dissent. "Yes," they agree. "Obviously that is logically possible. Although it seems implausible, there's no way that it could be disproved." The next thing wrong is the unsupported assumption that this is even "logically possible." To be sure of that, you'd need to have proved that no sound materialistic theory could correctly explain how a brain could produce the processes that we call "subjective experience." But again, that's just what we were trying to prove. What do those philosophers say when confronted by this argument? They usually answer with statements like this: "I just can't imagine how any theory could do that." That fallacy deserves a name-something like "incompetentium". Another reason often claimed to show that consciousness can't be explained is that the sense of experience is 'irreducible.' "Experience is all or none. You either have it or you don't-and there can't be anything in between. It's an elemental attribute of mind-so it has no structure to analyze." There are two quite different reasons why "something" might seem hard to explain. One is that it appears to be elementary and irreducible-as seemed Gravity before Einstein found his new way to look at it. The opposite case is when the 'thing' is so much more complicated than you imagine it is, that you just don't see any way to begin to describe it. This, I maintain, is why consciousness seems so mysterious. It is not that there's one basic and inexplicable essence there. Instead, it's precisely the opposite. Consciousness, instead, is an enormous suitcase that contains perhaps 40 or 50 different mechanisms that are involved in a huge network of intricate interactions. The brain, after all, is built by processes that involve the activities of several tens of thousands of genes. A human brain contains several hundred different sub-organs, each of which does somewhat different things. To assert that any function of such a large system is irreducible seems irresponsible-until you're in a position to claim that you understand that system. We certainly don't understand it all now. We probably need several hundred new ideas-and we can't learn much from those who give up. We'd do better to get back to work. Why do so many philosophers insist that "subjective experience is irreducible"? Because, I suppose, like you and me, they can look at an object and "instantly know" what it is. When I look at you, I sense no intervening processes. I seem to "see" you instantly. The same for almost every word you say: I instantly seem to know what it means. When I touch your hand, you "feel it directly." It all seems so basic and immediate that there seems no room for analysis. The feelings of being seem so direct that there seems to be nothing to be explained. I think this is what leads those philosophers to believe that the connections between seeing and feeling must be inexplicable. Of course we know from neurology that there are dozens of processes that intervene between the retinal image and the structures that our brains then build to represent what we think we see. That idea of a separate world for 'subjective experience' is just an excuse for the shameful fact that we don't have adequate theories of how our brains work. This is partly because those brains have evolved without developing good representations of those processes. Indeed, there probably are good evolutionary reasons why we did not evolve machinery for accurate "insights" about ourselves. Our most powerful ways to solve problems involve highly serial processes-and if these had evolved to depend on correct representations of how they, themselves work, our ancestors would have thought too slowly to survive. 说明：文中图片就是明斯基（2008）的照片，来自网络，没有任何商业目的，仅供欣赏，特此致谢！; 3594 次阅读|0 个评论

The first two weeks in Sydney: qiyin546 2011-10-17 20:01; Since I have been Sydney two weeks ago, nearly everything is unknown for me, I need to change and broaden myself to adapte the new word, so that to quickly get a ecological inche for me and make sure the upcoming learning fitness, that is might be the “evolution”process. During those days, several training items have ocurred on me. Firstly, I need to follow the new work and rest timetable. We have three hours’s time difference between Sydney and China, that I will start to work whereas all Chinese friends are still in sleep. Three hours is not so much, but forget a long term time rhythm and set up a new one is not so easy, so much Chinese news and things always remind me that it is not time to work and not time to sleep, I always go to sleep very late althouth I am very tired. The second items is the food, not only the food content, but the food habit. People here do not pay so much on lunch but the dinner, that is opposite to China. For lunch, teachers and students do not eat very much. There is a Kitchen in the research building, professors and students could cook their lunch there for free. That is a good way to save time and money, I like it. boys with a hamburg and few yoghourt, girls with some fruit, somebody even have nothing given not being so hungery. They generally do not have specific rest time at noon. By contrast, their dinner time will be very colorful, and eat too much. I do not adapt this absolutely at the beginning, and I always need to eat too much for lunch so that to delay the afternoon hungery time, and will be very tired aftern the dinner time. The third training item relates to the traffic. This is an very important item to broaden living space, and is also the priority to control time. The straight proximity from my temporay rent house to research building was 2.5 km, but I need to walk nearly 4 km. I need to choose the most convenient bus and train line. Bicycle is impossible here, from one aspect, there is no space for bicycle, few person ride bicycle, from another aspect, there is specific for bicycle, you need to wear helmet and specical bicycle dress, otherwise you will be fired. Train here is convenient, but the management system is so different,they have so many different train lines, and each line is consistent with specific platform, you need to know your traveling plan from traffict websit of New South Wales (131500.com). From this website, you could control your travelling time presise to minutes, I have be good at this now. Up to now, the daily training has been finished, I have adjust myself to the new world now. I could control my daily work and life time, and know where I should go if I what to buy something, I also know how to prepare the lunch food so that to join in the students at noon. Next step for me is science traning, keep going. Lunch time. The girl is laura, a volunteer from germany. The boy is one Ph.D candidate.; 个人分类: 感悟生活|1716 次阅读|0 个评论

why hybrid? on machine learning vs. hand-coded rules in NLP: 热度 1 liwei999 2011-10-8 04:00; Before we start discussing the topic of a hybrid NLP (Natural Language Processing) system, let us look at the concept of hybrid from our life experiences. I was driving a classical Camry for years and had never thought of a change to other brands because as a vehicle, there was really nothing to complain. Yes, style is old but I am getting old too, who beats whom? Until one day a few years ago when we needed to buy a new car to retire my damaged Camry. My daughter suggested hybrid, following the trend of going green. So I ended up driving a Prius ever since and fallen in love with it. It is quiet, with bluetooth and line-in, ideal for my iPhone music enjoyment. It has low emission and I finally can say bye to smog tests. It at least saves 1/3 gas. We could have gained all these benefits by purchasing an expensive all-electronic car but I want the same feel of power at freeway and dislike the concept of having to charge the car too frequently. Hybrid gets the best of both worlds for me now, and is not that more expensive. Now back to NLP. There are two major approaches to NLP, namely machine learning and grammar engineering (or hand-crafted rule system). As mentioned in previous posts, each has its own strengths and limitations, as summarized below. In general, a rule system is good at capturing a specific language phenomenon (trees) while machine learning is good at representing the general picture of the phenomena (forest). As a result, it is easier for rule systems to reach high precision but it takes a long time to develop enough rules to gradually raise the recall. Machine learning, on the other hand, has much higher recall, usually with compromise in precision or with a precision ceiling. Machine learning is good at simple, clear and coarse-grained task while rules are good at fine-grained tasks. One example is sentiment extraction. The coarse-grained task there is sentiment classification of documents (thumbs-up thumbs down), which can be achieved fast by a learning system. The fine-grained task for sentiment extraction involves extraction of sentiment details and the related actionable insights, including association of the sentiment with an object, differentiating positive/negative emotions from positive/negative behaviors, capturing the aspects or features of the object involved, decoding the motivation or reasons behind the sentiment,etc. In order to perform sophisticated tasks of extracting such details and actionable insights, rules are a better fit. The strength for machine learning lies in its retraining ability. In theory, the algorithm, once developed and debugged, remains stable and the improvement of a learning system can be expected once a larger and better quality corpus is used for retraining (in practice, retraining is not always easy: I have seen famous learning systems deployed in client basis for years without being retrained for various reasons). Rules, on the other hand, need to be manually crafted and enhanced. Supervised machine learning is more mature for applications but it requires a large labelled corpus. Unsupervised machine learning only needs raw corpus, but it is research oriented and more risky in application. A promising approach is called semi-supervised learning which only needs a small labelled corpus as seeds to guide the learning. We can also use rules to generate the initial corpus or seeds for semi-supervised learning. Both approaches involve knowledge bottlenecks. Rule systems's bottleneck is the skilled labor, it requires linguists or knowledge engineers to manually encode each rule in NLP, much like a software engineer in the daily work of coding. The biggest challenge to machine learning is the sparse data problem, which requires a very large labelled corpus to help overcome. The knowledge bottleneck for supervised machine learning is the labor required for labeling such a large corpus. We can build a system to combine the two approaches to complement each other. There are different ways of combining the two approaches in a hybrid system. One example is the practice we use in our product, where the results of insights are structured in a back-off model: high precision results from rules are ranked higher than the medium precision results returned by statistical systems or machine learning. This helps the system to reach configurable balance between precision and recall. When labelled data are available (e.g. the community has already built the corpus, or for some tasks, the public domain has the data, e.g. sentiment classification of movie reviews can use the review data with users' feedback on 5-star scale), and when the task is simple and clearly defined, using machine learning will greatly speed up the development of a capability. Not every task is suitable for both approaches. (Note that suitability is in the eyes of beholder: I have seen many passionate ML specialists willing to try everything in ML irrespective of the nature of the task: as an old saying goes, when you have a hammer, everything looks like a nail.) For example, machine learning is good at document classification whilerules are mostly powerless for such tasks. But for complicated tasks such as deep parsing, rules constructed by linguists usually achieve better performance than machine learning. Rules also perform better for tasks which have clear patterns, for example, identifying data items like time,weight, length, money, address etc. This is because clear patterns can be directly encoded in rules to be logically complete in coverage while machine learning based on samples still has a sparse data challenge. When designing a system, in addition to using a hybrid approach for some tasks, for other tasks, we should choose the most suitable approach depending on the nature of the tasks. Other aspects of comparison between the two approaches involve the modularization and debugging in industrial development. A rule system can be structured as a pipeline of modules fairly easily so that a complicated task is decomposed into a series of subtasks handled by different levels of modules. In such an architecture, a reported bug is easy to localize and fix by adjusting the rules in the related module. Machine learning systems are based on the learned model trained from the corpus. The model itself, once learned, is often like a black-box (even when the model is represented by a list of symbolic rules as results of learning, it is risky to manually mess up with the rules in fixing a data quality bug). Bugs are supposed to be fixable during retraining of the model based on enhanced corpus and/or adjusting new features. But re-training is a complicated process which may or may not solve the problem. It is difficultto localize and directly handle specific reported bugs in machine learning. To conclude, due to the complementary nature for pros/cons of the two basic approaches to NLP, a hybrid system involving both approaches is desirable, worth more attention and exploration. There are different ways of combining the two approaches in a system, including a back-off model using rulles for precision and learning for recall, semi-supervised learning using high precision rules to generate initial corpus or “seeds”, etc.. Related posts： Comparison of Pros and Cons of Two NLP Approaches Is Google ranking based on machine learning ? 《立委随笔：语言自动分析的两个路子》《立委随笔：机器学习和自然语言处理》【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|8723 次阅读|1 个评论

new: jiangdm 2011-10-7 23:14; new; 个人分类: CHI|2562 次阅读|0 个评论

美国的课外补习学校Sylvan Learning在Danbury: 热度 1 黄安年 2011-10-2 21:39; 美国的课外补习学校 Sylvan Learning 在 Danbury 黄安年文黄安年的博客 /2011 年 10 月 2 日 ( 美东时间 ) 发布美国的 Sylvan Learning C enters 在美国、加拿大等地有 900 多处 ( 请见以下简介 ) ，我把她称为美国的课外补习学校，在 Danbury ， CT 也有这样一所学校，设备和条件都不错 , 学校老师是在附近学校任教的一线有经验教师中兼任的，来这里补习是一对一的，一次课程 2 个小时 , 30 美元，相当于人民币 200 元 , 和现在国内教教一对一需要 100 元一小时 , 绝对价格相等，可见过内的家教费要高得多 , 而且环境远不如美国。国内的家教市场需要逐步走上规模化是大趋势。目前像学而思的网络化教学就很有影响。 ––– -------------------------------------------------------------------------------------------------------- Sylvan Learning is the leading provider of tutoring and supplemental education services to students of all ages and skill levels. At Sylvan, our warm and caring tutors tailor individualized learning plans that build the skills, habits and attitudes students need to succeed in school and in life. Affordable tutoring instruction is available in math, reading, writing, study skills, homework help, test prep and more at more than 900 learning centers in the United States, Canada and abroad. http://tutoring.sylvanlearning.com/index.cfm About Our Tutoring Programs At our centers, Sylvan trained and certified instructors provide highly personalized instruction in reading , math , writing , study skills , homework help , SAT*/ACT prep and state test prep . With Sylvan’s proven process and teaching methods, our students build the lasting skills, habits and attitudes they need to succeed in school - and in life. http://tutoring.sylvanlearning.com/sylvan_about_us.cfm; 个人分类: 教育改革思考(07-11)|6959 次阅读|1 个评论

review: 马尔可夫逻辑网络研究: jiangdm 2011-9-24 22:51; 马尔可夫逻辑网络研究徐从富, 郝春亮, 苏保君, 楼俊杰软件学报， 2011 摘要: 马尔可夫逻辑网络是将马尔可夫网络与一阶逻辑相结合的一种统计关系学习模型,在自然语言处理、复杂网络、信息抽取等领域都有重要的应用前景.较为全面、深入地总结了马尔可夫逻辑网络的理论模型、推理、权重和结构学习,最后指出了马尔可夫逻辑网络未来的主要研究方向. 关键词: Markov 逻辑网;统计关系学习;概率图模型;推理;权重学习;结构学习 the hard work of AI: 如何有效地处理复杂性和不确定性等问题? -- 统计关系学习(statistical relational learning, SRL) -- 概率图模型(probabilistic graphical model,PGM) 统计关系学习：通过集成关系/逻辑表示、概率推理、不确定性处理、机器学习和数据挖掘等方法,以获取关系数据中的似然模型概率图模型：一种通用化的不确定性知识表示和处理方法 -- 贝叶斯网络(Bayesian networks) -- 隐马尔可夫模型(hidden Markov model) -- 马尔可夫决策过程(Markov decision process) -- 神经网络(neural network) idea: 统计关系学习(尤其是关系/逻辑表示) + 概率图模型马尔可夫逻辑网络(Markov logic networks）= Markov网 + 一阶逻辑 Markov 网常用近似推理算法： 1 Markov 逻辑网 1.1 Markov网和一阶逻辑 Markov 网: Markov 随机场(Markov random field,MRF) 1.2 Markov逻辑网的定义和示例定义: Markov 逻辑网 2 Markov 逻辑网的推理 -- 概率图模型推理的基本问题: 计算边缘概率、条件概率以及对于最大可能存在状态的推理 -- Markov 逻辑网推理: 生成的闭Markov 网 2.1 最大可能性问题 MaxWalkSAT 算法 LazySAT 算法 2.2 边缘概率和条件概率概率图模型一重要推理形式: 计算边缘概率和条件概率,通常采用MCMC 算法、BP 算法等 3 Markov 逻辑网的学习 3.1 参数学习 3.1.1 伪最大似然估计 3.1.2 判别训练训练Markov 逻辑网权重的高效算法: VP(voted perceptron)算法、CD(contrastive divergence)算法 3.2 结构学习 3.2.1 评价标准结构学习两个难题: 一是如何搜索潜在的结构; 二是如何为搜索到的结构建立起一个评价标准,即如何筛选出最优结构. 3.2.2 自顶而下的结构学习 4 算法比较和分析 4.1 与基于Bayesian网的统计关系学习算法比较基于 Bayesian 网的SRL 算法: 传统Bayesian 网的基础上进行扩展的SRL 方法 4.2 与基于随机文法的统计关系学习算法比较 4.3 与基于HMM的统计关系学习算法比较 5 Markov 逻辑网的应用 5.1 应用概况 5.2 应用举例 6 述评 Markov 逻辑网: 1) 将传统的一阶谓词逻辑与当前主流的统计学习方法有机地结合起来 2) 填补了AI 等领域中存在的高层与底层之间的巨大鸿沟. -- 一阶谓词逻辑更适用于高层知识的表示与推理 -- 基于概率统计的机器学习方法则擅长于对底层数据进行统计学习. Open problem: (1) 增强算法的学习能力,使其可以从缺值数据中学习; (2) 提高真值闭从句的计算速度,解决结构学习算法效率的瓶颈问题; (3) 从一阶逻辑和Markov 网这两个方面完善Markov 逻辑网的理论; (4) 增强Markov 逻辑网模型的实用性,从而更好地解决实际应用问题. 马尔可夫链蒙特卡洛(Markov chain Monte Carlo,简称MCMC)方法信念传播算法推理学习个人点评：没看懂，可关注,问其与 Bayesian Network什么关系呢？马尔可夫逻辑网络研究.pdf; 个人分类: AI & ML|7274 次阅读|0 个评论

review: 李群机器学习研究综述: 热度 1 jiangdm 2011-9-24 21:56; 《李群机器学习研究综述》，李凡长　何书萍　钱旭培计算机学报，2010 摘　要　文中简述了李群机器学习的相关研究内容，包括李群机器学习的概念、公理假设、代数学习模型、几何学习模型、Dynkin图的几何学习算法、量子群、辛群分类器的设计、轨道生成学习算法等．关键词:　李群机器学习；公理假设；李群；分类器李群机器学习（Lie Group Machine Learning, LML) 李群机器学 vs 流形学习个人点评：文章作为综述，弱了。文章层次和分类不清楚，首先摘要没写好。李群机器学习研究综述.pdf; 个人分类: AI & ML|4995 次阅读|1 个评论

review: ELIQoS: 一种高效节能、与位置无关的传感器网络服务质量: jiangdm 2011-9-19 23:26; ELIQoS: 一种高效节能、与位置无关的传感器网络服务质量毛莺池龚海刚刘明陈道蓄谢立计算机研究与发 2006 摘要如何保证在覆盖足够的监测区域的同时延长网络的寿命是无线传感器网络所面临的最重要问题之一，广泛采用的策略是选出工作节点以满足应用期望的服务质量（即覆盖率），同时关闭其他冗余节点L分析了随机部署网络在已知监测区域大小和节点感知范围情况下，无需节点位置信息，应用期望的服务质量与所需的工作节点数量之间的数学关系L在此基础上提出了一种高效节能、与位置无关的传感器网络服务质量协议（ELIQoS），协议根据节点能量大小，选取最少的工作节点满足应用期望的服务质量L实验结果表明，ELIQoS协议不仅可以有效地提供满足应用期望的服务质量，而且可以减少能量消耗，实现能耗负载均衡关键词: 无线传感器网络；服务质量；覆盖；节能；状态调度 ELIQoS 一种高效节能、与位置无关的传感器网络服务质量协议.pdf; 个人分类: Network|1 次阅读|0 个评论

成都vlpr2011总结: 热度 1 guhuxiang 2011-9-2 02:14; 8.16日就返回北京了，但是因为各种事情没有来得及总结这次成都VLPR2011之行，又到月末总结时，顺便回顾一下，主要从时间流水帐、知识回顾、心得体会、感谢四个方面。成都vlpr2011官方网站： http://www.uestcrobot.net/vlpr2011/program.htm 。第一部分：时间流水帐。由于开始申请要求是有一篇顶级会议或期刊的论文，所以没申请，后来改为一般会议期刊或者牛导师推荐，6月28号左右我和实验室另外两个同学开始申请，主要准备简历和推荐信、研究方向简述三方面。7月13号收到录取通知，不过被gmail当成垃圾邮件了，幸好在deadline7月15号晚上两点左右发现了，没有错过这次难忘之旅。 8月5号与倆外三个参会着坐动车从北京出发前往成都，6号上午11点到电子科大，1点左右注册，发现同去的自动化所的有14个人，算是除了电子科大，参会人数最多的机构了。这次summer school总共100个正式学员，加交钱参会的和志愿者一共150人左右。下午去宽窄巷逛了一次，然后在那边喝茶休息，然后最大的感受就是成都适合中老年生活，用同去的人的话说就是宠物狗都懒得动弹，最后统计十条以上的狗发现还真没有一条活动的宠物狗。晚上七点坐车前往清水河校区，办理入住手续，很累，睡觉。 8月7号到13号正式上课，每天上午八点半到十二点，一般拖堂十五分钟，下午一点半到六点。都是大牛，不能睡觉，一天八个多小时的高强度的学习，人很疲惫。好在有上下午都有十五分钟的茶歇，能休息一下，问大牛一些问题，认识一般同学，吃点东西，四全其美。晚上要做team project，讨论、做PPT等，一天下来的确很累。到最后两三天，人数成指数下降，好一些人去成都峨眉山、都江堰等地方完了，有的睡觉、访友去了，到13号一帮小牛们讲课的时候不到一半人了。 7号晚上有一个迎新自助宴会，13号晚上结束宴会，期间颁team project的奖，还第一次近距离接触了功夫茶、川剧、变脸等。14号上午在大熊猫基地第一次近距离接触了传说中的国宝熊猫，的确萌的不行! ，还无意中发现了一只几乎没人发现的懒熊猫。下午参观了金沙遗址，晚上第一第一次进了酒吧：音乐房子，传说成都的最好的酒吧？第二部分：知识回顾。这一部分不好写，因为我是研一的菜鸟新人，在参会的正式学员里面我年级可能是最小的之一；加之七天下来，堆在脑袋里的东西太多了，所以很难总结。我从几个我感兴趣的几个方向挑几个写写，希望老鸟們不吝赐教。这次参会的大牛们有 Stefano Soatto 、 Mubarak Shah 、 ma yi等，详情参考： http://www.uestcrobot.net/vlpr2011/speakers.htm ，很遗憾没有见到女强人feifei lee。我从三个方面进行知识回顾。 1、综述。这次vlpr2011的主题多元化了，不仅仅局限在vision领域，还包括了图像检索、多媒体、社会计算等。由于互联网数据量的急剧增加，现在大规模搜索貌似越来越火了，来的9位大牛就有4位跟大规模数据相关，在其中我还听到了推荐系统这个熟悉的东东。Jiebo Luo在讲课中提到了几句比较有意思的话：1、data is king! 2、KNN is all you need! 3、No reason, just because my database is bigger than you! 从这些话里面可以看出现在的大规模数据的确算是霸气外露了！其他的就是1、mayi 的low-rank matrics、sparse等，是这次summer school的高潮部分了，茶歇的时候还问问题的成群结队，可怜的ma只吃了一块西瓜……； 2、ji qiang的Propabilistic graphical model； 3、zhuowen tu 的监督、半监督学习等； 4、Mubarak Shah的visual crowdsurveillance； 5、Stefano Soatt 的group theory，水平不行，很遗憾没有听懂。 2、感兴趣的部分。9号mayi一个人讲了一整天，他的low-rank matrics方面的工作的确做的非常出色，讲解的也非常富有激情，算是收益最大的一天，可惜到现在还没提供lecture的ppt，甚遗憾！其中他反复提到了他原来的那个附有争议的惊人结论：特征选择在高维空间可能不再重要是有前提的。他还提到他的数据集不方便公开，但是可以帮忙测试，以后有着方面需求的尽管找他，嘿嘿。他的工作这里： http://www.cvchina.info/2010/06/01/sparse-representation-vector-matrix-tensor-1/ 总结的很好，我就不再赘述。ji qiang原来来NLPR做过一次讲座，所以这次讲座听的算是最懂的一次讲座了，虽然用两个小时讲了一个学期的东西。他的演讲文档在这： http://www.uestcrobot.net/vlpr2011/program.htm .结构性非常好的一个报告，以后有时间可以尝试用图画出主要的内容来。Zhuowen Tu的研究方向可能是我最感兴趣的部分了，讲的也非常有逻辑性，娓娓道来，可惜那天实在太困了，只听了大概。 3、workshop的报告。计算所的Shiguang Shan 总结了Face recognition方面的工作，他们组也做了两篇比较有意思的文章。其中的Half Face Recognition让人耳目一新。曾经上过Shan老师的人机交互的课，这次在会议上也聊的不错，推荐感兴趣的可以去他们组，呵呵。Junsong Yuan总结了一下他们组的工作，Human Action Detection and Recognition方面的工作也做的很漂亮，还有video pattern mining也比较有意思。西电的Xinbo Gao讲了Sketch-photo Recognition，比较有意思。这些牛们的报告都可以在这找到： http://www.uestcrobot.net/vlpr2011/program.htm 。第三部分：心得体会。 1、这次成都VLPR2011之行最大的收获可能就是近距离接触一些大牛们，了解他们的研究方向、研究方式了。Ma老师算这次会议的主角了，不过讲的的确很好，无论是他做的报告还是paneldiscussion上面的告诫之言。前段时间有几个大牛们闲来无事在ICCV上面发了一篇关于数据库有偏性的workshop文章，引起轰动。在paneldiscussion上面有人问到这个问题，他的解答就是没有无偏的数据库，只有有偏的科研人，的确经典！ Zhuowen Tu也给我很深刻的印象，低调的华丽，总有一种让人安静下来的思考的魅力，以后也许我要学学他那种安静下来思考，把问题搞通弄透的执着！ just name a few。 2、认识一些牛逼的师兄师姐们。去那边的人都比你牛，学员里面有老师、工作的人、各个年级的博士，自己趴在牛人堆里面学习是一种愉快的经历。还认识一个北航的校友，因为错过申请时间跑过去交钱听课！一起同行的跟我同一年级的一北大哥们自己创办了公司、去年一年还跟师兄合作发了十五六篇论文，包括IJCV、IJCAI、MM等牛文，不得不让人感慨牛人各有各的牛法，不牛的人各有各的理由！ 3、以后选择研究方向尽量考虑两个方向，要么选择一门技术，然后通吃应用，比方Ma yi的sparse，Zhuowen Tu通透机器学习各种算法；要么选择一门很有意思的应用，尝试各种算法。不要换题目，做深做到同一方向你是最牛的几个人之一或者你们组是最牛的组之一，这样才能混的好点。不要发水文，发一篇有影响力的文章即可。 for i=1:inf 一流的老师-招一流的学生-选择一流的方向-营造一流的环境 end 4、英语能力的重要性。7号那天自助餐，在车上跟一个MIT的看上去像中国人的哥们聊天，我用蹩脚的英语问他能否讲中文，他用流利的汉语回答说可以，然后普通话说的比我还标准，瀑汗。后来知道他爷爷在上海，父母去了美国，所以他英语和汉语都很好，所以感觉他吃的很开，在最后的team project里也很占优势。有一组由于是用中文讲的，虽然讲的很有轰动性，结果只拿了安慰讲，的确很悲惨。我在回答问题的时候用英文讲了老半天，问问题的人直接说没听懂，让我汗颜…… 第四部分：感谢。 team project就是把一百位正式学院分成十组，每组十人，但有些组员因为各种原因没参加，一般每组有七到八个左右。每个组在三个晚上讨论一个创新的方案，越有创意的越好。我们的主题是Easy conference，怎么让以后的会议变得更加人性化，让参会者更容易更多的获取各方面的信息。最后的team project我们组拿到了第一名，3000的奖金，的确出乎我的意外。这是我们组领奖时的合照。在这要感谢项翔师兄（右五）对我提出idea的补充，然后在演示时电脑崩溃时，非常及时的站出来用流利的英语介绍了整个Idea，然后还组织了一次情景扮演，完善PPT等等，是我们组获胜的最大功臣。也要感谢冯柏岚师兄（左五）的一针见血般的建议；笑星蔡烜（左六）、陈见耸师兄（左一）和刘颖璐师姐的搞笑演出；敖欢欢（左四）和张志梅老师（左六）一直以来的工作和建议。整体来说，这次VLPR之行还是收获蛮大的，有机会的话可以去去，应该会不虚此行的。; 6794 次阅读|1 个评论

review: Knowledge Discovery in Databases: An Overview: jiangdm 2011-8-4 15:52; 《Knowledge Discovery in Databases: An Overview》， William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus， AAAI ，1992 Abstract： After a decade of fundamental interdisciplinary research in machine learning, the spadework in this field has been done; the 1990s should see the widespread exploitation of knowledge discovery as an aid to assembling knowledge bases. The contributors to the AAAI Press book \emph{Knowledge Discovery in Databases} were excited at the potential benefits of this research. The editors hope that some of this excitement will communicate itself to AI Magazine readers of this article the goal of this article： This article presents an overview of the state of the art in research on knowledge discovery in databases. We analyze Knowledge Discovery and define it as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. We then compare and contrast database, machine learning, and other approaches to discovery in data. We present a framework for knowledge discovery and examine problems in dealing with large, noisy databases, the use of domain knowledge, the role of the user in the discovery process, discovery methods, and the form and uses of discovered knowledge. We also discuss application issues, including the variety of existing applications and propriety of discovery in social databases. We present criteria for selecting an application in a corporate environment. In conclusion, we argue that discovery in databases is both feasible and practical and outline directions for future research, which include better use of domain knowledge, efficient and incremental algorithms, interactive systems, and integration on multiple levels. 个人点评：一篇老些的经典数据挖掘综述，个人认为本文两个入脚点：一是Machine Learning (Table 1,2)，二是文中 Figure 1 Knowledge Discovery in Databases Overview.pdf beamer_Knowledge_Discovery_Database_Overview.pdf beamer_Knowledge_Discovery_Database_Overview.tex; 个人分类: AI & ML|1 次阅读|0 个评论

[转载]Slides of Machine Learning Summer School: timy 2011-6-17 10:36; From: http://mlss2011.comp.nus.edu.sg/index.php?n=Site.Slides MLSS 2011 Machine Learning Summer School 13-17 June 2011, Singapore Slides Speaker Topic Slides Chiranjib Bhattacharyya Kernel Methods Slides ( pdf ) Wray Buntine Introduction to Machine Learning Slides ( pdf ) Zoubin Ghahramani Gaussian Processes, Graphical Model Structure Learning Slides (Part 1 pdf , Part 2 pdf , Part 3 pdf ) Stephen Gould Markov Random Fields for Computer Vision Slides (Part 1 pdf , Part 2 pdf , Part 3 pdf )]] Marko Grobelnik How We Represent Text? ...From Characters to Logic Slides ( pptx ) David Hardoon Multi-Source Learning; Theory and Application Slides ( pdf ) Mark Johnson Probabilistic Models for Computational Linguistics Slides (Part 1 pdf , Part 2 pdf , Part 3 pdf ) Wee Sun Lee Partially Observable Markov Decision Processes Slides ( pdf , pptx ) Hang Li Learning to Rank Slides ( pdf ) Sinno Pan Qiang Yang Transfer Learning Slides (Part1 pptx Part 2 pdf ) Tomi Silander Introduction to Graphical Models Slides ( pdf ) Yee Whye Teh Bayesian Nonparametrics Slides ( pdf ) Ivor Tsang Feature Selection using Structural SVM and its Applications Slides ( pdf ) Max Welling Learning in Markov Random Fields Slides ( pdf , pptx ); 个人分类: 机器学习|4265 次阅读|0 个评论

[转载]Classical Paper List on ML and NLP: wqfeng 2011-3-25 12:40; Classical Paper List on Machine Learning and Natural Language Processing from Zhiyuan Liu Hidden Markov Models Rabiner, L. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. (Proceedings of the IEEE 1989) Freitag and McCallum, 2000, Information Extraction with HMM Structures Learned by Stochastic Optimization, (AAAI'00) Maximum Entropy Adwait R. A Maximum Entropy Model for POS tagging, (1994) A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. (CL'1996) A. Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998. Hai Leong Chieu, 2002. A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text, (AAAI'02) MEMM McCallum et al., 2000, Maximum Entropy Markov Models for Information Extraction and Segmentation, (ICML'00) Punyakanok and Roth, 2001, The Use of Classifiers in Sequential Inference. (NIPS'01) Perceptron McCallum, 2002 Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms (EMNLP'02) Y. Li, K. Bontcheva, and H. Cunningham. Using Uneven-Margins SVM and Perceptron for Information Extraction. (CoNLL'05) SVM Z. Zhang. Weakly-Supervised Relation Classification for Information Extraction (CIKM'04) H. Han et al. Automatic Document Metadata Extraction using Support Vector Machines (JCDL'03) Aidan Finn and Nicholas Kushmerick. Multi-level Boundary Classification for Information Extraction (ECML'2004) Yves Grandvalet, Johnny MariÃ , A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification. (NIPS' 05) CRFs J. Lafferty et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. (ICML'01) Hanna Wallach. Efficient Training of Conditional Random Fields. MS Thesis 2002 Taskar, B., Abbeel, P., and Koller, D. Discriminative probabilistic models for relational data. (UAI'02) Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. (HLT/NAACL 2003) B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. (NIPS'2003) S. Sarawagi and W. W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction (NIPS'04) Brian Roark et al. Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm (ACL'2004) H. M. Wallach. Conditional Random Fields: An Introduction (2004) Kristjansson, T.; Culotta, A.; Viola, P.; and McCallum, A. Interactive Information Extraction with Constrained Conditional Random Fields. (AAAI'2004) Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. (NIPS'2004) John Lafferty, Xiaojin Zhu, and Yan Liu. Kernel Conditional Random Fields: Representation and Clique Selection. (ICML'2004) Topic Models Thomas Hofmann. Probabilistic Latent Semantic Indexing. (SIGIR'1999). David Blei, et al. Latent Dirichlet allocation. (JMLR'2003). Thomas L. Griffiths, Mark Steyvers. Finding Scientific Topics. (PNAS'2004). POS Tagging J. Kupiec. Robust part-of-speech tagging using a hidden Markov model. (Computer Speech and Language'1992) Hinrich Schutze and Yoram Singer. Part-of-Speech Tagging using a Variable Memory Markov Model. (ACL'1994) Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. (EMNLP'1996) Noun Phrase Extraction E. Xun, C. Huang, and M. Zhou. A Unified Statistical Model for the Identification of English baseNP. (ACL'00) Named Entity Recognition Andrew McCallum and Wei Li. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons. (CoNLL'2003). Moshe Fresko et al. A Hybrid Approach to NER by MEMM and Manual Rules, (CIKM'2005). Chinese Word Segmentation Fuchun Peng et al. Chinese Segmentation and New Word Detection using Conditional Random Fields, COLING 2004. Document Data Extraction Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. (ICML'2000). David Pinto, Andrew McCallum, etc. Table Extraction Using Conditional Random Fields. SIGIR 2003. Fuchun Peng and Andrew McCallum. Accurate Information Extraction from Research Papers using Conditional Random Fields. (HLT-NAACL'2004) V. Carvalho, W. Cohen. Learning to Extract Signature and Reply Lines from Email. In Proc. of Conference on Email and Spam (CEAS'04) 2004. Jie Tang, Hang Li, Yunbo Cao, and Zhaohui Tang, Email Data Cleaning, SIGKDD'05 P. Viola, and M. Narasimhan. Learning to Extract Information from Semi-structured Text using a Discriminative Context Free Grammar. (SIGIR'05) Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Li Teng, and Qinghua Zheng, Automatic Extraction of Titles from General Documents using Machine Learning, Information Processing and Management, 2006 Web Data Extraction Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional Random Fields for Object Recognition. (NIPS'2004) Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, and Hang Li, Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval, (SIGIR'05) Jun Zhu et al. Mutual Enhancement of Record Detection and Attribute Labeling in Web Data Extraction. (SIGKDD 2006) Event Extraction Kiyotaka Uchimoto, Qing Ma, Masaki Murata, Hiromi Ozaku, and Hitoshi Isahara. Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules. (ACL'2000) GuoDong Zhou and Jian Su. Named Entity Recognition using an HMM-based Chunk Tagger (ACL'2002) Hai Leong Chieu and Hwee Tou Ng. Named Entity Recognition: A Maximum Entropy Approach Using Global Information. (COLING'2002) Wei Li and Andrew McCallum. Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Trans. Asian Lang. Inf. Process. 2003 Question Answering Rohini K. Srihari and Wei Li. Information Extraction Supported Question Answering. (TREC'1999) Eric Nyberg et al. The JAVELIN Question-Answering System at TREC 2003: A Multi-Strategh Approach with Dynamic Planning. (TREC'2003) Natural Language Parsing Leonid Peshkin and Avi Pfeffer. Bayesian Information Extraction Network. (IJCAI'2003) Joon-Ho Lim et al. Semantic Role Labeling using Maximum Entropy Model. (CoNLL'2004) Trevor Cohn et al. Semantic Role Labeling with Tree Conditional Random Fields. (CoNLL'2005) Kristina toutanova, Aria Haghighi, and Christopher D. Manning. Joint Learning Improves Semantic Role Labeling. (ACL'2005) Shallow parsing Ferran Pla, Antonio Molina, and Natividad Prieto. Improving text chunking by means of lexical-contextual information in statistical language models. (CoNLL'2000) GuoDong Zhou, Jian Su, and TongGuan Tey. Hybrid text chunking. (CoNLL'2000) Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. (HLT-NAACL'2003) Acknowledgement Dr. Hang Li , for original paper list.; 个人分类: 模式识别|3020 次阅读|0 个评论

[转载]Clustering - Encyclopedia of Machine Learning (2010): 热度 1 timy 2011-2-14 23:47; From: http://www.springerlink.com/content/g37847m78178l645/fulltext.html Encyclopedia of Machine Learning Springer Science+Business Media, LLC2011 10.1007/978-0-387-30164-8_124 ClaudeSammut and GeoffreyI.Webb Clustering Clustering is a type of unsupervised learning in which the goal is to partition a set of examples into groups called clusters. Intuitively, the examples within a cluster are more similar to each other than to examples from other clusters. In order to measure the similarity between examples, clustering algorithms use various distortion or distance measures . There are two major types clustering approaches: generative and discriminative. The former assumes a parametric form of the data and tries to find the model parameters that maximize the probability that the data was generated by the chosen model. The latter represents graph-theoretic approaches that compute a similarity matrix defined over the input data. Cross References Categorical Data Clustering Cluster Editing Cluster Ensembles Clustering from Data Streams Constrained Clustering Consensus Clustering Correlation Clustering Cross-Language Document Clustering Density-Based Clustering Dirichlet Process Document Clustering Evolutionary Clustering Graph Clustering k-means clustering k-mediods clustering Model-Based Clustering Partitional Clustering Projective Clustering Sublinear Clustering; 个人分类: 机器学习|3319 次阅读|2 个评论

[转载]Symp. on Learning Language Models from Multilingual: timy 2011-1-11 22:48; From: http://www.cs.york.ac.uk/aig/LLMMC/ Symposium on Learning Language Models from Multilingual Corpora (LLMMC) Part of the AISB 2011 Convention , 4-7 April 2011. Call for Papers International organizations like the UN and the EU, news agencies, and companies operating internationally are producing large volumes of texts in different languages. As a result, large publicly-available parallel paragraph- or sentence-aligned corpora have been created for many language pairs, e.g., French-English, Chinese-English or Arabic-English. The multilingual nature of the EU has given rise to many documents available in all or many of its official languages, which have been been assembled in multi-lingual parallel corpora such as Europarl (11 languages, 34-55M words for each) and JRC-Acquis (22 languages, 11-22M words for each). These parallel corpora have been used, both monolingually and multilingually, for a variety of NLP tasks, including but not limited to machine translation, cross-lingual information retrieval, word sense disambiguation, semantic relation extraction, named entity recognition, POS tagging, and syntactic parsing. With the advent of Internet, there has been also an explosion in the availability of semi-parallel multilingual online resources like Wikipedia that have been used for similar tasks and have a big potential for future exploration and research. In this symposium, we are interested in explicit models, usable and verifiable by humans, which could be used for either translation or for modelling individual languages, e.g., as applied to morphology, where the available translations can help identify word forms of the same lexical entry in a given language, or lexical semantics, where parallel corpora can help extract instances of relations like synonymy and hypernymy, which are essential for building thesauri and ontologies. The main purpose of the symposium will be to gather and disseminate the best ideas in this new area. Thus, we welcome review and position papers alongside original submissions. A considerable part of this one-day symposium will be dedicated to discussions to encourage the formations of new collaborations and consortia. Duration: a one-day symposium. Important Dates: Call for papers: December 13, 2010 Submissions: January 19, 2011 Notification: February 14, 2011 Submission of camera-ready versions: February 28, 2011 Symposium: April 6, 2011 Organizers: Dimitar Kazakov, The University of York, UK (kazakov AT cs DOT york DOT ac DOT uk) Preslav Nakov, National University of Singapore, Singapore (preslav DOT nakov AT gmail DOT com) Ahmad R. Shahid, The University of York, UK (ahmad AT cs DOT york DOT ac DOT uk) Program Committee: Graeme Blackwood, University of Cambridge, UK Phil Blunsom, University of Oxford, UK Francis Bond, Nanyang Technological University, Singapore Yee-Seng Chan, University of Illinois at Urbana-Champaign, USA Daniel Dahlmeier, National University of Singapore, Singapore Marc Dymetman, Xerox Research Centre Europe, France Andreas Eisele, Directorate-General for Translation, Luxembourg Michel Galley, Stanford University, USA Kuzman Ganchev, University of Pennsylvania, USA Corina R Girju, University of Illinois at Urbana-Champaign, USA Philipp Koehn, University of Edinburgh, UK Krista Lagus, Aalto University School of Science and Technology, Finland Wei Lu, National University of Singapore, Singapore Elena Paskaleva, Bulgarian Academy of Sciences, Bulgaria Katerina Pastra, Institute for Language and Speech Processing, Greece Khalil Sima'an, University of Amsterdam, The Netherlands Ralf Steinberger, Joint Research Centre, Italy Joerg Tiedemann, Uppsala University, Sweden Marco Turchi, Joint Research Centre, Italy Jaakko Vyrynen, Aalto University School of Science and Technology, Finland; 个人分类: 同行交流|3256 次阅读|0 个评论

learning to rank on graph: petrelli 2010-9-29 11:03; 个人分类: mlj|45 次阅读|0 个评论

[转载]learning by doing: jianfengmao 2010-9-9 17:27; One must learn by doing the thing; for though you think you know it, you have no certainty until you try. Sophocles ~ 450 B.C.; 个人分类: R and Statistics|1927 次阅读|0 个评论

A Fast Algorithm for Learning a Ranking Function from Large-Scale Data Sets: petrelli 2010-9-6 17:15; 位置:E:\petrelli\study\ML\paper\PAMI @article{raykar2008fast, title={{A fast algorithm for learning a ranking function from large-scale data sets}}, author={Raykar, V.C. and Duraiswami, R. and Krishnapuram, B.}, journal={IEEE transactions on pattern analysis and machine intelligence}, volume={30}, number={7}, pages={1158--1170}, year={2008}, publisher={Citeseer} } 基本内容: 文章利用sigmoid funcion 来作为原问题的近似loss function, 然后利用conjugate gradient algortihm求解,直接求解效率很低,于是文章利用erfc函数做了个近似,得到一个快速算法. 贡献: 主要用来解决Preference ranking的问题. 文章提出的算法我感觉也蛮有局限性的,而且相比目前最快的算法,我觉得不会比他提出的差.; 个人分类: 科研笔记|86 次阅读|0 个评论

读The Elements of Statistical learning: petrelli 2010-9-5 20:29; Chapter 2 Overview of Supervised learning 2.1 几个常用且意义相同的术语: inputs在统计类的文献中,叫做predictors, 但经典叫法是independently variables,在模式识别中,叫做feature. outputs,叫做responses, 经典叫法是dependently variables. 2.2 给出了回归和分类问题的基本定义 2.3 介绍两类简单的预测方法: Least square 和 KNN: Least square产生的linear decision boundary的特点: low variance but potentially high bias; KNN wiggly and unstabla,也就是high variance and low bias. 这一段总结蛮经典: A large subset of the most popular techniques in use today are variants of these two simple procedures. In fact 1-nearest-neighbor, the simplest of all, captures a large percentage of the market for low-dimensional problems. The following list describes some ways in which these simple procedures have been enhanced: ~ Kernel methods use weights that decrease smoothly to zero with distance from the target point, ather than the eective 0=1 weights used by k-nearest neighbors. ~In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others. ~Local regression fits linear models by locally weighted least squares rather than fitting constants locally. ~Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models. ~Projection pursuit and neural network models consist of sums of non-linearly transformed linear models. 2.4 统计决策的理论分析看不进去,没怎么看懂,明天看新内容前再看一遍,今天看的内容 p35-p43. 2.5节讨论了local methods KNN在高维特征下的问题, 在维数增大的情况下,要选取r部分的样本,所需要的边长接近1,这样会导致variance非常高. 2.6节分为统计模型,监督学习介绍和函数估计的方法来介绍,统计模型给出一般问的统计概率模型,监督学习说明了用训练样例来拟合函数,函数估计介绍了常用的参数估计,选取使目标函数最大的参数作为估计. 2.7 介绍了structured regression methods,它能解决某些情况下不好解决的问题. 2.8 一些估计器的介绍: 2.8.1 通过roughness penalty, 实质就是regularized methods,通过penalty 项限制函数空间的复杂度. 2.8.2 kernel methods and local regression kermel function实际上和local neighbor方法类似,kernel反映了样本间的距离 2.8.3 basis functions and Dictionary methods 从dictionary中选出若干basis functions叠加作为得到的function. 单层前反馈神经网络和boosting 还有MARS,MART都属于这一类方法. 2.9 模型选择和bias, variance的折中往往模型的复杂度越高(例如regularizer控制项越小), bias越低但是variance越高. 造成训练错误率很低但是测试错误率很高. 反之亦然. 简图2.11 看到61页.主要讲了解回归问题的若干线性方法, 首先是基本回归问题,然后介绍多回归,多输出,接着说subset selection, forward stepwise/stagewise selection(两种的区别是后者更新时不会对其他变量做调整). 3.4 shrinkage methods 便是加入regularizer来smooth化,因为subset selection后的数据偏离散. 如果用平方则是ridge regression, 如果用绝对值就是lasso,还有一种变形least angle regression,和lasso很相关,明天再看看吧.也就是61页到97页的内容. 补充:3.3节对linear regression问题中约束对应的p-norm进行了分析,当p=1.2(文中q表示这里的p)是和elastic net penalty外形很相似,但事实上前者光滑,后者sharp(non-differentiable), (可微意味着无穷阶可导). 3.4节 Least Angle Regression(LAR),和lasso几乎相同,但是在非零取值为0时,相应的变量要从active set中移出,重新计算direction. 3.5节讨论了principal component regression 和partial least squares的方法, 应该可以理解为降维,将原来的d维数据映射到m(md)上面再求解. 3.6 讨论了selection 和 shrinkage方法的比较,貌似的优化的方向选择的不同; 3.7多元输出的selection和shrinkage 3.8Lasso更多的讨论和路径算法 : 基本的优化形式loss+penalty, loss 和penalty的不同造成了关于lasso之类的很多讨论. 另外有提到线性规划用单纯形法求解,记录一下怕将来需要看线性规划的东西没有方向. 3.9 计算代价的分析 Chapter 4 解分类问题的线性方法 4.1 介绍了线性决策边界为线性方法 4.2 indicator matrix的线性回归, 4.3 LDS linear discrimant analysis, 假设每一类为多元高斯分布如下, 在利用到概率密度给出分类的条件概率时,若概率密度函数中的协方差矩阵均相同就引出了LDA. 文章接着对LDA的各种情形和计算方式进行了讨论. 4.4 p137 明天重新过一遍,结束第4章; 个人分类: 科研笔记|6269 次阅读|0 个评论

Deep Learning: openmind 2010-7-28 14:56; 前几天一个R友提到deep learning. 因为我一直在关注NN的东西，所以，今天google了一把。NN，准确的说是ANN，难道开始因为Deep learning又热了？ NN 一直在 AI 领域兴风作浪，从1960年的Perceptron,到1986年的 BP 算法【在此期间，SVM等SLT也是硕果累累】，再到2006年 Hinton等人在DBN中成功运用了deep learnning之后，如今的NN又开热了。无容置疑，强大的计算能力使得DL成为可能。试问，NN 的下一个热点又将是什么呢？谁又是第二个Hinton呢? 看看，如何将你NN也DL一把？如此以来，grants 和 theses，papers 就有了，嘻嘻。更多可以参考： http://deeplearning.net/; 个人分类: 科普天地|536 次阅读|0 个评论

国际医学继续教育信息资源 BMJ Learning: xupeiyang 2010-7-16 06:28; 请见： http://learning.bmj.com/learning/channel-home.html Dear Prof Xu Peiyang, For every course you complete on BMJ Learning you can print a certificate of completion as proof for accreditation. The following courses have been recommended by other primary care doctors on BMJ Learning. If you are not working in primary care please update your details to ensure you only receive relevant communication. Welcome to BMJ Learning BMJ Learning is the world's largest and most trusted independent online learning service for medical professionals. We offer over500 peer reviewed, evidence based learning modules and our service is constantly updated. Train and test your knowledge and skills today. Accreditation of BMJ Learning courses is provided by several international authorities - including DHA, HAAD, EBAC, MMA, CME, RNZCGP, KIMS, and others . Please contact your relevant College or Association for information, or to request that they accredit BMJ Learning if they do not already.; 个人分类: 信息交流|3290 次阅读|0 个评论

[转载]Unsupervised feature learning and Deep Learning: xrtang 2010-7-10 17:10; 浏览EMNLP2010的网站，看到Andrew Ng将要做的一个报告的介绍，其中说到了特征学习和深度学习问题。最近在为知识库缺乏发愁。介绍中提到集中从无标注数据中学习特征的方法，在这里记录下来，以备后用。 1. sparse coding 链接：http://www.scholarpedia.org/article/Sparse_coding 2. ICA algorithm Independent componential analysis 3. Deep belief networks 链接：http://www.scholarpedia.org/article/Deep_belief_networks; 个人分类: 未分类|6310 次阅读|0 个评论

[转载]Change:March/April 2010: In This Issue: geneculture 2010-5-4 02:43; Frame of Reference: Open Access Starts with You by Lori A. Goetsch Federal legislation now requires the deposit of some taxpayer-funded research in open-access repositoriesthat is, sites where scholarship and research are made freely available over the Internet. The National Institutes of Health's open-access policy requires submission of NIH-funded research to PubMed Central, and there is proposed legislationthe Federal Research Public Access Act of 2009that extends this requirement to research funded by 11 other federal agencies. Academic Researchers Speak by Inger Bergom, Jean Waltman, Louise August and Carol Hollenshead Non-tenure-track (NTT) research faculty are perhaps the most under-recognized group of academic professionals on our campuses today, despite their increasingly important role within the expanding academic research enterprise. The American Association for the Advancement of Science reports that the amount of federal spending on RD has more than doubled since 1976. The government now spends about $140 billion yearly on RD, and approximately $30 billion of this amount goes to universities each year in the form of grants and contracts. Taking Teaching to (Performance) Task: Linking Pedagogical and Assessment Practices by Marc Chun Imagine a typical student taking an average set of courses. She has to complete a laboratory write-up for chemistry, write a research paper for linguistics, finish a problem set for mathematics, cram for a pop quiz in religious studies, and write an essay for her composition class. Her professors almost exclusively lecture (which, it's been said, is a way for information to travel from an instructor's lecture notes to the student's notebook without engaging the brains of either). And somehow she is supposed to not only learn the course content but also develop the critical thinking skills her college touts as central to its mission. Why Magic Bullets Don't Work by David F. Feldon We always tell our students that there are no shortcuts, that important ideas are nuanced, and that recognizing subtle distinctions is an essential critical-thinking skill. Mastery of a discipline, we know, requires careful study and necessarily slow, evolutionary changes in perspective. Then we look around for the latest promising trend in teaching and jump in with both feet, expecting it to transform our students, our courses, and our outcomes. Alternatively, we sniff disdainfully at the current educational fad and proudly stand by the instructional traditions of our disciplines or institutions, secure in our knowledge that the tried and true has a wisdom of its own. This reductive stance is a natural one. As university faculty who work within disciplines, we have each chosen a slice of human knowledge about which we are passionate, and we often settle on the most expedient (but sound) answer to the question of how to teach so that we can move on to the interesting issues and problems that led us to pursue academic careers in the first place. Further, the professional demands on us and the rewards for our work generally do not align with high levels of sustained effort invested in teaching. However, what we tell students about mastering our respective disciplines are the same truths that apply to finding effective instructional strategies: The devil is always in the details, and nuance is critical. Yet in our desire to do right by our students and still invest the bulk of our efforts in teaching content, we put our faith in over-simplified generalizations that never seem to realize the full benefits that they promise. There have been many sweeping statements made regarding the best ways to teach students in the 21st century. Two of the most au courant are traditional lectures are ineffective and internet-based technologies help students learn. There is empirical evidence to support the truth in each of these statements, truebut only if they meet specific parameters, which rarely carry over from their origins in educational research to guide their implementation in practice. Are lectures bad for learning? When we look beyond the rhetoric surrounding instructional practices to examine data, it turns out that bad lectures do limit students' learning and motivation. However, good lectures can be inspiring and have a positiveeven transformativeimpact on student outcomes. Given this unenlightening information, the real question becomes, What differentiates a good lecture from a bad one? Good lectures share a number of key properties with any type of effective instruction. They begin by establishing the relevance of the material for students through explicit connections with their goals or interests. They activate prior knowledge by connecting new content with what students already know and understand or problems with which they are currently grappling. They present information in a clear and straightforward manner that does not require disproportionate effort to translate into terms and concepts meaningful to students. They limit the information presented to a small number of core ideas that are thoroughly but not redundantly explained. Studies that systematically control the relevant features of lectures find significant learning benefits for students when these principles are implemented. However, the large-scale correlative studies of instructional format and student achievement that report negative outcomes for lectures do not control for or even ask about the presence or absence of these features. Thus it may be that the negative findings are a more accurate reflection of generally lackluster or ill-informed implementation of this teaching technique than a condemnation of the technique itself. Of course, simply knowing or even applying these general principles for effective lecturing does not guarantee positive results. Students enter courses with differing backgrounds, levels of prior knowledge, goals, and interests. Given that each of the guidelines above explicitly frames practice in terms of characteristics that vary by learner, the underlying challenge is to find ways to connect with the broadest cross-section of students and find supplemental or alternate means of connecting with those who do not fit that mold. Many instructors succeed at this through the use of assignments that require students to grapple with problems prior to the lecture. Others use clickers to stimulate engagement and structure situations in which the information presented is salient. However, the effective use of such practices involves understanding the students at whom the course is targeted. Is technology good for learning? Both the definitions and the uses of instructional technology are highly varied, so conversations about its benefits and limitations also tend to rely on overly broad generalizations. The two major foci of these discussions currently are game/simulation-based learning and so-called Web 2.0 technologies that allow users to interact with each other via the internet and to contribute content of various types directly to websites. Advocates claim that these applications are important for improving student learning outcomes; they enhance relevance for students by engaging them through the generationally preferred medium of digital media and provide them with opportunities to actively engage with a course's content. While there are indeed instances where such benefits are realized, they are not reflected in comprehensive literature reviews or meta-analyses of the research. There is a simple explanation for this: not all uses of a technology are created equal. The key features that drive engagement and learning pertain to the designs that underlie the technology rather than to the technology itself. When games and other digital learning environments are developed in accordance with principles of effective instruction, they achieve positive results. But they do not yield better results than less sophisticated instructional delivery systems that use the same instructional designs. Why? Because the active ingredients that affect students' learning are the same in both cases. One of the most durable descriptions of this phenomenon is Richard E. Clark's grocery truck metaphor: Media are mere vehicles that deliver instruction but do not influence student achievement any more than the truck that delivers our groceries causes changes in our nutrition (Clark, 1983, p. 445). What the new media do offer are tools for interacting with instructors, peers, and content in ways that are not affordable or possible otherwise. When these interactions offer opportunities to observe or manipulate information and phenomena in meaningful ways, they can facilitate learning. Generally, the features that are most helpful for students include enabling the representation of concepts at multiple levels of abstraction (e.g., via concrete representation, abstract functional models or mathematical models), providing opportunities for more extensive practice than would otherwise be possible and offering immediate feedback to direct further learning efforts. While they are potentially valuable learning tools, such technologies need to be designed in such a way that they are not confusing or overwhelming for the students who will use them. With any software, there is a learning curve for mastering the interface used to interact with it. To the extent that the interface functions in a standard way, students will be able to draw on previous technology experiences in using it. However, if it is significantly different from familiar interfaces, they will need to invest substantial effort in mastering its use before getting to content-related learning. The greater the departure from familiar software environments, the steeper the learning curve. Thus the technology itself can act as a learning impediment for students with limited technology backgrounds. It may be the case that the potential learning benefits offered outweigh the cognitive costs, but it should not be assumed without evidence that this will be the case. The role of cognition There are two threads linking effective lectures and effective technology use. The first is consideration of what students bring to the table in terms of goals, interests, and prior knowledge. The second is the deliberate management of the opportunities for students to engage with content in order to focus their investment of mental effort on key ideas. In educational research, a powerful framework for considering these factors jointly is cognitive load theory (CLT). When games and other digital learning environments are developed in accordance with principles of effective instruction, they achieve positive results. But they do not yield better results than less sophisticated instructional delivery systems that use the same instructional designs. CLT operates under the central premise that learners are only capable of attending to a finite amount of information at a given time due to the limited capacity of the working (short-term) memory system. So it is necessary to carefully manage the flow of information with which learners must grapple. It is likely that anyone who has taken an introductory course in educational or cognitive psychology will have heard of George Miller's (1956) magical number that people can only process seven information elements at a time, plus or minus two. However, what many people do not know is that this number is probably a substantial overestimate. Miller obtained his finding by asking people to listen to strings of random numbers and recite them back as accurately as possible. These numbers were not linked to any context, and he assumed that they were ubiquitous placeholders for any type of information that people might need to process. What did not occur to Miller is that people use strings of numbers for many everyday tasks and have developed memory strategies to retain them. Think, for example, of how you remember a telephone number or your social security number; most people group the digits into two or three chunks (e.g., XXX-XXXX or XXX-XX-XXXX). It is these chunks that occupy space in working memory and help to organize the information so that it does not get lost. Subsequent research holds that the upper limit of our short-term memory is actually closer to four information pieces or chunks. Given these tight bandwidth constraints, how do human beings handle any complex taskespecially one that has more than four discrete elements? To simplify, we handle the task-relevant information much as we would a phone number: we divide it into meaningful units based on our knowledge of the content and task structure. The more knowledge we have about a task, situation, or content area, the more efficiently and adaptively we are able to map discrete pieces of information onto schemas. These schemas are the abstract representations of our knowledge that serve as integrated templates for rapidly organizing the relevant facets of a situation. With deeper, more meaningful, and more interconnected knowledge, our schemas become more refined, nuanced, and capable of encoding increasing amounts of incoming information as a single chunk. Information that would occupy only one chunk for an advanced learner might be viewed by a novice as several discrete pieces of information. Cognitive load is conceptualized as the number of separate chunks (schemas) processed concurrently in working memory while learning or performing a task, plus the resources necessary to process the interactions between them. Therefore a given learning task may impose different levels of cognitive load for different individuals based on their levels of relevant prior knowledge. Cognitive load is experienced as mental effort; novices need to invest a great deal of effort to accomplish a task that an expert might be able to handle with virtually none, because they lack sufficiently complex schemas. When cognitive load (the information to be processed) exceeds working memory's capacity to process it, students have substantial difficulties. The most straightforward effect is that they are unable to learn or solve problems. However, other problematic outcomes can also occur. First, students may revert to using older or less effortful approaches to the problem that impose a less heavy load on working memory. This means that previously held misconceptions or erroneous approaches may be brought to bear, reinforcing knowledge that is counter to the material they are trying to learn. Second, students may default to pursuing less effortful goals. In other words, they may procrastinate. In such situations, thinking about the whole of a complex task may be so overwhelming that students turn to more manageable activities: checking their email, cleaning their desks, or taking on whatever other chores do not exceed their processing ability. (Rumor has it that faculty have similar experiences.) For this reason, one of the strategies for overcoming procrastination is to reduce the magnitude of a goal by breaking a large task into its component parts and dealing with only one piece at a time. This limits the complexity of the task faced, which reduces the cognitive load it imposes to manageable levels. Managing cognitive load in teaching In order to optimize the benefits of instruction, CLT prioritizes available information according to the type of cognitive load it imposes. Intrinsic load represents the inherent complexity of the material to be learned. The higher the number of components and the more those components interact, the greater the intrinsic load of the content. Extraneous load represents information in the instructional environment that occupies working memory space without contributing to comprehension or the successful solving of the problem presented. Germane load is the effort invested in the necessary instructional scaffolding and in learning concepts that facilitate further content learning. Cognitive load is conceptualized as the number of separate chunks (schemas) processed concurrently in working memory while learning or performing a task, plus the resources necessary to process the interactions between them. In this context, scaffolding refers to the cognitive support of learning that is provided during instruction. Just as a physical scaffold provides temporary support to a building that is under construction, with the intent that it will be removed when the structure is able to support itself, an instructional scaffold provides necessary cognitive assistance for learners until they are able to practice the full task without help. Extensive instruction typically provides multiple levels of support that are removed gradually to facilitate the ongoing development of proficiency. Processing the information provided as scaffolding imposes cognitive load. However, to the extent that it prevents the cognitive overload that would otherwise result for a learner struggling with new material, it is cost beneficial. Thus, the three driving principles of CLT are: 1) present content to students with appropriate prior knowledge so that the intrinsic load of the material to be learned does not occupy all the available working memory, 2) eliminate extraneous load, and 3) judiciously impose germane load to support learning. For any instructional situation, the goal is to ensure that intrinsic, extraneous, and germane load combined do not exceed working memory capacity. But how can we manage this? Although we do not control the innate complexity of the material we teach, we can assess the prior knowledge of our students to ensure they understand prerequisite concepts. If they have schemas in place to facilitate the processing of the new concept, their intrinsic load is lower than if they need to grapple with every nuance of the material without the benefit of appropriate chunking strategies. This is an opportunity to effectively use technology. The use of clickers during lectures or short online assessments to be completed prior to attending class can provide a quick picture of which necessary elements students have in place before a new concept is introduced. If they lack the prerequisite knowledge, then the instructor should teach or provide that material first in order to prevent the advanced material from exceeding students' ability to process it. The good news about extraneous load is that it should be eliminated whenever possible rather than managed. In fact, there are a number of simple and straightforward principles for doing so in instructional materials as well as in the classroom. Some have to do with the information presented. For example, ancillary information that is not directly on point should be eliminated. This includes things like biographies of historic figures in science texts when the instructional objective is to teach a theory or procedure. While it may be an interesting human-interest story to consider whether or not an apple really fell on Newton's head, processing that information detracts from the working memory available to understand gravitational theory or how to solve problems using the law of gravity. Other practices target the presentation of information. For example, it is better to integrate explanatory text into a diagram than to keep it separate, because the cognitive load of mentally integrating the information can be avoided when they are collocated. On the other hand, reading aloud the text that students are looking at forces redundant processing of the same information and impedes their ability to retain the material. Because sensory information enters working memory through modality-specific pathways, which themselves have limited bandwidth, it is helpful for information to be distributed across modalities wherever possible. It is also helpful for all necessary information to enter working memory at approximately the same time. Thus, the first example uses linguistic and visual information together, which distributes the information across modalities and avoids the unnecessary load of holding the information from the diagram in working memory while searching for the appropriate text or vice versa. In contrast, the second example overloads the pathway that handles verbal information because it simultaneously delivers read and spoken information. It also requires that information from the text be held in working memory while the speech is processed, because people typically read to themselves much more quickly than words are read aloud. Germane load is a highly complicated issue. Building scaffolds for learning imposes cognitive load. Novices being introduced to material for the first time need a great deal of explicit instruction, using very small chunks of information, to deeply process new information or problem-solving strategies. As they acquire more knowledge and skill, though, the external scaffolding which initially helped them becomes unnecessary and redundant. If such learning supports are not eliminated for those students, they cease to facilitate learning as germane load and begin to hinder it as extraneous load. This expertise reversal effect is the biggest challenge for developing effective instruction, because students do not all attain the same level of comprehension at the same time. What is germane and helpful load for one student may be extraneous and harmful for another. Effective Practices The keys to applying cognitive load theory effectively in a course are advance planning and the ongoing monitoring of students' progress. Because the central premise of CLT is to optimize the allocation of students' working memory resources for mastering particular information, it is vital to identify very specifically what the instructional objectives are for the course as a whole and for each class meeting or module. If we cannot be precise about what we want students to know and be able to do, we will not be able to structure their experiences to help them accomplish this. Next, we need to sequence the objectives so as to present material in the order in which it is needed. If some topics build on others in the course, the prerequisite pieces should be taught before they are needed. For example, we should teach processes and procedures in the same sequence that students will perform them, so that work products from preceding steps can be used in subsequent steps. If the concepts, knowledge, or skills being taught do not have an inherent sequence, then it is generally most effective to order them from simplest to most complex. Once we have figured out what content needs to be taught and the appropriate progression of topics, it is most helpful to students when we let them in on the secret. Trying to impose order on disconnected information is highly effortful. If we simply turn students loose on the material without presenting clearly what they should be trying to get from it and how it fits into the larger picture of the course's content, much of their cognitive resources will be allocated to figuring out what information is important (extraneous load) rather than focusing on constructing the knowledge necessary to meet our learning objectives. Although the logic of the course content and sequence may be obvious to us as knowledgeable instructors and content experts, our students arrive without the benefit of the schemas we have developed. Regardless of their previous experiences (or perhaps because of them), they sincerely appreciate knowing up front what they will be learning, what is expected of them, how they will be assessed, and how all of these elements fit together. When these components of the course are unclear, students invest substantial effort in figuring them out. Further, they may reach incorrect conclusions, which leads to more extraneous effort as they work at cross purposes to the course. Having mapped out the information in the course, we also need to determine how well students comprehend any knowledge on which later course content depends. This does not mean that we must burden our students (and ourselves) with exams or large assignments every week. Instead, we can use lightweight, rapid assessments that are not formally graded but are attuned to the key concepts upon which the new material draws. These can include short online surveys on the content that must be submitted a few days before class, quick check-in conversations as class begins, or multiple-choice questions on key issues that students must respond to using personal response systems (clickers). These tools are most effective when students are accountable for submitting a response but not for the accuracy of their answers. The purpose is to inform the instruction we provide rather than to increase students' anxiety (i.e., emotionally invoked extraneous load) about not knowing a correct answer. If students generally have a strong grasp of the prerequisite material, the likelihood of cognitive overload will be small, less scaffolding will be needed, and they can move directly into problem-solving. But if their understanding is weak, it will be important to review the prior material in detail, structure the new content as much as possible, and move slowly through it. When introducing problem-solving procedures to novices, providing worked examples is a very helpful practice. This involves demonstrating and explaining the reasoning processes that are involved in solving a class of problems, using a representative example. This helps to manage cognitive load effectively in several ways. When a problem is taken on, there are two sources of potential load for a learner. The first is the need to structure the information provided to effectively frame and analyze the problem. The second is the application of appropriate problem-solving strategies. The worked example both demonstrates problem-framing and provides a concrete model of an appropriate problem-solving strategy. sincerely appreciate knowing up front what they will be learning, what is expected of them, how they will be assessed, and how all of these elements fit together. This reduces the degree of uncertainty under which the students are working on three fronts. First, it allows them to map concrete instances onto relevant schemas, facilitating effective chunking. Second, it reduces their reliance on highly effortful trial-and-error attempts to identify productive solutions, which substantially increase cognitive load and time spent without providing any learning advantage. Last, it breaks the procedure down into distinguishable steps that can be considered in smaller, more manageable chunks. After walking through a full example, an excellent way to help students practice without getting overloaded is to provide a partially worked example and ask them to pick up where the completed part of the example leaves off. Having them practice the last steps first ensures that all aspects of the strategy to be learned are practiced. In complex, open-ended problems, students can get off track midway through an exercise and never have the opportunity to practice its later elements. As students become proficient in the later steps, they can be given problems with fewer steps completed for them. In this way, instructors can effectively control the overall level of cognitive load imposed by the problem and ramp up to full problems after students have developed effective schemas and chunking strategies. Practice makes perfect As students encounter repeated instances of problem types during their learning, their strategies become more nuanced (to accommodate small differences between the problems) and less effortful to execute. As they practice, their skills require less and less conscious monitoring, which reduces the level of cognitive load that problem-solving imposes. This lets them efficiently address problems of increasing complexity. Experts are able to solve problems beyond the scope of what laymen can handle precisely because their core problem-solving procedures impose virtually no load on working memory. Therefore, they can assimilate very subtle nuances and much more complex problem features with their extra cognitive capacity. The benefits of practice are just as powerful for teachers as they are for students. Teaching effectively and using cognitive load theory to guide practice is challenging. It requires the focused consideration of many details regarding our students, their knowledge, and our instructional goals. But with sustained effort, careful observations of what seems to yield more efficient and effective learning, and a willingness to make changes as necessary, these practices become less effortful. This frees up our own working memory resources to use for addressing both further complexities in addressing the learning needs of our students and the subtleties of our own disciplinary passions. Resources 1. Bernard, R. M., Abrami, P.C., Lou, Y., Borokhovski, E., Wade, A., Wozney, L., Wallet, P. A., Fiset, M. and Huang, B. (2004) How does distance education compare with classroom instruction? A meta-analysis of the empirical literature. Review of Educational Research 74 :3, pp. 379-439. 2. Bernard, R. M., Abrami, P. C., Borokhovski, E., Wade, C. A., Tamim, R. M., Surkes, M. A. and Bethel, E. C. (2009) A meta-analysis of three types of interaction treatments in distance education. Review of Educational Research 79 :3, pp. 1243-1289. 3. Clark, R. C., Nguyen, F. and Sweller, J. (2005) Efficiency in learning: Evidence-based guidelines to manage cognitive load , John Wiley Sons, San Francisco. 4. Clark, R. E. (2001) Learning from media: Arguments, analysis, and evidence , Information Age Publishing, Charlotte, NC. 5. Cowan, N. (2000) The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences 24 , pp. 87-185. 6. Feldon, D. F. (2007) Cognitive load in the classroom: The double-edged sword of automaticity. Educational Psychologist 42 :3, pp. 123-137. 7. Kalyuga, S., Ayres, P., Chandler, P. and Sweller, J. (2003) The expertise reversal effect. Educational Psychologist 38 :1, pp. 23-31. 8. Mayer, R. E. (2009) Multimedia learning , 2 Cambridge University Press, New York. 9. Miller, G. A. (1956) The magical number seven, plus or minus two: Some limits on our capacity for processing information. The Psychological Review 63 , pp. 81-97. 10. Schwartz, D. L. and Bransford, J. D. (1998) A time for telling. Cognition Instruction 16 :4, pp. 475-522. 11. van Merrinboer, J. J. G. and Sweller, J. (2005) Cognitive load theory and complex learning: Recent developments and future directions. Educational Psychology Review 17 :2, pp. 147-177. David Feldon is an assistant professor of STEM education and educational psychology at the University of Virginia. His research examines the development of expertise in science, technology, engineering, and mathematics through a cognitive lens. He also studies the effects of expertise on instructors' abilities to teach effectively within their disciplines. http://www.changemag.org/index.html Editorial: Motivating Learning by Margaret A. Miller Knowing how students learn and solve problems informs us how we should organise their learning environment and without such knowledge, the effectiveness of instructional designs is likely to be random . John Sweller (Instructional Science 32: 931, 2004.) I've written in the past about the things we want students to learn, how we help them learn, and about resistance (mine and virtually everyone else's) to change. In this issue, those concerns converge. Determining what we want students to learn is the amazingly difficult first step in developing assessments of that learning, as the article by Dary Erwin and Joe DeFillippo demonstrates. And Marc Chun talks about linking teaching, learning, assessment, and the ultimate use of higher-order thinking skills by both teaching and assessing those skills through tasks that mimic how they will be used in real life. But what particularly intrigues me is the connection between cognition and change. Educational psychologists have developed a number of constructs to explain how the mind works. In this issue, David Feldon suggests that a familiarity with cognitive load theory can be a big help in developing effective pedagogies, for example, a framework we see invoked in Carl Wieman's attempts to improve science instruction. But there is other knowledge about human cognitive architecture that can also be useful as we think about teaching and learning. For instance, the human cognitive default is to solve problems with as small a mental investment as possible; we typically retreat to earlier mental models and quicker and less effortful automated problem-solving strategies when new information threatens to overwhelm us. So as Feldon suggests, teachers need to find some way to keep the investment low enough and the cognitive load light enough that those mechanisms don't come into play. We can also exploit the fact that we're more likely to try to solve problems in areas that are important to us by showing students the relevance of what we're teaching to their lives and concerns. But given the fundamentally conservative nature of human cognition, perhaps the question should be, why doesn't the whole learning system grind to a halt? In a way, it's remarkable that we ever learn anything at all. I remember that when my son was about a year old, he developed the locomotive strategy of scooting around on his knees (it beat crawling, since he could carry things). Once he had built up calluses thick enough to protect those knees, it was a remarkably efficient way to get from point A to point B, and it halved the height from which he would fall if something went wrong. I remember thinking at the time, what will ever motivate him to get up on his hind legs and wobble around when a misstep would cause him to fall from twice the height? What will prompt him, in short, to face the perils of change when things work so well and comfortably for him as they are? Come to think of it, our bipedal walk is a great metaphor for our alternation between imbalance and stability. The act of walking, researchers have discovered, is a continual falling forward, regaining our balance, then falling forward again. Something impels us to lift that foot and risk the fall, then we consolidate our new position momentarily, then we lift that foot and fall again, and so on. At the species level, there are clearly advantages in the impulse to generate, test out, and practice both old and new survival strategies (e.g., bipedalism) that can give one an evolutionary edge. But what lies on top of that drive for individual students? How do we motivate them to lift one foot and put it down a little ahead, let us help them organize and consolidate their momentary new equilibrium, and then lift the other? I think the answer can be found by looking not at learning in school but at spontaneous learning, particularly during play. When they play, children seem to be motivated by several things. Curiosity, for one. Another stimulus is wanting to master the environment (a bone-deep tendency, crucial to the human race's survival, that is as dangerous as fire when out of control but as just as life-giving when contained), which is why children need plenty of free play where they make up the rules (as opposed to playing board games or participating in sports). A third stimulus may be the desire to imitate and take one's place among trusted and admired others, either peers or adults. Those tendencies don't need to be lost as one ages, as the success of Elderhostel attests to, although Grandgrindian schooling can certainly grind them down. So our job as teachers may be to stand in what Vygotsky called the zone of proximal development, the stage in their cognitive growth that students haven't quite gotten to yet, and beckon them forward into what for them is uncharted but possibly alluring territory (the ending of Huckleberry Finn floats into my mind, where Huck tells Jim that it's time to light out for the territories, or the song by Jacques Brel in which he mentions his childhood longing for le Far West). We motivate students to make that leap by stimulating their curiosity about the subject; by showing our own passion for it; by lessening the dangers of the move as we, knowing what their current maps look like, show them the path from there to here and how to organize their understanding of the new landscape; and by giving them as much control as possible over the learning environment. But more: I point you to Matt Procino's account (in the Listening to Students in the previous issue) of taking over a class in child development. He modeled for students the very behavior he wanted them to exhibit in life as a result of what they learned in his class by soliciting sometimes uncomfortable feedback as he learned how to teach. Similarly, he had earlier let his Outward Bound students see that he too was afraid of the challenges he was asking them to take on but that they could summon the courage to do so becausesee?he was doing it. From the point of view of the students, an admired other gave them two things to imitate: not only how you scale a cliff but how you deal with the fear of scaling a cliff. People generally can't be dragged or whipped into forward movement; they'll run back to their earlier spot of equilibrium the minute the threat (of bad grades, for instance) stops. I know that I plant my feet stubbornly whenever I feel bullied (leading one professorwho tried to argue me into liking Wordsworth's Michael, a poem I detest to this dayto say to me in exasperation, Miss Miller, why are you sometimes so dense ?) But I'm apt to leap joyfully ahead when beckoned by someone I trust and admire into knowledge that he or she is passionate about. And I want to be among the people who inhabit that new zone. That's why, at the end of a successful dissertation defense, I always say to the newly minted PhD, Welcome to the community of scholars.; 个人分类: 高等教育学|149 次阅读|0 个评论

英语单词的象形记忆法（2）: Bernie 2010-2-3 12:40; Figuration of English Words in Outlook and Sound Introduction The idea had been in my brain for more than 20 years to example words before I published it on my blog on 22th Sept. 2007. The list of letter meanings is the central body of the idea. The 26 letters each has its own meaning based on its outlook or on a derivation summarized from thousands of words. However, the meaning of a letter shows flexibility, and letter O, for example, looks not only a round blur puzzle but also a circle area and a ring to link two parts into a new word as well. We need the flexibility to figure out the meaning of million words individually. The list also shows the meanings of only some two-letter groups because meanings of the other groups as word have been defined in common dictionaries or as word fixes in textbooks. The meaning of each group has a typical way of flexibility: one comes from the list and the other is comes from the meanings of its constitutional letters. Cu, for example, means “cumulate” in the list. Cu can also mean “cut down” because C means “cut” and U means “down” in the list. It is generally not necessary to define meanings of any letter groups having more than three letters because you can find their meanings out of a dictionary. The idea is that you may like to hold a copy of the list in one of your hands to figure out meaning of any a word in your own story of figuration. After some days you will not need the list any more except a few occasional checks because the list is easy to remember. You may need to know that meaning of consonant letters plays a more important role than the vowels and a vowel letter usually behaves like an emphasis to the meaning of the consonant or consonants in front of it without much its own meaning, i.e. a vowel with its preceding consonant or consonants usually makes only one meaning. A very important skill is to break a word into letter groups. A consonant with its next following vowel should be a group but how to break connected consonants and vowels, and even to separate sub-word groups are optional. There are no rules but skills. Sometimes, you need to add or to take off a letter in letter groups to explain a word because people did so when they create words for shortness or good look. For example, dispirit=dis+spirit; distress=dis+stress; account=a+count and applause=a+plause. Let us begin with the word “love”. Why is “love” consisted of these four specific letters? Why love means to load venom or change to your heart or to others? Is it the real story that the word “love” was invented many many years ago in that way? It does not matter but the figuration may be interesting and helpful to remember it. Do you believe that the idea is a miracle? You can find that the idea and the list work well for thousands and thousands words. It is even true that your stories of figuration for many words are exactly the same as they are in the textbooks of word origin in prefixes, suffixes and stems from Greek, Latin and European colloquies including old English or in a text book of etymology. The principle is that any word came to its present meaning in its reason of formation or figuration and shall we find it? The history lost the figuration and shall we discuss and make an agreement to define it now? The idea is intended to supply people a way to remember words easily and funnily. However, who cares about your stories of the figuration being true or not in history of word origin provided they help you to remember words and they are interesting. Your word stories may not the same as mine or you may be surprised to find out yours are exactly the same of mine. However most importantly, if you share them with people, you may find that you are special and genius. If you share the stories of figuration from one of your friends to compare with yours, you may find out that your friend is special and he/she is a friend of yours in sake of nature and personality. To exchange stories of a specific word or a group of words would be an interesting game in an intimate circle of friends confidentially. You will remember words in your own best way by the figuration learning and practice. A word looks no longer hard and tedious but active, affective and interesting. Every one will become a writer to author a personal word storybook like a diary book and it may be published soon. If you would like to join my work, especially a native English speaker and a language specialist, if you would like to help me or join me to publish a book of the word stories in a form of a dictionary, would you please feel free to contact me by email: ypzong@mail.neu.edu.cn or feed back on my blog? Thank you!; 个人分类: 未分类|4756 次阅读|1 个评论

[转载]Ontology Construction[ZZ]: timy 2010-2-3 00:59; From: http://cgi.cse.unsw.edu.au/~handbookofnlp/index.php?n=Chapter16.Chapter16 Ontology Construction Philipp Cimiano, Paul Buitelaar and Johanna Vlker In this chapter we provide an overview of the current state-of-the-art in ontology construction with an emphasis on NLP-related issues such as text-driven ontology engineering and ontology learning methods. In order to put these methods into the broader perspective of knowledge engineering applications of this work, we also present a discussion of ontology research itself, in its philosophical origins and historic background as well as in terms of methodologies in ontology engineering. Bibtex Citation @incollection{Cimiano-handbook10, author = {Philipp Cimiano and Paul Buitelaar and Johanna V\{o}lker}, title = {Ontology Construction}, booktitle = {Handbook of Natural Language Processing, Second Edition}, editor = {Nitin Indurkhya and Fred J. Damerau}, publisher = {CRC Press, Taylor and Francis Group}, address = {Boca Raton, FL}, year = {2010}, note = {ISBN 978-1420085921} } Online Resources Ontologies General and upper-level ontologies CYC http://www.opencyc.org DOLCE http://www.loa-cnr.it/DOLCE.html SUMO http://www.ontologyportal.org Linguistic ontologies OntoWordnet http://www.loa-cnr.it/Papers/ODBASE-WORDNET.pdf Swinto, LingInfo Domain-specific ontologies (publicly available) in some example domains Biomedical Foundational Model of Anatomy http://sig.biostr.washington.edu/projects/fm/AboutFM.html Gene Ontology http://www.geneontology.org Repository of biomedical ontologies http://bioportal.bioontology.org Business/Financial XBRL ontology http://xbrlontology.com Geography Geonames ontology http://www.geonames.org/ontology/ Ontology Repositories and Search Engines Swoogle http://swoogle.umbc.edu/ Watson http://watson.kmi.open.ac.uk OntoSelect http://olp.dfki.de/ontoselect/ Oyster Ontology Development Ontology Development 101 http://ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinness-abstract.html Ontology Editors Protg http://protege.stanford.edu NeOn Toolkit http://www.neon-toolkit.org Swoop http://www.mindswap.org/2004/SWOOP/ TopRaid Composer Ontology Engineering Methodologies DILIGENT http://semanticweb.org/wiki/DILIGENT HCOME http://semanticweb.org/wiki/HCOME METHONTOLOGY http://semanticweb.org/wiki/METHONTOLOGY OTK methodology http://semanticweb.org/wiki/OTK_methodology Ontology Learning Tools Text2Onto http://www.neon-toolkit.org/wiki/index.php/Text2Onto OntoLearn http://lcl.di.uniroma1.it/tools.jsp OntoLT http://olp.dfki.de/OntoLT/OntoLT.htm; 个人分类: 自然语言处理|3809 次阅读|0 个评论

[转载][JMIR] Learning in a Virtual World: Experience With Using Second Life for Me: xupeiyang 2010-1-26 08:12; Journal of Medical Internet Research (JMIR) Volume 12 (2010) * Impact Factor (2008): 3.6 - Ranked top (#1/20) in the Medical Informatics and second (#2/62) in the Health Services Research category * http://www.jmir.org/2010 Content Alert, 25 Jan 2010 ==================================== ================================= UPCOMING ISSUE Volume 12, Issue 1 http://www.jmir.org/2010/1 ================================= The following article(s) has/have just been published in the UPCOMING JMIR issue (Volume 12 / Issue 1): (articles are still being added for this issue) Original Papers ------------------ Learning in a Virtual World: Experience With Using Second Life for Medical Education John Wiecha, Robin Heyden, Elliot Sternthal, Mario Merialdi J Med Internet Res 2010 (Jan 23); 12(1):e1 HTML (open access): http://www.jmir.org/2010/1/e1/ PDF (members only): http://www.jmir.org/2010/1/e1/PDF Background: Virtual worlds are rapidly becoming part of the educational technology landscape. Second Life (SL) is one of the best known of these environments. Although the potential of SL has been noted for health professions education, a search of the worlds literature and of the World Wide Web revealed a limited number of formal applications of SL for this purpose and minimal evaluation of educational outcomes. Similarly, the use of virtual worlds for continuing health professional development appears to be largely unreported. Objective: Our objectives were to: 1) explore the potential of a virtual world for delivering continuing medical education (CME) designed for post-graduate physicians; 2) determine possible instructional designs for using SL for CME; 3) understand the limitations of SL for CME; 4) understand the barriers, solutions, and costs associated with using SL, including required training; 5) measure participant learning outcomes and feedback. Methods: We designed and delivered a pilot postgraduate medical education program in the virtual world, Second Life. Our objectives were to: (1) explore the potential of a virtual world for delivering continuing medical education (CME) designed for physicians; (2) determine possible instructional designs using SL for CME; (3) understand the limitations of SL for CME; (4) understand the barriers, solutions, and costs associated with using SL, including required training; and (5) measure participant learning outcomes and feedback. We trained and enrolled 14 primary care physicians in an hour-long, highly interactive event in SL on the topic of type 2 diabetes. Participants completed surveys to measure change in confidence and performance on test cases to assess learning. The post survey also assessed participantsattitudestoward the virtual learning environment. Results: Of the 14 participant physicians, 12 rated the course experience, 10 completed the pre and post confidence surveys, and 10 completed both the pre and post case studies. On a seven-point Likert scale (1, strongly disagree to 7, strongly agree), participants mean reported confidence increased from pre to post SL event with respect to: selecting insulin for patients with type 2 diabetes (pre = 4.9 to post = 6.5,P= .002); initiating insulin (pre = 5.0 to post = 6.2,P= .02); and adjusting insulin dosing (pre = 5.2 to post = 6.2,P= .02). On test cases, the percent of participants providing a correct insulin initiation plan increased from 60% (6 of 10) pre to 90% (9 of 10) post (P= .2), and the percent of participants providing correct initiation of mealtime insulin increased from 40% (4 of 10) pre to 80% (8 of 10) post (P= .09). All participants (12 of 12) agreed that this experience in SL was an effective method of medical education, that the virtual world approach to CME was superior to other methods of online CME, that they would enroll in another such event in SL, and that they would recommend that their colleagues participate in an SL CME course. Only 17% (2 of 12) disagreed with the statement that this potential Second Life method of CME is superior to face-to-face CME. Conclusions: The results of this pilot suggest that virtual worlds offer the potential of a new medical education pedagogy to enhance learning outcomes beyond that provided by more traditional online or face-to-face postgraduate professional development activities. Obvious potential exists for application of these methods at the medical school and residency levels as well. ============================================ Please support JMIR by becoming a member today! JMIR needs to raise $100k to support software upgrades http://www.jmir.org/support.htm Memberships start at $4.92 per month ============================================ 2008 Impact Factor 3.6 confirms JMIR as THE top ranked health informatics, health services research and health policy journal for the Internet age. See http://www.jmir.org/announcement/view/24 for the new journal impact factors released in June 2009 ! ============================================ This is not an unsolicited email. You are receiving this email because you subscribed to JMIR content alerts. To unsubscribe from content alerts please log in at http://www.jmir.org/user/profile and uncheck the checkbox New issue published email notifications. If you lost your password, please go to http://www.jmir.org/login/lostPassword. ________________________________________________________________________ Journal of Medical Internet Research - The leading peer-reviewed ehealth journal - Open Access - Fast Review - Impact Factor: 3.6 *** JMIR is now ranked the number one (#1/20) in the Medical Informatics and second (#2/62) in the Health Services Research category! http://www.jmir.org; 个人分类: 信息检索|1878 次阅读|0 个评论

List of Conferences and Workshops Where Transfer Learning Paper Appear: timy 2009-11-6 10:49; From: http://www.cse.ust.hk/~sinnopan/conferenceTL.htm List of Conferences and Workshops Where Transfer Learning Paper Appear This webpage will be updated regularly. Main Conferences Machine Learning and Artificial Intelligence Conferences AAAI 2008 Transfer Learning via Dimensionality Reduction Transferring Localization Models across Space Transferring Localization Models over Time Transferring Multi-device Localization Models using Latent Multi-task Learning Text Categorization with Knowledge Transfer from Heterogeneous Data Sources Zero-data Learning of New Tasks 2007 Transferring Naive Bayes Classifiers for Text Classification Mapping and Revising Markov Logic Networks for Transfer Learning Measuring the Level of Transfer Learning by an AP Physics Problem-Solver 2006 Using Homomorphisms to Transfer Options across Continuous Reinforcement Learning Domains Value-Function-Based Transfer for Reinforcement Learning Using Structure Mapping IJCAI 2009 Transfer Learning Using Task-Level Features with Application to Information Retrieval Transfer Learning from Minimal Target Data by Mapping across Relational Domains Domain Adaptation via Transfer Component Analysis Knowledge Transfer on Hybrid Graph Manifold Alignment without Correspondence Robust Distance Metric Learning with Auxiliary Knowledge Can Movies and Books Collaborate? Cross-Domain Collaborative Filtering for Sparsity Reduction Exponential Family Sparse Coding with Application to Self-taught Learning 2007 Learning and Transferring Action Schemas General Game Learning Using Knowledge Transfer Building Portable Options: Skill Transfer in Reinforcement Learning Transfer Learning in Real-Time Strategy Games Using Hybrid CBR/RL An Experts Algorithm for Transfer Learning Transferring Learned Control-Knowledge between Planners Effective Control Knowledge Transfer through Learning Skill and Representation Hierarchies Efficient Bayesian Task-Level Transfer Learning ICML 2009 Deep Transfer via Second-Order Markov Logic Feature Hashing for Large Scale Multitask Learning A Convex Formulation for Learning Shared Structures from Multiple Tasks EigenTransfer: A Unified Framework for Transfer Learning Domain Adaptation from Multiple Sources via Auxiliary Classifiers Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model 2008 Bayesian Multiple Instance Learning: Automatic Feature Selection and Inductive Transfer Multi-Task Learning for HIV Therapy Screening Self-taught Clustering Manifold Alignment using Procrustes Analysis Automatic Discovery and Transfer of MAXQ Hierarchies Transfer of Samples in Batch Reinforcement Learning Hierarchical Kernel Stick-Breaking Process for Multi-Task Image Analysis Multi-Task Compressive Sensing with Dirichlet Process Priors A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning 2007 Boosting for Transfer Learning Self-taught Learning: Transfer Learning from Unlabeled Data Robust Multi-Task Learning with t-Processes Multi-Task Learning for Sequential Data via iHMMs and the Nested Dirichlet Process Cross-Domain Transfer for Reinforcement Learning Learning a Meta-Level Prior for Feature Relevance from Multiple Related Tasks Multi-Task Reinforcement Learning: A Hierarchical Bayesian Approach The Matrix Stick-Breaking Process for Flexible Multi-Task Learning Asymptotic Bayesian Generalization Error When Training and Test Distributions Are Different Discriminative Learning for Differing Training and Test Distributions 2006 Autonomous Shaping: Knowledge Transfer in Reinforcement Learning Constructing Informative Priors using Transfer Learning NIPS 2008 Clustered Multi-Task Learning: A Convex Formulation Multi-task Gaussian Process Learning of Robot Inverse Dynamics Transfer Learning by Distribution Matching for Targeted Advertising Translated Learning: Transfer Learning across Different Feature Spaces An empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis Domain Adaptation with Multiple Sources 2007 Learning Bounds for Domain Adaptation Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations A Spectral Regularization Framework for Multi-Task Structure Learning Multi-task Gaussian Process Prediction Semi-Supervised Multitask Learning Gaussian Process Models for Link Analysis and Transfer Learning Multi-Task Learning via Conic Programming Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation 2006 Correcting Sample Selection Bias by Unlabeled Data Dirichlet-Enhanced Spam Filtering based on Biased Samples Analysis of Representations for Domain Adaptation Multi-Task Feature Learning AISTAT 2009 A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation 2007 Kernel Multi-task Learning using Task-specific Features Inductive Transfer for Bayesian Network Structure Learning ECML/PKDD 2009 Relaxed Transfer of Different Classes via Spectral Partition Feature Selection by Transfer Learning with Linear Regularized Models Semi-Supervised Multi-Task Regression 2008 Actively Transfer Domain Knowledge An Algorithm for Transfer Learning in a Heterogeneous Environment Transferred Dimensionality Reduction Modeling Transfer Relationships between Learning Tasks for Improved Inductive Transfer Kernel-Based Inductive Transfer 2007 Graph-Based Domain Mapping for Transfer Learning in General Games Bridged Refinement for Transfer Learning Transfer Learning in Reinforcement Learning Problems Through Partial Policy Recycling Domain Adaptation of Conditional Probability Models via Feature Subsetting 2006 Skill Acquisition via Transfer Learning and Advice Taking COLT 2009 Online Multi-task Learning with Hard Constraints Taking Advantage of Sparsity in Multi-Task Learning Domain Adaptation: Learning Bounds and Algorithms 2008 Learning coordinate gradients with multi-task kernels Linear Algorithms for Online Multitask Classification 2007 Multitask Learning with Expert Advice 2006 Online Multitask Learning UAI 2009 Bayesian Multitask Learning with Latent Hierarchies Multi-Task Feature Learning Via Efficient L2,1-Norm Minimization 2008 Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Data Mining Conferences KDD 2009 Cross Domain Distribution Adaptation via Kernel Mapping Extracting Discriminative Concepts for Domain Adaptation in Text Mining 2008 Spectral domain-transfer learning Knowledge transfer via multiple model local structure mapping 2007 Co-clustering based Classification for Out-of-domain Documents 2006 Reverse Testing: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias ICDM 2008 Unsupervised Cross-domain Learning by Interaction Information Co-clustering Using Wikipedia for Co-clustering Based Cross-domain Text Classification SDM 2008 Type-Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation 2007 On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples Probabilistic Joint Feature Selection for Multi-task Learning Application Conferences SIGIR 2009 Mining Employment Market via Text Block Detection and Adaptive Cross-Domain Information Extraction Knowledge transformation for cross-domain sentiment classification 2008 Topic-bridged PLSA for cross-domain text classification 2007 Cross-Lingual Query Suggestion Using Query Logs of Different Languages 2006 Tackling Concept Drift by Temporal Inductive Transfer Constructing Informative Prior Distributions from Domain Knowledge in Text Classification Building Bridges for Web Query Classification WWW 2009 Latent Space Domain Transfer between High Dimensional Overlapping Distributions 2008 Can Chinese web pages be classified with English data source? ACL 2009 Transfer Learning, Feature Selection and Word Sense Disambiguation Graph Ranking for Sentiment Transfer Multi-Task Transfer Learning for Weakly-Supervised Relation Extraction Cross-Domain Dependency Parsing Using a Deep Linguistic Grammar Heterogeneous Transfer Learning for Image Clustering via the SocialWeb 2008 Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition Multi-domain Sentiment Classification Active Sample Selection for Named Entity Transliteration Mining Wiki Resources for Multilingual Named Entity Recognition Multi-Task Active Learning for Linguistic Annotations 2007 Domain Adaptation with Active Learning for Word Sense Disambiguation Frustratingly Easy Domain Adaptation Instance Weighting for Domain Adaptation in NLP Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets 2006 Estimating Class Priors in Domain Adaptation for Word Sense Disambiguation Simultaneous English-Japanese Spoken Language Translation Based on Incremental Dependency Parsing and Transfer CVPR 2009 Domain Transfer SVM for Video Concept Detection Boosted Multi-Task Learning for Face Verification With Applications to Web Image and Video Search 2008 Transfer Learning for Image Classification with Sparse Prototype Representations Workshops NIPS 2005 Workshop - Inductive Transfer: 10 Years Later NIPS 2005 Workshop - Interclass Transfer NIPS 2006 Workshop - Learning when test and training inputs have different distributions AAAI 2008 Workshop - Transfer Learning for Complex Tasks; 个人分类: 机器学习|7608 次阅读|0 个评论

ZZ: 迁移学习（ Transfer Learning ）: timy 2009-11-6 09:46; 转载于： http://apex.sjtu.edu.cn/apex_wiki/Transfer%20Learning 迁移学习（ Transfer Learning ）薛贵荣在传统的机器学习的框架下，学习的任务就是在给定充分训练数据的基础上来学习一个分类模型；然后利用这个学习到的模型来对测试文档进行分类与预测。然而，我们看到机器学习算法在当前的Web挖掘研究中存在着一个关键的问题：一些新出现的领域中的大量训练数据非常难得到。我们看到Web应用领域的发展非常快速。大量新的领域不断涌现，从传统的新闻，到网页，到图片,再到博客、播客等等。传统的机器学习需要对每个领域都标定大量训练数据，这将会耗费大量的人力与物力。而没有大量的标注数据，会使得很多与学习相关研究与应用无法开展。其次，传统的机器学习假设训练数据与测试数据服从相同的数据分布。然而，在许多情况下，这种同分布假设并不满足。通常可能发生的情况如训练数据过期。这往往需要我们去重新标注大量的训练数据以满足我们训练的需要，但标注新数据是非常昂贵的，需要大量的人力与物力。从另外一个角度上看，如果我们有了大量的、在不同分布下的训练数据，完全丢弃这些数据也是非常浪费的。如何合理的利用这些数据就是迁移学习主要解决的问题。迁移学习可以从现有的数据中迁移知识，用来帮助将来的学习。迁移学习（Transfer Learning）的目标是将从一个环境中学到的知识用来帮助新环境中的学习任务。因此，迁移学习不会像传统机器学习那样作同分布假设。我们在迁移学习方面的工作目前可以分为以下三个部分：同构空间下基于实例的迁移学习，同构空间下基于特征的迁移学习与异构空间下的迁移学习。我们的研究指出，基于实例的迁移学习有更强的知识迁移能力，基于特征的迁移学习具有更广泛的知识迁移能力，而异构空间的迁移具有广泛的学习与扩展能力。这几种方法各有千秋。 1.同构空间下基于实例的迁移学习基于实例的迁移学习的基本思想是，尽管辅助训练数据和源训练数据或多或少会有些不同，但是辅助训练数据中应该还是会存在一部分比较适合用来训练一个有效的分类模型，并且适应测试数据。于是，我们的目标就是从辅助训练数据中找出那些适合测试数据的实例，并将这些实例迁移到源训练数据的学习中去。在基于实例的迁移学习方面，我们推广了传统的 AdaBoost 算法，提出一种具有迁移能力的boosting算法：Tradaboosting ，使之具有迁移学习的能力，从而能够最大限度的利用辅助训练数据来帮助目标的分类。我们的关键想法是，利用boosting的技术来过滤掉辅助数据中那些与源训练数据最不像的数据。其中，boosting的作用是建立一种自动调整权重的机制，于是重要的辅助训练数据的权重将会增加，不重要的辅助训练数据的权重将会减小。调整权重之后，这些带权重的辅助训练数据将会作为额外的训练数据，与源训练数据一起从来提高分类模型的可靠度。基于实例的迁移学习只能发生在源数据与辅助数据非常相近的情况下。但是，当源数据和辅助数据差别比较大的时候，基于实例的迁移学习算法往往很难找到可以迁移的知识。但是我们发现，即便有时源数据与目标数据在实例层面上并没有共享一些公共的知识，它们可能会在特征层面上有一些交集。因此我们研究了基于特征的迁移学习，它讨论的是如何利用特征层面上公共的知识进行学习的问题。 2.同构空间下基于特征的迁移学习在基于特征的迁移学习研究方面，我们提出了多种学习的算法，如CoCC算法，TPLSA算法，谱分析算法与自学习算法等。其中利用互聚类算法产生一个公共的特征表示，从而帮助学习算法。我们的基本思想是使用互聚类算法同时对源数据与辅助数据进行聚类，得到一个共同的特征表示，这个新的特征表示优于只基于源数据的特征表示。通过把源数据表示在这个新的空间里，以实现迁移学习。应用这个思想，我们提出了基于特征的有监督迁移学习与基于特征的无监督迁移学习。 2.1 基于特征的有监督迁移学习我们在基于特征的有监督迁移学习方面的工作是基于互聚类的跨领域分类，这个工作考虑的问题是：当给定一个新的、不同的领域，标注数据及其稀少时，如何利用原有领域中含有的大量标注数据进行迁移学习的问题。在基于互聚类的跨领域分类这个工作中，我们为跨领域分类问题定义了一个统一的信息论形式化公式，其中基于互聚类的分类问题的转化成对目标函数的最优化问题。在我们提出的模型中，目标函数被定义为源数据实例，公共特征空间与辅助数据实例间互信息的损失。 2.2 基于特征的无监督迁移学习：自学习聚类我们提出的自学习聚类算法属于基于特征的无监督迁移学习方面的工作。这里我们考虑的问题是：现实中可能有标记的辅助数据都难以得到，在这种情况下如何利用大量无标记数据辅助数据进行迁移学习的问题。自学习聚类的基本思想是通过同时对源数据与辅助数据进行聚类得到一个共同的特征表示，而这个新的特征表示由于基于大量的辅助数据，所以会优于仅基于源数据而产生的特征表示，从而对聚类产生帮助。上面提出的两种学习策略（基于特征的有监督迁移学习与无监督迁移学习）解决的都是源数据与辅助数据在同一特征空间内的基于特征的迁移学习问题。当源数据与辅助数据所在的特征空间中不同时，我们还研究了跨特征空间的基于特征的迁移学习，它也属于基于特征的迁移学习的一种。３　异构空间下的迁移学习：翻译学习我们提出的翻译学习致力于解决源数据与测试数据分别属于两个不同的特征空间下的情况。在中，我们使用大量容易得到的标注过文本数据去帮助仅有少量标注的图像分类的问题，如上图所示。我们的方法基于使用那些用有两个视角的数据来构建沟通两个特征空间的桥梁。虽然这些多视角数据可能不一定能够用来做分类用的训练数据，但是，它们可以用来构建翻译器。通过这个翻译器，我们把近邻算法和特征翻译结合在一起，将辅助数据翻译到源数据特征空间里去，用一个统一的语言模型进行学习与分类。引文： . Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. Translated Learning: Transfer Learning across Different Feature Spaces. Advances in Neural Information Processing Systems 21 (NIPS 2008), Vancouver, British Columbia, Canada, December 8-13, 2008. . Xiao Ling, Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. Spectral Domain-Transfer Learning. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), Pages 488-496, Las Vegas, Nevada, USA, August 24-27, 2008. . Wenyuan Dai, Qiang Yang, Gui-Rong Xue and Yong Yu. Self-taught Clustering. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pages 200-207, Helsinki, Finland, 5-9 July, 2008. . Gui-Rong Xue, Wenyuan Dai, Qiang Yang and Yong Yu. Topic-bridged PLSA for Cross-Domain Text Classification. In Proceedings of the Thirty-first International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR2008), pages 627-634, Singapore, July 20-24, 2008. . Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang and Yong Yu. Can Chinese Web Pages be Classified with English Data Source? In Proceedings the Seventeenth International World Wide Web Conference (WWW2008), Pages 969-978, Beijing, China, April 21-25, 2008. . Xiao Ling, Wenyuan Dai, Gui-Rong Xue and Yong Yu. Knowledge Transferring via Implicit Link Analysis. In Proceedings of the Thirteenth International Conference on Database Systems for Advanced Applications (DASFAA 2008), Pages 520-528, New Delhi, India, March 19-22, 2008. . Wenyuan Dai, Gui-Rong Xue, Qiang Yang and Yong Yu. Co-clustering based Classification for Out-of-domain Documents. In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007), Pages 210-219, San Jose, California, USA, Aug 12-15, 2007. . Wenyuan Dai, Gui-Rong Xue, Qiang Yang and Yong Yu. Transferring Naive Bayes Classifiers for Text Classification. In Proceedings of the Twenty-Second National Conference on Artificial Intelligence (AAAI 2007), Pages 540-545, Vancouver, British Columbia, Canada, July 22-26, 2007. . Wenyuan Dai, Qiang Yang, Gui-Rong Xue and Yong Yu. Boosting for Transfer Learning. In Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), Pages 193-200, Corvallis, Oregon, USA, June 20-24, 2007. . Dikan Xing, Wenyuan Dai, Gui-Rong Xue and Yong Yu. Bridged Refinement for Transfer Learning. In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007), Pages 324-335, Warsaw, Poland, September 17-21, 2007. （Best Student Paper Award） . Xin Zhang, Wenyuan Dai, Gui-Rong Xue and Yong Yu. Adaptive Email Spam Filtering based on Information Theory. In Proceedings of the Eighth International Conference on Web Information Systems Engineering (WISE 2007), Pages 159170, Nancy, France, December 3-7, 2007. Transfer Learning (2009-10-29 03:03:46由 grxue 编辑); 个人分类: 机器学习|7241 次阅读|0 个评论

12 3 4 5 6 7 8 9 下一页

帐号		自动登录	找回密码
密码			注册

关闭 安全验证

标签: Learning

相关日志

关闭安全验证