Learning History II A student’s Guide to The National Experience 【Judith M. Walter等 著 《 国家的经历, 学习历史指南 II 》 , 19 93 年第八版】 【黄安年个人藏书书目(美国问题英文部分编号4 33 】 黄安年辑 黄安年的博客 /2019 年 3 月 14 日 发布(第 21203 号) 自2019年起,笔者将通过博客陆续发布个人收藏的全部图书书目,目前先发布美国问题英文书目,已经超过432单独编号,不分出版时间先后与图书类别。 这里发布的是 Judith M. Walter 等著 Learning History II A student’s Guide to The National Experience Part Two The History of The United States Since 1865 ( 《 国家的经历, 学习历史指南 II 》, 第二部分 1865年以来的美国史) ,Harcount Brace Jovanovich College Publishers 19 93 年第八版, 390 页。 照片15张拍自该书 1 , 2 , 3 , 4, 5, 6, 7, 8, 9, 10, 11 , 12 , 13 , 14 , 15 ,
Recurrent Neural Networks http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones. Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist. Recurrent Neural Networks have loops. In the above diagram, a chunk of neural network, \(A\) , looks at some input \(x_t\) and outputs a value \(h_t\) . A loop allows information to be passed from one step of the network to the next. These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop: An unrolled recurrent neural network. This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data. And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks . But they really are pretty amazing. Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore. The Problem of Long-Term Dependencies One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends. Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky ,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French .” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information. In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) and Bengio, et al. (1994) , who found some pretty fundamental reasons why it might be difficult. Thankfully, LSTMs don’t have this problem! LSTM Networks Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter Schmidhuber (1997) , and were refined and popularized by many people in following work. 1 They work tremendously well on a large variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn! All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. The repeating module in a standard RNN contains a single layer. LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way. The repeating module in an LSTM contains four interacting layers. Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using. In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations. !--To be a bit more explicit, we can split up each line into lines carrying individual scalar values: -- The Core Idea Behind LSTMs The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state. Step-by-Step LSTM Walk Through The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at \(h_{t-1}\) and \(x_t\) , and outputs a number between \(0\) and \(1\) for each number in the cell state \(C_{t-1}\) . A \(1\) represents “completely keep this” while a \(0\) represents “completely get rid of this.” Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject. The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, \(\tilde{C}_t\) , that could be added to the state. In the next step, we’ll combine these two to create an update to the state. In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting. It’s now time to update the old cell state, \(C_{t-1}\) , into the new cell state \(C_t\) . The previous steps already decided what to do, we just need to actually do it. We multiply the old state by \(f_t\) , forgetting the things we decided to forget earlier. Then we add \(i_t*\tilde{C}_t\) . This is the new candidate values, scaled by how much we decided to update each state value. In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps. Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through \(\tanh\) (to push the values to be between \(-1\) and \(1\) ) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to. For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next. Variants on Long Short Term Memory What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them. One popular LSTM variant, introduced by Gers Schmidhuber (2000) , is adding “peephole connections.” This means that we let the gate layers look at the cell state. The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others. Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older. A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014) . It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular. These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015) . There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014) . Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks. Conclusion Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks! Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable. LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner… Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015) , Chung, et al. (2015) , or Bayer Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!
Wei: Recently, the microblogging (wechat) community is full of hot discussions and testing on the newest annoucement of the Google Translate breakthrough in its NMT (neural network-based machine translation) offering, claimed to have achieved significant progress in data quality and readability. Sounds like a major breakthrough worthy of attention and celebration. The report says: Ten years ago, we released Google Translate, the core algorithm behind this service is PBMT: Phrase-Based Machine Translation. Since then, the rapid development of machine intelligence has given us a great boost in speech recognition and image recognition, but improving machine translation is still a difficult task. Today, we announced the release of the Google Neural Machine Translation (GNMT) system, which utilizes state-of-the-art training techniques to maximize the quality of machine translation so far. For a full review of our findings, please see our paper Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation . A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark data set was comparable to that of a phrase-based translation system. Since then, researchers have proposed a number of techniques to improve NMT, including modeling external alignment models to handle rare words, using attention to align input and output words, and word decomposition into smaller units to cope with rare words. Despite these advances, the speed and accuracy of NMT has not been able to meet the requirements of a production system such as Google Translate. Our new paper describes how to overcome many of the challenges of making NMT work on very large data sets and how to build a system that is both fast and accurate enough to deliver a better translation experience for Google users and services. ............ Using side-by-side comparisons of human assessments as a standard, the GNMT system translates significantly better than the previous phrase-based production system. With the help of bilingual human assessors, we found in sample sentences from Wikipedia and the news website that GNMT reduced translational errors by 55% to 85% or more in the translation of multiple major pairs of languages. In addition to publishing this research paper today, we have also announced that GNMT will be put into production in a very difficult language pair (Chinese-English) translation. Now, the Chinese-English translations of the Google Translate for mobile and web versions have been translated at 100% using the GNMT machine - about 18 million translations per day. GNMT's production deployment uses our open machine learning tool suite TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models, meeting Google Translate strict latency requirements for products. Chinese-to-English translation is one of the more than 10,000 language pairs supported by Google Translate. In the coming months, we will continue to extend our GNMT to far more language pairs. GNMT translated from Google Translate achieves a major breakthrough ! As an old machine translation researcher, this temptation cannot be resisted. I cannot wait to try this latest version of the Google Translate for Chinese-English. Previously I tried Google Chinese-to-English online translation multiple times, the overall quality was not very readable and certainly not as good as its competitor Baidu. With this newest breakthrough using deep learning with neural networks, it is believed to get close to human translation quality. I have a few hundreds of Chinese blogs on NLP, waiting to be translated as a try. I was looking forward to this first attempt in using Google Translate for my Science Popularization blog titled Introduction to NLP Architecture . My adventure is about to start. Now is the time to witness the miracle, if miracle does exist. Dong: I hope you will not be disappointed. I have jokingly said before: the rule-based machine translation is a fool, the statistical machine translation is a madman, and now I continue to ridicule: neural machine translation is a liar (I am not referring to the developers behind NMT). Language is not a cat face or the like, just the surface fluency does not work, the content should be faithful to the original! Wei: Let us experience the magic, please listen to this translated piece of my blog: This is my Introduction to NLP Architecture fully automatically translated by Google Translate yesterday (10/2/2016) and fully automatically read out without any human interference. I have to say, this is way beyond my initial expectation and belief. Listen to it for yourself, the automatic speech generation of this science blog of mine is amazingly clear and understandable. If you are an NLP student, you can take it as a lecture note from a seasoned NLP practitioner (definitely clearer than if I were giving this lecture myself, with my strong accent). The original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques. Prof. Dong, you know my background and my original doubtful mindset. However, in the face of such a progress, far beyond our original imagination limits for automatic translation in terms of both quality and robustness when I started my NLP career in MT training 30 years ago, I have to say that it is a dream come true in every sense of it. Dong: In their terminology, it is less adequate, but more fluent. Machine translation has gone through three paradigm shifts. When people find that it can only be a good information processing tool, and cannot really replace the human translation, they would choose the less costly. Wei: In any case, this small test is revealing to me. I am still feeling overwhelmed to see such a miracle live. Of course, what I have just tested is the formal style, on a computer and NLP topic, it certainly hit its sweet spot with adequate training corpus coverage. But compared with the pre-NN time when I used both Google SMT and Baidu SMT to help with my translation, this breakthrough is amazing. As a senior old school practitioner of rule-based systems, I would like to pay deep tribute to our nerve-network colleagues. These are a group of extremely genius crazy guys. I would like to quote Jobs' famous quotation here: “Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.” @Mao, this counts as my most recent feedback to the Google scientists and their work. Last time, about a couple of months ago when they released their parser, proudly claimed to be the most accurate parser in the world , I wrote a blog to ridicule them after performing a serious, apples-to-apples comparison with our own parser . This time, they used the same underlying technology to announce this new MT breakthrough with similar pride, I am happily expressing my deep admiration for their wonderful work. This contrast of my attitudes looks a bit weird, but it actually is all based on facts of life. In the case of parsing, this school suffers from lacking naturally labeled data which they would make use of in perfecting the quality, especially when it has to port to new domains or genres beyond the news corpora. After all, what exists in the language sea involves corpora of raw text with linear strings of words, while the corresponding parse trees are only occasional, artificial objects made by linguists in a limited scope by nature (e.g. PennTree, or other news-genre parse trees by the Google annotation team). But MT is different, it is a unique NLP area with almost endless, high-quality, naturally-occurring labeled data in the form of human translation, which has never stopped since ages ago. Mao: @wei That is to say, you now embrace or endorse a neuron-based MT, a change from your previous views? Wei: Yes I do embrace and endorse the practice. But I have not really changed my general view wrt the pros and cons between the two schools in AI and NLP . They are complementary and, in the long run, some way of combining the two will promise a world better than either one alone. Mao: What is your real point? Wei: Despite biases we are all born with more or less by human nature, conditioned by what we have done and where we come from in terms of technical background, we all need to observe and respect the basic facts. Just listen to the audio of their GSMT translation by clicking the link above, the fluency and even faithfulness to my original text has in fact out-performed an ordinary human translator, in my best judgment. If an interpreter does not have sufficient knowledge of my domain, if I give this lecture in a classroom, and ask an average interpreter to translate on the spot for me, I bet he will have a hard time performing better than the Google machine listed above (of course, human translation gurus are an exception). This miracle-like fact has to be observed and acknowledged. On the other hand, as I said before, no matter how deep the learning reaches, I still do not see how they can catch up with the quality of my deep parsing in the next few years when they have no way of magically having access to a huge labeled data of trees they depend on, especially in the variety of different domains and genres. They simply cannot make bricks without straw (as an old Chinese saying goes, even the most capable housewife can hardly cook a good meal without rice). Because in the natural world, there are no syntactic trees and structures for them to learn from, there are only linear sentences. The deep learning breakthrough seen so far is still mainly supervised learning, which has almost an insatiable appetite for massive labeled data, forming its limiting knowledge bottleneck. Mao: I'm confused. Which one do you believe stronger? Who is the world's No. 0? Wei: Parsing-wise, I am happy to stay as No. 0 if Google insists on their being No. 1 in the world. As for MT, it is hard to say, from what I see, between their breakthrough and some highly sophisticated rule-based MT systems out there. But what I can say is, at a high level, the trends of the mainstream statistical MT winning the space both in the industry as well as in academia over the old school rule-based MT are more evident today than before. This is not to say that the MT rule system is no longer viable, or going to an end. There are things which SMT cannot beat rule MT. For examples, certain types of seemingly stupid mistakes made by GNMT (quite some laughable examples of totally wrong or opposite translation have been illustrated in this salon in the last few days) are almost never seen in rule-based MT systems. Dong: here is my try of GNMT from Chinese to English: 学习上,初二是一个分水岭,学科数量明显增多,学习方法也有所改变,一些学生能及时调整适应变化,进步很快,由成绩中等上升为优秀。但也有一部分学生存在畏难情绪,将心思用在学习之外,成绩迅速下降,对学习失去兴趣,自暴自弃,从此一蹶不振,这样的同学到了初三往往很难有所突破,中考的失利难以避免。 Learning, the second of a watershed, the number of subjects significantly significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is Fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day, Mao: This translation cannot be said to be good at all. Wei: Right, that is why it calls for an objective comparison to answer your previous question. Currently, as I see, the data for the social media and casual text are certainly not enough, hence the translation quality of online messages is still not their forte. As for the previous textual sample Prof. Dong showed us above, Mao said the Google translation is not of good quality as expected. But even so, I still see impressive progress made there. Before the deep learning time, the SMT results from Chinese to English is hardly readable, and now it can generally be read loud to be roughly understood. There is a lot of progress worth noting here. Ma: In the fields with big data, in recent years, DL methods are by leaps and bounds. I know a number of experts who used to be biased against DL have changed their views when seeing the results. However, DL in the IR field is still basically not effective so far, but there are signs of slowly penetrating IR. Dong: The key to NMT is looking nice. So for people who do not understand the original source text, it sounds like a smooth translation. But isn't it a liar if a translation is losing its faithfulness to the original? This is the Achille's heel of NMT. Ma: @Dong, I think all statistical methods have this aching point. Wei: Indeed, there are respective pros and cons. Today I have listened to the Google translation of my blog three times and am still amazed at what they have achieved. There are always some mistakes I can pick here and there. But to err is human, not to say a machine, right? Not to say the community will not stop advancing and trying to correct mistakes. From the intelligibility and fluency perspectives, I have been served super satisfactorily today. And this occurs between two languages without historical kinship whatsoever. Dong: Some leading managers said to me years ago, In fact, even if machine translation is only 50 percent correct, it does not matter. The problem is that it cannot tell me which half it cannot translate well. If it can, I can always save half the labor, and hire a human translator to only translate the other half. I replied that I am not able to make a system do that. Since then I have been concerned about this issue, until today when there is a lot of noise of MT replacing the human translation anytime from now. It's kinda like having McDonald's then you say you do not need a fine restaurant for French delicacy. Not to mention machine translation today still cannot be compared to McDonald's. Computers, with machine translation and the like, are in essence a toy given by God for us human to play with. God never agrees to permit us to be equipped with the ability to copy ourselves. Why GNMT first chose language pairs like Chinese-to-English, not the other way round to showcase? This is very shrewd of them. Even if the translation is wrong or missing the points, the translation is usually fluent at least in this new model, unlike the traditional model who looks and sounds broken, silly and erroneous. This is the characteristics of NMT, it is selecting the greatest similarity in translation corpus. As a vast number of English readers do not understand Chinese, it is easy to impress them how great the new MT is, even for a difficult language pair. Wei: Correct. A closer look reveals that this breakthrough lies more on fluency of the target language than the faithfulness to the source language, achieving readability at cost of accuracy. But this is just a beginning of a major shift. I can fully understand the GNMT people's joy and pride in front of a breakthrough like this. In our career, we do not always have that type of moment for celebration. Deep parsing is the NLP's crown. Yet to see how they can beat us in handling domains and genres lacking labeled data. I wish them good luck and the day they prove they make better parsers than mine would be the day of my retirement. It does not look anything like this day is drawing near, to my mind. I wish I were wrong, so I can travel the world worry-free, knowing that my dream has been better realized by my colleagues. Thanks to Google Translate at https://translate.google.com/ for helping to translate this Chinese blog into English, which was post-edited by myself. Wei’s Introduction to NLP Architecture Translated by Google OVERVIEW OF NATURAL LANGUAGE PROCESSING NLP White Paper: Overview of Our NLP Core Engine Introduction to NLP Architecture It is untrue that Google SyntaxNet is the world’s most accurate parser Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Is Google SyntaxNet Really the World’s Most Accurate Parser? Dr Li's NLP Blog in English
Introduction to NLP Architecture by Dr. Wei Li (fully automatically translated by Google Translate) The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable, if you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent) by folloing the link below: Wei’s Introduction to NLP Architecture To preserve the original translation, nothing is edited below. I will write another blog to post-edit it to make this an official NLP architecture introduction to the audiences perused and honored by myself, the original writer. But for time being, it is completely unedited, thanks to the newly launched Google Translate service from Chinese into English at https://translate.google.com/ For the natural language processing (NLP) and its application, the system architecture is the core issue, I blog which gave four NLP system architecture diagram, now one by one to be a brief . I put the NLP system from the core engine to the application, is divided into four stages, corresponding to the four frame diagram. At the bottom of the core is deep parsing, is the natural language of the bottom-up layer of automatic analyzer, this work is the most difficult, but it is the vast majority of NLP system based technology. The purpose of parsing is to structure unstructured languages. The face of the ever-changing language, only structured, and patterns can be easily seized, the information we go to extract semantics to solve. This principle began to be the consensus of (linguistics) when Chomsky proposed the transition from superficial structure to deep structure after the linguistic revolution of 1957. A tree is not only the arcs that express syntactic relationships, but also the nodes of words or phrases that carry various information. Although the importance of the tree, but generally can not directly support the product, it is only the internal expression of the system, as a language analysis and understanding of the carrier and semantic landing for the application of the core support. The next layer is the extraction layer (extraction), as shown above. Its input is the tree, the output is filled in the content of the templates, similar to fill in the form: is the information needed for the application, pre-defined a table out, so that the extraction system to fill in the blank, the statement related words or phrases caught out Sent to the table in the pre-defined columns (fields) to go. This layer has gone from the original domain-independent parser into the face-to-face, application-oriented and product-demanding tasks. It is worth emphasizing that the extraction layer is domain-oriented semantic focus, while the previous analysis layer is domain-independent. Therefore, a good framework is to do a very thorough analysis of logic, in order to reduce the burden of extraction. In the depth analysis of the logical semantic structure to do the extraction, a rule is equivalent to the extraction of thousands of surface rules of language. This creates the conditions for the transfer of the domain. There are two types of extraction, one is the traditional information extraction (IE), the extraction of fact or objective information: the relationship between entities, entities involved in different entities, such as events, can answer who dis what when and where When and where to do what) and the like. This extraction of objective information is the core technology and foundation of the knowledge graph which can not be renewed nowadays. After completion of IE, the next layer of information fusion (IF) can be used to construct the knowledge map. Another type of extraction is about subjective information, public opinion mining is based on this kind of extraction. What I have done over the past five years is this piece of fine line of public opinion to extract (not just praise classification, but also to explore the reasons behind the public opinion to provide the basis for decision-making). This is one of the hardest tasks in NLP, much more difficult than IE in objective information. Extracted information is usually stored in a database. This provides fragmentation information for the underlying excavation layer. Many people confuse information extraction and text mining, but in fact this is two levels of the task. Extraction is the face of a language tree, from a sentence inside to find the information you want. The mining face is a corpus, or data source as a whole, from the language of the forest inside the excavation of statistical value information. In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean, therefore, must use the computer to dig out the information from the ocean of critical intelligence to meet different applications. Therefore, mining rely on natural statistics, there is no statistics, the information is still out of the chaos of the debris, there is a lot of redundancy, mining can integrate them. Many systems do not dig deep, but simply to express the information needs of the query as an entrance, real-time (real time) to extract the relevant information from the fragmentation of the database, the top n results simply combined, and then provide products and user. This is actually a mining, but is a way to achieve a simple search mining directly support the application. In fact, in order to do a good job of mining, there are a lot of work to do, not only can improve the quality of existing information. Moreover, in-depth, you can also tap the hidden information, that is not explicitly expressed in the metadata information, such as the causal relationship between information found, or other statistical trends. This type of mining was first done in traditional data mining because the traditional mining was aimed at structural data such as transaction records, making it easy to mine implicit associations (eg, people who buy diapers often buy beer , The original is the father of the new people's usual behavior, such information can be excavated to optimize the display and sale of goods). Nowadays, natural language is also structured to extract fragments of intelligence in the database, of course, can also do implicit association intelligence mining to enhance the value of intelligence. The fourth architectural diagram is the NLP application layer. In this layer, analysis, extraction, mining out of the various information can support different NLP products and services. From the Q A system to the dynamic mapping of the knowledge map (Google search search star has been able to see this application), from automatic polling to customer intelligence, from intelligent assistants to automatic digest and so on. This is my overall understanding of the basic architecture of NLP. Based on nearly 20 years in the industry to do NLP product experience. 18 years ago, I was using a NLP structure diagram to the first venture to flicker, investors themselves told us that this is million dollar slide. Today's explanation is to extend from that map to expand from. Days unchanged Road is also unchanged. Where previously mentioned the million-dollar slide story. Clinton said that during the reign of 2000, the United States to a great leap forward in Internet technology, known as. Com bubble, a time of hot money rolling, all kinds of Internet startups are sprang up. In such a situation, the boss decided to hot to find venture capital, told me to achieve our prototype of the language system to do an introduction. I then draw the following three-tier structure of a NLP system diagram, the bottom is the parser, from shallow to deep, the middle is built on parsing based on information extraction, the top of the main categories are several types of applications, including Q A system. Connection applications and the following two language processing is the database, used to store the results of information extraction, these results can be applied at any time to provide information. This architecture has not changed much since I made it 15 years ago, although the details and icons have been rewritten no less than 100 times. The architecture diagram in this article is about one of the first 20 editions. Off the core engine (background), does not include the application (front). Saying that early in the morning by my boss sent to Wall Street angel investors, by noon to get his reply, said he was very interested. Less than two weeks, we got the first $ 1 million angel investment check. Investors say that this is a million dollar slide, which not only shows the threshold of technology, but also shows the great potential of the technology. Pre - Knowledge Mapping: The Structure of Information Extraction Engine 【Related】 Pre - Knowledge Mapping: The Architecture of Information Extraction Engine 【Essay contest: a dream come true 】 OVERVIEW OF NATURAL LANGUAGE PROCESSING NLP White Paper: Overview of Our NLP Core Engine White Paper of NLP Engine Zhaohua afternoon pick up directory retrieved 10/1/2016 from https://translate.google.com/ translated from http://blog.sciencenet.cn/blog-362400-981742.html
List of 23 NLP Publications (Cymfony Period) Once upon a time, we were publishing like crazy …… as if we were striving for tenure faculty R. Srihari, W. Li and X. Li. 2006. Question Answering Supported by Multiple Levels of Information Extraction. a book chapter in T. Strzalkowski S. Harabagiu (eds.), Advances in Open- Domain Question Answering. Springer, 2006, ISBN:1-4020-4744-4. http://link.springer.com/chapter/10.1007%2F978-1-4020-4746-6_11 R. Srihari, W. Li, C. Niu and T. Cornell. 2006. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37 http://journals.cambridge.org/action/displayAbstract?fromPage=onlineaid=1513012 This paper focuses on IE tasks designed to support information discovery applications. It defines new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information. C. Niu, W. Li, R. Srihari, H. Li. 2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation. Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005) W05-0605 C. Niu, W. Li and R. Srihari. 2004. Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction. In Proceedings of ACL 2004. ACL 2004 Niu Li Srihari 372_pdf_2-col C. Niu, W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities. In Proceedings of Senseval-3 Workshop. ACL 2004 Context Clustering for WSD niu1 C. Niu, W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation. International Journal of Artificial Intelligence Tools, Vol. 13, No. 1, 2004. IJAIT 2004 Niu, Li, Ding, and Srihari caseR (7) Cheng Niu, Wei Li and Rohini Srihari 2004. A Bootstrapping Approach to Information Extraction Domain Porting. ATEM-2004: The AAAI-04 Workshop on Adaptive Text Extraction and Mining. San Jose. (PDF) WS104NiuC W. Li, X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert Lexicon Approach to Identifying English Phrasal Verbs. In Proceedings of ACL 2003. Sapporo, Japan. pp. 513-520. ACL 2003 Li, Zhang, Niu, Jiang and Srihari 2003 PhrasalVerb_ACL2003_submitted C. Niu, W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping Approach to Named Entity Classification using Successive Learners. In Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342. ACL 2003 Niu, Li, Ding and Srihari 2003 ne-acl2003 W. Li, R. Srihari, C. Niu, and X. Li. 2003. Question Answering on a Case Insensitive Corpus. In Proceedings of Workshop on Multilingual Summarization and Question Answering – Machine Learning and Beyond (ACL-2003 Workshop). Sapporo, Japan. pp. 84-93. ACL 2003 Workshop Li, Srihari, Niu and Li 2003 QA-workshopl2003_final C. Niu, W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for Named Entity Tagging using Concept-based Seeds. In Proceedings of HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada. NAACL 2003 Niu, Li, Ding and Srihari 2003 ne_submitted R. Srihari, W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. In Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). pp. 52-59, Edmonton, Canada. NAACL 2003 Workshop InfoXtract SEALTS paper2 H. Li, R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Locatio Normalization: A Hybrid Approach to Geographic References in Information Extraction. In Proceedings of HLT/NAACL 2003 Workshop on Analysis of Geographic References. Edmonton, Canada. NAACL 2003 Workshop Li, Srihari, Niu and Li 2003 CymfonyLoc_final W. Li, R. Srihari, C. Niu, and X. Li 2003. Entity Profile Extraction from Large Corpora. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. PACLING 2003 Li, Srihari, Niu and Li 2003 Entity Profile profile_PACLING_final_submitted C. Niu, W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a Hidden Markov Model for Relationship Extraction Using Multi-level Contexts. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. PACLING 2003 Niu, Li, Srihari and Crist 2003 CE Bootstrapping PACLING03_15_final C. Niu, Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised Learning for Verb Sense Disambiguation Using Both Trigger Words and Parsing Relations. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. PACLING 2003 Niu, Zheng, Srihari, Li and Li 2003 Verb Sense Identification PACLING_14_final C. Niu, W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation. In Proceedings of the Sixteenth International FLAIRS Conference, St. Augustine, FL, May 2003, pp. 402-406. FLAIRS 2003 Niu, Li, Ding and Srihari 2003 FLAIRS03CNiu R. Srihari and W. Li 2003. Rapid Domain Porting of an Intermediate Level Information Extraction Engine. In Proceedings of International Conference on Natural Language Processing 2003. ICON2003 paper FINAL H. Li, R. Srihari, C. Niu and W. Li 2002. Location Normalization for Information Extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan. COLING 2002 Li, Srihari, Niu and Li 2002 coling2002LocNZ W. Li, R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002. Extracting Exact Answers to Questions Based on Structural Links. In Proceedings of Multilingual Summarization and Question Answering (COLING-2002 Workshop). Taipei, Taiwan. COLING 2002 Workshop Li et al CymfonyQA_final R. Srihari, and W. Li. 2000. A Question Answering System Supported by Information Extraction. In Proceedings of ANLP 2000. Seattle. ANLP 2000 Srihari and Li 2000 anlp9l R. Srihari, C. Niu and W. Li. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging. In Proceedings of ANLP 2000. Seattle. ANLP 2000 Srihari, Niu and Li 2000 anlp105_final9 R. Srihari and W. Li. 1999. Question Answering Supported by Information Extraction. In Proceedings of TREC-8. Washington cymfony Other publications: SBIR Final Reports W. Li R. Srihari. 2003. Flexible Information Extraction Learning Algorithm (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari. 2001. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari. 2000. A Domain Independent Event Extraction Toolkit (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari. 2000. Flexible Information Extraction Learning Algorithm (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. W. Li R. Srihari 2003. Automated Verb Sense Identification (Phase I), Final Techinical Report, U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003) R. Srihari W. Li 2003. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0156 (2002-2003) R. Srihari, W. Li C. Niu 2004. A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004) R. Srihari W. Li 2003. An Automated Domain Porting Toolkit for Information Extraction (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0057 (2002-2003) T. Cornell, R. Srihari W. Li 2004. Automatically Time Stamping Events in Unrestricted Text (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004) Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
极限学习机 Extreme Learning Machines (ELM) 程序网址 http://www.ntu.edu.sg/home/egbhuang/ E xtreme L earning M achines (ELM) : Filling the Gap between Frank Rosenblatt's Dream and John von Neumann's Puzzle - Network architectures : a homogenous hierarchical learning machine for partially or fully connected multi layers / single layer of (artifical or biological) networks with almost any type of practical (artifical) hidden nodes (or bilogical neurons). - Learning theories : Learning can be made without iteratively tuning (articial) hidden nodes (or biological neurons). - Learning algorithms : General, unifying and universal (optimization based) learning frameworks for compression, feature learning, clustering, regression and classification. Basic steps: 1) Learning are made layer wise ( in white box ) 2) Randomly generate ( any nonliear piecewise ) hidden neurons or inheritate hidden neuorns from ancestors 3) Learn the output weights in each hidden layer ( with application based optimization constraints ) 2013 www.extreme-learning-machines.org. All rights reserved. Best view with Google Chrome 感谢您提供更多的可下载程序! 感谢您的指教! 感谢您指正以上任何错误!
In my article “ Pride and Prejudice of Main Stream “, the first myth listed as top 10 misconceptions in NLP is as follows: Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). While there are numerous misconceptions on the old school of rule systems , this hand-crafted myth can be regarded as the source of all. Just take a review of NLP papers, no matter what are the language phenomena being discussed, it’s almost cliche to cite a couple of old school work to demonstrate superiority of machine learning algorithms, and the reason for the attack only needs one sentence, to the effect that the hand-crafted rules lead to a system “difficult to develop” (or “difficult to scale up”, “with low efficiency”, “lacking robustness”, etc.), or simply rejecting it like this, “literature , and have tried to handle the problem in different aspects, but these systems are all hand-crafted”. Once labeled with hand-crafting, one does not even need to discuss the effect and quality. Hand-craft becomes the rule system’s “original sin”, the linguists crafting rules, therefore, become the community’s second-class citizens bearing the sin. So what is wrong with hand-crafting or coding linguistic rules for computer processing of languages? NLP development is software engineering. From software engineering perspective, hand-crafting is programming while machine learning belongs to automatic programming. Unless we assume that natural language is a special object whose processing can all be handled by systems automatically programmed or learned by machine learning algorithms, it does not make sense to reject or belittle the practice of coding linguistic rules for developing an NLP system. For consumer products and arts, hand-craft is definitely a positive word: it represents quality or uniqueness and high value, a legit reason for good price. Why does it become a derogatory term in NLP? The root cause is that in the field of NLP, almost like some collective hypnosis hit in the community, people are intentionally or unintentionally lead to believe that machine learning is the only correct choice. In other words, by criticizing, rejecting or disregarding hand-crafted rule systems, the underlying assumption is that machine learning is a panacea, universal and effective, always a preferred approach over the other school. The fact of life is, in the face of the complexity of natural language, machine learning from data so far only surfaces the tip of an iceberg of the language monster (called l ow-hanging fruit by Church in K. Church: A Pendulum Swung Too Far ), far from reaching the goal of a complete solution to language understanding and applications. There is no basis to support that machine learning alone can solve all language problems, nor is there any evidence that machine learning necessarily leads to better quality than coding rules by domain specialists (e.g. computational grammarians). Depending on the nature and depth of the NLP tasks, hand-crafted systems actually have more chances of performing better than machine learning, at least for non-trivial and deep level NLP tasks such as parsing, sentiment analysis and information extraction (we have tried and compared both approaches). In fact, the only major reason why they are still there, having survived all the rejections from mainstream and still playing a role in industrial practical applications, is the superior data quality, for otherwise they cannot have been justified for industrial investments at all. the “forgotten” school: why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well. quote from On Recall of Grammar Engineering Systems In the real world, NLP is applied research which eventually must land on the engineering of language applications where the results and quality are evaluated. As an industry, software engineering has attracted many ingenious coding masters, each and every one of them gets recognized for their coding skills, including algorithm design and implementation expertise, which are hand-crafting by nature. Have we ever heard of a star engineer gets criticized for his (manual) programming? With NLP application also as part of software engineering, why should computational linguists coding linguistic rules receive so much criticism while engineers coding other applications get recognized for their hard work? Is it because the NLP application is simpler than other applications? On the contrary, many applications of natural language are more complex and difficult than other types of applications (e.g. graphics software, or word processing apps). The likely reason to explain the different treatment between a general purpose programmer and a linguist knowledge engineer is that the big environment of software engineering does not involve as much prejudice while the small environment of NLP domain is deeply biased, with belief that the automatic programming of an NLP system by machine learning can replace and outperform manual coding for all language projects. For software engineering in general, (manual) programming is the norm and no one believes that programmers’ jobs can be replaced by automatic programming in any time foreseeable. Automatic programming, a concept not rare in science fiction for visions like machines making machines, is currently only a research area, for very restricted low-level functions. Rather than placing hope on automatic programming, software engineering as an industry has seen a significant progress on work of the development infrastructures, such as development environment and a rich library of functions to support efficient coding and debugging. Maybe in the future one day, applications can use more and more of automated code to achieve simple modules, but the full automation of constructing any complex software project is nowhere in sight. By any standards, natural language parsing and understanding (beyond shallow level tasks such as classification, clustering or tagging) is a type of complex tasks. Therefore, it is hard to expect machine learning as a manifestation of automatic programming to miraculously replace the manual code for all language applications. The application value of hand-crafting a rule system will continue to exist and evolve for a long time, disregarded or not. “Automatic” is a fancy word. What a beautiful world it would be if all artificial intelligence and natural languages tasks could be accomplished by automatic machine learning from data. There is, naturally, a high expectation and regard for machine learning breakthrough to help realize this dream of mankind. All this should encourage machine learning experts to continue to innovate to demonstrate its potential, and should not be a reason for the pride and prejudice against a competitive school or other approaches. Before we embark on further discussions on the so-called rule system’s knowledge bottleneck defect, it is worth mentioning that the word “automatic” refers to the system development, not to be confused with running the system. At the application level, whether it is a machine-learned system or a manual system coded by domain programmers (linguists), the system is always run fully automatically, with no human interference. Although this is an obvious fact for both types of systems, I have seen people get confused so to equate hand-crafted NLP system with manual or semi-automatic applications. Is hand-crafting rules a knowledge bottleneck for its development? Yes, there is no denying or a need to deny that. The bottleneck is reflected in the system development cycle. But keep in mind that this “bottleneck” is common to all large software engineering projects, it is a resources cost, not only introduced by NLP. From this perspective, the knowledge bottleneck argument against hand-crafted system cannot really stand, unless it can be proved that machine learning can do all NLP equally well, free of knowledge bottleneck: it might be not far from truth for some special low-level tasks, e.g. document classification and word clustering, but is definitely misleading or incorrect for NLP in general, a point to be discussed below in details shortly. Here are the ballpark estimates based on our decades of NLP practice and experiences. For shallow level NLP tasks (such as Named Entity tagging, Chinese segmentation), a rule approach needs at least three months of one linguist coding and debugging the rules, supported by at least half an engineer for tools support and platform maintenance, in order to come up with a decent system for initial release and running. As for deep NLP tasks (such as deep parsing, deep sentiments beyond thumbs-up and thumbs-down classification), one should not expect a working engine to be built up without due resources that at least involve one computational linguist coding rules for one year, coupled with half an engineer for platform and tools support and half an engineer for independent QA (quality assurance) support. Of course, the labor resources requirements vary according to the quality of the developers (especially the linguistic expertise of the knowledge engineers) and how well the infrastructures and development environment support linguistic development. Also, the above estimates have not included the general costs, as applied to all software applications, e.g. the GUI development at app level and operations in running the developed engines. Let us present the scene of the modern day rule-based system development. A hand-crafted NLP rule system is based on compiled computational grammars which are nowadays often architected as an integrated pipeline of different modules from shallow processing up to deep processing. A grammar is a set of linguistic rules encoded in some formalism, which is the core of a module intended to achieve a defined function in language processing, e.g. a module for shallow parsing may target noun phrase (NP) as its object for identification and chunking . What happens in grammar engineering is not much different from other software engineering projects. As knowledge engineer, a computational linguist codes a rule in an NLP-specific language, based on a development corpus. The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system for independent QA to test and feedback. The development is an iterative process and cycle where incremental enhancements on bug reports from QA and/or from the field (customers) serve as a necessary input and step towards better data quality over time. Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion). There are almost infinite combinations of such conditions that can be enforced in rules’ patterns. A linguist’s job is to code such conditions to maximize the benefits of capturing the target language phenomena, a balancing art in engineering through a process of trial and error. Macroscopically speaking, the rule hand-crafting process is in its essence the same as programmers coding an application, only that linguists usually use a different, very high-level NLP-specific language, in a chosen or designed formalism appropriate for modeling natural language and framework on a platform that is geared towards facilitating NLP work. Hard-coding NLP in a general purpose language like Java is not impossible for prototyping or a toy system. But as natural language is known to be a complex monster, its processing calls for a special formalism (some form or extension of Chomsky’s formal language types) and an NLP-oriented language to help implement any non-toy systems that scale. So linguists are trained on the scene of development to be knowledge programmers in hand-crafting linguistic rules. In terms of different levels of languages used for coding, to an extent, it is similar to the contrast between programmers in old days and the modern software engineers today who use so-called high-level languages like Java or C to code. Decades ago, programmers had to use assembly or machine language to code a function. The process and workflow for hand-crafting linguistic rules are just like any software engineers in their daily coding practice, except that the language designed for linguists is so high-level that linguistic developers can concentrate on linguistic challenges without having to worry about low-level technical details of memory allocation, garbage collection or pure code optimization for efficiency, which are taken care of by the NLP platform itself. Everything else follows software development norms to ensure the development stay on track, including unit testing, baselines construction and monitoring, regressions testing, independent QA, code reviews for rules’ quality, etc. Each level language has its own star engineer who masters the coding skills. It sounds ridiculous to respect software engineers while belittling linguistic engineers only because the latter are hand-crafting linguistic code as knowledge resources. The chief architect in this context plays the key role in building a real life robust NLP system that scales. To deep-parse or process natural language, he/she needs to define and design the formalism and language with the necessary extensions, the related data structures, system architecture with the interaction of different levels of linguistic modules in mind (e.g. morpho-syntactic interface), workflow that integrate all components for internal coordination (including patching and handling interdependency and error propagation) and the external coordination with other modules or sub-systems including machine learning or off-shelf tools when needed or felt beneficial. He also needs to ensure efficient development environment and to train new linguists into effective linguistic “coders” with engineering sense following software development norms (knowledge engineers are not trained by schools today). Unlike the mainstream machine learning systems which are by nature robust and scalable, hand-crafted systems’ robustness and scalability depend largely on the design and deep skills of the architect. The architect defines the NLP platform with specs for its core engine compiler and runner, plus the debugger in a friendly development environment. He must also work with product managers to turn their requirements into operational specs for linguistic development, in a process we call semantic grounding to applications from linguistic processing. The success of a large NLP system based on hand-crafted rules is never a simple accumulation of linguistics resources such as computational lexicons and grammars using a fixed formalism (e.g. CFG) and algorithm (e.g. chart-parsing). It calls for seasoned language engineering masters as architects for the system design. Given the scene of practice for NLP development as describe above, it should be clear that the negative sentiment association with “hand-crafting” is unjustifiable and inappropriate. The only remaining argument against coding rules by hands comes down to the hard work and costs associated with hand-crafted approach, so-called knowledge bottleneck in the rule-based systems. If things can be learned by a machine without cost, why bother using costly linguistic labor? Sounds like a reasonable argument until we examine this closely. First, for this argument to stand, we need proof that machine learning indeed does not incur costs and has no or very little knowledge bottleneck. Second, for this argument to withstand scrutiny, we should be convinced that machine learning can reach the same or better quality than hand-crafted rule approach. Unfortunately, neither of these necessarily hold true. Let us study them one by one. As is known to all, any non-trivial NLP task is by nature based on linguistic knowledge, irrespective of what form the knowledge is learned or encoded. Knowledge needs to be formalized in some form to support NLP, and machine learning is by no means immune to this knowledge resources requirement. In rule-based systems, the knowledge is directly hand-coded by linguists and in case of (supervised) machine learning, knowledge resources take the form of labeled data for the learning algorithm to learn from (indeed, there is so-called unsupervised learning which needs no labeled data and is supposed to learn from raw data, but that is research-oriented and hardly practical for any non-trivial NLP, so we leave it aside for now). Although the learning process is automatic, the feature design, the learning algorithm implementation, debugging and fine-tuning are all manual, in addition to the requirement of manual labeling a large training corpus in advance (unless there is an existing labeled corpus available, which is rare; but machine translation is a nice exception as it has the benefit of using existing human translation as labeled aligned corpora for training). The labeling of data is a very tedious manual job. Note that the sparse data challenge represents the need of machine learning for a very large labeled corpus. So it is clear that knowledge bottleneck takes different forms, but it is equally applicable to both approaches. No machine can learn knowledge without costs, and it is incorrect to regard knowledge bottleneck as only a defect for the rule-based system. One may argue that rules require expert skilled labor, while the labeling of data only requires high school kids or college students with minimal training. So to do a fair comparison of the costs associated, we perhaps need to turn to Karl Marx whose “Das Kapital” has some formula to help convert simple labor to complex labor for exchange of equal value: for a given task with the same level of performance quality (assuming machine learning can reach the quality of professional expertise, which is not necessarily true), how much cheap labor needs to be used to label the required amount of training corpus to make it economically an advantage? Something like that. This varies from task to task and even from location to location (e.g. different minimal wage laws), of course. But the key point here is that knowledge bottleneck challenges both approaches and it is not the case believed by many that machine learning learns a system automatically with no or little cost attached. In fact, things are far more complicated than a simple yes or no in comparing the costs as costs need also to be calculated in a larger context of how many tasks need to be handled and how much underlying knowledge can be shared as reusable resources. We will leave it to a separate writing for the elaboration of the point that when put into the context of developing multiple NLP applications, the rule-based approach which shares the core engine of parsing demonstrates a significant saving on knowledge costs than machine learning. Let us step back and, for argument’s sake, accept that coding rules is indeed more costly than machine learning, so what? Like in any other commodities, hand-crafted products may indeed cost more, they also have better quality and value than products out of mass production. For otherwise a commodity society will leave no room for craftsmen and their products to survive. This is common sense, which also applies to NLP. If not for better quality, no investors will fund any teams that can be replaced by machine learning. What is surprising is that there are so many people, NLP experts included, who believe that machine learning necessarily performs better than hand-crafted systems not only in costs saved but also in quality achieved. While there are low-level NLP tasks such as speech processing and document classification which are not experts’ forte as we human have much more restricted memory than computers do, deep NLP involves much more linguistic expertise and design than a simple concept of learning from corpora to expect superior data quality. In summary, the hand-crafted rule defect is largely a misconception circling around wildly in NLP and reinforced by the mainstream, due to incomplete induction or ignorance of the scene of modern day rule development. It is based on the incorrect assumption that machine learning necessarily handles all NLP tasks with same or better quality but less or no knowledge bottleneck, in comparison with systems based on hand-crafted rules. Note: This is the author’s own translation, with adaptation, of part of our paper which originally appeared in Chinese in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013 Domain portability myth in natural language processing Pride and Prejudice of NLP Main Stream K. Church: A Pendulum Swung Too Far , Linguistics issues in Language Technology, 2011; 6(5) Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4 Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
As we all know, natural language parsing is fairly complex but instrumental in Natural Language Understanding (NLU) and its applications . We also know that a breakthrough to 90%+ accuracy for parsing is close to human performance and is indeed an achievement to be proud of. Nevertheless, following the common sense, we all have learned that you got to have greatest guts to claim the “most” for anything without a scope or other conditions attached, unless it is honored by authoritative agencies such as Guinness. For Google’s claim of “the world’s most accurate parser”, we only need to cite one system out-performing theirs to prove its being untrue or misleading. We happen to have built one . For a long time, we know that our English parser is near human performance in data quality, and is robust, fast and scales up to big data in supporting real life products. For the approach we take, i.e. the approach of grammar engineering, which is the other “school” from the mainstream statistical parsing, this was just a natural result based on the architect’s design and his decades of linguistic expertise. In fact, our parser reached near-human performance over 5 years ago, at a point of diminishing returns, hence we decided not to invest heavily any more in its further development. Instead, our focus was shifted to its applications in supporting open-domain question answering and fine-grained deep sentiment analysis for our products, as well as to the multilingual space . So a few weeks ago when Google announced SyntaxNet , I was bombarded by the news cited to me from all kinds of channels as well as many colleagues of mine, including my boss and our marketing executives. All are kind enough to draw my attention to this “newest breakthrough in NLU” and seem to imply that we should work harder, trying to catch up with the giant. In my mind, there has never been doubt that the other school has a long way before they can catch us. But we are in information age, and this is the power of Internet: eye-catching news from or about a giant, true or misleading, instantly spreads to all over the world. So I felt the need to do some study, not only to uncover the true picture of this space, but more importantly, also to attempt to educate the public and the young scholars coming to this field that there have always been and will always be two schools of NLU and AI (Artificial Intelligence). These two schools actually have their respective pros and cons , they can be complementary and hybrid , but by no means can we completely ignore or replace one by the other. Plus, how boring a world would become if there were only one approach, one choice, one voice, especially in core cases of NLU such as parsing (as well as information extraction and sentiment analysis , among others) where the “select approach” does not perform nearly as well as the forgotten one. So I instructed a linguist who was not involved in the development of the parser to benchmark both systems as objectively as possible, and to give an apples-to-apples comparison of their respective performance. Fortunately, the Google SyntaxNet outputs syntactic dependency relationships and ours is also mainly a dependency parser . Despite differences in details or naming conventions, the results are not difficult to contrast and compare based on linguistic judgment. To make things simple and fair, we fragment a parse tree of an input sentence into binary dependency relations and let the testor linguist judge; once in doubt, he will consult another senior linguist to resolve, or to put on hold if believed to be in gray area, which is rare. Unlike some other areas of NLP tasks, e.g. sentiment analysis, where there is considerable space of gray area or inter-annotator disagreement, parsing results are fairly easy to reach consensus among linguists. Despite the different format such results are embodied in by two systems (an output sample is shown below), it is not difficult to make a direct comparison of each dependency in the sentence tree output of both systems. (To be stricter on our side, a patched relationship called Next link used in our results do not count as a legit syntactic relation in testing.) SyntaxNet output: 1.Input: President Barack Obama endorsed presumptive Democratic presidential nominee Hillary Clinton in a web video Thursday . Netbase output: Benchmarking was performed in two stages as follows. Stage 1, we select English formal text in the news domain, which is SyntaxNet’s forte as it is believed to have much more training data in news than in other styles or genres. The announced 94% accuracy in news parsing is indeed impressive. In our case, news is not the major source of our development corpus because our goal is to develop a domain-independent parser to support a variety of genres of English text for real life applications on text such as social media (informal text) for sentiment analysis, as well as technology papers (formal text) for answering how questions. We randomly select three recent news article for this testing, with the following links. (1) http://www.cnn.com/2016/06/09/politics/president-barack-obama-endorses-hillary-clinton-in-video/ (2) Part of news from: http://www.wsj.com/articles/nintendo-gives-gamers-look-at-new-zelda-1465936033 (3) Part of news from: http://www.cnn.com/2016/06/15/us/alligator-attacks-child-disney-florida/ Here are the benchmarking results of parsing the above for the news genre: (1) Google SyntaxNet: F-score= 0.94 (tp for true positive, fp for false positive, tn for true negative; P for Precision, R for Recall, and F for F-score) P = tp/(tp+fp) = 1737/(1737+104) = 1737/1841 = 0.94 R = tp/(tp+tn) = 1737/(1737+96) = 1737/1833 = 0.95 F= 2* = 2* = 2*(0.893/1.89) = 0.94 (2) Netbase parser: F-score = 0.95 P = tp/(tp+fp) = 1714/(1714+66) = 1714/1780 = 0.96 R = tp/(tp+tn) = 1714/(1714+119) = 1714/1833 = 0.94 F = 2* = 2* = 2*(0.9024/1.9) = 0.95 So the Netbase parser is about 2 percentage points better than Google SyntaxNet in precision but 1 point lower in recall. Overall, Netbase is slightly better than Google in the precision-recall combined measures of F-score. As both parsers are near the point of diminishing returns for further development, there is not too much room for further competition. Stage 2 , we select informal text, from social media Twitter to test a parser’s robustness in handling “degraded text”: as is expected, degraded text will always lead to degraded performance (for a human as well as a machine), but a robust parser should be able to handle it with only limited degradation. If a parser can only perform well in one genre or one domain and the performance drastically falls in other genres, then this parser is not of much use because most genres or domains do not have as large labeled data as the seasoned news genre . With this knowledge bottleneck, a parser is severely challenged and limited in its potential to support NLU applications. After all, parsing is not the end, but a means to turn unstructured text into structures to support semantic grounding to various applications in different domains . We randomly select 100 tweets from Twitter for this testing, with some samples shown below. 1.Input: RT @ KealaLanae : ima leave ths here. https : //t.co/FI4QrSQeLh2.Input: @ WWE_TheShield12 I do what I want jk I ca n’t kill you .10.Input: RT @ blushybieber : Follow everyone who retweets this , 4 mins 20.Input: RT @ LedoPizza : Proudly Founded in Maryland. @ Budweiser might have America on their cans but we think Maryland Pizza sounds better 30.Input: I have come to enjoy Futbol over Football 40.Input: @ GameBurst That ‘s not meant to be rude. Hard to clarify the joke in tweet form . 50.Input: RT @ undeniableyella : I find it interesting , people only talk to me when they need something … 60.Input: Petshotel Pet Care Specialist Jobs in Atlanta , GA # Atlanta # GA # jobs # jobsearch https : //t.co/pOJtjn1RUI 70.Input: FOUR ! BUTTLER nailed it past the sweeper cover fence to end the over ! # ENG – 91/6 -LRB- 20 overs -RRB- . # ENGvSL https : //t.co/Pp8pYHfQI8 79..Input: RT @ LenshayB : I need to stop spending money like I ‘m rich but I really have that mentality when it comes to spending money on my daughter 89.Input: RT MarketCurrents : Valuation concerns perk up again on Blue Buffalo https : //t.co/5lUvNnwsjA , https : //t.co/Q0pEHTMLie 99.Input: Unlimited Cellular Snap-On Case for Apple iPhone 4/4S -LRB- Transparent Design , Blue/ https : //t.co/7m962bYWVQ https : //t.co/N4tyjLdwYp 100.Input: RT @ Boogie2988 : And some people say , Ethan ‘s heart grew three sizes that day. Glad to see some of this drama finally going away. https : //t.co/4aDE63Zm85 Here are the benchmarking results for the social media Twitter: (1) Google SyntaxNet: F-score = 0.65 P = tp/(tp+fp) = 842/(842+557) = 842/1399 = 0.60 R = tp/(tp+tn) = 842/(842+364) = 842/1206 = 0.70 F = 2* = 2* = 2*(0.42/1.3) = 0.65 Netbase parser: F-score = 0.80 P = tp/(tp+fp) = 866/(866+112) = 866/978 = 0.89 R = tp/(tp+tn) = 866/(866+340) = 866/1206 = 0.72 F = 2* = 2* = 2*(0.64/1.61) = 0.80 For the above benchmarking results, we leave it to the next blog for interesting observations and more detailed illustration, analyses and discussions. To summarize, our real life production parser beats Google’s research system SyntaxtNet in both formal news text (by a small margin as we both are already near human performance) and informal text, with a big margin of 15 percentage points. Therefore, it is safe to conclude that Google’s SytaxNet is by no means “world’s most accurate parser”, in fact, it has a long way to get even close to the Netbase parser in adapting to the real world English text of various genres for real life applications. Is Google SyntaxNet Really the World’s Most Accurate Parser? Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open K. Church: “A Pendulum Swung Too Far” , Linguistics issues in Language Technology, 2011; 6(5) Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering Introduction of Netbase NLP Core Engine Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP 发布
You do not have to understand the concepts in this appendix to become well-versed in C++. You can master C++, however, only if you spend some time learning about the behind-the-scenes roles played by binary numbers. The material presented here is not difficult, but many programmers do not take the time to study it; hence, there are a handful of C++ masters who learn this material and understand how C++ works “under the hood,” and there are those who will never master the language as they could. You should take the time to learn about addressing, binary numbers, and hexadecimal numbers. These fundamental principles are presented here for you to learn, and although a working knowledge of C++ is possible without knowing them, they greatly enhance your C++ skills (and your skills in every other programming language). After reading this appendix, you will better understand why different C++ data types hold different ranges of numbers. You also will see the importance of being able to represent hexadecimal numbers in C++, and you will better understand C++ array and pointer addressing. 下载地址: http://edu.ctfile.com/info/W5N57733 http://www.yimuhe.com/file-3045045.html
分享deep learning大牛Yoshua Bengio的《deep learning》最终版 印刷前公开最终版,原来的版本才500多页,这个版本800多页,值得分享。 (pdf 太大而传不上科学网) 百度云网盘下载地址: http://pan.baidu.com/s/1qYIeAJ U 作者: Ian Goodfellow Yoshua Bengio Aaron Courville 网站: http://www.deeplearningbook.org/ 目录内容: Table of Contents Acknowledgements Notation 1 Introduction Part I: Applied Math and Machine Learning Basics 2 Linear Algebra 3 Probability and Information Theory 4 Numerical Computation 5 Machine Learning Basics Part II: Modern Practical Deep Networks 6 Deep Feedforward Networks 7 Regularization 8 Optimization for Training Deep Models 9 Convolutional Networks 10 Sequence Modeling: Recurrent and Recursive Nets 11 Practical Methodology 12 Applications Part III: Deep Learning Research 13 Linear Factor Models 14 Autoencoders 15 Representation Learning 16 Structured Probabilistic Models for Deep Learning 17 Monte Carlo Methods 18 Confronting the Partition Function 19 Approximate Inference 20 Deep Generative Models Bibliography Index
机器学习-deep learning reading 网上链接 Deep learning Reading List Following is a growing list of some of the materials i found on the web for Deep Learning beginners. Free Online Books Deep Learning by Yoshua Bengio, Ian Goodfellow and Aaron Courville Neural Networks and Deep Learning by Michael Nielsen Deep Learning by Microsoft Research Deep Learning Tutorial by LISA lab, University of Montreal Courses Machine Learning by Andrew Ng in Coursera Neural Networks for Machine Learning by Geoffrey Hinton in Coursera Neural networks class by Hugo Larochelle from Université de Sherbrooke Deep Learning Course by CILVR lab @ NYU CS231n: Convolutional Neural Networks for Visual Recognition On-Going CS224d: Deep Learning for Natural Language Processing Going to start Video and Lectures How To Create A Mind By Ray Kurzweil - Is a inspiring talk Deep Learning, Self-Taught Learning and Unsupervised Feature Learning By Andrew Ng Recent Developments in Deep Learning By Geoff Hinton The Unreasonable Effectiveness of Deep Learning by Yann LeCun Deep Learning of Representations by Yoshua bengio Principles of Hierarchical Temporal Memory by Jeff Hawkins Machine Learning Discussion Group - Deep Learning w/ Stanford AI Lab by Adam Coates Making Sense of the World with Deep Learning By Adam Coates Demystifying Unsupervised Feature Learning By Adam Coates Visual Perception with Deep Learning By Yann LeCun Papers ImageNet Classification with Deep Convolutional Neural Networks Using Very Deep Autoencoders for Content Based Image Retrieval Learning Deep Architectures for AI CMU’s list of papers Tutorials UFLDL Tutorial 1 UFLDL Tutorial 2 Deep Learning for NLP (without Magic) A Deep Learning Tutorial: From Perceptrons to Deep Networks WebSites deeplearning.net deeplearning.stanford.edu Datasets MNIST Handwritten digits Google House Numbers from street view CIFAR-10 and CIFAR-100 IMAGENET Tiny Images 80 Million tiny images Flickr Data 100 Million Yahoo dataset Berkeley Segmentation Dataset 500 Frameworks Caffe Torch7 Theano cuda-convnet Ccv NuPIC DeepLearning4J Miscellaneous Google Plus - Deep Learning Community Caffe Webinar 100 Best Github Resources in Github for DL Word2Vec Caffe DockerFile TorontoDeepLEarning convnet Vision data sets Fantastic Torch Tutorial My personal favourite. Also check out gfx.js Torch7 Cheat sheet 原文链接:http://jmozah.github.io/links/#rd
一切声称用主流机器学习方法做社会媒体舆情挖掘的系统,都值得怀疑。捉襟见肘不堪应用是基本现状。原因是如此显然,机器学习在短消息主导的社会媒体面前失效了。短消息根本就没有足够密度的数据点(所谓 keyword density)供机器学习施展。巧妇且难为无米之炊,这是一袋子词的方法论决定的,再大的训练集也难以克服这个局限。没有语言学的结构分析,这是不可逾越的挑战。 I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry far. So let me make it simple to be understood: The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media. The major reason is simple: the social media posts are full of short messages which do not have the keyword density required by a classifier to make the proper sentiment decision. The precision ceiling for this line of work in real life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system. Trusting a machine learning classifier to perform social media sentiment is not much better than flipping a coin. So let us get straight. From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning). Fancy visualizations may make the mining results look real and attractive but they are just not trustable at all. 【补记】 朋友截屏了朋友圈,说这是一竿子打翻一船人的架势。但关于这一点,实在没有办法, 无论中文还是西文, 短消息压倒多数是 移动时代社交媒体的现实, 总须有人揭出社交媒体大数据挖掘背后的事实真相。 BOW 面对短消息束手无策,是不争的事实,不会因为这是最简便 available 的主流方法,多数人用它,它就在不适合它的场所突然显灵了。 不 work 就是不 work,这一路突破不了60%的精度瓶颈,离公认的可用精度门槛80%遥不可及,这是方法论决定的。 Related Posts: Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering Coarse-grained vs. fine-grained sentiment analysis 舆情挖掘系统独立验证的意义 2015-11-22 【立委科普:NLP 中的一袋子词是什么】 2015-11-27 【置顶:立委科学网博客NLP博文一览(定期更新版)】
January 2000 On Hybrid Model Pre-Knowledge-Graph Profile Extraction Research via SBIR (3) This section presents the feasibility study conducted in Phase I of the proposed hybrid model for Level-2 and Level‑3 IE. This study is based on literature review and supported by extensive experiments and prototype implementation. This model complements corpus-based machine learning by hand-coded FST rules. The essential argument for this strategy is that by combining machine learning methods with an FST rule-based system, the system is able to exploit the best of both paradigms while overcoming their respective weaknesses. This approach was intended to meet the demand of the designed system for processing unrestricted real life text. 2.2.1 Hybrid Approach It was proposed that FST hand-crafted rules combine with corpus-based learning in all major modules of Textract . More precisely, each module M will consist of two sub-modules M1 and M2, i.e. FST model and trained model. The former serves as a preprocessor, as shown below. M1: FST Sub-module ˉ M2: trained Sub-module The trained model M2 has two features: (i) adaptive training ; (ii) structure-based training. In a pipeline architecture, the output of the previous module is the input of the succeeding module. If the succeeding module is a trained model, there are two types of training: adaptive training or non-adaptive training. In adaptive training, the input in the training phase is exactly the same as the input in the application phase. That is, the possibly imperfect output from the previous module is the input for training even if the previous module may make certain mistakes. This type of training “adapts” the model to imperfect input and the trained model will be more robust and results in some necessary adjustment. In contrast, a naive non-adaptive training is often conducted based on a perfect, often artificial input. The assumption is that the previous module is a continuously improving module and will be able to provide near-perfect output for the next module. There are pros and cons for both adaptive and non-adaptive methods. Non-adaptive training is suitable for the case when the training time is significantly long and in the case where the previous module is simple and reaches high precision. In contrast, an adaptively trained model has to be re-trained each time the previous module(s) undergo some major changes. Otherwise, the performance will be seriously affected. This imposes stringent requirements on training time and algorithm efficiency. Since the machine learning tools, which Cymfony has developed in-house, are very efficient, Textract can afford to adopt the more flexible training method using adaptive input. Adaptive training provides the rationale for placing the FST model before the trained model. The development of the FST sub-module M1 and the trained sub-module M2 can be done independently. When the time comes for the integration of M1 and M2 for better performance, it suffices to re-train M2 on the output of M1. The flexible adaptive training capabilities make this design viable, as verified inthe prototype implementation of Textract2.0/CE . In contrast, if M1 were placed after M2, the development of hand-crafted rules for M1 would have to wait until M2 is implemented. Otherwise, many rules may have to be re-written and re-debugged, which is not desirable. The second issue is structure-based training. Natural language is structural by nature; any sophisticated high level IE can hardly be successful based on linear strings of tokens. In order to capture CE/GEphenomena, traditional n -gram training with a window size of n linear tokens is not sufficient. Sentences can be long where the related entities are far apart, not to mention the long distance phenomena in linguistics. Without structure based training, no matter how large the window size one chooses, generalized rules cannot be effectively learned. However, once the training is based on linguistic structures, the distance between the entities becomes tractable. In fact, as linguistic structures are hierarchical, we need to perform multi-level training in order to capture CE/GE. For CE, it has been found during the Phase I research that three levels of training are necessary. Each level of training should be supported by the corresponding natural language parser. The remainder of this section presents the feasibility study and arguments for the choice of an FST rule based system to complement the corpus-based machine learning models. 2.2.2 FST Grammars The most attractive feature of the FST formalism lies in its superior time and space efficiency. Applying FST basically depends linearly on the input size of the text. This is in contrast to the more pervasive formalism used in NLP, namely, Context Free Grammars. This theoretical time/space efficiency has been verified through the extensive use of Cymfony’s proprietary FST Toolkit in the following applications of Textract implementation: (i) tokenizer; (ii) FST-based rules for capturing NE; (iii) FST representation of lexicons (lexical transducers); (iv) experiments in FST local grammars for shallow parsing; and (v) local CE/GEgrammars in FST. For example, the Cymfony shallow parser has been benchmarked to process 460 MB of text an hour on a 450 MHz Pentium II PC running Windows NT. There is a natural combination of FST-based grammars and lexical approaches to natural language phenomena . In order for IE grammars/rules to perform well, the lexical approach must be employed. In fact, the NE/CE/GE grammars which have been developed in Phase I have demonstrated a need for the lexical approach. Take CE as an example. In order to capture a certain CE relationship, say affiliation , the corresponding rules need to check patterns involving specific verbs and/or prepositions, say work for/hired by , which denote this relationship in English. The GE grammar, which aims at decoding the key semantic relationships in the argument structure in stead of surface syntactic relationships, has also demonstrated the need to involve considerable level of lexical constraints. Efficiency is always an important consideration indeveloping a large-scale deployable software system. Efficiency is particularly required for lexical grammars since lexical grammars are usually too large for efficient processing using conventional, more powerful grammar formalisms (e.g. Context Free Grammar formalism). Cymfony is convinced through extensive experiments that the FST technology is an outstanding tool to tackle this efficiency problem. It was suggested that a set of cascaded FST grammars could simulate sophisticated natural language parsing. This use of FST application has already successfully been applied to the Textract shallow parsing and local CE/GE extraction. There are a number of success stories of FST-based rule systems in the field of IE. For example, the commercial NE system NetOwl relies heavily on FST pattern matching rules . SRI also applied a very efficient FST local grammar for the shallow parsing of basic noun phrases and verb groups in order to support IE tasks . More recently, Universite Paris VII/LADL has successfully applied FST technology to one specified information extraction/retrieval task; that system can extract information on-the-fly about one's occupation from huge amounts of free text. The system is able to answer questions which conventional retrieval systems cannot handle, e.g. W ho is the minister of culture in France? Finally, it has also been proven by many research programs such as , and INTEX , as well as Cymfony , that an FST-based rule system is extremely efficient. In addition, FST is a convenient tool for capturing linguistic phenomena, especially for idioms and semi-productive expressions that are abundant in natural languages. As Hobbs says, “languages in general are very productive in the construction of short, multiword fixed phrases and proper names employing specialized microgrammars”. However, a purely FST-based rule system suffers from the same disadvantage in knowledge acquisition as that for all handcrafted rule systems. After all, the FST rules or local grammars have to be encoded by human experts, imposing this traditional labor-intensive problem in developing large scale systems. The conclusion is that while FST overcomes a number of shortcomings of the traditional rule based system (in particular the efficiency problem), it does not relieve the dependence on highly skilled human labor. Therefore, approaches for automatic machine learning techniques are called for. 2.2.3 Machine Learning The appeal of corpus-based machine learning in language modeling lies mainly in its automatic training/learning capabilities,hence significantly reducing the cost for hand-coding rules. Compared with rule based systems, there are definite advantages of corpus-based learning: · automatic knowledge acquisition: results in fast development time since the system discovers regularity automatically when given sufficient correctly annotated data · robustness: since knowledge/rules are learned directly from corpus · acceptable speed : in general, there is little run-time processing; the knowledge/rules obtained in the training phase can be stored in efficient data structures for run-time lookup · portability : a domain shift only requires the truthing of new data; new knowledge/rules will be automatically learned with no need to change any part of the program or control BBN has recently implemented an integrated, fully trainable model, SIFT, applied to IE . This system performs the tasks of linguistic processing (POS tagging, syntactic parsing and semantic relationship identification), TE and TR as well as NE, all at once. They have reported 83.49% F-measures for TE and 71.23% F-measures for TR, a result close to those of the best systems in MUC-7. In addition, their successful experiment in making use of the Penn Treebank for training the initial syntactic parser significantly reduces the cost of human annotation. There is no doubt that their effort is a significant progress in this field. It demonstrates the state-of-the-art in applying grammar induction to Level-2 IE. However, there are two potentially serious problems with their approach. The first is the lack of efficiency in applying the model. As they acknowledge, the speed of the system is rather slow. In terms of efficiency, the CKY‑based parsing algorithm they use is not comparable with algorithms for formalisms based on the finite state scheme (e.g. FST, Viterbi for HMM). This limiting factor is due to the inherent nature of the learned grammar based on the CFG formalism. To overcome this problem, rule induction has been explored in the direction of learning FST style grammars for local CE/GEextraction instead of CFG. The second problem is with their integrated approach. Because everything is integrated in one process, it is extremely difficult to trace where a problem lies, making debugging difficult. It is believed that a much more secure way is to follow the conventional practice of modularizing the NLP/IE process in different tasks and sub-tasks, as Cymfony has proposed in the Textract architecture design: POS tagging, shallow parsing, co-referencing, full parsing, pragmatic filtering, NE, CE,GE. Along this line, it is easy to find directly whether a particular degradation in performance is due to poor support from co-referencing or from mistakes in shallow parsing, for example. Performance benchmarking can be measured for each module; efforts to improve the performance of each individual module will contribute to the improvement of the overall system performance. 2.2.4 Drawbacks of Corpus-based Learning The following drawbacks motivate the proposed idea of building a hybrid system/module, complementing the automatic corpus-based learning by handcrafted grammars in FST. · ‘Sparse data’ problem : this is recognized as a bottle-neck for all corpus-based models . Unfortunately, a practical solution to this problem (e.g. smoothing or back-off techniques) often results in a model much less sophisticated than traditional rule-based systems. · ‘Local maxima’ problem : even if the training corpus is large and sufficiently representative, the training program can result in a poor model because training got stuck in a local maximum and failed to find the global peak . This is an inherent problem with the standard training algorithms for both HMM (i.e. forward-backward algorithm ) and CFG grammar induction ( inside-outside algorithm ). This problem can be very serious when there is no extra information applied to guide the training process. · computational complexity problem : It is often the case that there is a trade-off between expressive power/prior knowledge/constraints in the templates and feasibility. Usually, the more sophisticateda model or rule template is, the more the minimum requirement for a corpus increases, often up to an unrealistic level of training complexity. To extend the length of the string to be examined (e.g. from bigram to trigram), or to add more features (or categories/classes) for a template to be able to make reference to, usually means an enormous jump in such requirement. Otherwise, the system suffers from more serious sparse data effect. In many cases, the limitation imposed on the training complexity makes some research ideas unattainable, which in turn limits the performance power. · potentially very high cost for manual annotationof corpus: that is why Cymfony has proposed as one important direction for future research to explore the combination of supervised training and unsupervised training. Among the above four problems, the sparse data problem is believed to be most serious. To a considerable extent, the success of a system depends on how this problem is addressed. In general, there are three ways to minimize the negative effect of sparse data, discussed below. The first is to condition the probabilities/rules on fewer elements, e.g. to back off from N-gram model to (N-1)-gram model. This remedy is clearly a sacrifice of the power and therefore is not a viable option for sophisticated NLP/IE tasks. The second approach is to condition the probabilities/rules on appropriate levels of linguistic structures (e.g. basic phrase level) instead of surface based linear tokens. The research in the CE prototyping showed this to be one the most promising ways of handling the sparse data problem. This approach calls for a reliable natural language parser to establish the necessary structural foundation for conducting structure-based adaptive learning. The shallow parser which Cymfony has built using the FST engine and an extensively tested manual grammar has been tested to perform with 90.5% accuracy. The third method is to condition the probabilities/rules on more general features, e.g. using syntactic categories (e.g. POS) or semantic classes (e.g. the results from semantic lexicon; or from word clustering training) instead of the token literal. This is also a proven effective means for overcoming this bottleneck. However,there is considerable difficulty in applying this approach due to the high degree of lexical ambiguity widespread in natural languages. As for the ‘local maxima’ problem, the proposed hybrid approach in integrating handcrafted FST rules and the automatic grammar learner promises a solution. The learned model can be re-trained using the FST component as a ‘seed’ to guide the learning. In general, the more constraints and heuristics that are given to the initial statistical model for training, the better the chance for the training algorithm to result in the global maximum. It is believed that a handcrafted grammar is the most effective of such constraints since it embodies human linguistic knowledge. 2.2.5 Feasibility and Advantages of Hybrid Approach In fact, the feasibility of such collaboration between a handcrafted rule system (FST in this case) and a corpus-based system has already been verified for all the major types of models: · For transformation based systems, Brill's training algorithm ensures that the input to the system can be either a randomly tagged text ( naive initial state ) or a text tagged by another module with the same function ( sophisticated initial state ) . Using the POS tagging as an example, the input to the transformation-based tagger can be either a text randomly tagged or a text tagged by another POS tagger. The shift in the input sources only requires re-training the system; nothing in the algorithm and the annotated corpus need to be changed. · In the case of rule induction, the FST-based grammar can serve as a ‘seed’ to effectively constrain/guide the learning process in overcoming the ‘local maxima’ problem. In general, a better initial estimate of the parameters gives the learning procedure a chance to obtain better results when many local maximal points exist . It is proven by experiments conducted by Briscoe Waegner that even with a very crude handcrafted grammar of only seven binary-branching rules (e.g. PP -- P NP) to start with, a much better grammar is automatically learned than the one using the same approach without a grammar ‘seed’. Another more interesting experiment they conducted gives the following encouraging results. Given the seed of an artificial grammar that can only parse 25% of the 50,00-word corpus, the training program is able to produce a grammar capable of parsing 75% the corpus. This demonstrates the feasibility of combining handcrafted grammar and automatic grammar induction in line with the general approach proposed above: FST rules before statistical model. · When the trained sub-module is an HMM, Cymfony has verified its feasibility through extensive experiments in implementing the hybrid NE tagger, Textract 1.0 . Cymfony first implemented an NE system purely on HMM bi-gram learning, and found there were weaknesses. Due to sparse data problem, although time and numerical NEs are expressed in very predictable patterns, there was considerable amount of mistagging. Later this problem was addressed by FST rules which are good at capturing these patterns. The FST pattern rules for NE serve as a preprocessor. As a result, Textract1.0 achieved significant performance enhancement (from 85% F-measures raised to 93%). The advantages of this proposed hybrid approach are summarized below: · strict modularity : the proposal of combining FST rules and statistical models makes the system more modular as each major module is now divided into two sub-modules. Of course, adaptive re-training is necessary in the later stage of integrating the two sub-modules but it is not a burden as the process is automatic and in principle, it does not require modifications in the algorithm or the training corpus. · enhanced performance : due to the complementary nature of handcrafted and machine-learning systems. · flexible ratio of sub-modules : one module may have a large trained model and a small FST component, or the other way around, depending on the nature of a given task, i.e. how well the FST approach or the learning approach applies to the task. One is free to decide how to allocate more effort and resources to develop one component or the other. If we judge that for Task One, automatic learning is most effective, we are free to decide that more effort and resources should be used to develop the trained module M2 for this task (and less effort for the FST module M1). In other words, the relative size or contribution of M1 versus M2 is flexible,e.g. M1=20% and M2=80%. Technology developed for the proposed information extraction system and its application has focused on six specific areas: (i) machine learning toolkit, (ii) CE, (iii),CO (iv) GE, (v) QA and (vi) truthing and evaluation. The major accomplishments in these areas from the Phase I research are presented in the following sections. In fact, it is also the case in the development of a pure statistical system: repeated training and testing is the normal practice of adjusting the model in the effort for performance improvement and debugging. It is possible that one module is based exclusively on FST rules, i.e. M1=100% and M2=0%, or completely on a learned model, i.e. M1=0% and M2=100% so long as its performance is deemed good enough or the overhead of combining the FST grammar and the learned model outweighs the slight gain in performance. In fact, some minor modules like Tokenizer and POS Tagger can produce very reliable results using only one approach. REFERENCES Abney, S.P. 1991. Parsingby Chunks, Principle-Based Parsing: Computation and Psycholinguistics ,Robert C. Berwick, Steven P. Abney, Carol Tenny, eds. Kluwer Academic Publishers, Boston, MA, pp.257-278. Appelt, D.E. et al. 1995. SRI International FASTUS System MUC-6 TestResults and Analysis. Proceedings ofMUC-6 , Morgan Kaufmann Publishers, San Mateo, CA Beckwith, R. et al. 1991. WordNet: A Lexical Database Organized on Psycholinguistic Principles. Lexicons: Using On-line Resources to build a Lexicon , Uri Zernik,editor, Lawrence Erlbaum, Hillsdale, NJ. Bikel, D.M. et al .,1997. Nymble: a High-Performance Learning Name-finder. Proceedings ofthe Fifth Conference on Applied Natural Language Processing , MorganKaufmann Publishers, pp. 194-201. Brill, E., 1995.Transformation-based Error-Driven Learning and Natural language Processing: A Case Study in Part-of-Speech Tagging, Computational Linguistics , Vol.21,No.4, pp. 227-253 Briscoe, T. Waegner,N., 1992. Robust Stochastic Parsing Using the Inside-Outside Algorithm. WorkshopNotes, Statistically-Based NLP Techniques , AAAI, pp. 30-53 Charniak, E. 1994. Statistical Language Learning , MIT Press, Cambridge, MA. Chiang, T-H., Lin, Y-C. Su, K-Y. 1995. Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution, Computational Linguistics , Vol.21,No.3, pp. 321-344. Chinchor, N. Marsh,E. 1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedingsof MUC-7 Darroch, J.N. Ratcliff, D. 1972. Generalized iterative scaling for log-linear models. TheAnnals of Mathematical Statistics, pp. 1470-1480. Grishman, R., 1997.TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA. Hobbs, J.R. 1993. FASTUS: A System for Extracting Informationfrom Text, Proceedings of the DARPA workshop on Human Language Technology , Princeton, NJ, pp. 133-137. Krupka, G.R. Hausman, K. 1998. IsoQuest Inc.: Description of the NetOwl (TM) ExtractorSystem as Used for MUC-7, Proceedings of MUC-7 Lin, D. 1998. Automatic Retrieval and Clustering of Similar Words, Proceedings of COLING-ACL '98 , Montreal, pp. 768-773. Miller, S. et al .,1998. BBN: Description of the SIFT System as Used for MUC-7. Proceedings of MUC-7 Mohri, M. 1997.Finite-State Transducers in Language and Speech Processing, ComputationalLinguistics , Vol.23, No.2, pp.269-311. Mooney, R.J. 1999. Symbolic Machine Learning for NaturalLanguage Processing. Tutorial Notes, ACL ’99 . MUC-7, 1998. Proceedings of the Seventh MessageUnderstanding Conference (MUC-7), published on the websitehttp://www.muc.saic.com/ Pine, C. 1996. Statement-of-Work (SOW) for The Intelligence Analyst Associate (IAA)Build 2, Contract for IAA Build 2, USAF, AFMC, RomeLaboratory. Rilof, E. Jones, R.1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth a National Conference on Artificial Intelligence (AAAI-99) Rosenfeld, R. 1994. Adaptive Statistical Language Modeling. PhD thesis, Carnegie Mellon University. Senellart, J. 1998. Locating Noun Phrases with Finite StateTransducers, Proceedings of COLING-ACL '98 , Montreal, pp. 1212-1219. Silberztein, M. 1998.Tutorial Notes: Finite State Processing with INTEX, COLING-ACL'98, Montreal(also available at http://www.ladl.jussieu.fr) Srihari, R. 1998. A Domain Independent Event Extraction Toolkit, AFRL-IF-RS-TR-1998-152 Final Technical Report, published by Air Force Research Laboratory, Information Directorate,Rome Research Site, New York Yangarber, R. Grishman, R. 1998. NYU: Description of the Proteus/PET System as Used for MUC-7ST, Proceedings of MUC-7 Pre-Knowledge-Graph Profile Extraction Research via SBIR (1) 2015-10-24 Pre-Knowledge-Graph Profile Extraction Research via SBIR (2) 2015-10-24 朝华午拾:在美国写基金申请的酸甜苦辣 - 科学网 【置顶:立委科学网博客NLP博文一览(定期更新版)】
Torch/Lua Material for Deep Learning Lua --Lua is a powerful, fast, lightweight, embeddable scripting language. The core code for the interpreter of Lua is very short. Lua tutorial: Learn Lua in an hour -- https://www.youtube.com/watch?v=S4eNl1rA1Ns Learn Lua in one video --- https://www.youtube.com/watch?v=iMacxZQMPXs Lua more --- https://www.youtube.com/watch?v=Us46grT9wsAindex=1list=PL0o3fqwR2CsWg_ockSMN6FActmMOJ70t_ Torch7(Basic Library) -nn -Tensor -image -math -random -CmdLine -More library in Torch, timer Torch7(More) -Five simple examples -torch/tutorials examples -torch demos --https://github.com/torch/demos -Artificial and robotic vision (introduction to Torch7 and lua, and some examples) -torch Cheatsheet -Machine Learning with Torch7
Before we start discussing the topic of a hybrid NLP (Natural Language Processing) system, let us look at the concept of hybrid from our life experiences. I was driving a classical Camry for years and had never thought of a change to other brands because as a vehicle, there was really nothing to complain. Yes, style is old but I am getting old too, who beats whom? Until one day a few years ago when we needed to buy a new car to retire my damaged Camry. My daughter suggested hybrid, following the trend of going green. So I ended up driving a Prius ever since and have fallen in love with it. It is quiet, with bluetooth and line-in, ideal for my iPhone music enjoyment. It has low emission and I finally can say bye to smog tests. It at least saves 1/3 gas. We could have gained all these benefits by purchasing an expensive all-electronic car but I want the same feel of power at freeway and dislike the concept of having to charge the car too frequently. Hybrid gets the best of both worlds for me now, and is not that more expensive. Now back to NLP. There are two major approaches to NLP, namely machine learning and grammar engineering (or hand-crafted rule system). As mentioned in previous posts, each has its own strengths and limitations, as summarized below. In general, a rule system is good at capturing a specific language phenomenon (trees) while machine learning is good at representing the general picture of the phenomena (forest). As a result, it is easier for rule systems to reach high precision but it takes a long time to develop enough rules to gradually raise the recall. Machine learning, on the other hand, has much higher recall, usually with compromise in precision or with a precision ceiling. Machine learning is good at simple, clear and coarse-grained task while rules are good at fine-grained tasks. One example is sentiment extraction. The coarse-grained task there is sentiment classification of documents (thumbs-up thumbs down), which can be achieved fast by a learning system. The fine-grained task for sentiment extraction involves extraction of sentiment details and the related actionable insights, including association of the sentiment with an object, differentiating positive/negative emotions from positive/negative behaviors, capturing the aspects or features of the object involved, decoding the motivation or reasons behind the sentiment,etc. In order to perform sophisticated tasks of extracting such details and actionable insights, rules are a better fit. The strength for machine learning lies in its retraining ability. In theory, the algorithm, once developed and debugged, remains stable and the improvement of a learning system can be expected once a larger and better quality corpus is used for retraining (in practice, retraining is not always easy: I have seen famous learning systems deployed in client basis for years without being retrained for various reasons). Rules, on the other hand, need to be manually crafted and enhanced. Supervised machine learning is more mature for applications but it requires a large labelled corpus. Unsupervised machine learning only needs raw corpus, but it is research oriented and more risky in application. A promising approach is called semi-supervised learning which only needs a small labelled corpus as seeds to guide the learning. We can also use rules to generate the initial corpus or seeds for semi-supervised learning. Both approaches involve knowledge bottlenecks. Rule systems's bottleneck is the skilled labor, it requires linguists or knowledge engineers to manually encode each rule in NLP, much like a software engineer in the daily work of coding. The biggest challenge to machine learning is the sparse data problem, which requires a very large labelled corpus to help overcome. The knowledge bottleneck for supervised machine learning is the labor required for labeling such a large corpus. We can build a system to combine the two approaches to complement each other. There are different ways of combining the two approaches in a hybrid system. One example is the practice we use in our product, where the results of insights are structured in a back-off model: high precision results from rules are ranked higher than the medium precision results returned by statistical systems or machine learning. This helps the system to reach configurable balance between precision and recall. When labelled data are available (e.g. the community has already built the corpus, or for some tasks, the public domain has the data, e.g. sentiment classification of movie reviews can use the review data with users' feedback on 5-star scale), and when the task is simple and clearly defined, using machine learning will greatly speed up the development of a capability. Not every task is suitable for both approaches. (Note that suitability is in the eyes of beholder: I have seen many passionate ML specialists willing to try everything in ML irrespective of the nature of the task: as an old saying goes, when you have a hammer, everything looks like a nail.) For example, machine learning is good at document classification whilerules are mostly powerless for such tasks. But for complicated tasks such as deep parsing, rules constructed by linguists usually achieve better performance than machine learning. Rules also perform better for tasks which have clear patterns, for example, identifying data items like time,weight, length, money, address etc. This is because clear patterns can be directly encoded in rules to be logically complete in coverage while machine learning based on samples still has a sparse data challenge. When designing a system, in addition to using a hybrid approach for some tasks, for other tasks, we should choose the most suitable approach depending on the nature of the tasks. Other aspects of comparison between the two approaches involve the modularization and debugging in industrial development. A rule system can be structured as a pipeline of modules fairly easily so that a complicated task is decomposed into a series of subtasks handled by different levels of modules. In such an architecture, a reported bug is easy to localize and fix by adjusting the rules in the related module. Machine learning systems are based on the learned model trained from the corpus. The model itself, once learned, is often like a black-box (even when the model is represented by a list of symbolic rules as results of learning, it is risky to manually mess up with the rules in fixing a data quality bug). Bugs are supposed to be fixable during retraining of the model based on enhanced corpus and/or adjusting new features. But re-training is a complicated process which may or may not solve the problem. It is difficultto localize and directly handle specific reported bugs in machine learning. To conclude, Hybrid gets the best of both worlds . Due to the complementary nature for pros/cons of the two basic approaches to NLP, a hybrid system involving both approaches is desirable, worth more attention and exploration. There are different ways of combining the two approaches in a system, including a back-off model using rulles for precision and learning for recall, semi-supervised learning using high precision rules to generate initial corpus or “seeds”, etc.. Related posts: Comparison of Pros and Cons of Two NLP Approaches Is Google ranking based on machine learning ? 《立委随笔:语言自动分析的两个路子》 《立委随笔:机器学习和自然语言处理》 【置顶:立委科学网博客NLP博文一览(定期更新版)】
My road to learn python for deep learning. Preface: before I started to learn python for deep learning, the author is extremely familiar with the theory of the deep learning. Meanwhile, the author is also familiar with two deep learning toolboxs (caffe and matconvnet). Since I am very interested in python language, thus I decided to learn python for deep learning. For deep learning theory, I recommend the following materials: Michael Nielsen: Neural Network and Deep Learning, a online book(now updated to Chapter five). Remarks: this book is awesome, I spent two days finishing this book. Interestingly, I got the feeling when I read the Pattern Recognition and Machine Learing(PRML). Geoffrey E. Hinton: Neural Network for Deep Learning, the course in coursera. https://www.coursera.org/course/neuralnets you can find many materials from the internet such as the deep learning course from stanford. I do not intend to mention them all. What I have done: To learn to use deep learning toolbox written in Python, such as Theano or Torch, you need to learn Python language well. If you have good knowledge to C++ and Matlab, you will find it not so hard to learn Python. Python Skills, I recommend : Google Python Class: https://developers.google.com/edu/python/, if you can access youtube, it will be good. you can watch the video. https://www.youtube.com/watch?v=tKTZoB2Vjuklist=PLC8825D0450647509 Coursera course: https://www.coursera.org/course/pythonlearn Numpy, Scipy, and matlabplot: https://www.youtube.com/watch?v=oYTs9HwFGbY Python Imaging Library: A tutorial post by RootOfTheNull, https://www.youtube.com/watch?v=dkrXgzuZk3klist=PL1H1sBF1VAKXCayO4KZqSmym2Z_sn6Pha A very good tutorial for Theano by Alec Radford : Introduction to Deep Learning with Python https://www.youtube.com/watch?v=S75EdAcXHKkindex=1list=PL9Nq-Q1jocNvwnoyUIQFA1SAR9dwakhxa Until now, for python, I have read above-mentioned content. My future plan is to follow two projects, 1: Kaggle competions on Detecting the Local of Keypoints on Face Images. The Link: https://www.kaggle.com/c/facial-keypoints-detection A very blog to Using CNN to detect facial keypoints tutorial: http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/ 2: Kaggle competions on Predict Ocean Health, one Plankton at a time The Link: http://www.kaggle.com/c/datasciencebowl The rank 1st method: http://benanne.github.io/2015/03/17/plankton.html I have summitted one result to the facial keypoint detection and randed 3rd. I will continue updating the content in future when I learn more. I do not check the language, pay attention to the contest and ignore the typos.
Lets take a close look at three relatedterms (Deep Learning vs Machine Learning vs Pattern Recognition), and see howthey relate to some of the hottest tech-themes in 2015 (namely Robotics andArtificial Intelligence). In our short journey through jargon, you shouldacquire a better understanding of how computer vision fits in, as well as gainan intuitive feel for how the machine learning zeitgeist has slowly evolvedover time. Fig 1. Putting a human inside a computeris not Artificial Intelligence (Photo from WorkFusionBlog ) If you look around, you'll see no shortage of jobs at high-tech startupslooking for machine learning experts. While only a fraction of them are lookingfor Deep Learning experts, I bet most of these startups can benefit from eventhe most elementary kind of data scientist. So how do you spot a futuredata-scientist? You learn how they think. The three highly-relatedlearning buzz words “Pattern recognition,” “machinelearning,” and “deep learning” represent three different schools ofthought. Pattern recognition is the oldest (and as a term is quiteoutdated). Machine Learning is the most fundamental (one of the hottest areasfor startups and research labs as of today, early 2015). And DeepLearning is the new, the big, the bleeding-edge -- we’re not even close tothinking about the post-deep-learning era . Just take a look at thefollowing Google Trends graph. You'll see that a) Machine Learning isrising like a true champion, b) Pattern Recognition started as synonymous withMachine Learning, c) Pattern Recognition is dying, and d) Deep Learning is newand rising fast. 1. Pattern Recognition: The birth ofsmart programs Pattern recognition was a term popularin the 70s and 80s. Theemphasis was on getting a computer program to do something “smart” likerecognize the character 3. And it really took a lot ofcleverness and intuition to build such a program. Just think of 3vs B and 3 vs 8. Back in the day, itdidn’t really matter how you did it as long as there was no human-in-a-boxpretending to be a machine. (See Figure 1) So if your algorithm wouldapply some filters to an image, localize some edges, and apply morphologicaloperators, it was definitely of interest to the pattern recognitioncommunity. Optical Character Recognition grew out of this community andit is fair to call “Pattern Recognition” as the “Smart Signal Processingof the 70s, 80s, and early 90s. Decision trees, heuristics, quadraticdiscriminant analysis, etc all came out of this era. Pattern Recognition becomesomething CS folks did, and not EE folks. One of the most popular booksfrom that time period is the infamous Duda Hart PatternClassification book and is still a great starting point for youngresearchers. But don't get too caught up in the vocabulary, it's a bitdated. The character 3 partitionedinto 16 sub-matrices. Custom rules, custom decisions, and customsmart programs used to be all the rage. SeeOCR Page . Quiz : The most popular Computer Vision conference is called CVPRand the PR stands for Pattern Recognition. Can you guess the year of thefirst CVPR conference? 2. Machine Learning: Smart programs canlearn from examples Sometime in the early 90s people startedrealizing that a more powerful way to build pattern recognition algorithms isto replace an expert (who probably knows way too much about pixels) with data(which can be mined from cheap laborers). So you collect a bunch of faceimages and non-face images, choose an algorithm, and wait for the computationsto finish. This is the spirit of machine learning. Machine Learningemphasizes that the computer program (or machine) must do some work after it isgiven data. The Learning step is made explicit. And believeme, waiting 1 day for your computations to finish scales better than invitingyour academic colleagues to your home institution to design some classificationrules by hand. What is Machine Learningfrom DrNatalia Konstantinova's Blog . The most important part of thisdiagram are the Gears which suggests thatcrunching/working/computing is an important step in the ML pipeline. As Machine Learning grew into a majorresearch topic in the mid 2000s, computer scientists began applying these ideasto a wide array of problems. No longer was it only character recognition,cat vs. dog recognition, and other “recognize a pattern inside an array ofpixels” problems. Researchers started applying Machine Learning to Robotics(reinforcement learning, manipulation, motion planning, grasping), to genomedata, as well as to predict financial markets. Machine Learning wasmarried with Graph Theory under the brand “Graphical Models,” every roboticsexpert had no choice but to become a Machine Learning Expert, and MachineLearning quickly became one of the most desired and versatile computing skills . However Machine Learning says nothing about the underlyingalgorithm. We've seen convex optimization, Kernel-based methods, SupportVector Machines, as well as Boosting have their winning days. Togetherwith some custom manually engineered features, we had lots of recipes, lots ofdifferent schools of thought, and it wasn't entirely clear how a newcomershould select features and algorithms. But that was all about tochange... Further reading: To learn more about the kinds of features that were used inComputer Vision research see my blog post: Fromfeature descriptors to deep learning: 20 years of computer vision . 3. Deep Learning: one architecture torule them all Fast forward to today and what we’reseeing is a large interest in something called Deep Learning. The most popularkinds of Deep Learning models, as they are using in large scale imagerecognition tasks, are known as Convolutional Neural Nets, or simplyConvNets. ConvNet diagram from TorchTutorial Deep Learning emphasizes the kind ofmodel you might want to use (e.g., a deep convolutional multi-layer neuralnetwork) and that you can use data fill in the missing parameters. Butwith deep-learning comes great responsibility. Because you are starting witha model of the world which has a high dimensionality, you really need a lot ofdata (big data) and a lot of crunching power (GPUs). Convolutions are usedextensively in deep learning (especially computer vision applications), and thearchitectures are far from shallow. If you're starting out with Deep Learning, simply brush up on some elementaryLinear Algebra and start coding. I highly recommend AndrejKarpathy's Hacker's guideto Neural Networks . Implementing your own CPU-based backpropagationalgorithm on a non-convolution based problem is a good place to start. There are still lots of unknowns. The theory of why deep learning works isincomplete, and no single guide or book is better than true machine learningexperience. There are lots of reasons why Deep Learning is gainingpopularity, but Deep Learning is not going to take over the world. Aslong as you continue brushing up on your machine learning skills, your job issafe. But don't be afraid to chop these networks in half, slice 'n dice atwill, and build software architectures that work in tandem with your learningalgorithm. The Linux Kernel of tomorrow might run on Caffe (one of the most popular deeplearning frameworks), but great products will always need great vision, domainexpertise, market development, and most importantly: human creativity. Other related buzz-words Big-data is the philosophy of measuring allsorts of things, saving that data, and looking through it forinformation. For business, this big-data approach can give you actionableinsights. In the context of learning algorithms, we’ve only startedseeing the marriage of big-data and machine learning within the past fewyears. Cloud-computing , GPUs , DevOps ,and PaaS providers have made large scale computing within reach ofthe researcher and ambitious everyday developer. Artificial Intelligence is perhaps the oldest term, the most vague,and the one that was gone through the most ups and downs in the past 50 years.When somebody says they work on Artificial Intelligence, you are either goingto want to laugh at them or take out a piece of paper and write down everythingthey say. Further reading: My 2011 Blog post ComputerVision is Artificial Intelligence . Conclusion Machine Learning is here to stay. Don't think about it as PatternRecognition vs Machine Learning vs Deep Learning, just realize that each termemphasizes something a little bit different. But the search continues. Go ahead and explore. Break something. We will continue building smartersoftware and our algorithms will continue to learn, but we've only begun toexplore the kinds of architectures that can truly rule-them-all. If you're interested in real-time visionapplications of deep learning, namely those suitable for robotic and homeautomation applications, then you should check out what we've been buildingat vision.ai . Hopefully in a few days, I'll be ableto say a little bit more. :-)
10 Common Misconceptions about Neural Networks As a computer scientist, I often get asked about neural networks because people would like to use them but often don't know how to go about it. Alternatively, they may have tried to use them but were disappointed in the results. Neural Networks don't have to be hard to use, and when used correctly they can produce superior results to other classes of predictive models such as regression analysis and decision tree induction . In quantitative finance, neural networks are most often used for time-series forecasting, proprietary trading signal generation, fully automated trading (decision making), financial modelling, derivatives pricing, credit risk assessments, pattern matching, and classification of securities. This article will discuss some of the theory behind neural networks. Neural networks are not models of the human brain Neural networks are not just a weak form of statistics Neural networks come in many different architectures Size matters, but bigger isn't always better Many training algorithms exist for neural networks Neural networks do not always require a lot of data Neural networks cannot be trained on any data Neural networks may need to be retrained Neural networks are not black boxes Neural networks are not hard to implement 1. Neural networks are not models of the human brain The human brain is a mystery and many scientists don't agree on how it works. Two popular theories of the brain are the grandmother cell theory and the distributed representation theory. In the first theory individual neurons are capable of representing complex concepts such as your grandmother or Jennifer Aniston . In the second theory neurons are believed to be much more simple. Artificial neural networks are inspired by the second theory and consist of many simple statistical functions connected together to form a network. Personally I support the belief that biological neurons are a lot more complex than artificial ones, and that their information capacity is larger as well. A single neuron in the brain is an incredibly complex machine that even today we don’t understand. A single “neuron” in a neural network is an incredibly simple mathematical function that captures a minuscule fraction of the complexity of a biological neuron. So to say neural networks mimic the brain, that is true at the level of loose inspiration, but really artificial neural networks are nothing like what the biological brain does. - Andrew Ng Another big difference between the brain and neural networks is size and organization. Human brains contain many more neurons and synapses than neural network and they are self-organizing and adaptive. Neural networks, by comparison, are organized according to an architecture. In other words, neural networks are not self-organizing in the same sense as the brain. The only exception to this is adaptive neural networks which are discussed later on in this article. So what does that mean? Think of it this way: a neural network is inspired by the brain in the same way that the Olympic stadium in Beijing is inspired by a bird's nest. That does not mean that the Olympic stadium is-a bird's nest, it means that some elements of birds nests are present in the design of the stadium. In other words, elements of the brain are present in the design of neural networks but they are a lot less similar than you might think. In my opinion, neural networks are actually more closely related to statistical methods like curve fitting and regression analysis than the human brain. In the context of quantitative finance I think it is important to remember that, because whilst it may sound cool to say that something is 'inspired by the brain', this statement may result unrealistic expectations or fear. For more info see my LinkedIn article 'No! Artificial Intelligence is not an existential threat'. Some very interesting views of the brain as created by state of the art brain imagine techniques. Click on the image for more information. 2. Neural networks aren't a weak form of statistics Neural networks have more in common with statistical methods like curve fitting and regression analysis than with the human brain. This is because they approximate a complex non-linear function between inputs and outputs. This fact has led s ome academics and industry professionals to view neural networks as a weak form of statistics. This misconception exists because soft computing techniques, such as neural networks, are defined as algorithms capable of finding good inexact solutions to intractable problems. However, t his does not mean that the resulting function is less accurate, it simply means that it is impossible to know what the real-world function is or whether one even exists. The mechanisms underlying a neural networks are in fact statistical and it is possible to reason about them using statistics. Neural networks are created by combining artificial neurons, called perceptrons . A perceptron contains a function, called an activation function, which maps inputs to outputs. In statistical terms, we can think of a perceptron as a multiple linear regression. The following activation functions are commonly used in neural networks: linear function, step function, ramp function, Sigmoid function, hyperbolic tangent function, and a Gaussian function. The type of activation function is very important, as it asserts requirements on the data being fed into the neural network (see misconception #7). This diagram shows six of the most popular activation functions which can be used in perceptrons making the neural network. Each perceptron in the network is adjusted such that the mean the classification error of the network on a set of known data points is minimized. How each perceptron is adjusted is determined by an optimization algorithm e.g. gradient descent. This process is called training. A perceptron is only able to linearly separate the training data points. This diagram illustrates how a single layer perceptron acts as a linear classifier of a data set. By combining multiple perceptrons together in a network structure, we are in effect creating a complex function which allows a neural network to achieve non-linear separability of the training data points. A network consisting of a single 'layer' of perceptrons is equivalent to a multiple linear regression , a statistical model for determining the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. This diagram illustrates how multiple perceptrons can be connected to create multiple linear regressions on a data set. The neural network is an evolution on this because it consists of multiple layers. In fact, a feed forward neural network is also called a multi-layer perceptron in some circles. In this model, the outputs from one layer of perceptrons form the inputs to the following layer of perceptrons. Layering enables neural networks to learn complex relationships. In trading, a regression between a set of technical indicators (input layer) and the future price of a security (output layer), might not exist. However, a neural network might find a complex relationship between regressions done on the inputs (hidden layer) and the future prices of the security. This diagram shows the general architecture of a multi-layer perceptron. This is the standard architecture for most neural network implementations. To summarize, neural networks are based on a strong statistical foundation and stating that a neural network is just a weak form of statistics is incorrect. I will not pretend that I have covered the full theoretical foundation of neural networks, therefore, I highly recommend this brilliant paper to inspired readers interested in the statistics of neural networks. 3. Neural networks come in many architectures A neural network's performance is directly linked to it's architecture yet most practitioners have only ever used the feedforward neural network. This architecture consists of three layers, an input layer, a hidden layer, and an output layer. Whilst this is a generally good architecture, many others exist which may be better suited to the problem. There are many many neural network architectures out there, so an exhaustive list is outside of the scope of this article. Here are some popular ones, Partially recurrent networks - some connections flow backwards, in other words, feed back loops exist in the network. These networks are believed to perform better on time series data. Therefore they are especially relevant for trading strategies. This diagram shows three popular recurrent Neural Network Architectures namely the Elman neural network, the Jordan neural network, and the Hopfield single-layer neural network. Boltzmann neural network - these were the first neural networks capable of learning internal representations, and solving very difficult combinatoric problems. When constrained they can prove more efficient than traditional neural networks. This diagram shows how different Boltzmann Machines with connections between the different nodes can significantly affect the results of the neural network (graphs to the right of the networks) Deep neural networks - neural networks with multiple hidden layers. Deep neural networks are currently at the forefront of research. Essentially deep neural networks consist of many hidden layers which are trained independently usually using Stochastic Gradient Descent. A great site for deep learning resources is DeepLearning.net . This diagram shows a deep neural network which consists of multiple hidden layers. Adaptive neural networks - neural networks which simultaneously adapt and optimize their architectures whilst learning by either growing the architecture or shrinking it. Adaptive neural networks have been shown to perform well for forecasting time series events. This diagram shows two different types of adaptive neural network architectures. The left image is a cascade neural network and the right image is a self-organizing map. I believe that these represent the future of neural networks, because the architecture of a network determines what it can approximate. If the architecture is sub-optimal, then the network will never perform optimally regardless of how many perceptrons or connections it has. Another benefit of optimal architectures is improved information capacity . Neural networks with higher information capacity require less perceptrons to fit a complex function. Given that the larger a network becomes, the harder it is to train, this benefit can be very useful. Radial basis networks - although not a different type of architecture in the sense of perceptrons and connections, radial basis functions make use of radial basis functions as their activation functions, these are real valued functions whose output depends on the distance from a particular point. The most commonly used radial basis functions is the Gaussian distribution. This diagram shows how curve fitting can be done using radial basis functions Because radial basis functions can take on much more complex forms, they were originally used for performing function interpolation. As such, a radial basis function neural network can have a much higher information capacity. Radial basis functions are also used in the kernel of a Support Vector Machine . For more information take a look at this presentation . In summary various neural network architectures exist. The performance of one neural network is often vastly superior to another, therefore for quantitative analysts interested in using neural networks, an important first step is to decide which architecture(s) you want to test. 4. Size matters, but bigger isn't always better Having selected an architecture (excluding an adaptive architecture), one must then decide how large or small the neural network should be. To illustrate the point, assume I have selected a feed-forward neural network with three layers. How many inputs should I have? How many hidden perceptrons should I have? And how many outputs are required? There are many old programmer's tales which state that you must have between 10 and 20 perceptrons for high dimensional problems. The truth is that every problem is unique, and that the best technique for finding the optimal size of any architecture is to empirically test various sizes. It is bad practice to stick with one initial guess and hope it works. That having been said, you can use a modified version of Occam's razor . Which is a scientific heuristic for finding good hypotheses. In this case we use it to reason about neural network architectures. Simpler architectures are defined as ones with fewer perceptrons and fewer connections between them. For an interesting discussion on simplicity, read this . When you have two competing neural networks which make the same predictions, the one with the simpler architecture will generalize better. - Occam's razor for neural networks 5. Many training algorithms exist for neural networks The learning algorithm of a neural network tries to optimize the neural network's weights until some stopping condition has been met. This condition is normally when the neural network can predict the outcome of the training data set to an acceptable level of accuracy but could also be when the computational budget has been exhausted. The most common learning algorithm is backpropagation which uses stochastic gradient descent . This algorithm consists of two phases: The feedforward pass - the training data set is passed through the network and the output from the neural network is recorded, and Backward propagation - the error signal is passed back through the network and the weights of the neural network are optimized using gradient descent. The are some problems with this approach. Adjusting all the weights at once can result in a significant movement of the neural network in weight space, the gradient descent algorithm is slow, and is susceptible to local minima . Assuming the neural network objective function contains local minima (this has been debated in recent years), it may make sense to use an optimization algorithm which is less sensitive to local minima. Such algorithms are called global optimization algorithms. Two popular global optimization algorithms are the Particle Swarm Optimization (PSO) and the Genetic Algorithm (GA). Here is how they can be used to train neural networks: Neural network vector representation - by encoding the neural network as a vector of weights, each representing the weight of a connection in the neural network, we can train neural networks using most meta-heuristic search algorithms. This technique does not work well with deep neural networks because the vectors become too large. This diagram illustrates how a neural network can be represented in a vector notation and related to the concept of a search space or fitness landscape. Particle Swarm Optimization - to train a neural network using a PSO we construct a population / swarm of those neural networks. Each neural network is represented as a vector of weights and is adjusted according to it's position from the global best particle and it's personal best. The fitness function is calculated as the sum-squared error of the reconstructed neural network after completing one feedforward pass of the training data set. The main consideration with this approach is the velocity of the weight updates. This is because if the weights are adjusted too quickly, the sum-squared error of the neural networks will stagnate and no learning will occur. This diagram shows how particles are attracted to one another in a single swarm Particle Swarm Optimization algorithm. Genetic Algorithm - to train a neural network using a genetic algorithm we first construct a population of vector represented neural networks. Then we apply the three genetic operators on that population to evolve better and better neural networks. These three operators are, Selection - Using the sum-squared error of each network calculated after one feedforward pass, we rank the population of neural networks. The top x% of the population are selected to 'survive' to the next generation and be used for crossover. Crossover - The top x% of the population's genes are allowed to cross over with one another. This process forms 'offspring'. In context, each offspring will represent a new neural network with weights from both of the 'parent' neural networks. Mutation - this operator is required to maintain genetic diversity in the population. A small percentage of the population are selected to undergo mutation. Some of the weights in these neural networks will be adjusted randomly within a particular range. This algorithm shows the selection, crossover, and mutation genetic operators being applied to a population of neural networks represented as vectors. In addition to these population-based metaheuristic search algorithms, other algorithms have been used to train of neural networks including backpropagation with added momentum, differential evolution , Levenberg Marquardt , simulated annealing , and many more . 6. Neural networks do not always require a lot of data Neural networks can use one of three learning strategies namely a supervised learning strategy, an unsupervised learning strategy, or a reinforcement learning strategy. The principal difference between these three strategies is the amount of labelled data that they require. Labelled data is training data for which the correct output is known upfront. Supervised learning - these strategies require at least two datasets, a training set which consists of inputs with the expected output, and a generalization set which consists of inputs without the expected output. Over-fitting is the term given to a neural network which has 'learnt' the noise in the test set too well and cannot generalize well on unseen data. Unsupervised learning - these strategies discover hidden structures (such as Markov chains) in unlabeled data. They are based on well known statistical techniques such as density estimation , principal component analysis , Hebbian learning, and clustering algorithms. Because unsupervised learning does not need any labelled data, the neural network can be applied to under-formulated problems where the correct output is not known. An example is the Google neural network which used unsupervised learning to discover cats without having any prior knowledge of them i.e. no sets of cat images were used to train the network. Two common unsupervised neural networks are the self organizing map (SOM), and adaptive resonance theory (ART). A SOM is a multi-dimensional scaling method for projecting high dimensional search spaces onto a two dimensional grid. A SOM produced a 'heat map' which can be analysed to identify similar patterns and their underlying characteristics. A self-organizing map showing U.S. Congress voting patterns visualized in Synapse. The first two boxes show clustering and distances while the remaining ones show the component planes. Red means a yes vote while blue means a no vote in the component planes (except the party component where red is Republican and blue is Democratic). This thesis describes an unsupervised learning strategy using the particle swarm optimization algorithm and a neural network to discover favourable technical market indicators and trading strategies all starting with zero expert knowledge of the securities or technical indicators. Another interesting application of SOM's is in colouring time segments of stock charts to represent which market patterns they represent. This website provides a detailed tutorial and code snippets for implementing the idea for improved Forex trading strategies. Reinforcement learning - these strategies are based on the simple premise of rewarding neural networks for good behaviours and punishing them for bad behaviours. This strategy lends itself to trading because good decisions and bad decisions are easy to quantifiable in terms of existing metrics and profit or loss. It also requires no training data. Reinforcement learning strategies consist of three components. A policy which specifies how the neural network will make decisions e.g. using technical and fundamental indicators. A reward function which distinguishes good from bad e.g. making vs. losing money. And a value function which specifies the long term goal e.g. a high Sortino ratio. This diagram shows how a neural network can be either negatively or positively reinforced. 7. Neural networks cannot be trained on any data In my opinion this is the worst misconception about neural networks. Many people who try to use neural networks do not properly pre-process the data being fed into the neural network. The result of this is that the neural network will under perform. Data normalization, removal of redundant information, and outlier removal should be done to improve performance. Data normalization - neural networks consist of various perceptrons linked together through weighted connections. Each perceptron contains an activation function which each have an ' active range ' excepting radial basis functions. Inputs into the neural network should be scaled within this range so that the activation function's outputs are different for each input. Consider a neural network trading system which receives indicators about a set of securities as inputs and outputs whether each security should be bought or sold. One of the inputs is the price of the security and we are using the Sigmoid activation function. However, most of the securities cost between 5$ and 15$ per share and the output of the Sigmoid function approaches 1.0. So the output of the Sigmoid function will be be 1.0 for all securities, all of the perceptrons will 'fire' and the neural network will not learn. Neural networks trained on unprocessed data produce models where 'the lights are on but nobody's home' The active range of the Sigmoid function is -sqrt(3) to sqrt(3), so the prices of each security should be scaled to fit within that range. The other consideration is how to handle securities like Berkshire Hathaway which cost $190,000. Would feeding this securities' price into our trading system impact it's ability to correctly classify $5 - $15 securities? Outlier removal - an outlier is value that is much smaller or larger than most of the other values in some set of data. Outliers can cause problems with statistical techniques like regression analysis and curve fitting because when the model tries to 'accommodate' the outlier, performance of the model across all other data deteriorates. Consider the illustration below, This diagram shows the effect of removing an outlier from the training data for a linear regression. The results are comparable for neural networks. Image source: https://statistics.laerd.com/statistical-guides/img/pearson-6.png The illustration shows that trying to accommodate an outlier into the linear regression model results in a poor fits of the data set. The effect of outliers on non-linear regression models, including neural networks, is similar. Therefore it is good practice is to remove outliers from the training data set. This is a challenge itself, this tutorial and paper discuss existing techniques. Remove redundancy - larger neural networks are less likely to generalize well, which is one reason why simpler networks are desirable. Another reason is that they are faster, more efficient, and easier to 'decipher'. Removing redundant inputs can simplify your neural network. Different inputs can share mutual information about the problem. Mutual information is the degree of dependence between two inputs. If this is high, then the two variables will be strongly correlated with one another. This means that the amount of unique information presented by either input is small, and the less significant input can be removed. Adaptive neural networks will automatically prune redundant connections, perceptrons, and inputs. For fixed architecture, this process requires some data-pre-processing. Measuring the correlation between each pair of inputs and performing a sensitivity analysis between the inputs and the expected outputs can both help identify redundant inputs. 8. Neural networks may need to be retrained Neural networks tend to stop working over time. That having been said, you would be wrong to assume that this is a poor reflection of neural networks. It is actually an accurate reflection of the world we live in. The world is constantly changing. This is especially true for financial markets because the underlying mechanisms, market participants, are not predictable. Investors emotions' drive markets to bubble and then burst. What I find interesting is that chaos theory originated from the study of weather, whose underlying mechanisms are well understood. What does that says about financial markets? Crowding also contributes the dynamic nature of financial markets, for more information read The Crisis of Crowding . Dynamic environments, such as financial markets, are extremely difficult for neural networks to model. Two approaches are either to keep retraining the neural network over-time, or to use a dynamic neural network. Dynamic neural networks 'track' changes to the environment over time and adjust their architecture and weights accordingly. They are adaptive over time. For dynamic problems, multi-solution meta-heuristic optimization algorithms can be used to track changes to local optima over time. One such algorithm is the multi-swarm optimization algorithm, a derivative of the particle swarm optimization. Genetic algorithms with enhanced diversity or memory have also been shown to be robust in dynamic environments. The illustration below demonstrates how a genetic algorithm evolves over time to find new optima in a dynamic environment. This illustration also happens to mimic trade crowding which is when market participants crowd a profitable trading strategy, thereby exhausting trading opportunities causing the trade to become less profitable over time. This animated image shows a dynamic fitness landscape (search space) change over time. Image source: http://en.wikipedia.org/wiki/Fitness_landscape 9. Neural networks are not black boxes By itself a neural network is a black-box . This presents problems for people wanting to use them. For example, fund managers wouldn't know how a neural network makes trading decisions, so it is impossible to assess the risks of the trading strategies learned by the neural network. Similarly, banks using neural networks for credit risk modelling would not be able to justify why a customer has a particular credit rating, which is a regulatory requirement. That having been said, state of the art rule-extraction algorithms have been developed to vitrify some neural network architectures. These algorithms extract knowledge from the neural networks as either mathematical expressions, symbolic logic, fuzzy logic, or decision trees. This image shows a neural network as a black box and how it related to rule extraction techniques. Mathematical rules - algorithms have been developed which can extract multiple linear regression lines from neural networks. The problem with these techniques is that the rules are often still difficult to understand, therefore these do not solve the 'black-box' problem. Propositional logic - propositional logic is a branch of mathematical logic which deals with operations done on discrete valued variables. These variables, such as A or B, are often either TRUE or FALSE, but they could occupy values within a discrete range e.g. {BUY,HOLD,SELL}. Logical operations can then be applied to those variables such as OR, AND, and XOR. The results are called predicates which can also be quantified over sets using the exists or for-all quantifiers. This is the difference between predicate and propositional logic. If we had a simple neural network which Price (P), Simple Moving Average (SMA), and Exponential Moving Average (EMA) as inputs and we extracted a trend following strategy from the neural network in propositional logic, we might get rules like this, Fuzzy logic - fuzzy logic is where probability and propositional logic meet. The problem with propositional logic is that is deals in absolutes e.g. BUY or SELL, TRUE or FALSE, 0 or 1. Therefore for traders there is no way to determine the confidence of these results. Fuzzy logic overcomes this limitation by introducing a membership function which specifies how much a variable belongs to a particular domain. For example, a company (GOOG) might belong 0.7 to the domain {BUY} and 0.3 to the domain {SELL}. Combinations of neural networks and fuzzy logic are called Neuro-Fuzzy systems . This research survey discusses the various fuzzy rule extraction techniques which exist for neural networks. Decision trees - decision trees are data structures which show decision making under various conditions or given certain information. This article I wrote describes how to evolve security analysis decision trees using genetic programming . Decision tree induction is the term given to the process of extracting decision trees from neural networks. An example of a simple trading strategy represented using a decision tree. The triangular boxes represent decision nodes, these could be to BUY, HOLD, or SELL a company. Each box represents a tuple of indicator, inequality,= value=. An example might be sma,, 25 or ema, =, 30=. 10. Neural networks are not hard to implement Speaking from personal experience, neural networks are quite difficult to code from scratch. Luckily for us, there are many existing open source and proprietary packages which contain implementations of different types of neural networks. However, for advanced topics, such as rule extraction, custom development is unavoidable. Encog - is an easy to use library containing implementations of many machine learning algorithms and neural networks. Encog is particularly nice because it offers an API which allows users to define new algorithms for training and creating adaptive neural networks. PyBrain - is a modular Machine Learning Library for Python which contains implementations of various neural networks. Python is great for financial modelling because it can be combined with statistical packages such as Pandas , and SciPy . SAS Enterprise Miner - SAS is a proprietary statistical programming language used across the financial services industry. The SAS Enterprise Miner module contains an implementation of various neural networks and decision tree classification structures. Scikit Learn - is another open source machine learning library for the Python programming language. Again, Python is great for financial modelling because it can be combined with statistical packages such as Pandas , and SciPy . For readers interested in using neural networks, I recommend using an existing package. A general rule of thumb is that 'off the shelf' packages, whether open source or proprietary, contain less bugs and produce more reliable results than custom developed applications. That said, there is no better way to learn neural networks than to code one. For more useful tools and applications check out my Tools for Computational Finance page. Conclusion Neural networks are a class of powerful machine learning algorithms. They are based on solid statistical foundations and have been applied successfully in financial models as well as in trading strategies for many years. Despite this, they have a bad reputation in industry caused by the many unsuccessful attempts to use them in practice. In most cases, unsuccessful neural network implementations can be traced back to inappropriate neural network design decisions and general misconceptions about how they work. This article aims to articulate some of these misconceptions in the hopes that they might help individuals implementing neural networks meet with success. For readers interested in getting more information, I have found the following books to be quite instructional when it comes to neural networks and their role in finance and trading. Lastly, if you managed to read all 4,500 words of this article congratulations. Please contact me or comment below if you have any specific questions, suggestions, or corrections. Thank you. 博文来源:http://www.stuartreid.co.za/misconceptions-about-neural-networks/
关于采用deep learning进行图像复原的论文: Image denoising Deep convolutional neural network for image deconvolution Image denoising: Can plain Neural Networks compete with BM3D? Image denoising and inpainting with deep neural networks Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion Image denoising with multi-layer perceptrons Adaptive multi-column deep neural networks with application to robust image denoising Robust image denoising with multi-column deep neural networks Image Denoising with Rectified Linear Units super-resolutin Learning a deep convolutional network for image super-resolution Image Super-Resolution Using Deep Convolutional Networks Image Super-Resolution with Fast Approximate Convolutional Sparse Coding Deep Network Cascade for Image Super-resolution deblurring Image Deblurring Using Back Propagation Neural Network
Description: This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems. This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this Machine Learning course and complete sections II, III, IV (up to Logistic Regression) first. Sparse Autoencoder Neural Networks Backpropagation Algorithm Gradient checking and advanced optimization Autoencoders and Sparsity Visualizing a Trained Autoencoder Sparse Autoencoder Notation Summary Exercise:Sparse Autoencoder Vectorized implementation Vectorization Logistic Regression Vectorization Example Neural Network Vectorization Exercise:Vectorization Preprocessing: PCA and Whitening PCA Whitening Implementing PCA/Whitening Exercise:PCA in 2D Exercise:PCA and Whitening Softmax Regression Softmax Regression Exercise:Softmax Regression Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning Building Deep Networks for Classification From Self-Taught Learning to Deep Networks Deep Networks: Overview Stacked Autoencoders Fine-tuning Stacked AEs Exercise: Implement deep networks for digit classification Linear Decoders with Autoencoders Linear Decoders Exercise:Learning color features with Sparse Autoencoders Working with Large Images Feature extraction using convolution Pooling Exercise:Convolution and Pooling Note : The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. Miscellaneous MATLAB Modules Style Guide Useful Links Miscellaneous Topics Data Preprocessing Deriving gradients using the backpropagation idea Advanced Topics : Sparse Coding Sparse Coding Sparse Coding: Autoencoder Interpretation Exercise:Sparse Coding ICA Style Models Independent Component Analysis Exercise:Independent Component Analysis Others Convolutional training Restricted Boltzmann Machines Deep Belief Networks Denoising Autoencoders K-means Spatial pyramids / Multiscale Slow Feature Analysis Tiled Convolution Networks Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen
Wiley Health Learning 提供连续的教育活动,帮助您的临床实践。我们的最前沿的临床学习活动是来自最受信赖出版物里的循证医学内容,便于您将来的使用。只要您在线,就可以开始、暂存或者完成学习活动;如果您想获取证书,只需要到电子商店完成任务。 Wiley Health Learning 给您高质量教育,提高您的医疗水平。 目前我们向亚太地区的皮肤病学家提供免费在线培训 项目: “Guidelines for management of androgenetic alopecia based onBASP classification–the Asian consensus committee guideline” (基于BASP分类的雄激素源性脱发的管理指南——亚洲共识委员会 指南 ) Journal of the European Academy of Dermatology and Venereology . August, 2013 活动类型: 基于期刊的继续医学教育 注册 并开始浏览我们在WileyHealthLearning上推出的所有学习计划! 了解有关 在Wiley Health Learning创建您自己的网络学习计划 的更多信息! 浏览本页面,了解 Wiley Health Learning 上新的学习活动的更新内容!
Neural Networks, Manifolds, and Topology Posted on April 6, 2014 ( colah's blog ) topology, neural networks, deep learning, manifold hypothesis Recently, there’s been a great deal of excitement and interest in deep neural networks because they’ve achieved breakthrough results in areas such as computer vision. 1 However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing. If one trains it well, it achieves high quality results, but it is challenging to understand how it is doing so. If the network fails, it is hard to understand what went wrong. While it is challenging to understand the behavior of deep neural networks in general, it turns out to be much easier to explore low-dimensional deep neural networks – networks that only have a few neurons in each layer. In fact, we can create visualizations to completely understand the behavior and training of such networks. This perspective will allow us to gain deeper intuition about the behavior of neural networks and observe a connection linking neural networks to an area of mathematics called topology. A number of interesting things follow from this, including fundamental lower-bounds on the complexity of a neural network capable of classifying certain datasets. A Simple Example Let’s begin with a very simple dataset, two curves on a plane. The network will learn to classify points as belonging to one or the other. The obvious way to visualize the behavior of a neural network – or any classification algorithm, for that matter – is to simply look at how it classifies every possible data point. We’ll start with the simplest possible class of neural network, one with only an input layer and an output layer. Such a network simply tries to separate the two classes of data by dividing them with a line. That sort of network isn’t very interesting. Modern neural networks generally have multiple layers between their input and output, called “hidden” layers. At the very least, they have one. Diagram of a simple network from Wikipedia As before, we can visualize the behavior of this network by looking at what it does to different points in its domain. It separates the data with a more complicated curve than a line. With each layer, the network transforms the data, creating a new representation . 2 We can look at the data in each of these representations and how the network classifies them. When we get to the final representation, the network will just draw a line through the data (or, in higher dimensions, a hyperplane). In the previous visualization, we looked at the data in its “raw” representation. You can think of that as us looking at the input layer. Now we will look at it after it is transformed by the first layer. You can think of this as us looking at the hidden layer. Each dimension corresponds to the firing of a neuron in the layer. The hidden layer learns a representation so that the data is linearly seperable Continuous Visualization of Layers In the approach outlined in the previous section, we learn to understand networks by looking at the representation corresponding to each layer. This gives us a discrete list of representations. The tricky part is in understanding how we go from one to another. Thankfully, neural network layers have nice properties that make this very easy. There are a variety of different kinds of layers used in neural networks. We will talk about tanh layers for a concrete example. A tanh layer tanh ( W x + b ) consists of: A linear transformation by the “weight” matrix W A translation by the vector b Point-wise application of tanh. We can visualize this as a continuous transformation, as follows: The story is much the same for other standard layers, consisting of an affine transformation followed by pointwise application of a monotone activation function. We can apply this technique to understand more complicated networks. For example, the following network classifies two spirals that are slightly entangled, using four hidden layers. Over time, we can see it shift from the “raw” representation to higher level ones it has learned in order to classify the data. While the spirals are originally entangled, by the end they are linearly separable. On the other hand, the following network, also using multiple layers, fails to classify two spirals that are more entangled. It is worth explicitly noting here that these tasks are only somewhat challenging because we are using low-dimensional neural networks. If we were using wider networks, all this would be quite easy. (Andrej Karpathy has made a nice demo based on ConvnetJS that allows you to interactively explore networks with this sort of visualization of training!) Topology of tanh Layers Each layer stretches and squishes space, but it never cuts, breaks, or folds it. Intuitively, we can see that it preserves topological properties. For example, a set will be connected afterwards if it was before (and vice versa). Transformations like this, which don’t affect topology, are called homeomorphisms. Formally, they are bijections that are continuous functions both ways. Theorem : Layers with N inputs and N outputs are homeomorphisms, if the weight matrix, W , is non-singular. (Though one needs to be careful about domain and range.) Proof : Let’s consider this step by step: Let’s assume W has a non-zero determinant. Then it is a bijective linear function with a linear inverse. Linear functions are continuous. So, multiplying by W is a homeomorphism. Translations are homeomorphisms tanh (and sigmoid and softplus but not ReLU) are continuous functions with continuous inverses. They are bijections if we are careful about the domain and range we consider. Applying them pointwise is a homemorphism Thus, if W has a non-zero determinant, our layer is a homeomorphism. ∎ This result continues to hold if we compose arbitrarily many of these layers together. Topology and Classification A is red, B is blue Consider a two dimensional dataset with two classes A , B ⊂ R 2 : A = { x | d ( x , 0 ) 1 / 3 } B = { x | 2 / 3 d ( x , 0 ) 1 } Claim : It is impossible for a neural network to classify this dataset without having a layer that has 3 or more hidden units, regardless of depth. As mentioned previously, classification with a sigmoid unit or a softmax layer is equivalent to trying to find a hyperplane (or in this case a line) that separates A and B in the final represenation. With only two hidden units, a network is topologically incapable of separating the data in this way, and doomed to failure on this dataset. In the following visualization, we observe a hidden representation while a network trains, along with the classification line. As we watch, it struggles and flounders trying to learn a way to do this. For this network, hard work isn’t enough. In the end it gets pulled into a rather unproductive local minimum. Although, it’s actually able to achieve ∼ 80 % classification accuracy. This example only had one hidden layer, but it would fail regardless. Proof : Either each layer is a homeomorphism, or the layer’s weight matrix has determinant 0. If it is a homemorphism, A is still surrounded by B , and a line can’t separate them. But suppose it has a determinant of 0: then the dataset gets collapsed on some axis. Since we’re dealing with something homeomorphic to the original dataset, A is surrounded by B , and collapsing on any axis means we will have some points of A and B mix and become impossible to distinguish between. ∎ If we add a third hidden unit, the problem becomes trivial. The neural network learns the following representation: With this representation, we can separate the datasets with a hyperplane. To get a better sense of what’s going on, let’s consider an even simpler dataset that’s 1-dimensional: A = B = ∪ Without using a layer of two or more hidden units, we can’t classify this dataset. But if we use one with two units, we learn to represent the data as a nice curve that allows us to separate the classes with a line: What’s happening? One hidden unit learns to fire when x − 1 2 and one learns to fire when x 1 2 . When the first one fires, but not the second, we know that we are in A. The Manifold Hypothesis Is this relevant to real world data sets, like image data? If you take the manifold hypothesis really seriously, I think it bears consideration. The manifold hypothesis is that natural data forms lower-dimensional manifolds in its embedding space. There are both theoretical 3 and experimental 4 reasons to believe this to be true. If you believe this, then the task of a classification algorithm is fundamentally to separate a bunch of tangled manifolds. In the previous examples, one class completely surrounded another. However, it doesn’t seem very likely that the dog image manifold is completely surrounded by the cat image manifold. But there are other, more plausible topological situations that could still pose an issue, as we will see in the next section. Links And Homotopy Another interesting dataset to consider is two linked tori, A and B . Much like the previous datasets we considered, this dataset can’t be separated without using n + 1 dimensions, namely a 4 th dimension. Links are studied in knot theory, an area of topology. Sometimes when we see a link, it isn’t immediately obvious whether it’s an unlink (a bunch of things that are tangled together, but can be separated by continuous deformation) or not. A relatively simple unlink. If a neural network using layers with only 3 units can classify it, then it is an unlink. (Question: Can all unlinks be classified by a network with only 3 units, theoretically?) From this knot perspective, our continuous visualization of the representations produced by a neural network isn’t just a nice animation, it’s a procedure for untangling links. In topology, we would call it an ambient isotopy between the original link and the separated ones. Formally, an ambient isotopy between manifolds A and B is a continuous function F : × X → Y such that each F t is a homeomorphism from X to its range, F 0 is the identity function, and F 1 maps A to B . That is, F t continuously transitions from mapping A to itself to mapping A to B . Theorem : There is an ambient isotopy between the input and a network layer’s representation if: a) W isn’t singular, b) we are willing to permute the neurons in the hidden layer, and c) there is more than 1 hidden unit. Proof : Again, we consider each stage of the network individually: The hardest part is the linear transformation. In order for this to be possible, we need W to have a positive determinant. Our premise is that it isn’t zero, and we can flip the sign if it is negative by switching two of the hidden neurons, and so we can guarantee the determinant is positive. The space of positive determinant matrices is path-connected , so there exists p : → G L n ( R ) 5 such that p ( 0 ) = I d and p ( 1 ) = W . We can continually transition from the identity function to the W transformation with the function x → p ( t ) x , multiplying x at each point in time t by the continuously transitioning matrix p ( t ) . We can continually transition from the identity function to the b translation with the function x → x + t b . We can continually transition from the identity function to the pointwise use of σ with the function: x → ( 1 − t ) x + t σ ( x ) . ∎ I imagine there is probably interest in programs automatically discovering such ambient isotopies and automatically proving the equivalence of certain links, or that certain links are separable. It would be interesting to know if neural networks can beat whatever the state of the art is there. (Apparently determining if knots are trivial is NP. This doesn’t bode well for neural networks.) The sort of links we’ve talked about so far don’t seem likely to turn up in real world data, but there are higher dimensional generalizations. It seems plausible such things could exist in real world data. Links and knots are 1 -dimensional manifolds, but we need 4 dimensions to be able to untangle all of them. Similarly, one can need yet higher dimensional space to be able to unknot n -dimensional manifolds. All n -dimensional manifolds can be untangled in 2 n + 2 dimensions. 6 (I know very little about knot theory and really need to learn more about what’s known regarding dimensionality and links. If we know a manifold can be embedded in n-dimensional space, instead of the dimensionality of the manifold, what limit do we have?) The Easy Way Out The natural thing for a neural net to do, the very easy route, is to try and pull the manifolds apart naively and stretch the parts that are tangled as thin as possible. While this won’t be anywhere close to a genuine solution, it can achieve relatively high classification accuracy and be a tempting local minimum. It would present itself as very high derivatives on the regions it is trying to stretch, and sharp near-discontinuities. We know these things happen. 7 Contractive penalties, penalizing the derivatives of the layers at data points, are the natural way to fight this. 8 Since these sort of local minima are absolutely useless from the perspective of trying to solve topological problems, topological problems may provide a nice motivation to explore fighting these issues. On the other hand, if we only care about achieving good classification results, it seems like we might not care. If a tiny bit of the data manifold is snagged on another manifold, is that a problem for us? It seems like we should be able to get arbitrarily good classification results despite this issue. (My intuition is that trying to cheat the problem like this is a bad idea: it’s hard to imagine that it won’t be a dead end. In particular, in an optimization problem where local minima are a big problem, picking an architecture that can’t genuinely solve the problem seems like a recipe for bad performance.) Better Layers for Manipulating Manifolds? The more I think about standard neural network layers – that is, with an affine transformation followed by a point-wise activation function – the more disenchanted I feel. It’s hard to imagine that these are really very good for manipulating manifolds. Perhaps it might make sense to have a very different kind of layer that we can use in composition with more traditional ones? The thing that feels natural to me is to learn a vector field with the direction we want to shift the manifold: And then deform space based on it: One could learn the vector field at fixed points (just take some fixed points from the training set to use as anchors) and interpolate in some manner. The vector field above is of the form: F ( x ) = v 0 f 0 ( x ) + v 1 f 1 ( x ) 1 + f 0 ( x ) + f 1 ( x ) Where v 0 and v 1 are vectors and f 0 ( x ) and f 1 ( x ) are n-dimensional gaussians. This is inspired a bit by radial basis functions . K-Nearest Neighbor Layers I’ve also begun to think that linear separability may be a huge, and possibly unreasonable, amount to demand of a neural network. In some ways, it feels like the natural thing to do would be to use k-nearest neighbors (k-NN). However, k-NN’s success is greatly dependent on the representation it classifies data from, so one needs a good representation before k-NN can work well. As a first experiment, I trained some MNIST networks (two-layer convolutional nets, no dropout) that achieved ∼ 1 % test error. I then dropped the final softmax layer and used the k-NN algorithm. I was able to consistently achieve a reduction in test error of 0.1-0.2%. Still, this doesn’t quite feel like the right thing. The network is still trying to do linear classification, but since we use k-NN at test time, it’s able to recover a bit from mistakes it made. k-NN is differentiable with respect to the representation it’s acting on, because of the 1/distance weighting. As such, we can train a network directly for k-NN classification. This can be thought of as a kind of “nearest neighbor” layer that acts as an alternative to softmax. We don’t want to feedforward our entire training set for each mini-batch because that would be very computationally expensive. I think a nice approach is to classify each element of the mini-batch based on the classes of other elements of the mini-batch, giving each one a weight of 1/(distance from classification target). 9 Sadly, even with sophisticated architecture, using k-NN only gets down to 5-4% test error – and using simpler architectures gets worse results. However, I’ve put very little effort into playing with hyper-parameters. Still, I really aesthetically like this approach, because it seems like what we’re “asking” the network to do is much more reasonable. We want points of the same manifold to be closer than points of others, as opposed to the manifolds being separable by a hyperplane. This should correspond to inflating the space between manifolds for different categories and contracting the individual manifolds. It feels like simplification. Conclusion Topological properties of data, such as links, may make it impossible to linearly separate classes using low-dimensional networks, regardless of depth. Even in cases where it is technically possible, such as spirals, it can be very challenging to do so. To accurately classify data with neural networks, wide layers are sometimes necessary. Further, traditional neural network layers do not seem to be very good at representing important manipulations of manifolds; even if we were to cleverly set weights by hand, it would be challenging to compactly represent the transformations we want. New layers, specifically motivated by the manifold perspective of machine learning, may be useful supplements. (This is a developing research project. It’s posted as an experiment in doing research openly. I would be delighted to have your feedback on these ideas: you can comment inline or at the end. For typos, technical errors, or clarifications you would like to see added, you are encouraged to make a pull request on github .) Acknowledgments Thank you to Yoshua Bengio, Michael Nielsen, Dario Amodei, Eliana Lorch, Jacob Steinhardt, and Tamsyn Waterhouse for their comments and encouragement. This seems to have really kicked off with Krizhevsky et al. , (2012) , who put together a lot of different pieces to achieve outstanding results. Since then there’s been a lot of other exciting work. ↩ These representations, hopefully, make the data “nicer” for the network to classify. There has been a lot of work exploring representations recently. Perhaps the most fascinating has been in Natural Language Processing: the representations we learn of words, called word embeddings, have interesting properties. See Mikolov et al. (2013) , Turian et al. (2010) , and, Richard Socher’s work . To give you a quick flavor, there is a very nice visualization associated with the Turian paper. ↩ A lot of the natural transformations you might want to perform on an image, like translating or scaling an object in it, or changing the lighting, would form continuous curves in image space if you performed them continuously. ↩ Carlsson et al. found that local patches of images form a klein bottle. ↩ G L n ( R ) is the set of invertible n × n matrices on the reals, formally called the general linear group of degree n . ↩ This result is mentioned in Wikipedia’s subsection on Isotopy versions . ↩ See Szegedy et al. , where they are able to modify data samples and find slight modifications that cause some of the best image classification neural networks to misclasify the data. It’s quite troubling. ↩ Contractive penalties were introduced in contractive autoencoders. See Rifai et al. (2011) . ↩ I used a slightly less elegant, but roughly equivalent algorithm because it was more practical to implement in Theano: feedforward two different batches at the same time, and classify them based on each other. ↩ 原文: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
2010_Deep big simple neural nets excel on hand-written digit recognition 首先看这篇文章: 2003_Best practice for convolutional neural networks applied to visual document analysis 这篇文章中心思想: Expanding training set with elastic deformation and affine transformation 本文的中心思想是运用仿射变换和形变扩大数据集,然后运用GPU加速训练庞大的简单神经网络,从而获得比较好的结果。 实验结果: 结论: 1. convolutional network is better than MLP 2. Elastic deformations have a considerable impact on performance both for 2-layer MLP and convolutional network 3. Cross-entropy seems better and faster than mean square error(MSE)
ICCV_2009_What is the best multi-stage architecture for object recognition? 个人认为这是一篇很好的分析型文章,思路浅显易懂,实验验证也很充分。 本文主要在于研究三个问题: 1. 滤波器后面的非线性对识别率有什么影响? 2. 有监督学习与无监督学习,以及hard-wired与random的filter有什么影响? 3. 两层结构是否优于单层结构? 本文用PSD无监督学习方法,对几个因素进行组合比较,主要有一下几种组合: 卷积层filter bank layer: 修正层rectification layer: 局部对比度归一层local contrast normalization layer: 最大值池化或均值池化max-pooling or subsampling layer: 四种结构: 、 、 、 训练方法: Random Features and Supervised Classifier: R and RR Unsupervised Features, Supervised Classifier: U and UU Random Features, Global Supervised Refinement: R+ and R+R+ Unsupervised Feature, Global Supervised Refinement: U+ and U+U+ 实验数据:Caltech-101 实验对比结果图: On NORB dataset:
NIPS_2007_Sparse deep belief net model for visual area V2 这篇文章主要讲的是sparse DBN。 很多比较学习算法的结果与V1区域相似的工作,但是没有与大脑视觉体系更深层次的比较,比如V2、V4,这篇文章量化的比较了sparse DBN与V2学习的特征,V2的结果引用自这篇文章: M. Ito and H. Komatsu. Representation of angles embedded within contour stimuli in area v2 of macaque monkeys. The Journal of Neuroscience, 24(13):3313–3324, 2004. 1. Introduction J. H. van Hateren and A. van der Schaaf. Independent component filters of natural images compared with simple cells in primary visual cortex. Proc.R.Soc.Lond. B, 265:359–366, 1998. 这篇文章研究表明在自然图像中ICA学习到的filter与V1中简单细胞的局部接受野非常相似。 2. Biological comparison 2.1 Features in early visual cortex: area V1 V1中简单细胞的局部接收野是localized, oriented, bandpass filters that resemble gabor filters. 2.2 Features in visual cortex area V2 J. B. Levitt, D. C. Kiper, and J. A. Movshon. Receptive fields and functional architecture of macaque v2. Journal of Neurophysiology, 71(6):2517–2542, 1994. 这篇文章的研究暗示了area V2 may serve as a place where different channels of visual information are integrated. 接下来讲解了第一节中那篇文章对V2中细胞的选择性的分析。 3. Algorithm 3.1 Sparse RBM Gaussian RBM能量函数: compute conditional probability distribution: 是高斯密度。 加入稀疏惩罚,最终优化问题变为: 其中 是给定数据的conditional expectation, 是regularization constant,p是一个常数控制稀疏程度。 3.2 Learning deep networks using sparse RBM 跟DBN的思想一致,本文学习了含有两个隐层的网络。 4. Visualization 4.1 Learning strokes from handwritten digits 首先PCA降维到69维,然后用69-200结构学习出的结果: 4.2 Learning from natural images 用http://hlab.phys.rug.nl/imlib/index.html的自然图片学习,从2000张图片中抽取100000个14*14的patches,200个patches作为一个mini-batch,用196-400的结构学习得到的结果类似V1: 4.3 Learning a two-layer model of natural images using sparse RBMs 5. Evaluation experiments
这篇 ECCV2010上的 tutorial,由余凯和Andrew Ng两位大神做的,我在此把ppt中一些摘要整理一下,供大家参考。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The quality of visual features is crucial for a wide range of computer vision topics, e.g., scene classification, object recognition, and object detection, which are very popular in recent computer vision venues. All these image classification tasks have traditionally relied on hand-crafted features to try to capture the essence of different visual patterns. Fundamentally, a long-term goal in AI research is to build intelligent systems that can automatically learn meaningful feature representations from a massive amount of image data. We believe a comprehensive coverage of the latest advances on image feature learning will be of broad interest to ECCV attendees. The primary objective of this tutorial is to introduce a paradigm of feature learning from unlabeled images, with an emphasis on applications to supervised image classification. We provide a comprehensive coverage of recently developed algorithms for learning powerful sparse nonlinear features , and showcase their superior performance on a number of challenging image classification benchmarks, including Caltech101, PASCAL, and the recent large-scale problem ImageNet. Furthermore, we describe deep learning and a variety of deep learning algorithms, which learn rich feature hierarchies from unlabeled data and can capture complex invariance in visual patterns. 1. Introduction where do we get low-level representation from? 2. State-of-the-art Image Classification Methods (1) Features (2) Discriminative Methods a. bag of words Issues: Spatial information is missed. b. Spatial Pyramid Pooling (3) Generative Model 3. Image Classification Using Sparse Coding V1中的处理和Gabor小波变换类似,做边缘检测。 Sparse Coding大致意思是寻找一组基向量,使得所有input数据可以用这组基向量线性表示出来,而系数大部分为0,因此称为稀疏。 这个方法假设边缘是一个场景最基本的元素,用这种方法可以得到一个比像素更简洁的,更高层的表示。 主要步骤: 但是这样依然不如SIFT,有三个方法改善: 与SIFT结合就是把输入数据变为SIFT descriptors。 与K-means方法比较,发现sparse coding就是K-means的一种soft版本,任何时候需要用k-means得到字典的时候,用sparse coding都会提升效果。 4. Advanced Topics on Image Classification Using Sparse Coding (1) Why sparse coding helps classification? A topic model view to sparse coding: A geometric view to sparse coding: 接下来讲述了SC在MNIST上的实验数据,当SC得到最小误差的时候,学习到的基向量就像数字,启发:研究数据中的几何特征可能对分类有帮助。 Local Coordinate Coding Applications: (2) Recent Advances in Sparse Coding for Image Classification 5. Learning Feature Hierachies and Deep Learning
这篇文章相信学习DL的人都知道,deep learning在2006年三大breakthrough文章之一,其主要思想是逐层贪婪训练方法,以下为论文部分摘抄及翻译。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Problem: To train deep networks, gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. 1. Introduction 对于shallow architecture模型,如SVM,对于d个输入,要有2^d个样本,才足够训练模型。当d增大的时候,这就产生了维数灾难问题。而多层神经网络能够避免这个问题: boolean functions (such as the function that computes the multiplication of two numbers from their d-bit representation) expressible by O(logd) layers of combinatorial logic with O(d) elements in eachlayer may require O(2^d)elements when expressed with only 2 layers. Three important aspects: 1. Pre-training one layer at a time in a greedy way. 2. Using unsupervised learning at each layer in order to preserve information from the inputs. 3. Fine-tuning the whole network with respect to the ultimate criterion of interest. 2. DBN 2.1 RBM 2.2 Gibbs Markov Chain and log-likelyhood gradient in a RBM RBMUpdate Algorithm 2.3 Greedy layer-wise training of a DBN 每当训练完一层RBM,向上叠加一层RBM,且输入层使用下一层RBM的输出作为输入。使用RBM中隐层的posterior distribution作为DBN中可视层的posterior distribution。贪婪学习的动机是一个部分DBN对最底层的表示比单个RBM要好。 TrainUnsupervisedDBN 其中i为层数 2.4 Fine-tuning wake-sleep algorithm or mean-field approximation TrainSupervisedDBN 这里C为squared error or cross entropy DBMSupervisedFineTuning 3. Extension to continuous-valued inputs 将输入向量进行归一化,转换为(0,1)区间的数,把它当做是二值单元变成1的概率,然后用RBM的方法进行训练。这种方法对灰度像素是有效的,但是可能对其他形式的输入无效。 4.Understand why layer-wise strategy works TrainGreedyAutoEncodingDeepNet n为各层单元数 TrainGreedySupervisedDeepNet Experiment 2 shows that greedy unsupervised layer-wise pre-training gives much better results than the standard way to train a deep network (with no greedy pre-training) or a shallownetwork, and that, without pre-training, deep networks tend to perform worse than shallow networks. 同样supervised pretraining要比unsupervised效果差,因为它太贪婪,可能的解释是在学习到的隐层表示中,它抛弃了目标的一些信息。 Experiment 3 将最顶层设置为只有20个单元,因为在实验2中training error都很小,基本看不出pretraining对optimization的帮助,那是因为即使没有很好的初始化,最底层和最高层还是组成了一个标准的浅层网络,他们能够保留足够的输入信息去适应训练集,但是对生成没有什么帮助。由实验室结果可以看出这个假设是正确的。 Continuous training of all layers of a DBN 我们希望连续的训练一个DBN而不是每次加一层,再决定迭代次数来训练。 To achieve this it is suf_cient to insert a line in TrainUnsupervisedDBN , so that RBMupdate is called on all the layers and the stochastic hidden values are propagated all the way up. The advantage is that we can now have a single stopping criterion (for the whole network). 具体怎么做文章没有细说。 5. Dealing with uncooperative input distributions 当输入分布于目标不太相关的情况下,例如x~p(x), p is Gaussian and target y=f(x)+noise,f=sinus,这时候在p和f之间没有特别的相关性,这时候无监督贪婪预训练就帮不上忙。这时候在训练每层时可以用一个混合的训练规则,结合无监督和有监督。 TrainPartiallySupervisedLayer
此文又是Bengio大神的综述,2013年,内容更新,与09年那篇综述侧重点稍有不同,同样,需要完全读懂需要很好的基础和时间,反正我是很多地方懵懵懂懂,姑且将自己的翻译和摘取贴出来大家互相交流,如有不对的地方欢迎指正。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. Introduction 2. Why should we care about learning representation? Although depth is an important part of the story, many other priors are interesting and can be conveniently captured by a learner when the learning problem is cast as one of learning a representation. 接下来几个领域的应用: Speech Recognition and Signal Processing Object Recognition MNIST最近的记录:Ciresan et al. (2012) 0.27% Rifai et al. (2011c) 0.81% ImageNet:Krizhevsky et al.(2012) 15.3% Natural Language Processing Multi-Task and Transfer Learning, Domain Adaptation 3. What makes a representation good? 3.1 Priors for representation learning in AI a. Smoothness b. Multiple explanatory factors: 数据形成的分布是由一些隐含factors生成的 c. A hierarchical organization of explanatory factors:变量的层次结构 d. Semi-supervised learning e. Shared factors across tasks f. Manifolds:流形学习,主要用于AE的研究 g. Natural clustering h. Temporal and spatial coherence i. Sparsity 这些priors都可以用来帮助learner学习representation。 3.2 Smoothness and the Curse of Dimensionality 机器学习局部非参数学习例如核函数,它们只能获取局部泛化,假设目标函数是足够平滑的,但这不足以解决维度灾难。这些叫smoothness based learner和线性模型仍然可以用在这些学习到的表示的顶层。实际上,学习一个表示和核函数等同于学习一个核。 3.3 Distributed representations 一个one-hot表示,例如传统的聚类算法、高斯混合模型。最近邻算法、高斯SVM等的结果都需要O(N)个参数来区分O(N)个输入区域。但是像 RBM、sparse coding、auto-encoder或者多层神经网络可以用O(N)个参数表示 个输入区域。它们就是分布式表示。 3.4 Depth and abstraction 深度带来了两个明显的提高:促进特征的重复利用、潜在的获取更抽象的高层特征。 re-use: 深度回路的一个重要性质是path的数量,它会相对于深度呈指数级增长。我们可以改变每个节点的定义来改变回路的深度。典型的计算包括:weighted,sum, product, artificial neuron model,computation of a kernel, or logic gates. Abstraction and invariance 更抽象的概念可以依靠less抽象的概念构造。例如在CNN中,通过pooling来构造这种抽象,更抽象的概念一般是对于大部分输入的局部变化具有不变性。 3.5 Disentangle factors of variation 最鲁棒的特征学习是尽可能多的解决factors,抛弃尽可能少的信息,类似降维。 3.6 What are good criteria for learning representations? 在分类中,目标很明显就是最小化错误分类,但是表示学习不一样,这是一个值得思考的问题。 4. Building deep representations 06年在特征学习方面有一个突破,中心思想就是贪婪单层非监督预训练,学习层次化的特征,每次一层,用非监督学习来学习组合之前学习的transformation。 单层训练模式也可以用于监督训练,虽然效果不如非监督训练,但比没有预训练要好。 下面几种方法组合单层网络来构成监督模型: 层叠RBM形成DBN ,但是怎么去估计最大似然来优化生产模型目前还不是很清楚,一种选择是wake-sleep算法。 另一种方法是 结合RBM参数到DBM 中,基本上就是减半。 还有一种方法是 层叠RBM或者AE形成深层自编码器 , 另一种方法训练深层结构是iterative construction of a free energy function 5. Single-layer learning modules 在特征学习方面,有两条主线:一个根源于PGM,一个根源于NN。 根本上这两个的区别在于每一层是描述为PGM还是computational graph,简单的说,就是隐层节点便是latent random variables还是computational nodes。 RBM是在PGM这一边,AE在NN这一边。在RBM中,用score matching训练模型本质上与AE的规则化重建目标是想通的。 PCA三种解释: a. 与PM相关,例如probabilistic PCA、factor analysis和传统的多元变量高斯分布。 b. 它本质上和线性AE相同 c. 它可以看成是线性流形学习的一种形式 但是线性特征的表达能力有限,他不能层叠老获取更抽象的表示,因为组合线性操作最后还是产生线性操作。 6. Probabilistic Models Learning is conceived in term of estimating a set of model parameters that (locally) maximizes the likelihood of the training data with respect to the distribution over these latent variables. 6.1 Directed Graphical Models 6.1.1 Explaining away 一个事件先验独立的因素在给出了观察后变得不独立了,这样即使h是离散的,后验概率P(h|x)也变的intractable。 6.1.2 Probabilistic interpretation of PCA 6.1.3 Sparse coding 稀疏编码和PCA的区别在于包含一个惩罚来保证稀疏性。 Laplace分布(等同于L1惩罚)产生稀疏表示。与RBM和AE比较,稀疏编码中的推理包含了一个内在循环的优化也就增加了计算复杂度,稀疏编码中的编码对每一个样本是一个free variable,在那种意义上潜在的编码器是非参数的。 6.2 Undirected Graphical Models 无向图模型,也叫马尔科夫随机场(MRF),一种特殊形式叫BM,其能力方程: 6.2.1 RBM 6.3 Generalization of the RBM to real-valued data 最简单的方法是Gaussian RBM,但是 Ranzato2010 给出了更好的方法来model自然图像,叫做 mean and covariance RBM(mcRBM) ,这是covariance RBM和GRBM的结合。还介绍了mPoT模型。 Courville2011 提出了ssRBM在CIFAR-10上取得了很好的表现。 这三个模型都用来model real-valued data,隐层单元不仅编码了数据的conditional mean,还是conditional covariance。不同于训练方法,这几个的区别在于怎么encode conditional covariance。 6.4 RBM parameter estimation 6.4.1 Contrastive Divergence 6.4.2 Stochastic Maximum Likelihood(SML/PCD) 在每一步梯度更新时,不像CD算法中positive phase中gibbs sampling每次都更新,而是用前一次更新过的状态。但是当权值变大,估计的分布就有更多的峰值,gibbs chain就需要很长的时间mix。 Tieleman2009 提出一种fast-weight PCD(FPCD)算法用来解决这个问题。 6.4.3 Pseudolikelihood,Ratio-matching and other Inductive Principles Marlin2010 比较了CD和SML。 7. Direct Encoding: Learning a Parametric Map from Input to Representation 7.1 Auto-Encoders 特征提取函数和重构误差函数根据应用领域不同而不同。 对于无界的输入,用线性解码器和平方重构误差。对于 之间的输入,使用sigmoid函数。如果是二值输入,可以用binary cross-entropy误差。 binary cross-entropy: 线性编码和解码得到类似于PCA的subspace,当使用sigmoid非线性编码器的时候也一样,但是权值如果是tied就不同了。 如果编码器和解码器都使用sigmoid非线性函数,那么他得到类似binary RBM的特征。一个不同点在于RBM用一个单一的矩阵而AE在编码和解码使用不同的权值矩阵。 在实际中AE的优势在于它定义了一个简单的tractable优化目标函数,可以用来监视进程。 7.2 Regularized Auto-encoders 传统的AE就像PCA一样,作为一种数据降维的方法,因此用的是 bottleneck 。而 RBM和sparse coding倾向于over-complete表示 ,这可能导致AE太简单(直接复制输入到特征)。这就需要regularization,可以把它看成让表示对输入的变化不太敏感。 7.2.1 Sparse Auto-encoders 当直接惩罚隐层单元时,有很多变体,但没有文章比较他们哪一个更好 。虽然L1 penalty看上去最自然,但很少SAE的文章用到。一个类似的规则是student-t penalty 。详见UFLDL。 7.2.2 Denoising Auto-encoders Vincent2010 中考虑了对灰度图加入等向高斯噪声、椒盐噪声。 7.2.3 Contractive Auto-encoders Refai2011 提出了CAE与DAE动机一样,也是要学习鲁棒的特征,通过加入一个analytic contractive penalty term。 令Jacobian matrix ,CAE的目标函数为: 是一个超参数控制规则强度。如果是sigmoid encoder: 三个与DAE不同的地方,也有紧密的联系,有小部分噪声的DAE可以看做是一种CAE,当惩罚项在全体重构误差上而不仅仅是encoder上。另外Refai又提出了CAE+H,加入了第三项使得 与 靠近: 注意 CAE学习到的表示倾向于饱和而不是稀疏,也就是说大部分隐层单元靠近极值,它们的导数也很小。不饱和的单元很少,对输入敏感,与权值一起构成一组基表示样本的局部变化。在CAE中,权值矩阵是tied。 7.2.4 Predictive Sparse Decomposition(PSD) (Kavukcuoglu et al., 2008) 提出了sparse coding和autoencoder的变种,叫PSD。 PSD是sparse coding和AE的一个变种,目标函数为: PSD可以看做是sparse coding的近似,只是多了一个约束,可以将稀疏编码近似为一个参数化的编码器。 8.Representation Learning As Manifold Learning PCA就是一种线性流形算法。 8.1 Learning a parametric mapping based on a neighborhood graph 8.2 Learning a non-linear manifold through a coding scheme 8.3 Leveraging the modeled tangent spaces 9. Connections between probabilistic and direct encoding models 标准的概率模型框架讲训练准则分解为两个:log-likelihood 和prior 。 9.1 PSD:a probabilistic interpretation PSD是一个介于概率模型和直接编码方法的表示学习算法 。RBM因为在隐层单元连接的限制,也是一个交叉。但是DBM中就没有这些性质。 9.2 Regularized Auto-encoder Capture Local Statistics of the Density 规则化的AE训练准则不同于标准的似然估计因为他们需要一种prior,因此他们是data-dependent。 (Vincent2011) A connection between score matching and denoising autoencoders Bengio2012_Implicit density estimation by local moment matching to sample from auto- encoders (Refai2012)_A generative process for sampling contractive auto-encoders 规则项需要学习到的表示尽可能的对输入不敏感,然而在训练集上最小化重构误差又迫使表示包含足够的信息来区分他们。 9.3 Learning Approximate Inference 9.4 Sampling Challenge MCMC sampling在学习过程中效率很低,因为the models of the learned distribution become sharper, makeing mixing between models very slow. Bengio2012 表明深层表示可以帮助mixing。 9.5 Evaluating and Monitoring Performance 一般情况下我们在学习到的特征上加一层简单的分类器,但是最后的分类器可能计算量更大(例如fine-tuning一般比学习特征需要更多次数的迭代)。更重要的是这样可能给出特征的不完全的评价。 对于AE和sparse coding来说,可以用测试集上的重构误差来监视。 对于RBM和一些BM, Murray2008 提出用 Annealed Importance Sampling 来估计partition function。RBM的另一种( Desjardins2011 )是在训练中跟踪partition function,这个可以用来early stopping和减少普通AIS的计算量。 10. Global Training of Deep Models 10.1 On the Training of Deep Architectures 第一种实现就是非监督或者监督单层训练, Erhan2010 解释了为什么单层非监督预训练有帮助。这些也与对中间层表示有指导作用的算法有联系,例如Semi-supervised Embedding( Weston2008 )。 在Erhan2010中, 非监督预训练的作用主要体现在regularization effect和optimization effect上。前者可以用stacked RBM或者AE实验证明。后者就很难分析因为不管底层的特征有没有用,最上面两层也可以过拟合训练集。 改变优化的数值条件对训练深层结构影响很大,例如改变初始化范围和改变非线性函数( Glorot2010 )。第一个是梯度弥散问题,这个难题促使了二阶方法的研究,特别是 Hessian-free second-order方法 ( Marten2010 ) 。Cho2011 提出了一种RBM自适应的学习速率。 Glorot2011 表明了sparse rectifying units也可以影响训练和生成的表现。 Ciresan2010 表明有大量标注数据,再合理的初始化和选择非线性函数,有监督的训练深层网络也可以很成功。这就加强这种假设,当有足够的标注样本时,无监督训练只是作为一种prior。 Krizhevsky2012 中运用了很多技术,未来的工作希望是确定哪些元素最重要,怎么推广到别的任务中。 10.2 Joint Training of Deep Boltzmann Machines Salakhutdinov and Hinton (2009) 提出了DBM。 DBM的能量方程: DBM相当于在Boltzmann Machine的能量方程中U=0,并且V和W之间稀疏连接结构。 10.2.1 Mean-field approximate inference 因为隐层单元之间有联系,所以后验概率变得intractable,上文中用了mean-field approximation。比如一个有两层隐层的DBM,我们希望用 近似后验 ,使得 最小。 10.2.2 Training Deep Boltzmann Machines 在训练DBM时,与训练RBM最主要的不同在于,不直接最大似然,而是选择参数来最大化lower-bound。具体见论文。 11. Building-in Invariance 11.1 Augmenting the dataset with known input deformations Ciresan2010 中使用了一些小的放射变换来扩大训练集MNIST,并取得了很好的结果。 11.2 Convolution and Pooling (Le Roux et al., 2008a) 研究了图片中的2D拓扑。 论述这种结构与哺乳类动物的大脑在目标识别上的相同点: Serre et al., 2007 :Robust object recognition with cortex-like mechanisms DiCarlo et al., 2012 :How does the brain solve visual object recognition? Pooling的重要性 : Boureau2010 :A theoretical analysis of feature pooling in vision algorithms Boureau2011 :Ask the locals:multi-way local pooling for image recognition 一个成功的pooling的变种是L2 Pooling: Le2010 :Tiled convolutional neural networks Kavukcuoglu2009 :Learning invariant features through topographic filter maps Kavukcuoglu2010 :Learning convolutional feature hierarchies for visual recognition Patch-based Training : (Coates and Ng, 2011) :The importance of encoding versus training with sparse coding and vector quantization。 这篇文章compared several feature learners with patch-based training and reached state- of-the-art results on several classification benchmarks。 文中发现最后的结果与简单的K-means聚类方法类似,可能是因为patch本来就是低维的,例如边缘一般来说就是6*6,不需要分布式的表示。 Convolutional and tiled-convolutional training: convolutional RBMs: Desjardins and Bengio, 2008 :Empirical evaluation of convolutional RBMs for vision Lee,2009 :Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations Taylor,2010 :Convolutional learning of spatio-temporal features a convolutional version of sparse coding: Zeiler2010 : Deconvolutional networks tiled convolutional networks: GregorLeCun,2010 : Emergence of complex- like cells in a temporal product network with local receptive field Le,2010 : Tiled convolutional neural networks Alternatives to pooling: (Scattering operations) Mallat,2011: Group invariant scattering BrunaMallat,2011 : Classification with scattering operators 11.3 Temporal Coherence and slow features 11.4 Algorithms to Disentanle Facotors of Variantion Hinton,2011 :Transforming auto-encoders 这篇文章take advantage of some of the factors of variation known to exist in the data。 Erhan,2010 :Understanding representations learned in deep architectures 这篇文章有实验表明DBN的第二层倾向于比第一层更具有不变性。 12. Conclusion 本文讲述了三种表示学习方法:概率模型、基于重构的算法和几何流形学习方法。 Practical Guide and Guidelines: Hinton,2010 :Practical guide to train Restricted Bolzmann Machine Bengio,2012 :Practical recommendations for gradient- based training of deep architectures Snoek2012 :Practical Bayesian Optimization of Machine Learning Algorithms
刚开通的博客,突然觉得什么都记在印象笔记中不如分享来的重要,在过程中也许还可以和同领域的人们交流,这就算是处女贴吧。从印象笔记贴过来,排版什么的可能会有些乱,重新整理太耗时间,请大家见谅。 今年开始研究近年来比较火热的deep learning,对于我来说是一个全新的领域,基本上是零基础,所以这篇综述性质的文章对于我的帮助是很大的,不过要看懂这篇文章,不仅需要很多时间,而且需要对神经网络有一点基础。这篇文章是一个长文,涉及的范围也比较广,有很多地方我还没有搞懂,本文只是对这篇文章的总结及摘要性质,并没有夹杂我个人的观点,有些地方翻译不对或内容理解错误,请大家帮忙指点出来。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. Introduction we assume that the computational machinery necessary to express complex behaviors requires highly varying mathematical functions, i.e. mathematical functions that are highly non-linear in terms of raw sensory inputs, and display a very large number of variations. If a machine captured the factors that explain the statistical variations in the data, and how they interact to generate the kind of data we observe, we would be able to say that the machine understands those aspects of the world covered by these factors of variation. 1.1 How do we train deep architectures? Automatically learning features at multiple levels of abstraction allows a system to learn complex functions mapping the input to the output directly from data, without depending completely on human-crafted features. Depth of architecture refers to the number of levels of composition of non-linear operations in the function learned. the mammal brain is organized in a deep architecture with a given input percept represented at multiple levels of abstraction, each level corresponding to a different area of cortex. This is particularly clear in the primate visual system (Serre et al., 2007), with its sequence of processing stages: detection of edges, primitive shapes, and moving up to gradually more complex visual shapes. Something that can be considered a breakthrough happened in 2006: DBN, autoencoder... apparently exploiting the same principle: guiding the training of intermediate levels of representation using unsupervised learning, which can be performed locally at each level . 1.2 Intermediate Representations: Sharing Features and Abstractions Across Tasks These algorithms can be seen as learning to transform one representation (the output of the previous stage) into another, at each step maybe disentangling better the factors of variations underlying the data. 每一层的特征都不是相互独立的,他们构成了一个分布表示: the information is not localized in a particular neuron but distributed across many.大脑中的表示是稀疏的,大概只有1-4%的神经在同一个时间激活。 Even though statistical efficiency is not necessarily poor when the number of tunable parameters is large, good generalization can be obtained only when adding some form of prior (e.g. that smaller values of the parameters are preferred) Exploiting the underlying commonalities between these tasks and between the concepts they require has been the focus of research on multi-task learning. Consider a multi-task setting in which there are different outputs for different tasks, all obtained from a shared pool of high-level features. 2. Theoretical Advantages of Deep Architectures 这一节讲了学习深度结构的动机和一些结构深度的解释。有一些函数不能被浅层结构有效的表达,因为可调整元素的数量。 For a fixed number of training examples, and short of other sources of knowledge injected in the learning algorithm, we would expect that compact representations of the target function 2 would yield better generalization.具体的说一个可以被k层结构表示的函数如果要用k-1层结构表示需要增加指数级的计算元素。 To formalize the notion of depth of architecture, one must introduce the notion of a set of computational elements . Theoretical results suggest that it is not the absolute number of levels that matters, but the number of levels relative to how many are required to represent efficiently the target function (with some choice of set of computational elements). 2.1 Computational Complexity 最基本的结论就是如果一个函数可以被一个深层结构简明的表示,如果用不够深的结构来表示的话它需要一个很大的结构。 举了逻辑门函数的例子,但是这些理论并不能证明别的一些函数(比如AI中的一些task)需要深层结构,也不能证明这些限制适用于其他回路。但是这引发了我们的思考,普通的浅层网络是不是不能有效的表示复杂函数。 Results such as the above theorem also suggest that there might be no universally right depth : each function (i.e. each task) might require a particular minimum depth (for a given set of computational elements). 2.2 Informal Arguments We say that a function is highly-varying when a piecewise approximation (e.g., piecewise-constant or piecewise-linear) of that function would require a large number of pieces. 深层结构是很多运算的组合,应该都可以用一个非常大的2层结构来表示。 To conclude, a number of computational complexity results strongly suggest that functions that can be compactly represented with a depth k architecture could require a very large number of elements in order to be represented by a shallower architecture. 3. Local vs Non-Local Generalization 3.1 The Limits of Matching Local Templates Local estimators不适合用来学习highly-varying functions,即使他们能够用深层结构有效的表示。 An estimator that is local in input space obtains good generalization for a new input x by mostly exploiting training examples in the neighborhood of x. Local estimators 直接或间接的将输入空间划分为区域,每个区域都需要不同的参数来表示目标函数,所以当需要很多区域的时候,参数也会变的很多。 根据局部模板匹配的结构可以看成两层结构,第一层是模板匹配层,第二层是分类层。最典型的例子是Kernel machine: 和 构成第二层,在第一层,核函数 将输入x和训练样本xi进行匹配。 著名的kernel machine有SVM和Gaussian Process。Kernel Machine得到generalization是通过找出smooth prior: the assumption that the target function is smooth or can be well approximated with a smooth function. 如果没有关于task的先验知识,就不能设计合适的kernel,这就刺激了很多研究。 (Salakhutdinov Hinton, 2008) Gaussian Process Kernel Machine可以用DBN来学习特征空间来提高效率。深层结构的学习算法可以看成是为kernel machine学习好的特征空间的方法。 考虑目标函数中的方向v上下震荡,对于高斯kernel machine,需要的数据量对于目标函数需要学习的震荡的数量是线性增长的。 For a maximally varying function such as the parity function, the number of examples necessary to achieve some error rate with a Gaussian kernel machine is exponential in the input dimension. 对于一个仅仅依靠先验知识目标函数式局部平滑的学习者,学习一个在一个方向上符号变化很多的函数式很困难的。 对于高维复杂的任务,如果一个曲线有很多variation并且这些variation各自没有关联,那么局部的estimator应该是最好的算法。但是在AI中我们假设目标函数中存在一些潜在的规律,所以需要寻找variation的更加compact的representation,从而可能导致更好的generalization。 Most of these unsupervised and semi-supervised algorithms using local estimators rely on the neighborhood graph: a graph with one node per example and arcs between near neighbors. 接下来举了一个流形学习的例子: 图中同一个目标4的一组图片,通过旋转和缩小,形成了一个低维的流形。因为流形是局部平滑的,所以原则上可以用linear patches去估计局部,每个patch与流形相切。但是如果流形过度弯曲,这些patch就需要很小,并且数量也指数级增长。 考虑基于neighborhood graph的半监督学习算法。labeled examples需要和variations of interst一样多才行,当decision surface变化太多的话就不行了。 Theoretical analysis(Bengio et al.,2009)说明了对于特定的函数要得到一个给定的error rate需要的数据量是呈指数级的。empirical result表明决策树的泛化能力是会随着variations的增多而降低的。 Ensembles of trees:They add a third level to the architecture which allows the model to discriminate among a number of regions exponential in the number of parameters. 3.2 Learning Distributed Representations 一个简单的local representation for 是一个N比特的向量r(i),其中有一个1和N-1个0。同样的distributed representation是一个 的向量,更加简洁的表示。 在分布式表示中,输入的特征不是相互独立的,但是有可能统计独立,举例来说,聚类算法不是分布式的因为cluster必须mutually exclusive,但是PCA和ICA是分布式表示。(这句没完全理解) 监督学习例如多层神经网络和无监督学习例如Boltzmann machine,都是学习分布式的内在表示,他们的目标在于让学习算法学习构成分布式表示的特征。 4. Neural Networks for Deep Architecture 4.1 Multi-Layer Neural Networks 每一层是上一层特征的非线性函数sigmoid或者tanh的激活值,参数有偏置b和权值w,通过最小化最后一层与目标的误差,更新参数学习。最后一层也可以用其他表示,例如softmax,计算激活值的比例,最后一层的输出hi就是P(Y=i|x)的估计,这种情况下经常用negative conditional log-likelihood,也就是-logP (Y=y|x)作为损失函数。 4.2 The Challenge of Training Deep Neural Networks 基于随机梯度的算法用在深层网络上时容易陷入局部最优,当随机初始化时,深层网络的结果可能比浅层结构更差。06年Hinton开始使用pre-training得到更好的结果,这些文字都发现了greedy layer-wise unsupervised学习算法:首先用无监督算法训练第一层,产生了第一层的初始值,然后用第一层的输出作为第二层的输入,训练第二层。全部训练完以后,用监督算法进行fine-tune。这些基于RBM和auto-encoder的算法的相同点是layer-local unsupervised criteria,也就是the idea that injecting an unsupervised training signal at each layer may help to guide the parameters of that layer towards better regions in parameter space. In Weston et al. (2008), the neural networks are trained using pairs of examples (x, ˜x), which are either supposed to be “neighbors” (or of the same class) or not. A local training criterion is defined at each layer that pushes the intermediate representations hk(x) and hk(˜x) either towards each other or away fromeach other , according to whether x and ˜x are supposed to be neighbors or not. 同样的方法已经被用于无监督流形学习算法。 Bergstra and Bengio (2010) exploit the temporal constancy of high-level abstraction to provide an unsupervised guide to intermediate layers: 连续帧可能包含同样的目标。 那么这样的改进是因为better optimization还是better regularization呢? Erhan(2009)的实验中指出对于同样的训练误差,有非监督预训练的测试误差会更低。 非监督pre-training可以看做是一种regularizer/prior: 约束了参数空间。Bengio(2007)的实验指出poor tuning of the lower layersmight be responsible for the worse results without pre-training。当增加隐层节点的数量时训练误差可以降到0。当最顶层被限制为很少,如果没有pre-training,那么training和test error都降很大,因为最上面两层可以看做是正常的两层网络,如果最顶层足够大,它已经足够fit训练集,因此需要pre-training使下面的层better optimized,小的最顶层才能yield better generalization。 如果顶层节点足够多,即使底层训练的不好训练误差也可以很低,但是这样generalization可能比浅层网络更差。当训练误差低而测试误差高,我们叫做overfitting。因为pre-training的效果,它也可以被看做是一种data-dependent regularizer。 当训练集的size很小,虽然非监督学习可以提高test error,但是training error会变大。(为什么?) 用Gaussian process或SVM代替最上面两层可以降低training error,但是如果如果低层没有足够优化,那么对generalization还是没有帮助。 另外一种可以产生better generalization的事一种regularization:with unsupervised pre-training, the lower layers are constrained to capture regularities of the input distribution. One way to reconcile the optimization and regularization viewpointsmight be to consider the truly online setting. 这种情况下, online gradient descent是一个随机优化过程,如果无监督pre-training仅仅是一个regularization,可以预期有一个无限的训练集,那么有没有pre-training都将收敛到同一个程度。 为了解释这个问题,用了一个’infinite MNIST’ dataset (Loosli, Canu, Bottou, 2007),结果发现pre-trained的3层网络明显收敛到更低的error,也就是说pre-training不仅仅是一种regularizer,而且是一种寻找最优最小值的方法。 为什么低层更难优化呢?上述表明反向传播的梯度可能不够使得参数转移到别的region,他们容易陷入局部最优。The gradient becomes less informative about the required changes in the parameters as we move back towards the lower layers, or that the error function becomes too ill-conditioned for gradient descent to escape these apparent local minima. 4.3 Unsupervised Learning for Deep Architecture 非监督学习找到了输入数据的一种统计规律的表示。 PCA和ICA可能不适合因为他们不能处理overcomplete case,也就是outputs比inputs更多的情况 。另外,stack linear projections仍然是一个线性变换(例如2层PCA),不是构造深层结构。 另一种动机去研究非监督学习:它可以是一种把问题分解为子问题的方法,每个子问题对应于不同level的abstraction。第一层可以提取显著信息,但是由于limited capacity,他只是低层特征,然后下一层用这些低层特征作为输入,那么就可以提取稍微高一点的特征。但是这里用梯度下降法的话,又会产生 梯度弥散 的问题。 4.4 Deep Generative Architectures 除了给监督算法做预训练,非监督算法还可以学习分布和生成样本。生成模型一般用图模型表示。Sigmoid belief net是一种多层生成模型,使用variatonal approximations训练。DBN与sigmoid belief net有点相似,除了最上面两层,DBN的最上面两层是RBM,一种无向图模型。 4.5 Convolutional Neural Networks 虽然深层网络使用监督算法训练很困难,但是有一个例外,就是CNN。原因有两个猜想: 1. The small fan-in of these neurons (few inputs per neuron) helps gradients to propagate through so many layers without diffusing so much as to become useless. 2. The hierarchical local connectivity structure is a very strong prior that is particularly appropriate for vision tasks, and sets the parameters of the whole network in a favorable region (with all non-connections corresponding to zero weight) from which gradient-based optimization works well . 4.6 Auto-Encoders Auto-Encoder和RBM之间也有一些联系, auto-encoder training approximates RBM training by Contrastive Divergence. 如果有一层线性隐层,那么k个隐层单元学习将input投影到前k个重要成分,有点类似PCA。 如果隐层是非线性的,那么就具有捕捉输入分布的multi-modal aspects 。 一个很重要的问题是如果没有其他约束,一个有n维输入和至少n维编码的自编码器,可能只学习到恒等方程(很多编码没有用,例如仅仅copy输入)。Bengio(2007)实验表明,当用随机梯度下降法,overcomplete(隐层节点比输入节点多)非线性自编码器可以产生有用的表示。一个简单的解释是early stopping有点类似L2约束。 对于连续输入的重建,一个非线性 自编码器在第一层需要很小的权值(to bring the non-linearity of the hidden units in their linear regime),第二层需要很大的权值。对于二值输入,也需要很大的权值去完全最小化重构误差。 除了constraining encoder by explicit or implicit regularization of the weights, 另一种策略是 添加noise。 这就是本质上RBM所做的。另一种策略是 sparsity constraint 。这些方法生成的权值矩阵与V1,V2神经元观察的结果类似(Lee, Ekanadham, Ng, 2008)。 sparsity和regularization为了避免学习identity,但是减少了capacity,而RBM既有很大的capacity,也不会学习identity,因为它是捕捉输入数据的统计结构。有一种auto-encoder的变体denoising auto-encoder具有和RBM类似的特性。 5 Energy-Based Models and Boltzmann Machines 5.1 Energy-Based Models and Products of Experts Energy-based model引入了能量的定义,energy-based概率模型可以用一个能量函数来定义一个概率分布: 任何一个概率分布都可以用能量模型计算,归一化元素Z被称为partition function: 在product of experts formulation中,energy定义为: 5.1.1 Introducing Hidden Variables an observed part x, a hidden part h: marginal: map this to energy function: with and 用 代表这个模型的参数,那么log-likelihood gradient: 因此average log-likelyhood gradient: 为训练集empirical distribution, 为expectation under model’s distribution P 也可以将能量写成一个关于某一个hidden unit的和: 例如在RBM中,FreeEnergy和分子都是可以求出的: 表示 被赋予所有值的和, 如果h是连续值,那么sum可以用integral取代 。 5.1.2 Conditional Energy-Based Models 计算这个partition function很难,如果我们的最终目的是决定y给定x,那么我们不需要求出联合分布P(x,y),而只要求P(y|x): 这类方法可以应用于判别RBM(Discriminative RBM)。 5.2 Boltzmann Machine Boltzmann machine是一种特殊形式含有隐变量的能量模型,能量函数是一个二次多项式: 和 分别为连接x和h的偏置,权值 、 、 分别对应一对连接,U和V是对称的,在大部分模型中对角线为0。非0的对角线可以用来获得其他变种,例如Gaussian代替二项式单元。 由于隐层单元的相互作用,上述FreeEnergy的计算方法在这里不适用,但是可以用MCMC采样方法,推导如下: 很容易计算,因此如果可以sample from P(h|x)和P(x,h),那么就可以得到unbiased stachastic estimator of the log-likelihood gradient。 Hinton(1986)介绍了一下术语:在 positive phase ,x作为输入,sample h from x;在 negative phase ,x和h都采样,理想中是从模型自身中采样。 Gibbs sampling是一种近似采样,N个随机变量的联合分布 是通过N个子采样得到的, ,每次从N-1个随机变量里采样Si,经过无穷步采样逐渐收敛到P(S)。 以下是怎么在Boltzmann machine中运用Gibbs sampling,并举了一个例子, 可惜没看懂,等用到的时候再看。 因为对于每个样本都需要两条MCMC链(一个positive phase,一个negative phase),所以计算量很大, 故这种方法被BP算法取代。但是接下来的Contrastive Divergence算法又成功运用了这种方法。 5.3 Restricted Boltzmann Machine RBM是DBN的building block,它当中的 、都为0,因为层间没有连接。Energy function: 输入的Free Energy: Conditional probability P(h|x): 在binary 输入的case中, 因为x和h在能量函数中是对称的,因此 在Hinton(2006),binomial input units are used to encode pixel gray levels in input images as if they were the probability of a binary event. 在MNIST训练集上很有效,但是在其他case中不行。在Bengio(2007)的实验中描述了 当输入是连续值时Gaussian input对于binomial input的优势 。 虽然RBM可能不能像BM那样有效的表示某些分布,但是它可以表示任何离散分布,只要有足够的隐层单元。Le Roux Bengio(2008)的实验中表明除非RBM已经完美的表示了训练集分布,增加一个隐层单元总能提高log-likelihood。 一个RBM也可以看做是multi-clustering,每一个隐层单元产生一个2-region partition of the input space。 The sum over the exponential number of possible hidden-layer configurations of an RBM can also beseen as a particularly interesting form of mixture, with an exponential number of components (with respect to the number of hidden units and of parameters): 例如,如果P(x|h)被选择为Gaussian,这就是一个Gaussian mixture,h有n bits就有 个成分。但是这些成分不能独立tuned,因为它们share parameters。Gaussian均值通过一个线性函数得到 ,也就是说每个隐层单元 在均值中都贡献一个 。 5.3.1 Gibbs Sampling in RBMs Gibbs sampling在RBM中在每步中有两个子步骤:第一sample h from x,第二sample a new x from h。随着样本的增加,模型分布与训练分布越来越相似。如果我们从模型分布开始,那么一步就可以收敛,所以从训练样本的经验分布开始保证只需要很少的步骤就可以收敛。 5.4 Contrastive Divergence 5.4.1 Justifying Contrastive Divergence 在这个算法中,我们需要做的第一个approximation就是用单个样本代替所有可能输入的平均值。在k-step Contrastive Divergency中,包含了第二个近似: 是k步后的最后一个样本。when ,the bias goes away. 当模型分布非常接近于经验分布的话,也就是 时,start the chain from x(a sample) the chain has converged, 我们只需要一步就得到了unbiased sample 。 结果表明即使k=1都能得到很好的结果。一种方法解释CD算法就是这是一种locally around样本x1的对log-likelihood gradient的一种近似。LeCun(2006)指出EBM的训练算法中最重要的就是使得observed inputs的能量最小,这里指的就是FreeEnergy。CD算法中的contrast指的是一个真实的训练样本和一个链中的样本的对比。 5.4.2 Alternatives to Contrastive Divergence Tieleman(2008) ; Salakhutdinov Hinton(2009)提出了persistent MCMC for the negative phase。主要思想很简单: keep a background MCMC chain. . .xt → ht → xt+1 → ht+1 . . . to obtain the negative phase samples 。不同于CD-k中run a short chain,在做出近似时忽略参数在不断变化的事实,也就是说,we do not run a separate chain for each value of the parameters。因为参数变化实际上很慢,所以这种近似效果很好,但是 the trade-off with CD-1 is that the variance is larger but the bias is smaller .( 这个方法实际上还没有看懂!!! ) 另一种方法是Score Matching (Hyv¨arinen, 2005, 2007b, 2007a) ,这是一种用来训练EBM的方法,它可以求出能量,但不是分母那个归一化常数Z。一个密度 的score function是 ,这个函数与归一化常数Z没有关系,这个方法的基本思想是 match the score function of the model with the score function of the empirical density ,然后比较两个score function的difference。(需要结合paper细看) 5.4.3 Truncations of the Log-Likelihood Gradient in Gibbs-Chain Models( 纯数学推导,具体要细看paper ) Bengio and Delalleau (2009) 给出了一个定理: Theorem 5.1. Consider the converging Gibbs chain x1 ⇒ h1 ⇒ x2 ⇒ h2 . . . starting at data point x1. The log-likelihood gradient can be written and the final term converges to zero as time goes to infinit 。 所以truncating the chain to k steps用这样的一个近似: 这就是CD-k算法,这就告诉我们CD-k算法的bias就是 ,这个bias会随着k的增加而减少,所以增加CD-k的步骤会更快更好的收敛。当用x1来初始化Markov chain,第一步相对于x1就向着正确的方向移动了,也就是说,roughly going down the energy landscape from x1. CD-1是进行了两次采样,如果只进行一次呢?那么分析log-likelihood gradient expansion: 用average configuration 来代替 : 忽略后一项( WHY ),我们得到了右边作为update direction,这就是 reconstruction error, 典型地用来训练自编码器: 所以我们发现 truncation of the chain得到了第一个近似是大概的重构误差,然后稍微更好的近似是CD-1 ,重构误差也是在训练RBM时用来跟踪整个过程的。 5.4.4 Model Samples Are Negative Examples( 数学解释没有看懂 ) 在boltzmann machine和CD算法中,一个很重要的元素就是 the ability to sample from the model 。极大似然准则想要在训练样本上得到很高的相似,而在其他地方很低。如果我们已经有一个模型,那么where the model puts high probability(represented by samples)和where the training examples are指出了怎么来改变这个模型。 如果我们可以用一个decision surface(决策面)分离训练样本和模型样本,我们可以这样增加likelihood:减少决策面中有更多训练样本那一侧的能量函数值,增加另一面。 以下用数学推导证明了 如果可以增加一个classifier分离训练样本和模型样本的能力,就可以增加这个模型的log-likelihood,将更大的可能面放在训练样本一侧 。实际中,我们可以用一个分类器,这个分类器的判别函数像生成模型的free energy那样定义,再假设可以从模型中采样,可以达到这样的目标。 6. Greedy Layer-Wise Training of Deep Architectures 6.1 Layer-Wise Training of Deep Belief Networks 一个含有l层的DBM构成联合分布如下: 是visible-given-hidden conditional distribution, 是最上面一层RBM的joint distribution。 代表posteriors,除了最上一层因为最上面一层是RBM,可以exact inference。 具体算法见paper: Greedy layer-wise learning of deep networks 6.2 Training Stacked Auto-Encoders Training的方法与DBN类似: 1、训练第一层最小化重构误差 2、用隐层输出作为下一层的输入,训练第二层 3、迭代第2步 4、最后一个隐层的输出作为一个有监督层的输入,初始化参数 5、用监督训练准则fine-tune所有的参数 与DBN的对比实验表明,DBN一般比SAE有优势,可能是因为 CD-k is closer to the log-likelihood gradient than the reconstruction error gradient 。但是因为no sampling is involved,所以reconstruction error has less variance than CD-k。 SAE的优势在于每一层的任何参数化都有可能,但是概率图模型中例如CD算法或者其他tractable estimators of the log-likelihood gradient能被应用的有限。SAE的缺点在于它不是生成模型,生成模型中sample可以被量化的检查学习到了什么,例如可视化。 6.3 Semi-Supervised and Partially Supervised Training 除了上述的先unsupervised training,再supervised training,还有其他的两者结合的方法。 Bengio(2007) 提出了partially supervised training,这个方法在输入分布P(X)与P(Y|X)不是强烈相关的情况下有用,具体看paper。 还有一种self-taught learning,( Lee, Battle, Raina,Ng, 2007 ; Raina et al., 2007) 7. Variants of RBMs and Auto-Encoders 7.1 Sparse Representations in Auto-Encoders and RBMs 1. Why a sparse representation? 几种解释sparsity的观点,具体见paper。 Ranzato(2008) 2. Sparse Auto-Encoder and Sparse Coding 第一个成功在深度结构中发掘出稀疏表示的是 Ranzato(2006) 。第二年同一个组介绍了一个变种,based on a Student-t prior。还有一种方法与computational neuroscience有关,它包含了两层sparse RBMs (Lee 2008) 。 在压缩感知力sparsity是通过加入L1惩罚,也就是说,当h是稀疏的时候输入x以很低的L2 reconstructed error重构。 这里 。 就像有向图模型,sparse coding表现有点像explaining away:不同的configuration竞争,从中选取一个,别的都关掉。 The advantage is that if a cause is much more probable than the other, than it is the one that we want to highlight. The disadvantage is that it makes the resulting codes somewhat unstable, in the sense that small perturbations of the input x could give rise to very different values of the optimal code h. 为了解决稳定性问题和fine-tuning的问题,Bagnell, Bradley(2009)提出用一个softer近似取代L1惩罚,也就是说many very small coefficients, without actually converging to 0. sparse auto-encoder和sparse RBMs不存在这些问题:computational complexity (of inferring the codes), stability of the inferred codes, and numerical stability and computational cost of computing gradients on the first layer in the context of global fine-tuning of a deep architecture. 一些介于它们之间的SAE在(Ranzato et al., 2007, 2007; RanzatoLeCun, 2007;Ranzato et al., 2008)中被提出,他们提出let the codes h be free,but include a parametric encoder and a penalty for the difference between the free non-parametric codes h and the outputs of the parametric encoder. 在实验中,encoder只是一个affine transformation接着一个non-linearity(like the sigmoid),decoder是线性的(as in sparse coding)。 7.2 Denoising Auto-Encoders DAE是AE的一个随机版本,它的输入被stochasitically corrupted,但是uncorrupted输入还是作为重构的目标。它做两件事,第一,encode输入,第二,消除corruption的影响。第二件只能通过捕获输入数据中的statistical dependency完成。 Vincent(2008) 提出 corruption操作为随机初始化多达一半的输入为0。 一个recuurent版本早在Seung(1998)就被提出,用AE来denoising实际上在 (LeCun, 1987; Gallinari, LeCun, Thiria, Fogelman-Soulie, 1987)被提出。DAE因此展示了这个策略用在无监督预训练上的成功,并且与生成模型连接。 一个DAE有趣的性质是它相当于一个生成模型,另一个有趣的性质是it naturally lends ifself to data with missing values or multi-modal data。 7.3 Lateral Connections RBM可以slightly restricted,通过在显示层加入一些lateral connections。这样sampling h仍然很简单,但是sampling x将会复杂一点。 Osindero and Hinton (2008) 中的结果表明基于这种模块的DBN比传统DBN效果更好。 这种横向连接捕获了pairwise dependencies,让隐层捕获更高层的dependency。这样第一层就相当于一种whitening,作为一种预处理。 这样的优势在于隐层表示的更高层的factors不需要编码所有的局部细节,这些细节横向连接可以捕获。 7.4 Conditional RBMs and Temporal RBMs A Conditional RBM is an RBM where some of the parameters are not free but are instead parametrized functions of a conditioning random variable. Taylor and Hinton(2009) 提出了context-dependent RBMs,隐层的参数c是一个context variable z的affine function。 这是一个temporal RBM的例子,双箭头表明一个RBM,虚箭头表明conditional dependency。这个想法成功运用到了human motion中。 7.5 Factored RBMs 运用在language Model中,不太了解。 7.6 Generalizing RBMs and Contrastive Divergence (待补充) 8. Stochastic Variational Bounds for Joint Optimization of DBN Layers 以下Q代表RBM,P代表DBN。 运用jensen 不等式可以使DBN的似然对数lower bounded。 首先运用到 、 、 ,于是改写: 指Q(h|x)的entropy,根据non-negativity of the KL divergence得到不等式: 当P和Q相同时就说等号。 在DBN中用P表示probability,在RBM中用Q表示。在第一层RBM中Q=P,但事实上不可能相等,因为在DBN中第一个隐层P(h)是由上层决定的。 8.1 Unfolding RBMs into Infinite Directed Belief Networks 上式证明greedy training procedure之前,需要先建立DBN中 和RBM中 的关系。当第二层RBM的权值是第一层的transpose时,这两者相等。 另一种方法看这个问题,将一个gibbs sampling无限的链看成无限有向图with tied weights,这种无限有向图其实相当于RBM。一个2层的DBN,第二层的权值等于第一层权值的transpose,这个2层DBN相当于单个RBM。 8.2 Variational Justification of Greedy Layer-wise Training 下面来证明增加一层RBM可以提高DBN的likelihood。首先按照上述构造一个2层等价DBN,也就是权值矩阵是第一层的逆矩阵,固定第一层的两个条件概率,提高 ,增加KL项。 开始KL项是0,entropy项不依靠DBN中的,所以的增加可以增加logP(x)。因为KL项和entropy项的非负性,进一步训练第二层RBM可以increase a lower bound。 因此训练第二层RBM来最大化: If there was no constraint on P(h1), the maximizer of the above training criterion would be its “empirical” or target distribution: 同理可证增加第三层也是这样。 增加的一层RBM的size和weight的限制不是必须的,用前一层的转置矩阵去初始化权值是否会增加训练速度需要实验证明。 注意在训练最顶层时,不能保证 会单调增加。当lower bound持续增加时,实际的log-likelihood将会减小。这需要KL项减少,这个一般不可能,因为当DBN中的 越来越脱离RBM中的 ,那么 和 也会越来越分离,导致KL变大。当训练第二层时, 就会从 慢慢向 移动。 但并不是第二层RBM从任何参数开始训练都会增加likelihood ,( 以下举了一个反例,没有看明白 )Consider the case where the first RBMhas very large hidden biases, so that ,but large weights and small visible offsets so that ,i.e., the hidden vector is copied to the visible units . When initializing the second RBM with the transpose of the weights of the first RBM, the training likelihood of the second RBM cannot be improved, nor can the DBN likelihood。当第二层RBM从一个不好的配置开始训练,那么 将会向 移动,导致KL变小。 另一种解释就是第二层RBM的训练分布是第一层生成的,相当于进行了一次gibbs采样,我们知道gibbs采样越多越可以准确的得到真实的数据。 When we train within this greedy layer-wise procedure an RBM that will not be the top-level level of a DBN, we are not taking into account the fact that more capacity will be added later to improve the prior on the hidden units. Le Roux and Bengio (2008) 提出了一种方法替换CD算法来训练RBM,实验表明训练第一个RBM用KL divergense可以更好的优化DBN。但是这种方法intractable,因为需要计算隐层所有configuration的和。 8.3 Joint Unsupervised Training of All the Layers 8.3.1 The wake-sleep algorithm 在wake-sleep算法中,向上的recognition parameter和向下的generative parameter是分离的。主要思想如下: 1、wake阶段:用x生成h~Q(h|x),用这个(h,x)作为fully observed data训练P(x|h)和P(h),相当于对 做了一次随机梯度。 2、sleep阶段:sample (h,x) from P(x,h),然后用它作为observed data训练Q(h|x),这相当于对 做了一次随机梯度。 8.3.2 Transforming the DBN into a Boltzmann Machine 当每一层作为RBM初始化之后,DBN就转变成了一个deep boltzmann machine。因为在BM中每个单元接收上面和下面的输入, (Salakhutdinov Hinton, 2009) 提出将RBM的权值二等分来初始化DBM。 9. Looking Forward 9.1 Global Optimization Strategies 9.2 Why Unsupervised Learning is Important 1. Scarcity of labeled examples 2. Unknown future tasks 3. Once a good high-level representation is learned, other learning tasks could be much easy. 4. Layer-wise unsupervised learning 5. Unsupervised learning could put the parameters of a supervised or reinforcement learning machine in a region from which gradient descent (local optimization) would yield good solutions 6. The extra constraints imposed on the optimization by requiring the model to capture not only the input-to-target dependency but also the statistical regularities of the input distributionmight be helpful in avoiding some poorly generalizing apparent local minima. In general extra constraints may also create more local minima, 但是无监督预训练可以减少训练和测试误差,说明预训练将参数移动到最好的representation的参数空间附近。 Deep architectures have typically been used to construct a supervised classifier, and in that case the unsupervised learning component can clearly be seen as a regularizer or a prior that forces the resulting parameters to make sense not only to model classes given inputs but also to capture the structure of the input distribution. 9.3 Open questions 1、为什么基于梯度的深层网络用随机初始化经常不成功? 2、用CD算法训练的RBM能否保留输入的信息(因为它不像自编码器那样,可能丢失一些重要信息),如果不能怎么修改? 3、在CD算法中gibbs sampling的步骤需要调整吗? 4、Persistent CD算法值得挖掘? 5、除了重构误差,还有别的方法监视DBN或者RBM的训练吗? 6、RBM和AE能否用某种形式的sparse penalty提高效果? 7、SAE和SDAE有没有概率论的解释?
So it is time to compare and summarize the pros and cons of the two basic NLP (Natural Language Processing) approaches and show where they are complementary to each other. Some notes: 1. In text processing, majority of basic robust machine learning is based on keywords, so-called BOW (bag-of-word) model although there is research of machine learning that goes beyond keywords. It actutally utilizes n-gram (mostly bigram or trigram) linear word sequence to simulate the language structure. 2. Grammar engineering is mostly a hand-crafted rule system based on linguistic structures (often represented internally as a grammar tree), to simulate the linguistic parsing in human mind. 3. Machine learning is good at viewing the forest (tasks such as document classification or word clustering from a corpus; and it fails in short messages) while rules are good at examining each tree (sentence-level tasks such as parsing and extraction; and it handles short messages well). This is understandable. Document or corpus contains a fairly big bag of keywords, making it easy for machine to learn statistical clues of the words for a given task. Short messages do not have enough data points for a machine learning system to use as evidence. On the other hand, grammar rules decode the linguistic relationships between words to understand the sentence, therefore it is good at handling short messages. 4. In general, a machine learning system based on keyword statistics is recall-oriented while a rule system is precision-oriented. They are complementary in these two core metrics of data quality. Each rule may only cover a tiny portion of the language phenomena, but once it captures it, it is usually precise. It is easy to develop a highly precise rule system but the recall typically only picks up incrementally in accordance with the number of rules developed. Because keyword based machine learning has no knowledge of sentence structures, at best its ngram evidence indirectly simulates languiage structure, it usually cannot reach high precision, but as long as the training corpus is sizable, good recall can be expected by the nature of underlying keyword statistics and the disregard for structural constraints. 5. Machine learning is known for its robustness and scalability as its algorithms are based on science (e.g. MaxEnt is based on information theory) that can be repeated and rigidly tested (of course, like any application areas, there are tricks and know-how to make things work or fail too in practice). The development is also fast once the labeled corpus is available (which is often not easy in practice) because there are off-shelf tools in open source and tons of documentation and literature in the community for proven ML algorithms. 6. Grammar engineering on the other hand tends to depend more on the expertise of the designer and developer for being robust and scalable. It requires deep skills and secret source which may only be accumulated based on years of successes as well as lessons learned. It is not purely a scientific undertaking but more of a blancing art in architect, design and development. To a degree, this is like chefs for Chinese cooking: with the same materials and the assumably the same recipe, one chef's dish can taste a lot better or different from that of another chef. Recipe only gives a framework while the monster of great taste is in the details of know-how. It is not easily repeatable across developers but the same master can repeatedly make the best quality dishes/systems. 7. The knowledge bottleneck is reflected in both machine learning systems and in grammar systems. A decent machine learning system requires a large hand-labeled corpus (research oriented unsupervised learning systems do not need manual annotation, but they are often not practical either). There is consensus in the community that the quality of machine learning usually depends more on the data than on the algorithms. On the other hand, the bottleneck of grammar engineering lies in skilled designer (data scientist) and well-trained domain developers (computational linguists), who are often in short supply today. 8. Machine learning is good at coarse-grained specific task (typical example is classification) while grammar engineering is good at fine-grained analysis and detailed insight extraction. Their respective strengths make them highly complementary in certain application scenarios because as information consumers, users often demand both coarse-grained overview as well as details of actionable intelligence. 9. One big big problem of a machine learning system is the difficulty to fix a reported quality bug. This is because the learned model is usually a black box and no direct human interference is allowed or even possible to address a specific problem unless the model is re-trained with new corpus and/or new features. In the latter case, there is no guarantee that the specific problem we want to solve will be addressed well by re-training as the learning process needs to blance all features in a unified model. This issue is believed to be the major reason why the Google search ranking algorithm favors hand-crafted functions over machine learning because their objective of better user experience can hardly by achieved by a black box model . 10. Grammar system is much more transparent in the language understanding process. The modern grammar systems are all designed with careful modularization so that each specific quality bug can be traced to the corresponding module of the system for fine-tuning. The effect is direct, immediate and can be incrementally accumulated for overall perforamcece enhancement. 11. From the perspective of the NLP depth, at least for the current state of the art, machine learning seems to do shallow NLP work fairly well while grammar engineering can go much deeper in linguistic parsing to achieve deep analytics and insights. (The on-going deep learning research program might get machine learning to some level deeper than before, but it is yet to see how effective it can do real deep NLP and how deep it can go, especially in the area of text processing and understanding.) Related blogs: why hybrid? on machine learning vs. hand-coded rules in NLP 再谈机器学习和手工系统:人和机器谁更聪明能干? 【置顶:立委科学网博客NLP博文一览(定期更新版)】
Quora has a question with discussions on Why is machine learning used heavily for Google's ad ranking and less for their search ranking? A lot of people I've talked to at Google have told me that the ad ranking system is largely machine learning based, while search ranking is rooted in functions that are written by humans using their intuition (with some components using machine learning). Surprise? Contrary to what many people have believed, Google search consists of hand-crafted functions using heuristics. Why? 479 One very popular reply there is from Edmond Lau , Ex-Google Search Quality Engineer who said something which we have been experiencing and have indicated over and over in my past blogs on Machine Learning vs. Rule System, i.e. it is very difficult to debug an ML system for specific observed quality bugs while the rule system, if designed modularly, is easy to control for fine-tuning: From what I gathered while I was there, Amit Singhal , who heads Google's core ranking team, has a philosophical bias against using machine learning in search ranking. My understanding for the two main reasons behind this philosophy is: In a machine learning system, it's hard to explain and ascertain why a particular search result ranks more highly than another result for a given query. The explainability of a certain decision can be fairly elusive; most machine learning algorithms tend to be black boxes that at best expose weights and models that can only paint a coarse picture of why a certain decision was made. Even in situations where someone succeeds in identifying the signals that factored into why one result was ranked more highly than other, it's difficult to directly tweak a machine learning-based system to boost the importance of certain signals over others in isolated contexts. The signals and features that feed into a machine learning system tend to only indirectly affect the output through layers of weights, and this lack of direct control means that even if a human can explain why one web page is better than another for a given query, it can be difficult to embed that human intuition into a system based on machine learning. Rule-based scoring metrics, while still complex, provide a greater opportunity for engineers to directly tweak weights in specific situations. From Google's dominance in web search, it's fairly clear that the decision to optimize for explainability and control over search result rankings has been successful at allowing the team to iterate and improve rapidly on search ranking quality. The team launched 450 improvements in 2008 , and the number is likely only growing with time. Ads ranking, on the other hand, tends to be much more of an optimization problem where the quality of two ads are much harder to compare and intuit than two web page results. Whereas web pages are fairly distinctive and can be compared and rated by human evaluators on their relevance and quality for a given query , the short three- or four-line ads that appear in web search all look fairly similar to humans. It might be easy for a human to identify an obviously terrible ad, but it's difficult to compare two reasonable ones: Branding differences, subtle textual cues, and behavioral traits of the user, which are hard for humans to intuit but easy for machines to identify, become much more important. Moreover, different advertisers have different budgets and different bids, making ad ranking more of a revenue optimization problem than merely a quality optimization problem. Because humans are less able to understand the decision behind an ads ranking decision that may work well empirically, explainability and control -- both of which are important for search ranking -- become comparatively less useful in ads ranking, and machine learning becomes a much more viable option. Jackie Bavaro , Google PM for 3 years Suggest Bio Votes by Piaw Na ( Worked at Google ) , Marc Bodnick , Alex Clemmer , Tudor Achim , and 92 more . Edmond Lau's answer is great, but I wanted to add one more important piece of information. When I was on the search team at Google (2008-2010), many of the groups in search were moving away from machine learning systems to the rules-based systems. That is to say that Google Search used to use more machine learning, and then went the other direction because the team realized they could make faster improvements to search quality with a rules based system. It's not just a bias, it's something that many sub-teams of search tried out and preferred. I was the PM for Images, Video, and Local Universal - 3 teams that focus on including the best results when they are images, videos, or places. For each of those teams I could easily understand and remember how the rules worked. I would frequently look at random searches and their results and think Did we include the right Images for this search? If not, how could we have done better?. And when we asked that question, we were usually able to think of signals that would have helped - try it yourself. The reasons why *you* think we should have shown a certain image are usually things that Google can actually figure out. Upvote • Comment • Share • Thank • Report • Written 10 Apr, 2013 Anonymous Votes by Edmond Lau ( Ex-Google Search Quality Engineer ) , Bin Lu ( Software Engineer at Google ) , Keith Rabois , Vu Ha , and 34 more . Part of the answer is legacy, but a bigger part of the answer is the difference in objectives, scope and customers of the two systems. The customer for the ad-system is the advertiser (and by proxy, Google's sales dept). If the machine-learning system does a poor job, the advertisers are unhappy and Google makes less money. Relatively speaking, this is tolerable to Google. The system has an objective function ($) and machine learning systems can be used when they can work with an objective function to optimize. The total search-space (# of ads) is also much much smaller. The search ranking system has a very subjective goal - user happiness. CTR, query volume etc. are very inexact metrics for this goal, especially on the fringes (i.e. query terms that are low-volume/volatile). While much of the decisioning can be automated, there are still lots of decisions that need human intuition. To tell whether site A better than site B for topic X with limited behavioural data is still a very hard problem. It degenerates into lots of little messy rules and exceptions that tries to impose a fragile structure onto human knowledge, that necessarily needs tweaking. An interesting question is - is the Google search index (and associated semantic structures) catching up (in size and robustness) to the subset of the corpus of human knowledge that people are interested in and searching for ? My guess is that right now, the gap is probably growing - i.e. interesting/search-worthy human knowledge is growing faster than Google's index.. Amit Singhal's job is probably getting harder every year. By extension, there are opportunities for new search providers to step into the increasing gap with unique offerings. p.s: I used to manage an engineering team for a large search provider (many years ago). 【置顶:立委科学网博客NLP博文一览(定期更新版)】
最近在研究learning to rank算法,然后就找到了一个宝贝: https://github.com/JK-SUN/cikm12-vs-cf-sourcecode 于是我就通过百度网盘把代码数据下载下来,然后在运行时就报了这个错误: raceback (most recent call last): File G:\software\CIKM-SourceCode\sourcecode\weighted_KendallTauRank_specialty_degree_sim.py, line 169, in module main('user_user_sim_eachmovie_weight.data') File G:\software\CIKM-SourceCode\sourcecode\weighted_KendallTauRank_specialty_degree_sim.py, line 119, in main results=pprocess.pmap(calculate_tf,sequence_tf,limit) File G:\software\CIKM-SourceCode\sourcecode\pprocess.py, line 917, in pmap mymap = Map(limit=limit) File G:\software\CIKM-SourceCode\sourcecode\pprocess.py, line 675, in __init__ Exchange.__init__(self, *args, **kw) File G:\software\CIKM-SourceCode\sourcecode\pprocess.py, line 277, in __init__ self.poller = select.poll() AttributeError: 'module' object has no attribute 'poll' 是由select.poll()引起的,经检查才知道这个接口仅限于Unix操作系统,windows上不能用。 http://docs.python.org/2/library/select.html 真是可惜,本以为python在哪个操作系统都能用,看来该用Linux还是要用Linux,拿windows来碰运气纯粹是浪费时间。 找到错误原因了,可以睡而无憾了。
摘自: http://cseweb.ucsd.edu/~dasgupta/254-deep/ CSE 254: Seminar on Learning Algorithms Time TuTh 3.30-5 in CSE 2154 Instructor: Sanjoy Dasgupta Office hours TBA in EBU3B 4138 This quarter the theme of CSE 254 is deep learning . Prerequisite: CSE 250AB. The first couple of lectures will be an overview of basic material. Thereafter, in each class meeting, a student will give a talk lasting about 60 minutes presenting a technical paper (or several papers) in detail. In questions during the talk, and in the final 20 minutes, all seminar participants will discuss the paper and the issues raised by it. Date Presenter Paper Slides Jan 10 Sanjoy Introduction Jan 12 Sanjoy Hopfield nets Jan 17 Sanjoy Markov random fields, Gibbs sampling, simulated annealing Jan 19 Sanjoy Deep belief nets as autoencoders and classifiers Jan 24 Brian Task-driven dictionary learning here Jan 26 Vicente A quantitative theory of immediate visual recognition here Jan 31 Emanuele Convolutional deep belief networks here Feb 2 Nakul Restricted Boltzmann machines: learning , and hardness of inference here Feb 7 Craig The independent components of natural scenes are edge filters here Feb 9 No class: ITA conference at UCSD Feb 14 Janani Deep learning via semi-supervised embedding here Feb 16 Stefanos A unified architecture for natural language processing here Feb 21 Hourieh An analysis of single-layer networks in unsupervised feature learning here Feb 23 Ozgur Emergence of simple-cell receptive properties by learning a sparse code for natural images here Feb 28 Matus Representation power of neural networks: Barron , Cybenko , Kolmogorov here Mar 1 Frederic Reinforcement learning on slow features of high-dimensional input streams Mar 6 Dibyendu, Sreeparna Learning deep energy models and What is the best multistage architecture for object recognition? here Mar 8 No class: Sanjoy out of town Mar 13 Bryan Inference of sparse combinatorial-control networks here Mar 15 Qiushi Weighted sums of random kitchen sinks here This is a four unit course in which the work consists of oral presentations. The procedure for each student presentation is as follows: · One week in advance: Finish a draft of Latex/Powerpoint that present clearly the work in the paper. Make an appointment with me to discuss the draft slides. And email me the slides. · Several days in advance: Meet for about one hour to discuss improving the slides, and how to give a good presentation. · Day of presentation: Give a good presentation with confidence, enthusiasm, and clarity. · Less than three days afterwards: Make changes to the slides suggested by the class discussion, and email me the slides in PDF, two slides per page, for publishing. Try to make your PDF file less than one megabyte. Please read, reflect upon, and follow these presentation guidelines , courtesy of Prof Charles Elkan. Presentations will be evaluated, in a friendly way but with high standards, using this feedback form . Here is a preliminary list of papers .
Deep LearningInstructor: Bhiksha Raj COURSE NUMBER -- MLD: 10805 LTI: 11-785 (Lab) / 11-786 (Seminar) Timings: 1:30 p.m. -- 2:50 p.m. Days: Mondays and Wednesdays Location: GHC 4211 Website: http://deeplearning.cs.cmu.edu Credits: 10-805 and 11-786 are 6-credit seminar courses. 11-785 is a 12-credit lab course. Students who register for 11-785 will be required to complete all lab exercises. IMPORTANT: LTI students are requested to switch to the 11-XXX courses. All students desiring 12 credits must register for 11-785. Instructor: Bhiksha Raj Contact: email:bhiksha@cs.cmu.edu, Phone:8-9826, Office: GHC6705 Office hours: 3.30-5.00 Mondays. You may also meet me at other times if I'm free. TA: Anders Oland Contact: email:anderso@cs.cmu.edu, Office: GHC7709 Office hours: 12:30-2:00 Fridays. Deep learning algorithms attempt to learn multi-level representations of data, embodying a hierarchy of factors that may explain them. Such algorithms have been demonstrated to be effective both at uncovering underlying structure in data, and have been successfully applied to a large variety of problems ranging from image classification, to natural language processing and speech recognition. In this course students will learn about this resurgent subject. The course presents the subject through a series of seminars, which will explore it from its early beginnings, and work themselves to some of the state of the art. The seminars will cover the basics of deep learning and the underlying theory, as well as the breadth of application areas to which it has been applied, as well as the latest issues on learning from very large amounts of data. Although the concept of deep learning has been applied to a number of different models, we will concentrate largely, although not entirely, on the connectionist architectures that are most commonly associated with it. Students who participate in the course are expected to present at least one paper on the topic to the class. Presentations are expected to be thorough and, where applicable, illustrated through experiments and simulations conducted by the student. Students are registered for the lab course must also complete all lab exercises. Labs Lab 1 is up Lab 1: Perceptrons and MLPs Data sets Due: 18 Sep 2013 Lab 2 is up Lab 1: The effect of increasing network depth Data set Due: 17 Oct 2013 Papers and presentations Date Topic/paper Author Presenter Additional Links 28 Aug 2013 Introduction Bhiksha Raj Intelligent Machinery Alan Turing Subhodeep Moitra 4 Sep 2013 Bain on Neural Networks. Brain and Cognition 33:295-305, 1997 Alan L. Wilkes and Nicholas J. Wade Lars Mahler McCulloch, W.S. Pitts, W.H. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics, 5:115-137, 1943. W.S. McCulloch and W.H. Pitts Kartik Goyal Michael Marsalli's tutorial on the McCulloch and Pitts Neuron 9 Sep 2013 The Perceptron: A Probalistic Model For Information Storage And Organization In The Brain. Psychological Review 65 (6): 386.408, 1958. F. Rosenblatt Daniel Maturana ?? Chapter from “The organization of Behavior”, 1949. D. O. Hebb Sonia Todorova 11 Sep 2013 The Widrow Hoff learning rule (ADALINE and MADALINE). Widrow Pallavi Baljekar ?? Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks 2 (6): 459.473, 1989. T. Sanger Khoa Luu A simplified Neuron model as a principal component analyzer, by Erkki Oja 16 Sep 2013 Learning representations by back-propagating errors. Nature323(6088): 533.536 Rumelhart et al. Ahmed Hefny Chapter by Rumelhart, Hinton and Williams Backpropagation through time: what it does and how to do it., P. Werbos, Proc. IEEE 1990 A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm, IEEE Intl. Conf. on Neural Networks, 1993 M. Riedmiller, H. Braun Danny (ZhenZong) Lan 18 Sep 2013 Neural networks and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sciences, Vol 79, 2554-2558, 1982 J. J. Hopfield Prasanna Muthukumar The self-organizing map. Proc. IEEE, Vol 79, 1464:1480, 1990 Teuvo Kohonen Fatma Faruq 23 Sep 2013 Phoneme recognition using time-delay neural networks, IEEE trans. Acoustics, Speech Signal Processing, Vol 37(3), March 1989 A. Waibel et al. Chen Chen A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the echo state network approach, GMD Report 159, German National Research Center for Information Technology, 2002 Herber Jaeger Shaowei Wang 25 Sep 2013 Bidirectional recurrent neural networks, IEEE transactions on signal processing, Vol 45(11), Nov. 1997 M. Schuster and K. Paliwal Felix Juefei Xu Long short-term memory. Neural Computation, 9(8):1735.1780, 1997 S. Hochreiter and J. Schmidhuber Dougal Sutherland 30 Sep 2013 A learning algorithm for Boltzmann machines, Cognitive Science, 9, 147-169, 1985 D. Ackley, G. Hinton, T. Sejnowski Siyuan Improved simulated annealing, Boltzmann machine, and attributed graph matching, EURASIP Workshop on Neural Networks, vol 412, LNCS, Springer, pp: 151-160, 1990 Lei Xu, Erkii Oja. Ran Chen 2 Oct 2013 Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position, Pattern Recognition Vol. 15(6), pp. 455-469, 1982 K. Fukushima, S. Miyake Sam Thomson Shift invariance and the Neocognitron, E. Barnard and D. Casasent, Neural Networks Vol 3(4), pp. 403-410, 1990 Face recognition: A convolutional neural-network approach, IEEE transactions on Neural Networks, Vol 8(1), pp98-113, 1997 S. Lawrence, C. L. Giles, A. C. Tsoi, A. D. Back Hoang Ngan Le Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, P.Y.Simard, D. Steinkraus, J.C. Platt, Prc. Document analysis and recognition, 2003 Gradient based learning applied to document recognition, Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. Proceedings of the IEEE, November 1998, pp. 1-43 7 Oct 2013 On the problem of local minima in backpropagation, IEEE tran. Pattern Analysis and Machine Intelligence, Vol 14(1), 76-86, 1992 M. Gori, A. Tesi Jon Smereka Learning long-term dependencies with gradient descent is difficult, IEEE trans. Neural Networks, Vol 5(2), pp 157-166, 1994 Y. Bengio, P. Simard, P. Frasconi Keerthiram Murugesan Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, in A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press , 2001 Backpropagation is sensitive to initial conditions, J. F. Kolen and J. B. Pollack, Advances in Neural Information Processing Systems, pp 860-867, 1990 9 Oct 2013 Multilayer feedforward networks are universal approximators, Neural Networks, Vol:2(3), 359-366, 1989 K. Hornik, M. Stinchcombe, H. White Sonia Todorova Approximations by superpositions of a sigmoidal function, G. Cybenko, Mathematics of control, signals and systems, Vol:2, pp. 303-314, 1989 On the approximation realization of continuous mappings by neural networks, K. Funahashi, Neural Networks, Vol. 2(3), pp. 183-192, 1989 Universal approximation bounds for superpositions of a sigmoidal function, A. R. Barron, IEEE Trans. on Info. Theory, Vol 39(3), pp. 930-945, 1993 On the expressive power of deep architectures, Proc. 14th intl. conf. on discovery science, 2011 Y. Bengio and O. Delalleau Prasanna Muthukumar Scaling learning algorithms towards AI, Y. Bengio and Y. LeCunn, in Large Scale Kernel Machines , Eds. Bottou, Chappelle, DeCoste, Weston, 2007 Shallow vs. Deep sum product networks, O. Dellaleau and Y. Bengio, Advances in Neural Information Processing Systems, 2011 14 Oct 2013 Information processing in dynamical systems: Foundations of Harmony theory; In Parallel Distributed Processing: Explorations in the microstructure of cognition , Rumelhart and McLelland eds., 1986 Paul Smolensky Kathy Brigham Geometry of the restricted Boltzmann machine, M. A. Cueto, J. Morton, B. Sturmfels, Contemporary Mathematics, Vol. 516., pp. 135-153, 2010 Exponential family harmoniums with and application to information retrieval, Advances in Neural Information Processing Systems (NIPS), 2004 M. Welling, M. Rosen-Zvi, G. Hinton Ankur Gandhe Continuous restricted Boltzmann machine with an implementable training algorithm, H. Chen and A. F. Muray, IEE proceedings on Vision, Image and Signal Processing, Vol. 150(3), pp. 153-158, 2003 Diffusion networks, product of experts, and factor analysis, T. K. Marks and J. R. Movellan, 3rd Intl. Conf. on Independent Component Analysis and Signal Separation, 2001 16 Oct 2013 Distributed optimization of deeply nested systems. Unpublished manuscript, Dec. 24, 2012, arXiv:1212.5921 M. Carrera-Perpiñan and W. Wang M. Carrera-Perpiñan 21 Oct 2013 Training products of experts by minimizing contrastive divergence, Neural Computation, Vol. 14(8), pp. 1771-1800, 2002 G. Hinton Yuxiong Wang On contrastive divergence learning, M. Carrera-Perpinñan, AI and Statistics, 2005 Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient, T. Tieleman, International conference on Machine learning (ICML), pp. 1064-1071, 2008 An Analysis of Contrastive Divergence Learning in Gaussian Boltzmann Machines, Chris Williams, Felix Agakov, Tech report, University of Edinburgh, 2002 Justifying and generalizing contrastive divergence, Y. Bengio, O. Delalleau, Neural Computation, Vol. 21(6), pp. 1601-1621, 2009 23 Oct 2013 A fast learning algorithm for deep belief networks, Neural Computation, Vol. 18, No. 7, Pages 1527-1554, 2006. G. Hinton, S. Osindero, Y.-W. Teh Aaron Wise Reducing the dimensionality of data with Neural Networks, G. Hinton and R. Salakhutidnov, Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006 Greedy layer-wise training of deep networks, Neural Information Processing Systems (NIPS), 2007. Y. Bengio, P. Lamblin, D. Popovici and H. Larochelle Ahmed Hefny Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model, M. Ranzato, C.S. Poultney, S. Chopra, Y. Lecunn, Neural Information Processing Systems (NIPS), 2006. 28 Oct 2013 Imagenet classification with deep convolutional neural networks, NIPS 2012 A. Krizhevsky, I. Sutskever, G. Hinton Danny Lan Convolutional recursive deep learning for 3D object classification, R. Socher, B. Huval, B. Bhat, C. Manning, A. Ng, NIPS 2012 Multi-column deep neural networks for image classification, D. Ciresan, U. Meier and J. Schmidhuber, CVPR 2012 Learning hierarchial features for scene labeling, IEEE transactions on pattern analysis and machine intelligence, Vol 35(8), pp. 1915-1929, 2012 C. Couprie, L. Najman, Y. LeCun Jon Smereka Learning convolutional feature hierarchies for visual recognition, K. Laukcuoglu,P. Sermanet, Y-Lan Boureau, K. Gregor, M. Mathieu, Y. LeCun, NIPS 2010 30 Oct 2013 Statistical language models based on neural networks, PhD dissertation, Brno, 2012, chapters 3 and 6 T. Mikolov, Fatma Faruq Semi-supervised recursive autoencoders for predicting sentiment R. Socher, J. Pennington, E. Huang, A. Ng and C. Manning Yueran Yuan Dynamic pooling and unfoloding recursive autoencoders for paraphrase detection, R. Socher, E. Huang, J. Pennington, A. Ng, C. Manning, EMNLP 2011 Joint learning of words and meaning representation for open-text semantic parsing, A.Bodes, X. Glorot, J. Weston, Y. Bengio, AISTATS 2012 4 Nov 2013 Supervised sequence labelling with recurrent neural networks, PhD dissertation, T. U. Munchen, 2008, Chapters 4 and 7 A. Graves, Georg Schoenherr Speech recognition with deep recurrent neural networks, A. Graves, A.-. Mohamed, G. Hinton, ICASSP 2013 Deep neural networks for acoustic modeling in speech recognition: the shared view of four research groups, IEEE Signal Processing Magazine, Vold 29(6), pp 82-97, 2012. G. Hinton et al. Daniel Maturana 6 Nov 2013 Modeling Documents with a Deep Boltzmann Machine, UAI 2013 N. Srivastava, R. Salakhutidinov, G. Hinton Siyuan Generating text with Recurrent Neural Networks, I. Sutskever, J. Martens, G. Hinton, ICML 2011 Word representations: A simple and general method for semi-supervised learning, ACL 2010 J. Turian, L. Ratinov, Y. Bengio Sam Thomson 11 Nov 2013 An empirical evaluation of deep architectures on problems with many factors or variables, ICML 2007 H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio Ran Chen The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training, AISTATS 2009 D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, P. Vincent Ankur Gandhe 13 Nov 2013 Extracting and Composing Robust Features with Denoising Autoencoders, ICML 2008 P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzgool Pallavi Baljekar Improving neural networks by preventing co-adaptation of feature detectors, G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sustskever, R. R. Salakhutdinov Subhodeep Moitra 18 Nov 2013 A theory of deep learning architectures for sensory perception: the ventral stream, Fabio Anselmi, Joel Z Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, Tomaso Poggio Dipan Pal 20 Nov 2013 No more pesky learning rates, ICML 2013 Tom Schaul, Sixin Zhang and Yann LeCun Georg Shoenherr No more pesky learning rates: supplementary material On the importance of initialization and momentum in deep learning, JMLR 28(3): 1139.1147, 2013 Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton Kartik Goyal Supplementary material for paper 25 Nov 2013 Guest lecture Quoc Le 27 Nov 2013 A multi-layer sparse coding network learns contour coding from natural images Neural Networks Research Centre, Vision Research 42(12): 1593-1605, 2002 Patrik O. Hoyer and Aapo Hyvarinen Sparse Feature Learning for Deep Belief Networks, NIPS 2007 Marc.Aurelio Ranzato Y-Lan Boureau, Yann LeCun Sparse deep belief net model for visual area V2, NIPS 2007 Honglak Lee Chaitanya Ekanadham Andrew Y. Ng Deep Sparse Rectifier Neural Networks, JMLR 16: 315-323, 2011 Xavier Glorot, Antoine Bordes, Yoshua Bengio To be arranged Exploring strategies for training deep neural networks, Journal of Machine Learning Research, Vol. 1, pp 1-40, 2009 H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin Why Does Unsupervised Pre-training Help Deep Learning?, AISTATS 2010 D. Erhan, A. Courville, Y. Bengio, P. Vincent Understanding the difficulty of training deep feedforward neural networks, AISTATS 2010 X. Glorot and Y. Bengio A Provably Efficient Algorithm for Training Deep Networks, arXiv:1304.7045 , 2013 R. Livni, S. Shalev-Schwartz, O. Shamir
Facebook Launches Advanced AI Effort to Find Meaning in Your Posts A technique called deep learning could help Facebook understand its users and their data better. By Tom Simonite on September 20, 2013 Facebook ’s piles of data on people’s lives could allow it to push the boundaries of what can be done with the emerging AI technique known as deep learning . Facebook is set to get an even better understanding of the 700 million people who use the social network to share details of their personal lives each day. A new research group within the company is working on an emerging and powerful approach to artificial intelligence known as deep learning , which uses simulated networks of brain cells to process data. Applying this method to data shared on Facebook could allow for novel features and perhaps boost the company’s ad targeting. Deep learning has shown potential as the basis for software that could work out the emotions or events described in text even if they aren’t explicitly referenced, recognize objects in photos, and make sophisticated predictions about people’s likely future behavior. The eight-person group , known internally as the AI team, only recently started work, and details of its experiments are still secret. But Facebook’s chief technology officer , Mike Schroepfer , will say that one obvious way to use deep learning is to improve the news feed, the personalized list of recent updates he calls Facebook’s “ killer app .” The company already uses conventional machine learning techniques to prune the 1,500 updates that average Facebook users could possibly see down to 30 to 60 that are judged most likely to be important to them. Schroepfer says Facebook needs to get better at picking the best updates because its users are generating more data and using the social network in different ways. “The data set is increasing in size, people are getting more friends, and with the advent of mobile, people are online more frequently,” Schroepfer told MIT Technology Review . “It’s not that I look at my news feed once at the end of the day; I constantly pull out my phone while I’m waiting for my friend or I’m at the coffee shop. We have five minutes to really delight you.” Shroepfer says deep learning could also be used to help people organize their photos or choose which is the best one to share on Facebook . In looking into deep learning , Facebook follows its competitors Google and Microsoft , which have used the approach to impressive effect in the past year. Google has hired and acquired leading talent in the field (see “ 10 Breakthrough Technologies 2013: Deep Learning ”), and last year it created software that taught itself to recognize cats and other objects by reviewing stills from YouTube videos. The underlying technology was later used to slash the error rate of Google’s voice recognition services (see “ Google’s Virtual Brain Goes to Work ”). Meanwhile, researchers at Microsoft have used deep learning to build a system that translates speech from English to Mandarin Chinese in real time (see “ Microsoft Brings Star Trek’s Voice Translator to Life ”). Chinese Web giant Baidu also recently established a Silicon Valley research lab to work on deep learning . Less complex forms of machine learning have underpinned some of the most useful features developed by major technology companies in recent years, such as spam detection systems and facial recognition in images. The largest companies have now begun investing heavily in deep learning because it can deliver significant gains over those more established techniques, says Elliot Turner , founder and CEO of AlchemyAPI , which rents access to its own deep learning software for text and images. “Research into understanding images, text, and language has been going on for decades, but the typical improvement a new technique might offer was a fraction of a percent,” he says. “In tasks like vision or speech, we’re seeing 30 percent-plus improvements with deep learning .” The newer technique also allows much faster progress in training a new piece of software, says Turner. Conventional forms of machine learning are slower because before data can be fed into learning software, experts must manually choose which features of it the software should pay attention to, and they must label the data to signify, for example, that certain images contain cars. Deep learning systems can learn with much less human intervention because they can figure out for themselves which features of the raw data are most significant. They can even work on data that hasn’t been labeled, as Google’s cat-recognizing software did. Systems able to do that typically use software that simulates networks of brain cells, known as neural nets, to process data. They require more powerful collections of computers to run. Facebook’s AI group will work on applications that can help the company’s products as well as on more general research that will be made public, says Srinivas Narayanan , an engineering manager at Facebook who’s helping to assemble the new group. He says one way Facebook can help advance deep learning is by drawing on its recent work creating new types of hardware and software to handle large data sets (see “ Inside Facebook’s Not-So-Secret New Data Center ”). “It’s both a software and a hardware problem together; the way you scale these networks requires very deep integration of the two,” he says. Facebook hired deep learning expert Marc’Aurelio Ranzato away from Google for its new group. Other members include Yaniv Taigman , cofounder of the facial recognition startup Face.com (see “ When You’re Always a Familiar Face ”); computer vision expert Lubomir Bourdev ; and veteran Facebook engineer Keith Adams . 原文: http://www.technologyreview.com/news/519411/facebook-launches-advanced-ai-effort-to-find-meaning-in-your-posts/
题目:Descriptive, Mechanistic and Interpretive Models of Primary Visual Cortex 主讲人:肖达,北京邮电大学计算机学院教师。 袁行远,前淘宝网数据挖掘与并行计算高级算法工程师,现辞职休假中。 提纲: 1.Descriptive models (What): * Responses of a Neuron in an Intact Cat Brain, (视频: Hubel Wiesel - Cortical Neuron - V1 http://v.youku.com/v_show/id_XNDc0MTkxODc2.html ) * Contrast sensitivity of Human * Receptive Fields and Edges Detection Program Demo 2.Machanistic Models (How): * Oriented Receptive Fields and Position-Less Receptive Fields * Fourier Decomposition hypothesis * Build Self-Organizing Map for V1 3.Interpretive Models (Why): * What is the Best Multi-Stage Architecture for Object Recognition 4.The columnar organization of the neocortex and its implication for computer vision 参考文献: 【NB】Matteo Carandini (2012) Area V1. Scholarpedia, 7(7):12105. http://www.scholarpedia.org/article/Area_V1 【NB】【CM】Carandini M, et al. (2005) Do we know what the early visual system does? Journal of Neuroscience, 25:10577-10597. 【NB】Douglas, RJ and Martin, KAC (2007) Recurrent neuronal circuits in the neocortex. Current Opinion in Biology, 17:496-500. 【NB】Douglas, RJ and Martin, KAC (2010) Canonical cortical circuits. Chapter 2 in Handbook of Brain Microcircuits 15-21. 【ML】Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. (2009) What is the Best Multi-Stage Architecture for Object Recognition? in Proc. International Conference on Computer Vision (ICCV’09). (文章前的标签代表类型,NB=神经生物学发现,CM=计算模型,ML=机器学习算法,SP=统计物理。) 多贝视频: http://www.duobei.com/room/3011311368 讲稿: 袁行远,What do we know about V1 http://vdisk.weibo.com/s/u4Vws15JLvz_z 代码演示网址 http://www.demogng.de/ 肖达,Modular organization of neocortex and its implication for computer vision http://vdisk.weibo.com/s/u4Vws15JLvz_l
Description of the Research Plan Title: A novel situated social-personalized learning approach Keywords: e-learning, personalized learning, knowledge space construction, situatedknowledge representation, learning path analysis, cognitive learning process, ande-learning assistant agent Objectives: This research program is to develop an intelligent E-LearningAssistant (ELA) agent for personalized learning. A novel situated social-personalizedlearning approach is proposed in this proposal. This research plan covers threeaspects: The first one is to develop a situated knowledge representation andorganization (SKRO) model. The second is to research nonlinear personallearning path or behavior model. The third is to research social-personalizedlearning mechanism based on machine learning algorithms. The details of thesethree aspects are as follows: 1. Research on situated knowledge representation and organization model Knowledge representationand organization method is the fundamental element for building an e-learningsystem. Here, enlightened by the belief of cognitive learning and pedagogy, i.e.“knowledge is situated and learning is also situated”. A situated knowledgerepresentation and organization (SKRO) model is proposed for building asituated e-learning environment. The SKRO model not only can construct acontextual relationship of knowledge, but also can build a map of differentkinds of heterogeneous knowledge (e.g. text, audio, video, animation, imagesand others). The SKRO model is also a modeling foundation of personal learningprocess and how they learn. 2. Research on nonlinear personal learning path or behavior model With the SKRO model, the concept ofconstructive memory is proposed for storing learning content about thelearners’ situations, goals, and learned knowledge. More important, all theselearned knowledge will be recorded according to a time series and a specificsequence of learning activities. An individual learning path (actually shouldbe a map, not a path) is modeled here to record his learning process andcognitive process. At last, the personal learning space (PLS) is constructedwith personal learned knowledge and nonlinear learning path/behavior. 3. The social-personalized learning mechanism based on machine learningalgorithms The thirdresearch highlight is that we hold a belief that one’s learning process andcognitive process may be helpful to others. Therefore, a novelsocial-personalized learning mechanism based on machine learning algorithms isproposed for personalized learning. The social-personalized learning mechanismmeans that intelligent ELA agent can sense social learning processes (i.e. eachlearner’s sequence of learning activities in any order) to analyze therelevance of knowledge. After that, it constructs a situated knowledgerelevance model through employing statistic-based machine learning algorithms. Sucha model can tell one learner what others’ learning process are and how theylearn through analyzing personal and social learning path. At the same time, thesocial-personalized learning mechanism can also tell the learner what knowledgeI should learn now according to his situation and goals. The framework of intelligent E-Learning Assistant(ELA) agent is as following:
跟ICML'13一篇paper(Maxout Networks)的作者讨论一个bio-inspired deep learning的想法,得到的回复是:你的想法没啥硬伤,关键是能不能跑出更好的结果。 Other people, including me, having worked on fairly similar things and didn't really get them to work. That doesn't mean it's impossible though; just that you might need to try a few different tricks to go with it that we haven't tried. In general it's very hard to predict theoretically how well a deep learning method will work in advance. You just have to get your hands dirty and try a lot of them. 虽说“实践是检验真理的唯一标准”没错,但还是觉得,现阶段的DL不像科学,更像炼金术! 接下来打算好好玩玩pylearn2和cuda-convnet了。
题目:Overview: deep architectures in brain and machine 提纲 1. An overview of primate visual pathways 2. What problems does the visual system solve? An object recognition perspective 3. A very general introduction to deep learning and some personal comments 4. Neural representation benchmark: Brain vs Machine 讲稿 http://www.kuaipan.cn/file/id_2602161770890478.htm
Technical Challenges (技术挑战) 1. (如何)确保参与者能够能够成功地使用技术(Ensuring participants can successfully use the technology) 2.(如何)抵制盲目使用技术的冲动(Resisting the urge to use technology simply because it is available) Organizational Challenges (管理挑战) 3. (如何)克服“混合学习不如传统课堂培训有效”的想法(Overcoming the idea that blended learning is not as effective as traditional classroom training) 4. (如何)重新定位教辅人员的角色(Redefining the role of the facilitator) 5. (如何)管理和监测参与者进程(Managing and monitoring participant progress) Instructional Challenges (教学挑战) 6. (如何)重视如何教,而不是简单的教什么(Looking at how to teach, not just what to teach) 7.(如何)匹配最佳的传播媒介和最佳的性能目标(Matching the best delivery medium to the performance objective) 8. (如何)保持在线服务互动,而不仅仅回复参与者(Keeping online offerings interactive rather than just “talking at” participants) 9. (如何)确保参与者的承诺和后续的“非实况直播”( Ensuring participant commitment and follow-through with “non-live” elements.) 10.(如何)确保混合教学的各个元素相互配合(Ensuring all the elements of the blend are coordinated) 文章来源:“Top 10 Challenges of Blended learning”, by Jennifer Hofmann
转载来源: http://blog.csdn.net/zouxy09/article/details/8782018 十、总结与展望 1)Deep learning总结 深度学习是关于自动学习要建模的数据的潜在(隐含)分布的多层(复杂)表达的算法。换句话来说,深度学习算法自动的提取分类需要的低层次或者高层次特征。高层次特征,一是指该特征可以分级(层次)地依赖其他特征,例如:对于机器视觉,深度学习算法从原始图像去学习得到它的一个低层次表达,例如边缘检测器,小波滤波器等,然后在这些低层次表达的基础上再建立表达,例如这些低层次表达的线性或者非线性组合,然后重复这个过程,最后得到一个高层次的表达。 Deep learning能够得到更好地表示数据的feature,同时由于模型的层次、参数很多,capacity足够,因此,模型有能力表示大规模数据,所以对于图像、语音这种特征不明显(需要手工设计且很多没有直观物理含义)的问题,能够在大规模训练数据上取得更好的效果。此外,从模式识别特征和分类器的角度,deep learning框架将feature和分类器结合到一个框架中,用数据去学习feature,在使用中减少了手工设计feature的巨大工作量(这是目前工业界工程师付出努力最多的方面),因此,不仅仅效果可以更好,而且,使用起来也有很多方便之处,因此,是十分值得关注的一套框架,每个做ML的人都应该关注了解一下。 当然,deep learning本身也不是完美的,也不是解决世间任何ML问题的利器,不应该被放大到一个无所不能的程度。 2)Deep learning未来 深度学习目前仍有大量工作需要研究。目前的关注点还是从机器学习的领域借鉴一些可以在深度学习使用的方法,特别是降维领域。例如:目前一个工作就是稀疏编码,通过压缩感知理论对高维数据进行降维,使得非常少的元素的向量就可以精确的代表原来的高维信号。另一个例子就是半监督流行学习,通过测量训练样本的相似性,将高维数据的这种相似性投影到低维空间。另外一个比较鼓舞人心的方向就是evolutionary programming approaches(遗传编程方法),它可以通过最小化工程能量去进行概念性自适应学习和改变核心架构。 Deep learning还有很多核心的问题需要解决: (1)对于一个特定的框架,对于多少维的输入它可以表现得较优(如果是图像,可能是上百万维)? (2)对捕捉短时或者长时间的时间依赖,哪种架构才是有效的? (3)如何对于一个给定的深度学习架构,融合多种感知的信息? (4)有什么正确的机理可以去增强一个给定的深度学习架构,以改进其鲁棒性和对扭曲和数据丢失的不变性? (5)模型方面是否有其他更为有效且有理论依据的深度模型学习算法? 探索新的特征提取模型是值得深入研究的内容。此外有效的可并行训练算法也是值得研究的一个方向。当前基于最小批处理的随机梯度优化算法很难在多计算机中进行并行训练。通常办法是利用图形处理单元加速学习过程。然而单个机器GPU对大规模数据识别或相似任务数据集并不适用。在深度学习应用拓展方面,如何合理充分利用深度学习在增强传统学习算法的性能仍是目前各领域的研究重点。 十一、参考文献和Deep Learning学习资源 ( 持续更新…… ) 先是机器学习领域大牛的微博:@余凯_西二旗民工;@老师木;@梁斌penny;@张栋_机器学习;@邓侃;@大数据皮东;@djvu9…… (1)Deep Learning http://deeplearning.net/ (2)Deep Learning Methods for Vision http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/ (3)Neural Network for Recognition of Handwritten Digits http://www.codeproject.com/Articles/16650/Neural-Network-for-Recognition-of-Handwritten-Digi (4)Training a deep autoencoder or a classifier on MNIST digits http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html (5)Ersatz:deep neural networks in the cloud http://www.ersatz1.com/ (6)Deep Learning http://www.cs.nyu.edu/~yann/research/deep/ (7)Invited talk A Tutorial on Deep Learning by Dr. Kai Yu (余凯) http://vipl.ict.ac.cn/News/academic-report-tutorial-deep-learning-dr-kai-yu (8)CNN - Convolutional neural network class http://www.mathworks.cn/matlabcentral/fileexchange/24291 (9)Yann LeCun's Publications http://yann.lecun.com/exdb/publis/index.html#lecun-98 (10) LeNet-5, convolutional neural networks http://yann.lecun.com/exdb/lenet/index.html (11) Deep Learning 大牛Geoffrey E. Hinton's HomePage http://www.cs.toronto.edu/~hinton/ (12)Sparse coding simulation software http://redwood.berkeley.edu/bruno/sparsenet/ (13)Andrew Ng's homepage http://robotics.stanford.edu/~ang/ (14)stanford deep learning tutorial http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial (15)「深度神经网络」(deep neural network)具体是怎样工作的 http://www.zhihu.com/question/19833708?group_id=15019075#1657279 (16)A shallow understanding on deep learning http://blog.sina.com.cn/s/blog_6ae183910101dw2z.html (17)Bengio's Learning Deep Architectures for AI http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf (18)andrew ng's talk video: http://techtalks.tv/talks/machine-learning-and-ai-via-brain-simulations/57862/ (19)cvpr 2012 tutorial: http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/tutorial_p2_nnets_ranzato_short.pdf (20)Andrew ng清华报告听后感 http://blog.sina.com.cn/s/blog_593af2a70101bqyo.html (21)Kai Yu:CVPR12 Tutorial on Deep Learning Sparse Coding (22)Honglak Lee:Deep Learning Methods for Vision (23)Andrew Ng :Machine Learning and AI via Brain simulations (24)Deep Learning 【2,3】 http://blog.sina.com.cn/s/blog_46d0a3930101gs5h.html (25)deep learning这件小事…… http://blog.sina.com.cn/s/blog_67fcf49e0101etab.html (26)Yoshua Bengio, U. Montreal:Learning Deep Architectures (27)Kai Yu:A Tutorial on Deep Learning (28)Marc'Aurelio Ranzato:NEURAL NETS FOR VISION (29)Unsupervised feature learning and deep learning http://blog.csdn.net/abcjennifer/article/details/7804962 (30)机器学习前沿热点–Deep Learning http://elevencitys.com/?p=1854 (31)机器学习——深度学习(Deep Learning) http://blog.csdn.net/abcjennifer/article/details/7826917 (32)卷积神经网络 http://wenku.baidu.com/view/cd16fb8302d276a200292e22.html (33)浅谈Deep Learning的基本思想和方法 http://blog.csdn.net/xianlingmao/article/details/8478562 (34)深度神经网络 http://blog.csdn.net/txdb/article/details/6766373 (35)Google的猫脸识别:人工智能的新突破 http://www.36kr.com/p/122132.html (36)余凯,深度学习-机器学习的新浪潮,Technical News程序天下事 http://blog.csdn.net/datoubo/article/details/8577366 (37)Geoffrey Hinton:UCLTutorial on: Deep Belief Nets (38)Learning Deep Boltzmann Machines http://web.mit.edu/~rsalakhu/www/DBM.html (39)Efficient Sparse Coding Algorithm http://blog.sina.com.cn/s/blog_62af19190100gux1.html (40)Itamar Arel, Derek C. Rose, and Thomas P. Karnowski: Deep Machine Learning—A New Frontier in Artificial Intelligence Research (41)Francis Quintal Lauzon:An introduction to deep learning (42)Tutorial on Deep Learning and Applications (43)Boltzmann神经网络模型与学习算法 http://wenku.baidu.com/view/490dcf748e9951e79b892785.html (44)Deep Learning 和 Knowledge Graph 引爆大数据革命 http://blog.sina.com.cn/s/blog_46d0a3930101fswl.html (45)……
@ARTICLE{XAQZ14, author = {Xu, Shuo and An, Xin and Qiao, Xiaodong and Zhu, Lijun}, title = {Multi-Task Least-Squares Support Vector Machines}, journal = {Multimedia Tools and Applications}, year = {2014}, volume = {71}, number = {2}, pages = {699--715}, issn = {1380-7501}, abstract = {There are often the underlying cross relatedness amongst multiple tasks, which is discarded directly by traditional single-task learning methods. Since multi-task learning can exploit these relatedness to further improve the performance, it has attracted extensive attention in many domains including multimedia. It has been shown through a meticulous empirical study that the generalization performance of Least-Squares Support Vector Machine (LS-SVM) is comparable to that of SVM. In order to generalize LS-SVM from single-task to multi-task learning, inspired by the regularized multi-task learning (RMTL), this study proposes a novel multi-task learning approach, multi-task LS-SVM (MTLS-SVM). Similar to LS-SVM, one only solves a convex linear system in the training phrase, too. What's more, we unify the classification and regression problems in an efficient training algorithm, which effectively employs the Krylow methods. Finally, experimental results on \emph{school} and \emph{dermatology} validate the effectiveness of the proposed approach.}, doi = { 10.1007/s11042-013-1526-5 }, keywords = {Multi-Task Learning \sep Least-Square Support Vector Machine (LS-SVM) \sep Multi-Task LS-SVM (MTLS-SVM) \sep Krylow Methods}, } 全文见: XAQZ14.pdf
原址: http://www.wired.com/wiredenterprise/2013/05/hinton/ Computer Brain Escapes Google’s X Lab to Supercharge Search BY ROBERT MCMILLAN 05.20.13 6:30 AM Geoffrey Hinton (right), Alex Krizhevsky, and Ilya Sutskever (left) will do machine-learning work at Google. Photo: U of T Two years ago Stanford professor Andrew Ng joined Google’s X Lab, the research group that’s given us Google Glass and the company’s driverless cars . His mission: to harness Google’s massive data centers and build artificial intelligence systems on an unprecedented scale. He ended up working with one of Google’s top engineers to build the world’s largest neural network; A kind of computer brain that can learn about reality in much the same way that the human brain learns new things. Ng’s brain watched YouTube videos for a week and taught itself which ones were about cats. It did this by breaking down the videos into a billion different parameters and then teaching itself how all the pieces fit together. But there was more. Ng built models for processing the human voice and Google StreetView images. The company quickly recognized this work’s potential and shuffled it out of X Labs and into the Google Knowledge Team. Now this type of machine intelligence — called deep learning — could shake up everything from Google Glass, to Google Image Search to the company’s flagship search engine. It’s the kind of research that a Stanford academic like Ng could only get done at a company like Google, which spends billions of dollars on supercomputer-sized data centers each year. “At the time I joined Google, the biggest neural network in academia was about 1 million parameters,” remembers Ng. “At Google, we were able to build something one thousand times bigger.” Ng stuck around until Google was well on its way to using his neural network models to improve a real-world product: its voice recognition software. But last summer, he invited an artificial intelligence pioneer named Geoffrey Hinton to spend a few months in Mountain View tinkering with the company’s algorithms. When Android’s Jellly Bean release came out last year, these algorithms cut its voice recognition error rate by a remarkable 25 percent. In March, Google acquired Hinton’s company. Now Ng has moved on (he’s running an online education company called Coursera), but Hinton says he wants to take this deep learning work to the next level. A first step will be to build even larger neural networks than the billion-node networks he worked on last year. “I’d quite like to explore neural nets that are a thousand times bigger than that,” Hinton says. “When you get to a trillion , you’re getting to something that’s got a chance of really understanding some stuff.” Hinton thinks that building neural network models about documents could boost Google Search in much the same way they helped voice recognition. “Being able to take a document and not just view it as, “It’s got these various words in it,” but to actually understand what it’s about and what it means,” he says. “That’s most of AI, if you can solve that.” Test images labeled by Hinton’s brain. Image: Geoff Hinton Hinton already has something to build on. Google’s knowledge graph : a database of nearly 600 million entities. When you search for something like “ The Empire State Building ,” the knowledge graph pops up all of that information to the right of your search results. It tells you that the building is 1,454 feet tall and was designed by William F. Lamb. Google uses the knowledge graph to improve its search results, but Hinton says that neural networks could study the graph itself and then both cull out errors and improve other facts that could be included. Image search is another promising area. “‘Find me an image with a cat wearing a hat.’ You should be able to do that fairly soon,” Hinton says. Hinton is the right guy to take on this job. Back in the 1980s he developed the basic computer models used in neural networking. Just two months ago, Google paid an undisclosed sum to acquire Hinton’s artificial intelligence company , DNNresearch , and now he’s splitting his time between his University of Toronto teaching job, and working for Jeff Dean on ways to make Google’s products smarter at the company’s Mountain View campus. In the past five years, there’s been a mini-boom in neural networking as researchers have harnessed the power of graphics processors (GPUs) to build out ever-larger neural networks that can quickly learn from extremely large sets of data. “Until recently… if you wanted to learn to recognize a cat, you had to go and label tens of thousands of pictures of cats,” says Ng. “And it was just a pain to find so many pictures of cats and label then.” Now with “unsupervised learning algorithms,” like the ones Ng used in his YouTube cat work, the machines can learn without the labeling, but to build the really large neural networks, Google had to first write code that would work on such a large number of machines, even when one of the systems in the network stopped working. It typically takes a large number of computers sifting through a large amount of data to train the neural network model. The YouTube cat model, for example, was trained on 16,000 chip cores . But once that was hammered out, it too k just 100 cores to be able to spot cats on YouTube. Google’s data centers are based on Intel Xeon processors, but the company has started to tinker with GPUs because they are so much more efficient at this neural network processing work, Hinton says. Google is even testing out a D-Wave quantum computer , a system that Hinton hopes to try out in the future. But before then, he aims to test out his trillion-node neural network. “People high up in Google I think are very committed to getting big neural networks to work very well,” he says.
文献列表(更新) 因为文献太多一次读书会不可能面面俱到,采取follow每个领域最重要的1~2个研究者的最具代表性的工作的方式,挑选出下面文章重点研读。文章前的标签代表类型,NB=神经生物学发现,CM=计算模型,ML=机器学习算法。 Overview: deep architecture in brain and machine(1次) James DiCarlo 【NB】Chris I. Baker (2004) Visual Processing in the Primate Brain. In Handbook of Psychology, Biological Psychology, Wiley. 【NB】【CM】DiCarlo JJ, Zoccolan D, Rust NC. (2012) How does the brain solve visual object recognition? Neuron, 73(3):415-34. 【CM】Cadieu CF, et al. (2013) The Neural Representation Benchmark and its Evaluation on Brain and Machine. International Conference on Learning Representations (ICLR) 2013. Early visual system (retinal ganglion cell, LGN, V1), canonical cortical circuits(1.5次) Matteo Carandini, Rodney Douglas 【NB】Matteo Carandini (2012) Area V1. Scholarpedia, 7(7):12105. http://www.scholarpedia.org/article/Area_V1 【NB】【CM】Carandini M, et al. (2005) Do we know what the early visual system does? Journal of Neuroscience, 25:10577-10597. 【NB】Douglas, RJ and Martin, KAC (2007) Recurrent neuronal circuits in the neocortex. Current Opinion in Biology, 17:496-500. 【NB】Douglas, RJ and Martin, KAC (2010) Canonical cortical circuits. Chapter 2 in Handbook of Brain Microcircuits 15-21. 【ML】Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. (2009) What is the Best Multi-Stage Architecture for Object Recognition? in Proc. International Conference on Computer Vision (ICCV’09). learning features(selectivity) sparse coding, cortical maps(0.5次) Bruno Olshausen 【CM】Olshausen, B. A., Field, D. J. (1997). Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research, 37 (23), 3311-3325. 【CM】Bednar JA. (2012) Building a mechanistic model of the development and function of the primary visual cortex. Journal of Physiology (Paris), 106:194-211. learning transformations(invariance)(1次) Aapo Hyvarinen, Yan Karklin 【CM】Hyvarinen, A. and Hoyer, P. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18):2413–2423. 【CM】Karklin, Y., Lewicki, M. S. (2009). Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457(7225), 83-85. 【CM】Adelson E.H. and Bergen J.R. (1985) Spatiotemporal energy models for the perception of motion. Journal Opt. Soc. Am. 【ML】Ian J. Goodfellow, Quoc V. Le, Andrew M. Saxe, Honglak Lee, and Andrew Y. Ng. (2009) Measuring invariances in deep networks. Advances in Neural Information Processing Systems (NIPS). 补充 【ML】Q.V. Le, et, al. Building high-level features using large scale unsupervised learning. ICML, 2012. V2(1次) 【NB】Lawrence C. Sincich and Jonathan C. Horton (2005) The Circuitry of V1 and V2: Integration of Color, Form, and Motion. Annu. Rev. Neurosci. 28:303–26. 【NB】Roe AW, Lu HD, Chen G (2008) Functional architecture of area V2. Encyclopedia of Neuroscience (Squire L, ed.). Elsevier, Oxford, UK. 【CM】Cadieu C.F. Olshausen B.A. (2012) Learning Intermediate-Level Representations of Form and Motion from Natural Movies. Neural Computation. 【ML】Zou, W.Y., Zhu, S., Ng, A., and Yu, K. (2012) Deep learning of invariant features via simulated fixations in video. In Advances in Neural Information Processing Systems (NIPS). 【CM】Gutmann MU Hyvarinen A (2013) A three-layer model of natural image statistics. Journal of Physiology-Paris. 专题讨论:learning mid-level features(形式待定) 【ML】Memisevic, R., Exarchakis, G. (2013) Learning invariant features by harnessing the aperture problem. International Conference on Machine Learning (ICML). 【ML】Kihyuk Sohn, Guanyu Zhou, Chansoo Lee, and Honglak Lee. (2013) Learning and Selecting Features Jointly with Point-wise Gated Boltzmann Machines。 Proceedings of the 30th International Conference on Machine Learning (ICML). 【ML】Roni Mittelman, Honglak Lee, Benjamin Kuipers, and Silvio Savarese. (2013) Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). V4, shape perception(1次) Charles E. Connor 【NB】Pasupathy, A. Connor, C.E. (2002) Population coding of shape in area V4. Nature Neuroscience 5: 1332-1338. 【NB】Connor, C.E. (2007) Transformation of shape information in the ventral pathway. Current Opinion in Neurobiology 17: 140-147. 【NB】【CM】Roe AW, et al. (2012) Towards a unified theory of visual area V4. Neuron 74(2):12-29. 【NB】【CM】Cadieu C, Kouh M, Pasupathy A, Connor CE, Riesenhuber M, Poggio T. (2007) A model of V4 shape selectivity and invariance. Journal of Neurophysiology, 98(3), 1733-50. IT, object face recognition(1次) Keiji Tanaka, Doris Tsao 【NB】Charles G. Gross (2008) Inferior temporal cortex. Scholarpedia, 3(12):7294. http://www.scholarpedia.org/article/Inferior_temporal_cortex 【NB】Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. 【NB】【CM】Tsao DY, Livingstone, MS. (2008) Mechanisms for face perception. Annual Review of Neuroscience, 31: 411-438. 【NB】【CM】Tsao D.Y., Cadieu C. and Livingstone M. (2010) Object Recognition: Physiological and Computational Insights. Chapter 24 in Primate Neuroethology. Edited by M. Platt and A. Ghazanfar. Oxford University Press. 海马体,记忆,睡眠(1次) 待补充 视觉系统的发育和进化,低等动物的视觉(1次) Jon Kaas The Evolution Of The Visual System In Primates 待补充 Neuronal oscillation and synchrony(1次) Christoph von der Malsburg, Markus Siegel 【NB】Siegel M., Donner T. H., Engel A. K. (2012) Spectral fingerprints of large-scale neuronal interactions. Nature Reviews Neuroscience 13:121-134 【NB】Varela F (2001) The brainweb: Phase synchronization and large-scale integration. Nature Reviews Neuroscience 2, 229-239. 【CM】Donner T. H., Siegel M. (2011) A framework for local cortical oscillation patterns. Trends in Cognitive Sciences 15(5): 191-199 【CM】von der Malsburg C. (1999) The What and Why of Binding: The Modeler’s Perspective. Neuron, Vol. 24, 95–104.
scikit-learn is a Python module integrating classique machine learning algorithmes in the tightly-nit world of scientific Python packages. It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering. website: http://scikit-learn.org/dev/index.html
原址: http://www.mathworks.cn/matlabcentral/fileexchange/38310-deep-learning-toolboxwatching=38310 Description PLEASE GO TO https://github.com/rasmusbergpalm/DeepLearnToolbox FOR NEWEST VERSION DeepLearnToolbox A Matlab toolbox for Deep Learning. Deep Learning is a new subfield of machine learning that focuses on learning deep hierarchical models of data. It is inspired by the human brain's apparent deep (layered, hierarchical) architecture. A good overview of the theory of Deep Learning theory is Learning Deep Architectures for AI For a more informal introduction, see the following videos by Geoffrey Hinton and Andrew Ng. The Next Generation of Neural Networks (Hinton, 2007) Recent Developments in Deep Learning (Hinton, 2010) Unsupervised Feature Learning and Deep Learning (Ng, 2011) If you use this toolbox in your research please cite: Prediction as a candidate for learning deep hierarchical models of data (Palm, 2012) Directories included in the toolbox NN/ - A library for Feedforward Backpropagation Neural Networks CNN/ - A library for Convolutional Neural Networks DBN/ - A library for Deep Belief Networks SAE/ - A library for Stacked Auto-Encoders CAE/ - A library for Convolutional Auto-Encoders util/ - Utility functions used by the libraries data/ - Data used by the examples tests/ - unit tests to verify toolbox is working For references on each library check REFS.md Required Products MATLAB MATLAB release MATLAB 7.11 (R2010b)
转载自我在blogspot.com上的博文: http://marchonscience.blogspot.com/2013/04/compressed-sensing-of-eeg-using-dwt-as.html Since my paper Compressed Sensing of EEG for Wireless Telemonitoring with Low Energy Consumption and Inexpensive Hardware (IEEE T-BME, vol.60, no.1, 2013) has been published, lots of people asked me how to do the compressed sensing of EEG using wavelets. Their problem was that Matlab has no function to generate the DWT basis matrix (i.e. the matrix D in my paper). One has to generate such matrices using other wavelet toolboxes. Now I updated my codes, where I gave a guide to generate such dictionary matrices using the wavelab ( http://www-stat.stanford.edu/ ~wavelab/ ) , and there is a demo to show how to use a DWT basis matrix as the dictionary matrix for compressed sensing of EEG ( demo_useDWT.m ) . The codes can be downloaded at here . 如果上面的代码连接不对(可能需要“翻墙”),可来email向我索要。
Corpususe and learning to translate GuyAston 1999.Textus 12: 289-314. Given the factthat total bi-directional correspondences are extremely rare phenomena, we often have to search forsecond-best matches, and that means we have to select one of several alternatives, namely the one that fits thecontext best. It should bepossible to come up with this match if the translator consults a large corpus and, by identifying thecontext pattern in question, finds the lexical unit that would `naturally' be used in such asituation. All it needs is an operational definition of context and context pattern. (Teubert 1996:241) 0. Introduction Apaper on the learning of translation must espouse some view of what translationinvolves. So let me state some premises. Following Joseph (1998), I take itthat translation involves interpreting a source text (ST), and then generatinga target text (TT) in another language which strategically directs its intendedaudience to an interpretation of it - generally one which in certain respectsmatches the interpretation given to the source text. From this point of view -substantially corresponding, as I understand it, to Nida's notion ofdynamic equivalence (1964) - would-be translators must developinterpretative and strategic competencies which they may well lack in at leastone of the languages involved for a particular task, since translators arerarely balanced bilinguals, nor always specialists in the discourse domain inquestion. In addition, translating - like editing - calls for the ability toelaborate, compare and evaluate different strategies and interpretations in thelight of externally-defined contextual restrictions. Translators typically workunder commission, where specific target audiences, and specific interpretationsof the source and/or of the target text are implied (Reiss 1981). Thetranslator thus needs resources which can suggest possible and probableinterpretations of the ST, which can indicate effective strategies forachieving particular interpretations of the TT, and which can facilitate the evaluationof alternative strategies and interpretations. Varantola (1997) suggests thatas much as 50% of the time spent on a translation can be dedicated toconsulting reference materials. In this paper I review the roles which can beplayed by electronic corpora in improving the quality and speed of thetranslation process, in helping would-be translators to develop theirinterpretative and strategic competence, and in developing their sensitivity tothe issues involved. While in no way wishing to suggest that electronic corporaare a touchstone to resolve the translator's many problems, I believe that theycan satisfy three significant criteria for translation instruments: se facilitano il processo eportano ad una migliore qualita' del prodotto, anche attraverso un aumentodelle possibilita' di scelta dell'utente; se offrono occasioni diapprendimento linguistico e metalinguistico; se permettono lo sviluppo di unacapacita' tecnica e critica nei confronti di simili strumenti. (Aston 1996: 308) Theyachieve these objectives by providing collections of helpful informationwhich facilitate their decision-making and make them feel more secure abouttheir choices (Varantola 1997), allowing better and/or faster solutionsto be obtained; by offering numerous opportunities for learning the language,the domain, and about the translation process; and by allowing the user to playan active role in their development and exploitation. 1. Types ofcorpora Interestin corpora in the field of translation has been from two main perspectives,descriptive and practical. On the one hand, scholars have designed and analysedcorpora of translations, comparing these with corpora of original texts inorder to establish the characteristics peculiar to translations in particularSL-TL combinations (Gellerstam 1996), and indeed possible universalsdistinguishing translated texts (Baker 1998, Laviosa 1998). On the other hand,there has been a growing interest in corpora as aids in the processes of humanand machine translation - their role which is my primary concern here. For thispurpose, three main types of corpora have been proposed as relevant: Monolingualcorpora consist of texts in a single language, which may be either the sourceor the target language of a given translation. While general monolingualcorpora include texts of a wide variety of types, specialized monolingualcorpora are restricted to a particular genre and/or topic domain. In eithercase, the corpus attempts to provide a sample of a particular textualpopulation, which ideally also reflects the variability of that population(Biber 1993). Wheremonolingual corpora of similar design are available for two or more languages,they may be treated as components of a single comparable corpus. With a fewexceptions (note 1), comparable corpora are currently specialized, with thetexts belonging to genres or domains which are sociolinguistically similar ineach of the cultures involved (in terms of participation framework, function,and topic), and have similar variabilities. Parallelcorpora also have components in two or more languages, consisting of originaltexts and their translations. Again, most parallel corpora are specialized.They take two main forms (figure 1): Figure 1: Comparable and parallelcorpora language A language B Comparable A. specialized corpus B. specialized corpus of same design Parallel A. specialized corpus B. translations of texts contained in A Unidirectional Parallel A1.specialized corpus B1.specialized corpus of same design as A1 Bidirectional A2.translations of B1 B2.translations of A1 .Unidirectionalparallel corpora consist of texts in one language along with translations ofthose texts into another language (or languages). Since the corpus in languageA is by definition restricted to texts which have been translated into languageB, this will not generally allow the textual population in language A to berepresentatively sampled (Aijmer et al 1996). The criteria to be adopted inselecting the translations to be included in the language B component are alsodebatable - for instance, whether these should be filtered forquality in some way. The two components are typically aligned on aparagraph-by- paragraph or sentence-by-sentence basis: that is to say,information is added to each sentence or paragraph of each text which indicatesthe corresponding sentence or paragraph in the parallel text in the othercomponent (note 2). (For a review of alignment procedures, see http://www.lpl.univ-aix.fr/projects/arcade/ ) .Bidirectionalor reciprocal parallel corpora contain four components: source texts inlanguage A and their aligned translations in language B, and source texts inlanguage B and their aligned translations in language A. They thereby combinethe characteristics of unidirectional parallel corpora with those of comparablecorpora: if the same design criteria are employed for both languages, theyinclude comparable collections of original texts in the two languages (A1 andB1), as well as comparable collections of translated texts in the two languages(A2 and B2). They additionally allow comparisons between original andtranslated texts in the same language (A1 and A2; B1 and B2: Johansson andEbeling 1996). Inthis paper I discuss the relevance of each of these types of corpus for thetrainee translator. In addition, I shall consider the role of ad hoc corpora,i.e. corpora compiled on the fly by the translator in order toinvestigate a specific problem encountered during a particular translation. 2. Uses ofdifferent corpora 2.1Monolingual general corpora Theobvious way in which corpora can help translators is as reference tools, ascomplements to traditional dictionaries and grammars. Thus the first sentenceof Bruce Chatwin's Utz (1989a: 7) reads as follows: An hour before dawn on March7th 1974, Kaspar Joachim Utz died of a second and long-expected stroke, in hisapartment at No. 5 Sirok Street, overlooking the Old Jewish Cemetery in Prague. Letme focus on just one problem here, the translation of overlooking. If weexamine the occasions where the word apartment occurs in the vicinity ofoverlooking in the 100-million word British National Corpus (http://info.ox.ac.uk/bnc),we find that apartments typically overlook mountains, rivers, oceans, ports,squares and gardens - all views which seem positively connotated. On the fewoccasions where what is overlooked is ugly, irony appears to be intended, as in: Not only do theytolerate the fast-food shops serving up nutriment that top breeders wouldn'trecommend for Fido, they go as far as purchasing two expensive weeks in agruesome timeshare apartment, and sit smoking all day on a balcony overlooking the A9. Thecorpus data thus suggests that overlooking has a positive semantic prosody(Sinclair 1991, Louw 1993) - a fact which is unmentioned by dictionaries, andmight even be overlooked by a translator whose native language was English. Itaids interpretation of the ST, raising the problem of whether Chatwin intendsthe Prague cemetery to be seen by the reader as a beauty spot, or whether he isbeing ironic - or indeed, whether he simply aims at ambiguity in this respect. Acorpus can also help the translator evaluate - or indeed come up with - apossible translation for this sentence. The Italian translation of Utz (Chatwin1989b: 9) renders it as: Il 7 marzo 1974,un'ora prima dell'alba, nel suo appartamento di via Sirok 5 che dava sul vecchio cimitero ebraico di Praga,Kaspar Utz mori' di un secondo colpo da tempo previsto. Doesthe choice of dava su share thepositive connotations of overlooking, and allow a similar, possibly ironic,interpretation? In a small (2 million word) collection of Italian literarytexts, we find the following instances of dava su: Lei non si vedeva. Ma il soggiorno dava su una veranda da cui una scaletta o negli onesti. La finestra di mezzo dava su un balcone di ferro. Concentr finestrone, dai vetri impolverati, che dava su di uno spalto esterno, da cui si la vasca. Al chiaro di una finestra che dava su un cortile interno, lesensazio be potuto uscire subito dalla porta che dava sul sottoponte. Ma, quasi a prende Muovendosi davanti alla vetrata che dava sul parco, il Bocchi vide i globi omandante aveva una grande finestra che dava sul pozzo a lume; di fronte, con un Bocchi abitavain un piccolo attico che dava sul Lungoparma, nel punto in cui Thesecitations offer little evidence that dava su has a distinctive prosody, andmake it doubtful that this translation could be interpreted as ironic. Datafrom monolingual corpora may thus support interpretative and strategichypotheses, or suggest that they should be rejected. They may also suggestalternative hypotheses. In the English corpus, overlooking tends to beassociated with a particular set of collocates (garden, sea, hills, squareetc.). If we search the Italian corpus for occurrences of equivalents to thesecollocates (giardino, mare, montagna, piazza, etc.) in the vicinity of wordslike appartamento, camera, casa and finestra, we find citations such as thefollowing: se vuole possoprenotarle una camera per domani stesso, una bella e linda cameretta con vistasul mare, vita sana, bagni di alghe, talassoterapia, Thiscitation suggests another possible translation of overlooking, namely con vistasu. As we did with dava su, we can now test this against the corpus in order tosee whether it is positively connotated, and whether there is evidence of itsbeing used ironically - whether, that is, it occurs in similar contexts tooverlooking. Amonolingual general corpus also provides a rich language learning environment.Even if the dava su hypothesis is rejected, the process of doing so allows theuser to learn much which may be of value in the future. Unlike the dictionary,a concordance leaves it to the user to work out how an expression is used fromthe data. This typically calls for deeper processing than does consulting adictionary, thereby increasing the probability of learning (Hultsijn 1992). Inmore general terms, by drawing attention to the different ways expressions aretypically used and with what frequencies, corpora can make learners moresensitive to issues of phraseology, register and frequency, which are poorlydocumented by other tools. Corporaalso allow much unpredictable, incidental learning. Almost any concordance islikely to contain unknown or unfamiliar uses, which may be noticed and exploredby the user who is prepared to go off at a tangent to follow them up(Bernardini 1997, in press). Looking through the occurrences of dava su, Inoticed the unfamiliar expression pozzo a lume. While I can roughly understandits meaning from the context, I may be able to get a better idea of its use andfrequency by generating a concordance of all its occurrences in the corpus. Astranslation aids, however, monolingual general corpora pose a number ofdifficulties: .Itmay be difficult to locate and select an appropriate corpus. Reference corporasuch as the Bank of English and the British National Corpus are sufficientlylarge and well-balanced to document the range of uses of all but the rarestlexical items in British English: there are, for example, 767 occurrences ofoverlooking in the BNC. But no similar corpus is yet publicly available forItalian, nor for American English. The Italian data cited above were taken froma relatively small (2 million word) collection of contemporary literary texts,put together from the Internet. The limited size and representativeness of thiscollection makes it much more difficult to identify and to evaluateregularities of use: there are only 8 occurrences of dava su, and it isdebatable how far the intertextual background against which a translation ofUtz should be interpreted is a purely literary one. .Itmay be difficult to retrieve appropriate instances from the corpus.Overlooking, for example, is polysemous, meaning either looking out ontoor ignoring. There is no way, using currently available corpora andconcordancing software, that it is possible to find just one of these sensesand exclude the other. Roughly 10% of the occurrences of overlooking in the BNChave the ignoring sense, and these must be excluded manually inorder to effectively analyse the semantic prosody of the looking outonto sense. In the case of dava su, not only do other senses (such asthat of gave up) have to be excluded, but morphological variants ofeach word should probably be investigated (da/dava/danno/davanosu/sul/sull'/sullo/sulla/sugli/ sulle), many of them polysemic, as shouldoccurrences where the component words are separated by, for example, adverbials(dava direttamente su). Considering such factors will tend to reduce theprecision of any search, making it more likely that spurioussolutions will be found which require manual deletion (note 3). .Itmay be difficult to match the data to the translation. Whether evaluating ausage in the ST, or a candidate translation in the TT, the user is unlikely tofind examples which precisely match the required context. Analysingconcordances requires identification, classification and generalization toestablish recurrent patterns and to relate these to particular contextualfeatures (Johns 1991: 4), and these procedures require training and practice.Faced with a concordance of overlooking, the learner will need to group useswith different senses (e.g. animate and inanimate subjects and/or physical andabstract objects: overlooking the problem vs overlooking the park), and to drawinferences as to what features are shared by a particular group. Since thisobliges the user to discriminate and attend to uses which differ from thatoccurring in the source or target text, the process will be time-consuming, andarguably dispersive in terms of the translation at hand - even if rewarding inlanguage learning terms, where greater understanding of the different uses ofoverlooking may be a valuable by-product. 2.2 Specializedand comparable corpora Thesedifficulties can be reduced by using corpora which are specialized, that is,which consist only of texts of a type similar to the ST and/or the desired TT.Such corpora may be extractable as sub-corpora from large general ones - thoughonly limited specialization can be obtained without compromisingrepresentativeness (Sinclair 1991) - or they may be specifically collected - aninvestment which may be well worth the effort where the translator foreseesdoing a number of similar translations in the future, and which is therefore auseful exercise for any translator training course (Maia 1997). Specializedcorpora can be seen as a development of the tradition of usingparallel texts in translation - i.e. collections of texts of thesame kind as the ST and/or TT (Haartman 1980; Williams 1996) - with electronicformat enabling more rapid and systematic searching of larger quantities oftext. Such corpora are particularly useful for the investigation of forms andmeanings which are typical of that type of text (in particular terminology, butalso features of register and text structure: Gavioli, forthcoming; Zanettin,forthcoming), and as an environment in which to prepare for work which has tobe carried out under time constraints, such as speech interpreting. Varantola(1997) underlines how specialized corpora have high reassurancevalue, particularly where the TT is in the translator's L2, insofar asthey illustrate similar contexts to those of the translation being worked on. Wherespecialized corpora have to be constructed by the user, this involves designdecisions as to what texts to include and why. One of our early experiences inForl withspecialized corpora involved learners who were translating material for theMelozzo centenary exhibition into English, for which we compiled English andItalian corpora from CD-ROMs of the National Gallery and of the Uffizi. Eachcorpus contained texts of similar types describing artists and their works,genres, schools and technical aspects for a lay public. While limited in size(under 100,000 words each), their specialization and authoritativeness madethem appropriate resources for the task, and given their similar composition,the two corpora could also be treated as comparable. Today, corpora of texts ofthis type could also be compiled from the Internet (Pearson, forthcoming).Clients are also a potential source of relevant specialized texts. Withrespect to general monolingual corpora, specialized ones are easier to handleand in many ways more informative. In particular: .Itis easy for the user to become familiar with the texts included in a smallspecialized corpus, facilitating the interrogation of that corpus and theinterpretation of data from it (Aston 1997) - a familiarity which will befurther enhanced if the corpus has been compiled by the user (Maia 1997). .Aspecialized corpus can provide figures concerning lexical density andrepetition for texts of that type, in the shape of standardized type/tokenratios, numbers of types accounting for different percentages of tokens, etc.The user can compare means and variances of these figures with values for theST and/or TT, to see how well the latter match the norms of the corpus. .Concordancesare less likely to contain spurious citations. Insofar as the frequency ofdifferent senses varies according to text-type, the likelihood of encounteringother senses of polysemous items may be reduced (if we simply restrict a searchfor overlooking to the subcorpus of fiction in the BNC, the proportion ofexamples of the ignore sense drops by 50%). Gavioli and Zanettin(1997) provide a clear example of this phenomenon. Faced with the ST phraseetilisti con o senza marcatori HBV in a medical research article on hepatitisC, the initial translation hypothesis was alcoholics with or without HBVmarkers. However a search for markers in a specialized corpus of similararticles in English revealed the recurrent positive/negative for HBV markers.This would not have emerged from the BNC, where there are fewer texts of thistype, and where other senses of the word markers predominate (in reference toexaminations, pens, and linguistics). The viability of such a corpus, however,depends on whether the intertextual background for the use of an expression canbe confidently limited to a single text-type in this way. .Specializedcorpora also provide more assistance in formulating translation hypotheses. Thegreater precision provided by a specialized corpus allows us to extend theprinciple of using collocates to identify possible equivalents to completetexts or segments of texts. Where an expression in the ST is used in relationto a particular person or concept in the field in question, it may be possibleto locate possible equivalents by searching for references to that same personor concept in the TL corpus and then reading the surrounding text. Software canfacilitate this process: Wordsmith Tools (Scott 1997) shows which texts andparts of those texts contain most occurrences of a particular form. Remainingwithin the field of hepatitis C research, let us say that we wish to find anappropriate translation for the ST's casi mortali per insufficienza epatica.Given that all the texts in the corpus deal with hepatitis C, we can guess thatthose where death is most frequently mentioned are most likely to be relevant.The following table shows the files where the word death is most frequent: N File Words Hits per 1,000 words 1 sx.11 1,436 17 11.84 2 rx.11 965 11 11.40 3 mx.11 914 8 8.75 4 rx.7 870 3 3.45 Readingthe first of these, we find the expression fatal liver disease - a translationhypothesis which we can then investigate using the entire corpus. .Specializedcorpora facilitate analyses related to textual macrostructure. Insofar as thetexts in the corpus share similar structures and functions, it is easier torelate occurrences to particular functions and positions in texts (Aston 1997). .Incidentallearning from texts and citations from a specialized corpus is more likely tobe relevant to the task at hand, or at any rate to come in useful for furthertranslations of a similar nature (Zanettin, forthcoming). For instance, aconcordance of panel painting in the National Gallery corpus includesreferences to types of panel paintings, such as tondi, and to the techniqueswhereby they were created, such as pastiglia - terms which may well proveuseful at other points of an art history translation. .Aspecialized corpus provides a useful means of learning about an area in whichthe translator needs to work and its textual conventions. Key concepts can belocated manually in wordlists, or a wordlist from a specialized corpus can becompared with one from a general corpus in order to highlight the distinctivefeatures of the former (Wordsmith Tools carries out such comparisons for bothsingle words and phrases). If the corpus is comparable, a candidate list ofterms in one language can be matched with one for the other language to createa terminology bank. Whilemost work involving specialized corpora as translation aids has used TL corpora(Bowker 1998, Varantola 1997, Friedbichler and Friedbichler 1997), wherecomparable specialized corpora are available, these can also be used toinvestigate the SL and the ST, particularly where the conventions of the latterare relatively unfamiliar, as a means to identify routine and non-routine uses.Comparable corpora seem particularly useful for learning purposes, as a meansof exploring a particular text- type in both languages prior to engaging intranslation. 2.3 Corpusconstruction Sincespecialized corpora for a particular text-type are rarely availableoff-the-shelf, the translator needs to learn to construct such corpora - anexperience which will develop awareness of their potential validity andreliability. Collecting a reasonably representative set of texts of aparticular type requires a preliminary survey of the textual population and ofits variability, as well as of the authoritativeness of candidate texts.Friedbichler and Friedbichler (1997) recommend selecting texts which have beensubject to peer review, and which are where possible widely cited in thespecialist literature (note 4); Varantola (1997) recommends avoiding textswritten by non-native speakers. It isclear that for any specialized corpus, the greater the variability of the text-type to be represented, the larger the corpus should be. In general, the largerthe better, though there is clearly a point where the returns on expansiondiminish. Friedbichler and Friedbichler (1997) suggest that for English,authoritative specialized corpora of 500,000 to 5 million words (according tothe variability of the text-type) should provide solutions to 97% of thetranslator's questions. In what follows, a number of criteria for evaluatingspecialized corpora are proposed: in each case, the smaller the value thebetter. .Thesmaller the type/token ratio, the more lexically repetitive the corpus, andhence the better documented the types it contains. A ratio of 2% means thateach word-type occurs, on average, 50 times every 1000 words in the corpus. .Whileindicating the extent of documentation of the types contained in the corpus,the type/token ratio gives no indication of whether those occurring in asimilar text from outside the corpus will be documented. This probability canbe assessed from the ratio of hapax legomena (word-types which occur only oncein the corpus) to the total number of tokens: an HL/tokens ratio of 2% meansthat when reading a new text, an undocumented type is likely to be encountered,on average, every 50 words. .TheHL/tokens ratio does not however consider the variability of the text-type.This can be assessed by considering the proportion of word-types that occur inonly one text in the corpus. This provides a further indication of thelikelihood, in any new text, of encountering new types. A proportion of 20%means that in any similar text, 20% of its word-types will on average beundocumented. Allthese measures are a function of variability within and across texts, and ofcorpus size (and in the case of the last measure, also of text size): a smallbut homogeneous corpus of weather reports may well have lower values than amuch larger one of tourist guides. Values will also depend on the language ofthe texts: given the greater morphological complexity of the language, Italiancorpora tend to have higher values than English ones (note 5). Thetranslator can use measures such as these to assess the reliability of aparticular specialized corpus and hence to determine its required size. Valuesobtained on the last two measures can also be compared with the actualproportions of undocumented types encountered in the ST and/or TT, as anindication of the goodness- of-fit of the corpus for the text inquestion. Thisfit will rarely be perfect, and in any case no specialized corpus is everlikely to document all the problems posed by a particular text. Specializedtexts also use non- specialized language, and the intertextual background onwhich they draw will rarely be simply that of the text-type in question. Therethus remains the need to recognize where general monolingual corpora should becalled on, or where it may be useful to compile a corpus ad hoc to analyse aparticular problem. 2.4 Ad hoccorpora Specializedcorpora will rarely document every word in an ST or TT, even if they are likelyto provide a much fuller documentation for features typical of that text typethan large general ones. One learner using a comparable specialized corpus oncancer of the colon in order to translate an English research article intoItalian was completely nonplussed by an allusion in the ST to the holy plane,for which she could find no explanation or equivalent. In such cases, relevantinformation may be obtainable from a large monolingual corpus or, failing that,CD-ROMs or the Internet. We can in fact use the Internet to compile corpora adhoc, using search engines to find all the texts containing particularexpressions. Since the world-wide web is an ever- changing entity of dubiousauthority whose overall composition is unknown, considerable care must howeverbe exercised in selecting texts and drawing inferences (Pearson, forthcoming). Thevalue of such ad hoc corpora can be illustrated by an example from Bertacciniand Aston (forthcoming), which focusses on the translation into English of aFrench newspaper article which contained the word clochemerlesques. Searcheswere made for clochemerl* in a CD-ROM of Le Monde, and using the Altavistasearch engine on the Internet (http://altavista.digital.com). Together, theseturned up 20 French texts, analysis of which allowed for a fairly confidentinterpretation of the ST: Clochemerle was a comic novel by G. Chevallier whichridiculed factionism in village politics, apparently well-enough known as anarchetype of petty factionism to be alluded to without explanation by Frenchjournalists. Howcould it be translated in English? Searches for English examples of clochemerl*on the Internet, and in CD-ROMs of The Independent and The Daily Telegraph,suggested that Clochemerle was far from equally familiar to a British public,and that it was if anything associated with public conveniences. Did anyarchetype in British culture have similar associations to the French one? One possibilitywhich came to mind was Gulliver's Travels, and the conflict in Lilliput betweenBig- and Little-endians as to the right way to crack an egg. However, furthersearches provided no evidence that reference to Lilliput, or to big/little-endian, would have these associations for a general reader (the formerseemed associated exclusively with size, and the latter were terms in computerarchitecture). The final (if less than fully satisfactory) solution was localsquabbling, whose derogatory connotations were confirmed by a study of thesemantic prosody of squabbl* in the BNC. Insuch cases, an ad hoc corpus is clearly better than none, though very time-consuming to compile. Friedbichler and Friedbichler (1997) suggest that to becost- effective, searches using corpora should not exceed an average of tenseconds: so the use of ad hoc corpora must be limited to a very smallproportion of the problems posed by any translation. 2.5 Parallelcorpora Afurther limit of monolingual and comparable corpora as translation tools is thedifficulty of generating hypotheses as to possible translations. The user mustrely on known or suspected equivalences as heuristics to retrieve similarcontexts in a TL corpus, providing a specification which is both sufficientlygeneral to recall a range of possibilities, and sufficiently precise to limitthe number of spurious hits. S/he must then verify that the citations retrievedare in fact sufficiently similar to those of the ST and/or the SL corpus. Theseprocedures are both time-consuming and error-prone: an expression in the TLcorpus may occur in a similar context to one in the SL corpus, yet in fact meansomething different. For example, in attempting to translate the phrase loopileostomy in a medical research article, Ferri (1999: 64) illustrates how asearch for similar contexts in the TL found ileostomia su bacchetta. Withoutdetailed medical knowledge, she initially assumed this term to be equivalent,while it is in fact hyponymous. Greatercertainty as to the equivalence of particular expressions can be obtained byusing parallel corpora, consisting of original texts and their translations,where these are similar to the ST and TT. If the corpus is aligned, andsuitable software is available, the user can locate all the occurrences of anyexpression along with the corresponding sentences in the other language. Thereis however a dearth of parallel corpora for English and Italian, and relativelylittle parallel concordancing software for the PC (though see Barlow 1995,Woolls 1997). The examples which follow were extracted using Multiconcord(Woolls 1997), from its sample collection of different language versions ofdiscussions in the European Parliament. This material has many limits, since wedo not know which version constitutes the original text, and which atranslation, or indeed a translation of a translation (Lauridson 1996).Nevertheless, it can illustrate how a parallel corpus may provide a means ofidentifying translation hypotheses in a specialized environment. Thefollowing concordance shows occurrences of the word establish and itsequivalents in Italian (some citations are abbreviated for reasons of space): We support the Socialist Group's demand forthe President to establish a committeeas soon as possible to conduct such a review. Condividiamo la richiesta del gruppo socialista in basealla quale il Presidente dovrebbe istituire quanto prima un comitato per larealizzazione di questa modifica. if we are to guarantee the quality andcompetitiveness of the European tourist industry, we shall have also to develop new forms of synergy with otherCommunity policies, bringing in all ofthe interested parties in an effort to establish the conditions favourable tothe development of the Union's tourist enterprises per garantire la qualita' ela competitivita' dell'industria europea del turismo, occorre inoltresviluppare nuove sinergie con le altre politiche comunitarie, coinvolgendotutte le parti interessate al fine di creare le condizioni favorevoli allosviluppo delle imprese turistiche dell'Unione Thus we need to establish a coherentEuropean tourism policy which adds value above and beyond Member State leveland against which we can judge and monitor the very considerable sums of moneywhich are spent through other EU funds ed e' quindi necessario realizzare unapolitica europea per il turismo globale, che aggiunga valore al di sopra ed oltreil livello di Stato Membro e rispetto alla quale possiamo valutare econtrollare le notevoli somme di denaro che vengono spese attraverso altrifondi europei It is vital at this point that we establishdiplomatic relations and therefore a dialogue with the current Kabulauthorities, Si rivela indispensabile in questo momento, instaurare relazionidiplomatiche e quindi un dialogo con le attuali autorita' di Kabul, It must put an end to the inconsistenciesand finally establish a clear and independent foreign policy, at lastshouldering its responsibilities, without hesitation and avoidinginconsistencies. Metta fine alle sue contraddizioni e elabori finalmente unapolitica estera chiara, autonoma, si assuma finalmente le sue responsabilita',senza tentennamenti e senza contraddizioni. We must ask the Union to establish whetherthe proposals made by these countries under the aegis of IGADD will be able tobring about a solution and if so to give them our support. Invitiamo l'Unione averificare se le proposte avanzate da questi Stati nell'ambito dell'IGAD sianotali da favorire una soluzione e, in caso positivo, la sollecitiamo a dare ilsuo sostegno. We need morespecific signs and we need clearer evidence that the Belarus Government doesindeed want to establish a free and more democratic society. Ci servono segnipiu' precisi, cosi' come deve essere precisa l'intenzione del governo bielorussodi instaurare a tutti gli effetti un sistema libero e democratico. Thisillustrates a wide range of possible equivalents to establish: avviare, creare,elaborare, ginstaurare, realizzare, verificare. For the translator of anEnglish text of this kind, it thus suggests a range of hypotheses which can befurther investigated using a general or specialized TL corpus. Notall expressions are paralleled by such a wide variety of equivalents. One ofthe most frequent lexical words in the Italian component of the corpus isrelazione. The parallel English term is invariably report (unlike the Britishparliamentary paper). In contrast, under a third of the occurrences of anotherfrequent word, favore, are paralleled by favour: parallel to votare a favore diwe find vote for; parallel to accogliamo con favore, we welcome. The corpussuggests equivalents for technical terms, and a wider variety of possibletranslations for sub-technical lexis than are likely to be found in a bilingualdictionary, particularly at a phraseological level. It may also highlightsyntactic contrasts, including differences in the organization of the text intosentences and paragraphs. Usingsuch a corpus can also have a positive impact on learning. Where a variety ofparallel realizations are encountered, this may help learners to distinguishbetween different contexts of use, and reduce their tendency to think in termsof one-to-one equivalence, as Ulrych (1997) illustrates in respect of parallelEnglish realizations to ossia. More general problems may also be faced:Danielsson and Mühlenbock (forthcoming) illustrate how a parallel corpus cancast light on translation strategies for proper names, showing whether theseare transcribed, translated, clarified or simplified. Johns (forthcoming)proposes a number of types of exercises using parallel concordances, forinstance by blanking out the search word in language A and asking learners toinfer it from the parallel citations provided in language B. Sinceparallel concordances provide translations of each occurrence, citations aremore likely to be immediately understandable for the user, diminishing thedifficulties of retrieval and risks of misinterpretation associated withmonolingual and comparable corpora. For the same reason, the scope forincidental learning may be increased. However, notwithstanding their apparentface validity, parallel corpora also introduce new dangers deriving from theassumption that parallel occurrences are effectively equivalent. It isnecessary to ask whether the translations in the corpus are reliable andauthoritative (note 6), and to bear in mind that the use of translations toidentify equivalents inevitably implies reduc the target language toa mirror image of the source language (Teubert 1996: 250) - or the SL toa mirror image of the TL: There is, for instance, no directT E in English for the German word Schadenfreude Therefore, we will rarely find occurrences of Schadenfreude in Germantranslations of English texts. Generally speaking, translations in language Bwill contain `grosso modo' only those lexical items which count as TEs foritems of the vocabulary of language A. The same is true for syntax. The`impersonal passive' (e.g. Es wurde viel getrunken, literally `It was drunk alot') is a fairly common syntactic construction in German for which there is noequivalent in English. (Teubert1996: 247) Usingtranslations as models for the TT thus risks reproducing those features oftranslationese which have been identified by workers using corporain descriptive translation studies: normalization, simplification,explicitation (Baker 1993, 1998), sanitization by reducingconnotational meanings (Kenny 1998), increased cohesion (Over†s 1998), andlower lexical density, higher mean sentence length, and higher proportions ofhigh-frequency words (Laviosa 1998). Gellerstam (1996) shows how translationsinto Swedish of English texts carry over many features of English vocabulary,syntax, and rhetoric when compared with comparable Swedish originals; Gavioliand Zanettin (1997) illustrate some similar features in Italian translationsfrom English. Using parallel corpora seems likely to reinforce such tendencies(though it is of course possible that they may increase learners' awareness ofthese features, and hence their conscious control of them: Ulrych 1997). Theunreliability of the translations in parallel corpora makes it advisable to usethem in conjunction with monolingual or comparable corpora, so that, forinstance, a translation hypothesis derived from a parallel corpus can be testedagainst a collection of original texts in the language in question. The idealparallel corpus, from this point of view, will be bidirectional or reciprocal(cf 1 above), allowing the user to see whether occurrences found intranslations into language B are also found in original texts in language B,and whether these are translated into language A in the manner encountered inoriginal texts in language A. Such a corpus combines the advantages of aparallel corpus with those of a comparable one: from this point of view,bidirectional English-Italian corpora would seem an important area for futureresearch and development. Such corpora are however considerably more difficultto design and compile than comparable ones, given the need to create comparablecollections of texts which have been translated, and to align the texts andtranslations prior to use. Given the amount of work involved, they are likelyto be relatively unspecialized in order to extend their range of application(see e.g. the English-Norwegian parallel corpus: Johansson and Hofland 1993).Consequently there is still likely to be a role for comparable andunidirectional parallel corpora of a more specialized nature. One form of thelatter may be compiled by the specialized translator (or their client), drawingon the texts that s/he has (had) translated in the past (cf note 5 above). Itshould be noted en passant that parallel concordancing software can also beused to analyse a single text and its translation. This is potentially a usefultool for translators to check and evaluate their own translations. Alignedversions of the ST and TT can be used to see whether a particular term in theST has been translated consistently in the TT, or whether (given the tendencyof translations to be less lexically varied than their source texts) aparticular expression in the TT corresponds to a variety of expressions in theST. Type/token ratios and lexical density measures for the ST and TT can alsobe compared, and evaluated by comparison with those found in comparable orparallel corpora of similar texts. 3. Conclusions Thereis as yet little hard empirical evidence to demonstrate the effectiveness ofcorpora as translation and as learning tools. Williams (1996) found a 40%improvement in the recovery of correct equivalents when paralleltexts were used as translation aids as opposed to bilingual dictionaries, andone might expect these results to be matched or bettered with largercollections of texts in electronic format, and the aid of retrieval software.In a pilot experiment Bowker (1998) found that learners using a specializedcorpus of texts in the target language (their L1) showed greater correct termchoice and idiomaticity than a matched group using bilingual dictionariesalone. On the other hand, Bernardini and Aston (forthcoming) found that on twotranslation tasks into the L2, learners using monolingual L2 dictionariesperformed better than matched groups using a general L2 monolingual corpus.While learners seem to a large extent enthusiastic about using corpora, itremains to be shown just in what respects, and under what conditions, theirperformance as translators may improve as a consequence: we cannot for instanceexclude the idea that training with corpora may also improve dictionary usage,by instilling greater attention to collocation and register. No research that Iam aware of has yet attempted to compare the effectiveness of different typesof corpora, or of different learner approaches to them; yet more difficult tomeasure are the overall effects of corpus use on learning, be this in terms ofgeneral linguistic knowledge and ability, or as relating to a specializedtext-type. Inthis climate of empirical uncertainty, arguments for and against the use ofcorpora in translator training must be of a theoretical nature, and can resortat best to anecdotal evidence. Where available and accessible, appropriatecorpora appear able to provide better and faster solutions to many of thetranslator's problems in a unified environment, with positive effects onlearning. They make possible more idiomatic, native-like interpretations ofsource texts and a use of more idiomatic, native-like strategies in targettexts. It is our experience at Forli' that few trainee-translators who haveused corpora would wish to be without them, notwithstanding (or because of?) theinvestment in time and effort required to compile corpora and to learn how touse them, and we expect that as the number of available corpora and thequantity of suitable software increases, the use of corpora for translation andtranslator-training will gather further momentum, with a growth in itscost-effectiveness. Notes 1.The Parole project aims to produce general comparable corpora for all thelanguages of the EU (http://www.ilc.pi.cnr.it/parole/parole.ht ml). 2.Parallel corpora can be extended to include multiple languages (Woolls 1997),or multiple translations of each text (Ulrych 1997, Malmkjaer 1998). As thevalue of such extensions seems more descriptive than pedagogic, I shall notdiscuss them here. 3.In the gave up sense, su is of course an adverb rather than a preposition. Ifthe corpus used is tagged with part-of-speech codes (as is the case with theBNC and the Bank of English), it may be possible to avoid unwanted senses bysearching for a specific part of speech, e.g. dare su=PRP (or an equivalentformalism). Part-of-speech tagging may also facilitate analysis, enabling thedata to be sorted by part-of-speech code. 4.Bowker (1998) and Pearson (1996, 1998) argue that where specialised corpora areused to train translators in a specialised field, they should include a rangeof different types of text - expert, instructional, and popularised. The lattertypes, they argue, are likely to explain terms and concepts which are taken forgranted in expert texts. However, it is important not to confuse these types inthe corpus, since we would not, for example, expect divulgative texts to havethe same collocational and colligational regularities as specialist ones, norto contain the same range of terms as the latter. Where the corpus is used totranslate a specific text, the appropriate component should be given priority. 5.King (1997: 396) compares the number of types in translations of Le petitprince with the French original: scoring the latter as 100, figures for Englishand for Italian are 83 and 107 respectively. 6.This may, for instance, be dubious if all the translations in the corpus havebeen produced by the same translator, as is often the case withtranslation memory systems. References Aijmer,K., B. Altenberg and M. Johansson (eds.), 1996, Languages in contrast , Lund University Press, Lund. Aston,G., 1996, Traduzione e tecnologia, in G. Cortese (a cura di), Tradurre i linguaggi settoriali ,Edizioni Cortina, Torino, pp. 293-310. Aston,G., 1997, Small and large corpora in language learning, inLewandowska- Tomaszczyk and Melia, pp. 51-62. Aston,G. (ed.), forthcoming, Learning withcorpora . Baker,M., 1993, Corpus linguistics and translation studies: implications andapplications, in Baker et al., pp. 233-250. Baker,M., 1998, R‚explorer la langue de la traduction: une approche parcorpus, Meta , 43/4, pp.480-485. Baker,M., G. Francis and E. Tognini-Bonelli (eds.), 1993, Text and technology: in honour of John Sinclair , Benjamins,Amsterdam. Barlow,M., 1995, ParaConc: a concordancer for parallel texts, Computers texts, 10. Bernardini,S., 1997, A `trainee' translator's perspective on corpora,available online, http://www.sslmit.unibo.it/cultpaps/trainee.htm Bernardini,S., in press, Competence, capacity,corpora, CLUEB, Bologna . Bernardini,S. and G. Aston, forthcoming, Do corpora actually helptranslators?. Bertaccini,F. and G. Aston, forthcoming, Exploring cultural connotations throughadhoc corpora, in Aston (forthcoming). Biber,D., 1993, Representativeness in corpus design, Literary and linguistic computing , 8/4, pp. 243-257. Bowker,L., 1998, Using specialized monolingual native-language corpora as atranslation resource: a pilot study, Meta , 43/4, pp. 631-651. Burnard,L. and T. McEnery (eds.), forthcoming, Papers from TALC 98 (provisional title),Peter Lang, Bern. Chatwin,B., 1989a, Utz , Pan, London. Chatwin,B., 1989b, Utz , trans. D. Mazzone,Adelphi, Milano. Danielsson,P., and K. Mhlenbock, forthcoming, Retrieval of name translations inparallel corpora, in Burnard and McEnery. Ferri,S., 1999, Uso di piccoli corpora comparabili per la traduzione medica,unpublished dissertation, SSLMIT, Forli'. Friedbichler,I. and M. Friedbichler, 1997, The potential of domain-specific target-language corpora for the translator's workbench, available online,http://www.sslmit.unibo.it/cultpaps/fried.htm Gavioli,L., forthcoming, Corpora and the concordancer in learning ESP: anexperiment in a course of interpreters and translators, in G. Azzaro andM. Ulrych (eds.), Anglistica e ....: metodi e percorsi comparatistici nellelingue, culture e letterature di origine europea. Volume II: Transitilinguistici e culturali, EUT, Trieste. Gavioli,L. and F. Zanettin, 1997, Comparable corpora and translation: a pedagogicperspective, available online,http://www.sslmit.unibo.it/cultpaps/laura-fede.htm Gellerstam,M., 1996, Translations as a source for cross-linguistic studies, inAijmer et al., pp. 53-62. Hartmann,R.R.K., 1980, Contrastive textology: comparative discourse analysis in appliedlinguistics, Julius Gross Verlag, Heidelberg. Hultsijn,J.H., 1992, Retention of inferred and given word meanings: experiments inincidental vocabulary learning, in P.J.L. Arnaud and H. B‚joint (eds.),Vocabulary and applied linguistics, Macmillan, London, pp. 113- 125. Johansson,S. and J. Ebeling, 1996, Exploring the English-Norwegian parallelcorpus, in C. Percy, C.F. Meyer and I. Lancashire (eds.), Synchroniccorpus linguistics, Rodopi, Amsterdam, pp. 3-15. Johansson,S. and K. Hofland, 1994, Towards an English-Norwegian parallelcorpus, in U. Fries, G. Tottie and P. Schneider (eds.), Creating andusing English language corpora, Rodopi, Amsterdam, pp. 25-37. Johns,T., 1991, Should you be persuaded: two examples of data-drivenlearning, in T. Johns and P. King (eds.), Classroom concordancing, (ELRjournal, 4), Centre for English language studies, Birmingham, pp. 1- 16. Johns,T., forthcoming, Reciprocal learning: a practical application of parallelconcordancing'. Joseph,J.E., 1998, Why isn't translation impossible?, in S. Hunston (ed.),Language at work, BAAL/Multilingual Matters, Clevedon, pp. 98- 108. Kenny,D., 1998, Creatures of habit? What translators usually do withwords, Meta, 43/4, pp. 515-523. King,P., 1997, Parallel corpora for translator training, in Lewandowska-Tomaszczyk and Melia, pp. 393-402. Lauridsen,K., 1996, Text corpora and contrastive linguistics: which type of corpusfor which type of analysis?, in Aijmer et al., pp. 63-71. Laviosa,S., 1998, Core patterns of lexical use in a comparable corpus of Englishnarrative prose, Meta, 43/4, pp. 557-570. Lewandowska-Tomaszczyk,B. and P.J. Melia (eds.), 1997, PALC'97: practical applications in languagecorpora, Lodz University Press, Lodz. Louw,B., 1993, Irony in the text or insincerity in the writer? The diagnosticpotential of semantic prosodies, in Baker et al., pp. 157-176. Maia,B., 1997, Do-it-yourself corpora... with a little bit of help from yourfriends!, in Lewandowska-Tomaszczyk and Melia, pp. 403-410. Malmkjaer,K., 1998, Love thy neighbour: will parallel corpora endear linguists totranslators?, Meta, 43/4, pp. 534-541. Nida,E., 1964, Towards a science of translating: with special reference toprinciples and procedures in Bible translating, E.J. Brill, Leiden. Over†s,L., 1998, In search of the third code: an investigation of norms inliterary translation, Meta, 43/4, pp. 571-588. Pearson,J., 1996, Teaching terminology using electronic resources, in S.Botley, J. Glass, T. McEnery and A. Wilson (eds.), Proceedings of Teaching andlanguage corpora 1996, UCREL, Lancaster, pp. 203-216. Pearson,J., 1998, Terms in context, Benjamins, Amsterdam. Pearson,J., forthcoming, Surfing the internet: teaching students to choose theirtexts wisely, in Burnard and McEnery. Reiss,K., 1981, Type, kind and individuality of text: decision making intranslation, Poetics today, 2/4, pp. 121-131. Scott,M., 1997, Wordsmith Tools (ver. 2.0), Oxford University Press, Oxford. Sinclair,J.M., 1991, Corpus, concordance, collocation, Oxford University Press, Oxford. Teubert,W., 1996, Comparable or parallel corpora?, International jounral oflexicography, 9/3, pp. 238-264. Ulrych,M., 1997, The impact of multilingual parallel concordancing ontranslation, in Lewandowska-Tomaszczyk and Melia, pp. 421-435. Varantola,K., 1997, Translators, dictionaries and text corpora, available online,http://www.sslmit.unibo.it/cultpaps/varanto.htm Williams,I.A., 1996, A translator's reference needs: dictionaries or paralleltexts, Target, 8, pp. 277-299. Woolls,D., 1997, MultiConc (ver. 1.0), CFL Software Development, Birmingham. Zanettin,F., 1998, Bilingual comparable corpora and the training oftranslators, Meta, 43/4, pp. 616-630. Zanettin,F., forthcoming, Swimming in words: corpora, translation and languagelearning, in Aston (forthcoming). http://www.sslmit.unibo.it/~guy/textus.htm
(1) Hinton Geoffrey: http://www.cs.toronto.edu/~hinton/ father of RBM, it's him to make the RBM trainable in practice. (2) Andrew Ng: http://ai.stanford.edu/~ang/ Great professor and great speaker. His student helped to popularize the deep belief network (3) Honglak Lee: http://web.eecs.umich.edu/~honglak/ It's him to win the best application paper award of ICML 2009. Currently he works on how to model invariance using RBM. (4) Ruslan Salakhutdinov: http://www.utstat.toronto.edu/~rsalakhu/ He is student of Prof. Hinto,and his major contribution is introduction of deep boltzmann machine. Prof.Hinto coined deep belief network. There two kinds of networks share some similarity, both belonging to deep architectures. (5) Graham Taylor: http://www.uoguelph.ca/~gwtaylor/ He is also the student of Prof. Hinton, and his major contribution is the introduction of gated boltzmann machine, which makes generate gray scale images possible. (6) Hugo Larochelle: http://www.dmi.usherb.ca/~larocheh/index_en.html Again he is Prof. Hinto's student, and his major contribution is applying RBM to model attentionla data. (7) Mark Ranzato: http://www.cs.toronto.edu/~ranzato/ He finished his Ph.D under Prof. Yann Lecun, and spent two-years' postdoc under Prof. Hinton. His contribution is introduction of one duplicate image to model covariance among neighboring pixels. (8) Roland Memisevic: http://www.iro.umontreal.ca/~memisevr/ He modeled temporal data using RBM. Now he found a faculty position in the University of Montreal. (9) Yoshua Bengio: http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html Great professor. His work of 'Learning Deep Structure for AI' is a must-read. (10) Yann Lecun: http://yann.lecun.com/ He is a legend. He disregarded CV guys.He is super smart, and his work may revolutionize object recognition. (11) Rob Fergus: http://cs.nyu.edu/~fergus/ NYU guy, who rejected when I applied for him. Anyway, a genius, I love him. (12) Kai Yu: http://www.dbs.ifi.lmu.de/~yu_k/ He inspired me why whitening doesn't make data independent. Sincere thanks to him. These professor are those who I am most familiar with. However, with emergence of deep belief network and deep boltzmann machine, there are so many other scholars. You may find a list from 2012 UCLA Deep Learning Summer School: http://www.ipam.ucla.edu/programs/gss2012/
I have almost gone through the first chapter. This chapter emphasize on how to use R to clean the data, to sort the data to suit special need and visualize the data. I was following the author step by step in his work. I understand most contents except for code for graphs, those ggplot2 code. So I felt I have met with the bottle neck if I wish to journey on: I need to familiarize myself with the graphical system in MLFH: the ggplot2 package. That's why I decide to deviate a while and figure out how to use ggplot2 at first.
准备近期就开始组织一个研读Dictionary Learning论文的讨论组。每周一次,大概1-2个小时,采用Skype的方式。研读的论文主要侧重Dictionary Learning/Sparse Representation,尤其是supervised dictionary learning等。欢迎报名参加(拟5人)。条件是需要在这个方向上发表至少一篇论文。 详细情况请发信给我:zhangzlacademy( a t )gmail.com。请勿发站内信件。
引言: 神经网络( N eural N etwork)与支持向量机( S upport V ector M achines,SVM)是统计学习的代表方法。可以认为神经网络与支持向量机都源自于感知机(Perceptron)。感知机是1958年由Rosenblatt发明的线性分类模型。感知机对线性分类有效,但现实中的分类问题通常是非线性的。 神经网络与支持向量机(包含核方法)都是非线性分类模型。1986年,Rummelhart与McClelland发明了神经网络的学习算法 B ack P ropagation。后来,Vapnik等人于1992年提出了支持向量机。神经网络是多层(通常是三层)的非线性模型, 支持向量机利用核技巧把非线性问题转换成线性问题。 神经网络与支持向量机一直处于“竞争”关系。 Scholkopf是Vapnik的大弟子,支持向量机与核方法研究的领军人物。据Scholkopf说,Vapnik当初发明支持向量机就是想"干掉"神经网络(He wanted to kill Neural Network)。支持向量机确实很有效,一段时间支持向量机一派占了上风。 近年来,神经网络一派的大师Hinton又提出了神经网络的Deep Learning算法(2006年),使神经网络的能力大大提高,可与支持向量机一比。 Deep Learning假设神经网络是多层的,首先用Boltzman Machine(非监督学习)学习网络的结构,然后再通过Back Propagation(监督学习)学习网络的权值。 关于Deep Learning的命名,Hinton曾开玩笑地说: I want to call SVM shallow learning. (注:shallow 有肤浅的意思)。其实Deep Learning本身的意思是深层学习,因为它假设神经网络有多层。 总之,Deep Learning是值得关注的统计学习新算法。 深度学习(Deep Learning) 是ML研究中的一个新的领域,它被引入到ML中使ML更接近于其原始的目标:AI。查看 a brief introduction to Machine Learning for AI 和 an introduction to Deep Learning algorithms . 深度学习是关于学习多个表示和抽象层次,这些层次帮助解释数据,例如图像,声音和文本。 对于更多的关于深度学习算法的知识,可以参看: The monograph or review paper Learning Deep Architectures for AI (Foundations Trends in Machine Learning, 2009). The ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references . The LISA public wiki has a reading list and a bibliography . Geoff Hinton has readings from last year’s NIPS tutorial . 这篇综述主要是介绍一些最重要的深度学习算法,并将演示如何用 Theano 来运行它们。 Theano是一个python库,使得写深度学习模型更加容易,同时也给出了一些关于在GPU上训练它们的选项。 这个算法的综述有一些先决条件。首先你应该知道一个关于python的知识,并熟悉numpy。由于这个综述是关于如何使用Theano,你应该先阅读 Theano basic tutorial 。一旦你完成这些,阅读我们的 Getting Started 章节---它将介绍概念定义,数据集,和利用随机梯度下降来优化模型的方法。 纯有监督学习算法可以按照以下顺序阅读: Logistic Regression - using Theano for something simple Multilayer perceptron - introduction to layers Deep Convolutional Network - a simplified version of LeNet5 无监督和半监督学习算法可以用任意顺序阅读(auto-encoders可以被独立于RBM/DBM地阅读): Auto Encoders, Denoising Autoencoders - description of autoencoders Stacked Denoising Auto-Encoders - easy steps into unsupervised pre-training for deep nets Restricted Boltzmann Machines - single layer generative RBM model Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning 关于mcRBM模型,也有一篇新的关于从能量模型中抽样的综述: HMC Sampling - hybrid (aka Hamiltonian) Monte-Carlo sampling with scan() 上文翻译自 http://deeplearning.net/tutorial/ 深度学习是 机器学习 研究中的一个新的领域,其动机在于建立、模拟人脑进行分析学习的神经网络,它模仿人脑的机制来解释数据,例如图像,声音和文本。深度学习是 无监督学习 的一种。 深度学习的概念源于 人工神经网络 的研究。含多隐层的多层感知器就是一种深度学习结构。深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示。 深度学习的概念由Hinton等人于2006年提出。基于深信度网(DBN)提出非监督贪心逐层训练算法,为解决深层结构相关的优化难题带来希望,随后提出多层自动编码器深层结构。此外Lecun等人提出的卷积神经网络是第一个真正多层结构学习算法,它利用空间相对关系减少参数数目以提高训练性能。 一、Deep Learning的前世今生 图灵在 1950 年的论文里,提出图灵试验的设想,即,隔墙对话,你将不知道与你谈话的,是人还是电脑 。 这无疑给计算机,尤其是人工智能,预设了一个很高的期望值。但是半个世纪过去了,人工智能的进展,远远没有达到图灵试验的标准。这不仅让多年翘首以待的人们,心灰意冷,认为人工智能是忽悠,相关领域是“伪科学”。 2008 年 6 月,“连线”杂志主编,Chris Anderson 发表文章,题目是 “理论的终极,数据的泛滥将让科学方法过时”。并且文中还引述经典著作 “人工智能的现代方法”的合著者,时任 Google 研究总监的 Peter Norvig 的言论,说 “一切模型都是错的。进而言之,抛弃它们,你就会成功” 。 言下之意,精巧的算法是无意义的。面对海量数据,即便只用简单的算法,也能得到出色的结果。与其钻研算法,不如研究云计算,处理大数据。 如果这番言论,发生在 2006 年以前,可能我不会强力反驳。但是自 2006 年以来,机器学习领域,取得了突破性的进展。 图灵试验,至少不是那么可望而不可即了。至于技术手段,不仅仅依赖于云计算对大数据的并行处理能力,而且依赖于算法。这个算法就是,Deep Learning。 借助于 Deep Learning 算法,人类终于找到了如何处理 “抽象概念”这个亘古难题的方法。 于是学界忙着延揽相关领域的大师。Alex Smola 加盟 CMU,就是这个背景下的插曲。悬念是 Geoffrey Hinton 和 Yoshua Bengio 这两位牛人,最后会加盟哪所大学。 Geoffrey Hinton 曾经转战 Cambridge、CMU,目前任教University of Toronto。相信挖他的名校一定不少。 Yoshua Bengio 经历比较简单,McGill University 获得博士后,去 MIT 追随 Mike Jordan 做博士后。目前任教 University of Montreal。 Deep Learning 引爆的这场革命,不仅学术意义巨大,而且离钱很近,实在太近了。如果把相关技术难题比喻成一座山,那么翻过这座山,山后就是特大露天金矿。技术难题解决以后,剩下的事情,就是动用资本和商业的强力手段,跑马圈地了。 于是各大公司重兵集结,虎视眈眈。Google 兵分两路,左路以 Jeff Dean 和 Andrew Ng 为首,重点突破 Deep Learning 等等算法和应用 。 Jeff Dean 在 Google 诸位 Fellows 中,名列榜首,GFS 就是他的杰作。Andrew Ng 本科时,就读 CMU,后来去 MIT 追随 Mike Jordan。Mike Jordan 在 MIT 人缘不好,后来愤然出走 UC Berkeley。Andrew Ng 毫不犹豫追随导师,也去了 Berkeley。拿到博士后,任教 Stanford,是 Stanford 新生代教授中的佼佼者,同时兼职 Google。 Google 右路军由 Amit Singhal 领军,目标是构建 Knowledge Graph 基础设施。 1996 年 Amit Singhal 从 Cornell University 拿到博士学位后,去 Bell Lab 工作,2000 年加盟 Google。据说他去 Google 面试时,对 Google 创始人 Sergey Brian 说,“Your engine is excellent, but let me rewirte it!” 换了别人,说不定一个大巴掌就扇过去了。但是 Sergey Brian 大人大量,不仅不怪罪小伙子的轻狂,反而真的让他从事新一代排名系统的研发。Amit Singhal 目前任职 Google 高级副总裁,掌管 Google 最核心的业务,搜索引擎。 Google 把王牌中之王牌,押宝在 Deep Learning 和 Knowledge Graph 上,目的是更快更大地夺取大数据革命的胜利果实。 Reference Turing Test. http://en.wikipedia.org/wiki/Turing_test The End of Theory: The Data Deluge Makes the Scientific Method Obsolete http://www.wired.com/science/discoveries/magazine/16-07/pb_theory Introduction to Deep Learning. http://en.wikipedia.org/wiki/Deep_learning Interview with Amit Singhal, Google Fellow. http://searchengineland.com/interview-with-amit-singhal-google-fellow-121342 原文链接: http://blog.sina.com.cn/s/blog_46d0a3930101fswl.html 作者微博: http://weibo.com/kandeng#1360336038853 二、Deep Learning的基本思想和方法 实际生活中,人们为了解决一个问题,如对象的分类(对象可是是文档、图像等),首先必须做的事情是如何来表达一个对象,即必须抽取一些特征来表示一个对象,如文本的处理中,常常用词集合来表示一个文档,或把文档表示在向量空间中(称为VSM模型),然后才能提出不同的分类算法来进行分类;又如在图像处理中,我们可以用像素集合来表示一个图像,后来人们提出了新的特征表示,如SIFT,这种特征在很多图像处理的应用中表现非常良好,特征选取得好坏对最终结果的影响非常巨大。因此,选取什么特征对于解决一个实际问题非常的重要。 然而,手工地选取特征是一件非常费力、启发式的方法,能不能选取好很大程度上靠经验和运气;既然手工选取特征不太好,那么能不能自动地学习一些特征呢?答案是能!Deep Learning就是用来干这个事情的,看它的一个别名Unsupervised Feature Learning,就可以顾名思义了,Unsupervised的意思就是不要人参与特征的选取过程。因此,自动地学习特征的方法,统称为Deep Learning。 1)Deep Learning的基本思想 假设我们有一个系统S,它有n层(S1,…Sn),它的输入是I,输出是O,形象地表示为: I =S1=S2=…..=Sn = O,如果输出O等于输入I,即输入I经过这个系统变化之后没有任何的信息损失,保持了不变,这意味着输入I经过每一层Si都没有任何的信息损失,即在任何一层Si,它都是原有信息(即输入I)的另外一种表示。现在回到我们的主题Deep Learning,我们需要自动地学习特征,假设我们有一堆输入I(如一堆图像或者文本),假设我们设计了一个系统S(有n层),我们通过调整系统中参数,使得它的输出仍然是输入I,那么我们就可以自动地获取得到输入I的一系列层次特征,即S1,…, Sn。 另外,前面是假设输出严格地等于输入,这个限制太严格,我们可以略微地放松这个限制,例如我们只要使得输入与输出的差别尽可能地小即可,这个放松会导致另外一类不同的Deep Learning方法。上述就是Deep Learning的基本思想。 2)Deep Learning的常用方法 a). AutoEncoder 最简单的一种方法是利用人工神经网络的特点,人工神经网络(ANN)本身就是具有层次结构的系统,如果给定一个神经网络,我们假设其输出与输入是相同的,然后训练调整其参数,得到每一层中的权重,自然地,我们就得到了输入I的几种不同表示(每一层代表一种表示),这些表示就是特征,在研究中可以发现,如果在原有的特征中加入这些自动学习得到的特征可以大大提高精确度,甚至在分类问题中比目前最好的分类算法效果还要好!这种方法称为AutoEncoder。当然,我们还可以继续加上一些约束条件得到新的Deep Learning方法,如如果在AutoEncoder的基础上加上L1的Regularity限制(L1主要是约束每一层中的节点中大部分都要为0,只有少数不为0,这就是Sparse名字的来源),我们就可以得到Sparse AutoEncoder方法。 b). Sparse Coding 如果我们把输出必须和输入相等的限制放松,同时利用线性代数中基的概念,即O = w1*B1 + W2*B2+….+ Wn*Bn, Bi是基,Wi是系数,我们可以得到这样一个优化问题: Min |I – O| 通过求解这个最优化式子,我们可以求得系数Wi和基Bi,这些系数和基础就是输入的另外一种近似表达,因此,它们可以特征来表达输入I,这个过程也是自动学习得到的。如果我们在上述式子上加上L1的Regularity限制,得到: Min |I – O| + u*(|W1| + |W2| + … + |Wn|) 这种方法被称为Sparse Coding。 c) Restrict Boltzmann Machine (RBM) 假设有一个二部图,每一层的节点之间没有链接,一层是可视层,即输入数据层( v ),一层是隐藏层( h ),如果假设所有的节点都是二值变量节点(只能取0或者1值),同时假设全概率分布p( v, h )满足Boltzmann 分布,我们称这个模型是Restrict Boltzmann Machine (RBM)。下面我们来看看为什么它是Deep Learning方法。首先,这个模型因为是二部图,所以在已知 v 的情况下,所有的隐藏节点之间是条件独立的,即p( h | v ) =p( h 1| v )…..p( h n| v )。同理,在已知隐藏层 h 的情况下,所有的可视节点都是条件独立的,同时又由于所有的v和h满足Boltzmann 分布,因此,当输入 v 的时候,通过p( h | v ) 可以得到隐藏层 h ,而得到隐藏层 h 之后,通过p(v|h) 又能得到可视层,通过调整参数,我们就是要使得从隐藏层得到的可视层 v1 与原来的可视层 v 如果一样,那么得到的隐藏层就是可视层另外一种表达,因此隐藏层可以作为可视层输入数据的特征,所以它就是一种Deep Learning方法。 如果,我们把隐藏层的层数增加,我们可以得到Deep Boltzmann Machine (DBM);如果我们在靠近可视层的部分使用贝叶斯信念网络(即有向图模型,当然这里依然限制层中节点之间没有链接),而在最远离可视层的部分使用Restrict Boltzmann Machine,我们可以得到Deep Belief Net (DBN) 。 当然,还有其它的一些Deep Learning 方法,在这里就不叙述了。总之,Deep Learning能够自动地学习出数据的另外一种表示方法,这种表示可以作为特征加入原有问题的特征集合中,从而可以提高学习方法的效果,是目前业界的研究热点。 原文链接: http://blog.csdn.net/xianlingmao/article/details/8478562 三、深度学习(Deep Learning)算法简介 查看最新论文 Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009 深度(Depth) 从一个输入中产生一个输出所涉及的计算可以通过一个流向图(flow graph)来表示:流向图是一种能够表示计算的图,在这种图中每一个节点表示一个基本的计算并且一个计算的值(计算的结果被应用到这个节点的孩子节点的值)。考虑这样一个计算集合,它可以被允许在每一个节点和可能的图结构中,并定义了一个函数族。输入节点没有孩子,输出节点没有父亲。 对于表达 的流向图,可以通过一个有两个输入节点 和 的图表示,其中一个节点通过使用 和 作为输入(例如作为孩子)来表示 ;一个节点仅使用 作为输入来表示平方;一个节点使用 和 作为输入来表示加法项(其值为 );最后一个输出节点利用一个单独的来自于加法节点的输入计算SIN。 这种流向图的一个特别属性是深度(depth):从一个输入到一个输出的最长路径的长度。 传统的前馈神经网络能够被看做拥有等于层数的深度(比如对于输出层为隐层数加1)。SVMs有深度2(一个对应于核输出或者特征空间,另一个对应于所产生输出的线性混合)。 深度架构的动机 学习基于深度架构的学习算法的主要动机是: 不充分的深度是有害的; 大脑有一个深度架构; 认知过程是深度的; 不充分的深度是有害的 在许多情形中深度2就足够(比如logical gates, formal neurons, sigmoid-neurons, Radial Basis Function units like in SVMs)表示任何一个带有给定目标精度的函数。但是其代价是:图中所需要的节点数(比如计算和参数数量)可能变的非常大。理论结果证实那些事实上所需要的节点数随着输入的大小指数增长的函数族是存在的。这一点已经在logical gates, formal neurons 和rbf单元中得到证实。在后者中Hastad说明了但深度是d时,函数族可以被有效地(紧地)使用O(n)个节点(对于n个输入)来表示,但是如果深度被限制为d-1,则需要指数数量的节点数O(2^n)。 我们可以将深度架构看做一种因子分解。大部分随机选择的函数不能被有效地表示,无论是用深地或者浅的架构。但是许多能够有效地被深度架构表示的却不能被用浅的架构高效表示(see the polynomials example in the Bengio survey paper )。一个紧的和深度的表示的存在意味着在潜在的可被表示的函数中存在某种结构。如果不存在任何结构,那将不可能很好地泛化。 大脑有一个深度架构 例如,视觉皮质得到了很好的研究,并显示出一系列的区域,在每一个这种区域中包含一个输入的表示和从一个到另一个的信号流(这里忽略了在一些层次并行路径上的关联,因此更复杂)。这个特征层次的每一层表示在一个不同的抽象层上的输入,并在层次的更上层有着更多的抽象特征,他们根据低层特征定义。 需要注意的是大脑中的表示是在中间紧密分布并且纯局部:他们是稀疏的:1%的神经元是同时活动的。给定大量的神经元,任然有一个非常高效地(指数级高效)表示。 认知过程看起来是深度的 人类层次化地组织思想和概念; 人类首先学习简单的概念,然后用他们去表示更抽象的; 工程师将任务分解成多个抽象层次去处理; 学习/发现这些概念(知识工程由于没有反省而失败?)是很美好的。对语言可表达的概念的反省也建议我们一个稀疏的表示:仅所有可能单词/概念中的一个小的部分是可被应用到一个特别的输入(一个视觉场景)。 学习深度架构的突破 2006年前,尝试训练深度架构都失败了:训练一个深度有监督前馈神经网络趋向于产生坏的结果(同时在训练和测试误差中),然后将其变浅为1(1或者2个隐层)。 2006年的3篇论文改变了这种状况,由Hinton的革命性的在深度信念网(Deep Belief Networks, DBNs)上的工作所引领: Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief nets .Neural Computation 18:1527-1554, 2006 Yoshua Bengio, Pascal Lamblin, Dan Popovici and Hugo Larochelle, Greedy Layer-Wise Training of Deep Networks , in J. Platt et al. (Eds), Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 153-160, MIT Press, 2007 Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun Efficient Learning of Sparse Representations with an Energy-Based Model , in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006), MIT Press, 2007 在这三篇论文中以下主要原理被发现: 表示的无监督学习被用于(预)训练每一层; 在一个时间里的一个层次的无监督训练,接着之前训练的层次。在每一层学习到的表示作为下一层的输入; 用无监督训练来调整所有层(加上一个或者更多的用于产生预测的附加层); DBNs在每一层中利用用于表示的无监督学习RBMs。Bengio et al paper 探讨和对比了RBMs和auto-encoders(通过一个表示的瓶颈内在层预测输入的神经网络)。Ranzato et al paper在一个convolutional架构的上下文中使用稀疏auto-encoders(类似于稀疏编码)。Auto-encoders和convolutional架构将在以后的课程中讲解。 从2006年以来,大量的关于深度学习的论文被发表,一些探讨了其他原理来引导中间表示的训练,查看 Learning Deep Architectures for AI 原文链接: http://www.cnblogs.com/ysjxw/archive/2011/10/08/2201782.html 四、拓展学习推荐 Deep Learning 经典阅读材料: The monograph or review paper Learning Deep Architectures for AI (Foundations Trends in Machine Learning, 2009). The ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references . The LISA public wiki has a reading list and a bibliography . Geoff Hinton has readings from last year’s NIPS tutorial . Deep Learning工具—— Theano : Theano 是deep learning的Python库,要求首先熟悉Python语言和numpy,建议读者先看 Theano basic tutorial ,然后按照 Getting Started 下载相关数据并用gradient descent的方法进行学习。 学习了Theano的基本方法后,可以练习写以下几个算法: 有监督学习: Logistic Regression - using Theano for something simple Multilayer perceptron - introduction to layers Deep Convolutional Network - a simplified version of LeNet5 无监督学习: Auto Encoders, Denoising Autoencoders - description of autoencoders Stacked Denoising Auto-Encoders - easy steps into unsupervised pre-training for deep nets Restricted Boltzmann Machines - single layer generative RBM model Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning 最后呢,推荐给大家基本ML的书籍: Chris Bishop, “Pattern Recognition and Machine Learning”, 2007 Simon Haykin, “Neural Networks: a Comprehensive Foundation”, 2009 (3rd edition) Richard O. Duda, Peter E. Hart and David G. Stork, “Pattern Classification”, 2001 (2nd edition) 原文链接: http://blog.csdn.net/abcjennifer/article/details/7826917 五、应用实例 1、计算机视觉 ImageNet Classification with Deep Convolutional Neural Networks , Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, NIPS 2012. Learning Hierarchical Features for Scene Labeling , Clement Farabet, Camille Couprie, Laurent Najman and Yann LeCun, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. Learning Convolutional Feature Hierachies for Visual Recognition , Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michaeuml;l Mathieu and Yann LeCun, Advances in Neural Information Processing Systems (NIPS 2010), 23, 2010. 2、语音识别 微软研究人员通过与hintion合作,首先将RBM和DBN引入到语音识别声学模型训练中,并且在大词汇量语音识别系统中获得巨大成功,使得语音识别的错误率相对减低30%。但是,DNN还没有有效的并行快速算法,目前,很多研究机构都是在利用大规模数据语料通过GPU平台提高DNN声学模型的训练效率。 在国际上,IBM、google等公司都快速进行了DNN语音识别的研究,并且速度飞快。 国内方面,科大讯飞、百度、中科院自动化所等公司或研究单位,也在进行深度学习在语音识别上的研究。 3、自然语言处理等其他领域 很多机构在开展研究,但目前深度学习在自然语言处理方面还没有产生系统性的突破。 六、参考链接: 1. http://baike.baidu.com/view/9964 ... enter=deep+learning 2. http://www.cnblogs.com/ysjxw/archive/2011/10/08/2201819.html 3. http://blog.csdn.net/abcjennifer/article/details/7826917 本文转载来自 http://elevencitys.com/?p=1854 Stanford大学的Deep Learning 和 tutorial: http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
现有的贝叶斯网络学习算法侧重于对数据本身的分析,如网络结构学习中的数据与模型匹配度分析(Score-based method),结点间依赖关系分析(C onstraint -based method),以及两种分析的混合。然而在小样本数据下,由于样本有限,这些分析往往不具有足够的统计显著性,因此传统的结构学习算法并不适用。 (To be continue) This work has already been published in IJAR: DOI: http://dx.doi.org/10.1016/j.ijar.2014.02.008 If you have any questions, please do not hesitate to contact me. Cheers,
(1) LLE (Prof.Roweis, Unfortunately, he passes away): Nonlinear dimensionality reduction by locally linear embedding (2) ISOMAP: A global geometric framework for nonlinear dimensionality reduction (3) eigenmap: Laplacian eigenmaps for dimensionality reduction and data representation (4) neural network (Prof. Geoffrey Hinton): Reducing the dimensionality of data with neural networks (5) survey paper (Prof. Yoshua Bengio): Out-of-sample extensions for lle, isomap , mds, eigenmaps, and spectral clustering
Session: Online Feature Selection with Streaming Features Reading: Online Feature Selection with Streaming Features Reviewing: Affinity Learning with Diffusion on Tensor Product Graph Writing: Joint-ViVo – Revs 2 Travel to shanghai: 5038 CNY
欢迎大家访问OpenPR主页: http://www.openpr.org.cn , 并提出意见和建议!同时,OpenPR也期待您分享您的代码! OpenPR, stands for Open Pattern Recognition project and is intended to be an open source platform for sharing algorithms of image processing, computer vision, natural language processing, pattern recognition, machine learning and the related fields. Code released by OpenPR is under BSD license, and can be freely used for education and academic research. OpenPR is currently supported by the National Laboratory of Pattern Recognition, Institution of Automation, Chinese Academy of Sciences. Thresholding program This is demo program on global thresholding for image of bright small objects, such as aircrafts in airports. the program include four method, otsu,2D-Tsallis,PSSIM, Smoothnees Method. Authorschen xueyun E-mail xueyun.chen@nlpr.ia.ac.cn Principal Component Analysis Based on Nonparametric Max... In this paper, we propose an improved principal component analysis based on maximum entropy (MaxEnt) preservation, called MaxEnt-PCA, which is derived from a Parzen window estimation of Renyi’s quadratic entropy. Instead of minimizing the reconstruction ... AuthorsRan He E-mail rhe@nlpr.ia.ac.cn Metropolis–Hastings algorithm Metropolis-Hastings alogrithm is a Markov chain Monte Carlo method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. Thi sequence can be used to approximate the distribution. AuthorsGong Xing E-mail xgong@nlpr.ia.ac.cn Tagssampling, distribution Maximum Correntropy Criterion for Robust Face Recogniti... This code is developed based on Uriel Roque's active set algorithm for the linear least squares problem with nonnegative variables in: Portugal, L.; Judice, J.; and Vicente, L. 1994. A comparison of block pivoting and interior-point algorithms for linear ... AuthorsRan HE E-mail rhe@nlpr.ia.ac.cn Tagspattern recognition Naive Bayes EM Algorithm OpenPR-NBEM is an C++ implementation of Naive Bayes Classifier, which is a well-known generative classification algorithm for the application such as text classification. The Naive Bayes algorithm requires the probabilistic distribution to be discrete. Op ... AuthorsRui XIA E-mail rxia@nlpr.ia.ac.cn Tagspattern recognition, natural language processing, text classification Local Binary Pattern This is a class to calculate histogram of LBP (local binary patterns) from an input image, histograms of LBP-TOP (local binary patterns on three orthogonal planes) from an image sequence, histogram of the rotation invariant VLBP (volume local binary patte ... AuthorsJia WU E-mail jwu@nlpr.ia.ac.cn Tagscomputer vision, image processing, pattern recognition Two-stage Sparse Representation This program implements a novel robust sparse representation method, called the two-stage sparse representation (TSR), for robust recognition on a large-scale database. Based on the divide and conquer strategy, TSR divides the procedure of robust recogni ... AuthorsRan HE E-mail rhe@dlut.edu.cn Tagspattern recognition CMatrix Class It's a C++ program for symmetric matrix diagonalization, inversion and principal component anlaysis(PCA). The matrix diagonalization function can also be applied to the computation of singular value decomposition (SVD), Fisher linear discriminant analysis ... AuthorsChenglin LIU E-mail liucl@nlpr.ia.ac.cn Tagspattern recognition P3P(Perspective 3-Points) Solver This is a implementation of the classic P3P(Perspective 3-Points) algorithm problem solution in the Ransac paper "M. A. Fischler, R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartogr ... AuthorsZhaopeng GU E-mail zpgu@nlpr.ia.ac.cn TagsComputer Vision, PNP, Extrinsic Calibration Linear Discriminant Function Classifier This program is a C++ implementation of Linear Discriminant Function Classifier. Discriminant functions such as perceptron criterion, cross entropy (CE) criterion, and least mean square (LMS) criterion (all for multi-class classification problems) are sup ... AuthorsRui Xia E-mail rxia@nlpr.ia.ac.cn Tagslinear classifier, discriminant function Naive Bayes Classifier This program is a C++ implementation of Naive Bayes Classifier, which is a well-known generative classification algorithm for the application such as text classification. The Naive Bayes algorithm requires the probabilistic distribution to be discrete. Th ... AuthorsRui Xia E-mail rxia@nlpr.ia.ac.cn Tagspattern recognition, natural language processing, text classification OpenCV Based Extended Kalman Filter Frame A simple and clear OpenCV based extended Kalman filter(EKF) abstract class implementation,absolutely following standard EKF equations. Special thanks to the open source project of KFilter1.3. It is easy to inherit it to implement a variable state and me ... AuthorsZhaopeng GU E-mail zpgu@nlpr.ia.ac.cn TagsComputer Vision, EKF, INS Supervised Latent Semantic Indexing Supervised Latent Semantic Indexing(SLSI) is an supervised feature transformation method. The algorithms in this package are based on the iterative algorithm of Latent Semantic Indexing. AuthorsMingbo Wang E-mail mb.wang@nlpr.ia.ac.cn SIFT Extractor This program is used to extract SIFT points from an image. AuthorsZhenhui Xu E-mail zhxu@nlpr.ia.ac.cn Tagscomputer vision OpenPR-0.0.2 Scilab Pattern Recognition Toolbox is a toolbox developed for Scilab software, and is used in pattern recognition, machine learning and the related field. It is developed for the purpose of education and research. AuthorsJia Wu E-mail jiawu83@gmail.com Tagspattern recognition Layer-Based Dependency Parser LDPar is an efficient data-driven dependency parser. You can train your own parsing model on treebank data and parse new data using the induced model. AuthorsPing Jian E-mail pjian@nlpr.ia.ac.cn Tagsnatural language processing Probabilistic Latent Semantic Indexing AuthorsMingbo Wang E-mail mbwang@nlpr.ia.ac.cn Calculate Normalized Information Measures The toolbox is to calculate normalized information measures from a given m by (m+1) confusion matrix for objective evaluations of an abstaining classifier. It includes total 24 normalized information measures based on three groups of definitions, that is, ... AuthorsBaogang Hu E-mail hubg@nlpr.ia.ac.cn Quasi-Dense Matching This program is used to find point matches between two images. The procedure can be divided into two parts: 1) use SIFT matching algorithm to find sparse point matches between two images. 2) use "quasi-dense propagation" algorithm to get "quasi-dense" p ... AuthorsZhenhui Xu E-mail zhxu@nlpr.ia.ac.cn Agglomerative Mean-Shift Clustering Mean-Shift (MS) is a powerful non-parametric clustering method. Although good accuracy can be achieved, its computational cost is particularly expensive even on moderate data sets. For the purpose of algorithm speedup, an agglomerative MS clustering metho ... AuthorsXiao-Tong Yuan E-mail xtyuan@nlpr.ia.ac.cn Histograms of Oriented Gradients (HOG) Feature Extracti... This program is used to extract HOG(histograms of oriented gradients) features from images. The integral histogram is used for fast histogram extraction. Both APIs and binary utility are provided. AuthorsLiang-Liang He E-mail llhe@nlpr.ia.ac.cn 相关PPT下载详见 “视觉计算研究论坛”「SIGVC BBS」: http://www.sigvc.org/bbs/thread-272-1-1.html
讲者:杨杰 报告时间:2012.9.19 文章信息:paper#1 Kaizhu Huang, Zenglin Xu,Irwin King, Michael R.Lyu, Colin Campbell: Supervised Self-taught Learning: Actively transferring knowledge from unlabeled data. IJCNN 2009 paper#2 X. Yu and Y. Aloimonos,Attribute-based Transfer Learning for Object Categorization with Zero or One Training Example, ECCV2010 文章简介: Paper#1 Problem: Self-taught Learning学出来的基分类效果有限 Motivation:Self-taught Learning基础上作了一个监督化形式的模型,将self taught learning三步整合到一起。上一次组会也提到通过无标签学出来的基对于分类不一定是帮助的,所以huang这篇文章正是基于这点,将监督标签信息加进来用来指导基的学习,这样改进分类性能。 Model:具体体现在目标函数,将基的组合对原数据拟合以及SVM分类面的学习放在一起。在模型的优化方面,由于问题是非凸的,但是如果固定一部分参数,则可以将问题转化为一个迭代优化的两个子问题,并且这两个子问题是凸问题 Paper#2 Problem:这篇文章要解决的问题是zero/one shot learning problem,即是待分类标签的数据在训练数据中并没有同标签的数据,或者这样的数据很少。作者是在Animals with Attributes(AwA)作了一个基于属性的方法,这里的属性是数据集上已经人工提取出来的一些较为直观的特点,比如马是四条腿,四条腿就是一个属性,所有数据共用一个属性集。 Motivation:作者基于此并且结合author topic model提出了attribute model,可以去学习attribute和topic之间的概率关联关系,从而将此作为先验,或者利用学得的参数来合成一些人工数据,来模拟待分类标签数据。 相关PPT下载详见 “视觉计算研究论坛”「SIGVC BBS」: http://www.sigvc.org/bbs/thread-107-1-1.html
http://matpalm.com/blog/2010/08/06/my-list-of-cool-machine-learning-books/ 0) " Machine Learning: a Probabilistic Perspective " byKevin Patrick Murphy Now available amazon.com and other vendors. Electronic versions (e.g., for Kindle) will be available later in the Fall. Table of contents Chapter 1 (Introduction) Information for instructors from MIT Press . If you are an official instructor, you can request an e-copy, which can help you decide if the book is suitable for your class. You can also request the solutions manual. Errata Matlab software All the figures , together with matlab code to generate them 1) "programming collective intelligence" by toby segaran if you know nothing about machine learning and haven't done maths since high school then this is the book for you. it's a fantastically accesible introduction to the field. includes almost no theory and explains algorithms using actual python implementations. 2) "data mining" by witten and frank this book covers quite a bit more than programming c.i. while still being extremely practical (ie very few formula). about a fifth of the book is dedicated to weka, a machine learning workbench which was written by the authors. apart from the weka section this book has no code. i made a little screencast on weka awhile back if you're after a summary. 3) "introduction to data mining" by tan, steinbach and kumar covers almost the same material as the witten/frank text but delves a little bit deeper and with more rigour. includes no code (none of the books do from now on) with algorithms described by formula. has a number of appendices on linear algebra, probability, statistics etc so that you can read up if you're a bit rusty or new to the fields (the witten/frank text lack these). some people might argue having both of these books is a waste since they cover so much of the same ground but i've always found multiple explanations from different authors to be a great way to help understand a topic. i read the witten/frank text first and am glad i did but if i could only keep one i'd keep this one. intermission at this point you've probably got enough mental firepower to handle some of the uni level machine learning course notes that are floating about online. if you're keen to get a better foundation of the maths side of things it'd be worth working through andrew ng's lecture series on machine learning. (20 hours of a second year stanford course on machine learning) i also found andrew moore's lecture slides really great. (they do though require a reasonable understanding of the basics) 4) "foundations of statistical natural language processing" by manning and schutze not a machine learning book as such but great for learning to deal with one of the most common types of data around; text. since most of machine learning theory is about maths (ie numbers) this is awesome in helping to understanding how to deal with text in a mathematical context. 5) "introduction to machine learning" by ethem alpaydin covers generally the same sort of topics as the data mining books but with much more rigour and theory (derivations, proofs, etc). i think this is a good thing though since understanding how things work at a low level gives you the ability to tweak and modify as required. loads more formulas but again with appendixs that introduce the basics in enough detail to get by. 6) "all of statistics" by larry wasserman by this stage you'll probably have an appreciation of how important statistics is for this domain and it might be worth foccussing on it for a bit. personally i found this book to be a great read and though i've only read certain sections in depth i'm looking forward to when i get a chance to work through it cover to cover 7) "the elements of statistical learning" by hastie, tibshirani and friedman. with a bit more stats under your belt you might have a chance of getting through this one; the most complex of the lot. this book is absolutely beautifully presented and now that it's FREE to download you've got no reason not to have a crack at it. a remarkable piece of work and one i've yet to get through fully cover to cover, it's quite hardcore and right on the border of my level of understanding ( which makes it perfect for me :P ) ps. books i haven't read that are in the mail "machine learning" by tom mitchell have been wanting to read this one for awhile, i'm a big fan of tom mitchell , but couldn't justify the cost however just found out the other day the paperback is a third of the price of the hardback i was looking at!! the book's in the mail "pattern recognition and machine learning" by chris bishop all of a sudden seemed like everyone was reading this but me so it was time to jump on the bandwagon 《模式分类》如果是计算机、物理背景的 ,先看 Bishop的Machine Learning and Pattern Recognition ,然后看T. Hastie的 Elements of Statistical Learning 如果是数学、统计背景的,调转个顺序就可以了。Bishop的那本太厚推荐Jordan的统计学习的课件,全面,难度适中 http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/ 如果实在对英文没兴趣,可以看看李航的那本统计学习,比较基础 如果仅仅想看看这方面的应用情景,推荐吴军的数学之美 以上内容转自 http://www.zhizhihu.com/html/y2012/4019.html
At a time of rising interest in new forms of teaching to effect greater learning, Harvard Magazine asked Harry Lewis, Gordon McKay professor of computer science , to recount how he rethought his—and his students’—roles in creating a new course, and what he learned from teaching it. ~The Editors Computer science is booming at Harvard (and across the country). The number of concentrators has nearly tripled in five years. For decades, most of our students have been converts; barely a third of recent CS graduates intended to study the field when they applied to college. But sometime in 2010, we realized that this boom was different from those of earlier years, when many of our students came to computer science from mathematics, physics, and engineering. Today many seem to be coming from the life sciences, social sciences, and humanities. Never having studied formal mathematics, these students were struggling in our mathematically demanding courses. Their calculus and linear algebra courses did not teach them the math that is used to reason about computer programs: logic, proofs, probability, and counting (figuring out how many poker hands have two pairs, for example). Without these tools they could become good computer programmers, but they couldn’t become computer scientists at all. It was time to create a new course to fill in the background. I’ve developed big courses like CS 50, our introduction to the field. Courses for specialists, like CS 121 (“Introduction to the Theory of Computation”) and CS 124 (“Data Structures and Algorithms”), the theory courses in the CS concentration. A lecture course mixing math and public policy—my “Bits” course, part of the Core and General Education curricula. Even a freshman seminar for 12, outside my professional expertise: on amateur athletics—really a social history of sports in America, heavily laced with Harvardiana. So I figured I knew how to create courses. They always come out well—at least by the standard that I can’t possibly do a worse job than the previous instructor! This time was different. Figuring out the right topics was the easy part. I polled faculty about their upper-level courses and asked them what math they wished their students knew. I looked at the websites of courses at competing institutions, and called some former students who teach those courses to get the real story. (College courses are no more likely to work as advertised than anything else described in a catalog.) Thus was born CS 20, “Discrete Mathematics for Computer Science.” But once I knew what I needed to teach, I started worrying. Every good course I have ever taught (or taken, for that matter) had a narrative. CS 121 is the story of computability, a century-long intellectual history as well as a beautiful suite of mathematical results. “Bits” is the drama of information freedom, the liberation of ideas from the physical media used to store and convey them (see “Study Card” ). CS 20, on the other hand, risked being more like therapy—so many treatments of this followed by so many doses of that, all nauseating. “It’s good for you” is not a winning premise for a course. And what if students did not show up for class? I had no desire to develop another set of finely crafted slides to be delivered to another near-empty lecture hall. I’ll accept the blame for the declining attendance. My classes are generally video-recorded for an Extension School audience. I believe that if the videos exist, then all my students should have them—and they should have my handouts too. In fact, I think I should share as much of these materials with the world as Harvard’s business interests permit. I could think of ways to force students to show up (not posting my slide decks, or administering unannounced quizzes, for example). But those would be tricks, devices to evade the truth: the digital explosion has changed higher education. In the digital world, there is no longer any reason to use class time to transfer the notes of the instructor to the notes of the student (without passing through the brain of either, as Mark Twain quipped). Instead, I should use the classroom differently. So I decided to change the bargain with my students. Attendance would be mandatory. Homework would be daily. There would be a reading assignment for every class. But when they got to class, they would talk to each other instead of listening to me. In class, I would become a coach helping students practice rather than an oracle spouting truths. We would “flip the classroom,” as they say: students would prepare for class in their rooms, and would spend their classroom time doing what we usually call “homework”—solving problems. And they would solve problems collaboratively, sitting around tables in small groups. Students would learn to learn from each other, and the professor would stop acting as though his job was to train people to sit alone and think until they came up with answers. A principal objective of the course would be not just to teach the material but to persuade these budding computer scientists that they could learn it. It had to be a drawing-in course, a confidence-building course, not a weeding-out course. I immediately ran into one daunting obstacle: there was no place to teach such a course. Every classroom big enough to hold 40 or 50 students was set up on the amphitheater plan perfected in Greece 2,500 years ago. Optimal for a performer addressing an audience; pessimal, as computer scientists would say, for students arguing with each other. The School of Engineering and Applied Sciences (SEAS) had not a single big space with a flat floor and doors that could be closed. Several other SEAS professors also wanted to experiment with their teaching styles, and in the fall of 2011 we started talking about designs. In remarkably short order by Harvard standards, SEAS made a dramatic decision. It would convert some underutilized library space on the third floor of Pierce Hall to a flat-floor classroom. In this prototype there would be minimal technology, just a projection system. Thanks to some heroic work by architects and engineers, the whole job was done between the end of classes in December and the start of classes in late January 2012. The space is bright, open, and intentionally low-tech. The room features lots of whiteboards, some fixed to the walls and others rolling on casters, and small paisley-shaped tables, easily rearranged to accommodate two, four, or six seats. Electric cables run underneath a raised floor and emerge here and there like hydras, sprouting multiple sockets for student laptops, which never seem to have working batteries. A few indispensable accouterments were needed—lots of wireless Internet connectivity; push-of-a-button shades to cover the spectacular skylight; and a guarantee from the building manager that the room would be restocked daily with working whiteboard markers. About 40 brave souls showed up to be the guinea pigs in what I told them would be an experiment. To make the point about how the course would work, I gave on day one not the usual hour-long synopsis of the course and explanation of grading percentages, but a short substantial talk on the “pigeonhole principle”: If every pigeon goes in a pigeonhole and there are more pigeons than pigeonholes, some pigeonhole must have at least two pigeons. I then handed out a problem for the tables to solve using that principle, right then and there: prove that if you pick any 10 points from the area of a 1 x 1 square, then some two of them must be separated by no more than one-third of the square root of two. They got it, and they all came back for the next class, some with a friend or two. (Try it yourself—and remember, it helps to have someone else to work with!) After a few fits and starts, the course fell into a rhythm. We met Mondays, Wednesdays, and Fridays from 10 to 11 a.m. The course material was divided into bite-sized chunks, one topic per day. For each topic I created a slide presentation, which was the basis for a 20-minute mini-lecture I recorded on my laptop while sitting at home. The video and the slides were posted on the course website by the end of a given class so students could view them at their convenience before the next class. I also assigned 10 to 20 pages of reading from relevant sources that were free online. (A standard text for this material costs $218.67, and I just couldn’t ask students to spend that kind of money.) The students, in turn, had to answer some short questions online to prove they had done the reading and watched the video before showing up for class. Once in class, I worked one problem and then passed out copies of a sheet posing three or four others. Students worked in groups of four around tables, and each table wrote its solution on a whiteboard. A teaching fellow (TF), generally a junior or senior concentrating in math or computer science, coached and coaxed, and when a table declared it had solved a problem, finally called on a student to explain and defend the group’s solution. (This protocol provided an incentive for the members of a group to explain the solution to each other before one of them was called on.) At the end of the class, we posted the solutions to all the in-class problems, and also posted real homework problems, to be turned in at the beginning of the next class. We took attendance, and we collected the homework submissions at the beginning of class, to make sure people showed up on time. I had serious doubts about whether this protocol would actually work. Required attendance is countercultural at Harvard, as is daily homework to be submitted in class. And education requires the trust of the students. To learn anything, they have to believe the professors know what they are doing. I really didn’t, though I had observed a master teacher, Albert Meyer ’63, Ph.D. ’72, MIT’s Hitachi America professor of engineering, utilize this style with great skill. There was also the choppiness, the lack of a dramatic story line for the whole course. I took the cheap way out of that problem—I threw in some personal war stories related to the material. How Bill Gates ’77, LL.D. ’07, as a sophomore, cracked a problem I gave him about counting pancake flips and published a paper about it called “Bounds for Sorting By Prefix Reversal.” How Mark Zuckerberg ’06 put me at the center of his prototype social-network graph (so pay attention to graph theory, students, you never know when it might come in handy!). With no camera on me, I used the intimacy of the classroom for topical gossip—including updates on the five varsity athletes taking the course, three of them on teams that won Ivy championships during the term. Student feedback was gratifyingly positive. Anonymous responses to my questionnaire included “I’ve found this to be the most helpful teaching method at Harvard” and “Oh my goodness, the in-class problem-solving is beautiful! We need more of it.” Even the negative comments were positive. One student said, “The TFs are great. Professor Lewis’s teaching is not good. …I find it more useful to…talk to the TFs than listen to his lectures.” Fine, I thought to myself, I’ll talk less. My TFs have always been better teachers than I am, anyway, and lots of them are top professors now, so this is par for the course. My favorite: “You might say the class is a kind of start-up, and that its niche is the ‘class as context for active, engaging, useful, and fun problem-solving’ (as opposed to ‘class as context for sitting, listening, and being bored’).” Yes! Discrete mathematics as entrepreneurial educational disruption! What have we learned from the whole CS 20 experiment? Thirty-three topic units were a lot to prepare—each includes a slide deck, a recorded lecture, a selection of readings, a set of in-class problems, and homework exercises. The trickiest part was coordinating the workflow and getting everything at the right difficulty level—manageable within our severe time constraints, but hard enough to be instructive. Fortunately, my head TF, Michael Gelbart, a Princeton grad and a Ph.D. candidate in biophysics, is an organizational and pedagogical genius. When our homework problems were too hard and students became collectively discouraged or angry, we pacified the class with an offering of cupcakes or doughnut holes. We kept the classroom noncompetitive—we gave the normal sorts of exams, but students were not graded on their in-class performance, provided they showed up. That created an atmosphere of trust and support, but in-class problem-solving is pedagogically inefficient: I could have “covered” a lot more material if I were lecturing rather than confronting, in every class, students’ (mis)understanding of the material! Harvard’s class schedule, which allots three class hours per week for every course, is an anachronism of the lecture era; for this course we really need more class time for practice, drill, and testing. I relearned an old cultural lesson in a more international Harvard. Thirty-five years ago I learned the hard way never to assign an exam problem that required knowing the rules of baseball, because (who knew?) in most of the world children don’t grow up talking about innings and batting averages. This year I learned (happily, this time, before I made up the final exam) that there are places where children aren’t taught about hearts and diamonds, because card games are considered sinful. I also responded to some familiar student objections. Having weathered storms of protest in 1995 over randomizing the Houses, I anticipated that students would prefer to pick their own table-mates, but (true to type) I decided that mixing up the groups would make for greater educational dynamism. It worked, but next time I will go one step further. I will re-scramble the groups halfway through the course, so everyone can exchange their newly acquired problem-solving strategies with new partners. With a good set of recorded lectures and in-class problems now in hand, the class could be scaled pretty easily; we could offer multiple sections at different hours of the day, if we could get the classroom space and hire enough conscientious, articulate, mathematically mature undergraduate assistants. Fortunately, the Harvard student body includes a great many of the latter, and I owe a lot of thanks to those who assisted me this year—Ben Adlam, Paul Handorff, Abiola Laniyonu, and Rachel Zax—as well as to Albert Meyer and my colleague Paul Bamberg ’63, senior lecturer on mathematics, who gave me good advice and course materials to adapt for CS 20. I had the added satisfaction, as a longtime distance-education buff, of finding out that this experience could be replicated online. With the support of Henry Leitner, Ph.D. ’82, associate dean in the Division of Continuing Education and senior lecturer in computer science, we tried, and seem to have succeeded. In CSci E-120, offered this spring through the Harvard Extension School, a group of adventurous students, physically spread out from California to England, replicated the CS 20 “active learning” experience. They watched the same lectures and did the same reading on their own time. They “met” together synchronously for three hours per week (in the early evening for some, and the early morning for others). Web conferencing software allowed them to form virtual “tables” of four students each. Each “table” collaborated to solve problems by text-chatting and by scribbling on a shared virtual “whiteboard” using a tablet and stylus. My prize assistant, Deborah Abel ’01, “wandered” among the rooms just as the teaching fellows were doing in the physical space of my Pierce Hall classroom. Most of all, the course was for me an adventure in the co-evolution of education and technology—indeed, of life and technology. The excitement of computing created the demand for the course in the first place. The new teaching style was a response to the flood of digital content—and to my stubborn, libertarian refusal to dam it up. The course couldn’t have been done without digital infrastructure—five years ago I could not have recorded videos, unassisted and on my own time, for students to watch on theirs. The distance version of the course is an exercise in cyber-mediated intercontinental collaboration. Yet in the Harvard College classroom, almost nothing is digital. It is all person-to-person-to-person, a cacophony of squeaky markers and chattering students, assistants, and professor, above which every now and then can be heard those most joyous words, “Oh! I get it now!” 原文见 http://harvardmagazine.com/2012/09/reinventing-the-classroom
一个流行学习demo,并且有源代码 里面实现的代码有一下文章: MDS Michael Lee's MDS code ISOMAP J.B. Tenenbaum, V. de Silva and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319--2323, 2000. LLE L. K. Saul and S. T. Roweis. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds . Journal of Machine Learning Research, v4, pp. 119-155, 2003. Hessian LLE D. L. Donoho and C. Grimes. Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data . Technical Report TR-2003-08, Department of Statistics, Stanford University, 2003. Laplacian Eigenmap M. Belkin, P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , Neural Computation, June 2003; 15 (6):1373-1396. Diffusion Maps Nadler, Lafon, Coifman, and Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems LTSA Zhenyue Zhang and Hongyuan Zha. Principal Manifolds and Nonlinear Dimension Reduction via Tangent Space Alignment. SIAM Journal of Scientific Computing, 2004, 26 (1): 313-338. 原始链接: http://www.math.ucla.edu/~wittman/mani/index.html 本文引用地址: http://blog.sciencenet.cn/blog-722391-583977.html
ICML 2012 Maximum Margin Output Coding Information-theoretic Semi-supervised Metric Learning via Entropy Regularization A Hybrid Algorithm for Convex Semidefinite Optimization Information-Theoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation Similarity Learning for Provably Accurate Sparse Linear Classification ICML 2011 Learning Discriminative Fisher Kernels Learning Multi-View Neighborhood Preserving Projections CVPR 2012 Order Determination and Sparsity-Regularized Metric Learning for Adaptive Visual Tracking Non-sparse Linear Representations for Visual Tracking with Online Reservoir Metric Learning Unsupervised Metric Fusion by Cross Diffusion Learning Hierarchical Similarity Metrics Large Scale Metric Learning from Equivalence Constraints Neighborhood Repulsed Metric Learning for Kinship Verification Learning Robust and Discriminative Multi-Instance Distance for Cost Effective Video Classification PCCA: a new approach for distance learning from sparse pairwise constraints Group Action Induced Distances for Averaging and Clustering Linear Dynamical Systems with Applications to the Analysis of Dynamic Visual Scenes CVPR 2011 A Scalable Dual Approach to Semidefinite Metric Learning AdaBoost on Low-Rank PSD Matrices for Metric Learning with Applications in Computer Aided Diagnosis Adaptive Metric Differential Tracking (HUST) Tracking Low Resolution Objects by Metric Preservation (HUST) ACM MM 2012 Optimal Semi-Supervised Metric Learning for Image Retrieval Low Rank Metric Learning for Social Image Retrieval Activity-Based Person Identification Using Sparse Coding and Discriminative Metric Learning Deep Nonlinear Metric Learning with Independent Subspace Analysis for Face Verification ACM MM 2011 Biased Metric Learning for Person-Independent Head Pose Estimation ICCV 2011 Learning Mixtures of Sparse Distance Metrics for Classification and Dimensionality Reduction Unsupervised Metric Learning for Face Identification in TV Video Random Ensemble Metrics for Object Recognition Learning Nonlinear Distance Functions using Neural Network for Regression with Application to Robust Human Age Estimation Learning parameterized histogram kernels on the simplex manifold for image and action classification ECCV 2012 Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Dual-force Metric Learning for Robust Distractor Resistant Tracker Learning to Match Appearances by Correlations in a Covariance Metric Space Image Annotation Using Metric Learning in Semantic Neighbourhoods Measuring Image Distances via Embedding in a Semantic Manifold Supervised Earth Mover’s Distance Learning and Its Computer Vision Applications Learning Class-to-Image Distance via Large Margin and L1-norm Regularization Labeling Images by Integrating Sparse Multiple Distance Learning and Semantic Context Modeling IJCAI 2011 Distance Metric Learning Under Covariate Shift Learning a Distance Metric by Empirical Loss Minimization AAAI 2011 Efficiently Learning a Distance Metric for Large Margin Nearest Neighbor Classification NIPS 2011 Learning a Distance Metric from a Network Learning a Tree of Metrics with Disjoint Visual Features Metric Learning with Multiple Kernels KDD 2012 Random Forests for Metric Learning with Implicit Pairwise Position Dependence WSDM 2011 Mining Social Images with Distance Metric Learning for Automated Image Tagging
Dear Colleague, BMJ Learning is celebrating success following the completion of two million learning modules. To mark this milestone, we're throwing open our 1,000-strong collection of modules for free, for everyone, for one week. To thank you for helping us hit the two million mark, you can access hundreds of modules not usually available - free of charge any time during the week of July 2-9th. Whether it's our new, animated procedure based modules, journal-related CPD or anything from the A-Z of our offer (from Abdominal pain to Whooping cough), there's something extra for everyone. Here are some modules suitable forhealthcare professionalsto keep you busy in the mean time: Alcohol withdrawal: managing patients in the emergency department Arterial blood gases: a guide to interpretation Addison’s disease Upper gastrointestinal bleeding: a guide to diagnosis and management of non-variceal bleeding Acute kidney injury: a guide to diagnosis and treatment Best wishes, Dr. Helen Morant Editor, Online learning http://learning.bmj.com/learning/info/twomillionthmodule.html?utm_source=Adestrautm_medium=emailutm_campaign=2994utm_content=Celebrate%20with%20us%20-%20over%201%2C000%20free%20modules%20for%20one%20weekutm_term=BMJ%20LearningCampaign+name=SP%20250612%20healthcare%20professions%20weekly%20alert%20fre
一个流行学习demo,并且有源代码 里面实现的代码有一下文章: MDS Michael Lee's MDS code ISOMAP J.B. Tenenbaum, V. de Silva and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319--2323, 2000. LLE L. K. Saul and S. T. Roweis. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds . Journal of Machine Learning Research, v4, pp. 119-155, 2003. Hessian LLE D. L. Donoho and C. Grimes. Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data . Technical Report TR-2003-08, Department of Statistics, Stanford University, 2003. Laplacian Eigenmap M. Belkin, P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , Neural Computation, June 2003; 15 (6):1373-1396. Diffusion Maps Nadler, Lafon, Coifman, and Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems LTSA Zhenyue Zhang and Hongyuan Zha. Principal Manifolds and Nonlinear Dimension Reduction via Tangent Space Alignment. SIAM Journal of Scientific Computing, 2004, 26 (1): 313-338. 原始链接: http://www.math.ucla.edu/~wittman/mani/index.html
邹晓辉:如何在大学教育层面开展阅读和写作的国家项目? 回答这个问题,至少要考虑: 1.当前大学在母语和外语以及自然语言和程序语言乃至通用俗语与专用术语这三类双语教学及研究甚至计算机交互三个方面的现状调查; 2.如何借鉴国内外这方面成功的做法及事例? 3.怎样搭建基于三类双语教学及研究甚至计算机交互网络平台? 附录 : NWP 国家写作项目 ( http://www.nwp.org/ ) Writing is Essential Writing is essential to communication, learning, and citizenship . It is the currency of the new workplace and global economy. Writing helps us convey ideas, solve problems, and understand our changing world. Writing is a bridge to the future . About NWP Our Mission The National Writing Project focuses the knowledge, expertise, and leadership of our nation's educators on sustained efforts to improve writing and learning for all learners. Our Vision Writing in its many forms is the signature means of communication in the 21st century. The NWP envisions a future where every person is an accomplished writer, engaged learner, and active participant in a digital, interconnected world . Who We Are Unique in breadth and scale, the NWP is a network of sites anchored at colleges and universities and serving teachers across disciplines and at all levels , early childhood through university . We provide professional development, develop resources, generate research, and act on knowledge to improve the teaching of writing and learning in schools and communities . The National Writing Project believes that access to high-quality educational experiences is a basic right of all learners and a cornerstone of equity. We work in partnership with institutions, organizations, and communities to develop and sustain leadership for educational improvement. Throughout our work, we value and seek diversity—our own as well as that of our students and their communities—and recognize that practice is strengthened when we incorporate multiple ways of knowing that are informed by culture and experience. A Network of University-Based Sites Co-directed by faculty from the local university and from K–12 schools, nearly 200 local sites serve all 50 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands. Sites work in partnership with area school districts to offer high-quality professional development programs for educators . NWP continues to add new sites each year, with the goal of placing a writing project site within reach of every teacher in America. The network now includes two associated international sites . A Successful Model Customized for Local Needs NWP sites share a national program model , adhering to a set of shared principles and practices for teachers’ professional development , and offering programs that are common across the network. In addition to developing a leadership cadre of local teachers (called “ teacher-consultants ”) through invitational summer institutes , NWP sites design and deliver customized inservice programs for local schools, districts, and higher education institutions, and they provide a diverse array of continuing education and research opportunities for teachers at all levels. National research studies have confirmed significant gains in writing performance among students of teachers who have participated in NWP programs. The NWP is the only federally funded program that focuses on the teaching of writing . Support for the NWP is provided by the U.S. Department of Education , foundations, individuals, corporations, universities, and K-12 schools. NWP Core Principles The core principles at the foundation of NWP’s national program model are: Teachers at every level—from kindergarten through college—are the agents of reform ; universities and schools are ideal partners for investing in that reform through professional development . Writing can and should be taught , not just assigned, at every grade level. Professional development programs should provide opportunities for teachers to work together to understand the full spectrum of writing development across grades and across subject areas . Knowledge about the teaching of writing comes from many sources : theory and research, the analysis of practice, and the experience of writing . Effective professional development programs provide frequent and ongoing opportunities for teachers to write and to examine theory, research, and practice together systematically. There is no single right approach to teaching writing; however, some practices prove to be more effective than others. A reflective and informed community of practice is in the best position to design and develop comprehensive writing programs . Teachers who are well informed and effective in their practice can be successful teachers of other teachers as well as partners in educational research, development, and implementation. Collectively, teacher-leaders are our greatest resource for educational reform. http://www.nwp.org/cs/public/print/doc/about.csp
2012 International Workshop on Swarm Intelligent Systems (IWSIS2012) http://www1.tyust.edu.cn/yuanxi/yjjg/iwsis2012/iwsis2012.htm Special Session- Recent Advances on Opposition-Based Learning Applications Session Chair: Dr. Qingzheng Xu Department of Military Electronic Engineering, Xi’an Communication Institute, China Scope: Diverse forms of opposition are already existent virtually everywhere around us and the interplay between entities and opposite entities is apparently fundamental for maintaining universal balance. However, it seems that there is a gap regarding oppositional thinking in engineering, mathematics and computer science. A better understanding of opposition could potentially establish new search, reasoning, optimization and learning schemes with a wide range of applications. The main idea of opposition-based learning (OBL) is to consider opposite estimates, actions or states as an attempt to increase the coverage of the solution space and to reduce exploration time. OBL has already been applied to reinforcement learning, differential evolution, artificial neural network, particle swarm optimization, ant colony optimization, and genetic algorithm, etc. Example applications include large scale optimization problem, multi-objective optimization, traveling salesman problem, data mining, nonlinear system identification, image processing and understanding. However, finding killer applications for OBL is still a hard task that is heavily pursued. The objective of this special session is to bring together the state-of-art research results and industrial applications on this topic. Contributed papers must be the original work of the authors and should not have been published or under consideration by other journals or conferences. Topics of primary interest include, but are not limited to: l Motivation and theory of opposition-based learning l Opposition-based optimization techniques l Reasoning and search strategies in opposition-based computing l Real-world applications in signal processing, pattern recognition, image understanding, robotics, social networking, etc. l Other methodologies and applications associated with opposition-based learning Submission and review process: Submissions should follow the IWSIS2012 manuscript format described in the Workshop Web site at http://www1.tyust.edu.cn/yuanxi/yjjg/iwsis2012/iwsis2012.htm . All the papers must be submitted electronically in PDF format only via email to Dr. Qingzheng Xu at xuqingzheng@hotmail.com . All the submitted papers will be strictly peer reviewed by at least two anonymous reviewers. Based on the reports by the reviewers, the final decision on papers submitted to this Special Session will be taken by General Chairs of IWSIS2012, Prof. Zhihua Cui and Prof. Jianchao Zeng. All accepted papers will be published in some EI journals as regular papers : Important dates: Submission Date: April 20, 2012 Acceptance Date: May 20, 2012 Registration Date: June 1, 2012 Final version Date: June 1, 2012 Publication Date: All accepted papers will be published in EI-indexed international journals within the late of 2012 and early of 2013
Future work on social tagging The results from evaluations of social tags by experienced indexers in MELT highlighted a number of interesting issues that need further validation and investigation. Social tagging, as a feature in a conventional learning resource repository, is a very new phenomenon and it will take time before those interested in this approach have well developed evaluation methodologies and tools in this new context. Nevertheless, the MELT analysis shows that: Tags that expert indexers don't understand mostly constitute ‘noise’, but there are exceptions to this (see 2). Some tags travel across languages; i.e. people understand them even if they do not speak the language. These “travel well” tags can support retrieval in a multilingual context by facilitating the cross-border retrieval of resources. Some tags are understood only by a sub-group of users (e.g. “esl” = English as a Second Language) enhancing cross-border use and adding value for these sub-groups, but mostly they constitute ‘noise’ to others. Some tags correspond to descriptors in the LRE Thesaurus and can be used as indexing keywords for a resource, especially when the existing indexing is poor or the tag represents a narrower term. ”Thesaurus tags” can be used to determine the language equivalences between keywords, and affinities between tags and indexing keywords. Thesaurus terms could be used in order to determine affinities between tags, thus helping describe resources, as well as retrieval of resources in multiple languages. Tags can lead to interesting non-descriptors in the thesaurus and thus facilitate and enhance multilinguality. Tags can help enrich the thesaurus in terms of suggesting new descriptors based on how users have used tags to describe resources. Lots of food for thought here that can be investigated further once a critical mass of user-generated tags has been accumulated as a result of many thousands of teachers using the public version of the LRE portal in 2009. Look out for a fast expanding LRE tag cloud! The LRE tag cloud in May 2009
##一种新的多智能体Q 学习算法 郭锐 吴敏 彭军 彭姣 曹卫华 自动化学报 Vol. 33, No. 4 2007 年4 月 摘要 针对 非确定马尔可夫环境 下的多智能体系统, 提出了一种新的多智能体Q 学习算法. 算法中通过对联合动作的统计来学习其它智能体的行为策略, 并利用智能体策略向量的全概率分布保证了对联合最优动作的选择. 同时对算法的收敛性和学习性能进行了分析. 该算法在多智能体系统RoboCup 中的应用进一步表明了算法的有效性与泛化能力. 关键词: 多智能体, 增强学习, Q 学习 1 引言 机器学习 根据反馈的不同将学习分为: --监督学习 -- 非监督学习: 增强学习(Reinforcement learning) {一种以反馈为输入的自适应学习方法} 多智能体系统(Multi-agent systems) : 合作进化学习 Mini-Max Q学习算法 FOF Q 学习算法: 竞争 + 合作 2 多智能体Q 学习 2.1 多智能体Q 学习思想 增强学习 + 多智能体系统. the arising difficulties: -- 首先必须改进增强学习所依据的环境模型 -- 再者, 在多智能体系统中, 学习智能体应学习其它智能体的策略, 系统当前状态到下一状态的变迁 由学习智能体与其它智能体的动作决定, -- 2.2 多智能体Q 学习算法 学习策略 期望累计折扣回报 引入多个智能体的行为: 迭代: 多智能体Q 学习算法: 2.3 算法收敛性和有效性分析 2.3.1 算法收敛性证明 2.3.2 算法有效性分析 PAC 准则 3 学习算法在RoboCup 中的应用 RoboCup 机器人仿真足球比赛 4 结论 I comment: I will study reinforcement learing in MAS in the next step. the Q-learing algorithm for MAS will provide some insight and references to my future study. 一种新的多智能体Q学习算法.pdf this paper written by the same correspond author, which emphasize the architecture of agent in MAS: 一种新的多智能体系统结构及其在RoboCup中的应用.pdf *** ##一种新颖的多agent强化学习方法 周浦城, 洪炳镕, 黄庆成 电子学报 2006-8 摘要 : 提出了一种综合了模块化结构、利益分配学习以及对手建模技术的多agen t强化学习方法, 利用 模块化学习结构来克服状态空间的维数灾问题, 将Q学习与利益分配学习相结合以加快学习速度, 采用基于观察的对手建模来预测其他agent的动作分布. 追捕问题的仿真结果验证了所提方法的有效性. 关键词: 多agen t学习; Q学习; 利益分配学习; 模块化结构; 对手建模 1 引言 多agen t系统 + 强化学习: 1) 一种方式将多agent系统作为单个学习agent, 借助单agent强化学习 维数灾问题. 2) 另一种方式系统中每个agent拥有各自独立的强化学习机制. 由于多个agent协同学习, Note:对比 协同进化 2 强化学习 2.1 Q学习 Q-learning: 一类似于动态规划的强化学习方法 2.2 利益分配学习 Profit Sharing, PS 学习: 强化学习算法 3 提出的模块强化学习方法 学习agen t由三种模块构成: (1)学习模块LM, 实现强化学习算法. (2) 对手模块OM, 用于通过观察其他agent的动作来得到其动作分布估计, 以便评估自身动作值. (3) 仲裁模块MM, 用来综合学习模块和对手模块的结果. 3.1 对手建模 MAS: agent的动作效果 = 外界环境 + Other agents动作影响. 3.2 混合强化学习算法 author idea: PS-learning + Q-learning 3.3 仲裁模块决策 3.4 多agent学习过程 4 仿真 4.1 追捕问题 I comment: one of authors is Prof. 洪炳镕, whose robotic dance group attend the festival celebration of Spring in 2012. from various information source, I believe he and his team may be develop some program to collaborate robot dance and the kernel of system also come from the commerical product of foreign company. 一种新颖的多agent强化学习方法.pdf
*** ##Reinforcement Learning: A Survey Leslie Pack Kaelbling,Michael L. Littman and Andrew W. Moore Journal of Arti cial Intelligence Research ,1996 Abstract This paper surveys the eld of reinforcement learning from a computer-science per- spective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement. The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning. 1. Introduction two main strategies for solving reinforcement-learning problems. 1) to search in the space of behaviors in order to find one that performs well in the environment. approach: genetic algorithms and genetic programming as well as some more novel search techniques 2) to use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world. the structure of this paper: 1) Section 1 is devoted to establishing notation and describing the basic reinforcement-learning model. 2) Section 2 explains the trade-off between exploration and exploitation and presents some solutions to the most basic case of reinforcement-learning problems 3) Section 3 considers the more general problem in which rewards can be delayed in time from the actions that were crucial to gaining them. 4) Section 4 considers some classic model-free algorithms for reinforcement learning from delayed reward: adaptive heuristic critic, TD( ) and Q-learning. 5) Section 5 demonstrates a continuum of algorithms that are sensitive to the amount of computation an agent can perform between actual steps of action in the environment. 6) Section 6 describes Generalization - the cornerstone of mainstream machine learning research 7) Section 7 considers the problems that arise when the agent does not have complete perceptual access to the state of the environment. 8) Section 8 catalogs some of reinforcement learning's successful applications. 9) Finally, Section 9 concludes with some speculations about important open problems and the future of reinforcement learning. 1.1 Reinforcement-Learning Model Formally, the model consists of a discrete set of environment states, S; a discrete set of agent actions, A; and a set of scalar reinforcement signals; typically f0; 1g, or the real numbers. The agent's job : to find a policy ,mapping states to actions, that maximizes some long-run measure of reinforcement Reinforcement learning vs supervised learning 1) reinforcement learning has no presentation of input/output pairs 2) on-line performance is important: the evaluation of the system is often concurrent with learning. reinforcement learning vs search and planning issues in AI 1.2 Models of Optimal Behavior the crucial problem: what reinforce learning model of optimality will be? how the agent should take the future into account in the decisions it makes about how to behave now? three models: 1) The finite-horizon model 2) The infinite-horizon discounted model 3) average-reward model 1.3 Measuring Learning Performance several incompatible measures 1) Eventual convergence to optimal 2) Speed of convergence to optimality 3) Regret 1.4 Reinforcement Learning and Adaptive Control Adaptive control vs Reinforcement Learning Adaptive control: parameter estimation problem 2. Exploitation versus Exploration: The Single-State Case reinforcement learning vs supervised learning: reinforcement-learner must explicitly explore its environment. Problem: the k-armed bandit problem (K臂赌博机问题) 参考: 《基于信任和K臂赌博机问题选择多问题协商对象》,软件学报,2006 the structure of this section: i) Section 2.1 discusses three solutions to the basic one-state bandit problem that have formal correctness results. ii) Section 2.2 presents three techniques that have had wide use in practice, 2.1 Formally Justi ed Techniques 2.1.1 Dynamic-Programming Approach Bayesian reasoning 2.1.2 Gittins Allocation Indices 2.1.3 Learning Automata 2.2 Ad-Hoc Techniques 2.2.1 Greedy Strategies 2.2.2 Randomized Strategies Boltzmann exploration (玻尔兹曼方法) 2.2.3 Interval-based Techniques 2.3 More General Problems 3. Delayed Reward 3.1 Markov Decision Processes An MDP consists of a set of states S, a set of actions A, a reward function R a state transition function T Definition: The model is Markov if the state transitions are independent of any previous environment states or agent actions. 3.2 Finding a Policy Given a Model tool:dynamic programming 3.2.1 Value Iteration 3.2.2 Policy Iteration 3.2.3 Enhancement to Value Iteration and Policy Iteration 3.2.4 Computational Complexity 4. Learning an Optimal Policy: Model-free Methods there are two ways to proceed. Model-free: Learn a controller without learning a model. Model-based: Learn a model, and use it to derive a controller. i)Section 4 examines model-free learning, ii)Section 5 examines model-based methods. The biggest problem facing a reinforcement-learning agent : temporal credit assignment. i)How do we know whether the action just taken is a good one? ii)when it might have farreaching effects? temporal difference methods: to adjust the estimated value of a state based on the immediate reward and the estimated value of the next state. 4.1 Adaptive Heuristic Critic and TD( ) two components: 1)a critic (labeled AHC), 2)a reinforcement-learning component (labeled RL). 4.2 Q-learning 4.3 Model-free Learning With Average Reward 5. Computing Optimal Policies by Learning Models 5.1 Certainty Equivalent Methods 5.2 Dyna 5.3 Prioritized Sweeping / Queue-Dyna 5.4 Other Model-Based Methods 6. Generalization 6.1 Generalization over Input the goal: examines approaches to generating actions or evaluations as a function of a description of the agent's current state. 6.1.1 Immediate Reward CRBP The idea behind this training ru le : whenever an action fails to generate reward, crbp will try to generate an action that is different from the current choice. ARC REINFORCE Algorithms Logic-Based Methods 6.1.2 Delayed Reward Adaptive Resolution Models Decision Trees Variable Resolution Dynamic Programming PartiGame Algorithm 6.2 Generalization over Actions 6.3 Hierarchical Methods 6.3.1 Feudal Q-learning 6.3.2 Compositional Q-learning 6.3.3 Hierarchical Distance to Goal 7. Partially Observable Environments 7.1 State-Free Deterministic Policies 7.2 State-Free Stochastic Policies 7.3 Policies with Internal State The only way to behave truly e ectively in a wide-range of environments : to use memory of previous actions and observations to disambiguate the current state a variety of approaches to learning policies with internal state 1) Recurrent Q-learning to use a recurrent neural net work to learn Q values. 2) Classi er Systems bucket brigade algorithm 3) Finite-history-window Approach to restore the Markov property is to allow decisions to be based on the history of recent observations and perhaps actions. 4) POMDP Approach use hidden Markov model (HMM) techniques to learn a model of the environment 8. Reinforcement Learning Applications a data point to questions such as: How important is optimal exploration? Can we break the learning period into exploration phases and exploitation phases? What is the most useful model of long-term reward: Finite horizon? Discounted? Infinite horizon? How much computation is available between agent decisions and how should it be used? What prior knowledge can we build into the system, and which algorithms are capable of using that knowledge? 8.1 Game Playing 8.2 Robotics and Control 9. Conclusions Reinforcement learning A survey.pdf 个人点评: 只了解其形,并未理解其精髓 附上一篇中文综述,可以与本文对照学习: 强化学习研究综述.pdf 强化学习研究进展.doc 参考资料: 强化学习 史忠植.ppt *** 多agent teamwork研究综述 李静,陈兆乾,陈世福,徐殿祥 计算机研究与发展 ,2003 modify history 1) 2012-2-24 摘要: Teamwork在许多动态、复杂的多Agent环境中占据越来越重要的地位,是目前人工智能界研究的热点之一,通过对多Agent Teamwork的研究现状、关键技术和发展趋势进行综述和讨论,试图勾画出目前Team work研究的脉络、重点及其发展趋向.主要内容包括:Teamwork研究的背景;TeamworkL的研究方法以及典型的Teamwork模型;Teamwork模型的特点以及关键技术;Teamwork的应用领域以及进一步研究的方向 关键词: 多Agent系统;Teamwork模型 1 引言 共享心理模型 Teamwork:指多Agent间协作、联合行动以确保团队以一致性(coherent)方式运作的过程 Teamwork: 1) cooperation 2) collaboration 3) coordination the goal of this paper: 介绍Teamwork 的两种主要研究方法,重点介绍了目前Teamwork 研究的热点问题,即构建Teamwork 模型的相关理论和技术,勾画出目前Teamwork 研究的重要方面、关键技术及其发展趋势 2 多agent Teamwork研究的主要方法 多agent环境中的Teamwork模型 两个目标: 1) 通过定义团队结构和团队运作过程来构建有效的Teamwork 2) 要求团队中的agent能够灵活地适应不断变化的环境 多agent研究主要有 两种方法: 1) 一以Teamwork理论为基础的基于知识、规划的方法,该方法主要是联合意图的建立, 典型代表模型:STEAM; 2) 另一基于行为的方法,主要实现agent间灵活的行为选择,产生具有容错性、灵活性、可靠性的行为,但不具有规划能力,典型代表结构是: ALLIANCE 3 基于知识、规划的方法 Teamwork characteristic: ①联合行动的相互承诺,即没有队友的参与,Agent 不能单独放弃承诺; ②相互支持,即必须主动帮助队友; ③相互响应,即如果有需要的话,能够接管队友的任务 Teamwork理论包括:联合意图和共享规划 3.1 Jonit Intetion Framework 3.2 Shared plan 3.3 STEAM模型 4 基于行为(behavior based)的方法 基于行为的方法:基于反应式方法基础上发展起来的,它吸取了反应式结构缺少内部状态表示,不能够看到过去和未来的缺点,保留了反应式结构实时反应、鲁棒性、可扩展性等优势,成为物理多机器人系统中普遍采用的结构. 基于行为方法的Teamwork可以分为两类:( 数学收敛 ) 1) 群类型协作(swarm type cooperation)方法 2) 意图类型协作(intentional cooperation)方法 4.1 ALLIANCE 5 Teamwork模型的特点 4个基本特点: (1)强壮性和容错性 (2)实时反应性 (3)灵活性 (4)持久稳固性 6 构建Teamwork模型的关键技术 (1)协商 协商技术可以分为3类: i) 基于博弈论的协商; ii) 基于规划的协商 iii) 由人参与的和复杂人工智能方法的协商 (2)通信 Teamwork模型通信机制的实质: 通过通信的方法解决团队队员间的协作和冲突等问题 (3)信念推理 信念推理技术是一个富有挑战的领域: 包括逻辑、基于事例的推理(CBR)、信念修正、多agent规划、基于模 型的推理(MBR)、优化和博弈论. (4)规划 (5)学习 冲突问题是Teamwork模型中普遍存在的问题,因此要求Teamwork模型能够提供学习机制,以使Agent能够从自己过去的失败经验中不断学习,增强对环境的适应性 7 Teamwork模型的应用概况 7.1 机器人世界杯足球赛(RoboCup) 浙江大学 2003年RoboCup 8 结论 个人点评: 理解成 multi-agent AI planing,用于入门: 多AgentTeamwork研究综述.pdf the correspond author similar paper: 基于多Agent的Teamwork研究综述.pdf 另有一篇强化学习综述: 强化学习综述.pdf related paper: 多Agent系统合作与协调机制研究综述.pdf 并行学习神经网络集成方法.pdf 陈世福 是周志华 2003年全国优秀博士学位论文 的指导老师,文章可关注 *** Book: 《Reinforce Learning: An Introduction》 MIT 1998 http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html PDF: RL-3.pdf (this website has Lisp code for reinforce learning) MATLAB example: Netlog reinfore learning example: Reinforcement Learning Wargame.nlogo UML分析 (visio) : reinforce_learning_netlogo.vsd Approximate Dynamic Programming Chapter 6: Approximate Dynamic Programming.pdf ##面向多机器人系统的增强学习研究进展综述 吴军, 徐昕, 王健, 贺汉根 控制与决策 2011-11 摘要: 基于增强学习的多机器人系统优化控制是近年来机器人学与分布式人工智能的前沿研究领域. 多机器人系统具有分布、异构和高维连续空间等特性, 使得面向多机器人系统的增强学习的研究面临着一系列挑战, 为此, 对其相关理论和算法的研究进展进行了系统综述. 首先, 阐述了多机器人增强学习的基本理论模型和优化目标; 然后, 在对已有学习算法进行对比分析的基础上, 重点探讨了多机器人增强学习理论与应用研究中的困难和求解思路, 给出了若干典型问题和应用实例; 最后, 对相关研究进行了总结和展望. 关键词: 多机器人系统;多智能体;增强学习;随机对策;马氏决策过程 1 引言 增强学习(RL): 一种不依赖于环境模型和先验知识的机器学习方法, 通过试错和延时回报机制, 结 合自适应动态规划方法, 能够不断优化控制策略, 为系统自适应外界环境变化提供了可行方案 单智能体增强学习(SARL) 多机器人增强学习(MRRL) 多智能体增强学习(MARL) 2 多机器人增强学习的理论框架 2.1 模型框架基础 MRRL模型框架: -- 应用于独立增强学习的马氏决策过程(MDPs) 模型 -- 应用于协同增强学习的随机对策(SGs) 模型. 2.1.1 MDPs 模型 数学模型基础MDPs. 2.1.2 SGs 模型 矩阵对策 2.2 学习任务类型 MARL任务分为静态任务和动态任务. 2.3 理论和方法基础 2.4 均衡解概念 Nash 均衡 2.5 学习目标 3 多机器人增强学习方法的分类研究 3.1 多机器人增强学习的分类 3.2 各类方法研究现状 3.2.1 集中式多机器人增强学习方法 3.2.2 分布式独立多机器人增强学习方法 3.2.3 基于CISG 的多机器人增强学习方法 3.2.4 基于ZSSG 的多机器人增强学习方法 Minimax-Q算法: 采用Minimax原理 3.2.5 基于GSSG的多机器人增强学习方法 4 多机器人增强学习的难点及发展趋势 4.1 多智能体增强学习的固有难题 难题: 1) 维数灾难问题. 2) 信度分配问题. MARL的信度分配问题: 时间信度分配和结构信度分配两方面 3) 多均衡点协同选择问题 4.2 物理系统引入的约束限制 开发合理的样本采集策略快速遍历学习空间, 同时开发能高效利用有限学习样本的快速学习算法已成为迫切需要. 4.3 多机器人增强学习的发展趋势分析 5 典型应用领域 6 结论与思考 I comment: 面向多机器人系统的增强学习研究进展综述.pdf
Statistical relational learning of trust Achim Rettinger · Matthias Nickles · Volker Tresp Mach Learn (2011) Abstract The learning of trust and distrust is a crucial aspect of social interaction among autonomous, mentally-opaque agents. In this work, we address the learning of trust based on past observations and context information. We argue that from the truster’s point of view trust is best expressed as one of several relations that exist between the agent to be trusted (trustee) and the state of the environment . Besides attributes expressing trustworthiness, additional relations might describe commitments made by the trustee with regard to the current situation, like: a seller offers a certain price for a specific product. We show how to implement and learn context-sensitive trust using statistical relational learning in form of a Dirichlet process mixture model called Infinite Hidden Relational Trust Model (IHRTM). The practicability and effectiveness of our approach is evaluated empirically on user-ratings gathered from eBay. Our results suggest that (i) the inherent clustering achieved in the algorithm allows the truster to characterize the structure of a trust-situation and provides meaningful trust assessments; (ii) utilizing the collaborative filtering effect associated with relational data does improve trust assessment performance; (iii) by learning faster and transferring knowledge more effectively we improve cold start performance and can cope better with dynamic behavior in open multiagent systems. The later is demonstrated with interactions recorded from a strategic two-player negotiation scenario. Keywords: Relational learning · Computational trust · Social computing · Infinite hidden relational models · Initial trust 1 Introduction computational trust the existing problem: 1)lack the ability to take context sufficiently into account when trying to predict future behavior of interacting agents 2) not able to transfer knowledge gained in a specific context to a related context recommendations or social trust networks, cognitive and game theoretic models
笔者注:这篇访谈是John Brockman对明斯基的一个采访,明斯基是MIT教授,也是人工智能领域的大牛,看看他说点什么还是有意思的,闲话少说,上文! 还是简单说两句为好,这个访谈反映了明斯基对于人类意识的一些看法。之所以出现对于意识解释的混乱现象,是因为大脑内部太复杂,可能有40-50种运作机制,而这些都是我们目前还没有办法清楚了解与解决的。他也否定了经验的主观特性。经验感觉是否可以还原?这是哲学家们感兴趣的话题,物理主义的解释是不能令人满意的。一百年前如此,今天也如此。即便意识是一个大手提箱,也不能什么都装啊? CONSCIOUSNESS IS A BIG SUITCASE A Talk with Marvin Minsky MINSKY: My goal is making machines that can think-by understanding how people think. One reason why we find this hard to do is because our old ideas about psychology are mostly wrong. Most words we use to describe our minds (like "consciousness", "learning", or "memory") are suitcase-like jumbles of different ideas. Those old ideas were formed long ago, before 'computer science' appeared. It was not until the 1950s that we began to develop better ways to help think about complex processes. Computer science is not really about computers at all, but about ways to describe processes. As soon as those computers appeared, this became an urgent need. Soon after that we recognized that this was also what we'd need to describe the processes that might be involved in human thinking, reasoning, memory, and pattern recognition, etc. JB: You say 1950, but wouldn't this be preceded by the ideas floating around the Macy Conferences in the '40s? MINSKY: Yes, indeed. Those new ideas were already starting to grow before computers created a more urgent need. Before programming languages, mathematicians such as Emil Post, Kurt G鱠el, Alonzo Church, and Alan Turing already had many related ideas. In the 1940s these ideas began to spread, and the Macy Conference publications were the first to reach more of the technical public. In the same period, there were similar movements in psychology, as Sigmund Freud, Konrad Lorenz, Nikolaas Tinbergen, and Jean Piaget also tried to imagine advanced architectures for 'mental computation.' In the same period, in neurology, there were my own early mentors-Nicholas Rashevsky, Warren McCulloch and Walter Pitts, Norbert Wiener, and their followers-and all those new ideas began to coalesce under the name 'cybernetics.' Unfortunately, that new domain was mainly dominated by continuous mathematics and feedback theory. This made cybernetics slow to evolve more symbolic computational viewpoints, and the new field of Artificial Intelligence headed off to develop distinctly different kinds of psychological models. JB: Gregory Bateson once said to me that the cybernetic idea was the most important idea since Jesus Christ. MINSKY: Well, surely it was extremely important in an evolutionary way. Cybernetics developed many ideas that were powerful enough to challenge the religious and vitalistic traditions that had for so long protected us from changing how we viewed ourselves. These changes were so radical as to undermine cybernetics itself. So much so that the next generation of computational pioneers-the ones who aimed more purposefully toward Artificial Intelligence-set much of cybernetics aside. Let's get back to those suitcase-words (like intuition or consciousness) that all of us use to encapsulate our jumbled ideas about our minds. We use those words as suitcases in which to contain all sorts of mysteries that we can't yet explain. This in turn leads us to regard these as though they were "things" with no structures to analyze. I think this is what leads so many of us to the dogma of dualism-the idea that 'subjective' matters lie in a realm that experimental science can never reach. Many philosophers, even today, hold the strange idea that there could be a machine that works and behaves just like a brain, yet does not experience consciousness. If that were the case, then this would imply that subjective feelings do not result from the processes that occur inside brains. Therefore (so the argument goes) a feeling must be a nonphysical thing that has no causes or consequences. Surely, no such thing could ever be explained! The first thing wrong with this "argument" is that it starts by assuming what it's trying to prove. Could there actually exist a machine that is physically just like a person, but has none of that person's feelings? "Surely so," some philosophers say. "Given that feelings cannot not be physically detected, then it is 'logically possible' that some people have none." I regret to say that almost every student confronted with this can find no good reason to dissent. "Yes," they agree. "Obviously that is logically possible. Although it seems implausible, there's no way that it could be disproved." The next thing wrong is the unsupported assumption that this is even "logically possible." To be sure of that, you'd need to have proved that no sound materialistic theory could correctly explain how a brain could produce the processes that we call "subjective experience." But again, that's just what we were trying to prove. What do those philosophers say when confronted by this argument? They usually answer with statements like this: "I just can't imagine how any theory could do that." That fallacy deserves a name-something like "incompetentium". Another reason often claimed to show that consciousness can't be explained is that the sense of experience is 'irreducible.' "Experience is all or none. You either have it or you don't-and there can't be anything in between. It's an elemental attribute of mind-so it has no structure to analyze." There are two quite different reasons why "something" might seem hard to explain. One is that it appears to be elementary and irreducible-as seemed Gravity before Einstein found his new way to look at it. The opposite case is when the 'thing' is so much more complicated than you imagine it is, that you just don't see any way to begin to describe it. This, I maintain, is why consciousness seems so mysterious. It is not that there's one basic and inexplicable essence there. Instead, it's precisely the opposite. Consciousness, instead, is an enormous suitcase that contains perhaps 40 or 50 different mechanisms that are involved in a huge network of intricate interactions. The brain, after all, is built by processes that involve the activities of several tens of thousands of genes. A human brain contains several hundred different sub-organs, each of which does somewhat different things. To assert that any function of such a large system is irreducible seems irresponsible-until you're in a position to claim that you understand that system. We certainly don't understand it all now. We probably need several hundred new ideas-and we can't learn much from those who give up. We'd do better to get back to work. Why do so many philosophers insist that "subjective experience is irreducible"? Because, I suppose, like you and me, they can look at an object and "instantly know" what it is. When I look at you, I sense no intervening processes. I seem to "see" you instantly. The same for almost every word you say: I instantly seem to know what it means. When I touch your hand, you "feel it directly." It all seems so basic and immediate that there seems no room for analysis. The feelings of being seem so direct that there seems to be nothing to be explained. I think this is what leads those philosophers to believe that the connections between seeing and feeling must be inexplicable. Of course we know from neurology that there are dozens of processes that intervene between the retinal image and the structures that our brains then build to represent what we think we see. That idea of a separate world for 'subjective experience' is just an excuse for the shameful fact that we don't have adequate theories of how our brains work. This is partly because those brains have evolved without developing good representations of those processes. Indeed, there probably are good evolutionary reasons why we did not evolve machinery for accurate "insights" about ourselves. Our most powerful ways to solve problems involve highly serial processes-and if these had evolved to depend on correct representations of how they, themselves work, our ancestors would have thought too slowly to survive. 说明:文中图片就是明斯基(2008)的照片,来自网络,没有任何商业目的,仅供欣赏,特此致谢!
Since I have been Sydney two weeks ago, nearly everything is unknown for me, I need to change and broaden myself to adapte the new word, so that to quickly get a ecological inche for me and make sure the upcoming learning fitness, that is might be the “evolution”process. During those days, several training items have ocurred on me. Firstly, I need to follow the new work and rest timetable. We have three hours’s time difference between Sydney and China, that I will start to work whereas all Chinese friends are still in sleep. Three hours is not so much, but forget a long term time rhythm and set up a new one is not so easy, so much Chinese news and things always remind me that it is not time to work and not time to sleep, I always go to sleep very late althouth I am very tired. The second items is the food, not only the food content, but the food habit. People here do not pay so much on lunch but the dinner, that is opposite to China. For lunch, teachers and students do not eat very much. There is a Kitchen in the research building, professors and students could cook their lunch there for free. That is a good way to save time and money, I like it. boys with a hamburg and few yoghourt, girls with some fruit, somebody even have nothing given not being so hungery. They generally do not have specific rest time at noon. By contrast, their dinner time will be very colorful, and eat too much. I do not adapt this absolutely at the beginning, and I always need to eat too much for lunch so that to delay the afternoon hungery time, and will be very tired aftern the dinner time. The third training item relates to the traffic. This is an very important item to broaden living space, and is also the priority to control time. The straight proximity from my temporay rent house to research building was 2.5 km, but I need to walk nearly 4 km. I need to choose the most convenient bus and train line. Bicycle is impossible here, from one aspect, there is no space for bicycle, few person ride bicycle, from another aspect, there is specific for bicycle, you need to wear helmet and specical bicycle dress, otherwise you will be fired. Train here is convenient, but the management system is so different,they have so many different train lines, and each line is consistent with specific platform, you need to know your traveling plan from traffict websit of New South Wales (131500.com). From this website, you could control your travelling time presise to minutes, I have be good at this now. Up to now, the daily training has been finished, I have adjust myself to the new world now. I could control my daily work and life time, and know where I should go if I what to buy something, I also know how to prepare the lunch food so that to join in the students at noon. Next step for me is science traning, keep going. Lunch time. The girl is laura, a volunteer from germany. The boy is one Ph.D candidate.
Before we start discussing the topic of a hybrid NLP (Natural Language Processing) system, let us look at the concept of hybrid from our life experiences. I was driving a classical Camry for years and had never thought of a change to other brands because as a vehicle, there was really nothing to complain. Yes, style is old but I am getting old too, who beats whom? Until one day a few years ago when we needed to buy a new car to retire my damaged Camry. My daughter suggested hybrid, following the trend of going green. So I ended up driving a Prius ever since and fallen in love with it. It is quiet, with bluetooth and line-in, ideal for my iPhone music enjoyment. It has low emission and I finally can say bye to smog tests. It at least saves 1/3 gas. We could have gained all these benefits by purchasing an expensive all-electronic car but I want the same feel of power at freeway and dislike the concept of having to charge the car too frequently. Hybrid gets the best of both worlds for me now, and is not that more expensive. Now back to NLP. There are two major approaches to NLP, namely machine learning and grammar engineering (or hand-crafted rule system). As mentioned in previous posts, each has its own strengths and limitations, as summarized below. In general, a rule system is good at capturing a specific language phenomenon (trees) while machine learning is good at representing the general picture of the phenomena (forest). As a result, it is easier for rule systems to reach high precision but it takes a long time to develop enough rules to gradually raise the recall. Machine learning, on the other hand, has much higher recall, usually with compromise in precision or with a precision ceiling. Machine learning is good at simple, clear and coarse-grained task while rules are good at fine-grained tasks. One example is sentiment extraction. The coarse-grained task there is sentiment classification of documents (thumbs-up thumbs down), which can be achieved fast by a learning system. The fine-grained task for sentiment extraction involves extraction of sentiment details and the related actionable insights, including association of the sentiment with an object, differentiating positive/negative emotions from positive/negative behaviors, capturing the aspects or features of the object involved, decoding the motivation or reasons behind the sentiment,etc. In order to perform sophisticated tasks of extracting such details and actionable insights, rules are a better fit. The strength for machine learning lies in its retraining ability. In theory, the algorithm, once developed and debugged, remains stable and the improvement of a learning system can be expected once a larger and better quality corpus is used for retraining (in practice, retraining is not always easy: I have seen famous learning systems deployed in client basis for years without being retrained for various reasons). Rules, on the other hand, need to be manually crafted and enhanced. Supervised machine learning is more mature for applications but it requires a large labelled corpus. Unsupervised machine learning only needs raw corpus, but it is research oriented and more risky in application. A promising approach is called semi-supervised learning which only needs a small labelled corpus as seeds to guide the learning. We can also use rules to generate the initial corpus or seeds for semi-supervised learning. Both approaches involve knowledge bottlenecks. Rule systems's bottleneck is the skilled labor, it requires linguists or knowledge engineers to manually encode each rule in NLP, much like a software engineer in the daily work of coding. The biggest challenge to machine learning is the sparse data problem, which requires a very large labelled corpus to help overcome. The knowledge bottleneck for supervised machine learning is the labor required for labeling such a large corpus. We can build a system to combine the two approaches to complement each other. There are different ways of combining the two approaches in a hybrid system. One example is the practice we use in our product, where the results of insights are structured in a back-off model: high precision results from rules are ranked higher than the medium precision results returned by statistical systems or machine learning. This helps the system to reach configurable balance between precision and recall. When labelled data are available (e.g. the community has already built the corpus, or for some tasks, the public domain has the data, e.g. sentiment classification of movie reviews can use the review data with users' feedback on 5-star scale), and when the task is simple and clearly defined, using machine learning will greatly speed up the development of a capability. Not every task is suitable for both approaches. (Note that suitability is in the eyes of beholder: I have seen many passionate ML specialists willing to try everything in ML irrespective of the nature of the task: as an old saying goes, when you have a hammer, everything looks like a nail.) For example, machine learning is good at document classification whilerules are mostly powerless for such tasks. But for complicated tasks such as deep parsing, rules constructed by linguists usually achieve better performance than machine learning. Rules also perform better for tasks which have clear patterns, for example, identifying data items like time,weight, length, money, address etc. This is because clear patterns can be directly encoded in rules to be logically complete in coverage while machine learning based on samples still has a sparse data challenge. When designing a system, in addition to using a hybrid approach for some tasks, for other tasks, we should choose the most suitable approach depending on the nature of the tasks. Other aspects of comparison between the two approaches involve the modularization and debugging in industrial development. A rule system can be structured as a pipeline of modules fairly easily so that a complicated task is decomposed into a series of subtasks handled by different levels of modules. In such an architecture, a reported bug is easy to localize and fix by adjusting the rules in the related module. Machine learning systems are based on the learned model trained from the corpus. The model itself, once learned, is often like a black-box (even when the model is represented by a list of symbolic rules as results of learning, it is risky to manually mess up with the rules in fixing a data quality bug). Bugs are supposed to be fixable during retraining of the model based on enhanced corpus and/or adjusting new features. But re-training is a complicated process which may or may not solve the problem. It is difficultto localize and directly handle specific reported bugs in machine learning. To conclude, due to the complementary nature for pros/cons of the two basic approaches to NLP, a hybrid system involving both approaches is desirable, worth more attention and exploration. There are different ways of combining the two approaches in a system, including a back-off model using rulles for precision and learning for recall, semi-supervised learning using high precision rules to generate initial corpus or “seeds”, etc.. Related posts: Comparison of Pros and Cons of Two NLP Approaches Is Google ranking based on machine learning ? 《立委随笔:语言自动分析的两个路子》 《立委随笔:机器学习和自然语言处理》 【置顶:立委科学网博客NLP博文一览(定期更新版)】
美国的课外补习学校 Sylvan Learning 在 Danbury 黄安年文 黄安年的博客 /2011 年 10 月 2 日 ( 美东时间 ) 发布 美国的 Sylvan Learning C enters 在美国、加拿大等地有 900 多处 ( 请见以下简介 ) , 我把她称为美国的课外补习学校, 在 Danbury , CT 也有这样一所学校, 设备和条件都不错 , 学校老师是在附近学校任教的一线有经验教师中兼任的, 来这里补习是一对一的,一次课程 2 个小时 , 30 美元, 相当于人民币 200 元 , 和现在国内教教一对一需要 100 元一小时 , 绝对价格相等, 可见过内的家教费要高得多 , 而且环境远不如美国。国内的家教市场需要逐步走上规模化是大趋势。目前像学而思的网络化教学就很有影响。 ––– -------------------------------------------------------------------------------------------------------- Sylvan Learning is the leading provider of tutoring and supplemental education services to students of all ages and skill levels. At Sylvan, our warm and caring tutors tailor individualized learning plans that build the skills, habits and attitudes students need to succeed in school and in life. Affordable tutoring instruction is available in math, reading, writing, study skills, homework help, test prep and more at more than 900 learning centers in the United States, Canada and abroad. http://tutoring.sylvanlearning.com/index.cfm About Our Tutoring Programs At our centers, Sylvan trained and certified instructors provide highly personalized instruction in reading , math , writing , study skills , homework help , SAT*/ACT prep and state test prep . With Sylvan’s proven process and teaching methods, our students build the lasting skills, habits and attitudes they need to succeed in school - and in life. http://tutoring.sylvanlearning.com/sylvan_about_us.cfm
《Knowledge Discovery in Databases: An Overview》, William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus, AAAI ,1992 Abstract: After a decade of fundamental interdisciplinary research in machine learning, the spadework in this field has been done; the 1990s should see the widespread exploitation of knowledge discovery as an aid to assembling knowledge bases. The contributors to the AAAI Press book \emph{Knowledge Discovery in Databases} were excited at the potential benefits of this research. The editors hope that some of this excitement will communicate itself to AI Magazine readers of this article the goal of this article: This article presents an overview of the state of the art in research on knowledge discovery in databases. We analyze Knowledge Discovery and define it as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. We then compare and contrast database, machine learning, and other approaches to discovery in data. We present a framework for knowledge discovery and examine problems in dealing with large, noisy databases, the use of domain knowledge, the role of the user in the discovery process, discovery methods, and the form and uses of discovered knowledge. We also discuss application issues, including the variety of existing applications and propriety of discovery in social databases. We present criteria for selecting an application in a corporate environment. In conclusion, we argue that discovery in databases is both feasible and practical and outline directions for future research, which include better use of domain knowledge, efficient and incremental algorithms, interactive systems, and integration on multiple levels. 个人点评: 一篇老些的经典数据挖掘综述,个人认为本文两个入脚点:一是Machine Learning (Table 1,2),二是文中 Figure 1 Knowledge Discovery in Databases Overview.pdf beamer_Knowledge_Discovery_Database_Overview.pdf beamer_Knowledge_Discovery_Database_Overview.tex
From: http://mlss2011.comp.nus.edu.sg/index.php?n=Site.Slides MLSS 2011 Machine Learning Summer School 13-17 June 2011, Singapore Slides Speaker Topic Slides Chiranjib Bhattacharyya Kernel Methods Slides ( pdf ) Wray Buntine Introduction to Machine Learning Slides ( pdf ) Zoubin Ghahramani Gaussian Processes, Graphical Model Structure Learning Slides (Part 1 pdf , Part 2 pdf , Part 3 pdf ) Stephen Gould Markov Random Fields for Computer Vision Slides (Part 1 pdf , Part 2 pdf , Part 3 pdf )]] Marko Grobelnik How We Represent Text? ...From Characters to Logic Slides ( pptx ) David Hardoon Multi-Source Learning; Theory and Application Slides ( pdf ) Mark Johnson Probabilistic Models for Computational Linguistics Slides (Part 1 pdf , Part 2 pdf , Part 3 pdf ) Wee Sun Lee Partially Observable Markov Decision Processes Slides ( pdf , pptx ) Hang Li Learning to Rank Slides ( pdf ) Sinno Pan Qiang Yang Transfer Learning Slides (Part1 pptx Part 2 pdf ) Tomi Silander Introduction to Graphical Models Slides ( pdf ) Yee Whye Teh Bayesian Nonparametrics Slides ( pdf ) Ivor Tsang Feature Selection using Structural SVM and its Applications Slides ( pdf ) Max Welling Learning in Markov Random Fields Slides ( pdf , pptx )
Classical Paper List on Machine Learning and Natural Language Processing from Zhiyuan Liu Hidden Markov Models Rabiner, L. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. (Proceedings of the IEEE 1989) Freitag and McCallum, 2000, Information Extraction with HMM Structures Learned by Stochastic Optimization, (AAAI'00) Maximum Entropy Adwait R. A Maximum Entropy Model for POS tagging, (1994) A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. (CL'1996) A. Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998. Hai Leong Chieu, 2002. A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text, (AAAI'02) MEMM McCallum et al., 2000, Maximum Entropy Markov Models for Information Extraction and Segmentation, (ICML'00) Punyakanok and Roth, 2001, The Use of Classifiers in Sequential Inference. (NIPS'01) Perceptron McCallum, 2002 Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms (EMNLP'02) Y. Li, K. Bontcheva, and H. Cunningham. Using Uneven-Margins SVM and Perceptron for Information Extraction. (CoNLL'05) SVM Z. Zhang. Weakly-Supervised Relation Classification for Information Extraction (CIKM'04) H. Han et al. Automatic Document Metadata Extraction using Support Vector Machines (JCDL'03) Aidan Finn and Nicholas Kushmerick. Multi-level Boundary Classification for Information Extraction (ECML'2004) Yves Grandvalet, Johnny Marià , A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification. (NIPS' 05) CRFs J. Lafferty et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. (ICML'01) Hanna Wallach. Efficient Training of Conditional Random Fields. MS Thesis 2002 Taskar, B., Abbeel, P., and Koller, D. Discriminative probabilistic models for relational data. (UAI'02) Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. (HLT/NAACL 2003) B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. (NIPS'2003) S. Sarawagi and W. W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction (NIPS'04) Brian Roark et al. Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm (ACL'2004) H. M. Wallach. Conditional Random Fields: An Introduction (2004) Kristjansson, T.; Culotta, A.; Viola, P.; and McCallum, A. Interactive Information Extraction with Constrained Conditional Random Fields. (AAAI'2004) Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. (NIPS'2004) John Lafferty, Xiaojin Zhu, and Yan Liu. Kernel Conditional Random Fields: Representation and Clique Selection. (ICML'2004) Topic Models Thomas Hofmann. Probabilistic Latent Semantic Indexing. (SIGIR'1999). David Blei, et al. Latent Dirichlet allocation. (JMLR'2003). Thomas L. Griffiths, Mark Steyvers. Finding Scientific Topics. (PNAS'2004). POS Tagging J. Kupiec. Robust part-of-speech tagging using a hidden Markov model. (Computer Speech and Language'1992) Hinrich Schutze and Yoram Singer. Part-of-Speech Tagging using a Variable Memory Markov Model. (ACL'1994) Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. (EMNLP'1996) Noun Phrase Extraction E. Xun, C. Huang, and M. Zhou. A Unified Statistical Model for the Identification of English baseNP. (ACL'00) Named Entity Recognition Andrew McCallum and Wei Li. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons. (CoNLL'2003). Moshe Fresko et al. A Hybrid Approach to NER by MEMM and Manual Rules, (CIKM'2005). Chinese Word Segmentation Fuchun Peng et al. Chinese Segmentation and New Word Detection using Conditional Random Fields, COLING 2004. Document Data Extraction Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. (ICML'2000). David Pinto, Andrew McCallum, etc. Table Extraction Using Conditional Random Fields. SIGIR 2003. Fuchun Peng and Andrew McCallum. Accurate Information Extraction from Research Papers using Conditional Random Fields. (HLT-NAACL'2004) V. Carvalho, W. Cohen. Learning to Extract Signature and Reply Lines from Email. In Proc. of Conference on Email and Spam (CEAS'04) 2004. Jie Tang, Hang Li, Yunbo Cao, and Zhaohui Tang, Email Data Cleaning, SIGKDD'05 P. Viola, and M. Narasimhan. Learning to Extract Information from Semi-structured Text using a Discriminative Context Free Grammar. (SIGIR'05) Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Li Teng, and Qinghua Zheng, Automatic Extraction of Titles from General Documents using Machine Learning, Information Processing and Management, 2006 Web Data Extraction Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional Random Fields for Object Recognition. (NIPS'2004) Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, and Hang Li, Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval, (SIGIR'05) Jun Zhu et al. Mutual Enhancement of Record Detection and Attribute Labeling in Web Data Extraction. (SIGKDD 2006) Event Extraction Kiyotaka Uchimoto, Qing Ma, Masaki Murata, Hiromi Ozaku, and Hitoshi Isahara. Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules. (ACL'2000) GuoDong Zhou and Jian Su. Named Entity Recognition using an HMM-based Chunk Tagger (ACL'2002) Hai Leong Chieu and Hwee Tou Ng. Named Entity Recognition: A Maximum Entropy Approach Using Global Information. (COLING'2002) Wei Li and Andrew McCallum. Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Trans. Asian Lang. Inf. Process. 2003 Question Answering Rohini K. Srihari and Wei Li. Information Extraction Supported Question Answering. (TREC'1999) Eric Nyberg et al. The JAVELIN Question-Answering System at TREC 2003: A Multi-Strategh Approach with Dynamic Planning. (TREC'2003) Natural Language Parsing Leonid Peshkin and Avi Pfeffer. Bayesian Information Extraction Network. (IJCAI'2003) Joon-Ho Lim et al. Semantic Role Labeling using Maximum Entropy Model. (CoNLL'2004) Trevor Cohn et al. Semantic Role Labeling with Tree Conditional Random Fields. (CoNLL'2005) Kristina toutanova, Aria Haghighi, and Christopher D. Manning. Joint Learning Improves Semantic Role Labeling. (ACL'2005) Shallow parsing Ferran Pla, Antonio Molina, and Natividad Prieto. Improving text chunking by means of lexical-contextual information in statistical language models. (CoNLL'2000) GuoDong Zhou, Jian Su, and TongGuan Tey. Hybrid text chunking. (CoNLL'2000) Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. (HLT-NAACL'2003) Acknowledgement Dr. Hang Li , for original paper list.
From: http://www.springerlink.com/content/g37847m78178l645/fulltext.html Encyclopedia of Machine Learning Springer Science+Business Media, LLC2011 10.1007/978-0-387-30164-8_124 ClaudeSammut and GeoffreyI.Webb Clustering Clustering is a type of unsupervised learning in which the goal is to partition a set of examples into groups called clusters. Intuitively, the examples within a cluster are more similar to each other than to examples from other clusters. In order to measure the similarity between examples, clustering algorithms use various distortion or distance measures . There are two major types clustering approaches: generative and discriminative. The former assumes a parametric form of the data and tries to find the model parameters that maximize the probability that the data was generated by the chosen model. The latter represents graph-theoretic approaches that compute a similarity matrix defined over the input data. Cross References Categorical Data Clustering Cluster Editing Cluster Ensembles Clustering from Data Streams Constrained Clustering Consensus Clustering Correlation Clustering Cross-Language Document Clustering Density-Based Clustering Dirichlet Process Document Clustering Evolutionary Clustering Graph Clustering k-means clustering k-mediods clustering Model-Based Clustering Partitional Clustering Projective Clustering Sublinear Clustering
From: http://www.cs.york.ac.uk/aig/LLMMC/ Symposium on Learning Language Models from Multilingual Corpora (LLMMC) Part of the AISB 2011 Convention , 4-7 April 2011. Call for Papers International organizations like the UN and the EU, news agencies, and companies operating internationally are producing large volumes of texts in different languages. As a result, large publicly-available parallel paragraph- or sentence-aligned corpora have been created for many language pairs, e.g., French-English, Chinese-English or Arabic-English. The multilingual nature of the EU has given rise to many documents available in all or many of its official languages, which have been been assembled in multi-lingual parallel corpora such as Europarl (11 languages, 34-55M words for each) and JRC-Acquis (22 languages, 11-22M words for each). These parallel corpora have been used, both monolingually and multilingually, for a variety of NLP tasks, including but not limited to machine translation, cross-lingual information retrieval, word sense disambiguation, semantic relation extraction, named entity recognition, POS tagging, and syntactic parsing. With the advent of Internet, there has been also an explosion in the availability of semi-parallel multilingual online resources like Wikipedia that have been used for similar tasks and have a big potential for future exploration and research. In this symposium, we are interested in explicit models, usable and verifiable by humans, which could be used for either translation or for modelling individual languages, e.g., as applied to morphology, where the available translations can help identify word forms of the same lexical entry in a given language, or lexical semantics, where parallel corpora can help extract instances of relations like synonymy and hypernymy, which are essential for building thesauri and ontologies. The main purpose of the symposium will be to gather and disseminate the best ideas in this new area. Thus, we welcome review and position papers alongside original submissions. A considerable part of this one-day symposium will be dedicated to discussions to encourage the formations of new collaborations and consortia. Duration: a one-day symposium. Important Dates: Call for papers: December 13, 2010 Submissions: January 19, 2011 Notification: February 14, 2011 Submission of camera-ready versions: February 28, 2011 Symposium: April 6, 2011 Organizers: Dimitar Kazakov, The University of York, UK (kazakov AT cs DOT york DOT ac DOT uk) Preslav Nakov, National University of Singapore, Singapore (preslav DOT nakov AT gmail DOT com) Ahmad R. Shahid, The University of York, UK (ahmad AT cs DOT york DOT ac DOT uk) Program Committee: Graeme Blackwood, University of Cambridge, UK Phil Blunsom, University of Oxford, UK Francis Bond, Nanyang Technological University, Singapore Yee-Seng Chan, University of Illinois at Urbana-Champaign, USA Daniel Dahlmeier, National University of Singapore, Singapore Marc Dymetman, Xerox Research Centre Europe, France Andreas Eisele, Directorate-General for Translation, Luxembourg Michel Galley, Stanford University, USA Kuzman Ganchev, University of Pennsylvania, USA Corina R Girju, University of Illinois at Urbana-Champaign, USA Philipp Koehn, University of Edinburgh, UK Krista Lagus, Aalto University School of Science and Technology, Finland Wei Lu, National University of Singapore, Singapore Elena Paskaleva, Bulgarian Academy of Sciences, Bulgaria Katerina Pastra, Institute for Language and Speech Processing, Greece Khalil Sima'an, University of Amsterdam, The Netherlands Ralf Steinberger, Joint Research Centre, Italy Joerg Tiedemann, Uppsala University, Sweden Marco Turchi, Joint Research Centre, Italy Jaakko Vyrynen, Aalto University School of Science and Technology, Finland
位置:E:\petrelli\study\ML\paper\PAMI @article{raykar2008fast, title={{A fast algorithm for learning a ranking function from large-scale data sets}}, author={Raykar, V.C. and Duraiswami, R. and Krishnapuram, B.}, journal={IEEE transactions on pattern analysis and machine intelligence}, volume={30}, number={7}, pages={1158--1170}, year={2008}, publisher={Citeseer} } 基本内容: 文章利用sigmoid funcion 来作为原问题的近似loss function, 然后利用conjugate gradient algortihm求解,直接求解效率很低,于是文章利用erfc函数做了个近似,得到一个快速算法. 贡献: 主要用来解决Preference ranking的问题. 文章提出的算法我感觉也蛮有局限性的,而且相比目前最快的算法,我觉得不会比他提出的差.
Chapter 2 Overview of Supervised learning 2.1 几个常用且意义相同的术语: inputs在统计类的文献中,叫做predictors, 但经典叫法是independently variables,在模式识别中,叫做feature. outputs,叫做responses, 经典叫法是dependently variables. 2.2 给出了回归和分类问题的基本定义 2.3 介绍两类简单的预测方法: Least square 和 KNN: Least square产生的linear decision boundary的特点: low variance but potentially high bias; KNN wiggly and unstabla,也就是high variance and low bias. 这一段总结蛮经典: A large subset of the most popular techniques in use today are variants of these two simple procedures. In fact 1-nearest-neighbor, the simplest of all, captures a large percentage of the market for low-dimensional problems. The following list describes some ways in which these simple procedures have been enhanced: ~ Kernel methods use weights that decrease smoothly to zero with distance from the target point, ather than the eective 0=1 weights used by k-nearest neighbors. ~In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others. ~Local regression fits linear models by locally weighted least squares rather than fitting constants locally. ~Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models. ~Projection pursuit and neural network models consist of sums of non-linearly transformed linear models. 2.4 统计决策的理论分析 看不进去,没怎么看懂,明天看新内容前再看一遍,今天看的内容 p35-p43. 2.5节讨论了local methods KNN在高维特征下的问题, 在维数增大的情况下,要选取r部分的样本,所需要的边长接近1,这样会导致variance非常高. 2.6节分为统计模型,监督学习介绍和函数估计的方法来介绍,统计模型给出一般问的统计概率模型,监督学习说明了用训练样例来拟合函数,函数估计介绍了常用的参数估计,选取使目标函数最大的参数作为估计. 2.7 介绍了structured regression methods,它能解决某些情况下不好解决的问题. 2.8 一些估计器的介绍: 2.8.1 通过roughness penalty, 实质就是regularized methods,通过penalty 项 限制函数空间的复杂度. 2.8.2 kernel methods and local regression kermel function实际上和local neighbor方法类似,kernel反映了样本间的距离 2.8.3 basis functions and Dictionary methods 从dictionary中选出若干basis functions叠加作为得到的function. 单层前反馈神经网络和boosting 还有MARS,MART都属于这一类方法. 2.9 模型选择和bias, variance的折中 往往模型的复杂度越高(例如regularizer控制项越小), bias越低但是variance越高. 造成训练错误率很低但是测试错误率很高. 反之亦然. 简图2.11 看到61页.主要讲了解回归问题的若干线性方法, 首先是基本回归问题,然后介绍多回归,多输出,接着说subset selection, forward stepwise/stagewise selection(两种的区别是后者更新时不会对其他变量做调整). 3.4 shrinkage methods 便是加入regularizer来smooth化,因为subset selection后的数据偏离散. 如果用平方则是ridge regression, 如果用绝对值就是lasso,还有一种变形least angle regression,和lasso很相关,明天再看看吧.也就是61页到97页的内容. 补充:3.3节对linear regression问题中约束对应的p-norm进行了分析,当p=1.2(文中q表示这里的p)是和elastic net penalty外形很相似,但事实上前者光滑,后者sharp(non-differentiable), (可微意味着无穷阶可导). 3.4节 Least Angle Regression(LAR),和lasso几乎相同,但是在非零取值为0时,相应的变量要从active set中移出,重新计算direction. 3.5节讨论了principal component regression 和partial least squares的方法, 应该可以理解为降维,将原来的d维数据映射到m(md)上面再求解. 3.6 讨论了selection 和 shrinkage方法的比较,貌似的优化的方向选择的不同; 3.7多元输出的selection和shrinkage 3.8Lasso更多的讨论和路径算法 : 基本的优化形式loss+penalty, loss 和penalty的不同造成了关于lasso之类的很多讨论. 另外有提到线性规划用单纯形法求解,记录一下怕将来需要看线性规划的东西没有方向. 3.9 计算代价的分析 Chapter 4 解分类问题的线性方法 4.1 介绍了线性决策边界为线性方法 4.2 indicator matrix的线性回归, 4.3 LDS linear discrimant analysis, 假设每一类为多元高斯分布如下, 在利用到概率密度给出分类的条件概率时,若概率密度函数中的协方差矩阵 均相同就引出了LDA. 文章接着对LDA的各种情形和计算方式进行了讨论. 4.4 p137 明天重新过一遍,结束第4章
请见: http://learning.bmj.com/learning/channel-home.html Dear Prof Xu Peiyang, For every course you complete on BMJ Learning you can print a certificate of completion as proof for accreditation. The following courses have been recommended by other primary care doctors on BMJ Learning. If you are not working in primary care please update your details to ensure you only receive relevant communication. Welcome to BMJ Learning BMJ Learning is the world's largest and most trusted independent online learning service for medical professionals. We offer over500 peer reviewed, evidence based learning modules and our service is constantly updated. Train and test your knowledge and skills today. Accreditation of BMJ Learning courses is provided by several international authorities - including DHA, HAAD, EBAC, MMA, CME, RNZCGP, KIMS, and others . Please contact your relevant College or Association for information, or to request that they accredit BMJ Learning if they do not already.
Frame of Reference: Open Access Starts with You by Lori A. Goetsch Federal legislation now requires the deposit of some taxpayer-funded research in open-access repositoriesthat is, sites where scholarship and research are made freely available over the Internet. The National Institutes of Health's open-access policy requires submission of NIH-funded research to PubMed Central, and there is proposed legislationthe Federal Research Public Access Act of 2009that extends this requirement to research funded by 11 other federal agencies. Academic Researchers Speak by Inger Bergom, Jean Waltman, Louise August and Carol Hollenshead Non-tenure-track (NTT) research faculty are perhaps the most under-recognized group of academic professionals on our campuses today, despite their increasingly important role within the expanding academic research enterprise. The American Association for the Advancement of Science reports that the amount of federal spending on RD has more than doubled since 1976. The government now spends about $140 billion yearly on RD, and approximately $30 billion of this amount goes to universities each year in the form of grants and contracts. Taking Teaching to (Performance) Task: Linking Pedagogical and Assessment Practices by Marc Chun Imagine a typical student taking an average set of courses. She has to complete a laboratory write-up for chemistry, write a research paper for linguistics, finish a problem set for mathematics, cram for a pop quiz in religious studies, and write an essay for her composition class. Her professors almost exclusively lecture (which, it's been said, is a way for information to travel from an instructor's lecture notes to the student's notebook without engaging the brains of either). And somehow she is supposed to not only learn the course content but also develop the critical thinking skills her college touts as central to its mission. Why Magic Bullets Don't Work by David F. Feldon We always tell our students that there are no shortcuts, that important ideas are nuanced, and that recognizing subtle distinctions is an essential critical-thinking skill. Mastery of a discipline, we know, requires careful study and necessarily slow, evolutionary changes in perspective. Then we look around for the latest promising trend in teaching and jump in with both feet, expecting it to transform our students, our courses, and our outcomes. Alternatively, we sniff disdainfully at the current educational fad and proudly stand by the instructional traditions of our disciplines or institutions, secure in our knowledge that the tried and true has a wisdom of its own. This reductive stance is a natural one. As university faculty who work within disciplines, we have each chosen a slice of human knowledge about which we are passionate, and we often settle on the most expedient (but sound) answer to the question of how to teach so that we can move on to the interesting issues and problems that led us to pursue academic careers in the first place. Further, the professional demands on us and the rewards for our work generally do not align with high levels of sustained effort invested in teaching. However, what we tell students about mastering our respective disciplines are the same truths that apply to finding effective instructional strategies: The devil is always in the details, and nuance is critical. Yet in our desire to do right by our students and still invest the bulk of our efforts in teaching content, we put our faith in over-simplified generalizations that never seem to realize the full benefits that they promise. There have been many sweeping statements made regarding the best ways to teach students in the 21st century. Two of the most au courant are traditional lectures are ineffective and internet-based technologies help students learn. There is empirical evidence to support the truth in each of these statements, truebut only if they meet specific parameters, which rarely carry over from their origins in educational research to guide their implementation in practice. Are lectures bad for learning? When we look beyond the rhetoric surrounding instructional practices to examine data, it turns out that bad lectures do limit students' learning and motivation. However, good lectures can be inspiring and have a positiveeven transformativeimpact on student outcomes. Given this unenlightening information, the real question becomes, What differentiates a good lecture from a bad one? Good lectures share a number of key properties with any type of effective instruction. They begin by establishing the relevance of the material for students through explicit connections with their goals or interests. They activate prior knowledge by connecting new content with what students already know and understand or problems with which they are currently grappling. They present information in a clear and straightforward manner that does not require disproportionate effort to translate into terms and concepts meaningful to students. They limit the information presented to a small number of core ideas that are thoroughly but not redundantly explained. Studies that systematically control the relevant features of lectures find significant learning benefits for students when these principles are implemented. However, the large-scale correlative studies of instructional format and student achievement that report negative outcomes for lectures do not control for or even ask about the presence or absence of these features. Thus it may be that the negative findings are a more accurate reflection of generally lackluster or ill-informed implementation of this teaching technique than a condemnation of the technique itself. Of course, simply knowing or even applying these general principles for effective lecturing does not guarantee positive results. Students enter courses with differing backgrounds, levels of prior knowledge, goals, and interests. Given that each of the guidelines above explicitly frames practice in terms of characteristics that vary by learner, the underlying challenge is to find ways to connect with the broadest cross-section of students and find supplemental or alternate means of connecting with those who do not fit that mold. Many instructors succeed at this through the use of assignments that require students to grapple with problems prior to the lecture. Others use clickers to stimulate engagement and structure situations in which the information presented is salient. However, the effective use of such practices involves understanding the students at whom the course is targeted. Is technology good for learning? Both the definitions and the uses of instructional technology are highly varied, so conversations about its benefits and limitations also tend to rely on overly broad generalizations. The two major foci of these discussions currently are game/simulation-based learning and so-called Web 2.0 technologies that allow users to interact with each other via the internet and to contribute content of various types directly to websites. Advocates claim that these applications are important for improving student learning outcomes; they enhance relevance for students by engaging them through the generationally preferred medium of digital media and provide them with opportunities to actively engage with a course's content. While there are indeed instances where such benefits are realized, they are not reflected in comprehensive literature reviews or meta-analyses of the research. There is a simple explanation for this: not all uses of a technology are created equal. The key features that drive engagement and learning pertain to the designs that underlie the technology rather than to the technology itself. When games and other digital learning environments are developed in accordance with principles of effective instruction, they achieve positive results. But they do not yield better results than less sophisticated instructional delivery systems that use the same instructional designs. Why? Because the active ingredients that affect students' learning are the same in both cases. One of the most durable descriptions of this phenomenon is Richard E. Clark's grocery truck metaphor: Media are mere vehicles that deliver instruction but do not influence student achievement any more than the truck that delivers our groceries causes changes in our nutrition (Clark, 1983, p. 445). What the new media do offer are tools for interacting with instructors, peers, and content in ways that are not affordable or possible otherwise. When these interactions offer opportunities to observe or manipulate information and phenomena in meaningful ways, they can facilitate learning. Generally, the features that are most helpful for students include enabling the representation of concepts at multiple levels of abstraction (e.g., via concrete representation, abstract functional models or mathematical models), providing opportunities for more extensive practice than would otherwise be possible and offering immediate feedback to direct further learning efforts. While they are potentially valuable learning tools, such technologies need to be designed in such a way that they are not confusing or overwhelming for the students who will use them. With any software, there is a learning curve for mastering the interface used to interact with it. To the extent that the interface functions in a standard way, students will be able to draw on previous technology experiences in using it. However, if it is significantly different from familiar interfaces, they will need to invest substantial effort in mastering its use before getting to content-related learning. The greater the departure from familiar software environments, the steeper the learning curve. Thus the technology itself can act as a learning impediment for students with limited technology backgrounds. It may be the case that the potential learning benefits offered outweigh the cognitive costs, but it should not be assumed without evidence that this will be the case. The role of cognition There are two threads linking effective lectures and effective technology use. The first is consideration of what students bring to the table in terms of goals, interests, and prior knowledge. The second is the deliberate management of the opportunities for students to engage with content in order to focus their investment of mental effort on key ideas. In educational research, a powerful framework for considering these factors jointly is cognitive load theory (CLT). When games and other digital learning environments are developed in accordance with principles of effective instruction, they achieve positive results. But they do not yield better results than less sophisticated instructional delivery systems that use the same instructional designs. CLT operates under the central premise that learners are only capable of attending to a finite amount of information at a given time due to the limited capacity of the working (short-term) memory system. So it is necessary to carefully manage the flow of information with which learners must grapple. It is likely that anyone who has taken an introductory course in educational or cognitive psychology will have heard of George Miller's (1956) magical number that people can only process seven information elements at a time, plus or minus two. However, what many people do not know is that this number is probably a substantial overestimate. Miller obtained his finding by asking people to listen to strings of random numbers and recite them back as accurately as possible. These numbers were not linked to any context, and he assumed that they were ubiquitous placeholders for any type of information that people might need to process. What did not occur to Miller is that people use strings of numbers for many everyday tasks and have developed memory strategies to retain them. Think, for example, of how you remember a telephone number or your social security number; most people group the digits into two or three chunks (e.g., XXX-XXXX or XXX-XX-XXXX). It is these chunks that occupy space in working memory and help to organize the information so that it does not get lost. Subsequent research holds that the upper limit of our short-term memory is actually closer to four information pieces or chunks. Given these tight bandwidth constraints, how do human beings handle any complex taskespecially one that has more than four discrete elements? To simplify, we handle the task-relevant information much as we would a phone number: we divide it into meaningful units based on our knowledge of the content and task structure. The more knowledge we have about a task, situation, or content area, the more efficiently and adaptively we are able to map discrete pieces of information onto schemas. These schemas are the abstract representations of our knowledge that serve as integrated templates for rapidly organizing the relevant facets of a situation. With deeper, more meaningful, and more interconnected knowledge, our schemas become more refined, nuanced, and capable of encoding increasing amounts of incoming information as a single chunk. Information that would occupy only one chunk for an advanced learner might be viewed by a novice as several discrete pieces of information. Cognitive load is conceptualized as the number of separate chunks (schemas) processed concurrently in working memory while learning or performing a task, plus the resources necessary to process the interactions between them. Therefore a given learning task may impose different levels of cognitive load for different individuals based on their levels of relevant prior knowledge. Cognitive load is experienced as mental effort; novices need to invest a great deal of effort to accomplish a task that an expert might be able to handle with virtually none, because they lack sufficiently complex schemas. When cognitive load (the information to be processed) exceeds working memory's capacity to process it, students have substantial difficulties. The most straightforward effect is that they are unable to learn or solve problems. However, other problematic outcomes can also occur. First, students may revert to using older or less effortful approaches to the problem that impose a less heavy load on working memory. This means that previously held misconceptions or erroneous approaches may be brought to bear, reinforcing knowledge that is counter to the material they are trying to learn. Second, students may default to pursuing less effortful goals. In other words, they may procrastinate. In such situations, thinking about the whole of a complex task may be so overwhelming that students turn to more manageable activities: checking their email, cleaning their desks, or taking on whatever other chores do not exceed their processing ability. (Rumor has it that faculty have similar experiences.) For this reason, one of the strategies for overcoming procrastination is to reduce the magnitude of a goal by breaking a large task into its component parts and dealing with only one piece at a time. This limits the complexity of the task faced, which reduces the cognitive load it imposes to manageable levels. Managing cognitive load in teaching In order to optimize the benefits of instruction, CLT prioritizes available information according to the type of cognitive load it imposes. Intrinsic load represents the inherent complexity of the material to be learned. The higher the number of components and the more those components interact, the greater the intrinsic load of the content. Extraneous load represents information in the instructional environment that occupies working memory space without contributing to comprehension or the successful solving of the problem presented. Germane load is the effort invested in the necessary instructional scaffolding and in learning concepts that facilitate further content learning. Cognitive load is conceptualized as the number of separate chunks (schemas) processed concurrently in working memory while learning or performing a task, plus the resources necessary to process the interactions between them. In this context, scaffolding refers to the cognitive support of learning that is provided during instruction. Just as a physical scaffold provides temporary support to a building that is under construction, with the intent that it will be removed when the structure is able to support itself, an instructional scaffold provides necessary cognitive assistance for learners until they are able to practice the full task without help. Extensive instruction typically provides multiple levels of support that are removed gradually to facilitate the ongoing development of proficiency. Processing the information provided as scaffolding imposes cognitive load. However, to the extent that it prevents the cognitive overload that would otherwise result for a learner struggling with new material, it is cost beneficial. Thus, the three driving principles of CLT are: 1) present content to students with appropriate prior knowledge so that the intrinsic load of the material to be learned does not occupy all the available working memory, 2) eliminate extraneous load, and 3) judiciously impose germane load to support learning. For any instructional situation, the goal is to ensure that intrinsic, extraneous, and germane load combined do not exceed working memory capacity. But how can we manage this? Although we do not control the innate complexity of the material we teach, we can assess the prior knowledge of our students to ensure they understand prerequisite concepts. If they have schemas in place to facilitate the processing of the new concept, their intrinsic load is lower than if they need to grapple with every nuance of the material without the benefit of appropriate chunking strategies. This is an opportunity to effectively use technology. The use of clickers during lectures or short online assessments to be completed prior to attending class can provide a quick picture of which necessary elements students have in place before a new concept is introduced. If they lack the prerequisite knowledge, then the instructor should teach or provide that material first in order to prevent the advanced material from exceeding students' ability to process it. The good news about extraneous load is that it should be eliminated whenever possible rather than managed. In fact, there are a number of simple and straightforward principles for doing so in instructional materials as well as in the classroom. Some have to do with the information presented. For example, ancillary information that is not directly on point should be eliminated. This includes things like biographies of historic figures in science texts when the instructional objective is to teach a theory or procedure. While it may be an interesting human-interest story to consider whether or not an apple really fell on Newton's head, processing that information detracts from the working memory available to understand gravitational theory or how to solve problems using the law of gravity. Other practices target the presentation of information. For example, it is better to integrate explanatory text into a diagram than to keep it separate, because the cognitive load of mentally integrating the information can be avoided when they are collocated. On the other hand, reading aloud the text that students are looking at forces redundant processing of the same information and impedes their ability to retain the material. Because sensory information enters working memory through modality-specific pathways, which themselves have limited bandwidth, it is helpful for information to be distributed across modalities wherever possible. It is also helpful for all necessary information to enter working memory at approximately the same time. Thus, the first example uses linguistic and visual information together, which distributes the information across modalities and avoids the unnecessary load of holding the information from the diagram in working memory while searching for the appropriate text or vice versa. In contrast, the second example overloads the pathway that handles verbal information because it simultaneously delivers read and spoken information. It also requires that information from the text be held in working memory while the speech is processed, because people typically read to themselves much more quickly than words are read aloud. Germane load is a highly complicated issue. Building scaffolds for learning imposes cognitive load. Novices being introduced to material for the first time need a great deal of explicit instruction, using very small chunks of information, to deeply process new information or problem-solving strategies. As they acquire more knowledge and skill, though, the external scaffolding which initially helped them becomes unnecessary and redundant. If such learning supports are not eliminated for those students, they cease to facilitate learning as germane load and begin to hinder it as extraneous load. This expertise reversal effect is the biggest challenge for developing effective instruction, because students do not all attain the same level of comprehension at the same time. What is germane and helpful load for one student may be extraneous and harmful for another. Effective Practices The keys to applying cognitive load theory effectively in a course are advance planning and the ongoing monitoring of students' progress. Because the central premise of CLT is to optimize the allocation of students' working memory resources for mastering particular information, it is vital to identify very specifically what the instructional objectives are for the course as a whole and for each class meeting or module. If we cannot be precise about what we want students to know and be able to do, we will not be able to structure their experiences to help them accomplish this. Next, we need to sequence the objectives so as to present material in the order in which it is needed. If some topics build on others in the course, the prerequisite pieces should be taught before they are needed. For example, we should teach processes and procedures in the same sequence that students will perform them, so that work products from preceding steps can be used in subsequent steps. If the concepts, knowledge, or skills being taught do not have an inherent sequence, then it is generally most effective to order them from simplest to most complex. Once we have figured out what content needs to be taught and the appropriate progression of topics, it is most helpful to students when we let them in on the secret. Trying to impose order on disconnected information is highly effortful. If we simply turn students loose on the material without presenting clearly what they should be trying to get from it and how it fits into the larger picture of the course's content, much of their cognitive resources will be allocated to figuring out what information is important (extraneous load) rather than focusing on constructing the knowledge necessary to meet our learning objectives. Although the logic of the course content and sequence may be obvious to us as knowledgeable instructors and content experts, our students arrive without the benefit of the schemas we have developed. Regardless of their previous experiences (or perhaps because of them), they sincerely appreciate knowing up front what they will be learning, what is expected of them, how they will be assessed, and how all of these elements fit together. When these components of the course are unclear, students invest substantial effort in figuring them out. Further, they may reach incorrect conclusions, which leads to more extraneous effort as they work at cross purposes to the course. Having mapped out the information in the course, we also need to determine how well students comprehend any knowledge on which later course content depends. This does not mean that we must burden our students (and ourselves) with exams or large assignments every week. Instead, we can use lightweight, rapid assessments that are not formally graded but are attuned to the key concepts upon which the new material draws. These can include short online surveys on the content that must be submitted a few days before class, quick check-in conversations as class begins, or multiple-choice questions on key issues that students must respond to using personal response systems (clickers). These tools are most effective when students are accountable for submitting a response but not for the accuracy of their answers. The purpose is to inform the instruction we provide rather than to increase students' anxiety (i.e., emotionally invoked extraneous load) about not knowing a correct answer. If students generally have a strong grasp of the prerequisite material, the likelihood of cognitive overload will be small, less scaffolding will be needed, and they can move directly into problem-solving. But if their understanding is weak, it will be important to review the prior material in detail, structure the new content as much as possible, and move slowly through it. When introducing problem-solving procedures to novices, providing worked examples is a very helpful practice. This involves demonstrating and explaining the reasoning processes that are involved in solving a class of problems, using a representative example. This helps to manage cognitive load effectively in several ways. When a problem is taken on, there are two sources of potential load for a learner. The first is the need to structure the information provided to effectively frame and analyze the problem. The second is the application of appropriate problem-solving strategies. The worked example both demonstrates problem-framing and provides a concrete model of an appropriate problem-solving strategy. sincerely appreciate knowing up front what they will be learning, what is expected of them, how they will be assessed, and how all of these elements fit together. This reduces the degree of uncertainty under which the students are working on three fronts. First, it allows them to map concrete instances onto relevant schemas, facilitating effective chunking. Second, it reduces their reliance on highly effortful trial-and-error attempts to identify productive solutions, which substantially increase cognitive load and time spent without providing any learning advantage. Last, it breaks the procedure down into distinguishable steps that can be considered in smaller, more manageable chunks. After walking through a full example, an excellent way to help students practice without getting overloaded is to provide a partially worked example and ask them to pick up where the completed part of the example leaves off. Having them practice the last steps first ensures that all aspects of the strategy to be learned are practiced. In complex, open-ended problems, students can get off track midway through an exercise and never have the opportunity to practice its later elements. As students become proficient in the later steps, they can be given problems with fewer steps completed for them. In this way, instructors can effectively control the overall level of cognitive load imposed by the problem and ramp up to full problems after students have developed effective schemas and chunking strategies. Practice makes perfect As students encounter repeated instances of problem types during their learning, their strategies become more nuanced (to accommodate small differences between the problems) and less effortful to execute. As they practice, their skills require less and less conscious monitoring, which reduces the level of cognitive load that problem-solving imposes. This lets them efficiently address problems of increasing complexity. Experts are able to solve problems beyond the scope of what laymen can handle precisely because their core problem-solving procedures impose virtually no load on working memory. Therefore, they can assimilate very subtle nuances and much more complex problem features with their extra cognitive capacity. The benefits of practice are just as powerful for teachers as they are for students. Teaching effectively and using cognitive load theory to guide practice is challenging. It requires the focused consideration of many details regarding our students, their knowledge, and our instructional goals. But with sustained effort, careful observations of what seems to yield more efficient and effective learning, and a willingness to make changes as necessary, these practices become less effortful. This frees up our own working memory resources to use for addressing both further complexities in addressing the learning needs of our students and the subtleties of our own disciplinary passions. Resources 1. Bernard, R. M., Abrami, P.C., Lou, Y., Borokhovski, E., Wade, A., Wozney, L., Wallet, P. A., Fiset, M. and Huang, B. (2004) How does distance education compare with classroom instruction? A meta-analysis of the empirical literature. Review of Educational Research 74 :3, pp. 379-439. 2. Bernard, R. M., Abrami, P. C., Borokhovski, E., Wade, C. A., Tamim, R. M., Surkes, M. A. and Bethel, E. C. (2009) A meta-analysis of three types of interaction treatments in distance education. Review of Educational Research 79 :3, pp. 1243-1289. 3. Clark, R. C., Nguyen, F. and Sweller, J. (2005) Efficiency in learning: Evidence-based guidelines to manage cognitive load , John Wiley Sons, San Francisco. 4. Clark, R. E. (2001) Learning from media: Arguments, analysis, and evidence , Information Age Publishing, Charlotte, NC. 5. Cowan, N. (2000) The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences 24 , pp. 87-185. 6. Feldon, D. F. (2007) Cognitive load in the classroom: The double-edged sword of automaticity. Educational Psychologist 42 :3, pp. 123-137. 7. Kalyuga, S., Ayres, P., Chandler, P. and Sweller, J. (2003) The expertise reversal effect. Educational Psychologist 38 :1, pp. 23-31. 8. Mayer, R. E. (2009) Multimedia learning , 2 Cambridge University Press, New York. 9. Miller, G. A. (1956) The magical number seven, plus or minus two: Some limits on our capacity for processing information. The Psychological Review 63 , pp. 81-97. 10. Schwartz, D. L. and Bransford, J. D. (1998) A time for telling. Cognition Instruction 16 :4, pp. 475-522. 11. van Merrinboer, J. J. G. and Sweller, J. (2005) Cognitive load theory and complex learning: Recent developments and future directions. Educational Psychology Review 17 :2, pp. 147-177. David Feldon is an assistant professor of STEM education and educational psychology at the University of Virginia. His research examines the development of expertise in science, technology, engineering, and mathematics through a cognitive lens. He also studies the effects of expertise on instructors' abilities to teach effectively within their disciplines. http://www.changemag.org/index.html Editorial: Motivating Learning by Margaret A. Miller Knowing how students learn and solve problems informs us how we should organise their learning environment and without such knowledge, the effectiveness of instructional designs is likely to be random . John Sweller (Instructional Science 32: 931, 2004.) I've written in the past about the things we want students to learn, how we help them learn, and about resistance (mine and virtually everyone else's) to change. In this issue, those concerns converge. Determining what we want students to learn is the amazingly difficult first step in developing assessments of that learning, as the article by Dary Erwin and Joe DeFillippo demonstrates. And Marc Chun talks about linking teaching, learning, assessment, and the ultimate use of higher-order thinking skills by both teaching and assessing those skills through tasks that mimic how they will be used in real life. But what particularly intrigues me is the connection between cognition and change. Educational psychologists have developed a number of constructs to explain how the mind works. In this issue, David Feldon suggests that a familiarity with cognitive load theory can be a big help in developing effective pedagogies, for example, a framework we see invoked in Carl Wieman's attempts to improve science instruction. But there is other knowledge about human cognitive architecture that can also be useful as we think about teaching and learning. For instance, the human cognitive default is to solve problems with as small a mental investment as possible; we typically retreat to earlier mental models and quicker and less effortful automated problem-solving strategies when new information threatens to overwhelm us. So as Feldon suggests, teachers need to find some way to keep the investment low enough and the cognitive load light enough that those mechanisms don't come into play. We can also exploit the fact that we're more likely to try to solve problems in areas that are important to us by showing students the relevance of what we're teaching to their lives and concerns. But given the fundamentally conservative nature of human cognition, perhaps the question should be, why doesn't the whole learning system grind to a halt? In a way, it's remarkable that we ever learn anything at all. I remember that when my son was about a year old, he developed the locomotive strategy of scooting around on his knees (it beat crawling, since he could carry things). Once he had built up calluses thick enough to protect those knees, it was a remarkably efficient way to get from point A to point B, and it halved the height from which he would fall if something went wrong. I remember thinking at the time, what will ever motivate him to get up on his hind legs and wobble around when a misstep would cause him to fall from twice the height? What will prompt him, in short, to face the perils of change when things work so well and comfortably for him as they are? Come to think of it, our bipedal walk is a great metaphor for our alternation between imbalance and stability. The act of walking, researchers have discovered, is a continual falling forward, regaining our balance, then falling forward again. Something impels us to lift that foot and risk the fall, then we consolidate our new position momentarily, then we lift that foot and fall again, and so on. At the species level, there are clearly advantages in the impulse to generate, test out, and practice both old and new survival strategies (e.g., bipedalism) that can give one an evolutionary edge. But what lies on top of that drive for individual students? How do we motivate them to lift one foot and put it down a little ahead, let us help them organize and consolidate their momentary new equilibrium, and then lift the other? I think the answer can be found by looking not at learning in school but at spontaneous learning, particularly during play. When they play, children seem to be motivated by several things. Curiosity, for one. Another stimulus is wanting to master the environment (a bone-deep tendency, crucial to the human race's survival, that is as dangerous as fire when out of control but as just as life-giving when contained), which is why children need plenty of free play where they make up the rules (as opposed to playing board games or participating in sports). A third stimulus may be the desire to imitate and take one's place among trusted and admired others, either peers or adults. Those tendencies don't need to be lost as one ages, as the success of Elderhostel attests to, although Grandgrindian schooling can certainly grind them down. So our job as teachers may be to stand in what Vygotsky called the zone of proximal development, the stage in their cognitive growth that students haven't quite gotten to yet, and beckon them forward into what for them is uncharted but possibly alluring territory (the ending of Huckleberry Finn floats into my mind, where Huck tells Jim that it's time to light out for the territories, or the song by Jacques Brel in which he mentions his childhood longing for le Far West). We motivate students to make that leap by stimulating their curiosity about the subject; by showing our own passion for it; by lessening the dangers of the move as we, knowing what their current maps look like, show them the path from there to here and how to organize their understanding of the new landscape; and by giving them as much control as possible over the learning environment. But more: I point you to Matt Procino's account (in the Listening to Students in the previous issue) of taking over a class in child development. He modeled for students the very behavior he wanted them to exhibit in life as a result of what they learned in his class by soliciting sometimes uncomfortable feedback as he learned how to teach. Similarly, he had earlier let his Outward Bound students see that he too was afraid of the challenges he was asking them to take on but that they could summon the courage to do so becausesee?he was doing it. From the point of view of the students, an admired other gave them two things to imitate: not only how you scale a cliff but how you deal with the fear of scaling a cliff. People generally can't be dragged or whipped into forward movement; they'll run back to their earlier spot of equilibrium the minute the threat (of bad grades, for instance) stops. I know that I plant my feet stubbornly whenever I feel bullied (leading one professorwho tried to argue me into liking Wordsworth's Michael, a poem I detest to this dayto say to me in exasperation, Miss Miller, why are you sometimes so dense ?) But I'm apt to leap joyfully ahead when beckoned by someone I trust and admire into knowledge that he or she is passionate about. And I want to be among the people who inhabit that new zone. That's why, at the end of a successful dissertation defense, I always say to the newly minted PhD, Welcome to the community of scholars.
Figuration of English Words in Outlook and Sound Introduction The idea had been in my brain for more than 20 years to example words before I published it on my blog on 22th Sept. 2007. The list of letter meanings is the central body of the idea. The 26 letters each has its own meaning based on its outlook or on a derivation summarized from thousands of words. However, the meaning of a letter shows flexibility, and letter O, for example, looks not only a round blur puzzle but also a circle area and a ring to link two parts into a new word as well. We need the flexibility to figure out the meaning of million words individually. The list also shows the meanings of only some two-letter groups because meanings of the other groups as word have been defined in common dictionaries or as word fixes in textbooks. The meaning of each group has a typical way of flexibility: one comes from the list and the other is comes from the meanings of its constitutional letters. Cu, for example, means “cumulate” in the list. Cu can also mean “cut down” because C means “cut” and U means “down” in the list. It is generally not necessary to define meanings of any letter groups having more than three letters because you can find their meanings out of a dictionary. The idea is that you may like to hold a copy of the list in one of your hands to figure out meaning of any a word in your own story of figuration. After some days you will not need the list any more except a few occasional checks because the list is easy to remember. You may need to know that meaning of consonant letters plays a more important role than the vowels and a vowel letter usually behaves like an emphasis to the meaning of the consonant or consonants in front of it without much its own meaning, i.e. a vowel with its preceding consonant or consonants usually makes only one meaning. A very important skill is to break a word into letter groups. A consonant with its next following vowel should be a group but how to break connected consonants and vowels, and even to separate sub-word groups are optional. There are no rules but skills. Sometimes, you need to add or to take off a letter in letter groups to explain a word because people did so when they create words for shortness or good look. For example, dispirit=dis+spirit; distress=dis+stress; account=a+count and applause=a+plause. Let us begin with the word “love”. Why is “love” consisted of these four specific letters? Why love means to load venom or change to your heart or to others? Is it the real story that the word “love” was invented many many years ago in that way? It does not matter but the figuration may be interesting and helpful to remember it. Do you believe that the idea is a miracle? You can find that the idea and the list work well for thousands and thousands words. It is even true that your stories of figuration for many words are exactly the same as they are in the textbooks of word origin in prefixes, suffixes and stems from Greek, Latin and European colloquies including old English or in a text book of etymology. The principle is that any word came to its present meaning in its reason of formation or figuration and shall we find it? The history lost the figuration and shall we discuss and make an agreement to define it now? The idea is intended to supply people a way to remember words easily and funnily. However, who cares about your stories of the figuration being true or not in history of word origin provided they help you to remember words and they are interesting. Your word stories may not the same as mine or you may be surprised to find out yours are exactly the same of mine. However most importantly, if you share them with people, you may find that you are special and genius. If you share the stories of figuration from one of your friends to compare with yours, you may find out that your friend is special and he/she is a friend of yours in sake of nature and personality. To exchange stories of a specific word or a group of words would be an interesting game in an intimate circle of friends confidentially. You will remember words in your own best way by the figuration learning and practice. A word looks no longer hard and tedious but active, affective and interesting. Every one will become a writer to author a personal word storybook like a diary book and it may be published soon. If you would like to join my work, especially a native English speaker and a language specialist, if you would like to help me or join me to publish a book of the word stories in a form of a dictionary, would you please feel free to contact me by email: ypzong@mail.neu.edu.cn or feed back on my blog? Thank you!
From: http://cgi.cse.unsw.edu.au/~handbookofnlp/index.php?n=Chapter16.Chapter16 Ontology Construction Philipp Cimiano, Paul Buitelaar and Johanna Vlker In this chapter we provide an overview of the current state-of-the-art in ontology construction with an emphasis on NLP-related issues such as text-driven ontology engineering and ontology learning methods. In order to put these methods into the broader perspective of knowledge engineering applications of this work, we also present a discussion of ontology research itself, in its philosophical origins and historic background as well as in terms of methodologies in ontology engineering. Bibtex Citation @incollection{Cimiano-handbook10, author = {Philipp Cimiano and Paul Buitelaar and Johanna V\{o}lker}, title = {Ontology Construction}, booktitle = {Handbook of Natural Language Processing, Second Edition}, editor = {Nitin Indurkhya and Fred J. Damerau}, publisher = {CRC Press, Taylor and Francis Group}, address = {Boca Raton, FL}, year = {2010}, note = {ISBN 978-1420085921} } Online Resources Ontologies General and upper-level ontologies CYC http://www.opencyc.org DOLCE http://www.loa-cnr.it/DOLCE.html SUMO http://www.ontologyportal.org Linguistic ontologies OntoWordnet http://www.loa-cnr.it/Papers/ODBASE-WORDNET.pdf Swinto, LingInfo Domain-specific ontologies (publicly available) in some example domains Biomedical Foundational Model of Anatomy http://sig.biostr.washington.edu/projects/fm/AboutFM.html Gene Ontology http://www.geneontology.org Repository of biomedical ontologies http://bioportal.bioontology.org Business/Financial XBRL ontology http://xbrlontology.com Geography Geonames ontology http://www.geonames.org/ontology/ Ontology Repositories and Search Engines Swoogle http://swoogle.umbc.edu/ Watson http://watson.kmi.open.ac.uk OntoSelect http://olp.dfki.de/ontoselect/ Oyster Ontology Development Ontology Development 101 http://ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinness-abstract.html Ontology Editors Protg http://protege.stanford.edu NeOn Toolkit http://www.neon-toolkit.org Swoop http://www.mindswap.org/2004/SWOOP/ TopRaid Composer Ontology Engineering Methodologies DILIGENT http://semanticweb.org/wiki/DILIGENT HCOME http://semanticweb.org/wiki/HCOME METHONTOLOGY http://semanticweb.org/wiki/METHONTOLOGY OTK methodology http://semanticweb.org/wiki/OTK_methodology Ontology Learning Tools Text2Onto http://www.neon-toolkit.org/wiki/index.php/Text2Onto OntoLearn http://lcl.di.uniroma1.it/tools.jsp OntoLT http://olp.dfki.de/OntoLT/OntoLT.htm
Journal of Medical Internet Research (JMIR) Volume 12 (2010) * Impact Factor (2008): 3.6 - Ranked top (#1/20) in the Medical Informatics and second (#2/62) in the Health Services Research category * http://www.jmir.org/2010 Content Alert, 25 Jan 2010 ==================================== ================================= UPCOMING ISSUE Volume 12, Issue 1 http://www.jmir.org/2010/1 ================================= The following article(s) has/have just been published in the UPCOMING JMIR issue (Volume 12 / Issue 1): (articles are still being added for this issue) Original Papers ------------------ Learning in a Virtual World: Experience With Using Second Life for Medical Education John Wiecha, Robin Heyden, Elliot Sternthal, Mario Merialdi J Med Internet Res 2010 (Jan 23); 12(1):e1 HTML (open access): http://www.jmir.org/2010/1/e1/ PDF (members only): http://www.jmir.org/2010/1/e1/PDF Background: Virtual worlds are rapidly becoming part of the educational technology landscape. Second Life (SL) is one of the best known of these environments. Although the potential of SL has been noted for health professions education, a search of the worlds literature and of the World Wide Web revealed a limited number of formal applications of SL for this purpose and minimal evaluation of educational outcomes. Similarly, the use of virtual worlds for continuing health professional development appears to be largely unreported. Objective: Our objectives were to: 1) explore the potential of a virtual world for delivering continuing medical education (CME) designed for post-graduate physicians; 2) determine possible instructional designs for using SL for CME; 3) understand the limitations of SL for CME; 4) understand the barriers, solutions, and costs associated with using SL, including required training; 5) measure participant learning outcomes and feedback. Methods: We designed and delivered a pilot postgraduate medical education program in the virtual world, Second Life. Our objectives were to: (1) explore the potential of a virtual world for delivering continuing medical education (CME) designed for physicians; (2) determine possible instructional designs using SL for CME; (3) understand the limitations of SL for CME; (4) understand the barriers, solutions, and costs associated with using SL, including required training; and (5) measure participant learning outcomes and feedback. We trained and enrolled 14 primary care physicians in an hour-long, highly interactive event in SL on the topic of type 2 diabetes. Participants completed surveys to measure change in confidence and performance on test cases to assess learning. The post survey also assessed participantsattitudestoward the virtual learning environment. Results: Of the 14 participant physicians, 12 rated the course experience, 10 completed the pre and post confidence surveys, and 10 completed both the pre and post case studies. On a seven-point Likert scale (1, strongly disagree to 7, strongly agree), participants mean reported confidence increased from pre to post SL event with respect to: selecting insulin for patients with type 2 diabetes (pre = 4.9 to post = 6.5,P= .002); initiating insulin (pre = 5.0 to post = 6.2,P= .02); and adjusting insulin dosing (pre = 5.2 to post = 6.2,P= .02). On test cases, the percent of participants providing a correct insulin initiation plan increased from 60% (6 of 10) pre to 90% (9 of 10) post (P= .2), and the percent of participants providing correct initiation of mealtime insulin increased from 40% (4 of 10) pre to 80% (8 of 10) post (P= .09). All participants (12 of 12) agreed that this experience in SL was an effective method of medical education, that the virtual world approach to CME was superior to other methods of online CME, that they would enroll in another such event in SL, and that they would recommend that their colleagues participate in an SL CME course. Only 17% (2 of 12) disagreed with the statement that this potential Second Life method of CME is superior to face-to-face CME. Conclusions: The results of this pilot suggest that virtual worlds offer the potential of a new medical education pedagogy to enhance learning outcomes beyond that provided by more traditional online or face-to-face postgraduate professional development activities. Obvious potential exists for application of these methods at the medical school and residency levels as well. ============================================ Please support JMIR by becoming a member today! JMIR needs to raise $100k to support software upgrades http://www.jmir.org/support.htm Memberships start at $4.92 per month ============================================ 2008 Impact Factor 3.6 confirms JMIR as THE top ranked health informatics, health services research and health policy journal for the Internet age. See http://www.jmir.org/announcement/view/24 for the new journal impact factors released in June 2009 ! ============================================ This is not an unsolicited email. You are receiving this email because you subscribed to JMIR content alerts. To unsubscribe from content alerts please log in at http://www.jmir.org/user/profile and uncheck the checkbox New issue published email notifications. If you lost your password, please go to http://www.jmir.org/login/lostPassword. ________________________________________________________________________ Journal of Medical Internet Research - The leading peer-reviewed ehealth journal - Open Access - Fast Review - Impact Factor: 3.6 *** JMIR is now ranked the number one (#1/20) in the Medical Informatics and second (#2/62) in the Health Services Research category! http://www.jmir.org
From: http://www.cse.ust.hk/~sinnopan/conferenceTL.htm List of Conferences and Workshops Where Transfer Learning Paper Appear This webpage will be updated regularly. Main Conferences Machine Learning and Artificial Intelligence Conferences AAAI 2008 Transfer Learning via Dimensionality Reduction Transferring Localization Models across Space Transferring Localization Models over Time Transferring Multi-device Localization Models using Latent Multi-task Learning Text Categorization with Knowledge Transfer from Heterogeneous Data Sources Zero-data Learning of New Tasks 2007 Transferring Naive Bayes Classifiers for Text Classification Mapping and Revising Markov Logic Networks for Transfer Learning Measuring the Level of Transfer Learning by an AP Physics Problem-Solver 2006 Using Homomorphisms to Transfer Options across Continuous Reinforcement Learning Domains Value-Function-Based Transfer for Reinforcement Learning Using Structure Mapping IJCAI 2009 Transfer Learning Using Task-Level Features with Application to Information Retrieval Transfer Learning from Minimal Target Data by Mapping across Relational Domains Domain Adaptation via Transfer Component Analysis Knowledge Transfer on Hybrid Graph Manifold Alignment without Correspondence Robust Distance Metric Learning with Auxiliary Knowledge Can Movies and Books Collaborate? Cross-Domain Collaborative Filtering for Sparsity Reduction Exponential Family Sparse Coding with Application to Self-taught Learning 2007 Learning and Transferring Action Schemas General Game Learning Using Knowledge Transfer Building Portable Options: Skill Transfer in Reinforcement Learning Transfer Learning in Real-Time Strategy Games Using Hybrid CBR/RL An Experts Algorithm for Transfer Learning Transferring Learned Control-Knowledge between Planners Effective Control Knowledge Transfer through Learning Skill and Representation Hierarchies Efficient Bayesian Task-Level Transfer Learning ICML 2009 Deep Transfer via Second-Order Markov Logic Feature Hashing for Large Scale Multitask Learning A Convex Formulation for Learning Shared Structures from Multiple Tasks EigenTransfer: A Unified Framework for Transfer Learning Domain Adaptation from Multiple Sources via Auxiliary Classifiers Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model 2008 Bayesian Multiple Instance Learning: Automatic Feature Selection and Inductive Transfer Multi-Task Learning for HIV Therapy Screening Self-taught Clustering Manifold Alignment using Procrustes Analysis Automatic Discovery and Transfer of MAXQ Hierarchies Transfer of Samples in Batch Reinforcement Learning Hierarchical Kernel Stick-Breaking Process for Multi-Task Image Analysis Multi-Task Compressive Sensing with Dirichlet Process Priors A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning 2007 Boosting for Transfer Learning Self-taught Learning: Transfer Learning from Unlabeled Data Robust Multi-Task Learning with t-Processes Multi-Task Learning for Sequential Data via iHMMs and the Nested Dirichlet Process Cross-Domain Transfer for Reinforcement Learning Learning a Meta-Level Prior for Feature Relevance from Multiple Related Tasks Multi-Task Reinforcement Learning: A Hierarchical Bayesian Approach The Matrix Stick-Breaking Process for Flexible Multi-Task Learning Asymptotic Bayesian Generalization Error When Training and Test Distributions Are Different Discriminative Learning for Differing Training and Test Distributions 2006 Autonomous Shaping: Knowledge Transfer in Reinforcement Learning Constructing Informative Priors using Transfer Learning NIPS 2008 Clustered Multi-Task Learning: A Convex Formulation Multi-task Gaussian Process Learning of Robot Inverse Dynamics Transfer Learning by Distribution Matching for Targeted Advertising Translated Learning: Transfer Learning across Different Feature Spaces An empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis Domain Adaptation with Multiple Sources 2007 Learning Bounds for Domain Adaptation Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations A Spectral Regularization Framework for Multi-Task Structure Learning Multi-task Gaussian Process Prediction Semi-Supervised Multitask Learning Gaussian Process Models for Link Analysis and Transfer Learning Multi-Task Learning via Conic Programming Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation 2006 Correcting Sample Selection Bias by Unlabeled Data Dirichlet-Enhanced Spam Filtering based on Biased Samples Analysis of Representations for Domain Adaptation Multi-Task Feature Learning AISTAT 2009 A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation 2007 Kernel Multi-task Learning using Task-specific Features Inductive Transfer for Bayesian Network Structure Learning ECML/PKDD 2009 Relaxed Transfer of Different Classes via Spectral Partition Feature Selection by Transfer Learning with Linear Regularized Models Semi-Supervised Multi-Task Regression 2008 Actively Transfer Domain Knowledge An Algorithm for Transfer Learning in a Heterogeneous Environment Transferred Dimensionality Reduction Modeling Transfer Relationships between Learning Tasks for Improved Inductive Transfer Kernel-Based Inductive Transfer 2007 Graph-Based Domain Mapping for Transfer Learning in General Games Bridged Refinement for Transfer Learning Transfer Learning in Reinforcement Learning Problems Through Partial Policy Recycling Domain Adaptation of Conditional Probability Models via Feature Subsetting 2006 Skill Acquisition via Transfer Learning and Advice Taking COLT 2009 Online Multi-task Learning with Hard Constraints Taking Advantage of Sparsity in Multi-Task Learning Domain Adaptation: Learning Bounds and Algorithms 2008 Learning coordinate gradients with multi-task kernels Linear Algorithms for Online Multitask Classification 2007 Multitask Learning with Expert Advice 2006 Online Multitask Learning UAI 2009 Bayesian Multitask Learning with Latent Hierarchies Multi-Task Feature Learning Via Efficient L2,1-Norm Minimization 2008 Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Data Mining Conferences KDD 2009 Cross Domain Distribution Adaptation via Kernel Mapping Extracting Discriminative Concepts for Domain Adaptation in Text Mining 2008 Spectral domain-transfer learning Knowledge transfer via multiple model local structure mapping 2007 Co-clustering based Classification for Out-of-domain Documents 2006 Reverse Testing: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias ICDM 2008 Unsupervised Cross-domain Learning by Interaction Information Co-clustering Using Wikipedia for Co-clustering Based Cross-domain Text Classification SDM 2008 Type-Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation 2007 On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples Probabilistic Joint Feature Selection for Multi-task Learning Application Conferences SIGIR 2009 Mining Employment Market via Text Block Detection and Adaptive Cross-Domain Information Extraction Knowledge transformation for cross-domain sentiment classification 2008 Topic-bridged PLSA for cross-domain text classification 2007 Cross-Lingual Query Suggestion Using Query Logs of Different Languages 2006 Tackling Concept Drift by Temporal Inductive Transfer Constructing Informative Prior Distributions from Domain Knowledge in Text Classification Building Bridges for Web Query Classification WWW 2009 Latent Space Domain Transfer between High Dimensional Overlapping Distributions 2008 Can Chinese web pages be classified with English data source? ACL 2009 Transfer Learning, Feature Selection and Word Sense Disambiguation Graph Ranking for Sentiment Transfer Multi-Task Transfer Learning for Weakly-Supervised Relation Extraction Cross-Domain Dependency Parsing Using a Deep Linguistic Grammar Heterogeneous Transfer Learning for Image Clustering via the SocialWeb 2008 Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition Multi-domain Sentiment Classification Active Sample Selection for Named Entity Transliteration Mining Wiki Resources for Multilingual Named Entity Recognition Multi-Task Active Learning for Linguistic Annotations 2007 Domain Adaptation with Active Learning for Word Sense Disambiguation Frustratingly Easy Domain Adaptation Instance Weighting for Domain Adaptation in NLP Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets 2006 Estimating Class Priors in Domain Adaptation for Word Sense Disambiguation Simultaneous English-Japanese Spoken Language Translation Based on Incremental Dependency Parsing and Transfer CVPR 2009 Domain Transfer SVM for Video Concept Detection Boosted Multi-Task Learning for Face Verification With Applications to Web Image and Video Search 2008 Transfer Learning for Image Classification with Sparse Prototype Representations Workshops NIPS 2005 Workshop - Inductive Transfer: 10 Years Later NIPS 2005 Workshop - Interclass Transfer NIPS 2006 Workshop - Learning when test and training inputs have different distributions AAAI 2008 Workshop - Transfer Learning for Complex Tasks
转载于: http://apex.sjtu.edu.cn/apex_wiki/Transfer%20Learning 迁移学习( Transfer Learning ) 薛贵荣 在传统的机器学习的框架下,学习的任务就是在给定充分训练数据的基础上来学习一个分类模型;然后利用这个学习到的模型来对测试文档进行分类与预测。然而,我们看到机器学习算法在当前的Web挖掘研究中存在着一个关键的问题:一些新出现的领域中的大量训练数据非常难得到。我们看到Web应用领域的发展非常快速。大量新的领域不断涌现,从传统的新闻,到网页,到图片,再到博客、播客等等。传统的机器学习需要对每个领域都标定大量训练数据,这将会耗费大量的人力与物力。而没有大量的标注数据,会使得很多与学习相关研究与应用无法开展。其次,传统的机器学习假设训练数据与测试数据服从相同的数据分布。然而,在许多情况下,这种同分布假设并不满足。通常可能发生的情况如训练数据过期。这往往需要我们去重新标注大量的训练数据以满足我们训练的需要,但标注新数据是非常昂贵的,需要大量的人力与物力。从另外一个角度上看,如果我们有了大量的、在不同分布下的训练数据,完全丢弃这些数据也是非常浪费的。如何合理的利用这些数据就是迁移学习主要解决的问题。迁移学习可以从现有的数据中迁移知识,用来帮助将来的学习。迁移学习(Transfer Learning)的目标是将从一个环境中学到的知识用来帮助新环境中的学习任务。因此,迁移学习不会像传统机器学习那样作同分布假设。 我们在迁移学习方面的工作目前可以分为以下三个部分:同构空间下基于实例的迁移学习,同构空间下基于特征的迁移学习与异构空间下的迁移学习。我们的研究指出,基于实例的迁移学习有更强的知识迁移能力,基于特征的迁移学习具有更广泛的知识迁移能力,而异构空间的迁移具有广泛的学习与扩展能力。这几种方法各有千秋。 1.同构空间下基于实例的迁移学习 基于实例的迁移学习的基本思想是,尽管辅助训练数据和源训练数据或多或少会有些不同,但是辅助训练数据中应该还是会存在一部分比较适合用来训练一个有效的分类模型,并且适应测试数据。于是,我们的目标就是从辅助训练数据中找出那些适合测试数据的实例,并将这些实例迁移到源训练数据的学习中去。在基于实例的迁移学习方面,我们推广了传统的 AdaBoost 算法,提出一种具有迁移能力的boosting算法:Tradaboosting ,使之具有迁移学习的能力,从而能够最大限度的利用辅助训练数据来帮助目标的分类。我们的关键想法是,利用boosting的技术来过滤掉辅助数据中那些与源训练数据最不像的数据。其中,boosting的作用是建立一种自动调整权重的机制,于是重要的辅助训练数据的权重将会增加,不重要的辅助训练数据的权重将会减小。调整权重之后,这些带权重的辅助训练数据将会作为额外的训练数据,与源训练数据一起从来提高分类模型的可靠度。 基于实例的迁移学习只能发生在源数据与辅助数据非常相近的情况下。但是,当源数据和辅助数据差别比较大的时候,基于实例的迁移学习算法往往很难找到可以迁移的知识。但是我们发现,即便有时源数据与目标数据在实例层面上并没有共享一些公共的知识,它们可能会在特征层面上有一些交集。因此我们研究了基于特征的迁移学习,它讨论的是如何利用特征层面上公共的知识进行学习的问题。 2.同构空间下基于特征的迁移学习 在基于特征的迁移学习研究方面,我们提出了多种学习的算法,如CoCC算法 ,TPLSA算法 ,谱分析算法 与自学习算法 等。其中利用互聚类算法产生一个公共的特征表示,从而帮助学习算法。我们的基本思想是使用互聚类算法同时对源数据与辅助数据进行聚类,得到一个共同的特征表示,这个新的特征表示优于只基于源数据的特征表示。通过把源数据表示在这个新的空间里,以实现迁移学习。应用这个思想,我们提出了基于特征的有监督迁移学习与基于特征的无监督迁移学习。 2.1 基于特征的有监督迁移学习 我们在基于特征的有监督迁移学习方面的工作是基于互聚类的跨领域分类 ,这个工作考虑的问题是:当给定一个新的、不同的领域,标注数据及其稀少时,如何利用原有领域中含有的大量标注数据进行迁移学习的问题。在基于互聚类的跨领域分类这个工作中,我们为跨领域分类问题定义了一个统一的信息论形式化公式,其中基于互聚类的分类问题的转化成对目标函数的最优化问题。在我们提出的模型中,目标函数被定义为源数据实例,公共特征空间与辅助数据实例间互信息的损失。 2.2 基于特征的无监督迁移学习:自学习聚类 我们提出的自学习聚类算法 属于基于特征的无监督迁移学习方面的工作。这里我们考虑的问题是:现实中可能有标记的辅助数据都难以得到,在这种情况下如何利用大量无标记数据辅助数据进行迁移学习的问题。自学习聚类 的基本思想是通过同时对源数据与辅助数据进行聚类得到一个共同的特征表示,而这个新的特征表示由于基于大量的辅助数据,所以会优于仅基于源数据而产生的特征表示,从而对聚类产生帮助。 上面提出的两种学习策略(基于特征的有监督迁移学习与无监督迁移学习)解决的都是源数据与辅助数据在同一特征空间内的基于特征的迁移学习问题。当源数据与辅助数据所在的特征空间中不同时,我们还研究了跨特征空间的基于特征的迁移学习,它也属于基于特征的迁移学习的一种。 3 异构空间下的迁移学习:翻译学习 我们提出的翻译学习 致力于解决源数据与测试数据分别属于两个不同的特征空间下的情况。在 中,我们使用大量容易得到的标注过文本数据去帮助仅有少量标注的图像分类的问题,如上图所示。我们的方法基于使用那些用有两个视角的数据来构建沟通两个特征空间的桥梁。虽然这些多视角数据可能不一定能够用来做分类用的训练数据,但是,它们可以用来构建翻译器。通过这个翻译器,我们把近邻算法和特征翻译结合在一起,将辅助数据翻译到源数据特征空间里去,用一个统一的语言模型进行学习与分类。 引文: . Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. Translated Learning: Transfer Learning across Different Feature Spaces. Advances in Neural Information Processing Systems 21 (NIPS 2008), Vancouver, British Columbia, Canada, December 8-13, 2008. . Xiao Ling, Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. Spectral Domain-Transfer Learning. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), Pages 488-496, Las Vegas, Nevada, USA, August 24-27, 2008. . Wenyuan Dai, Qiang Yang, Gui-Rong Xue and Yong Yu. Self-taught Clustering. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pages 200-207, Helsinki, Finland, 5-9 July, 2008. . Gui-Rong Xue, Wenyuan Dai, Qiang Yang and Yong Yu. Topic-bridged PLSA for Cross-Domain Text Classification. In Proceedings of the Thirty-first International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR2008), pages 627-634, Singapore, July 20-24, 2008. . Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang and Yong Yu. Can Chinese Web Pages be Classified with English Data Source? In Proceedings the Seventeenth International World Wide Web Conference (WWW2008), Pages 969-978, Beijing, China, April 21-25, 2008. . Xiao Ling, Wenyuan Dai, Gui-Rong Xue and Yong Yu. Knowledge Transferring via Implicit Link Analysis. In Proceedings of the Thirteenth International Conference on Database Systems for Advanced Applications (DASFAA 2008), Pages 520-528, New Delhi, India, March 19-22, 2008. . Wenyuan Dai, Gui-Rong Xue, Qiang Yang and Yong Yu. Co-clustering based Classification for Out-of-domain Documents. In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007), Pages 210-219, San Jose, California, USA, Aug 12-15, 2007. . Wenyuan Dai, Gui-Rong Xue, Qiang Yang and Yong Yu. Transferring Naive Bayes Classifiers for Text Classification. In Proceedings of the Twenty-Second National Conference on Artificial Intelligence (AAAI 2007), Pages 540-545, Vancouver, British Columbia, Canada, July 22-26, 2007. . Wenyuan Dai, Qiang Yang, Gui-Rong Xue and Yong Yu. Boosting for Transfer Learning. In Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), Pages 193-200, Corvallis, Oregon, USA, June 20-24, 2007. . Dikan Xing, Wenyuan Dai, Gui-Rong Xue and Yong Yu. Bridged Refinement for Transfer Learning. In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007), Pages 324-335, Warsaw, Poland, September 17-21, 2007. (Best Student Paper Award) . Xin Zhang, Wenyuan Dai, Gui-Rong Xue and Yong Yu. Adaptive Email Spam Filtering based on Information Theory. In Proceedings of the Eighth International Conference on Web Information Systems Engineering (WISE 2007), Pages 159170, Nancy, France, December 3-7, 2007. Transfer Learning (2009-10-29 03:03:46由 grxue 编辑)