科学网

 找回密码
  注册

tag 标签: Rule

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

Hund\'s rule洪特规则
HpJi 2019-11-26 20:18
洪特规则由多种定义方法,比如: 1 电子在能量相同的轨道(即等价轨道 )上排布时 ,总是尽可能分占不同的轨道且自旋方向同向,此时这种排布方式总能量最低。 2 在相同 n 和相同 l 的轨道上分布的电子,将尽可能分占 m 值不同的轨道,且自旋平行。 根据Hund's rule,可以判定: 1 ground state具有最大的S;2 相同S时,最大L时能量更低;3 S与L相同时,J最大(more than half-filled)或J最小(less than half-filled)时能量最低。
15511 次阅读|0 个评论
On Recall of Rule-based Systems
liwei999 2016-8-1 13:34
After I showed the benchmarking results of SyntaxNet and our rule system based on grammar engineering, many people seem to be surprised by the fact that the rule system beats the newest deep-learning based parser in data quality . I then got asked many questions, one question is: Q: We know that rules crafted by linguists are good at precision, how about recall? This question is worth a more in-depth discussion and serious answer because it touches the core of the viability of the forgotten school: why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well. Before we elaborate, here was my quick answer to the above question: Unlike precision, recall is not rules' forte, but there are ways to enhance recall; To enhance recall without precision compromise, one needs to develop more rules and organize rules in a hierarch y, and organize grammars in a pipeline , so r ecall is a function of time; To enhance recall with limited compromise in precision, one can fine-tune the rules to loosen conditions. Let me address these points by presenting the scene of action for this linguistic art in its engineering craftsmanship. A rule system is based on compiled computational grammars. A grammar is a set of linguistic rules encoded in some formalism. What happens in grammar engineering is not much different from other software engineering projects. As knowledge engineer, a computational linguist codes a rule in a NLP-specific language, based on a development corpus. The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system. Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule's conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion). There are almost infinite combinations of such conditions that can be enforced in rules' patterns. A linguist's job is to use such conditions to maximize the benefits of capturing the target language phenomena, through a process of trial and error. Given the description of grammar engineering as above, what we expect to see in the initial stage of grammar development is a system precision-oriented by nature. Each rule developed is geared towards a target linguistic phenomenon based on the data observed in the development corpus: conditions can be as tight as one wants them to be, ensuring precision. But no single rule or a small set of rules can cover all the phenomena. So the recall is low in the beginning stage. Let us push things to extreme, if a rule system is based on only one grammar consisting of only one rule, it is not difficult to quickly develop a system with 100% precision but very poor recall. But what is good of a system that is precise but without coverage? So a linguist is trained to generalize. In fact, most linguists are over-trained in school for theorizing and generalization before they get involved in software industrial development. In my own experience in training new linguists into knowledge developers, I often have to de-train this aspect of their education by enforcing strict procedures of data-driven and regression-free development. As a result, the system will generalize only to the extent allowed to maintain a target precision, say 90% or above. It is a balancing art. Experienced linguists are better than new graduates. Out of explosive possibilities of conditions, one will only test some most likely combination of conditions based on linguistic knowledge and judgement in order to reach the desired precision with maximized recall of the target phenomena. For a given rule, it is always possible to increase recall at compromise of precision by dropping some conditions or replacing a strict condition by a loose condition (e.g. checking a feature instead of literal, or checking a general feature such as noun instead of a narrow feature such as human ). When a rule is fine-tuned with proper conditions for the desired balance of precision and recall, the linguist developer moves on to try to come up with another rule to cover more space of the target phenomena. So, as the development goes on, and more data from the development corpus are brought to the attention on the developer's radar, more rules are developed to cover more and more phenomena, much like silkworms eating mulberry leaves. This is incremental enhancement fairly typical of software development cycles for new releases. Most of the time, newly developed rules will overlap with existing rules, but their logical OR points to an enlarged conquered territory. It is hard work, but recall gradually, and naturally, picks up with time while maintaining precision until it hits long tail with diminishing returns. There are two caveats which are worth discussing for people who are curious about this seasoned school of grammar engineering. First, not all rules are equal. A non-toy rule system often provides mechanism to help organize rules in a hierarchy for better quality as well as easier maintenance: after all, a grammar hard to understand and difficult to maintain has little prospect for debugging and incremental enhancement. Typically, a grammar has some general rules at the top which serve as default and cover the majority of phenomena well but make mistakes in the exceptions which are not rare in natural language. As is known to all, naturally language is such a monster that almost no rules are without exceptions. Remember in high school grammar class, our teacher used to teach us grammar rules. For example, one rule says that a bare verb cannot be used as predicate with third person singular subject, which should agree with the predicate in person and number by adding -s to the verb: hence, She leaves instead of *S he leave . But soon we found exceptions in sentences like The teacher demanded that she leave. This exception to the original rule only occurs in object clauses following certain main clause verbs such as demand , theoretically labeled by linguists as subjunctive mood. This more restricted rule needs to work with the more general rule to result in a better formulated grammar. Likewise, in building a computational grammar for automatic parsing or other NLP tasks, we need to handle a spectrum of rules with different degrees of generalizations in achieving good data quality for a balanced precision and recall. Rather than adding more and more restrictions to make a general rule not to overkill the exceptions, it is more elegant and practical to organize the rules in a hierarchy so the general rules are only applied as default after more specific rules are tried, or equivalently, specific rules are applied to overturn or correct the results of general rules. Thus, most real life formalisms are equipped with hierarchy mechanism to help linguists develop computational grammars to model the human linguistic capability in language analysis and understanding. The second point that relates to the topic of recall of a rule system is so significant but often neglected that it cannot be over-emphasized and it calls for a separate writing in itself. I will only present a concise conclusion here. It relates to multiple levels of parsing that can significantly enhance recall for both parsing and parsing-supported NLP applications . In a multi-level rule system, each level is one module of the system, involving a grammar. Lower levels of grammars help build local structures (e.g. basic Noun Phrase), performing shallow parsing. System thus designed are not only good for modularized engineering, but also great for recall because shallow parsing shortens the distance of words that hold syntactic relations (including long distance relations) and lower level linguistic constructions clear the way for generalization by high level rules in covering linguistic phenomena. In summary, a parser based on grammar engineering can reach very high precision; in addition, there are proven effective ways of enhancing its recall too. High recall can be achieved if enough time and expertise are invested in its development. In case of parsing, as shown by test results , our seasoned English parser is good at both precision (96% vs. SyntaxNet 94%) and recall (94% vs. Syntax 95%, only 1 percentage point lower than SyntaxNet) in news genre, and with regards to social media, our parser is robust enough to beat SyntaxNet in both precision (89% vs. SyntaxNet 60%) and recall (72% vs. SyntaxNet 70%). Is Google SyntaxNet Really the World’s Most Accurate Parser? It is untrue that Google SyntaxNet is the “world’s most accurate parser” R. Srihari, W Li, C. Niu, T. Cornell. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006 K. Church: “A Pendulum Swung Too Far” , Linguistics issues in Language Technology, 2011; 6(5) Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering Introduction of Netbase NLP Core Engine Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|4308 次阅读|0 个评论
[转载]Lipinski's Rule of Five (Lipinski五规则)
yqx1985 2010-7-13 16:49
1997 年 Lipinski 对 2287 药物分子的结构特征进行了分析,这 2287 个分子基本上通过了一期临床的实验。分析结构表明,如果一个药物分析具有 好的吸收和穿透特性,应符合下面的规则: ① 氢键给体(连接在 N 和 O 上的请原子数)数目小于 5 ; ② 氢键受体( N 和 O 的数目)数目小于 10 ; ③ 相对分子质量小于 500 ; ④ 脂水分配系数小于 5 。 一般把 Lipinski 得到的规则称为 五规则 (rule of 5), 这个规则已经被广泛用于数据库的初筛中。 Lipinski 等提出的 五规则 可能是最为简单的药代动力学特征评估原则,因为它所揭示的分子的结构 特征实际上是和分子的通透性和渗透性密切相关的。 Lipinski's Rule of Five states that, in general, an orally active drug has no more than one violation of the following criteria: * Not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms) * Not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms) * A molecular weight under 500 g /mol * A partition coefficient log P less than 5 Note that all numbers are multiples of five, which is the origin of the rule's name. To evaluate druglikeness better, the rules have spawned many extensions, for example one from a 1999 paper by Ghose et al. * Partition coefficient log P in -0.4 to +5.6 range * Molar refractivity from 40 to 130 * Molecular weight from 160 to 480 * Number of heavy atoms from 20 to 70 Over the past decade Lipinski's profiling tool for druglikeness has led to further investigations by scientists to extend profiling tools to lead-like properties of compounds in the hope that a better starting point in early discovery can save time and cost.
个人分类: 试验琐碎|9511 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-15 16:09

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部