我: 我是这样教导学生 NLP和 AI 的: 人工智能里面没有智能 知识系统里面没有知识 一切都是自己跟自己玩 一切都是为了自己玩自己的时候 努力玩得似乎符合逻辑 自然 方便 而且容易记忆和维护 学: 前面的听懂了,AI 这块有点懵懂 我: 没关系 前面听懂了是关键。后面是哲学,哲学的事儿不必那么懂。你都懂了 我这个做导师的怎么吃饭呢? 学: 给功能词加 features 怎样才妥? 我: 功能词可以枚举,原则上可以没有 features,无所谓妥不妥。看你怎么用 用起来觉得妥就妥 觉得别扭或捣乱 就不妥。如果你永远不用 则没有妥不妥的问题 给了与不给一个样 因为永远没用到。没用到是可能的,譬如你总是为这个词写 WORD 的规则, 不让它有机会被 feature 的规则匹配上 那么 features 就是摆设 也就谈不上妥不妥。 学: 有道理。本来就这么几个词,写WORD就好了,不需要为Feature伤脑筋。 我: 有点开窍的意思 学: 跟老师多交流,才能开窍,不然我就钻进自己的死胡同了。 我: 人都是这样的 钻进n个胡同以后才能在 n+ 的时候开窍。没进过胡同就开窍的 那不是天才 那是死人。 学: NLP 里面的知识表达,包括词典的 features,应该怎么设计呢? 我: 从词典表达 lexical features 到句法语义逻辑的表达,大多没有黑白分明的标准答案。 就是自己这么给了 显得蛮合理 也好记忆 否则自己就不舒服 或记不住。更重要的是 给了 features 以后 规则好写了 规则自然 简洁 有概括性 且方便维护。 almost everything is coordination u assign u use no one is in between no intelligence no god as long as it makes sense to you (not to others) so u know what u r doing as long as it is natural and easy to remember as long as you find it convenient to use certain features in rules and rules are easy to read and easy to maintain in principle u can assign anything to any words or choose not to assign what goes around comes around you play with yourself computer knows nothing features are just 0s or 1s WHAT GOES AROUND COMES AROUND that is NLP in an integrated system whether it refers to POS, chunking, SVO or logical form it is to make your job easy and yourself comfortable u have no need to make others happy unless your system is a middleware commodity to serve your clients if your NLP and your NLP apps are within your own control they are integrated in your system in your own architecture everything is internal coordination This is my lecture on NLP Architecture for Dummies 白: you是谁?个人、团队、公司? 我: good question, it is the architect in most cases: he has the say. Sometimes it can be a bit democratic if the architect wants to motivate his team, for example the naming right. 白: 是全局系统的architect,还是NLP这嘎达的architect? 我: a bit of knowledge is named as f1 or f2, that is arbitrary and the major consideration is memonic-like, features must be easy to remember, but sometimes we let a team member decide its name, such practice often makes the team happy, wow I can act like God, wow I can decide a drop of the sea in the system language … 白: 伟哥还没回答我最后一个问题: 是全局系统的architect,还是NLP这嘎达的architect? 我: the former because we are talking about NLP and NLP apps in an integrated system: apps 不是产品 而是语义落地。落地后 还有一个产品层面 包括 UI 等 那已经不劳我们操心了。落地是与产品的接口而已。NLP 核心引擎与 NLP 落地 是一个无缝连接的系统 这种 design 可以羡慕死人。 如果是有缝对接 如果是两拨人马 两个设计师 甚至两个公司 那就扯不完的皮 擦不完的屁股 成不了大事儿。NLP 和 NLP 产品可以分开 而且应该分开 但是 NLP 与 NLP落地 最好不分开。NLP 落地 包括(1) IE (2) MT (3) dialogue (mapping) (4) QA (5)…… 内部分层 但外部不分开 这就叫无缝连接 可以说 offshelf 害死人,component technology 没有啥前途。选择 offshelf 或 license components 往往是无奈之举,自己暂时没有能力 或不具备条件做,也有找的借口冠冕堂皇:不要 reinvent wheels,最后害的还是自己。 我们已经害过几次自己了 吃尽了苦头 才有这 “十年一悟”,以前说过的: 做工业NLP 自给自足是王道。 白: 这个,关键看公司拥有什么样的专家了。专家不同模式也不同。 我: 也与时代有关: 20 年后也许不必自给自足,就一样做好NLP落地。 【相关】 【立委科普:NLP 联络图 】 【立委科普:自然语言系统架构简说】 自给自足是NLP王道 置顶:立委科学网博客NLP博文一览(定期更新版)】 《朝华午拾》总目录
对于自然语言处理(NLP)及其应用,系统架构是核心问题,我在博文【 立委科普:NLP 联络图 】里面给了四个NLP系统的体系结构的框架图,现在就一个一个做个简要的解说。 我把 NLP 系统从核心引擎直到应用,分为四个阶段,对应四张框架图。最底层最核心的是 deep parsing,就是对自然语言的自底而上层层推进的自动分析器,这个工作最繁难,但是它是绝大多数NLP系统基础技术。 parsing 的目的是把非结构的语言结构化。面对千变万化的语言表达,只有结构化了,patterns 才容易抓住,信息才好抽取,语义才好求解。这个道理早在乔姆斯基1957年语言学革命后提出表层结构到深层结构转换的时候,就开始成为(计算)语言学的共识了。结构树不仅是表达句法关系的枝干(arcs),还包括负载了各种信息的单词或短语的叶子(nodes)。结构树虽然重要,但一般不能直接支持产品,它只是系统的内部表达,作为语言分析理解的载体和语义落地为应用的核心支持。 接下来的一层是抽取层 (extraction),如上图所示。它的输入是结构树,输出是填写了内容的 templates,类似于填表:就是对于应用所需要的情报,预先定义一个表格出来,让抽取系统去填空,把语句中相关的词或短语抓出来送进表中事先定义好的栏目(fields)去。这一层已经从原先的领域独立的 parser 进入面对领域、针对应用和产品需求的任务了。 值得强调的是,抽取层是面向领域的语义聚焦的,而前面的分析层则是领域独立的。因此,一个好的架构是把分析做得很深入很逻辑,以便减轻抽取的负担。在深度分析的逻辑语义结构上做抽取,一条抽取规则等价于语言表层的千百条规则。这就为领域转移创造了条件。 有两大类抽取,一类是传统的信息抽取(IE),抽取的是事实或客观情报:实体、实体之间的关系、涉及不同实体的事件等,可以回答 who dis what when and where (谁在何时何地做了什么)之类的问题。这个客观情报的抽取就是如今火得不能再火的知识图谱(knowledge graph)的核心技术和基础,IE 完了以后再加上下一层挖掘里面的整合(IF:information fusion),就可以构建知识图谱。另一类抽取是关于主观情报,舆情挖掘就是基于这一种抽取。我过去五年着重做的也是这块,细线条的舆情抽取(不仅仅是褒贬分类,还要挖掘舆情背后的理由来为决策提供依据)。这是 NLP 中最难的任务之一,比客观情报的 IE 要难得多。抽取出来的信息通常是存到某种数据库去。这就为下面的挖掘层提供了碎片情报。 很多人混淆了抽取(information extraction) 和下一步的挖掘(text mining),但实际上这是两个层面的任务。抽取面对的是一颗颗语言的树,从一个个句子里面去找所要的情报。而挖掘面对的是一个 corpus,或数据源的整体,是从语言森林里面挖掘有统计价值的情报。在信息时代,我们面对的最大挑战就是信息过载,我们没有办法穷尽信息海洋,因此,必须借助电脑来从信息海洋中挖掘出关键的情报来满足不同的应用。因此挖掘天然地依赖统计,没有统计,抽取出来的信息仍然是杂乱无章的碎片,有很大的冗余,挖掘可以整合它们。 很多系统没有深入做挖掘,只是简单地把表达信息需求的 query 作为入口,实时(real time)去从抽取出来的相关的碎片化信息的数据库里,把 top n 结果简单合并,然后提供给产品和用户。这实际上也是挖掘,不过是用检索的方式实现了简单的挖掘就直接支持应用了。 实际上,要想做好挖掘,这里有很多的工作可做,不仅可以整合提高已有情报的质量。而且,做得深入的话,还可以挖掘出隐藏的情报,即不是元数据里显式表达出来的情报,譬如发现情报之间的因果关系,或其他的统计性趋势。这种挖掘最早在传统的数据挖掘(data mining)里做,因为传统的挖掘针对的是交易记录这样的结构数据,容易挖掘出那些隐含的关联(如,买尿片的人常常也买啤酒,原来是新为人父的人的惯常行为,这类情报挖掘出来可以帮助优化商品摆放和销售)。如今,自然语言也结构化为抽取的碎片情报在数据库了,当然也就可以做隐含关联的情报挖掘来提升情报的价值。 第四张架构图是NLP应用(apps)层。在这一层,分析、抽取、挖掘出来的种种情报可以支持不同NLP产品和服务。从问答系统到知识图谱的动态浏览(谷歌搜索中搜索明星已经可以看到这个应用),从自动民调到客户情报,从智能助理到自动文摘等等。 这算是我对NLP基本架构的一个总体解说。根据的是近20年在工业界做NLP产品的经验。18年前,我就是用一张NLP架构图忽悠来的第一笔风投,投资人自己跟我们说,这是 million dollar slide 。如今的解说就是从那张图延伸拓展而来。 天不变道亦不变。 以前在哪里提过这个 million-dollar slide 的故事。说的是克林顿当政时期的 2000 前,美国来了一场互联网科技大跃进,史称 .com bubble,一时间热钱滚滚,各种互联网创业公司如雨后春笋。就在这样的形势下,老板决定趁热去找风险投资,嘱我对我们实现的语言系统原型做一个介绍。我于是画了下面这么一张三层的NLP体系架构图,最底层是parser,由浅入深,中层是建立在parsing基础上的信息抽取,最顶层是几类主要的应用,包括问答系统。连接应用与下面两层语言处理的是数据库,用来存放信息抽取的结果,这些结果可以随时为应用提供情报。这个体系架构自从我15年前提出以后,就一直没有大的变动,虽然细节和图示都已经改写了不下100遍了,本文的架构图示大约是前20版中的一版,此版只关核心引擎(后台),没有包括应用(前台)。话说架构图一大早由我老板寄送给华尔街的天使投资人,到了中午就得到他的回复,表示很感兴趣。不到两周,我们就得到了第一笔100万美金的天使投资支票。投资人说,这张图太妙了,this is a million dollar slide,它既展示了技术的门槛,又显示了该技术的巨大潜力。 from 科学网—前知识图谱钩沉: 信息抽取引擎的架构 【相关】 【立委科普:NLP 联络图 (之一)】 前知识图谱钩沉: 信息抽取引擎的架构 【立委科普:自然语言parsers是揭示语言奥秘的LIGO式探测仪】 【征文参赛:美梦成真】 《OVERVIEW OF NATURAL LANGUAGE PROCESSING》 《NLP White Paper: Overview of Our NLP Core Engine》 White Paper of NLP Engine 《朝华午拾》总目录 【置顶:立委科学网博客NLP博文一览(定期更新版)】
【立委按】 以前在哪里提过这个 million-dollar slide 的故事。说的是克林顿当政时期的 2000 前,美国来了一场互联网科技大跃进,史称 .com bubble,一时间热钱滚滚,各种互联网创业公司如雨后春笋。就在这样的形势下,老板决定趁热去找风险投资,嘱我对我们实现的语言系统原型做一个介绍。我于是画了下面这么一张三层的NLP体系架构图,最底层是parser,由浅入深,中层是建立在parsing基础上的信息抽取,最顶层是几类主要的应用,包括问答系统。连接应用与下面两层语言处理的是数据库,用来存放信息抽取的结果,这些结果可以随时为应用提供情报。这个体系架构自从我15年前提出以后,就一直没有大的变动,虽然细节和图示都已经改写了不下100遍了,本文的架构图示大约是前20版中的一版,此版只关核心引擎(后台),没有包括应用(前台)。话说架构图一大早由我老板寄送给华尔街的天使投资人,到了中午就得到他的回复,表示很感兴趣。不到两周,我们就得到了第一笔100万美金的天使投资支票。投资人说,这张图太妙了,this is a million dollar slide,它既展示了技术的门槛,又显示了该技术的巨大潜力。 2. 2 . 2 . System Background: InfoXtract InfoXtract (Li and Srihari2003, Srihari et al. 2000) is a domain-independent and domain-portable, inter mediate level IE engine. Figure 4 illustrates the overall architecture of the engine which will be explained in detail shortly. The outputs of InfoXtract have been designed with information discovery in mind. Specifically, there is an attempt to: Merge information about the same entity into a single profile. While NE provides very local information, an entity profile which consolidates all mentions of an entity in a document is much more useful Normalize information wherever possible; this includes time and location normalization. Recent work has also focused on mapping key verbs into verb synonym sets reflecting the general meaning of the action word Extract generic events in a bottom-up fashion, as well as map them to specific event types in a top-down manner Figure 4 . InfoXtract Engine Architecture A description of the increasingly sophisticated IE outputs from the InfoXtract engine is given below: · NE: Named Entity objects represent key items such as proper names of person, organization, product, location, target , contact information such as address, email, phone number, URL, time and numerical expressions such as date, year and various measurements weight , money , percentage , etc . · CE : Correlated Entity objects capture relations hip mentions between entities such as the affiliation relationship between a person and his employer . The results will be consolidated into the information object Entity Profile (EP) based on co-reference and alias support . · EP : Entity Profiles are complex rich information objects that collect entity-centric information, in particular, all the CE relationships that a given entity is involved in and all the events this entity is involved in. This is achieved through document-internal fusion and cross-document fusion of related information based on support from co-reference, including alias association. Work is in progress to enhance the fusion by correlating the extracted information with information in a user-provided existing database. · GE: General Events are verb-centric information objects representing ‘who did what to whom when and where’ at the logical level. Concept based GE (CGE) further requires that participants of events be filled by EPs instead of NEs and that other values of the GE slots (the action, time and location) be disambiguated and normalized. · PE: Predefined Events are domain specific or user-defined events of a specific event type, such as Product Launch and Company Acquisition in the business domain. They represent a simplified versionof MUC ST. InfoXtract provides a toolkit that allows users to define and write their own PEs based on automatically generated PE rule templates. The linguistic modules serve as underlying support system for different levels of IE. This support system involves almost all major linguistic areas: orthography, morphology, syntax, semantics, discourse and pragmatics. A brief description of the linguistic modulesis given below. · Preprocessing: This component handles file format converting, text zoning and tokenization. The task of text zoning is to identify and distinguish metadata such as title, author, etc from normal running text. The task of tokenization is to convert the incoming linear string of characters from the running text into a tokenlist ; this forms the basis for subsequent linguistic processing. · Word Analysis: This component includes word-level orthographical analysis (capitalization, symbol combination, etc.) and morphological analysis such as stemming. It also includes part-of-speech (POS) tagging which distinguishes, e.g., a noun from a verb based on contextual clues. An optional HMM-based Case Restoration module is called when performing case insensitive QA (Li et al. . 2003a). · Phrase Analysis: This component, also called shallow parsing , undertakes basic syntactic analysis and establishes simple, un-embedded linguistic structures such as basic noun phrases (NP), verb groups(VG), and basic prepositional phrases (PP). This is a key linguistic module, providing the building blocks forsubsequent dependency linkages between phrases. · Sentence Analysis: This component, also called deep parsing , decodes underlying dependency trees that embody logical relationships such as V-S (verb-subject), V-O (verb-object), H-M (head-modifier). The InfoXtract deep parser transforms various patterns, such as active patterns and passivepatterns, into the same logical form, with the argument structure at its core. This involves a considerable amount of semantic analysis . The decoded structures are crucial for supporting structure-based grammar development and/or structure-based machine learning for relationship and event extraction. · Discourse Analysis: This component studies the structure across sentence boundaries. One key task for discourse analysis is to decode the co-reference (CO) links of pronouns ( he, she, it , etc) and other anaphor ( this company,that lady ) with the antecedent named entities. A special type of CO task is ‘Alias Association’ which will link International Business Machine with IBM and Bill Clinton with William Clinton . The results support information merging and consolidation for profiles and events. · Pragmatic Analysis: T his component distinguishes important , relevant information from unimportant , irrelevant information based on lexical resources, structural patterns and contextual clues. Lexical Resources The InfoXtractengine uses various lexical resources including the following: General English dictionaries available in electronic form providing basis for syntactic information. The Oxford Advanced Learners’ Dictionary (OALD) is used extensively. Specialized glossaries for people names, location names, organization names, products, etc. Specialized semantic dictionaries reflecting words that denote person , organization , etc. For example, doctor corresponds to person, church corresponds to organization. This is especially useful in QA. Both WordNet as well as custom thesauri are used in InfoXtract. Statistical language models for Named Entity tagging (retrainable for new domains) InfoXtract exploits a large number of lexical resources. Three advantages exist by separating lexicon modules from grammars : (i) high speed due to indexing-based lookup; (ii) sharing of lexical resources by multiple gramamr modules; (iii) convenience in managing grammars and lexicons. InfoXtract uses two approaches to disambiguate lexicons. The first is a traditional feature-based Grammatical/machine learning Approach where semantic features are assigned to lexical entries that are subsequently used by the grammatical modules. The second approach involves expert lexicons which are discussed in the next section. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (SBIR Phase 2 ) Wei Li, Ph.D., Principal Investigator Rohini K. Srihari, Ph.D.,Co-P rincipal Contract No. F30602-01-C-0035 September 2003 《朝华午拾:创业之路》 前知识图谱钩沉: 信息体理论 2015-10-31 前知识图谱钩沉,信息抽取任务由浅至深的定义 2015-10-30 前知识图谱钩沉,关于事件的抽取 2015-10-30 SVO as General Events 2015-10-25 Pre-Knowledge-Graph Profile Extraction Research via SBIR 2015-10-24 《知识图谱的先行:从 Julian Hill 说起 》 2015-10-24 朝华午拾:在美国写基金申请的酸甜苦辣 - 科学网 【置顶:立委科学网博客NLP博文一览(定期更新版)】