科学网 › 标签 › extraction

标签: extraction

相关帖子	版块	作者	回复/查看	最后发表

没有相关内容

相关日志

Question answering of the past and present: liwei999 2016-10-5 06:06; A pre-existence The traditional question answering (QA) system is an application of Artificial Intelligence (AI). It is usually confined to a very narrow and specialized domain, which is basically made up of a hand-crafted knowledge base with a natural language interface. As the field is narrow, the vocabulary is very limited, and its pragmatic ambiguity can be effectively under control. Questions are highly predictable, or close to a closed set, the rules for the corresponding answers are fairly straightforward. Well-known projects in the 1960s include LUNAR, a QA system specializing in answering questions about the geological analysis on the lunar samples collected from the Apollo's landing on the Moon. SHRDLE is another famous QA expert system in AI history, it simulates the operation of a robot in the toy building world. The robot can answer the question of the geometric state of a toy and listen to the language instruction for its operation. These early AI explorations seemed promising, revealing a fairy-tale world of scientific fantasy, greatly stimulating our curiosity and imagination. Nevertheless, in essence, these are just toy systems that are confined to the laboratory and are not of much practical value. As the field of artificial intelligence was getting narrower and narrower (although some expert systems have reached a practical level, majority AI work based on common sense and knowledge reasoning could not get out beyond lab), the corresponding QA systems failed to render meaningful results. There were some conversational systems (chatterbots) that had been developed thus far and became children's popular online toys (I remember at one time when my daughter was young, she was very fond of surfing the Internet to find various chatbots, sometimes deliberately asking tricky questions for fun. Recent years have seen a revival of this tradition by industrial giants, with some flavor seen in Siri, and greatly emphasized in Microsoft's Little Ice). 2. Rebirth Industrial open-domain QA systems are another story, it came into existence with the development of the Internet boom and the popularity of search engines. Specifically, the open QA system was born in 1999, when the TREC-8 (Eighth Text Retrieval Conference) decided to add a natural language QA track of competition, funded by the US Department of Defense's DARPA program, administrated by the United States National Institute of Standards and Technology (NIST), thus giving birth to this emerging QA community. Its opening remarks when calling for the participation of the competition are very impressive, to this effect: Users have questions, they need answers. Search engines claim that they are doing information retrieval, yet the information is not an answer to their questions but links to thousands of possibly related files. Answers may or may not be in the returned documents. In any case, people are compelled to read the documents in order to find answers. A QA system in our vision is to solve this key problem of information need. For QA, the input is a natural language question, the output is the answer, it is that simple. It seems of benefit to introduce some background for academia as well as the industry when the open QA was born. From the academic point of view, the traditional sense of artificial intelligence is no longer popular, replaced by the large-scale corpus-based machine learning and statistical research. Linguistic rules still play a role in the field of natural language, but only as a complement to the mainstream machine learning. The so-called intelligent knowledge systems based purely on knowledge or common sense reasoning are largely put on hold by academic scholars (except for a few, such as Dr. Douglas Lenat with his Cyc). In the academic community before the birth of open-domain question and answering, there was a very important development, i.e. the birth and popularity of a new area called Information Extraction (IE), again a child of DARPA. The traditional natural language understanding (NLU) faces the entire language ocean, trying to analyze each sentence seeking a complete semantic representation of all its parts. IE is different, it is task-driven, aiming at only the defined target of information, leaving the rest aside. For example, the IE template of a conference may be defined to fill in the information of the conference , , , , and such. It is very similar to filling in the blank in a student's reading comprehension test. The idea of task-driven semantics for IE shortens the distance between the language technology and practicality, allowing researchers to focus on optimizing tasks according to the tasks, rather than trying to swallow the language monster at one bite. By 1999, the IE community competitions had been held for seven annual sessions (MUC-7: Seventh Message Understanding Conference), the tasks of this area, approaches and the then limitations were all relatively clear. The most mature part of information extraction technology is the so-called Named Entity (NE tagging), including identification of names for human, location, and organization as well as tagging time, percentage, etc. The state-of-the-art systems, whether using machine learning or hand-crafted rules, reached a precision-recall combined score (F-measures) of 90+%, close to the quality of human performance. This first-of-its-kind technological advancement in a young field turned out to play a key role in the new generation of open-domain QA. In industry, by 1999, search engines had grown rapidly with the popularity of the Internet, and search algorithms based on keyword matching and page ranking were quite mature. Unless there was a methodological revolution, the keyword search field seemed to almost have reached its limit. There was an increasing call for going beyond basic keyword search. Users were dissatisfied with search results in the form of links, and they needed more granular results, at least in paragraphs (snippets) instead of URLs, preferably in the form of direct short answers to the questions in mind. Although the direct answer was a dream yet to come true waiting for the timing of open-domain QA era, the full-text search more and more frequently adopted paragraph retrieval instead of simple document URLs as a common practice in the industry, the search results changed from the simple links to web pages to the highlighting of the keywords in snippets. In such a favorable environment in industry and academia, the open-domain question answering came onto the stage of history. NIST organized its first competition, requiring participating QA systems to provide the exact answer to each question, with a short answer of no more than 50 bytes in length and a long answer no more than 250 bytes. Here are the sample questions for the first QA track: Who was the first American in space? Where is the Taj Mahal? In what year did Joe DiMaggio compile his 56-game hitting streak? 3. Short-lived prosperity What are the results and significance of this first open domain QA competition? It should be said that the results are impressive, a milestone of significance in the QA history. The best systems (including ours) achieve more than 60% correct rate, that is, for every three questions, the system can search the given corpus and is able to return two correct answers. This is a very encouraging result as a first attempt at an open domain system. At the time of dot.com's heyday, the IT industry was eager to move this latest research into information products and revolutionize the search. There were a lot of interesting stories after that (see my related blog post in Chinese: the road to entrepreneurship), eventually leading to the historical AI event of IBM Watson QA beating humans in Jeopardy. The timing and everything prepared by then from the organizers, the search industry, and academia, have all contributed to the QA systems' seemingly miraculous results. The NIST emphasizes well-formed natural language questions as appropriate input (i.e. English questions, see above), rather than traditional simple and short keyword queries. These questions tend to be long, well suited for paragraph searches as a leverage. For competition's sake, they have ensured that each question asked indeed has an answer in the given corpus. As a result, the text archive contains similar statements corresponding to the designed questions, having increased the odds of sentence matching in paragraph retrieval (Watson's later practice shows that from the big data perspective, similar statements containing answers are bound to appear in text as long as a question is naturally long). Imagine if there are only one or two keywords, it will be extremely difficult to identify relevant paragraphs and statements that contain answers. Of course, finding the relevant paragraphs or statements is not sufficient for this task, but it effectively narrows the scope of the search, creating a good condition for pinpointing the short answers required. At this time, the relatively mature technology of named entity tagging from the information extraction community kicked in. In order to achieve the objectivity and consistency in administrating the QA competition, the organizers deliberately select only those questions which are relatively simple and straightforward, questions about names, time or location (so-called factoid questions). This practice naturally agrees with the named entity task closely, making the first step into open domain QA a smooth process, returning very encouraging results as well as a shining prospect to the world. For example, for the question In what year did Joe DiMaggio compile his 56-game hitting streak?, the paragraph or sentence search could easily find text statements similar to the following: Joe DiMaggio's 56 game hitting streak was between May 15, 1941 and July 16. An NE system tags 1941 as time with no problem and the asking point for time in parsing the wh-phrase in what year is also not difficult to decode. Therefore, an exact answer to the exact question seems magically retrieved from the sea of documents to satisfy the user, like a needle found in the haystack. Following roughly the same approach, equipped with gigantic computing power for parallel processing of big data, 11 years later, IBM Watson QA beat humans in the Jeopardy live show in front of the nationwide TV audience, stimulating the entire nation's imagination with awe for this technology advance. From QA research perspective, the IBM's victory in the show is, in fact, an expected natural outcome, more of an engineering scale-up showcase rather than research breakthrough as the basic approach of snippet + NE + asking-point has long been proven. A retrospect shows that adequate QA systems for factoid questions are invariably combined with a solid Named Entity module and a question parser for identifying asking points. As long as there is an IE-indexed big data behind, with information redundancy as its nature, factoid QA is a very tractable task . 4. State of the art The year 1999 witnessed the academic community's initial success of the first open-domain QA track as a new frontier of the retrieval world. We also benefited from that event as a winner, having soon secured a venture capital injection of $10 million from the Wall Street. It was an exciting time shortly after AskJeeves' initial success in presenting a natural language interface online (but they did not have the QA technology for handling the huge archive for retrieving exact answers automatically, instead they used human editors behind the scene to update the answers database). A number of QA start-ups were funded. We were all expecting to create a new era in the information revolution. Unfortunately, the good times are not long, the Internet bubble soon burst, and the IT industry fell into the abyss of depression. Investors tightened their monetary operations, the QA heat soon declined to freezing point and almost disappeared from the industry (except for giants' labs such as IBM Watson; in our case, we shifted from QA to mining online brand intelligence for enterprise clients). No one in the mainstream believes in this technology anymore. Compared with traditional keyword indexing and searching, the open domain QA is not as robust and is yet to scale up to really big data for showing its power. The focus of the search industry is shifting from depth back to breadth, focusing on the indexing coverage, including the so-called deep web. As the development of QA systems is almost extinct from the industry, this emerging field stays deeply rooted in the academic community, developed into an important branch, with increasing natural language research from universities and research labs. IBM later solves the scale-up challenge, as a precursor of the current big data architectural breakthrough. At the same time, scholars begin to summarize the various types of questions that challenge QA. A common classification is based on identifying the type of questions for their asking points. Many of us still remember our high school language classes, where the teacher stressed the 6 WHs for reading comprehension: who / what / when / where / how / why. ( Who did what when, where, how and why? ) Once answers to these questions are clear , the central stories of an article are in hands. As a simulation of human reading comprehension, the QA system is designed to answer these key WH questions as well. It is worth noting that these WH questions are of different difficulty levels, depending on the types of asking points (one major goal for question parsing is to identify the key need from a question, what we call asking point identification, usually based on question parsing of wh-phrases and other question clues). Those asking points corresponding to an entity as an appropriate answer, such as who / when / where, are relatively easy questions to answer (i.e. factoid questions). Another type of question is not simply answerable by an entity, such as what-is / how / why, there is consensus that answering such questions is a much more challenging task than factors questions. A brief introduction to these three types of tough questions and their solutions are presented below as a showcase of the on-going state to conclude this overview of the QA journey. What/who is X? This type of questions is the so-called definition question , such as What is iPad II? Who is Bill Clinton? This type of question is typically very short, after the wh-word and the stop word is are stripped in question parsing, what is left is just a name or a term as input to the QA system. Such an input is detrimental to the traditional keyword retrieval system as it ends up with too many hits from which the system can only pick the documents with the most keyword density or page rank as returns. But from QA perspective, the minimal requirement to answer this question is a definition statement in the forms of X is a .... Since any entity or object is in multiple relationships with other entities and involved in various events as described in the corpus, a better answer to the definition question involves a summary of the entity with all the links to its key associated relations and events, giving a profile of the entity. Such technology is in existence, and, in fact, has been partly deployed today. It is called knowledge graph, supported by underlying information extraction and fusion. The state-of-the-art solution for this type of questions is best illustrated in the Google deployment of its knowledge graph in handling queries of a short search for movie stars or other VIP. The next challenge is how-questions, asking about a solution for solving a problem or doing something, e.g. How can we increase bone density? How to treat a heart attack? This type of question calls for a summary of all types of solutions such as medicine, experts, procedures, or recipe. A simple phrase is usually not a good answer and is bound to miss varieties of possible solutions to satisfy the information need of the users (often product designers, scientists or patent lawyers) who typically are in the stage of prior art research and literature review for a conceived solution in mind. We have developed such a powerful system based on deep parsing and information extraction to answer open-domain how-questions comprehensively in the product called Illumin8 , as deployed by Elsevier for quite some years. (Powerful as it is, unfortunately, it did not end up as a commercial success in the market from revenue perspective.) The third difficult question is why. People ask why-questions to find the cause or motive of a phenomenon, whether an event or an opinion. For example, why people like or dislike our product Xyz? There might be thousands of different reasons behind a sentiment or opinion. Some reasons are explicitly expressed ( I love the new iPhone 7 because of its greatly enhanced camera ) and more reasons are actually in some implicit expressions ( just replaced my iPhone , it sucks in battery life ). An adequate QA system should be equipped with the ability to mine the corpus and summarize and rank the key reasons for the user. In the last 5 years, we have developed a customer insight product that can answer why questions behind the public opinions and sentiments for any topics by mining the entire social media space. Since I came to the Silicon Valley 9 years ago, I have been lucky, with pride, in having had a chance to design and develop QA systems for answering the widely acknowledged challenging questions. Two products for answering the open-domain how questions and why-questions in addition to deep sentiment analysis have been developed and deployed to global customers. Our deep parsing and IE platform is also equipped with the capability to construct deep knowledge graph to help answer definition questions, but unlike Google with its huge platform for the search needs, we have not identified a commercial opportunity to deploy that capability for a market yet. This piece of writing first appeared in 2011 in my personal blog, with only limited revisions since. Thanks to Google Translate at https://translate.google.com/ for providing a quick basis, which was post-edited by myself, the original @ 【问答系统的前生今世】 . Http://en.wikipedia.org/wiki/Question_answering The Anti-Eliza Effect, New Concept in AI 【问答系统的前生今世】 Knowledge map and open-domain QA (1) (in Chinese) knowledge map and how-question QA (2) (in Chinese) 【 Ask Jeeves and its million-dollar idea for human interface in 】(in Chinese); 个人分类: 立委科普|6543 次阅读|0 个评论

钩沉：SVO as General Events: liwei999 2015-10-25 04:50; Traditional Information Extraction (IE) following the MUC standards mainly targeted domain-specific Scenario Template (ST) for text mining of event scenarios. They have several limitations including too narrow and too challenging to be practically applicable. Therefore, in the course of our early IE research via SBIRs, in addition to proposing and implementing the entity-centric relationship extraction in Entity Profiles which are roughly equivalent to the core part of the current knowledge graph concept, as Principal Investigator, I proposed a new event type called General Events (GE) based on SVO (Subject-Verb-Object) parsing. This idea is proven to be also a significant progress in terms connecting the event-centric profiles and entity-centric profiles into a better knowledge graph. Some of the early ideas and design reported 15 years ago are re-presented below as a token of historical review of my career path in this fascinating area. January 2000 On General Events GE aims at extracting open-ended general events (level 3 IE) to support intelligent querying for information like who did what (to whom ) when and where . The feasibility study of this task has been conducted, and is reported below. GE is a more ambitious goal than CE, but it would have an even higher payoff. As the extracted general events are open-ended, they are expected to be able to support various applications ranging from question answering, data visualization, to automatic summarization, hence maximally satisfying the information need for the human agent. GE extraction is a sophisticated IE task which requires the most NLP support. In addition to all the existing modules required for CE (tokenization, POS tagging, shallow parsing, co-referencing), there are two more NLP module sidentified to be crucial for the successful development of GE. They are semantic full parsing and pragmatic filtering. A semantic full parser aims at mapping the surface structures into logical forms (semantic structures) via sentence level analysis. This is different from shallow parsing which only identifies isolated linguistic structures like NP, VG, etc. As the GE template in the Textract design is essentially a type of semantic representation, the requirement of a semantic full parser is fairly obvious. Traditional full parsers are based on powerful grammar formalisms like CFG, but they are often too inefficient for industrial application. The Textract full parser is based on the less powerful but more efficient FST formalism. Besides the proven advantage of efficiency, the use of FST for full parsing can be naturally built on top of the shallow parsing (also FST-based) results. This makes the grammar development much less complicated. As part of the feasibility study, a small grammar has been tested for semantic parsing based on the shallow parsing results. The followingare two of the sentences for this experiment. John Smith from California founded Xyz Company in 1970. This company was later acquired by Microsoft in 1985. These sentences go through the processing of all the existing modules, from tokenization, POS tagging, NE,shallow parsing, CO and CE. The processing results are represented in the Textract internal data structure. The relevant information is shown below in text format. 0|NP :NePerson\where_from=1\affiliation=3 1|PP :NeLocation 2|VG :ACTIVE\PAST 3|NP :organization_w\head=0 4|PP :NeTime 5|PERIOD 6 :organization_w\mother_org=8\coreference=3,0 7|VG :PASSIVE/PAST 8| :NeOrganization 9 :NeTime 10|PERIOD As seen, each unit has a unique identifier assigned by the system for supporting the CE and CO links. NE has provided NE features like NePerson,NeOrganization, NeTime, etc. Shallow parsing provides basic linguistic structures like NP, PP and VG (plus the features like ACTIVE voice, PASSIVE voice, PAST tense). CE has identified the relationships like affiliation, where_from, head and mother_org . The existing non-deterministic CO has linked the anaphor This company with its potential antecedents Xyz Company (correct)and John Smith (wrong). The above results are the assumed input to the semantic parser. For this experiment, the following two rules, one for active sentence and one for passive, were formulated to support the conceived semantic parsing. 0|NP * 1|VG(ACTIVE) 2|NP 3|PP(TIME) == 1:argument1=0\argument2=2\time=3 0|NP * 1|VG(PASSIVE) 2|PP(by) 3|PP(TIME) == 1:argument1=2\argument2=0\time=3 After compiling the above rules into a transducer, the FST runner serves as a parser in applying this transducer to the sample text and outputs the semantic structures as shownbelow: PREDICATE : found ARGUMENT1 : John Smith ARGUMENT2 : Xyz Company TIME : in 1970 PREDICATE : acquire ARGUMENT1 : Microsoft ARGUMENT2 : ‘This company’ TIME : in 1985 After merging based on the co-reference links, the second template is updated to: PREDICATE : acquire ARGUMENT1 : Microsoft ARGUMENT2 : {Xyz Company, John Smith} TIME : in 1985 This style of semantic representation shares the same structure with the defined GE template. This experiment demonstrates one important point. That is, there is no fundamental difference between CE grammars (CE3 in particular) and the grammar required for semantic parsing. The rules are strikingly similar; they share thesame goal in mapping surface structures into some form of semanticrepresentation (in CE template and GE template). They both rely on the same infrastructures(tools/mechanisms, basic NLP/IE support, etc.) which Cymfony has built over theyears. The difference lies in the content of the rules, not in the form of the rules. Because CE relationships are pre-defined, CErules are often key word based. For example, the key words checked in CE rules for the relationship affiliation include work for, join, hired by, etc. On the otherhand, the rules for semantic parsing to support GE are more abstract. Due to the open-endedness of the GE design,the grammar only needs to check category and sub-category (information likeintransitive, transitive, di-transitive, etc) of a verb instead of the wordliteral in order to build the semantic structures. The popular on-line lexical resources like OxfordAdvanced Learners’ Dictionary and Longman Dictionary of Contemporary English provide very reliable sub-categorization information for Englishverbs. The similarity between grammars for CE extraction and grammars for GE extraction is an important factor in terms of the feasibility of the proposed GE task. Since developing semantic parsing rules for GE (either hand-coded or machine learned or hybrid) does not require anything beyond that for developing CE rules, there is considerable degree of transferability from the CE feasibility to GE feasibility. The same techniques proven to be effective for CE hand-coded rules and for automatic CE rule learning are expected to be equally applicable to the GE prototype development. In particular, rule induction via structure based adaptive training promises a solution to sentence level semantic analysisfor both CE3 learning and GE learning. Note the similarity of these rules to the sample rule in the CE3 grammar shown previously. 【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|4528 次阅读|0 个评论

《知识图谱的先行：从 Julian Hill 说起》: 热度 2 liwei999 2015-10-24 01:36; 【立委按】我15年前领导的这个研究就是如今红透半边天的“知识图谱”的先行。知识图谱这个术语实际上是搜索产业借用了研究界的实体关系和实体概览（Entity Relationships and Entity Profile）概念炒热的。 from 2007年的博文回忆：《朝华午拾：信息抽取笔记 — Julian Hill Entity Profile 的形成》在我的科研生涯中，有些插曲很有意思。关于 Julian Hill 的故事就是其一，这段故事成为我们研究组推介所谓实体概览（Entity Profile）的概念和功能的经典例证。那是七八年前，我涉入信息抽取领域不到两年，同时主持两个信息抽取的研究项目，整天就琢磨信息抽取的体系结构和概念系统。当时，研究界为信息抽取定义了几个子任务，初步构成了一个信息表达单位（information object）的概念体系：实体标识（Named Entity tagging，NE），旨在标识人名地名机构名；模板成员（Template Element, TE），事件模板的成员，由实体名词加上其特征（descriptors）构成，比如，实体名：Microssoft, 类别：公司，特征：software giant；模板关系 (Template Relation, TR) 反映的是实体之间的关系，比如雇用关系，机构与其所在地的关系；实体串联（Coreference，CO），用来把所指相同的实体名词如专名Microsoft和代指词如this company或it等串联起来；事件模板 (Scenario Template, ST) 定义的是特定领域事件的有关信息，比如高层管理人员变动（management succession）事件，就要求自动抽取融合有关信息，填写下述模板内容：涉及公司，上台人物，下台人物，时间，原因，等。在上述信息抽取单位的概念体系里面，我隐隐感觉缺了点什么：NE 和 TR 都是表述式的抽取（mention-level extraction），即抽取的是一个个句子范围内的语言片断，而 ST 要求的是概览式抽取（profile-level extraction），即抽取的语言片断需要经过一个信息融合（information fusion）的过程，然后填入事件模板，因为一个模板的信息来源常常跨句，需要借助实体串联CO把这些信息融合起来，比如上台人物在前一句表述，而下台人物却在后来的句子提到：Abc Inc., an industry leader in speech processing, is reported today to appoint John Smith as its new CEO. Mr. Smith was MBA from Harvard, with 14 years of executive experiences in the IT industry. He will replace Peter Lee who was founder of the company but was recently hospitalized for his heart attack。融合过的模板示意如下：事件：高管人员变动涉及公司: 职务：CEO 上台人物: 下台人物: 时间: today 原因: was recently hospitalized for his heart attack 尽管实体标识和实体关系，甚至事件本身，都有跟实体相关的信息，但实体本身在抽取体系中却没有一个相应的概览式的模板表达。有意思的是，在填写事件模板的时候，现体系要求模板成员TE来填写参与事件的各个角色。可是，TE 是个信息贫乏的单位，只包括实体的特征信息，如，实体关系信息和事件参与信息不在其涵括范围之内。为什么实体的表达单位不能也独立成为一个概览式的信息丰富的模板呢？显然，标准的制定者对事件比对实体重视，一切围绕着填写复杂的事件模板为中心。我想，一个个自动标识出来的实体名词和抽取出来的一个个实体关系和事件，就像一颗颗散珠，只有联成串，才好作为物质世界的实体的信息表达单位。我慢慢感觉到这个未及命名的概念的重要性，就如个人履历对于人力资源经理的重要性一样。这样自动生成的履历可能为信息抽取的应用打开一个突破口，因为实体信息的融合，较之研究界定义并强调的事件融合，更加好把握（more tractable），而实体的重要性在很多应用领域中不言自明，例如，在反恐斗争中，进入黑名单的恐怖主义嫌疑分子就形成一个 watch-list, 任何关于他们的信息都是可能的情报来源。沿着这一思路，信息应该有两个中心，实体中心（entity-centric）和事件中心（event-centric），从而有两条信息抽取和融合的道路，导致两种概览式的信息单位，在事件概览以外，实体概览本身有其独特的使用价值。我们关心物质世界，不仅对发生的事件产生兴趣，很多时候，我们对其中人物或机构的兴趣也很浓厚。我们观察世界，有时是通过事件去分析前因后果，寻求对策，有时我们却是从实体着手，跟踪实体之间的关系及其对事件的影响。我把这个想法跟老板谈，她很欣赏，非常赞同我们应该把实体融合当作突破点，因为事件融合已经在业界受到很多批评，难度太大，最好的系统的精确度总是在50%-60%徘徊（一般认为精确度至少70％－80％才有应用前景），而且由于事件的领域倾向，系统的移植性太差，总之，ST这样的事件概览任务在当年是不现实的。我的设计思想最终表达为下列信息单位的概念表达体系：至此，概念算是基本理清了。我们起初给这个信息单位命名为”关系中的实体” （Correlated Entity, CE），最终定名为实体概览（Entity Profile, EP）。下一步的任务是做可行性研究，设计蓝图，最终领导我的研究组研制出原型系统来。在设计蓝图时，我遇到的第一个问题是缺乏有力的例证，我知道，一个好的例证胜过许多抽象论证。但是实体概览这个概念的力量取决于信息的丰富，而当时的研究还是在单篇文章上的初步验证，跨文本的抽取融合的研究尚未开展。我们的数据对象是新闻报道，我发现新闻报道大多短小，每篇报道提到特定的人或者机构时往往一笔带过，没有多少事实可以抽取融合成像样的实体概览以及实体之间的相互联系，无法显示其力量。我一篇一篇新闻浏览下去，希望发现奇迹。功夫不负苦心人，我终于在《纽约时报》存档里发现了一篇卟告似的新闻，报道尼龙的发明者 Julian Hill 病逝的消息。因为是报道名人去世，所以通篇回顾其生平事迹，有充分的材料可以抽取，构成履历一样的概览。我如获至宝。这篇报道节选如下： Julian Hill, a research chemist whose accidental discovery of a tough, taffylike compound revolutionized everyday life after it proved its worth in warfare and courtship, died on Sunday in Hockessin, Del. He was 91. Hill died at the Cokesbury Village retirement community, where he had lived in recent years with his wife of 62 years, Polly. ………… Julian Werner Hill was born in St. Louis, graduated from Washington University there in 1924 and earned a doctorate in organic chemistry from the Massachusetts Institute of Technology in 1928. His wife recalled on Wednesday that his doctoral studies were delayed a year because he was stricken with scarlet fever. Hill played the violin and was an accomplished squash player and figure-skater until his early 40s, when an attack of polio weakened one leg, his wife said. Before his retirement from Du Pont in 1964, Hill supervised the company’s program of aid to universities for research in physics and chemistry. ………… 我于是纸上谈兵，一步步以此为例设计实体信息抽取和融合的具体过程，设想其应用。信息抽取的步骤示例如下： (１) 实体标识人名 , a research chemist whose accidental discovery of a tough , taffylike compound revolutionized everyday life after it proved its worth in warfare and courtship , died on 日期 in 城镇名 . He was 91. 人名 died at the ［Cokesbury Village］城镇名 retirement community , where he had lived in recent years with his wife of 时段 , 人名 . ……… 人名 was born in 城镇名 , graduated from 学校名 there in 年代 and earned a doctorate in organic chemistry from the 学校名 in 年代 . ……… (2) 关系抽取：职位: research chemist ← Julian Hill 年龄: 91 ← Hill 出生地: St. Louis ← Julian Werner Hill 工作单位: Du Pont Co. ← Julian Werner 毕业学校: Washington University ← Julian 毕业学校: Massachusetts Institute of Technology ← Julian 配偶: Polly ← Julian Hill 特长: an accomplished squash player and figure-skater ← Julian (3) 事件抽取：死亡事件何人：Julian Werner Hill 何时：Sunday 何地：Hockessin, Del 发明事件何人：Julian Hill 何物：nylon　何时：1930s 毕业事件学校：Washington University 何时: 1924 何地：St. Louis ……… (4) 实体概览：【Julian Hill 概览】姓名：Julian Werner Hill 年龄：91 性别：MALE 职务：research chemist 工作单位：［DuPont Co.］教育背景：［Washington University］; ［Massachusetts Institute of Technology］配偶：［Polly］儿女：［Louisa Spottswood］; ［Joseph］ ; ［Jefferson］特长：an accomplished squash player and figure-skater 相关事件：死亡事件；发明事件；毕业事件；……… 当年设计出来的实体概览的应用图示如下。追踪跑兔式浏览的应用（chase-the-rabbit browsing）：在浏览文本时，鼠标指向任何实体名词，实体概览即可显现在相关实体概览中浏览轻而易举，只需点击目标实体的链接即可信息的时间图示化应用 (Information visualization) 信息的地域图示化应用 (Information visualization) 这个方向的研究进展顺利，我于是有了资本去游说政府项目的投资人。我们的政府项目经理是政府实验室的信息抽取组组长，精明强干，善于宏观把握项目的大方向。她跟我关系融洽，八年来合作非常愉快。每年由上级单位派专家组成的检查团检查实验室的工作，都是她最紧张的时刻。我总是全力协助她展示我们研究的亮点，作为她资助的项目的成绩汇报。她在制定科研项目的远景规划和资助重点的时候，经常跟我磋商，她的好几个选题就是我提供的描述。她跟我谈她的苦恼：她们从信息抽取研究一开始就资助这一领域的一批项目，如今已经砸进去太多钱了，可是实际应用方面却进展甚微。再这样下去，她对自己的上司也不好交待。她急需找到一个突破口，以证实该领域不是纸上谈兵，而是可以解决实际问题的应用研究。我就趁此机会向她推销我的实体概览的理念和实践，说明实体概览正是她苦苦寻求的难度不大不小的具有应用价值的抽取对象，她后来称之为 intermediate IE, 区别于已经原则上解决了的 shallow IE 问题（如实体标识），又不象 deep IE 事件概览那样缺乏应用的可行性。后来，在看到我们做出的原型系统，她拥抱了这个概念，跟我说，there is no doubt that you guys earn the credit of pushing this significant area along. 在她的推动下，政府罗马实验室终于把跨文本实体概览 (Cross-document Entity Profile) 作为信息抽取的一个大项目公开招标，项目的描述大段摘录了我的研究报告中的说法。不幸的是，这个千万美元的大项目招标的标准中要求公司必须有 clearance，而我所在的公司有不少象我一样的外籍员工，不具备主竞标人的资格，但可以作为主竞标单位的 sub-contractor，分一杯羹。一时间，几个主竞标者纷纷慕名而来，希望跟我们结成exclusive 联盟，增强它们的竞标胜算。CEO 跟我们一商量，觉得不能把宝押在一家身上，坚持不签 exclusive 的联盟（跟我们长期合作的那家公司因此感觉不悦，但他们没有自己的抽取技术，有求于我，只好委曲求全），对各主竞标者一视同仁，同意中标后跟赢家合作，提供实体概览的抽取引擎，支持主竞标者开发实体概览的应用产品。结果，还是那家长期跟我们合作的公司中标了，他们的经理随即招兵买马，踌躇满志，仗着我们的后盾，准备大干一场。回头看这段科研的心路历程，我的直觉和对应用前景的敏感还是不错的。如今，沿着实体概览思路开发出来的系统已经投入应用，可以预见类似的应用在不同的领域会越来越多，惠及越来越多的信息索取者。不出所料，首先成功运用此概念的系统是在人力资源领域，自动收集融合个人履历，见 http://www.zoominfo.com （顺便提一个插曲，ZoomInfo 刚开张的时候，就引起了我的注意。我首先搜索我自己 Wei Li, 发现我的履历在众多同名的人中居然排在首位，大概与我高级经理的职务有关。最近又查了一次，我的信息仍在，已经快沉底了，大概该系统这两年的搜集面越来越大，牛人太多，我是小巫见大巫了）。在医学领域，有两个成功的应用。一个是依据PubMed文献，用文本挖掘的技术自动抽取研究人员的工作单位，合作者，联络方式，专长，研究专题等等信息，综合起来，提供搜寻专家和权威的情报服务（ http://www.authoratory.com/index.htm ）。这些都是公开信息，但除了业内资深人士可以对专家做出合适的评价外，普通人员要想比较专家的权威度，很容易迷失在文献的海洋里。这个系统把情报归类融合，使得专家的搜寻可以建立在丰富的情报数据之上，减少盲目性，是个很有意义的应用。概览概念在医学领域的另一个成功的运用 MedStory（ http://www.medstory.com/ ）。他们不再局限于窄义的实体，而是为医学领域的各个概念包括疾病、药品、治疗手段、专家等等信息，汇集链接起来，提供概览式的浏览，其用户界面简洁漂亮，让人印象深刻。他们的成功也得力于医学卫生领域概念体系的丰富性和完整性，有很现成的术语库可以利用。微软已经决定收购MedStory. 最近，一家跨国公司寻求合作，就提出了在他们的领域准备投巨资开发实体概览的系统，他们称做”实体360度全景式浏览”（360 view of an entity），高层主管对此应用的前景极为看好，有数据分析人员称之为 dream product。很多有巨大应用价值的概念其实基于很简单的原理。譬如，作为Google搜索引擎的基石之一的超链分析，其基本原理就是把网页之间的链接看作论文的引用，引用率越高的网页流行度和权威性也越高，因此其搜索排名也应该越高。实体概览的概念也是如此，它不过是在自动信息单位的定义里面，模仿了实际生活中已经存在的个人履历和公司简介的结构。概念虽然简单，威力却很大。作为参与和推动这个概念的理论和实践的一员，笔者亲眼目睹这个概念这么快地被广泛接受和应用，深感欣慰和自豪。尼龙的发明人 Julian Hill 也因此成了我讲解此概念的经典案例。 2007年八月五日记【补记】关于前知识图谱时代的“知识图谱”研究，我的 SBIR 17个最后报告里面有非常详细的论述，大约10年前，准备根据这些报告写一本《信息抽取导论》，也联系好了出版商，不过还是一个叉打过去了，就没有真正成文发表。在工业界打工，坐冷板凳求著书立说的愿望似乎有点奢侈。趁着现如今知识图谱热，作为先行者，准备从历史档案里面找一些资料出来，发到博文上去，算是史海钩沉： Pre-Knowledge-Graph Profile Extraction Research via SBIR (1) 2015-10-24 Pre-Knowledge-Graph Profile Extraction Research via SBIR (2) 2015-10-24 Early arguments for a hybrid model for NLP and IE 2015-10-25 SVO as General Events 2015-10-25 【相关】【立委科普：信息抽取】朝华午拾：在美国写基金申请的酸甜苦辣 - 科学网《朝华午拾：创业之路》《泥沙龙笔记：搜索和知识图谱的话题》 2015-10-23 前知识图谱钩沉: 信息抽取引擎的架构【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|11805 次阅读|2 个评论

10/22/2015, SUPP Building, Texas State University-San Marcos: nyp 2015-10-23 02:57; 2138 次阅读|0 个评论

泥沙龙笔记： parsing vs. classification and IE: liwei999 2015-7-7 02:28; 雷: @wei 下一步要说明的是deep parsing的结果比传统的IE（information extraction）的好在哪里，要有一个solid的证明。 wei: 这个问题以前已经答过了，可能没特别阐明。我在博文中说，简单的factoid问题，那些可以用 entity 来回答的问题，when where who 之类，deep parsing 就可以支持很好的问答系统了。而复杂的问题（譬如，how 和 why）或者专门领域的问题，还是先做 IE 把可能的答案存在库里，然后支持问答系统为好。从原理上看， parsing 的结果是独立于领域的句法树，好处是可不变应万变，坏处是不容易整合。 IE 不同，IE 是预先定义的语用层面的template（实质就是语用树），它比句法树要深入。在句法树到语用树的mapping过程中，已经创造了更好的针对语用的整合条件。一般而言，两者应该是互补的，IE 支持问答或搜索是最精准的，但不能应对事先没有抽取的信息； deep parsing 是无需事先预定的，可以作为 IE 的backoff 说到底就是从句法树到语用树，是 offline indexing 时候做，还是 on the fly retrieval 时候做。前者就是借力 IE，后者就是直接依仗 SVO。做前者需要语言学家和领域数据专家，他们可以预见同一个抽取目标的各种表达形式。做后者是直接面对（power）users 无需专家。换一个角度就是，如果一类问题的表达形式很多，而且冗余度不够，那么须上 IE 才好。如果冗余度大，基本表达形式（SVO等，譬如问产品发布的信息，从关键动词 release 和 launch 驱动一个 SVO：Company “launch/release” Product）很容易就搞定，那么 deep parsing 就足足有余了。 IR 角度可以这样看 backoff model： Input：query（or question） Output: 1. if IE store has answer, then return answer with confidence 2. else if SVO can be matched (by power user) to retrieva answers, then return answer with confidence 3. else provide keyword search results 雷: @wei 是。这个是factbase的东西。比如，在做文章分类方面，比传统的方法要高明多少？传统的文章分类是取实词，做一个巨大的矩阵，余弦处理 Wei：雷，你说的那是分类，不是 parsing，分类是粗线条的活计，应用很广，但回答不了啥问题。 parsing 是条分缕析，能精准回答问题。雷: @wei 是的。我想说怎么利用parsing的结果做分类。第一，是不是可以把分类做的更好；第二，怎么做会更好更有用 wei: 分类还是关键词ngram最robust，这一点无疑问，尤其是分类对象是文章或段落的话。但对于twitter或微博的短消息，关键词分类基本失效（见：【立委科普：基于关键词的舆情分类系统面临挑战】），需要 parsing。雷: 比如，通过parsing，是不是可以获得主要的实词（文章主要的一些词）？ wei: parsing 做文章分类，一方面有点牛刀宰鸡，另一方面最后还是需要借力统计 (见：手工规则系统的软肋在文章分类 )，因为 parsing 的结果是树，不是林。如果要看到整个林子，起码要对树木做一个统计，最简单的统计是 majority vote。雷: 如果parsing文章后（深度的那种），获取主要的实词，其他的词就可以抛弃了。那样的话，分类的可以更加精准，而且还可以做分类的分类，多级的分类 wei: 单是为了获取实词，不必做 parsing，stop words 过滤掉就大体齐了。parsing 的好处是出了句法关系，但在文章分类中把句法关系使用得法并不容易。文章分类的本性是ngram密度（density）问题。 ngram 这名字谁起的？里面不就有 gram 嘛。当年没起名叫 nword， ntoken or nterm，而是 ngram，可以认为是因为当n个词（一般也就两三个词，n=2 or n=3，再多就是 sparse data 了，没啥统计价值）成为一个有次序的串，而不是一个bag（set），就意味着把语法中最重要的一个因素词序隐含地带入了，实际是对文法的粗线条模拟。 ngram 是直接从语料来的，这就保证了它包含了一些 open-eneded strings，而不是仅仅局限于词典。这样看来 ngram 实际上是语法的碎片化、最小化，然后加上条件概率，再把碎片化的ngrams重新串起来。先打碎，再整合，来模拟parsing的效果。统计NLP的几乎所有的成果都是基于这个模拟文法的原理。雷: 传统的分类中，bag of words还是主流吧 wei: bag of words 中的 words 不仅仅是词，而是包含了 ngram 自由组合，因此也还是模型了parsing 白: IE和deep parsing其实是有联系的。利用统计上显著的几跳，就可以把parse tree 拼接成IE tree。只要允许ngram隔空打炮，可以间接模拟parsing。按成份的“能量”，决定它能打进多远的滑动窗口里。雷: parsing的power应该是读出了文章的主要概念和这些概念之间的关系。在好像是提交文章时要提交一些关键词。但是作者提交的关键词有限。 wei: 说清了 ngram 的原理，回头再看 deep parsing 对于分类的功用。 deep parsing 不借助统计，实际上是不适合做文章分类的。因为 parsing 是见木不见林，而分类要求的是林的视点。然而，可以想见，deep parsing 以后，利用其结果再做统计，理论上可以做到更好的分类。譬如, SVO解析以后，至少可以把他看成是一个更高级的 ngram, SV 是 bigram, VO 是 bigram, SVO 是trigram. 雷: @wei 是的。在做好deep parsing后，统计其实词和关系, 那种统计就是纯粹的描述统计了. 统计时，比如，有多少S是相同的，多少O是相同的,还有这些S和O在概念上的发布的计算. 这些都是假设“deep parsing”有了好的解析结果后。抓住主要的东西，割舍次要的，会有更好的分类？ wei: 但这个 ngram 已经不是简单序列的基础上，而是可以涵盖long distance了, 因此理论上应该更反映文章的语义类别，利用这样的ngram，然后再用统计的方法，理论上可以做更好的分类。但是实际上，也有弄巧成拙的时候。因为，parsing 总是没有 keywords 鲁棒，如果也同时用keywords垫底，系统增加了复杂度，怎样做smoothing才好，并不是一件容易的事儿。白: 如果反过来，deep parsing还指望知识层面消歧，那就“活锁”了。如何不陷入livelock，考验智慧。 wei: 白老师，deep parsing还指望知识层面消歧是天经地义的。天经地义是因为句法不能预测千变万化的语用场景。句法如果能预测语用，那就失去了 generalization 的一面了，反而不妙。语言学之所以有威力，正是基于句法结构（句型）是有限的，而且很大程度上是 universal 的，对各个语言都通用。全世界语言的 argument structure，都是 SVOC （主谓宾补）及其变体。而语用的树是因领域、因产品、甚至因用户而有所不同的，根本不能归一。语言的归语言，语用的归语用。这种分工才最经济科学. 总结一下：deep parsing 最擅长支持信息抽取，因为抽取要求看清每一颗树。但parsing做分类就很 tricky，因为分类要求的是看林子。看林子还是统计拿手，ngram/keywords 常常就搞定，而且 robust。当然，理论上 parsing 可以使得分类更上一个台阶，实际上很难说，仍然是一个研究性课题，而不是成熟实用的技术。【相关博文】手工规则系统的软肋在文章分类【立委科普：基于关键词的舆情分类系统面临挑战】【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委科普|5101 次阅读|0 个评论

A CRF-Based System for Recognizing Chemical Entities: xiaohai2008 2013-10-11 15:25; @INPROCEEDINGS{XAZZ+13, author = {Xu, Shuo and An, Xin and Zhu, Lijun and Zhang, Yunliang and Zhang, Haodong}, title = {A {CRF}-Based System for Recognizing Chemical Entities in Biomedical Literature}, booktitle = {Proceedings of the 4th BioCreative Challenge Evaluation Workshop}, year = {2013}, volume = {2}, series = {152--157}, abstract = {One of tasks of the BioCreative IV competition, the CHEMDNER task, includes two subtasks: CEM and CDI. We participated in the later subtask, and developed a CEM recognition system on the basis of CRF approach and some open-source NLP toolkits. Our system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), and post-processing (rule-based approach and format conversion).}, keywords = {Chemical Compound and Drug Name Recognition \sep Conditional Random FField (CRFs) \sep Entity Mention \sep Rule-based Approach}, } 全文见： CEM.pdf; 个人分类: 机器学习|3729 次阅读|0 个评论

[转载]Extracting text from Wikipedia articles - Summary: timy 2010-9-9 11:28; Dear CORPORA Mailing list Members, I would like to thank very much everybody who replied to my question and to post a summary of the responses I received. Best regards, Irina Temnikova PhD Student in Computational Linguistics Editorial Assistant of the Journal of Natural Language Engineering Research Group in Computational Linguistics Research Institute of Information and Language Processing University of Wolverhampton, UK ============= Question: Dear CORPORA mailing list members, Do any of you know of any tool for extracting text specifically from Wikipedia articles, besides those for extracting text from HTML pages? I only need the title and the text, without any of the formal elements present in every Wikipedia article (such as From Wikipedia, the free encyclopedia, This article is about .., , the list of languages,Main article:,Categories:) and without Contents, See also, References, Notes and External links. Can you give me any suggestions? ============= Answers: ------------- Roman Klinger wrote: Users can add arbitrary HTML code. If you want to interpret that (to get the plain text) you could use the text based web browser lynx, which can dump to a text file. That works quite well, but is a HTML extraction method you excluded. Another approach a colleague pointed me to and told me to work -- I did not try it by myself -- is described here: http://evanjones.ca/software/wikipedia2text.html ------------- Goran Rakic wrote: Some time ago I have used a Python script by Antonio Fuschetto. This script can work on a Wikipedia database dump (XML file) from http://download.wikimedia.org and knows how to process individual articles, strip all Wiki tags and provide a plain text output. Google shows me that the script was available from http://medialab.di.unipi.it/wiki/Wikipedia_Extractor but this site currently seems to be down. You can download a slightly modified version from http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py To run the script against the downloaded database dump, pass it as a standard input using shell redirection. Change the process_page() method to fit your need. ------------- Srinivas Gokavarapu wrote: This is a tool for extracting information from wikipedia. http://wikipedia-miner.sourceforge.net/ Have a look at it. ------------- Nitin Madnani wrote: I recently did this. I downloaded the freebase wikipedia extraction (google that) and used BeautifulSoup to extract just the text part. It was a couple of days' work at the most. ------------- Trevor Jenkins wrote: Your requirements are rather specific. But as (the English language) WikiPedia uses a consistent markup scheme with those formal elements named (either by explicit id or implicit class names in attributes) you might be able to strip out just the textual content by running a XSLT stylesheet processor over the download files and delete the junk you don't want. ------------- Eros Zanchetta wrote: I recommend Antonio Fuschetto's WikiExtractor too: I used it recently to create a corpus of texts extracted from Wikipedia and it worked like a charm. As Goran Rakic said the site is currently down, but you can download the original script from here (this is a temporary link, don't count on this to stay online long): http://sslmit.unibo.it/~eros/WikiExtractor.py.gz You'll need to download the XML dump from the wikipedia repository and run the script on it, something like this: bunzip2 -c enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py ------------- Hartmut Oldenbrger wrote: besides regarding singular, possibly costly tools, you should consider more strongly enduring, free open source means: R is a very high script programming language, apt for text manipulation, and processing, mathematical, and statistical analysis, rich graphical output, controllable by several graphical user interfaces. Meanwhile R is a lingua franca, available for almost all computer systems at http://cran.at.r-project.org/ It has multi-language documentation, a journal, mailing-lists, user conferences for the worldwide experts, and users. For your purpose within the ~2500 packages for application, there is http://cran.at.r-project.org/web/packages/tm/vignettes/tm.pdf giving the entrance for text mining, and corpus analysis. After installing R, and 'tm', it will give you a basis for your scientific development(s). For me, it is an amazing enlightening experience since 1996/7 for developing, and work. ------------- Cyrus Shaoul wrote: I am not sure if this helps you, but I have extracted the text for the English version of Wikipedia (in April of this year) using the WikiExtractor http://medialab.di.unipi.it/wiki/Wikipedia_Extractor toolset and created a 990 million word corpus that is freely available on my web site: http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html ------------- Matthias Richter wrote: My answer is perl and the xmldump, but there is a degree of nastyness in the details and it depends on what one expects from the quality of the results. There is also Wikiprep from Evgeny Gabrilovich floating around that didn't exist then and that I didn't look at yet (but they are using it at Leipzig now for producing WP2010 corpora), And finally http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html may be a worthwhile source for readning and tinkering. ------------- Anas Tawileh wrote: Check this tool out (WP2TXT: Wikipedia to Text Converter): http://wp2txt.rubyforge.org/ ------------- Gemma Boleda wrote: we've developed a Java-based parser to do just this. It is available for download at: http://www.lsi.upc.edu/~nlp/wikicorpus ------------- Raphael Rubino wrote: I have modified this one http://www.u.arizona.edu/~jjberry/nowiki-xml2txt.py to output trectext format which is xml, maybe the original one is good for you. ------------- Sven Hartrumpf wrote: We did this with the additional requirement that headings and paragraph starts are still marked up. We tested our tool only on the German Wikipedia (dewiki-20100603-pages-articles.xml); sample results can be seen here: http://ki220.fernuni-hagen.de/wikipedia/de/20100603/ ------------ Constantin Orasan wrote: There is a version of Palinka which has a plugin to import Wikipedia articles. That version is available directly from the author. http://clg.wlv.ac.uk/projects/PALinkA/ ------------ Torsten Zesch wrote: the Java Wikipedia Library (JWPL) contains a parser for the MediaWiki syntax that allows you (among other things) to access the plain-text of a Wikipedia article: http://www.ukp.tu-darmstadt.de/software/jwpl/ ===================; 个人分类: 自然语言处理|4595 次阅读|0 个评论

[转载]Dimension reduction(From Wikipedia): machinelearn 2010-7-4 10:56; In statistics, dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. Feature selection approaches try to find a subset of the original variables (also called features or attributes).Three strategies are filter (e.g. information gain) ,wrapper (e.g. search guided by the accuracy) and embedded approaches. Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower dimensional space in such a way, that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors. Principal component analysis can be employed in a nonlinear way by means of the kernel trick. The resulting techniques are capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is entitled Kernel PCA. Other prominent nonlinear techniques include manifold learning techniques such as locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA. These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors. An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include classical multidimensional scaling (which is identical to PCA), Isomap (which uses geodesic distances in the data space), diffusion maps (which uses diffusion distances in the data space), t-SNE (which minimizes the divergence between distributions over pairs of points), and curvilinear component analysis. A different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feed-forward neural networks with a bottle-neck hidden layer. The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of Restricted Boltzmann machines) that is followed by a fine-tuning stage based on backpropagation.; 个人分类: 科研笔记|3411 次阅读|0 个评论

SBIR Grants: liwei999 2010-2-19 05:52; Final Reports for Small Business Innovation Research (SBIR) Projects Principal Investigator (PI) or Co-Principal Investigator (Co-PI): Dr. Wei Li Submitted (to appear) Li, W. and R. Srihari 2005. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (forthcoming) Li, W. and R. Srihari 2005. Automated Verb Sense Identification, Phase 1 Final Technical Report, Navy SBIR. (forthcoming) Li, W., R. Srihari and C. Niu 2006. Automated Verb Sense Identification, Phase 2 Final Technical Report, Navy SBIR. (forthcoming) Published (1) Srihari, R., W. Li and C. Niu 2005. An Intelligence Discovery Portal Based on Cross-document Extraction and Text Mining, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract: KEYWORDS: Information Extraction Corpus-Level IE Information Fusion Text Mining Knowledge Discovery This effort addresses two major enhancements to current information extraction (IE) technology. The first concerns the development of higher levels of IE, at the corpus level, and finally across corpora including structured data. The second objective concerns text mining from a rich IE repository assimilated from multiple corpora. IE is only a means to an end, which is the discovery of hidden trends and patterns that are implicit in large volumes of text. This effort was based on Cymfony’s document-level IE system, InfoXtract. A fusion component was developed to assimilate information extracted across multiple documents. Text mining experiments were conducted on the resulting rich knowledge repository. Finally, the design of an intelligence discovery portal (IDP) prototype led to the consolidation of the developed technology into an intuitive web-based application. (2) Srihari, R. and W. Li 2005. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract KEYWORDS: Information Fusion Case Restoration Information Extraction Multimedia Information Extraction Named Entity Tagging Event detection Relationship detection Fusion of information in diverse text media containing case-insensitive information was explored, based on a core Information Extraction (IE) system capable of processing case-sensitive text. The core engine was adapted to handle diverse, case-insensitive information e.g. e-mail, chat, newsgroups, broadcast transcripts, HUMINT intelligence documents. The fusion system assimilates information extracted from text with that in structured knowledge bases. Traditional IE for case-insensitive text is limited to the named entity (NE) stage, e.g. retraining an NE tagger on case insensitive text. We explored case restoration, whereby statistical models and rules are used to recover case-sensitive form. Thus, the core IE system was not modified. IE systems are fully exploited if their output is consolidated with knowledge in relational databases. This calls for natural language processing and reasoning, including entity co-reference and event co-reference. Consolidation permits database change detection and alerts. Feedback to the core IE system exploits information in knowledge bases thereby fusing information. Information analysts and decision makers will benefit since this effort extends the utility of IE. A viable solution has many applications, including business intelligence systems that use large knowledge-bases of companies, products, people and projects. Updating these knowledge-bases from chat, newsgroups and multimedia broadcast transcripts would be enabled. A specific commercial application focused on brand perception and monitoring will benefit. Knowledge management systems would benefit from the ability to assimilate information in web documents and newsgroups with structured information. Military applications stem from the fact that analysts need to consolidate an abundance of information. (3) Srihari, R. and W. Li 2004. An Automated Domain Porting Toolkit for Information Extraction, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract KEYWORDS: Domain Porting Customization Information Extraction Natural Language Processing Unsupervised Machine Learning Bootstrapping Example-based Rule Writing Information extraction (IE) systems provide critical assistance to both intelligence analysts as well as business analysts in the process of assimilating information from a multitude of electronic documents. This task seeks to investigate the feasibility of developing an automated, domain porting toolkit that could be used to customize generic information extraction for a specific domain of interest. Customization is required at various levels of IE in order to fully exploit domain characteristics. These levels include (i) lexicon customization, (ii) acquiring specialized glossaries of names of people, organizations, locations, etc. which assist in the process of tagging key named entities (NE), (iii) detecting important relationships between key entities, e.g. the headquarters of a given organization, and (iv) detecting significant events, e.g., transportation of chemicals. Due to the superior performance derived through customization, many have chosen to develop handcrafted IE systems which can be applied only to a single domain, such as the insurance and medical industries. The approach taken here is based on the existence of a robust, domain-independent IE engine that can continue to be enhanced, independent of any specific domain. This effort describes an attempt to develop a complete platform for automated customization of such a core engine to a specific domain or corpus. Such an approach facilitates both rapid domain porting as well as cost savings since linguists are not required. Developing such a domain porting toolkit calls for basic research in unsupervised machine learning techniques. Our structure-based training approach, which leverages output from the core IE engine, is already comparable in performance to the best unsupervised learning methods, and is expected to significantly exceed it with further research. A bootstrap approach using initial seeds is described. It is necessary to learn both lists of words (lexicons) as well as rule templates so that all levels of IE are customized. The final deliverables include: (i) new algorithms for structure-based bootstrap learning, (ii) a prototype model for domain porting of both lexicons and rule templates demonstrated on an intelligence domain and (iii) the design of a complete automated domain porting toolkit, including user-friendly graphical interfaces. (4) Li, W. R. Srihari. 2003. Flexible Information Extraction Learning Algorithm, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Use the citation below to notify others of the report’s availability: Li, W. R. Srihari. 2003. Flexible Information Extraction Learning Algorithm, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract: This research seeks to develop a working prototype for both shallow-level and intermediate-level information extraction (IE) by effectively employing machine learning techniques. Machine learning algorithms represent a cutting edge’ approach to tasks involving natural language processing (NLP) and information extraction. Currently, IE systems using machine learning have been restricted to low-level, shallow extraction tasks such as named entity tagging and simple event extraction. In terms of methodology, the majority of systems rely mainly on supervised learning that requires a sizable manually annotated corpus. To address these problems, a hybrid IE prototype InfoXtract’ that combines both machine learning and rule-based approaches has been developed. This prototype is capable of extracting named entities, correlated entity relationships and general events. To showcase the use of IE in applications, an IE-based Question Answering prototype has been implemented. In addition to the use of the proven techniques of supervised learning, unsupervised learning research has been explored in lexical knowledge acquisition in support of IE. A machine learning toolkit/platform that supports both supervised and unsupervised learning has also been developed. These achievements have laid a solid foundation for enhancing IE capabilities and for deploying the developed technology in IE applications. Full Text Availability: View Full Text (pdf) File: /UL/b292996.pdf Size: 793.8 KB Accession Number: ADB292996 Citation Status: Active Citation Classification: Unclassified Field(s) Group(s): 050800 - PSYCHOLOGY Corporate Author: CYMFONY NET INC WILLIAMSVILLE NY Unclassified Title: Flexible Information Extraction Learning Algorithm Title Classification: Unclassified Descriptive Note: Final technical rept. May 2000-Apr 2003 Personal Author(s): Li, Wei Srihari, Rohini K. Report Date: Jul 2003 Media Count: 65 Page(s) Cost: $9.60 Contract Number: F30602-00-C-0037 Report Number(s): AFRL-IF-RS-TR-2003-157 XC-AFRL-IF-RS Project Number: 3005 Task Number: 91 Monitor Acronym: AFRL-IF-RS XC Monitor Series: TR-2003-157 AFRL-IF-RS Report Classification: Unclassified Supplementary Note: The original document contains color images. Distribution Statement: Distribution authorized to U.S. Gov’t. agencies only; Specific Authority; Jul 2003. Other requests shall be referred to Air Force Research Lab., Attn: IFEA, Rome, NY 13441-4114., Availability: This document is not available from DTIC in microfiche. Descriptors: *ALGORITHMS, *LEARNING MACHINES, *INFORMATION RETRIEVAL, METHODOLOGY, PROTOTYPES, PLATFORMS, LOW LEVEL, EXTRACTION, KNOWLEDGE BASED SYSTEMS, FOUNDATIONS(STRUCTURES), NATURAL LANGUAGE, INFORMATION PROCESSING, TOOL KITS, LEXICOGRAPHY, SHALLOW DEPTH Identifiers: SBIR(SMALL BUSINESS INNOVATION RESEARCH), SBIR REPORTS, PE65502F, WUAFRL30059109 Abstract Classification: Unclassified Distribution Limitation(s): 03 - U.S. GOVT. ONLY; DOD CONTROLLED 26 - NOT AVAILABLE IN MICROFICHE Source Serial: F Source Code: 432812 Document Location: DTIC Creation Date: 19 Nov 2003 (5) Li, W. R. Srihari. 2001. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Use the citation below to notify others of the report’s availability Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization Abstract: This task seeks to develop a system for intermediate-level event extraction with emphasis on time/location normalization. Currently, only keyword-based, shallow Information Extraction (IE), mainly the identification of named entities and simple events, is available for deployment. There is an acute demand for concept-based, intermediate-level extraction of events and their associated time and location information. The results of this effort can be leveraged in applications such as information visualization, fusion, and data mining. Cymfony Inc. has assessed the technical feasibility for concept-based, intermediate-level general event extraction (C-GE), by effectively employing a flexible approach integrating statistical models, handcrafted grammars and procedures encapsulating specialized logic. This intermediate level IE system C-GE aims at ‘translating’ language specific, keyword based representation of IE results into a type of ‘interlingua’ based mainly on concepts. More precisely, the key verb for a shallow event will be mapped into a concept cluster (e.g. kill/murder/shoot to death à {kill, put to death}; the time and location of the event will be normalized (e.g. last Saturday à 1999-01-30). To extract concept-based, general events from free text requires the application of ‘cutting edge’ Natural Language Processing (NLP) technology. The approach Cymfony proposes consists of a blend of machine learning techniques, cascaded application of handcrafted Finite State Transducer (FST) rules, and procedural modules. This flexible approach in combining different techniques and methods exploits the best of different paradigms depending on the specific task being handled. The work implemented by Cymfony under this Small Business Innovative Research (SBIR) Phase I grant includes the C-GE system architecture, the detailed task definition of C-GE, the implementation of a prototype time normalization module, the implementation of an alias association procedure inside NE (Named Entity tagging), enhanced machine learning tool for tasks like co-reference (CO), the development of semantic parsing grammars for shallow events, and research on lexical clustering and sense tagging. These accomplishments make the feasibility study reliable and provide a solid foundation for future system development. Distribution Statement: Distribution authorized to U.S. Gov’t. agencies only; Specific Authority. (6) Li, W. R. Srihari. 2000. A Domain Independent Event Extraction Toolkit, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract The proliferation of electronic documents has created an information overload. It has become necessary to develop automated tools for sorting through the mass of unrestricted text and for extracting relevant information. Currently, advanced Natural Language Processing (NLP) tools are not widely available for such commercial applications across domains. A commercially viable solution to this problem could have a tremendous impact on automating the information access for human agents. Cymfony has assessed the technical feasibility for domain independent information extraction. Regardless of a document’s domain, it has proven to be possible to extract key information objects accurately(with over 90% precision and recall). These objects include data items such as dates, locations, addresses, individuals or organization names, etc. More significantly, multiple relationships and general events involving the identified items can also be identified fairly reliably (over 80% for pre-defined relationships and over 70% for general events in precision and recall). Multiple relationships between entities are reflected in the task definition of the Correlated Entity(CE) template. A CE template presents profile information about an entity. For example, the CE template for a person entity represents a miniature resume of the person. A general event (GE) template is an argument structure centering around a verb notion with its arguments (for logical subject, logical object, etc.) plus the associated information of time(or frequency) and location. The implementation of such a domain independent information extractor requires the application of robust natural language processing tools. The approach used for this effort consists of a unique blend of statistical processing and finite state transducer (FST) based grammar analysis. Statistical approaches were used for their demonstrated robustness and domain portability. For text processing, Cymfony has also developed FST technology to model natural language grammar at multiple levels. As the basis for the grammar modeling, Cymfony has implemented a FST Toolkit. Cymfony has achieved the proposed two design objectives: (i) domain portability: the information extraction system Cymfony has developed can be applied to different domains with minimal changes; (ii) user-friendliness: with an intuitive user interface, non-expert users can get access to the extracted information easily. The work implemented by Cymfony under this SBIR Phase II grant on domain independent information extraction includes the conceptual design of the system architecture, named entity tagging, FST grammar modeling, and an integrated end-to-end prototype system involving all the modules. These accomplishments provide a solid foundation for further commercial development and exploitation. (7) Li, W. R. Srihari. 2000. Flexible Information Extraction Learning Algorithm, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract: The proliferation of electronic documents has created an information overload. It has become necessary to develop automated tools for quickly extracting key information from the mass of documents. The popularity of news clipping services, advanced WWW search engines and software agents illustrate the need for such tools. Currently, only shallow Information Extraction (IE), mainly the identification of named entities, is available for commercial applications. There is an acute demand for high-level extraction of relationships and events in situations where massive amounts of natural language texts are involved. Cymfony Inc. has assessed the technical feasibility for domain independent, high-level information extraction by effectively employing machine learning techniques. A hierarchical, modular system named Textract has been proposed for high-level as well as low-level IE tasks. In the Textract architecture, high-level IE consists of two main modules/tasks: Correlated Entity (CE) extraction and General Event (GE) extraction. CE extracts pre-defined multiple relationships between entities, such as relationships of “affiliation”, “position”, “address”, and “email” for a person entity. GE is designed to extract open-ended key events to provide information on who did what, (to whom), when and where. These relationships and events could be contained within sentence boundaries, or span a discourse of running text. The application of Textract/IE in the task of natural language Question Answering (QA) has also been explored. A unique, hybrid approach, combining the best of both paradigms, namely, machine learning and rule-based systems using finite state transducers (FST) has been employed. The latter has the advantage of being intuitive as well as efficient. However, knowledge acquisition is laborious and incomplete, especially when domain portability is involved. Machine learning techniques address this deficiency by automated learning from an annotated corpus. Statistical techniques such as Hidden Markov Models, maximum entropy and rule induction have been examined for possible use in different tasks and module development of this effort. The work implemented by Cymfony under this SBIR Phase I grant includes the IE system architecture, task definitions, machine learning toolkit development, FST grammar modeling for relationship/event extraction, implementation of the Textract/CE prototype, implementation of the Textract/QA prototype based on IE results and a detailed simulation involving all the modules up to general event extraction. These accomplishments make the feasibility study reliable and provide a solid foundation for future system development. ~~~~~~~~~~~ The subject technical reports have been added to the Technical Reports database at DTIC, and are now available to others. The citations above provide the information that requesters would need to know to access these reports at http://www.dtic.mil/. 【置顶：立委科学网博客NLP博文一览（定期更新版）】; 个人分类: 立委其人|5934 次阅读|0 个评论

立委发表记录: liwei999 2010-2-19 05:44; Publications Srihari, R, W. Li and X. Li, 2006. Question Answering Supported by Multiple Levels of Information Extraction, a book chapter in T. Strzalkowski S. Harabagiu (eds.), Advances in Open- Domain Question Answering. Springer, 2006, ISBN:1-4020-4744-4. online info Srihari, R., W. Li, C. Niu and T. Cornell. 2006. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006. online info Niu,C., W. Li, R. Srihari, and H. Li. 2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation. . Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005). Srihari, R., W. Li, L. Crist and C. Niu. 2005. Intelligence Discovery Portal based on Corpus Level Information Extraction . Proceedings of 2005 International Conference on Intelligence Analysis Methods and Tools. Niu, C., W. Li and R. Srihari. 2004. Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction . In Proceedings of ACL 2004. Niu, C., W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities . In Proceedings of Senseval-3 Workshop. Niu, C., W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation . International Journal of Artificial Intelligence Tools, Vol. 13, No. 1, 2004. Niu, C., W. Li and R. Srihari 2004. A Bootstrapping Approach to Information Extraction Domain Porting . AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM), California. Srihari, R., W. Li and C. Niu. 2004. Corpus-level Information Extraction. In Proceedings of International Conference on Natural Language Processing (ICON 2004), Hyderabad, India. Li, W., X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert Lexicon Approach to Identifying English Phrasal Verbs . In Proceedings of ACL 2003. Sapporo, Japan. pp. 513-520. Niu, C., W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping Approach to Named Entity Classification using Successive Learners . In Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342. Li, W., R. Srihari, C. Niu, and X. Li. 2003. Question Answering on a Case Insensitive Corpus . In Proceedings of Workshop on Multilingual Summarization and Question Answering - Machine Learning and Beyond (ACL-2003 Workshop). Sapporo, Japan. pp. 84-93. Niu, C., W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for Named Entity Tagging using Concept-based Seeds . In Proceedings of HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada. Srihari, R., W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A Customizable Intermediate Level Information Extraction Engine . In Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). pp. 52-59, Edmonton, Canada. Li, H., R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction . In Proceedings of HLT/NAACL 2003 Workshop on Analysis of Geographic References. Edmonton, Canada. Li, W., R. Srihari, C. Niu, and X. Li 2003. Entity Profile Extraction from Large Corpora . In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a Hidden Markov Model for Relationship Extraction Using Multi-level Contexts . In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised Learning for Verb Sense Disambiguation Using Both Trigger Words and Parsing Relations . In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation . In Proceedings of the Sixteenth International FLAIRS Conference, St. Augustine, FL, May 2003, pp. 402-406. Srihari, R. and W. Li 2003. Rapid Domain Porting of an Intermediate Level Information Extraction Engine . In Proceedings of International Conference on Natural Language Processing 2003. Srihari, R., C. Niu, W. Li, and J. Ding. 2003. A Case Restoration Approach to Named Entity Tagging in Degraded Documents. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), Edinburgh, Scotland, Aug. 2003. Li, H., R. Srihari, C. Niu and W. Li 2002. Location Normalization for Information Extraction . In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan. Li, W., R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002. Extracting Exact Answers to Questions Based on Structural Links . In Proceedings of Multilingual Summarization and Question Answering (COLING-2002 Workshop). Taipei, Taiwan. Srihari, R. and W. Li. 2000. A Question Answering System Supported by Information Extraction . In Proceedings of ANLP 2000. Seattle. Srihari, R., C. Niu and W. Li. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging . In Proceedings of ANLP 2000. Seattle. Li. W. 2000. On Chinese parsing without using a separate word segmenter. In Communication of COLIPS 10 (1). pp. 19-68. Singapore. Srihari, R. and W. Li. 1999. Information Extraction Supported Question Answering . In Proceedings of TREC-8. Washington Srihari, R., M. Srikanth, C. Niu, and W. Li 1999. Use of Maximum Entropy in Back-off Modeling for a Named Entity Tagger, Proceedings of HKK Conference, Waterloo, Canada Li. W. 1997. Chart Parsing Chinese Character Strings. In Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9). Victoria, Canada. Li. W. 1996. Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. In Proceedings of International Chinese Computing Conference (ICCC’96). Singapore Li, W. and P. McFetridge 1995. Handling Chinese NP Predicate in HPSG, Proceedings of PACLING-II, Brisbane, Australia. Liu, Z., A. Fu, and W. Li. 1992. Machine Translation System Based on Expert Lexicon Techniques. Zhaoxiong Chen (eds.) Progress in Machine Translation Research , pp. 231-242. Dianzi Gongye Publishing House.Beijing. (刘倬，傅爱平，李维 (1992). 基于词专家技术的机器翻译系统，”机器翻译研究新进展”，陈肇雄编辑，电子工业出版社，第 231-242 页，北京) Li, Uej (Wei) 1991. Lingvistikaj trajtoj de la lingvo internacia Esperanto. In Serta gratulatoria in honorem Juan Rgulo, Vol. IV. pp. 707-723. La Laguna: Universidad de La Laguna http://blog.sciencenet.cn/blog-362400-285729.html Li, W. and Z. Liu. 1990. Approach to Lexical Ambiguities in Machine Translation. In Journal of Chinese Information Processing. Vol. 4, No. 1. pp. 1-13. Beijing. (李维，刘倬 (1990). 机器翻译词义辨识对策，《中文信息学报》，1990年第一期，第 1-13 页，北京) (Its abstract published in Computer World 1989/7/26 ) Liu, Z., A. Fu, and W. Li. 1989. JFY-IV Machine Translation System. In Proceedings of Machine Translation SUMMIT II. pp. 88-93, Munich. Li, W. 1988. E-Ch/A Machine Translation System and Its Synthesis in the Target Languages Chinese and Esperanto. In Journal of Chinese Information Processing. Vol. 2, No. 1. pp. 56-60. Beijing (李维 (1988). E-Ch/A 机器翻译系统及其对目标语汉语和英语的综合，《中文信息学报》，1988年第一期，第 56-60 页，北京) Li, W. 1988. Lingvistikaj Trajtoj de Esperanto kaj Ghia Mashin-traktado. El Popola Chinio. 1988. Beijing Li, W. 1988. An Experiment of Automatic Translation from Esperanto into Chinese and English, World Science and Technology 1988, No. 1, STEA sub Academia Sinica. 17-20, Beijing. Liu, Y. and W. Li 1987. Babelo Estos Nepre Konstruita. El Popola Chinio. 1987. Beijing (also presented in First Conference of Esperanto in China, 1985, Kunming) Li, W. 1986. Automatika Tradukado el la Internacia Lingvo en la Chinan kaj Anglan Lingvojn, grkg/Humankybernetik, Band 27, Heft 4. 147-152, Germany. Other Publications Chinese Dependency Syntax SBIR Grants (17 Final Reports published internally) Ph.D. Thesis: THE MORPHO-SYNTACTIC INTERFACE IN A CHINESE PHRASE STRUCTURE GRAMMAR M.A. Thesis in Chinese: 世界语到汉语和英语的自动翻译试验 –EChA机器翻译系统概述《立委科普：Machine Translation》 (encoded in Chinese GB) Li, W. 1997. Outline of an HPSG-style Chinese Reversible Grammar , Vancouver, Canada. Li, W. 1995. Esperanto Inflection and Its Interface in HPSG, Proceedings of 11th North West Linguistics Conference (NWLC), Victoria, Canada. Li, W. 1994. Survey of Esperanto Inflection System, Proceedings of 10th North West Linguistics Conference (NWLC), Burnaby, Canada.; 个人分类: 立委其人|4554 次阅读|1 个评论

C.V. （立委英文履历）: liwei999 2010-2-19 05:41; WEI LI Email: liwei AT sidepark DOT org Homepage: http://www.sciencenet.cn/m/user_index1.aspx?typeid=128262userid=362400 (1) Qualifications Dr. Li is a computational linguist with years of work experiences in Natural Language Processing (NLP). Dr. Li's background involves both a solid research track record and substantial industrial software development experiences. He is now Chief Scientist in a US company, leading the technology team in developing the core engine for extracting sentiments and text analytics for the consumer insight and business search products. Dr. Li led the NLP team and solved the problem of answering how-question: this was the technology foundation for the launch of the research product serving the technology community. After that, he directed the team in automatic sentiment analysis and solved the problem of answering why-questions. This effort has resulted in the launch of the product for extracting consumer insights from social media. He is currently leading the effort for multilingual NLP efforts and for identifying demographic information for social media IDs. In his previous job, Dr. Li was Principal Investigator (PI) at Cymfony on 17 federal grants from the DoD SBIR (AF and Navy) contracts in the area of NLP/IE (Information Extraction). These efforts led to the development and deployment of a suite of InfoXtract engine and products, including Cymfony ’ s BrandDashboard, Harmony and Influence as well as Janya ’ s Semantex engine for the government deployment. Dr. Li led the effort in winning the first competition at TREC-8 (Text Retrieval Conference 1999) in its natural language Question Answering (QA) track. Dr. Li has published extensively in refereed journals and high-profile international conferences such as ACL and COLING, in the areas of question answering, parsing, word sense disambiguation, information extraction and knowledge discovery. (2) Employment 2005.11- present Chief Scientist Dr. Wei Li leads the development of Netbase's core research and natural language processing (NLP) team. Major responsibilities: Direct RD; natural language parsing; transfer technology; business information extraction; sentiment analysis. Architect and key developer of the NLP platform for parsing English into logical forms Architect and key developer for question answering and business information extraction based on parsing Design and direct to develop sentiment analysis in Benefit Frame, Problem Frame, 360 Frame, and Preference Frame. Supports technology transfer into product features in three lines of commercially deployed products 1997.11-2005.11 Vice President for RD/NLP, Cymfony Inc. / Janya Inc. (Cymfony spin-off since 2005.08) Principal Research Scientist since 01/1998 VP since 09/1999 Dr. Wei Li lead the development of Cymfony/Janya’s core research and natural language processing (NLP) team. Major responsibilities: Direct RD; write grant proposals; transfer technology; develop linguistic modules. Chief architect for the core technology InfoXtract for broad coverage NLP and Information Extraction (IE): designed and developed the key modules for parsing, relationship extraction and event extraction Instrumental in helping to close the seed funding and the first round of financing of over 11 million dollars in 2000 and to develop a tiny 2-staff company when I joined it in 1996 into a 60+ staff technology company in the IT (Information Technology) sector of US industry, with offices in Buffalo, Boston and Bangalore (India) before the spin-off Responsible for technology transfer: designed the key features brand tagging, message tracking and quote extraction for the Cymfony flagship product Brand Dashboard(TM) Cymfony has been nominated for US Small Business Administration Prime Contractor of the Year Award several times for its outstanding government work Cymfony’s commercial product has won numerous awards including the Measurement Standard’s Third Annual Product of the Year Award, Finalist for the MITX Awards 2004, Finalist For 19th Annual Codie Award, 2003 Massachusetts Interactive Media Council (MIMC) Awards Cymforny has been named 100 Companies that matter in Knowledge Management by KMWold together with other industry leaders Pincipal Investigator (PI) or Co-PI for 17 SBIR (Small Business Innovation Research Phase 1, Phase 2 and Enhancement) grants (about eight million dollars) from DoD (Department of Defense) of US in the area of intelligent information retrieval and extraction PI, Fusion of Entity Information from Textual Data Sources (Phase I $100,000), U.S. DoD SBIR (AF),, Contract No. FA8750-05-C-0163 (2005) PI, Automated Verb Sense Identification (Phase II $750,000), U.S. DoD SBIR (Navy), Contract No. N00178-03-C-1047 (2003-2005) PI, Automated Verb Sense Identification (Phase I $100,000), U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003) Co-PI, An Automated Domain Porting Toolkit for Information Extraction (Phase II $750,000, Enhancement $830,000), U.S. DoD SBIR (AF), Contract No. F30602-03-C-0044 (2003-2006) Co-PI, An Automated Domain Porting Toolkit for Information Extraction (Phase I $100,000), U.S. DoD SBIR (AF), Contract No. F30602-02-C-0057(2002-2003) Co-PI, A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase II $750,000), U.S. DoD SBIR (AF) (2004-2006) Co-PI, A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase I, $100,000), U.S. DoD SBIR (AF) (2003-2004) Co-PI, Automatically Time Stamping Events in Unrestricted Text (Phase I $100,000), U.S. DoD SBIR (AF), (2003-2004) Co-PI, Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I, $100,000) , U.S. DoD SBIR (AF), Contract No. F30602-02-C-0156 (2002-2003) PI, Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase II, $750,000; Enhancement $500,000) , U.S. DoD SBIR (AF), Contract No. F30602-01-C-0035 (2001-2003) PI, Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase I, $100,000) , U.S. DoD SBIR (AF), Contract No. F30502-00-C-0090 (2000-2001) PI, Flexible Information Extraction Learning Algorithm (Phase II, $750,000; Enhancement $500,000) , U.S. DoD SBIR (AF) Contract No. F30602-00-C-0037 (2000-2002) PI, Flexible Information Extraction Learning Algorithm (Phase I, $100,000) , U.S. DoD SBIR (AF) Contract No. F30602-99-C-0102 (1999-2000) PI, A Domain Independent Event Extraction Toolkit (Phase II, $750,000) , U.S. DoD SBIR (AF), Contract No. F30602-98-C-0043 (1998-2000) 1986-1991 Assistant Researcher, Institute of Linguistics, CASS (Chinese Academy of Social Sciences) R D for Project of JFY Machine Translation Engine from English to Chinese (using COBOL) 1988-1991 Senior Engineer, Gaoli Software Company instrumental in turning the research prototype JFY into a real life software product GLMT for English-to-Chinese Machine Translation trained and supervised lexicographers in building up a lexicon of 60,000 entries supervised the testing of thousands of lexicon rules GLMT 1.0 successfully marketed in 1992 GLMT won nemerous prizes, including Silver Medal, INFORMATICS’92 (Singapore 1992); Gold Medal for Electronic Products at Chinese Science Technology Exhibition (Beijing, 1992) and various other software prizes (Beijing 1992-1995) technology partially transferred to VTECH Electronics Ltd in the product of pocket electronic translator 1988 Contract grammarian, BSO Software Company, Utrecht, The Netherlands Chinese Dependency Syntax Project , for use in multi-lingual MT (3) Education 2001 PhD in Computational Linguistics, Simon Fraser University, Canada Thesis: The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar 1992 PhD candidate in Computational Linguistics, CCL/UMIST, UK 1986 M.A. in Machine Translation, Graduate School of Chinese Academy of Social Sciences Thesis: Automatic Translation from Esperanto to English and Chinese (4) Prizes and Honors 2001 Outstanding Achievement Award, Department of Linguistics, Simon Fraser University (award given to the best PhD graduates from the department) 1995-1997 G.R.E.A.T. Award, Science Council, B.C. CANADA (an industry-based grant, funding the effort to bridge my Ph.D. research with the local industrial needs) 1997 President’s Research Stipend, SFU, CANADA 1996 Travel grant for attending ICCC in Singapore, by ICCC’96 1995 Graduate Fellowship (merit-based), SFU, CANADA 1992 Software Second Prize (Aiping Fu and Wei Li), Chinese Academy of Social Sciences for machine translation database software 1991 Sino-British Friendship Scholarship, supporting my PhD program in UK (a prestigious scholarship designed to award Chinese young scientists for overseas training in England in a nation-wide competition, administered jointly by the British Council, Sir Pao Foundation and the Education Ministry of China) (5) Professional Activities Editor, International Editorial Board for Journal of Chinese Language and Computing Industrial Advisor, supervising over 20 Graduate Student Interns from SUNY/Buffalo (since 1998) Reviewer, Second International Joint Conference on Natural Language Processing (IJCNLP-05) Member, Program committee for 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) Reviewer, Mark Maybury (ed.) New Directions in Question Answering, The AAAI Press, 2003 Member, Program committee for The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), 2003. Member, Program committee for 20th International Conference on Computer Processing of Oriental Languages (ICCPOL2003), 2003. Panelist, Multilingual Summarization and Question Answering (COLING-2002 Workshop) Invited talk, ‘Information Extraction and Natural Language Applications’, National Key Lab for NLP, Qinghua University, Beijing, Feb. 2001 Member, Association of Computational Linguistics (ACL) Member, American Association for Artificial Intelligence (AAAI) (6) Languages English: fluent Chinese: native French: Intermediate (learned 3 years) Esperanto: fluent (published in Esperanto) Russian: elementary (learned 1 year) (7) Publications A complete list of publications are available on-line at http://www.sciencenet.cn/m/user_content.aspx?id=295975; 个人分类: 立委其人|8226 次阅读|1 个评论

更多...

帐号		自动登录	找回密码
密码			注册

关闭 安全验证

标签: extraction

相关帖子

相关日志

关闭安全验证