电脑的中文处理业界有很多广为流传似是而非的迷思。在今后的随笔系列中,准备提出来分别讨论。 迷思之一:切词(又叫分词,word segmentation)是中文(或东方语言)处理特有的前提,因为中文书写不分词。 切词作为中文处理的一个先行环节,是为了模块化开发的方便,这一点不错。但它根本就不特有。 任何自然语言处理都有一个先行环节,叫 tokenization,就是把输入的字符串分解成为词汇单位:无论何种书面语,没有这个环节,辞典的词汇信息就无以附着,在词汇类别的基础上的有概括性的进一步句法语义分析就不能进行。中文切词不过是这个通用的 tokenization 的一个案例而已,没有什么“特有”的问题。 有说:中文书写不分词,汉字一个挨一个,词之间没有显性标识,而西文是用 space(空白键)来分词的,因此分词是中文处理的特有难题。 这话并不确切,语言学上错误更多。具体来说: 1 单字词没有切分问题:汉语词典的词,虽然以多字词为多数,但也有单字词,特别是那些常用的功能词(连词、介词、叹词等)。对于单字词,书面汉语显然是有显性标志的,其标志就是字与字的自然分界(如果以汉字作为语言学分析的最小单位,语言学上叫语素,其 tokenization 极其简单:每两个字节为一个汉字),无需 space. 2 多字词是复合词,与其说“切”词,不如说“组”词:现代汉语的多字词(如:利率)是复合词,本质上与西文的复合词(e.g. interest rate)没有区别,space 并不能解决复合词的分界问题。事实上,多字词的识别既可以看成是从输入语句(汉字串)“切”出来的,也可以看成是由单字组合抱团而来的,二者等价。无论中西,复合词抱团都主要靠查词典来解决,而不是靠自然分界(如 space)来解决(德语的名词复合词算是西文中的一个例外,封闭类复合词只要 space 就可以了,开放类复合词则需要进一步切词,叫 decompounding)。如果复合词的左边界或者右边界有歧义问题(譬如:“天下” 的边界可能歧义, e.g. 今天 下 了 一 场 雨;英语复合副词 in particular 的右边界可能有歧义:e.g. in particular cases),无论中西,这种歧义都需要上下文的帮助才能解决。从手段上看,中文的多字词切词并无任何特别之处,英语 tokenization 用以识别复合词 People's Republic of China 和 in particular 的方法,同样适用于中文切词。 咱们换一个角度来看这个问题。根据用不用词典,tokenization 可以分两种。不用词典的tokenization一般被认为是一个比较trivial的机械过程,在西文是见space或标点就切一刀(其实也不是那么trivial因为那个讨厌的西文句点是非常歧义的)。据说汉语没有space,因此必须另做一个特有的切词模块。其实对英语第一种tokenization,汉语更加简单,因为汉字作为语素(morpheme)本身就是自然的切分单位,一个汉字两个字节,每两个字节切一刀即可。理论上讲,词法句法分析完全可以直接建立在汉字的基础之上,无需一个汉语“特有”的切词模块。Note that 多数西文分析系统在Tokenization和POS以后都有一个chunking的模块,做基本短语抱团的工作(如:Base NP)。中文处理通常也有这么一个抱团的阶段。完全可以把组字成词和组词成短语当作同质的抱团工作来处理,跳过所谓的切词。 Chunking of words into phrases are by nature no different from chunking of morphemes (characters) into words. Parsing with no “word segmentation” is thus possible. 当然,在实际操作层面上看,专设一个切词模块有其便利之处。 再看由词典支持的tokenization, 这种 tokenization 才是我们通常讲的切词,说它是中文处理特有的步骤,其实是误解,因为西文处理复合词也一样用到它。除了实验室的 toy system,很难想象一个像样的西文处理系统可以不借助词典而是指望抽象规则来对付所有的复合词:事实上,对于封闭类复合词,即便抽象的词法规则可以使部分复合词抱团,也不如词典的参与来得直接和有益,理由就是复合词的词典信息更少歧义,对于后续处理更加有利。汉语的复合词“利率”与英语的复合词 “interest rate” 从本质上是同样的基于词典的问题,并没有什么“特有”之处。 【相关博文】 《 立委科普: 应该立法禁止分词研究 :=) 》 【置顶:立委科学网博客NLP博文一览(定期更新版)】
与业內老友的对话:在‘用’字上狠下功夫 耳边响起了林副主席关于系统开发的谆谆教导: Quote 带着问题做,活做活用,做用结合,急用先做,立竿见影,在‘用’字上狠下功夫。 from: http://blog.sciencenet.cn/home.php?mod=spaceuid=362400do=blogid=510567 这是从与朋友的内部交流中得来的。赶的是编造名人名言的时髦。 ~~~~~~~~~~~~ 在我发文【 坚持四项基本原则,开发鲁棒性NLP系统 】以后,有业内资深老友表示非常有意思,建议我把NLP方面的博文系列汇集加工,可以考虑出书: Quote A good 经验之谈. Somehow it reminds me this -- 带着问题学,活学活用,学用结合,急用先学,立竿见影,在‘用’字上狠下功夫。 You made a hidden preamble -- a given type of application in a given domain. A recommendation: expand your blog a bit as a series, heading to a book. My friend 吴军 did that quite successfully. Of course with statistics background. So he approached NLP from math perspective -- 数学之美 系列 You have very good thoughts and raw material. Just you need to put a bit more time to make your writing more approachable -- I am commenting on comments like 学习不了。 and 读起来鸭梨很大. I know you said: 有时候想,也不能弄得太可读了,都是多年 的经验,后生想学的话,也该吃点苦头。:=) But as you already put in the efforts, why not make it more approachable? The issue is, even if I am willing to 吃点苦头, I still don't know where to start 吃苦头, IF I have never built a real-life NLP system. For example, 词汇主义 by itself is enough for an article. You need to mention its opponents and its history to put it into context. Then you need to give some examples. 文章千古事,网上涂鸦岂敢出书?这倒不是妄自菲薄,主要是出书太麻烦,跟不上这个时代。 我回到: 吴军's series are super popular. When I first read one of his articles on the Google Blackboard, recommended by a friend, I was amazed how well he structured and carried the content. It is intriguing. (边注:当然,他那篇谈 Page Rank 的文章有偏颇,给年轻人一种印象,IT 事业的成功是由技术主宰的,而实际上技术永远是第二位的。对于所谓高技术企业,没有技术是万万不行的,但企业成功的关键却不是技术,这是显而易见的事实了。) For me, to be honest, I do not aim that high. Never bothered polishing things to pursue perfection although I did make an effort to try to link my stuffs into a series for the convenience of cross reference between the related pieces. There are missing links which I know I want to write about but which sort of depends on my mood or time slots. I guess I am just not pressed and motivated to do the writing part. Popularizing the technology is only a side effect of the blogging hobby at times. The way I prove myself is to show that I will be able to build products worth of millions, or even hundreds of millions of dollars. 网上的文字都是随兴之所至,我从来不写命题作文,包括我自己的命题。有时候兴趣来了,就说自己下一篇打算写什么什么,算是自我命题,算是动了某个话题的心思。可是过了两天,一个叉打过去,没那个兴致和时间了,也就作罢。 赶上什么写什么,这就是上网的心态。平时打工已经够累了,上网绝不给自己增加负担。 So far I have been fairly straightforward on what I write about. If there is readability issue, it is mainly due to my lack of time. Young people should be able to benefit from my writings especially once they start getting their hands dirty in building up a system. Your discussion is fun. You can see and appreciate things hidden behind my work more than other readers. After all, you have published in THE CL and you have almost terminated the entire segmentation as a scientific area. Seriously, it is my view that there is not much to do there after your work on tokenization both in theory and practice. I feel some urgency now for having to do Chinese NLP asap. Not many people have been through that much as what I have been, so I am in a position to potentially build a much more powerful system to make an impact on Chinese NLP, and hopefully on the IT landscape as well. But time passes fast . That is why my focus is on the Chinese processing now, day and night. I am keeping my hands dirty also with a couple of European languages, but they are less challenging and exciting. 【置顶:立委科学网博客NLP博文一览(定期更新版)】
RE: 切 词当然是第一关。这个没弄好,其他的免谈 现如今中文自动分析的瓶颈早已不是切词了 日期: 12/05/2011 15:43:43 半个世纪折腾进去无数的人力了。是 overdone,很大程度上是科研财主(sponsors)和科学家共同的失察。应该立法禁止切词(word segmentation or tokenization)研究(kidding :=)),至少是禁止用纳税人钱财做这个研究。 海量词库可以解决切词的90%以上的问题。 统计模型可以解决几个百分点。硬写规则或者 heuristics 也可以达到类似的效果。 再往上,多一个百分点少一个百分点又有什么关系?对于应用没有什么影响,as long as things can be patched and incrementally enhanced over time. 或者任其错误下去(上帝允许系统的不完美),或者在后面的句法分析中 patch。很多人夸大了管式系统的错误放大问题(所谓 error propagation in a pipeline system), 他们忽略了系统的容错能力(robustness through adaptive modules:负负可以得正),这当然要看系统设计者的经验和智慧了。 中文处理在切词之后,有人做了一些短语识别(譬如 Base NP 抱团)和专有名词识别(Named Entity Tagging),再往下就乏善可陈了。 深入不下去是目前的现状。我要做的就是镜子说的“点入”。先下去再说,做一个 end-to-end system,直接支持某个app,用到大数据(big data)上,让数据制导,让数据说话。同时先用上再说,至少尽快显示其初步的value,而不是十年磨一剑。 【相关博文】 再谈应该立法禁止切词研究 2015-06-30 【置顶:立委科学网博客NLP博文一览(定期更新版)】
话说这苹果真是能折腾,一个技术课题硬是折腾成大众话题,弄得满世界都在谈论苹果爱疯的贴身小蜜 “死日”(Siri,没追踪来源,但瞧这名字起的),说是她无所不能,能听得懂主人的心思,自动打理各项事务,从天气预报,到提供股票信息,甚至做笔记。不服不行,人家就是把这个科幻世界的机器人功能产品化了,挑起了大众的好奇心。虽然毁誉参半,批评者与追星者一样多,还是为语言技术扬了名。这不,圣诞节到了,调查表明,美国青少年最喜欢的圣诞礼品有三:(1)礼物券,也就是钱,爱怎么花自己定当然好;(2)时装(爱美之心);(3)苹果产品(因为那是时髦的代名词)。 前些时候,与朋友谈到死日,我说它有三大来源:首先是语言技术,包括语音识别和文句分析。语音识别做了很多年了,据说技术相当成熟可用了(语音虽然是我的近邻了,但隔行如隔山,我就不评论了)。文句分析(这可是我的老本行)当然有难度,但是因为死日是目标制导,即从目标app反推自然语言的问句表达法,所以分析难度大为降低,基本上是 tractable 的(见《立委随笔: 非常折服苹果的技术转化能力 》)。第二个来源是当年 AskJeeves 借以扬名的 million-dollar idea (见《 【 IT风云掌故:金点子起家的 AskJeeves 】 》),巧妙运用预知的问题模板,用粗浅的文句分析技术对应上去,反问用户,从而做到不变应万变,克服机器理解的困难。最近有人问死日:Where can I park the car? 死日就反问道:you asked about park as in a public park, or parking for your vehicle? 虽然问句表明了这位贴身小蜜是绣花枕头,徒有其表,理解能力很有限,但是对于主人(用户)来说,在两个选项中肯定一个不过是举“口”之劳的事情。第三个来源就是所谓聊天系统,网上有不少类似的玩具(见 【 立委科普 : 问答系统的前生今世 】 第一部分 ) ,他是当年面临绝路的老 AI 留下的两大遗产之一(另一个遗产是所谓专家系统)。 最近摆弄汉语自动分析,有老友批评得很到位: Quote 俺斗胆评论一下,您的系统长项应该在于自然 语言理解 至于语法树,应该是小儿科。韩愈说“句读之不知,惑 之不解”。 语法树的作用在于“知句读”,而您的系统应该强调“解惑”。 俺感觉照现在的发展速度,一个能够真正通过图灵检验的系统应该离我们不远了。虽然现在已经有系统号称能通过,但是都是聊天系统,干的本身就是不着调的工作。离真正意义的图灵检验还有距离。 是小儿科,可是很多人弄不了这小儿科呢。 日期: 12/05/2011 13:41:30 从high level看,从100年后看,说小儿科也差不多。 但是你所谓的解惑,离开现实太远。 一般来说,机器擅长分析、抽取和挖掘,上升到预测和解惑还有很长的路,除非预测是挖掘的简单延伸,解惑就是回答黑白分明的问题。 聊天系统,干的本身就是不着调的工作,一点儿不错,那是所谓 old AI 的残余。不过,即便如此,我在 苹果 Siri 中看到的三个来源(1.自然语言技术:语音和文字 2 Askjeeves 模板技术;3. 所谓 AI 聊天系统)中也看到了它的影子,它是有实用价值的,价值在于制造没有理解下的 人工智能 的假象。 昨天甜甜秀给我看:Dad, somebody asked Siri: what are you wearing? Guess how he replies? 这种 trick,即便知道是假的,也让人感觉到设计者的一份幽默。 那天在苹果iPhone4s展示会上,临结束全场哄堂大笑,原来苹果经理最后问了一个问题:Who are you? Siri 扭着细声答道: I am your humble assistant. 面对难以实现的人工智能,来点儿幽默似的假的人工智能,也是一种智慧。 相关篇什: 《 立委随笔:非常折服苹果的技术转化能力 。。。》 《 从新版iPhone发布,看苹果和微软技术转化能力的天壤之别 》 科学网—【 立委科普 : 问答系统的前生今世 】 科学网—《立委随笔:人工“智能”》 【置顶:立委科学网博客NLP博文一览(定期更新版)】
机器八卦:Text Mining and Intelligence Discovery (13219) Posted by: liwei999 Date: June 10, 2006 10:07PM 犀角提议,干脆用机器挖掘吧。我不想吓唬大家,但是,理论上说,除非你不冒泡,言多必失,机器八卦,比人工挖掘,可能揭示出你的更多特征。好在该技术还不成熟。 Text mining 是我这几年的研究重点之一,简单介绍一下。我们上贴用的是自然语言(英语,汉语),它们只是一串串字符,称作 unstructured text, 不是真的没有结构,而是结构是隐含的(语法结构、语义结构),需要NLU(Natural Language Understanding)技术中的 parsing 才能使其结构化。为什么要结构化?你想啊,千变万化的字符串组合,表达各种意义,如果不结构化,怎么从中有效地抽取信息(IE: information extraction),并挖掘出有价值的 intelligence (所谓 intelligence discovery) 呢? 当然,也有人不用结构去提取和挖掘,所谓 keyword-based information extraction and text mining. 一些浅层的信息和情报也可能这样被提取/挖掘出来。这就好比大家用 Google 搜索,Google 并不懂你的 query, 在 Google 眼中,不过是一串串互不相干、没有结构 words (search terms),但是由于网上有海量的带有很大 redundancy 的信息,东方不亮西方亮,查询结果往往很不错。Nevertheless, search 也好,IE 和text mining 也好,其最终突破在于 NLU. Text mining 这个术语从 Data mining 而来,后者通常指从数据库里面的有结构的数据中挖掘出规律来(hidden correlations and patterns)。Data mining 是个比较成熟的在实际应用中的技术。它能挖掘出对于 target marketing 很有价值的情报出来。比较 data mining 和 text mining, 可以知道,前者的成熟是建立在数据的结构化(数据库一般是人工建立和输入的)基础之上。因此,要想提高 text mining 的可用度,重点还是把 unstructured text 转化成结构化的 representation. 这就是我们一辈子也研究不完的题目了。 分析主谓宾及其修饰语关系(decoding Subject-Verb-Object, or SVO),是自然语言自动分析 (Natural Language Parsing)的主要任务。SVO parsing 做好了,就为语言理解打好了基础。在此基础上做信息抽取(IE: Information Extraction)和文本挖掘(Text Mining)就事半功倍了。 信息抽取和文本挖掘的区别是,前者提取的是“事实”(facts),文本中 explicitly 表达出来的东东(比如我曾说过我籍贯安徽,是世界语者,喜欢红楼梦,爱好音乐等等),而后者是挖掘文本中没有明说的 hidden relationships, patterns and trends. 所以 信息抽取可以充当文本挖掘的基础:根据已知事实挖掘隐含的联系、规律和走向,真地是八卦了,基于科学基础上的八卦。将来有一天,机器很有可能挖掘出这样一条爆炸性信息来:本坛网友某某某有同性恋倾向。那可比网络上的“人言”厉害,这是有“科学”根据的预测啊。真地是跳到黄河洗不清了。 总结如下: Natural Language Parsing -- Information Extraction -- Text Mining -------- 立委名言:如果生活能重来,我应该从事新闻采编。 钻这个牛角的意思 (13320) Posted by: seeit Date: June 11, 2006 07:25PM 还有一个问题请教立委,网上的信息可信度太低,text mining 如何考虑可信度?不同可信度的信息组合的最终结论的可信度又如何控制?头都想大了。 --------------------------------------------------------------------- 这是个大家都头大的问题。 (13322) Posted by: liwei999 Date: June 11, 2006 07:59PM 样本不够的时候没有什么好办法吧。比如李四一共才冒泡100次,而且刻意隐瞒、歪曲、半真半假。在这样的样本上是挖掘不出可信的情报来。挖掘的情报也只能是参考。 但是,如果样本很大,就可以过滤掉噪音和不实信息(deconflicting),前提是人天生不是时时事事在说谎(这个前提统计学上是成立的)。 情报挖掘由于domain dependent,样本有可能有限度:比如老友论坛,一共也不到两万帖,就是加上隔壁读书论坛,也不到20万帖子的存档吧。对于 domain independent 的知识习得(knowledge acquisition),海量存档提供了极好的过滤基础。去年还是前年,Google 一位仁兄就发了一篇 paper, 谈怎样从海量存档中获取 entity ontology 的知识。方法很简单,却非常有效,他只用了两个 patterns: 1. E1, E2, ..., and other C e.g. desks, tables, chairs and other furniture 2. C such as E1, E2, ..., and En funiture such as desks, tables and chairs E is supposed to be an entity noun, C should be a category noun of the entities. 这两个简单的语言 patterns, 只是英语用来表达实体上下位概念的常用说法,还有更多的说法没有概括进来,所以 recall (查全度)是不够的。这两个 patterns 的精确度(precision)也还有问题,error rate大概导致3-5%的噪音/不实信息。可是Google 数据量大啊,只要运算速度跟上来,海量数据可以弥补查全率的不足(由于 redundancy),而且也过滤了噪音(threshold设置高一点就成)。其结果出奇的好: furniture: desk, table, chair, bench, bookshelf, ... US States: California, Washington, Texas, New York, ... dictators: Saddam Hussein, Castro, Jiang Zemin, Kim Jong Il,... etc. etc. 而所谓 semantic web, 就是在源头上解决问题, (13238) Posted by: liwei999 Date: June 11, 2006 01:23AM 在网页编制发布时就人工参与地结构化了,简单地说,就是让我们搞自然语言的人失业。 但 semantic web 主旨并不是表达 domain independent 的语言分析(主谓宾什么的),而是表达 domain-dependent 的语义 ontology, 直接抓住网页的核心内容。 ontology 是知识表达体系,我们NLP(natural langiuage processing)/NLU(natural language understanding)/IE(information extraction)的目的就是 decode unstructured text,把内容map到预定的 ontology 上。 这是从目标上看。在 decoding 过程中,也有用到知识库,通常是 lexicalized thesaurus 什么的(比如 WordNet),这里面的知识也是成体系的,也包含 ontology. 我们做 NLP/NLU 的人,并非为分析而分析,parsing unstructured text 的主要目标是为 information extraction 和 text mining 服务。主谓宾之类只是手段,而非目的。Semantic web 的理想(或幻想)就是在源头上把目的达到。从这个意义上,是在抢我们的饭碗。 当然,人类用自然语言随处可见,不大可能都愿意麻烦或有条件走 semantic web 所指的路。所以,担心没有活干,是没必要的。 【置顶:立委科学网博客NLP博文一览(定期更新版)】
上周信笔涂鸦写了个不伦不类的科普( 【立委科普:从产业角度说说NLP这个行当】 ),写完自我感觉尚可,于是毛遂自荐要求加精:“ 自顶一哈:不用谦虚,这个应该加精。也不枉我费了大半天的时辰。 ” 本来是玩笑话,没成想科网的编辑MM在两小时内就真地加精上首页了。前几周还在抱怨,怕被编辑打入另册,正琢磨献花还是金币以求青睐,没想到这么快就峰回路转,春暖花开。响鼓不用重敲,原来还是要发奋码字才行,花言巧语的不行。得,一鼓作气,再码两篇。 言归正传,第一篇先介绍一下问答系统(Question Answering system)的来龙去脉。第二篇专事讲解问答系统中的三大难题 What,How 与 Why。 一 前生 传统的问答系统是人工智能(AI: Artificial Intelligence)领域的一个应用,通常局限于一个非常狭窄专门的领域,基本上是由人工编制的知识库加上一个自然语言接口而成。由于领域狭窄,词汇总量很有限,其语言和语用的歧义问题可以得到有效的控制。问题是可以预测的,甚至是封闭的集合,合成相应的答案自然有律可循。著名的项目有上个世纪60年代研制的LUNAR系统,专事回答有关阿波罗登月返回的月球岩石样本的地质分析问题。 SHRDLE 是另一个基于人工智能的专家系统,模拟的是机器人在玩具积木世界中的操作,机器人可以回答这个玩具世界的几何状态的问题,并听从语言指令进行合法操作。 这些早期的AI探索看上去很精巧,揭示了一个有如科学幻想的童话世界,启发人的想象力和好奇心,但是本质上这些都是局限于实验室的玩具系统(toy systems),完全没有实用的可能和产业价值。随着作为领域的人工智能之路越走越窄(部分专家系统虽然达到了实用,基于常识和知识推理的系统则举步维艰),寄生其上的问答系统也基本无疾而终。倒是有一些机器与人的对话交互系统 ( chatterbots ) 一路发展下来至今,成为孩子们的网上玩具(我的女儿就很喜欢上网找机器人对话,有时故意问一些刁钻古怪的问题,程序应答对路的时候,就夸奖它一句,但更多的时候是看着机器人出丑而哈哈大笑。不过,我个人相信这个路子还大有潜力可挖,把语言学与心理学知识交融,应该可以编制出质量不错的机器人心理治疗师。其实在当今的高节奏高竞争的时代,很多人面对压力需要舒缓,很多时候只是需要一个忠实的倾听者,这样的系统可以帮助满足这个社会需求。要紧的是要消除使用者“对牛弹琴”的先入为主的偏见,或者设法巧妙隐瞒机器人的身份,使得对话可以敞开心扉。扯远了,打住。) 二 重生 产业意义上的开放式问答系统完全是另一条路子,它是随着互联网的发展以及搜索引擎的普及应运而生的。准确地说,开放式问答系统诞生于1999年,那一年搜索业界的第八届年会(TREC-8:Text REtrieval Conference)决定增加一个问答系统的竞赛,美国国防部有名的DARPA项目资助,由美国国家标准局组织实施,从而催生了这一新兴的问答系统及其community。问答系统竞赛的广告词写得非常精彩,恰到好处地指出搜索引擎的不足,确立了问答系统在搜索领域的价值定位。记得是这样写的(大体): 用户有问题,他们需要答案。 搜索引擎声称自己做的是信息检索(information retrieval),其实检索出来的并不是所求信息,而只是成千上万相关文件的链接(URLs),答案可能在也可能不在这些文件中。无论如何,总是要求人去阅读这些文件,才能寻得答案。问答系统正是要解决这个信息搜索的关键问题。 对于问答系统,输入的是问题,输出的是答案,就是这么简单。 说到这里,有必要先介绍一下开放式问答系统诞生时候的学界与业界的背景。 从学界看,传统意义上的人工智能已经不再流行,代之而来的是大规模真实语料库基础上的机器学习和统计研究。语言学意义上的规则系统仍在自然语言领域发挥作用,作为机器学习的补充,而纯粹基于知识和推理的所谓智能规则系统基本被学界抛弃(除了少数学者的执着,譬如 Douglas Lenat 的 Cyc )。学界在开放式问答系统诞生之前还有一个非常重要的发展,就是信息抽取(Information Extraction)专业方向及其community的发展壮大。与传统的自然语言理解(Natural Language Understanding)面对整个语言的海洋,试图分析每个语句求其语义不同,信息抽取是任务制导,任务之外的语义没有抽取的必要和价值:每个任务定义为一个预先设定的所求信息的表格,譬如,会议这个事件的表格需要填写会议主题、时间、地点、参加者等信息,类似于测试学生阅读理解的填空题。这样的任务制导的思路一下子缩短了语言技术与实用的距离,使得研究人员可以集中精力按照任务指向来优化系统,而不是从前那样面面俱到,试图一口吞下语言这个大象。到1999年,信息抽取的竞赛及其研讨会已经举行了七届(MUC-7:Message Understanding Conference),也是美国DARPA项目的资助产物(如果说DARPA引领了美国信息产业研究及其实用化的潮流,一点儿也不过誉),这个领域的任务、方法与局限也比较清晰了。发展得最成熟的信息抽取技术是所谓实体名词的自动标注(Named Entity:NE tagging),包括人名、地名、机构名、时间、百分比等等。其中优秀的系统无论是使用机器学习的方法,还是编制语言规则的方法,其查准率查全率的综合指标都已高达90%左右,接近于人工标注的质量。这一先行的年轻领域的技术进步为新一代问答系统的起步和开门红起到了关键的作用。 到1999年,从产业来看,搜索引擎随着互联网的普及而长足发展,根据关键词匹配以及页面链接为基础的搜索算法基本成熟定型,除非有方法学上的革命,关键词检索领域该探索的方方面面已经差不多到头了。由于信息爆炸时代对于搜索技术的期望永无止境,搜索业界对关键词以外的新技术的呼声日高。用户对粗疏的搜索结果越来越不满意,社会需求要求搜索结果的细化(more granular results),至少要以段落为单位(snippet)代替文章(URL)为单位,最好是直接给出答案,不要拖泥带水。虽然直接给出答案需要等待问答系统的研究成果,但是从全文检索细化到段落检索的工作已经在产业界实行,搜索的常规结果正从简单的网页链接进化到 highlight 了搜索关键词的一个个段落。 新式问答系统的研究就在这样一种业界急切呼唤、学界奠定了一定基础的形势下,走上历史舞台。美国标准局的测试要求系统就每一个问题给出最佳的答案,有短答案(不超过50字节)与长答案(不超过250字节)两种。下面是第一次问答竞赛的试题样品: Who was the first American in space? Where is the Taj Mahal? In what year did Joe DiMaggio compile his 56-game hitting streak? 三 昙花 这次问答系统竞赛的结果与意义如何呢?应该说是结果良好,意义重大。最好的系统达到60%多的正确率,就是说每三个问题,系统可以从语言文档中大海捞针一样搜寻出两个正确答案。作为学界开放式系统的第一次尝试,这是非常令人鼓舞的结果。当时正是 dot com 的鼎盛时期,IT 业界渴望把学界的这一最新研究转移到信息产品中,实现搜索的革命性转变。里面有很多有趣的故事,参见我的相关博文: 《朝华午拾:创业之路》 。 回顾当年的工作,可以发现是组织者、学界和业界的天时地利促成了问答系统奇迹般的立竿见影的效果。美国标准局在设计问题的时候,强调的是自然语言的问题(English questions,见上),而不是简单的关键词 queries,其结果是这些问句偏长,非常适合做段落检索。为了保证每个问题都有答案,他们议定问题的时候针对语言资料库做了筛选。这样一来,文句与文本必然有相似的语句对应,客观上使得段落匹配(乃至语句匹配)命中率高(其实,只要是海量文本,相似的语句一定会出现)。设想如果只是一两个关键词,寻找相关的可能含有答案的段落和语句就困难许多。当然找到对应的段落或语句,只是大大缩小了寻找答案的范围,不过是问答系统的第一步,要真正锁定答案,还需要进一步细化,pinpoint 到语句中那个作为答案的词或词组。这时候,信息抽取学界已经成熟的实名标注技术正好顶上来。为了力求问答系统竞赛的客观性,组织者有意选择那些答案比较单纯的问题,譬如人名、时间、地点等。这恰好对应了实名标注的对象,使得先行一步的这项技术有了施展身手之地。譬如对于问题 “In what year did Joe DiMaggio compile his 56-game hitting streak?”,段落语句搜索很容易找到类似下列的文本语句:Joe DiMaggio's 56 game hitting streak was between May 15, 1941 and July 16, 1941. 实名标注系统也很容易锁定 1941 这个时间单位。An exact answer to the exact question,答案就这样在海量文档中被搜得,好像大海捞针一般神奇。沿着这个路子,11 年后的 IBM 花生研究中心成功地研制出打败人脑的电脑问答系统,获得了电视智能大奖赛 Jeopardy! 的冠军(见报道 COMPUTER CRUSHES HUMAN 'JEOPARDY!' CHAMPS ) ,在全美观众面前大大地出了一次风头,有如当年电脑程序第一次赢得棋赛冠军那样激动人心。 当年成绩较好的问答系统,都不约而同地结合了实名标注与段落搜索的技术: 证明了只要有海量文档,snippet+NE 技术可以自动搜寻回答简单的问题。 四 现状 1999 年的学界在问答系统上初战告捷,我们作为成功者也风光一时,下自成蹊,业界风险投资商蜂拥而至。很快拿到了华尔街千万美元的风险资金,当时的感觉真地好像是在开创工业革命的新纪元。可惜好景不长,互联网泡沫破灭,IT 产业跌入了萧条的深渊,久久不能恢复。投资商急功近利,收紧银根,问答系统也从业界的宠儿变成了弃儿(见 《朝华午拾 - 水牛风云》 )。主流业界没人看好这项技术,比起传统的关键词索引和搜索,问答系统显得不稳定、太脆弱(not robust),也很难 scale up, 业界的重点从深度转向广度,集中精力增加索引涵盖面,包括所谓 deep web。问答系统的研制从业界几乎绝迹,但是这一新兴领域却在学界发芽生根,不断发展着,成为自然语言研究的一个重要分支。IBM 后来也解决了 scale up (用成百上千机器做分布式并行处理)和适应性培训的问题,为赢得大奖赛做好了技术准备。同时,学界也开始总结问答系统的各种类型。一种常见的分类是根据问题的种类。 我们很多人都在中学语文课上,听老师强调过阅读理解要抓住几个WH的重要性:who/what/when/where/how/why(Who did what when, where, how and why?). 抓住了这些WH,也就抓住了文章的中心内容。作为对人的阅读理解的仿真,设计问答系统也正是为了回答这些WH的问题。值得注意的是,这些 WH 问题有难有易,大体可以分成两类:有些WH对应的是实体专名,譬如 who/when/where,回答这类问题相对容易,技术已经成熟。另一类问题则不然,譬如what/how/why,回答这样的问题是对问答学界的挑战。简单介绍一下这三大难题如下。 What is X?类型的问题是所谓定义问题,譬如 What is iPad II? (也包括作为定义的who:Who is Bill Clinton?) 。这一类问题的特点是问题短小,除去问题词What与联系词 is 以外 (搜索界叫stop words,搜索前应该滤去的,问答系统在搜索前利用它理解问题的类型),只有一个 X 作为输入,非常不利于传统的关键词检索。回答这类问题最低的要求是一个有外延和种属的定义语句(而不是一个词或词组)。由于任何人或物体都是处在与其他实体的多重关系之中(还记得么,马克思说人是社会关系的总和),要想真正了解这个实体,比较完美地回答这个问题,一个简单的定义是不够的,最好要把这个实体的所有关键信息集中起来,给出一个全方位的总结(就好比是人的履历表与公司的简介一样),才可以说是真正回答了 What/Who is X 的问题。显然,做到这一步不容易,传统的关键词搜索完全无能为力,倒是深度信息抽取可以帮助达到这个目标,要把散落在文档各处的所有关键信息抽取出来,加以整合才有希望( 【立委科普:信息抽取】 )。 How 类型的问题也不好回答,它搜寻的是解决方案。同一个问题,往往有多种解决方案,譬如治疗一个疾病,可以用各类药品,也可以用其他疗法。因此,比较完美地回答这个 How 类型的问题也就成为问答界公认的难题之一。 Why 类型的问题,是要寻找一个现象的缘由或动机。这些原因有显性表达,更多的则是隐性表达,而且几乎所有的原因都不是简单的词或短语可以表达清楚的,找到这些答案,并以合适的方式整合给用户,自然是一个很大的难题。 下一个姐妹篇《立委科普:自动回答 How 与 Why 的问题》准备详细谈谈后两个难题。这篇已经太长,收住吧。希望读者您 不 觉得太枯燥,如果有所收获,则幸甚。谢谢您的阅览。 参考文献: http://en.wikipedia.org/wiki/Question_answering 相关博文: 《新智元笔记:知识图谱和问答系统:开题(1)》 2015-12-21 《新智元笔记:知识图谱和问答系统:how-question QA(2)》 2015-12-22 【立委科普:从产业角度说说NLP这个行当】 《朝华午拾:创业之路》 《朝华午拾 - 水牛风云》 【立委科普:信息抽取】 《朝华午拾:信息抽取笔记》 《立委随笔:机器学习和自然语 言处理》 回答: 历史闲话太多,需要更多的细节。 大多数科普读者也就是听个故事 作者: 立委 (*) 日期: 04/23/2011 15:58:42 如果能激发大学生的好奇心 把科研与产业结合的激动人心的情绪传达给年轻人和后来者 就达到目的了 至于知识传播和技术细节都是其次 问答的文献也汗牛充栋了 光wiki和综述也不少了 寻求细节的文字随处可见 也许强调机器进步与软件进步的对比会更有些可读性。尤其是机器进步带来的革命。 最好能给出具体的事例来。比如过去编程计算,算出一个结果要三天。而今天3秒都不用。 ---------- 就“是”论事儿,就“事儿”论是,就“事儿”论“事儿”。 【置顶:立委科学网博客NLP博文一览(定期更新版)】
wow,听上去比伟哥的发明还要伟大,I never knew this side of NLP。 我一辈子就干的自然语言处理这行,即 NLP (Natural Language Processing),最近才知道它还有 seductive 的一面。 不过,我特别喜欢这个广告: Quote NLP is not magic, but the results you can get sometimes seem almost magical. (“NLP 不是魔术,但是,其结果有时几乎就是魔术一般神奇。”) http://www.confidencenow.com/nlp-seduction.htm 真地这么神么?是的,我们的NLP技术就是如此。 至于,是否迷惑异性也有这么神,就不得而知了。 老友说: 此NLP非彼NLP也 。 有着能迷惑异性的 seductive 一面的NLP是“Neuro-Linguistic Programming”(神经语言功能训练?),指的是一种心理疗法,详见wiki如下: Quote Neuro-linguistic programming (NLP) is an approach to psychotherapy and organizational change based on a model of interpersonal communication chiefly concerned with the relationship between successful patterns of behaviour and the subjective experiences (esp. patterns of thought) underlying them and a system of alternative therapy based on this which seeks to educate people in self-awareness and effective communication, and to change their patterns of mental and emotional behaviour http://en.wikipedia.org/wiki/Neuro-linguistic_programming 感觉就是一种克服心理障碍的疗法。类似于教结巴讲话:很多结巴主要不是生理性障碍,而是心理障碍,越急越结巴。很多腼腆的人,见到异性就脸红的人也是如此,需要克服心理障碍才能自如 。 【置顶:立委科学网博客NLP博文一览(定期更新版)】
Useful Tools Information Retrieval Lemur/Indri The Lemur Toolkit for Language Modeling and Information Retrieval http://www.lemurproject.org/ Indri: Lemur's latest search engine Lucene/Nutch Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. http://lucene.apache.org/ http://www.nutch.org/ WGet GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. http://www.gnu.org/software/wget/wget.html Natural Language Processing EGYPT: A Statistical Machine Translation Toolkit http://www.clsp.jhu.edu/ws99/projects/mt/ GIZA++ (Statistical Machine Translation) http://www.fjoch.com/GIZA++.html GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och. PHARAOH (Statistical Machine Translation) http://www.isi.edu/licensed-sw/pharaoh/ a beam search decoder for phrase-based statistical machine translation models OpenNLP: http://opennlp.sourceforge.net/ MINIPAR by Dekang Lin (Univ. of Alberta, Canada) MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second. http://www.cs.ualberta.ca/~lindek/minipar.htm WordNet http://wordnet.princeton.edu/ WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator). HowNet http://www.keenage.com/ HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents. Statistical Language Modeling Toolkit http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models. SRI Language Modeling Toolkit www.speech.sri.com/projects/srilm/ SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995. ReWrite Decoder http://www.isi.edu/licensed-sw/rewrite-decoder/ The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation. GATE (General Architecture for Text Engineering) http://gate.ac.uk/ A Java Library for Text Engineering Machine Learning YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning) http://www.fjoch.com/YASMET.html LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification. SVM Light 由cornell的Thorsten Joachims在dortmund大学时开发,成为LibSVM之后最为有名的SVM软件包。开源,用C语言编写,用于ranking问题 http://svmlight.joachims.org/ CLUTO http://www-users.cs.umn.edu/~karypis/cluto/ a software package for clustering low- and high-dimensional datasets CRF++ http://chasen.org/~taku/software/CRF++/ Yet Another CRF toolkit for segmenting/labelling sequential data CRF(Conditional Random Fields),由HMM/MEMM发展起来,广泛用于IE、IR、NLP领域 SVM Struct http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping h: X -- Y using labeled training examples (x1,y1), ..., (xn,yn). Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging. SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks: SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface. SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences. SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences. SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences. Misc Notepad++ 一个开源编辑器,支持C#,perl,CSS等几十种语言的关键字,功能可与新版的UltraEdit,Visual Studio .NET媲美 http://notepad-plus.sourceforge.net WinMerge : 用于文本内容比较,找出不同版本的两个程序的差异 winmerge.sourceforge.net/ OpenPerlIDE : 开源的perl编辑器,内置编译、逐行调试功能 open-perl-ide.sourceforge.net/ ps: 论起编辑器偶见过的最好的还是VS.NET了,在每个function前面有+/-号支持expand/collapse,支持区域copy/cut/paste,使用ctrl+ c/ctrl+x/ctrl+v可以一次选取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,还有还有...... Visual Studio .NET is really kool:D Berkeley DB http://www.sleepycat.com/ Berkeley DB不是一个关系数据库,它被称做是一个嵌入式数据库:对于c/s模型来说,它的client和server共用一个地址空间。由于数据库最初是从文件系统中发展起来的,它更像是一个key-value pair的字典型数据库。而且数据库文件能够序列化到硬盘中,所以不受内存大小限制。BDB有个子版本Berkeley DB XML,它是一个xml数据库:以xml文件形式存储数据?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌入到自己的产品中去了 Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs. It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are store this value under this key, check if this key exists and retrieve the value for this key so conceptually it's pretty simple - the complicated stuff all happens under the hood. case study: Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet. Microsoft uses Berkeley DB for the Groove collaboration software AOL uses Berkeley DB for search tool meta-data and other services. Hitachi uses Berkeley DB in its directory services server product. Ford uses Berkeley DB to authenticate partners who access Ford's Web applications. Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software. Google uses Berkeley DB High Availability for Google Accounts. Motorola uses Berkeley DB to track mobile units in its wireless radio network products. LaTeX LATEX, written as LaTeX in plain text, is a document preparation system for the TeX typesetting program. It offers programmable desktop publishing features and extensive facilities for automating most aspects of typesetting and desktop publishing, including numbering and cross-referencing, tables and figures, page layout, bibliographies, and much more. LaTeX was originally written in 1984 by Leslie Lamport and has become the dominant method for using TeXfew people write in plain TeX anymore. The current version is LaTeX2. 中文套装可以在http://www.ctex.org找到 http://learn.tsinghua.edu.cn:8080/2001315450/comp.html by王垠 EditPlus http://www.editplus.com/ EditPlus is an Internet-ready 32-bit text editor, HTML editor and programmers editor for Windows. While it can serve as a good replacement for Notepad, it also offers many powerful features for Web page authors and programmers. EditPlus当前最新版本是2.21,BrE和AmE的spell checker需要单独下载安装包安装 GVim: Vi IMproved http://www.vim.org/index.php Vim is an advanced text editor that seeks to provide the power of the de-facto Unix editor 'Vi', with a more complete feature set. It's useful whether you're already using vi or using a different editor. Users of Vim 5 should consider upgrading to Vim 6, which is greatly enhanced since Vim 5. Vim is often called a programmer's editor, and so useful for programming that many consider it an entire IDE. It's not just for programmers, though. Vim is perfect for all kinds of text editing, from composing email to editing configuration files. 普通windows用户可以从这个链接下载ftp://ftp.vim.org/pub/vim/pc/gvim64.exe Cygwin : GNU + Cygnus + Windows http://www.cygwin.com/ Cygwin is a Linux-like environment for Windows. It consists of two parts: A DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality. A collection of tools, which provide Linux look and feel. MinGW: Minimalistic GNU for Windows http://www.mingw.org/ MinGW: A collection of freely available and freely distributable Windows specific header files and import libraries combined with GNU toolsets that allow one to produce native Windows programs that do not rely on any 3rd-party C runtime DLLs. 在windows下编译、移植unix/linux平台的软件。cygwin相当于在windows系统层上模拟了一个POSIX-compliant的layer(库文件是cygwin1.dll);而mingw则是使用 windows自身的库文件(msvcrt.dll)实现了一些符合POSIX spec的功能,并不是完全POSIX-compliant。mingw其实是cygwin的一个branch,由于它没有实现linux api的模拟层,所以开销要比cygwin低些。 CutePDF Writer http://www.cutepdf.com Portable Document format (PDF) is the de facto standard for the secure and reliable distribution and exchange of electronic documents and forms around the world. CutePDF Writer (formerly CutePDF Printer) is the free version of commercial PDF creation software. CutePDF Writer installs itself as a printer subsystem. This enables virtually any Windows applications (must be able to print) to create professional quality PDF documents - with just a push of a button! 比起acrobat来,一大优点就是它是免费的。而且一般word图表、公式的转换效果很好,what you see is what you get,哈哈。可能需要ps2pdf converter,在该站点有链接提供下载 R http://www.r-project.org/ R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly ATT, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. R统计软件与MatLab类似,都是用在科学计算领域的。不同的是它是开源的东东:) From : http://kapoc.blogdriver.com/kapoc/1268927.html from: http://www.comp.nus.edu.sg/~kwang/MiscTools.html http://gump-bean.javaeye.com/category/74134?show_full=true
ACLshort paper被拒,有点郁闷,本以为short paper本来就是考虑到是一个正在进行中的工作的前期部分,可能实验不会太完整,不过似乎我想错了,在文章中由于篇幅,没有将实验部分交待的很细致,实验也没有完全展开,本以为能对前面的基本假设做一个初步证明即可,结果3个评审倒是意见一致了: (1)the description of the experiment on Semeval 2007 needs more details. Its not clear how the training set expansion is done and the different systems are defined. This could be the more interesting part of the paper but is vaguely described in one paragraph and consequently difficult to understand. A comparison with the features used by other Semeval participants would help to understand the contribution of the proposed technique. (2)All in all, the results over ngrams are interesting, but the application to WSD needs more work. (3)It will be more interesting to see your comparison for several languages--- currently I find your results too limited for ACL. 当然,外语写作的问题再次暴露,虽然我找了个人帮我改过了。 The paper would really profit from an English native speaker for proof-reading. It is partially very hard to understand and some sentences just don't make much sense. Ths goes beyond the standard number of mistakes that are unavoidable for non-native speakers.
I CGEC -20 10 -IS 20 The Fourth International Conference on Genetic and Evolutionary Computing December 13 15, 2010, ShenZhen, China http://bit.kuas.edu.tw/~icgec10/ Session Title Natural Language Processing and Intelligent Computation ( NLPIC) http://bit.kuas.edu.tw/~icgec10/ Call for Paper We are organizing an invited session on Natural Language Processing and Intelligent Computation for ICGEC-2010, which will be held in Shenzhen , China on December 13-15, 2010. W e expect that individuals and research institutions in the areas of both Intelligence Computation and NLP could pay attention to this session , which may contribute to boost these two areas. The topics of the session include, but are not limited to: 1. Genetic algorithms for natural language processing ; 2. Genetic algorithms for speech processing. 3. Computational intelligence and semantic compuation ; 4 . Application issues of NLP based computational intelligence. 5 . Other topics of relevance in computational intelligence and NLP application etc. Important Dates The deadline for paper submission May 31, 2010 The date for notification July 31, 2010 The deadline for camera-ready paper submission August 31, 2010 Paper Submission Papers are invited from prospective authors with interest on the related areas. Each paper should follow the IEEE paper format (DOC, LaTex Formatting Macros, PDF) with title, authors' names, affiliations and email addresses, an up to 150-words abstract, and a two-column body with 4 single-spaced pages and with font size at 10 pts . All papers must be submitted electronically in PDF format only and be mailed to: PhD. Peng Jin at jandp @pku.edu.cn Any questions, please feel free to contact with following organizers: Sessions Organizers Yao Liu Associate Professor Institute of Scientific and Technical Information of China No.15 Fuxing Road haidian District, Beijing 100038 China E-mail:liuy@istic.ac.cn Tel:086-01058882 053 Peng Jin Assistant Professor, Doctor School of Computer Science , Leshan Normal University No. 778 Binhe Rd. Shizhong District. 614004 , Leshan , Sichuan , China E-mail: jandp@ pku.edu.cn Tel:086- 8332276382 - 622
Publications Srihari, R, W. Li and X. Li, 2006. Question Answering Supported by Multiple Levels of Information Extraction, a book chapter in T. Strzalkowski S. Harabagiu (eds.), Advances in Open- Domain Question Answering. Springer, 2006, ISBN:1-4020-4744-4. online info Srihari, R., W. Li, C. Niu and T. Cornell. 2006. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006. online info Niu,C., W. Li, R. Srihari, and H. Li. 2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation. . Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005). Srihari, R., W. Li, L. Crist and C. Niu. 2005. Intelligence Discovery Portal based on Corpus Level Information Extraction . Proceedings of 2005 International Conference on Intelligence Analysis Methods and Tools. Niu, C., W. Li and R. Srihari. 2004. Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction . In Proceedings of ACL 2004. Niu, C., W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities . In Proceedings of Senseval-3 Workshop. Niu, C., W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation . International Journal of Artificial Intelligence Tools, Vol. 13, No. 1, 2004. Niu, C., W. Li and R. Srihari 2004. A Bootstrapping Approach to Information Extraction Domain Porting . AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM), California. Srihari, R., W. Li and C. Niu. 2004. Corpus-level Information Extraction. In Proceedings of International Conference on Natural Language Processing (ICON 2004), Hyderabad, India. Li, W., X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert Lexicon Approach to Identifying English Phrasal Verbs . In Proceedings of ACL 2003. Sapporo, Japan. pp. 513-520. Niu, C., W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping Approach to Named Entity Classification using Successive Learners . In Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342. Li, W., R. Srihari, C. Niu, and X. Li. 2003. Question Answering on a Case Insensitive Corpus . In Proceedings of Workshop on Multilingual Summarization and Question Answering - Machine Learning and Beyond (ACL-2003 Workshop). Sapporo, Japan. pp. 84-93. Niu, C., W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for Named Entity Tagging using Concept-based Seeds . In Proceedings of HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada. Srihari, R., W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A Customizable Intermediate Level Information Extraction Engine . In Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). pp. 52-59, Edmonton, Canada. Li, H., R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction . In Proceedings of HLT/NAACL 2003 Workshop on Analysis of Geographic References. Edmonton, Canada. Li, W., R. Srihari, C. Niu, and X. Li 2003. Entity Profile Extraction from Large Corpora . In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a Hidden Markov Model for Relationship Extraction Using Multi-level Contexts . In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised Learning for Verb Sense Disambiguation Using Both Trigger Words and Parsing Relations . In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation . In Proceedings of the Sixteenth International FLAIRS Conference, St. Augustine, FL, May 2003, pp. 402-406. Srihari, R. and W. Li 2003. Rapid Domain Porting of an Intermediate Level Information Extraction Engine . In Proceedings of International Conference on Natural Language Processing 2003. Srihari, R., C. Niu, W. Li, and J. Ding. 2003. A Case Restoration Approach to Named Entity Tagging in Degraded Documents. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), Edinburgh, Scotland, Aug. 2003. Li, H., R. Srihari, C. Niu and W. Li 2002. Location Normalization for Information Extraction . In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan. Li, W., R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002. Extracting Exact Answers to Questions Based on Structural Links . In Proceedings of Multilingual Summarization and Question Answering (COLING-2002 Workshop). Taipei, Taiwan. Srihari, R. and W. Li. 2000. A Question Answering System Supported by Information Extraction . In Proceedings of ANLP 2000. Seattle. Srihari, R., C. Niu and W. Li. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging . In Proceedings of ANLP 2000. Seattle. Li. W. 2000. On Chinese parsing without using a separate word segmenter. In Communication of COLIPS 10 (1). pp. 19-68. Singapore. Srihari, R. and W. Li. 1999. Information Extraction Supported Question Answering . In Proceedings of TREC-8. Washington Srihari, R., M. Srikanth, C. Niu, and W. Li 1999. Use of Maximum Entropy in Back-off Modeling for a Named Entity Tagger, Proceedings of HKK Conference, Waterloo, Canada Li. W. 1997. Chart Parsing Chinese Character Strings. In Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9). Victoria, Canada. Li. W. 1996. Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. In Proceedings of International Chinese Computing Conference (ICCC’96). Singapore Li, W. and P. McFetridge 1995. Handling Chinese NP Predicate in HPSG, Proceedings of PACLING-II, Brisbane, Australia. Liu, Z., A. Fu, and W. Li. 1992. Machine Translation System Based on Expert Lexicon Techniques. Zhaoxiong Chen (eds.) Progress in Machine Translation Research , pp. 231-242. Dianzi Gongye Publishing House.Beijing. (刘倬,傅爱平,李维 (1992). 基于词专家技术的机器翻译系统,”机器翻译研究新进展”,陈肇雄编辑,电子工业出版社,第 231-242 页,北京) Li, Uej (Wei) 1991. Lingvistikaj trajtoj de la lingvo internacia Esperanto. In Serta gratulatoria in honorem Juan Rgulo, Vol. IV. pp. 707-723. La Laguna: Universidad de La Laguna http://blog.sciencenet.cn/blog-362400-285729.html Li, W. and Z. Liu. 1990. Approach to Lexical Ambiguities in Machine Translation. In Journal of Chinese Information Processing. Vol. 4, No. 1. pp. 1-13. Beijing. (李维,刘倬 (1990). 机器翻译词义辨识对策,《中文信息学报》,1990年第一期,第 1-13 页,北京) (Its abstract published in Computer World 1989/7/26 ) Liu, Z., A. Fu, and W. Li. 1989. JFY-IV Machine Translation System. In Proceedings of Machine Translation SUMMIT II. pp. 88-93, Munich. Li, W. 1988. E-Ch/A Machine Translation System and Its Synthesis in the Target Languages Chinese and Esperanto. In Journal of Chinese Information Processing. Vol. 2, No. 1. pp. 56-60. Beijing (李维 (1988). E-Ch/A 机器翻译系统及其对目标语汉语和英语的综合,《中文信息学报》,1988年第一期,第 56-60 页,北京) Li, W. 1988. Lingvistikaj Trajtoj de Esperanto kaj Ghia Mashin-traktado. El Popola Chinio. 1988. Beijing Li, W. 1988. An Experiment of Automatic Translation from Esperanto into Chinese and English, World Science and Technology 1988, No. 1, STEA sub Academia Sinica. 17-20, Beijing. Liu, Y. and W. Li 1987. Babelo Estos Nepre Konstruita. El Popola Chinio. 1987. Beijing (also presented in First Conference of Esperanto in China, 1985, Kunming) Li, W. 1986. Automatika Tradukado el la Internacia Lingvo en la Chinan kaj Anglan Lingvojn, grkg/Humankybernetik, Band 27, Heft 4. 147-152, Germany. Other Publications Chinese Dependency Syntax SBIR Grants (17 Final Reports published internally) Ph.D. Thesis: THE MORPHO-SYNTACTIC INTERFACE IN A CHINESE PHRASE STRUCTURE GRAMMAR M.A. Thesis in Chinese: 世界语到汉语和英语的自动翻译试验 –EChA机器翻译系统概述 《立委科普:Machine Translation》 (encoded in Chinese GB) Li, W. 1997. Outline of an HPSG-style Chinese Reversible Grammar , Vancouver, Canada. Li, W. 1995. Esperanto Inflection and Its Interface in HPSG, Proceedings of 11th North West Linguistics Conference (NWLC), Victoria, Canada. Li, W. 1994. Survey of Esperanto Inflection System, Proceedings of 10th North West Linguistics Conference (NWLC), Burnaby, Canada.
WEI LI Email: liwei AT sidepark DOT org Homepage: http://www.sciencenet.cn/m/user_index1.aspx?typeid=128262userid=362400 (1) Qualifications Dr. Li is a computational linguist with years of work experiences in Natural Language Processing (NLP). Dr. Li's background involves both a solid research track record and substantial industrial software development experiences. He is now Chief Scientist in a US company, leading the technology team in developing the core engine for extracting sentiments and text analytics for the consumer insight and business search products. Dr. Li led the NLP team and solved the problem of answering how-question: this was the technology foundation for the launch of the research product serving the technology community. After that, he directed the team in automatic sentiment analysis and solved the problem of answering why-questions. This effort has resulted in the launch of the product for extracting consumer insights from social media. He is currently leading the effort for multilingual NLP efforts and for identifying demographic information for social media IDs. In his previous job, Dr. Li was Principal Investigator (PI) at Cymfony on 17 federal grants from the DoD SBIR (AF and Navy) contracts in the area of NLP/IE (Information Extraction). These efforts led to the development and deployment of a suite of InfoXtract engine and products, including Cymfony ’ s BrandDashboard, Harmony and Influence as well as Janya ’ s Semantex engine for the government deployment. Dr. Li led the effort in winning the first competition at TREC-8 (Text Retrieval Conference 1999) in its natural language Question Answering (QA) track. Dr. Li has published extensively in refereed journals and high-profile international conferences such as ACL and COLING, in the areas of question answering, parsing, word sense disambiguation, information extraction and knowledge discovery. (2) Employment 2005.11- present Chief Scientist Dr. Wei Li leads the development of Netbase's core research and natural language processing (NLP) team. Major responsibilities: Direct RD; natural language parsing; transfer technology; business information extraction; sentiment analysis. Architect and key developer of the NLP platform for parsing English into logical forms Architect and key developer for question answering and business information extraction based on parsing Design and direct to develop sentiment analysis in Benefit Frame, Problem Frame, 360 Frame, and Preference Frame. Supports technology transfer into product features in three lines of commercially deployed products 1997.11-2005.11 Vice President for RD/NLP, Cymfony Inc. / Janya Inc. (Cymfony spin-off since 2005.08) Principal Research Scientist since 01/1998 VP since 09/1999 Dr. Wei Li lead the development of Cymfony/Janya’s core research and natural language processing (NLP) team. Major responsibilities: Direct RD; write grant proposals; transfer technology; develop linguistic modules. Chief architect for the core technology InfoXtract for broad coverage NLP and Information Extraction (IE): designed and developed the key modules for parsing, relationship extraction and event extraction Instrumental in helping to close the seed funding and the first round of financing of over 11 million dollars in 2000 and to develop a tiny 2-staff company when I joined it in 1996 into a 60+ staff technology company in the IT (Information Technology) sector of US industry, with offices in Buffalo, Boston and Bangalore (India) before the spin-off Responsible for technology transfer: designed the key features brand tagging, message tracking and quote extraction for the Cymfony flagship product Brand Dashboard(TM) Cymfony has been nominated for US Small Business Administration Prime Contractor of the Year Award several times for its outstanding government work Cymfony’s commercial product has won numerous awards including the Measurement Standard’s Third Annual Product of the Year Award, Finalist for the MITX Awards 2004, Finalist For 19th Annual Codie Award, 2003 Massachusetts Interactive Media Council (MIMC) Awards Cymforny has been named 100 Companies that matter in Knowledge Management by KMWold together with other industry leaders Pincipal Investigator (PI) or Co-PI for 17 SBIR (Small Business Innovation Research Phase 1, Phase 2 and Enhancement) grants (about eight million dollars) from DoD (Department of Defense) of US in the area of intelligent information retrieval and extraction PI, Fusion of Entity Information from Textual Data Sources (Phase I $100,000), U.S. DoD SBIR (AF),, Contract No. FA8750-05-C-0163 (2005) PI, Automated Verb Sense Identification (Phase II $750,000), U.S. DoD SBIR (Navy), Contract No. N00178-03-C-1047 (2003-2005) PI, Automated Verb Sense Identification (Phase I $100,000), U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003) Co-PI, An Automated Domain Porting Toolkit for Information Extraction (Phase II $750,000, Enhancement $830,000), U.S. DoD SBIR (AF), Contract No. F30602-03-C-0044 (2003-2006) Co-PI, An Automated Domain Porting Toolkit for Information Extraction (Phase I $100,000), U.S. DoD SBIR (AF), Contract No. F30602-02-C-0057(2002-2003) Co-PI, A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase II $750,000), U.S. DoD SBIR (AF) (2004-2006) Co-PI, A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase I, $100,000), U.S. DoD SBIR (AF) (2003-2004) Co-PI, Automatically Time Stamping Events in Unrestricted Text (Phase I $100,000), U.S. DoD SBIR (AF), (2003-2004) Co-PI, Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I, $100,000) , U.S. DoD SBIR (AF), Contract No. F30602-02-C-0156 (2002-2003) PI, Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase II, $750,000; Enhancement $500,000) , U.S. DoD SBIR (AF), Contract No. F30602-01-C-0035 (2001-2003) PI, Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase I, $100,000) , U.S. DoD SBIR (AF), Contract No. F30502-00-C-0090 (2000-2001) PI, Flexible Information Extraction Learning Algorithm (Phase II, $750,000; Enhancement $500,000) , U.S. DoD SBIR (AF) Contract No. F30602-00-C-0037 (2000-2002) PI, Flexible Information Extraction Learning Algorithm (Phase I, $100,000) , U.S. DoD SBIR (AF) Contract No. F30602-99-C-0102 (1999-2000) PI, A Domain Independent Event Extraction Toolkit (Phase II, $750,000) , U.S. DoD SBIR (AF), Contract No. F30602-98-C-0043 (1998-2000) 1986-1991 Assistant Researcher, Institute of Linguistics, CASS (Chinese Academy of Social Sciences) R D for Project of JFY Machine Translation Engine from English to Chinese (using COBOL) 1988-1991 Senior Engineer, Gaoli Software Company instrumental in turning the research prototype JFY into a real life software product GLMT for English-to-Chinese Machine Translation trained and supervised lexicographers in building up a lexicon of 60,000 entries supervised the testing of thousands of lexicon rules GLMT 1.0 successfully marketed in 1992 GLMT won nemerous prizes, including Silver Medal, INFORMATICS’92 (Singapore 1992); Gold Medal for Electronic Products at Chinese Science Technology Exhibition (Beijing, 1992) and various other software prizes (Beijing 1992-1995) technology partially transferred to VTECH Electronics Ltd in the product of pocket electronic translator 1988 Contract grammarian, BSO Software Company, Utrecht, The Netherlands Chinese Dependency Syntax Project , for use in multi-lingual MT (3) Education 2001 PhD in Computational Linguistics, Simon Fraser University, Canada Thesis: The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar 1992 PhD candidate in Computational Linguistics, CCL/UMIST, UK 1986 M.A. in Machine Translation, Graduate School of Chinese Academy of Social Sciences Thesis: Automatic Translation from Esperanto to English and Chinese (4) Prizes and Honors 2001 Outstanding Achievement Award, Department of Linguistics, Simon Fraser University (award given to the best PhD graduates from the department) 1995-1997 G.R.E.A.T. Award, Science Council, B.C. CANADA (an industry-based grant, funding the effort to bridge my Ph.D. research with the local industrial needs) 1997 President’s Research Stipend, SFU, CANADA 1996 Travel grant for attending ICCC in Singapore, by ICCC’96 1995 Graduate Fellowship (merit-based), SFU, CANADA 1992 Software Second Prize (Aiping Fu and Wei Li), Chinese Academy of Social Sciences for machine translation database software 1991 Sino-British Friendship Scholarship, supporting my PhD program in UK (a prestigious scholarship designed to award Chinese young scientists for overseas training in England in a nation-wide competition, administered jointly by the British Council, Sir Pao Foundation and the Education Ministry of China) (5) Professional Activities Editor, International Editorial Board for Journal of Chinese Language and Computing Industrial Advisor, supervising over 20 Graduate Student Interns from SUNY/Buffalo (since 1998) Reviewer, Second International Joint Conference on Natural Language Processing (IJCNLP-05) Member, Program committee for 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) Reviewer, Mark Maybury (ed.) New Directions in Question Answering, The AAAI Press, 2003 Member, Program committee for The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), 2003. Member, Program committee for 20th International Conference on Computer Processing of Oriental Languages (ICCPOL2003), 2003. Panelist, Multilingual Summarization and Question Answering (COLING-2002 Workshop) Invited talk, ‘Information Extraction and Natural Language Applications’, National Key Lab for NLP, Qinghua University, Beijing, Feb. 2001 Member, Association of Computational Linguistics (ACL) Member, American Association for Artificial Intelligence (AAAI) (6) Languages English: fluent Chinese: native French: Intermediate (learned 3 years) Esperanto: fluent (published in Esperanto) Russian: elementary (learned 1 year) (7) Publications A complete list of publications are available on-line at http://www.sciencenet.cn/m/user_content.aspx?id=295975