Chinesemorphology syntax 字组词与词组句( or 短语): 1. 界限不清晰 2. 规则类似 3. compounding: small syntax, a BIG partof Chinese structures 4 . pipeline steps with adaptivedevelopment and patches can handle modulardevelopment is key for a complex system easyto debug and maintain System internal coordination : 1. 很多问题可以通过系统内部协调来解决:没有绝对对错,如何更合适 更好维护 e.g. two-subject phenomena 2. 大 体分层,局部 patch : 2major counter-arguments : inter-dependency errorpropagation 切词 与 POS 等因此无需一刀切 2-subject structures 我身体好 三星手机屏幕清晰,价格合理 Linguists different analyses: each hasits points/perspectives and is valid (1)S1+S2+Pred or Topic+Subj+Pred (2) NP1-modifier NP2 + P red ” = NP1+de+NP2 + Pred (3) NP1 + Pred (NP2+AP) : pred compounding analysis No need to argue, whichever analysis is convenient No absolute right or wrong, differentperspectives Largely system internal: parsing representation is not goal, IE is as long as tree is consistent andsupports IE 切词 vs 组词 切词是系统的有机部分: 1. 正确 率不是唯一的标准 : a real story 2. config 和 easy to debug 是最重要的 3. 不要本末倒置:负负也可以得正 , adaptive development vs.pipeline error propagation 大词典是根本对策: 1. 边界词典:越大越好 ( 虽然语言学词典是有限的 ) 2. 切词的目的之一是语义标注: HowNet 切词与组词相结合: 1. listable 2. open-ended 应该立法禁止切词研究 :=) 有待于汉语语法的理论突 破? 西 语分析的方法、工具: 1. 可用 2 . collocations: phrasal verbs at mopho -syntacticinterface 3. 需要扩充:譬如 reduplication unification 聊聊天;说说话 汉 语的所谓“意合性”: 1. 语 法比较弹性 2. 省略多: ( 1 ) 对于 这件事, 依 我的看法, 我们 应该听其自然。 ( 2 )这件事我的看法应该听其自然。 Parsing 的难度: 1. 中文这座山是陡 坡 : morefine-grained rules: POS, sub-POS, lexical feature, word-driven morelexical features needed: HowNet lazyman’s approach won’t work 2. 英 文的坡则比较平 缓 中文NLP迷思之三:中文处理的长足进步有待于汉语语法的理论突破 词义消歧( WSD )是 NLP 应用的瓶 颈 ?? No 结构歧义 ismore serious 1. Keep some non-deterministic path:following ambiguity untouched principle 2. Combine statistics with rule system NLP 迷思之四:词义消歧(WSD)是NLP应用的瓶颈 坚持四项基本原则,开发鲁棒性NLP系统 【科研笔记:NLP的词海战术】 BeyondPrecision Recall Bigdata redundancy help not only recall but also precision Instance-basedrecall at extraction level vs concept-based recall at mining level ( the latter matters to users ) 我们的语言系统每天阅读分析五千万个帖 子 , 15 亿词的处理 量 Community benchmarks vs industry benchmark Users’ experiences Sentiment Mining based on Chinese Parsing Thank You QA 台北讲演幻灯第一部分: http://blog.sciencenet.cn/blog-362400-677352.html 【置顶:立委科学网博客NLP博文一览(定期更新版)】
UPDATE:立委愚人节北京讲演时间地点已经确认,感谢中文信息学会孙教授的邀请和安排,也感谢董振东前辈教授的建议和推举: The loacation is : Room 334, 3rd floor, building 5 Institute of Software, Chinese Academy of Sciences, No. Zhongguancun South 4th Street 10:00~12:00 It's better you take the subway. And the nearest subway station of line 13 is 知春路 虽然在四月一日路过北平,但不是愚人节玩笑 :=), 具体地点和活动细节待确认后随时update Sentiment Mining from Chinese Social Media in Big Data Age by Wei Li, Ph.D. Computational Linguistics In this information age of big data, social media such as WeiBo (Micro-Blog, or Chinese twitter) is more and more influential. The popularity of mobile devices such as smart phones makes it possible for anyone to share his/her observation, experiences, opinions and sentiments any time anywhere in the social network such as WeiXin (or WeChat). The social media big data from WeiBo, WeiXin, Customer Review sites, Blogs and Forums are like a gold mine of intelligence, yet to be mined. They are in the form of natural language (Chinese in this case) and contain intelligence of public opinions and consumer sentiments on any topics, brands and products. Automated sentiment mining via Natural Language Processing (NLP) is a must-do if we (or businesses) do not want to be overwhelmed by the information overload. Dr. Li's talk will present the design philosophy behind such a sentiment mining system which he has designed and led the team to develop. He will first discuss the value and scope of NLP in sentiment extraction and mining, pros and cons between the rule based system and learning based classification, and different levels of sentiment mining in response to the various information needs. He will then demonstrate a list of real life Chinese social media hot topics as mined by the system to show the value and future of big data and NLP, in areas like automatic survey and social media listening and monitoring for consumer insights. 大数据时代中文社会媒体的舆情挖掘 李维 博士 随着大数据时代的到来,社会媒体(譬如 微博)的影响力日益增强。智能手机等移动设备的普及,使得普罗百姓的见闻、意见和情绪可以随时随地传达(譬如利用微信)。微博、微信、博客、论坛这些社会媒体大数据好像一座座富含情报的金山,等待我们去挖掘。在大数据面前,如果不想被信息爆炸淹没,就必然需要使用自动手段,尤其是可以用来自动抽取挖掘舆情的自然语言技术。 李博士的报告基于他主持开发的客户舆情自动抽取挖掘系统。报告分两大部分。第一部分阐述自然语言技术在舆情抽取中的应用范围,比较统计分类方法与规则系统方法的利弊,以及舆情分析的层级体系。第二部分通过一系列社会媒体热点话题的实例,展示大数据挖掘的价值和前景。 Dear Prof, Li, ...... the title and abstract of your talk in Chinese or English. And a simple cv of you. How about 10:00~12:00am ? About Dr, Li A hands-on computational linguist with nearly 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products. He is now Chief Scientist for a fast-growing Silicon Valley company which serves global Fortune 500 companies for consumer insights and social media monitoring. 【相关活动: 台北学术讲演谈中文语法分析 】 【置顶:立委科学网博客NLP博文一览(定期更新版)】
Chinese Turing Tests?? Challenging my Chinese dependency parser with puns. The real thing is, structural ambiguity is detectable, but not easily decodable. As for puns, forget it! Do you remember the last time you yourself, as an intelligent being designed by almighty God, were puzzled by jokes of puns? RE: 立委,测试你分析工具的图灵试题来了 大学里有两种人不谈恋爱:一种是谁都看不上,另一种是谁都 看不上。 parse 后一看,居然 合一 (unify)了:真地歇菜了?? 作者: 立委 日期: 10/11/2012 17:55:00 但是,(镜子曰,世界上怕就怕但是二字),请注意同样的string “是谁都看不上” 是怎样分析的:分析出两种意义 【意义1】是这么断句的:【是谁】 【都看不上】:【谁】 是【是】的逻辑宾语(Undergoer) 【意义2】则是:【是】 【谁都看不上】:【谁】 是【看不上】的逻辑主语(Actor) 哈哈,不傻吧,my baby 当然,同样的string,在目前是无法指望机器输出不同结果的。 实用的 parsing 技术从来没有超出语句级别的 context 来解码句法结构。 据说,类似的中文“图灵试题”还有: 大学里有两种人最容易被甩:一种人不知道什么【叫做】爱,一种人不知道什么叫【做爱】。 这些人都是原先喜欢一个人,后来喜欢一个人。 老友说,最后一句的精彩之处不在分词,在重音位置。机器只能歇菜 当然这些都是戏谑性的 puns,连人都会被绕晕,根本不用做 real life 系统的人分心。实际语言现象中,有的是 low hanging food, 很多 tractable 的问题好多系统都未及涉及呢,教机器识别 puns 这样劳而无功的勾当,根本排不上号。 【维基: 图灵测试 】 http://en.wikipedia.org/wiki/Turing_test 《立委科普:机器可以揭开双关语神秘的面纱》 【置顶:立委科学网博客NLP博文一览(定期更新版)】
“专业新人” (early stage researcher)也别被我的夸赞冲昏头脑。门道门道,有门有道。门儿清,不等于道儿清。做到门儿情,只要聪颖和悟性即可,而道儿清要的却是耐性、经验、时间,屡战屡败、屡败屡战的磨练,而且还要有运气。是为冰冻之寒也。 On Thu, Dec 29, 2011 G wrote: As you titled yourself early stage researcher, I'd recommend you a recent dialog on something related - http://blog.sciencenet.cn/ home.php?mod=spaceuid=362400 do=blogid=523458 . He has a point as an experienced practitioner. I quote him here as overall he is negative to what you are going to work on [注:指的是切词研究]. And agree with him that it's time to shift focus to parsing. 2011/12/29 G Continuation of the dialog, but with an early stage researcher. FYI as I actually recommended your blogs to him in place of my phd thesis :) On Dec 29, 2011, M wrote: Hi Dr. G, I just read the Liwei's posts and your comments. I partly agree with Liwei's arguments. I think It's just a different perspective to one of the core problem in NLP, disambiguation. Usually, beginners take the pipeline architecture as granted, i.e. segmentation--POS tagging--chunking--parsing, etc. However, given the ultimate goal is to predict the overal syntactical structures of sentences, the early stages of disambiguation can be considered as pruning for the exponential number of possible parsing trees. In this sense, Liwei's correct. As ambiguity is the enemy, it's the system designer's choice to decide what architecture to use and/or when to resolve it. I guess recently many other people in NLP also realized (and might even widely agreed on) the disadvantages of pipeline architectures, which explains why there are many joint learning of X and Y papers in past 5 years. In Chinese word segmentation, there are also attempts at doing word segmentation and parsing in one go, which seems to be promising to me. On the other hand, I think your comments are quite to the point. Current applications mostly utilize very shallow NLP information. So accurate tokenization/POS tagger/chunker have their own values. As for the interaction between linguistics theory and computational linguistics. I think it's quite similar to the relationship between other pairs of science and engineering. Basically, science decides the upper bound of engineering. But given the level of scientific achievements, engineering by itself has a huge space of possibilities. Moreover, in this specific case of our interest, CL itself may serve as a tool to advance linguistics theory, as the corpus based study of linguistics seems to be an inevitable trend. From: Wei Li Date: Fri, Dec 30, 2011 He is indeed a very promising young researcher who is willing to think and air his own opinions. I did not realize that the effect of my series is that I am against the pipeline architecture. In fact I am all for it as this is the proven solid architecture for engineering modular development. Of course, by just reading my recent three posts, it is not surprising that he got that impression. There is something deeper than that: a balance between pipeline structure and keeping ambiguity untouched principle. But making the relationship clear is not very easy, but there is a way of doing that based on experiences of adaptive development (another important principle). 【相关博文】 专业老友痛批立委《迷思》系列搅乱NLP秩序,立委固执己见 【置顶:立委科学网博客NLP博文一览(定期更新版)】
电脑的中文处理业界有很多广为流传似是而非的迷思。在今后的随笔系列中,准备提出来分别讨论。 迷思之一:切词(又叫分词,word segmentation)是中文(或东方语言)处理特有的前提,因为中文书写不分词。 切词作为中文处理的一个先行环节,是为了模块化开发的方便,这一点不错。但它根本就不特有。 任何自然语言处理都有一个先行环节,叫 tokenization,就是把输入的字符串分解成为词汇单位:无论何种书面语,没有这个环节,辞典的词汇信息就无以附着,在词汇类别的基础上的有概括性的进一步句法语义分析就不能进行。中文切词不过是这个通用的 tokenization 的一个案例而已,没有什么“特有”的问题。 有说:中文书写不分词,汉字一个挨一个,词之间没有显性标识,而西文是用 space(空白键)来分词的,因此分词是中文处理的特有难题。 这话并不确切,语言学上错误更多。具体来说: 1 单字词没有切分问题:汉语词典的词,虽然以多字词为多数,但也有单字词,特别是那些常用的功能词(连词、介词、叹词等)。对于单字词,书面汉语显然是有显性标志的,其标志就是字与字的自然分界(如果以汉字作为语言学分析的最小单位,语言学上叫语素,其 tokenization 极其简单:每两个字节为一个汉字),无需 space. 2 多字词是复合词,与其说“切”词,不如说“组”词:现代汉语的多字词(如:利率)是复合词,本质上与西文的复合词(e.g. interest rate)没有区别,space 并不能解决复合词的分界问题。事实上,多字词的识别既可以看成是从输入语句(汉字串)“切”出来的,也可以看成是由单字组合抱团而来的,二者等价。无论中西,复合词抱团都主要靠查词典来解决,而不是靠自然分界(如 space)来解决(德语的名词复合词算是西文中的一个例外,封闭类复合词只要 space 就可以了,开放类复合词则需要进一步切词,叫 decompounding)。如果复合词的左边界或者右边界有歧义问题(譬如:“天下” 的边界可能歧义, e.g. 今天 下 了 一 场 雨;英语复合副词 in particular 的右边界可能有歧义:e.g. in particular cases),无论中西,这种歧义都需要上下文的帮助才能解决。从手段上看,中文的多字词切词并无任何特别之处,英语 tokenization 用以识别复合词 People's Republic of China 和 in particular 的方法,同样适用于中文切词。 咱们换一个角度来看这个问题。根据用不用词典,tokenization 可以分两种。不用词典的tokenization一般被认为是一个比较trivial的机械过程,在西文是见space或标点就切一刀(其实也不是那么trivial因为那个讨厌的西文句点是非常歧义的)。据说汉语没有space,因此必须另做一个特有的切词模块。其实对英语第一种tokenization,汉语更加简单,因为汉字作为语素(morpheme)本身就是自然的切分单位,一个汉字两个字节,每两个字节切一刀即可。理论上讲,词法句法分析完全可以直接建立在汉字的基础之上,无需一个汉语“特有”的切词模块。Note that 多数西文分析系统在Tokenization和POS以后都有一个chunking的模块,做基本短语抱团的工作(如:Base NP)。中文处理通常也有这么一个抱团的阶段。完全可以把组字成词和组词成短语当作同质的抱团工作来处理,跳过所谓的切词。 Chunking of words into phrases are by nature no different from chunking of morphemes (characters) into words. Parsing with no “word segmentation” is thus possible. 当然,在实际操作层面上看,专设一个切词模块有其便利之处。 再看由词典支持的tokenization, 这种 tokenization 才是我们通常讲的切词,说它是中文处理特有的步骤,其实是误解,因为西文处理复合词也一样用到它。除了实验室的 toy system,很难想象一个像样的西文处理系统可以不借助词典而是指望抽象规则来对付所有的复合词:事实上,对于封闭类复合词,即便抽象的词法规则可以使部分复合词抱团,也不如词典的参与来得直接和有益,理由就是复合词的词典信息更少歧义,对于后续处理更加有利。汉语的复合词“利率”与英语的复合词 “interest rate” 从本质上是同样的基于词典的问题,并没有什么“特有”之处。 【相关博文】 《 立委科普: 应该立法禁止分词研究 :=) 》 【置顶:立委科学网博客NLP博文一览(定期更新版)】
与业內老友的对话:在‘用’字上狠下功夫 耳边响起了林副主席关于系统开发的谆谆教导: Quote 带着问题做,活做活用,做用结合,急用先做,立竿见影,在‘用’字上狠下功夫。 from: http://blog.sciencenet.cn/home.php?mod=spaceuid=362400do=blogid=510567 这是从与朋友的内部交流中得来的。赶的是编造名人名言的时髦。 ~~~~~~~~~~~~ 在我发文【 坚持四项基本原则,开发鲁棒性NLP系统 】以后,有业内资深老友表示非常有意思,建议我把NLP方面的博文系列汇集加工,可以考虑出书: Quote A good 经验之谈. Somehow it reminds me this -- 带着问题学,活学活用,学用结合,急用先学,立竿见影,在‘用’字上狠下功夫。 You made a hidden preamble -- a given type of application in a given domain. A recommendation: expand your blog a bit as a series, heading to a book. My friend 吴军 did that quite successfully. Of course with statistics background. So he approached NLP from math perspective -- 数学之美 系列 You have very good thoughts and raw material. Just you need to put a bit more time to make your writing more approachable -- I am commenting on comments like 学习不了。 and 读起来鸭梨很大. I know you said: 有时候想,也不能弄得太可读了,都是多年 的经验,后生想学的话,也该吃点苦头。:=) But as you already put in the efforts, why not make it more approachable? The issue is, even if I am willing to 吃点苦头, I still don't know where to start 吃苦头, IF I have never built a real-life NLP system. For example, 词汇主义 by itself is enough for an article. You need to mention its opponents and its history to put it into context. Then you need to give some examples. 文章千古事,网上涂鸦岂敢出书?这倒不是妄自菲薄,主要是出书太麻烦,跟不上这个时代。 我回到: 吴军's series are super popular. When I first read one of his articles on the Google Blackboard, recommended by a friend, I was amazed how well he structured and carried the content. It is intriguing. (边注:当然,他那篇谈 Page Rank 的文章有偏颇,给年轻人一种印象,IT 事业的成功是由技术主宰的,而实际上技术永远是第二位的。对于所谓高技术企业,没有技术是万万不行的,但企业成功的关键却不是技术,这是显而易见的事实了。) For me, to be honest, I do not aim that high. Never bothered polishing things to pursue perfection although I did make an effort to try to link my stuffs into a series for the convenience of cross reference between the related pieces. There are missing links which I know I want to write about but which sort of depends on my mood or time slots. I guess I am just not pressed and motivated to do the writing part. Popularizing the technology is only a side effect of the blogging hobby at times. The way I prove myself is to show that I will be able to build products worth of millions, or even hundreds of millions of dollars. 网上的文字都是随兴之所至,我从来不写命题作文,包括我自己的命题。有时候兴趣来了,就说自己下一篇打算写什么什么,算是自我命题,算是动了某个话题的心思。可是过了两天,一个叉打过去,没那个兴致和时间了,也就作罢。 赶上什么写什么,这就是上网的心态。平时打工已经够累了,上网绝不给自己增加负担。 So far I have been fairly straightforward on what I write about. If there is readability issue, it is mainly due to my lack of time. Young people should be able to benefit from my writings especially once they start getting their hands dirty in building up a system. Your discussion is fun. You can see and appreciate things hidden behind my work more than other readers. After all, you have published in THE CL and you have almost terminated the entire segmentation as a scientific area. Seriously, it is my view that there is not much to do there after your work on tokenization both in theory and practice. I feel some urgency now for having to do Chinese NLP asap. Not many people have been through that much as what I have been, so I am in a position to potentially build a much more powerful system to make an impact on Chinese NLP, and hopefully on the IT landscape as well. But time passes fast . That is why my focus is on the Chinese processing now, day and night. I am keeping my hands dirty also with a couple of European languages, but they are less challenging and exciting. 【置顶:立委科学网博客NLP博文一览(定期更新版)】
立委履历 (一)工作经历 2006.11-至今 首席科学家 架构师,自然语言平台和核心技术设计者 所设计研发的自然语言平台支持新一代搜索引擎,用于企业市场,主要搜索互联网上的商业情报,包括产品技术信息,客户反馈,等。该产品为多家财富500强的研究部门和市场部门采用,证明了它提供的价值是其他搜索引擎和工具难以取代的。 1997/11 至 2006/03 Cymfony 公司,研究开发部,美国纽约州水牛城(Buffalo, New York) 主研究员(Principal Research Scientist) 自然语言处理副总裁(Vice President,NLP) (1999始) 撰写研究基金申请计划,先后赢得18项美国政府”小企业创新研究基金”(SBIR: Small Business Innovative Research),担任其课题负责人(PI: Principal Investigator or co-PI),研究开发新一代基于自然语言处理(NLP: Natural Language Processing)的信息抽取(IE: Information Extraction)技术。 该技术集中体现在 Cymfony 公司所开发的 InfoXtract(TM) 软件系列,包括 InfoXtract NLP/IE 引擎,组建技术,词典语法资源,有限状态转录机工具箱(Finite State Transducer Toolkit),机器自动学习工具箱(Machine Learning Toolkit)及开发平台。 在此基础上开发的软件产品 Brand Dashboard 和 Digital Consumer Insight,实时扫描处理数千种媒体报道,自动抽取品牌报道关键信息,过滤整合,分析数据全面反映品牌走势,为大企业创保作为无形资产的名优品牌提供决策参考,达到人工分析难以企及的广度及统计学意义上的精度。 2000 年帮助成功引进华尔街高科技风险基金一千一百万,使Cymfony由有两三个员工的从事互联网一般业务的公司发展成为具有70多员工,设立三处办公楼(美国波士顿,布法罗,和印度孟买分公司),引进专业管理人员及制订信息技术(IT: Information Technology)市场营销计划的高科技中小企业。 1999 年指导 Cymfony 研发部参与由美国国家标准技术局(NIST:National Institute of Standards and Technology)主持评判的第八届”文本检索大会”(TREC-8: Text Retrieval Conference)专项竞赛“自然语言问答系统”,获得第一名。 Cymfony 的技术及成长先后被多种媒体报道,包括《财富》,《华尔街日报》,《布法罗新闻》,及中文版《世界日报》。Cymfony 由于在一系列 SBIR 研究中成绩突出,被提名竞逐“2002 全美小企业最优合同项目年度奖”(2002 US Small Business Administration Prime Contractor of the Year Award)。 1987-1991 中国社会科学院语言研究所,北京 助理研究员 从事外汉机器翻译,自然语言处理及中文信息处理等领域的研究。 1988-1991 高立软件公司,北京 高级工程师(兼职) 从事高立英汉机器翻译系统 GLMT 的开发研究。主要工作有: 开发及调试八百条机器语法规则 设计及实现系统的语义模块背景知识库 培训及指导八人小组建立并开发有六万多词条的机器翻译词典及具有上万词典规则的专家词典规则库的开发 推动高立公司将 GLMT 1.0 产品化(1992) 该机译技术成功转化到香港韦易达公司袖珍电子词典系列产品中 GLMT于1992年1月在北京新技术产业开发试验区通过鉴定,先后获得北京市科技进步奖、新加坡INFORMATICS’92国际博览会计算机应用软件银奖和92年第二届中国科技之光博览会电子行业金奖,被列入火炬计划。 1988 承接荷兰 BSO 软件公司合同项目,撰写为多语种机器翻译服务的“汉语依从关系形式句法”,获得好评。 (二)教育经历 2001年 获加拿大 Simon Fraser University 计算语言学专业博士学位 学位论文 “汉语短语结构文法中的词法句法接口研究” (The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar) 该汉语形式文法成功运用于英汉双向机器翻译系统的实验,证明同一部文法可以用于双向系统的汉语分析和综合。 攻读博士期间,多次担任计算机系自然语言实验室(Natural Language Lab)助研(Research Assistant)及语言学系助教(Teaching Assistant)或临时讲师(Sessional Instructor) 1991-1992年 英国曼彻斯特理工大学计算语言学中心(CCL/UMIST)博士候选人 1986年 获中国社会科学院研究生院语言学系机器翻译专业硕士学位 学位论文”从世界语到英语和汉语自动翻译”:这是国内少有的一对多机器翻译系统的研究探索。 1982年 安庆师范学院外语系英语专业学士学位 (三)获奖 2001年获本系杰出成就奖(Outstanding Achievement Award), Department of Linguistics, Simon Fraser University (award given to the best PhD graduates from the department) 1995-1997获加拿大卑诗省科学委员会 G.R.E.A.T. 奖学金 (G.R.E.A.T. Award, Scienc Council, B.C. CANADA), 旨在促进应用性博士课题与当地高科技企业的结合 1997年获校长研究资助(President’s Research Stipend) 1996年获新加坡 ICCC 大会特别旅行资助,宣讲论文 1995年获研究生奖学金(Graduate Fellowship) 1992年与傅爱平合作的机器翻译数据库应用程序获中国社会科学院软件二等奖 1991年获中英友好奖学金(中国教育部,英国文化委员会及包玉刚基金会联合提供)赴英深造 (四)其他专业活动 2002-2005,担任新加坡《中文和计算杂志》国际编委 1998-2004 担任企业导师(Industrial Advisor),先后指导20多位博士或硕士侯选人从事有工业应用前景的暑期实习研究课题(实习生来自纽约州立大学布法罗分校计算机系或语言学系) (五)论文发表记录 Srihari, R, W. Li and X. Li, 2006. Question Answering Supported by Multiple Levels of Information Extraction, a book chapter in T. Strzalkowski S. Harabagiu (eds.), Advances in Open- Domain Question Answering. Springer, 2006, ISBN:1-4020-4744-4. Srihari, R., W. Li, C. Niu and T. Cornell. 2006. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006. Niu,C., W. Li, R. Srihari, and H. Li. 2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation. Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005). Srihari, R., W. Li, L. Crist and C. Niu. 2005. Intelligence Discovery Portal based on Corpus Level Information Extraction. Proceedings of 2005 International Conference on Intelligence Analysis Methods and Tools. Niu, C., W. Li and R. Srihari. 2004. Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction. In Proceedings of ACL 2004. Niu, C., W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities. In Proceedings of Senseval-3 Workshop. Niu, C., W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation. International Journal of Artificial Intelligence Tools, Vol. 13, No. 1, 2004. Niu, C., W. Li and R. Srihari 2004. A Bootstrapping Approach to Information Extraction Domain Porting. AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM), California. Srihari, R., W. Li and C. Niu. 2004. Corpus-level Information Extraction. In Proceedings of International Conference on Natural Language Processing (ICON 2004), Hyderabad, India. Li, W., X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert Lexicon Approach to Identifying English Phrasal Verbs. In Proceedings of ACL 2003. Sapporo, Japan. pp. 513-520. Niu, C., W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping Approach to Named Entity Classification using Successive Learners. In Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342. Li, W., R. Srihari, C. Niu, and X. Li. 2003. Question Answering on a Case Insensitive Corpus. In Proceedings of Workshop on Multilingual Summarization and Question Answering - Machine Learning and Beyond (ACL-2003 Workshop). Sapporo, Japan. pp. 84-93. Niu, C., W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for Named Entity Tagging using Concept-based Seeds. In Proceedings of HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada. Srihari, R., W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. In Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). pp. 52-59, Edmonton, Canada. Li, H., R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction. In Proceedings of HLT/NAACL 2003 Workshop on Analysis of Geographic References. Edmonton, Canada. Li, W., R. Srihari, C. Niu, and X. Li 2003. Entity Profile Extraction from Large Corpora. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a Hidden Markov Model for Relationship Extraction Using Multi-level Contexts. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised Learning for Verb Sense Disambiguation Using Both Trigger Words and Parsing Relations. In Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada. Niu, C., W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation. In Proceedings of the Sixteenth International FLAIRS Conference, St. Augustine, FL, May 2003, pp. 402-406. Srihari, R. and W. Li 2003. Rapid Domain Porting of an Intermediate Level Information Extraction Engine. In Proceedings of International Conference on Natural Language Processing 2003. Srihari, R., C. Niu, W. Li, and J. Ding. 2003. A Case Restoration Approach to Named Entity Tagging in Degraded Documents. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), Edinburgh, Scotland, Aug. 2003. Li, H., R. Srihari, C. Niu and W. Li 2002. Location Normalization for Information Extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan. Li, W., R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002. Extracting Exact Answers to Questions Based on Structural Links. In Proceedings of Multilingual Summarization and Question Answering (COLING-2002 Workshop). Taipei, Taiwan. Srihari, R. and W. Li. 2000. A Question Answering System Supported by Information Extraction. In Proceedings of ANLP 2000. Seattle. Srihari, R., C. Niu and W. Li. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging. In Proceedings of ANLP 2000. Seattle. Li. W. 2000. On Chinese parsing without using a separate word segmenter. In Communication of COLIPS 10 (1). pp. 19-68. Singapore. Srihari, R. and W. Li. 1999. Information Extraction Supported Question Answering. In Proceedings of TREC-8. Washington Srihari, R., M. Srikanth, C. Niu, and W. Li 1999. Use of Maximum Entropy in Back-off Modeling for a Named Entity Tagger, Proceedings of HKK Conference, Waterloo, Canada W. Li. 1997. Chart Parsing Chinese Character Strings. In Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9). Victoria, Canada. W. Li. 1996. Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. In Proceedings of International Chinese Computing Conference (ICCC’96). Singapore W. Li and P. McFetridge 1995. Handling Chinese NP Predicate in HPSG, Proceedings of PACLING-II, Brisbane, Australia Uej Li. 1991. Lingvistikaj trajtoj de la lingvo internacia Esperanto. In Serta gratulatoria in honorem Juan Régulo, Vol. IV. pp. 707-723. La Laguna: Universidad de La Laguna Z. Liu, A. Fu, and W. Li. 1989. JFY-IV Machine Translation System. In Proceedings of Machine Translation SUMMIT II. pp. 88-93, Munich. 刘倬,傅爱平,李维 (1992). 基于词专家技术的机器翻译系统,”机器翻译研究新进展”,陈肇雄编辑,电子工业出版社,第 231-242 页,北京 李维,刘倬 (1990). 机器翻译词义辨识对策,《中文信息学报》,1990年第一期,第 1-13 页,北京 刘倬,傅爱平,李维 (1989), JFY-IV 机器翻译系统概要,《中文信息学报》,1989年第四期,第 1-10 页,北京 李维 (1988). E-Ch/A 机器翻译系统及其对目标语汉语和英语的综合,《中文信息学报》,1988年第一期,第 56-60 页,北京 其他发表 (略)