科学网

 找回密码
  注册

tag 标签: HPSG

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

【李白之41:Gui冒VP的风险】
liwei999 2017-4-28 17:31
白: “这些国家的统治者必须变革,不然就是在冒被一脚踢开的风险。” 1、“冒……风险”,离合词; 2、“风险”属于“N/S”型的名词,不反填定语从句; 3、“被”由N+升格为N,占“一脚踢开”提供的两个坑中的一个; 4、先行成分“这些国家的统治者”填“一脚踢开”提供的另一个坑。 李: 【冒VP的风险】 汉语离合词 是框式结构之一种,离合词里面的 XP 是啥 离合词本身决定。可以认为是由该词的subcat模板所规定。 这个case里面规定是要 VP。离合词“冒-险”(“冒-之|的 险|风险”)本身也是(动宾式)VP,于是我们赶上了内外两个 VPs:“Gui 冒杀头之险”。Subcat 如是说: 1 Gui 冒险。 2 Gui 杀头: 实际上是被杀头。“杀-头”本身也是离合词 里面应该是要的NP。NP外化就成了句法主语和逻辑宾语,也就是所谓隐式被动:Gui杀头 == Gui被杀头 == 把Gui杀头 == 杀Gui的头 == 对Gui杀头。这才叫语言学,微观语言学, subcat 执导。subcat 是语言个性与共性的接口 3 两个 VPs 之间的关系: 当然也由外面这个离合词“冒-险”来决定。具体说就是,内VP是外VP的同位语,是给外VP填充“冒险”的内容:冒什么险?杀头之险。这个同位语来源于内VP是外VP里宾语的定语这种形式,是随着离合词动态合成为动宾合成词,由宾语的同位语定语,捎带过来的(定语转状语,主子单位是变大了,但mod本性不变)。这个现象是动宾离合词的共性,再如:洗个痛快的澡 == 痛快洗澡. 4 剩下一些句法语义的鸡零狗碎 也仍然是外VP的subcat决定的:包括内VP是非谓语VP,因此不能用句法(或词法)的时体形式,语义上表达的是不定式。至于外VP,它当然是谓语VP, 譬如可以有进行体:“Gui正在冒杀头之险”。 总结一下:subcat 可以有很丰富的内容,很复杂的规定,它连接句法形式(模式s)与其对应的语义。好在 subcat 都是词典词条决定的,所以再复杂琐碎,在词典主义(lexicalist)看来也不难把控。 理论上 subcat 的这种复杂性最好由subcat的复杂特征结构(SUBCATT typed feature structure)来描述。上面举的例子及其相关句法语义的约束及其与逻辑语义的接口,可以非常从容、非常精细地在诸如 HPSG 的复杂特征结构里面透明地表达出来。如果是象牙塔玩符号逻辑,可说是进入了符号逻辑的天国:个性共性 词典grammar, 句法语义 燕舞莺歌,太平世界 同此凉热,在在美景 处处和谐。这就是我以前说的 玩 HPSG 可以入迷的原因。下面给几个HPSG 的复杂特征结构的图示,展现一下其叠床架屋背后的合一(unification)风采: 但我们终究还是抛弃了复杂特征结构,为了线速,为了简略,为了多层,为了模块化和易维护。总之是为了现世的便利,挥别了理想的符号天国。 【相关】 【语义计算:李白对话录系列】 中文处理 Parsing 【置顶:立委NLP博文一览】 《朝华午拾》总目录
个人分类: 立委科普|3775 次阅读|0 个评论
Chart Parsing Chinese Character Strings
liwei999 2016-9-20 20:59
【立委按】 这阵子整理以前的文字,钩沉哈。有些道理可以说得头头是道,但实践中还是要碰壁。chart parsing 理论上可以解决切词的歧义,但实践中令人抓狂的伪歧义问题,和单层parsing所带来的对于深入parsing的种种限制(且不说速度)和放不开手脚,不仅仅得不偿失,实质是此路不通,对于 real life 系统。 反过来看呢,以前所批判的多层系统的设计,尽管那些缺陷是有存在,但并非没有克服的办法,并不是简单的一句话就能打死:没有文法的 segmentation 都是短板。 W. Li. 1997. Chart Parsing Chinese Character Strings. In Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9). Victoria, Canada. Chart Parsing Chinese Character Strings Wei LI Simon Fraser University Burnaby B.C. V5A 1S6 CANADA (lio@sfu.ca) ABSTRACT This paper examines problems in word identification for a Chinese natural language processing system and presents our solution to these problems. In conventional systems, written Chinese parsing takes two steps: (1) a segmentation preprocessor for word identification (segmenter); (2) a grammar parsing the string of identified words. Morphological analysis, when required, as in the case of productive word formation, has to be incorporated in the segmenter. This matches the conventional morphology-before-syntax architecture. We will demonstrate the theoretical defect of this architecture when applied to Chinese. This leads to the conclusion that segmentational approach, despite its being the mainstream in Chinese computational morphology, is in general not adequate for the task of Chinese word identification. To solve this problem, a full grammar should be made available. Therefore, we take an alternative one-step approach. We have implemented an integrated grammar of morphology and syntax for directly parsing a string of Chinese characters, building both morphological and syntactic structures. Compared with the conventional two-step approach, our strategy has advantages in resolving ambiguity in word identification and in handling productive word formation. Introduction A written Chinese sentence is a string of characters with no blanks to mark word boundaries. In conventional systems, Chinese parsing takes two steps as shown in the following Figure 1: (1) a segmentation preprocessor (called segmenter ) for word identification; (2) a word based parsing grammar, building syntactic structures (Feng 1996, Chen Liu (1992). In contrast, we take an alternative one-step approach, as shown in Figure 2 below. We have implemented a grammar named W‑CPSG (for Wei's Chinese Phrase Structure Grammar ). W‑CPSG integrates morphology and syntax for character based parsing , building both morphological and syntactic structures. In the two-step architecture, the purpose for the segmenter is to properly identify a string of words to feed syntax. This is not an easy task due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters 研究生命, the segmentation ambiguity is shown in (1.a) and (1.b) below. (1.) 研究生命 (a) 研究生 | 命 graduate student | life or destiny (b) 研究 | 生命 study | life The resolution of the above ambiguity in the segmenter is a hopeless job because such ambiguity is syntactically conditioned. For sentences like 研究生命金贵 (life for graduate students is precious), (1.a) is the right identification. For the phrase 研究生命起源 (to study the origin of life), (1.b) is right. So far there are no segmenters which can handle this properly and guarantee right word segmentation (Feng 1996). In fact, there can never be such segmenters as long as a grammar is not brought in. This is a theoretical defect of all Chinese analysis systems in the conventional architecture. We have solved this problem in our morphology-syntax integrated W‑CPSG. Word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. In the text below, Section 2 investigates problems with the conventional two-step approach. In Section 3, we will present W‑CPSG one-step approach and demonstrate how W‑CPSG parsing solves these problems. The following is a list for abbreviations used in this paper. A (Adjective); AF (Affix); BM (Bound Morpheme); CLA (Classifier); CLAP (Classifier Phrase); DE (Chinese particle introducing a modifier of noun); DEP (DE Phrase); DE3 (Chinese particle introducing a modifier of result or capability); DET (Determiner); LE (Chinese perfective aspect marker); N (Noun); NP (Noun Phrase); P (Preposition); PP (Prepositional Phrase); S (Sentence); V (Verb); VP (Verb Phrase); Vt (Transitive Verb) Problems Challenging Segmenters In general, there are two basic problems for segmenters, namely, segmentation ambiguity and productive word formation. 2.1. segmentation ambiguity This sub-section studies the segmentation ambiguity for Chinese word identification. We indicate that this ambiguity is structural in nature. Therefore it should be captured by structural trees via parsing. We conclude that a parsing grammar is indispensable in the resolution of the segmentation ambiguity. Behind all segmenters are procedure based segmentation algorithms. Most proposals are some modified versions of large-lexicon based matching algorithms. As an underlying hypothesis, a longer match overrides a shorter match, hence the name maximum match . Decided by the direction of the procedure, i.e. whether the segmentation proceeds from left (the beginning of a string) to right (the end of the string) or from right to left, we have two general types of maximum match: (1) FMM (Forward Maximum Match) algorithm; (2) BMM (Backward Maximum Match) algorithm (Feng 1996). According to Liang 1987, segmenters have trouble with cases involving the segmentation ambiguity. There are two types of segmentation ambiguity: the cross ambiguity (AB|C vs. A|BC) and the embedded ambiguity (AB vs. A|B). To detect possible ambiguity, many researchers use the technique of combining the FMM algorithm and the BMM algorithm. When the output of FMM and BMM are different, there must be some ambiguity involved. The following table lists the cases associated with the FMM and BMM combined approach. The following 3 examples all contain a cross ambiguity sub-string 研究生命 with 2 segmentation possibilities: 研究生 |命 and 研究 | 生命. Example (4.) is a genuinely ambiguous case. Genuinely ambiguous sentences cannot be disambiguated within the sentence boundary, rendering multiple readings. (2.) case 1: 研究生命金贵。 (a) 研究生 | 命 | 金贵 (FMM: correct) graduate student | life | precious Life for graduate students is precious. (b) * 研究 | 生命 | 起源 (BMM: incorrect) study | life | precious (3.) case 2: 研究生命起源。 (a) * 研究生 | 命 | 起源 (FMM: incorrect) graduate-student | life | origin (b) 研究 | 生命 | 起源 (BMM: correct) study | life | origin to study the origin of life (4.) case 3: 研究生命不好。 (a) 研究生 | 命 | 不 | 好 (FMM: correct) graduate student | destiny | not | good The destiny of graduate students is not good. (b) 研究 | 生命 | 不 | 好 (BMM: correct) study | life | not | good It is not good to study life. The following example is a complicated case of cross ambiguity, involving more than 2 ways of segmentation. Both the FMM segmentation 出现 | 在世 | 界 and the BMM segmentation 出 | 现在 | 世界 are wrong. A third segmentation 出现 | 在 | 世界 is right. (5.) case 4: 出现在世界东方。 (a) * 出现 | 在世 | 界 | 东方 (FMM: incorrect) appear | be-alive | BM | east (b) * 出 | 现在 | 世界 | 东方 (BMM: incorrect) out | now | world | east (c) 出现 | 在 | 世界 | 东方 (correct) appear | at | world | east to appear in the east of the world In the following examples (6.) through (8.), ¿¾°×êí involves embedded ambiguity. As separate words, the verb ¿¾ (bake) and the NP °×êí (sweet potato) form a VP. As a whole, it is a compound noun ¿¾°×êí (baked sweet potato). In cases of the embedded ambiguity, FMM and BMM always make the same segmentation, namely AB instead of A|B. It may be the only right choice, as seen in (6.). It may be wrong as shown in (7.). It may only be half right, as in the case of genuine ambiguity shown in (8.). (6.) case 5: 他吃烤白薯。 (a) 他 | 吃 | 烤白薯 (FMMBMM: correct) he | eat | baked sweet potato He eats baked sweet potatoes. (b) * 他 | 吃 | 烤 | 白薯 (incorrect) he | eat | bake | sweet potato (7.) case 6: 他会烤白薯。 (a) * 他 | 会 | 烤白薯 (FMMBMM: incorrect) he | can | baked sweet potato (b) 他 | 会 | 烤 | 白薯 (correct) he | can | bake | sweet potato He can bake sweet potatoes. (8.) case 7: 他喜欢烤白薯。 (a) 他 | 喜欢 | 烤白薯 (FMMBMM: correct) he | like | baked sweet potato He likes baked sweet potatoes. (b) 他 | 喜欢 | 烤 | 白薯 (correct) he | like | bake | sweet potato He likes baking sweet potatoes. Compare the above examples, we see that there are severe limitations for the FMM-BMM combined approach. First, it only serves the purpose of ambiguity detection (when the results of FMM and BMM do not match), and contributes nothing to its resolution. It has no way to tell which segmentation is right (compare case 1 and case 2), and, worse still, whether both are right (case 3) or wrong (case 4). Second, even when the results of FMM and BMM do match, it by no means guarantees right segmentation (case 6). Third, as far as detection is concerned, it is only limited to the problems for the cross ambiguity. The existence of the embedded ambiguity defines a blind area for this way of detection (case 6 and case 7). This is because the underlying maximum match hypothesis assumed in the FMM and BMM segmentation algorithms is directly contradictory to the phenomena of the embedded ambiguity. In face of ambiguity, how do people judge which segmentation is right in the first place? It really depends on whether we can understand the sentence or phrase based on the segmentation. In computational linguistics, this is equivalent to whether the segmented string can be parsed by a grammar. The segmentation ambiguity is one type of structural ambiguity, not in essence different from typical structural ambiguity like, say, PP attachment ambiguity. In fact, PP attachment problem is a counterpart of the cross ambiguity in English syntax, as shown below. (9.) Cross ambiguity in PP attachment: V NP PP (a) (b) Therefore, like English PP attachment, Chinese word segmentation ambiguity should also be captured by a parsing grammar. A parser resolves the ambiguity if it can, or detects the ambiguity in the form of multiple parses when it cannot. As shall be demonstrated in Section 3, wrong segmentation will not lead to a parse. Right segmentation results in at least one successful parse. In any case, at least a parser (hence a grammar on which the parser is based) is required for proper word identification. The important thing is that the ambiguity in word identification is a grammatical problem. The attempt to solve this problem without a grammar is bound to be crippled. Since traditional segmentation algorithms are non-grammatical in nature, they are theoretically not equipped for handling such ambiguity. A successive model of segmentater-before-grammar attempts to do what it is not yet able to do. This is the theoretical defect for almost all existing segmentation approaches. (10.) Conclusion for 2.1. The segmentation ambiguity in word identification is one type of structural ambiguity. In order to solve this problem, a parsing grammar is indispensable. 2.2. productive word formation Unless morphological analysis is incorporated, lexicon match based segmenters will have trouble with new words produced by Chinese productive word formation, including reduplication, derivation and the formation of proper names. When the morphology component is incorporated in the segmenter, the two-step design becomes a variant of the conventional morphology-before-syntax architecture. But this architecture is not effective when the segmentation ambiguity is at issue. In the following, we investigate reduplication, derivation and proper names one by one. In each case, we find that there is always a possible involvement of the segmentation ambiguity. This problem cannot be solved by a morphology component independent of syntax. We therefore propose a grammar incorporating both morphology and syntax. 2.2.1. reduplication Reduplication in Chinese serves various grammatical and/or lexical functions. Not all reduplications pose challenges to segmentation algorithms. Assume that a word consists of 2 characters AB, reduplication of the type AB -- ABAB is no problem. What becomes a problem for word segmentation is the reduplication of the type AB -- AABB or its variants like AB -- AAB. For example, a two-morpheme verb with verb-object relation at the level of morphology has the following way of reduplication. (11.) Verb Reduplication: AB -- AAB (for diminutive use) 分心 (get distracted) -- 分分心 (get distracted a bit) 让他分分心。 让 | 他 | 分分心 let | he | get distracted a bit Let him relax a while. It seems that reduplication is a simple process which can be handled by incorporating some procedure-based function calls in the segmentation algorithm. If a 3-character string, say 分分心, cannot be found in the lexicon, the reduplication procedure will check whether the first 2 characters are the same, and if yes, delete one of them and consult the lexicon again. But, such expansion of the segmentation algorithm is powerless when the segmentation ambiguity is involved. For example, it is wrong to regard 分分心 as of reduplication in the following sentence. (12.) 这件事十分分心。 (a) * 这 | 件 | 事 | 十 | 分分心 this | CLA | thing | ten | get distracted a bit (b) 这 | 件 | 事 | 十分 | 分心 this | CLA | thing | very | distracting This thing is very distracting. 2.2.2. derivation In Contemporary Mandarin, there have come to be a few morphemes functioning similarly to English affixes, e.g. 可 (-able) turns a transitive verb into an adjective. (13.) 可 (-able) + Vt -- A 可 (-able) + 读 (Vt: read) -- 可读 (A:readable) 这本书非常可读。 这 | 本 | 书 | 非常 | 可读 this | CLA | book | very | readable This book is very readable. The suffix 性 works just like '-ness', changing an adjective into an abstract noun. The derived noun 可读性 (readability) in the following example, similar to its English counterpart, involves a process of double affixation. (14.) A + 性 (-ness) -- N 可 (-able) + 读 (Vt: read) -- 可读 (A:readable) 可读 (A:readable) + 性 (-ness) -- 可读性 (N:readability) 这本书的可读性 这 | 本 | 书 | 的 | 可读性 this | CLA | book | DE | readability this book's readability The suffix í· can change a transitive verb into an abstract noun adding to it the meaning worth-of. (15.) Vt + 头 (AF:worth of) -- N 吃 (Vt:eat) + 头 (AF:worth of) -- 吃头 (N:worth of eating) 这道菜没有吃头 这 | 道 | 菜 | 没有 | 吃头 this | CLA | dish | not-have | worth-of-eating This dish is not worth eating. It is not difficult to incorporate in the segmenter these derivation rules for the morphological analysis. But, as in the case of reduplication, there is always a danger of wrongly applying the rules due to possible ambiguity involved. For example, 吃头 is a sub-string of embedded ambiguity. It can be both a derived noun 'worth of eating' or two separate words as seen in the following example. (16.) 他饿得能吃头牛。 (a) * 他 | 饿 | 得 | 能 | 吃头· | 牛 he | hungry | DE3 | can | worth-of-eating | ox (b) 他 | 饿 | 得 | 能 | 吃 | 头 | 牛 he | hungry | DE3 | can | eat | CLA | ox He is so hungry that he can eat an ox. 2.2.3. proper name Proper names are of 2 major types: (1) Chinese names; (2) transliterated foreign names. In this paper, we only target the identification of Chinese names and leave the problem of transliterated foreign names for further research (Li, 1997b). A Chinese human name usually consists of a family name followed by a given name. Chinese family names form a clear-cut closed set. A given name is usually either one character or two characters. For example, the late Chinese chairman 毛泽东 (Mao Zedong) used to have another name 李得胜 (Li Desheng). In the lexicon, 李 is a registered family name. Both 得胜 and 胜 mean 'win'. This may lead to 3 ways of word segmentation: (1) 李得胜; (2) 李 | 得胜; (3) 李得 | 胜, as seen in the following examples. (17.) 李得胜了 (a) 李 | 得胜 | 了 . Li | win | LE Li won. (b) 李得 | 胜 | 了 Li De | win | LE Li De won. (c) * 李得胜 | 了 . Li Desheng | LE (18.) 李得胜胜了 。 (a) * 李 | 得胜 | 胜 | 了 . Li | win | win | LE (b) * 李得 | 胜 | 胜 | 了 Li De | win | win | LE (c) 李得胜 | 胜 | 了 Li Desheng | win | LE Li Desheng won. Since the given name like μÃê¤ is an arbitrary string of 1 or 2 characters, the morphological analysis of the full name should start with family name which can optionally combine with any 1 or 2 characters to form candidate proper names àî , àîμà and àîμÃê¤. In other words, family name serves as the left boundary of a full name and the length is used to determine candidates. The right segmentation can only be made via sentence analysis as shown in the above examples. Most Chinese place proper names are made of 1 to 3 characters, for example, 武汉市(Wuhu City), 南陵县 (Nanling County). The arbitrariness of these names makes any sub-strings of n characters (0n4) in the sentence a suspect. Fortunately, in most cases we may find boundary indicators of these names, like 省 (province), 市 (city), 县 (county), etc. Once the boundary indicator is located, the similar technique in using Chinese family name to identify the given name can be applied to select candidates of place proper names for verification through grammatical analysis. In general, there is always a possibility of ambiguity involvement in the formation of all types of proper names. (19.) Conclusion for 2.2. Due to the possible involvement of ambiguity, a parsing grammar for morphological analysis as well as for sentence analysis is required for the proper identification of the words produced by Chinese productive word formation. W‑CPSG Grammatical Approach This section presents W‑CPSG approach to Chinese word identification and morphological analysis. We will demonstrate how a parser based on W‑CPSG solves the problems of the word identification ambiguity and productive word formation. 3.1. rationale of W‑CPSG approach There have been a number of word identification algorithms based on both morphological and syntactic information (see survey in Feng 1996 and Sun Huang 1996). Most such approaches do not use a self-contained grammar to parse the complete sentence. They are confined to the conventional two-step process of the segmentation-before-grammar design. As long as the word identification procedure is independent of a parsing grammar, it is extremely difficult to make full use of grammatical information to resolve ambiguity in word identification. Careful tuning up and sophisticated design improves the precision but will not change the theoretical defect of all such approaches. Chen Liu acknowledges the limitation of their approach due to the lack of a grammar. “However”, they say, “it is almost impossible to apply real world knowledge nor to check the grammatical validity at this stage”. (Chen Liu 1992, p.105) Why impossible at this stage ? Because these segmentation systems are based on the concept of two-step architecture and the grammar is not yet available! As we have demonstrated, the final judgment for proper word identification can hardly be made until the whole sentence is parsed, hence the requirement of a full grammar. Therefore, we are forced to make a compromise in involving how much of grammatical information depending on how much word identification precision we can afford to sacrifice. Needless to say, there is significant double-labor between such a word segmentation procedure and the following stage of parsing. As more and more grammatical information is used to achieve better precision, the overhead of this double labor becomes more serious. We consider the double labor as one strong argument against the two-step approach. If enough grammatical information is incorporated, it is essentially equivalent to a grammar. And the segmenter will be equivalent to a parser. Then why two grammars, one for word identification, and one for sentence parsing? Why not combine them? That is exactly what we are proposing in W‑CPSG - one-step approach based on an integrated grammar, eliminating the necessity of a segmentation preprocessor. 3.2. W‑CPSG character-based parsing W‑CPSG (Li. 1997a, 1997b) is a lexicalized Chinese unification grammar. The work on W‑CPSG is taken in the spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (Pollard Sag 1994). W‑CPSG consists of two parts: a minimized general grammar and an enriched lexicon. The general grammar only contains a handful of PS (phrase structure) rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, all they say is one thing, that is, 2 signs can combine so long as the lexicon so indicates. The lexicon houses lexical entries with their linguistic description in feature structures. Potential morphological structures as well as potential syntactic structures are lexically encoded. In syntax, a word expects another sign to form a phrase. In morphology, a morpheme expects another sign to form a word. For example, the prefix 可 (-able) expects a transitive verb to form an adjective. The morphological PS rule will build the morphological structure when a transitive verb does appear after the prefix 可 (-able) in the input string. We now illustrate how W‑CPSG parses a string of Chinese characters by a sample parsing chart. The prototype of W‑CPSG was written in ALE, a grammar compiler developed on top of Prolog by Carpenter Penn (1994). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. W‑CPSG parse tree embodies both morphological analysis and syntactic analysis, as shown below. This is so-called bottom-up parsing. It starts with lexicon look-up. Edges 1 through 7 are lexical edges. Other edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. For example, 可 (-able), 读 (read) and 性 (-ness) are all registered entries in the lexicon, so they get matched and shown by edge 5, edge 6 and edge 7. Words produced by productive word formation present themselves as phrasal edges, e.g. edge ((5+6)+7) for 可读性 (readability). For the sake of concise illustration, we only show two pieces of information for the signs in the chart, namely category and interpretation with a delimiting colon (lexical edges are only labeled for either category or interpretation). The parser attempts to combine the signs according to PS rules in the grammar until parses are found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) for (20.) represents a binary structural tree based on the W‑CPSG analysis, as shown below. 3.3. ambiguity resolution in word identification Given the resources of a phrase structure grammar like W‑CPSG, a parser based on standard chart parsing algorithms can handle both the cross ambiguity and the embedded ambiguity provided that a match algorithm based on exhaustive lookup instead of maximum match is adopted for lexicon lookup. All candidate words in the input string are presented to the parser for judgment. Ambiguous segmentation becomes a natural part of parsing: different ways of segmentation add different edges, a successful parse always embodies right identification. In other words, word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. The following example of the complicated cross ambiguity illustrates how the W‑CPSG parser resolves ambiguity. As seen, both the FMM segmentation (represented by the edge sequence 8-9-5-10) and the BMM segmentation (represented by 1-11-12-10) are in the chart as a result of exhaustive lexicon lookup. They are proved to be wrong because they do not lead to a successful parse according to the grammar. As a by-product, the final parse (8+(3+(12+10))) automatically embodies rightly identified word sequence 8-3-12-10, i.e. 出现 (appear) |在 (at) |世界 (world) |东方 (east). Exhaustive lookup also makes an embedded ambiguity sub-string like 烤红薯 no longer a blind area for word identification, as shown in (22.) below. All the candidate words in the sub-string including 烤 (bake), 红薯 (sweet potato), 烤红薯 (baked sweet potato) are added to the chart as lexical edges (edge 4, edge 8 and edge 10). This is a case of genuine ambiguity, resulting in 2 parses corresponding to 2 readings. The first parse (1+(7+10)) identifies the word sequence 他|喜欢|烤红薯, and the second parse (1+(9+(4+8))) a different sequence 他|喜欢|烤|红薯. Edge 7 and edge 9 represent two lexical entries for the verb 喜欢 (like), with different syntactic expectation (categorization). One expects an NP object, notated in the chart by likeNP , and the other expects a VP complement, notated by likeVP . We now illustrate how Chinese proper names are identified in W‑CPSG parsing. In the W‑CPSG lexicon, Chinese family name is encoded to optionally expect the given name. Due to the arbitrariness of given names, no other constraint except for the length (either 1 character or 2 characters) is specified in the expectation. Therefore, we have three candidates for proper names in the following example, namely 李 (Li), 李得 (Li De), 李得胜 (Li Desheng), represented respectively by edge 1, edge (1+2) and the NP edge (1+5). The first two candidates contribute to two valid parses while the third does not, hence the identification of the word sequences 李|得胜|了 and 李得|胜|了. Now we add one more character 胜 (win) to form a new sentence, as shown in (24.) below. The first two candidate proper names 李 (Li) and 李得 (Li De) no longer lead to parses. But the third candidate 李得胜 (Li Desheng) becomes part of the parse as a subject NP. The parse (((1+6)+4)+5) corresponds to the identification of the only valid word sequence 李得胜|胜|了. Finally, we give an example to demonstrate how W‑CPSG handles reduplication in parsing and word identification. The sample sentence to be processed by the parser is 让他分分心 (Let him relax a while), involving the AB--AAB type verb reduplication for diminutive use. In most lexicons, 分心 (distract-heart: get distracted) is a registered 2-morpheme verb with internal morphological verb-object relation. Therefore, the reduplication is considered morphological. But in Chinese syntax, we also have a general verb reduplication rule of the type A--AA for diminutive use, for example, 看(look) -- 看看(have a look). This morphological verb reduplication rule AB--AAB and the syntactic verb reduplication rule A--AA are essentially the same rule in Chinese grammar. 分心 sits in the gray area between morphology and syntax. It looks both like a word (verb) and a phrase (VP). Lexically, it corresponds to one generalized sense (concept) and the internal combination is idiomatic, i.e. 分 (distract) must combine with 心 (heart) to mean 'get distracted'. But, structurally, the combination of 分 and 心 is not fundamentally different from a VP consisting of Vt and NP, as in the phrase 看电影 (see a film). In fact, there is no clear-cut boundary between Chinese morphology and syntax. This morphology-syntax isomorphic fact serves as a further argument to support the W‑CPSG design of integrating morphology and syntax in one grammar module. Although the boundary between Chinese morphology and syntax is fuzzy, hence no universal definition of basic notions like word and phrase, the division can be easily defined system internally in an integrated grammar. In W‑CPSG, 分心 is treated as a phrase (VP) instead of a word (verb). The lexical entry 分 (distract) is coded to obligatorily expect the literal 心 (heart) as its syntactic object, shown in the following chart by the notation V 心 . This approach has the advantage of eliminating the doubling of the reduplication rule for diminutive use in both syntax and morphology, making the grammar more elegant. The verb reduplication rule is implemented as a lexical rule in W‑CPSG. This lexical rule creates a reduplicated verb with added diminutive sense, shown by edge 8 (a lexical edge). The whole parsing process is illustrated below. REFERENCES Carpenter, B. Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide , Carnegie Mellon University Chen, K-J., S-H. Liu (1992): Word identification for mandarin Chinese sentences. Proceedings of the 15th International Conference on Computational Linguistics , Nantes, 101-107. Feng, Z-W. (1996): COLIPS lecture series - Chinese natural language processing, Communications of COLIPS , Vol.6, No.1 1996, Singapore Li, W. (1997a): Outline of an HPSG-style Chinese reversible grammar, Proceedings of The Northwest Linguistics Conference-97 (NWLC-97, forthcoming), UBC, Vancouver, Canada Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar And Its Application , Doctoral dissertation (on-going), Simon Fraser University, Canada Liang, N. (1987): Shumian Hanyu Zidong Fenci Xitong - CDWS (Automatic word segmentation system for written Chinese - CDWS), Journal of Chinese Information Processing , No.2 1987, pp 44-52, Beijing Pollard, C. I. Sag (1994): Head-Driven Phrase Structure Grammar , Centre for the Study of Language and Information, Stanford University, CA Sun, M-S. C-N. Huang (1996): Word segmentation and part of speech tagging for unrestricted Chinese texts ( Tutorial Notes for International Conference on Chinese Computing ICCC'96 ), Singapore ~~~~~~~~~~~~~~~~~~~ The author benefited from the insightful discussion with Dr. Dekang Lin on the feasibility of parsing Chinese character strings instead of word strings. Thanks also go to Paul McFetridge and Fred Popowich for their supervision and encouragement. This table is adapted from the following table in Sun Huang (1996). case 1 The output of FMM and BMM are different, but both are incorrect 0.054% case 2 The output of FMM and BMM are different, but only one is correct 9.24% case 3 The output of FMM and BMM are identical, but incorrect 0.41% case 4 The output of FMM and BMM are identical, and correct 90.30% The 4 cases which they listed are not logically exhaustive in terms of sentence based processing (i.e. when discourse is not involved in a system). In particular, there is another case when the output of FMM and BMM are different, and both are correct. We call this a case of genuine cross ambiguity. Note that there is another S edge (1+5) in the chart. These two edges are structurally different, created via different PS rules. The NP edge (1+5) is formed through the morphological PS rule, combining the family name (edge 1) and its expected given name (edge 5). In the S edge (1+5). however, it is the subject rule (one of the complement PS rules) that decides the combination of the predicate (edge 5) and its expected subject NP (edge 1). Lexical rules are favored by many linguists to capture redundancy in the lexicon instead of the conventional approach of syntactic transformation. Lexical rules are applied at compile time to form an expanded lexicon before parsing starts. Interaction of syntax and semantics in parsing Chinese transitive verb patterns Handling Chinese NP predicate in HPSG Notes for An HPSG-style Chinese Reversible Grammar Outline of an HPSG-style Chinese reversible grammar PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter VII Concluding Remarks Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|5148 次阅读|0 个评论
Interaction of syntax and semantics in parsing Chinese
热度 1 liwei999 2016-9-20 09:10
Interaction of syntax and semantics in parsing Chinese transitive verb patterns * (old paper i n Proceedings of International Chinese Computing Conference, ICCC'96 ) Wei LI Department of Linguistics, Simon Fraser University Burnaby, B.C. V5A 1S6 CANADA Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG Abstract This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction ) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint). We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar . Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis. The contribution of our research can be summarized as: (1) the insight on the interaction of syntax and semantics in analysis; (2) a proposed lexical rule approach to semantic deviation based on (1); (3) the application of (2) to the study of the Chinese transitive patterns; (4) the implementation of (3) in an unification-based Chinese HPSG prototype. Background When Chomsky proposed his Syntactic Structures in Fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence: 1) Colorless green ideas sleep furiously. Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense. However, things are not always this simple. Compare the following Chinese sentences of the same form NP NP V : 2a) dianxin wo chi le. Dim-Sum I eat LE. The Dim Sum I have eaten. Note: LE is a particle for perfect aspect. 2b) wo dianxin chi le. I have eaten the Dim Sum. Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation . Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation. It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model . While this is one way to organize a grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely. For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection. Head-driven Phrase Structure Grammar (HPSG) assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi (eat) : chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems. The problem is, the constraint should not always be enforced. In the practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical expressions. 3) xiang chi yueliang, ni gou de3 zhao me? want eat moon, you reach DE3 -able ME? Wanting to eat the moon, but can you reach it? Note: DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question. 4) dajia dou chi shehui zhuyi, neng bu qiong me? people all eat social -ism, can not poor ME Everyone is eating socialism, can it not be poor? yueliang (moon) is not food , of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those interpretations. To capture such deviation, Wilks came up with his Preference Semantics . A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches. The problem with this design is the serious computational complexities involved in the model . In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion. What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns . Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure's syntactic constraint (in terms of the syntactic categories and configuration, word order, function words and/or inflections) and its interpretation ( role assignment ). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V: SOV is another pattern, and NP V: SOV ( the BA construction ) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this approach. Chinese transitive patterns Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese. 5a) wo chi le dianxin. SVO 5b) wo dianxin chi le. SOV 5c) dianxin wo chi le. OSV SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP , the semantic constraint has to yield to syntax for interpretation. NP V NP: SVO 6) daodi shi ni zai du shu ne, haishi shu zai du ni ne? on-earth be you ZAI read book NE, or book ZAI read you NE? Are you reading the book, or is the book reading you, anyway? Note: ZAI is a particle for continuous aspect. NE is a sentence final particle for or-question. Same as in the English equivalent, the interpretation of 6) can only be SVO, no matter how contradictory it might be to our common sense. In other words, in the form of NP V NP , syntax plays a decisive role. In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not hold. In fact, this SOV pattern ( NP1 NP2 V: SOV ) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the D‑Structure for transitive pattern and Chinese is an underlying SOV language (called SOV Hypothesis: see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b). However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern and that SOV tends to (not must ) transform to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition. This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in is also a semantics dependent pattern. Compare: 7a) zhe zhang zhuozi san tiao tui. this Cl. table(furniture) three Cl. leg This table is three-legged. Note: Cl for classifier. 7b) * zhe zhang ditu san tiao tui. this Cl. map(non-furniture) three Cl. leg There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate ). Without this semantic agreement, Chinese NP is normally not capable of functioning as a predicate, as shown in 7b). Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV , it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject. 8) shitou wo ye xiang chi, kexi yao bu dong. stone(non-food) I(animate) also want eat, pity chew not -able Even stones I also want to eat, but it's such a pity that I am not able to chew them. If the constraint on the object matches well, is the subject allowed to be semantically deviant? 9) ? dianxin zhuozi chi le. Dim-Sum(food) table(non-animate) eat LE. Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it. Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency . 10a) dianxin wo xiangxin ni yiwei Lisi chi le. Dim-Sum I believe you think Lisi eat LE The Dim Sum I believe you think that Lisi ate. 10b) * Lisi wo xiangxin ni yiwei dianxin chi le. 10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate ; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern: 11a) dianxin wo jinjinyouwei de2 chi le. Dim-Sum I with-relish DE2 eat LE The Dim Sum I ate with relish. Note: DE2 is a particle introducing a preverbal adjunct of manner. 11b) * wo dianxin jinjinyouwei de2 chi le. 11c) * wo jinjinyouwei de2 dianxin chi le. There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs. NP V: SOV 12a) wo ba dianxin jinjinyouwei de2 chi le. I BA Dim-Sum with-relish DE2 eat LE I ate the Dim Sum with relish. 12b) wo jinjinyouwei de2 ba dianxin chi le. With relish, I ate the Dim Sum. 12c) dianxin ba wo jinjinyouwei de2 chi le. The Dim Sum ate me with relish. 12d) dianxin jinjinyouwei de2 ba wo chi le. With relish, the Dim Sum ate me. For the OSV order, there is another so-called BEI construction . The BEI construction is usually regarded as an explicit passive pattern in Chinese. NP V: OSV 13a) dianxin bei wo chi le. Dim-Sum BEI I eat LE The Dim Sum was eaten by me. 13b) wo bei dianxin chi le. I was eaten by the Dim Sum. The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic constraint. To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table. syntactic pattern semantic dependence NP V NP: SVO no dependence NP V: SOV no dependence NP V: OSV no dependence NP NP V: OSV partial dependence NP NP V: SOV full dependence ............ It should be emphasized that this observation constitutes the rationale behind our approach. Formulation of lexical rules Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation. A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation , especially for lexicalist theories. Our general design is as follows, still using chi (eat) for illustration: (1) Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object. (2) Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented. (3) The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns. As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs' semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model. Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure's syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns elliptical patterns (with some complement(s) omitted), in contrast to full patterns . With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns. We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure - transitive structure. Lexical rule 1: V ((NP1, NP2), (constr1, constr2)) -- NP1 V NP2: SVO The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2 . NP is syntactic category, and constr is semantic category ( human , animate , food , etc.). The output pattern is in a defined word order SVO and waives the semantic constraint. Lexical rule 2: V ((NP1, NP2), (constr1, constr2)) -- V: SOV Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object. 14) ta ni qingjiao guo me? he(human) you(human) consult GUO ME Him, have you ever consulted? Note: GUO is a particle for experience aspect. 15) ni ta qingjiao guo me? You, has he ever consulted? In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint. Lexical rule 2' (refined version): V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2)) -- V: SOV Lexical rule 3: V ((NP1, NP2), (constr1, constr2)) -- NP1 V: SOV This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions ( xiang, dao, dui, etc.) required by different verbs to do the same job. 16a) ni qingjiao guo ta me? you consult GUO he ME Have you ever consulted him? 16b) ni xiang ta qingjiao guo me? you XIANG he consult GUO ME Have you ever consulted him? 16c) * ni ba ta qingjiao guo me? you BA he consult GUO ME 17a) ta qu guo Beijing. he go-to GUO Beijing He has been to Beijing. 17b) ta dao Beijing qu guo. he DAO Beijing go-to GUO He has been to Beijing. 17c) * ta ba Beijing qu guo. he BA Beijing go-to GUO 18a) ta hen titie zhangfu. she very tenderly-care-for husband She cares for her husband very tenderly. 18b) ta dui zhangfu hen titie. she DUI husband very tenderly-care-for She cares for her husband very tenderly. 18c) * ta ba zhangfu hen titie. she BA husband very tenderly-care-for This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object . We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition ). With the valency information in place, we now reformulate Lexical rule 3 to make it more general: Lexical rule 3' (refined version): V ((NP1, NP2), (constr1, constr2), (valency_preposition=P), (P not = null)) -- NP1 V: SOV Lexical rule 4: V ((NP1, NP2), (constr1, constr2)) -- NP2 ... V: OSV This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long-distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly. Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5 . Lexical rule 5: V ((NP1, NP2), (constr1, constr2)) -- NP2 V: OSV We now turn to elliptical patterns. Lexical rule 6: V ((NP1, NP2), (constr1, constr2)) -- V NP2: VO 19) chi guo jiaozi me? eat GUO dumpling ME Have (you) ever eaten dumpling? Lexical rule 7: V ((NP1, NP2), (constr1, constr2)) -- V: SV 20) wo chi le. I eat LE I have eaten (it). 21) ji chi le. chicken1(animate) eat LE The chicken has eaten (it). Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate ; (2) chicken2 as food . We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8 . 22) ni qingjiao guo me? you consult GUO ME Have you consulted (someone)? 22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8 . Lexical rule 8: V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2)) -- V: OV 23) ji chi le. chicken2(food) eat LE The chicken has been eaten. Lexical rule 9: V ((NP1, NP2), (constr1, constr2)) -- NP2 : OV 24) dianxin bei chi le. Dim-Sum BEI eat LE The Dim Sum has been eaten. Lexical rule 10: V ((NP1, NP2), (constr1, constr2)) -- V: V 25) chi le me? eat LE ME? (Have you) eaten (it)? Implementation We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns. Then, we will show how our proposal works and discuss some related implementation issues. HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata ) associated with a couple of principles. The data structure is typed feature structure . The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism . Note: (1) Uppercase notation for feature; (2) Lowercase notation for type; (3) Number indices in square brackets for unification. Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take. In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature). Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure. A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation. This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing , which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism. CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x -- CATEGORY:p (where x is not null ). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures . (Similar design is found in the new software paradigm of Object Oriented Programming .) Thanks to ALE, we can now use literals ( ba, xiang, dao, dui, etc) as well as major categories ( n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories ( human, animate, food, etc.) to capture the thesaurus inference like human -- animate . This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar . In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time). At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules. The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 V: SOV (see Lexical rule 3' ). This is justified as follows. The valency preposition P ( ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example, NP1 V: SOV -- NP1 V NP2: SVO is easier than NP1 V NP2: SVO -- NP1 V: SOV . This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency information. Summery The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics. In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model. The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns. It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research. References Carpenter, B. Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Version 2.0 Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. Pollard C. (eds.) Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation. Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order Constituency in Mandarin Chinese. Kluwer Academic Publishers Li, Wei McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia Pollard, Carl Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA Pollard, Carl Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA Wilks, Y.A. (1978): “Making Preferences More Active”, Artificial Intelligence, Vol. 11 Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6 ~~~~~~~~~~~~ * This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments and encouragement. The other combinations are: 5d1) * dianxin chi le wo. OVS 5d2) dianxin chi le wo. The Dim Sum ate me. Note: It is OK with the 5d2) reading in the pattern NP V NP: SVO. 5e1) * chi le wo dianxin. VSO 5e2) chi le wo dianxin. (Somebody) ate my Dim Sum. Note: It is OK with the 5e2) reading of in the pattern V : VO where NP1 modifies NP2. 5f1) * chi le dianxin wo. VOS 5f2) chi le dianxin, wo. Eaten the Dim Sum, I have. Note: It is OK in Spoken Chinese, with a short pause before wo , in a pattern like V NP, NP: VOS. The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar . This choice is coincidental to the base‑generated account of the BA construction in , but that does not mean much. First, our so‑called basic pattern is not their D‑Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar . Handling Chinese NP predicate in HPSG (old paper) Notes for An HPSG-style Chinese Reversible Grammar Outline of an HPSG-style Chinese reversible grammar PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter II Role of Grammar PhD Thesis: Chapter III Design of CPSG95 PhD Thesis: Chapter IV Defining the Chinese Word PhD Thesis: Chapter V Chinese Separable Verbs PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation PhD Thesis: Chapter VII Concluding Remarks Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|5926 次阅读|1 个评论
Handling Chinese NP predicate in HPSG
liwei999 2016-9-16 09:59
Handling Chinese NP predicate in HPSG (old paper in Proceedings of the Second Conference of the Pacific Association for Computational Linguistics, Brisbane, 1995) Wei Li Paul McFetridge Department of Linguistics Simon Fraser University Burnaby, B.C. CANADA V5A 1S6 Key words: HPSG; knowledge representation, Chinese processing Abstract This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general. In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented. Problem We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern. 1.1. Data: one type of Chinese NP predicate 1) 他好身体。 ta hao shenti. he good body He is of good health. 2) 张三高个子。 Zhangsan gao gezi Zhangsan tall figure. Zhangsan is tall. 3) 李四圆圆的脸。 Lisi Lisi yuanyuan de lian. Lisi round-round DE face. Lisi has a quite round face. 4) 这件大衣红颜色。 zhe jian dayi hong yanse. this (cl.) coat red colour. This coat is of red colour. 5) 明天小雨。 mingtian xiao yu. tomorrow little rain. Tomorrow it will drizzle. 6) 那张桌子三条腿。 na zhang zhuozi san tiao tui. that (cl.) table three (cl.) leg That table is three-legged. Note: (cl.) for classifier. DE for Chinese attribute particle. The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP . In identity constructions, the linking verb SHI (be) cannot normally be omitted. 7a) 他是学者。 ta shi xuezhe. he be scholar He is a scholar. 8b) ?他学者。 ta xuezhe. 他学者。 he scholar 1.2. Problem analysis 1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern. A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP . In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified. Thus, the following patterns are predicted. 8a) 那张桌子三条腿。 na zhang zhuozi san tiao tui. that (cl.) table three (cl.) leg That table is three-legged. 8b) 那张桌子塑料腿。 na zhang zhuozi suliao tui. that (cl.) table plastic leg That table is of plastic legs. 8c) * 那张桌子三条塑料腿。 * na zhang zhuozi san tiao suliao tui. 8d) * 那张桌子腿。 * na zhang zhuozi tui. 1.2.2. What is the semantic constraint for the Chinese predicate pattern? Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic agreement between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable: 9) * 这杯茶好身体。 * zhe bei cha hao shenti. this cup tea good body 10) * 空气三条腿。 * kongqi san tiao tui. air three (cl.) leg Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body) belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences. There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples: 11a) 桌子坏了。 zhuozi huai le. table bad LE The table went wrong. 11b) 腿坏了。 tui huai le.leg bad LE leg bad LE The leg went wrong. 11c) 桌子的腿坏了。 zhuozi de tui huai le. table DE leg bad LE The table's leg went wrong. 12a) 他好。 ta hao. he good He is good. 12b) 身体好。 shenti hao. body good The health is good. 12c) 他的身体好。 ta de shenti hao. he DE body good His health is good. note: LE for Chinese perfect aspect particle. When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table , this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon. Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3). PHON shenti SYNSEM | KNOWLEDGE | PRED possess SYNSEM | KNOWLEDGE | POSSESSOR human SYNSEM | KNOWLEDGE | POSSESSED SYNSEM | LOCAL | CONTENT | INDEX SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body } SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE } Agreement revisited This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed. Let's first consider the following two parallel agreement problems in English: 13) * The boy drink. 14) ? The air drinks. 13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need. The agreement in person , gender and number are included in CONTENT | INDEX features (Pollard Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well. Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature. Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt: 15a) 点心我吃了。 dianxin wo chi le. Dim-Sum I eat LE The Dim Sum I have eaten. 15b) 我点心吃了。 wo dianxin chi le. I Dim-Sum eat LE I have eaten the Dim Sum. Who eats what? There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard Sag 1994, p81), ... there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs 'had their hands on' (so to speak) their subjects' indices, they would be unable to assign semantic roles to their subjects. The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general. It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement. PHON chi SYNSEM | KNOWLEDGE | PRED eat SYNSEM | KNOWLEDGE | AGENT animate SYNSEM | KNOWLEDGE | PATIENT food SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT ] SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS ] SYNSEM | LOCAL | CONTENT | RELATION SYNSEM | LOCAL | CONTENT | EATER | INDEX | ROGET SYNSEM | LOCAL | CONTENT | EATEN | INDEX | ROGET Note: Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information. One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk's term preference , requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky's famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978). Solution We will set some requirements first and then present a lexical rule to see how well it meets our requirements. 3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements: (1) It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2. (2) It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement. (3) It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2). (4) It should be implementable in HPSG formalism. 3.2. What mechanisms can we use to tackle a problem in HPSG formalism? HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules . Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum. The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3). 3.3. Lexical Rule Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule: NP ] hao ] , XP shenti ]] NP Predicate Lexical Rule SYNSEM | KNOWLEDGE | PRED possess SYNSEM | KNOWLEDGE | POSSESSOR SYNSEM | LOCAL | CATEGORY | HEAD | MAJ n SYNSEM | LOCAL | CATEGORY | PREDICATE - SYNSEM | LOCAL | CONTENT | INDEX SYNSEM | LOCAL | CONTENT | RESTRICTION { } ...| CATEGORY | PREDICATE + ...| CATEGORY | SUBCAT | EXTERNAL_ARGUMENT ] ...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS ] ...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS ] == ...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS } ] ...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS ...| CONTENT | RELATION possess ...| CONTENT | POSSESSOR | INDEX | ROGET ...| CONTENT | POSSESSED | INDEX ...| CONTENT | POSSESSED | RESTRICTION { | } For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2. The index links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index is the restriction relation of N2. links the INDEX features of XP and N2, and indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, defines the restriction relation of the XP to the INDEX of N2. The indices , and all contribute to artificially creating a semantic interpretation for . As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle . But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body . This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of , and . Finally, is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate. Let's see how well this lexical rule meets the 4 requirements set in 3.1. (1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument. (2) It enforces the semantic constraints through structure sharing by the index . (3) It correctly assigns roles to the constituents of the pattern. The following interpretation will be established for ta hao shenti (he is of good body) by the parser. CONTENT | RELATION possess CONTENT | POSSESSOR | INDEX | PERSON 3 CONTENT | POSSESSOR | INDEX | NUMBER singular CONTENT | POSSESSOR | INDEX | GENDER male CONTENT | POSSESSOR | INDEX | ROGET human CONTENT | POSSESSOR | RESTRICTION { } CONTENT | POSSESSED | INDEX | PERSON 3 CONTENT | POSSESSED | INDEX | NUMBER singular CONTENT | POSSESSED | INDEX | GENDER nil CONTENT | POSSESSED | INDEX | ROGET organ CONTENT | POSSESSED | RESTRICTION { , } CONTENT | POSSESSED | RESTRICTION { ], ] } In prose, it says roughly that a third person male human he possesses something which is an instance of good body . We believe that this is the adequate interpretation for the original sentence. (4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog. The results meet our objective. But there is one issue we have not touched yet, word order . At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head. 服务 fuwu (serve): NP, PP(wei) 16a) 为人民服务 wei renmin fuwu for people serve Serve the people. 16b) ? 服务为人民。 fuwu wei renmin. serve for people 有益 youyi (of benefit): NP, PP(dui yu) 17a) 这对我有益。 zhe dui wo youyi this to I have-benefit This is of benefit to me. 17b) * 这有益对我。 zhe youyi dui wo this have-benefit to I 18a) 这于我有益。 zhe yu wo youyi this to I have-benefit This is of benefit to me. 18b) 这有益于我。 zhe youyi yu wo this have-benefit to I This is of benefit to me. Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena. 3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions: NP1 NP2 Vi/A (topic + (subject + predicate)) In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different. 19) 他身体好。 ta shenti hao he body good He is good in health. For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta . Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti . Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare: 20a) 他身体好。 ta shenti hao he body good He is good in health. 20b) 他好身体。 ta hao shenti he good body He is of good health. 21a) 他脾气好。 ta piqi hao he disposition good He is good in disposition. 21b) 他好脾气。 ta hao piqi he good disposition He is of good disposition. but: 22a) 她学习好。 ta xuexi hao. he study good He is good in study. 22b) * 他好学习。 ta hao xuexi he good study What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi , the relation is rather in-aspect: something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A). Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework. References Pollard, Carl Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar , Centre for the Study of Language and Information, Stanford University, CA Pollard, Carl Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference. Artificial Intelligence , Vol. 6, pp.53-74. Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM , Vol. 18, No.5, pp.264-274 Wilks, Y.A. (1978): Making Preferences More Active. Artificial Intelligence , Vol. 11, pp. 197-223 ~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~ This is not absolute, we do have the following examples: Ia) 约翰是纽约人。 Yuehan shi Niuyue ren John be New-York person John is a New Yorker. Ib) 约翰纽约人。 Yuehan Niuyue ren. John New-York person John is a New Yorker. IIa) 今天是星期天。 jintian shi xingqi-tian. today be Sun-day Today is Sunday. IIb) 今天星期天。 jintian xingqi-tian. today Sun-day Today is Sunday. It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open. We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase: IIIa) 他一副好身体。 ta yi fu hao shenti. he one (cl.) good body He is of good health. (He is of a good body.) IIIb) * 他三副好身体。 ta san fu hao shenti he three (cl.) good body IIIc) 他好身体。 ta hao shenti. IVa) 李四一张圆圆的脸。 Lisi yi zhang yuanyuan de lian. Lisi one (cl.) round-round DE face Lisi has a quite round face. IVb) * 李四两张圆圆的脸。 Lisi liang zhang yuanyuan de lian. Lisi two (cl.) round-round DE face IVc) 李四圆圆的脸。 Lisi yuanyuan de lian. Another reading for 22a) is ], where ta xuexi is a subject clause: That he studies is good. This is another issue. Notes for An HPSG-style Chinese Reversible Grammar Outline of an HPSG-style Chinese reversible grammar PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter II Role of Grammar PhD Thesis: Chapter III Design of CPSG95 PhD Thesis: Chapter IV Defining the Chinese Word PhD Thesis: Chapter V Chinese Separable Verbs PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation PhD Thesis: Chapter VII Concluding Remarks Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|4487 次阅读|0 个评论
Outline of An HPSG-style Chinese Reversible Grammar
liwei999 2016-9-14 12:35
ABSTRACT Key words: Chinese parsing, Chinese generation, reversible grammar, HPSG This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter Penn1994), we have implemented a prototype of CPSG. CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics,see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation). Grammarreversibility is a highly desired feature for multi-lingual machine translation application (Hutchins Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994,Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Outline of an HPSG-style Chinese reversible grammar * Wei LI Simon Fraser University (NLWC97) This paper presents the outline and the design philosophy of a lexicalized Chinese unification grammar named W‑CPSG. W‑CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model. The grammar works reversibly, suited for both parsing and generation. This work is developed in the general spirit of the linguistic theory Head-driven Phrase Structure Grammar (Pollard Sag 1994). We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. We will also illustrate how W‑CPSG is formalized and how it works. Background Unification grammars have been extensively studied in the last decade (Shieber 1986). Implementations of such grammars for English are being used in a wide variety of applications. Attempts also have been made to write Chinese unification grammars (Huang 1986, among others). W‑CPSG (for Wei's Chinese Phrase Structure Grammar , Li, W. 1997b) is a new endeavor in this direction, with its unique design and characteristics. 1.1. Design philosophy We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. 1.1.1. Lexicalized design It has been widely accepted that a well-designed lexicon is crucial for a successful grammar, especially for a natural language computational system. But Chinese linguistics in general and Chinese computational grammars in particular have generally been lacking in in-depth research on Chinese lexical base. For many years, most dictionaries published in China did not even contain information for grammatical categories in the lexical entries (except for a few dictionaries intended for foreign readers learning Chinese). Compared with the sophisticated design and rich linguistic information embodied in English dictionaries like Oxford Advanced Learners' Dictionary and Longman Dictionary of Contemporary English , Chinese linguistics is hampered by the lack of such reliable lexical resources. In the last decade, however, Chinese linguists have achieved significant progress in this field. The publication of 800 Words in Contemporary Mandarin (Lü et al., 1980) marked a milestone for Chinese lexical research. This book is full of detailed linguistic description of the most frequently used Chinese words and their collocations. Since then, Chinese linguists have made fruitful efforts, marked by the publication of a series of valency dictionaries (e.g. Meng et al., 1987) and books (e.g. Li, L. 1986, 1990). But almost all such work was done by linguists with little knowledge of computational linguistics. Their description lacks formalization and consistency. Therefore, Chinese computational linguists require patience in adapting and formalizing these results, making them implementable. 1.1.2. Integrated design Most conventional grammars assume a successive model of morphology, syntax and semantics. We argue that this design is not adequate for Chinese natural language processing. Instead, an integrated grammar of morphology, syntax and semantics is adopted in W‑CPSG. Let us first discuss the rationale of integrating morphology and syntax in Chinese grammar. As it stands, a written Chinese sentence is a string of characters (morphemes) with no blanks to mark word boundaries. In conventional systems, there is a procedure-based Chinese morphology preprocessor (so-called segmenter ). The major purpose for the segmenter is to identify a string of words to feed syntax. This is not an easy task, due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters da xue sheng huo , the segmentation ambiguity is shown in (1a) and (1b) below. (1) da xue sheng huo (a) da-xue | sheng-huo university | life (b) da-xue-sheng | huo university-student | live The resolution of the above ambiguity in the morphology preprocessor is a hopeless job because such structural ambiguity is syntactically conditioned. For sentences like da xue sheng huo you qu (university life is interesting), (1a) is the right identification. For sentences like da xue sheng huo bu xia qu le (university students cannot make a living), (1b) is right. So far there are no segmenters which can handle this properly and guarantee correct word segmentation (Feng 1996). In fact, there can never be such segmenters as long as syntax is not brought in. This is a theoretical defect of all Chinese analysis systems in the morphology-before-syntax architecture (Li, W. 1997a). I have solved this problem in our morphology-syntax integrated W‑CPSG (see 2.2. below). Now we examine the motivation of integrating syntax and semantics in Chinese grammar. It has been observed that, compared with the analysis of Indo-European languages, proper Chinese analysis relies more heavily on semantic information (see, e.g. Chen 1996, Feng 1996). Chinese syntax is not as rigid as languages with inflections. Semantic constraint is called for in both structural and lexical disambiguation as well as in solving the problem of computational complexity. The integration of syntax and semantics helps establish flexible ways of their interaction in analysis (see 2.3. below). 1.2. Major theoretical foundation: HPSG The work on W‑CPSG is developed in the spirit of the linguistic theory Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard Sag, 1987). HPSG is a highly lexicalist theory, which encourages the integration of different components. This matches our design philosophy for implementing our Chinese computational grammar. HPSG serves as a desired framework to start this research with. We benefit most from the general linguistic ideas in HPSG. However, W‑CPSG is not confined to the theory-internal formulations of principles and rules and other details in HPSG versions (e.g. Pollard Sag 1987, 1994 or later developments). We borrow freely from other theoretical sources or form our own theories in W‑CPSG to meet our goal of Natural Language Processing in general and Chinese computing in particular. For example, treating morphology as an integrated part of parsing and placing it right into grammar is our deliberate choice. In syntax, we formulate our own theory for configuration and word order. Our semantics differs most from any standard version of situation-semantics-based theory in HPSG. It is based on insights from Tesnière's Dependency Grammar (Tesnière 1959), Fillmore's Case Grammar (Fillmore 1968) and Wilks' Preference Semantics (Wilks 1975, 1978) as well as our own semantic view for knowledge representation and better coordination of syntax-semantics interaction (Li, W. 1996). For these differences and other modifications, it is more accurate to regard W‑CPSG as an HPSG-style Chinese grammar, rather than an (adapted) version of Chinese HPSG. Integrated language model 2.1. W‑CPSG versus conventional Chinese grammar The lexicalized design sets the common basis for the organization of the grammar in W‑CPSG. This involves the interfaces of morphology, syntax and semantics. W‑CPSG assumes an integrated language model of its components (see Figure 1). The W‑CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (see Figure 2). Figure 2. conventional language model (non-reversible) 2.2. Interfacing morphology and syntax As shown in Figure 2 above, conventional systems take a two-step approach: a procedure-based preprocessor for word identification (without discovering the internal structure) and a grammar for word-based parsing. W‑CPSG takes an alternative one-step approach and the parsing is character- (i.e. morpheme-) based. A morphological PS (phrase structure) rule is designed not only to identify candidate words but to build word‑internal structures as well. In other words, W‑CPSG is a self-contained model, directly accepting the input of a character string for parsing. The parse tree embodies both the morphological analysis and the syntactic analysis, as illustrated by the following sample parsing chart. Note: DET for determiner; CLA for classifier; N for noun; DE for particle de ; AF for affix; V for verb; A for adjective; CLAP for classifier phrase; NP for noun phrase; DEP for DE-phrase This is so-called bottom-up parsing . It starts with lexicon look-up. Simple edges 1 through 7 are lexical edges. Combined edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. After looking up the lexicon, the lexical information for the signs are made available to the parser. For the sake of concise illustration, we only show two crucial pieces of information for each edge in the chart, namely category and interpretation with a delimiting colon (some function words are only labeled for category). The parser attempts to combine the edges according to PS rules in the grammar until a parse is found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) represents the following binary structural tree embodying both the morphological and syntactic analysis of this NP phrase. As seen, word identification is no longer a pre-condition for parsing. It becomes a natural by-product of parsing in this integrated grammar of morphology and syntax: a successful parse always embodies the right word identification. For example, the parse ((((1+2)+3)+4)+((5+6)+7)) includes the identification of a word-string zhe (DET) ben (CLA) shu (N) de (DE) ke-du-xing (N). An argument against the conventional separation model is that there exists in the two-step approach a theoretical threshold beyond which the precision for the correct word identification is not possible. This is because proper word identification in Chinese is to a considerable extent syntactically conditioned due to possible structural ambiguity involved. Our strategy has advantages over the conventional approach in resolving word identification ambiguities and in handling the productive word formation. It has solved the problems inherent in the morphology-before-syntax architecture (for detailed argumentation, see Li, W. 1997a). 2.3. Interaction of syntax and semantics The interface and interaction of syntax and semantics are of vital importance in a Chinese grammar. We are of the same opinion as Chen (1996) and many others that it is more effective to analyze Chinese in an environment where semantic constraints are enforced during the parsing, not after . The argument is based on the linguistic characteristics of Chinese. Chinese has no inflection (like English ‑'s, ‑s, ‑ing, ‑ed , etc.), no such formatives as article (like English a, the ), infinitivizer (like English to ) and complementizer (like English that ). Instead, function words and word order are used as major syntactic devices. But Chinese function words (prepositions, aspect particles, passive particle, plural suffix, conjunctions, etc.) can often be omitted (Lü et al. 1980, p.2). Moreover, fixed word order in order to mark syntactic functions which is usually assumed for isolating languages, is to a considerable extent untrue for Chinese. In fact, there is remarkable freedom or flexibility in Chinese word order. One typical example is demonstrated in the numerous word order variations (although the default order is S‑V‑O subject-verb-object) for the Chinese transitive patterns (Li, W. 1996). All these added up project a picture of Chinese as a language of loose syntactic constraint. A weak syntax requires some support beyond syntax to enhance grammaticality. Semantic constraints are therefore called for. I believe that an effective way to model this interaction between syntax and semantics is to integrate the two in one grammar. One strong piece of evidence for this syntax-semantics integration argument is that Chinese has what I call syntactically crippled structures . These are structures which can hardly be understood on purely formal grounds and are usually judged as ungrammatical unless accompanied with the support from the semantic constraints (i.e. the match of semantic selection restrictions). Some Chinese NP predicate (Li, W. McFetridge 1995) and transitive patterns like S‑O‑V (Li, W. 1996), among others, are such structures. The NP Predicate is a typical instance of semantic dependence. It is highly undesirable if we assume a general rule like S -- NP1 NP2 in a Chinese grammar to capture such phenomena. This is because there is a semantic condition for NP2 to function as predicate, which makes the Chinese NP predicate a very restricted pattern. For example, in the sentence This table is three-legged : zhe (this) zhang (classifier) zhuo-zi (desk) san (three) tiao (classifier) tui (leg), the subject must be of the semantic type animate or furniture (which can have legs). The general rule with no recourse to semantic constraints is simply too productive and may cause severe computational complexity. In the case of Chinese transitive patterns, formal means are decisive for some variations in their interpretation (i.e. role assignment) process. But others are heavily dependent on semantic constraint. Take chi (eat) as an example. There is no difference in syntactic form in sentences like wo (I) chi (eat) dianxin (Dim-Sum) le (perfect-aspect) and dianxin (Dim-Sum) wo (I) chi (eat) le (perfect-aspect) . Who eats what? To properly assign roles to NP1 NP2 V as S-O-V versus O-S-V, the semantic constraint animate eats food needs to be enforced. The conventional syntax-before-semantics model has now received less popularity in Chinese computing community. Researchers have been exploring various ways of integrating syntax and semantics in Chinese grammar (Chen 1996). In W‑CPSG, the Chinese syntax was enhanced by the incorporation of a semantic constraint mechanism. This mechanism embodies a lexicalized knowledge representation, which parallels to the syntactic representation in the lexicon. I have developed a way to dynamically coordinate the syntactic constraint and semantic constraint in one model. This technique proves to be effective in handling rhetorical expressions and in making the grammar both precise and robust (Li, W 1996). Lexicalized formal grammar 3.1. Formalized grammar The application nature of this research requires that we pay equal attention to practical issues of computational systems as well as to a sound theoretical design. All theories and rule formulations in W‑CPSG are implementable. In fact. most of them have been implemented in our prototype W‑CPSG. W‑CPSG is a strictly formalized grammar that does not rely on undefined notions. The whole grammar is represented by typed feature structures (TFS), as defined below based on Carpenter Penn (1994). (3) Definition: typed feature structure A typed feature structure is a data structure adopted to model a certain object of a grammar. The necessary part for a typed feature structure is type. Type represents the classification of the feature structure. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature-value pairs in addition to the type. A feature-value pair consists of a feature and a value. A feature reflects one aspect of an object. The value describes that aspect. A value is itself a feature structure (simple or complex). A feature determines which type of feature structures it takes as its value. Typed feature structures are finite in a grammar. Their definition constitutes the typology of the grammar. With this formal device of typed feature structures, we formulate W‑CPSG by defining from the very basic notions (e.g. sign, morpheme, word, phrase, S, NP, VP, etc.) to rules (PS rules and lexical rules), lexical items, lexical hierarchy and typology (hierarchy embodied in feature structures) (Li, W. 1997b). The following sample definitions of some basic notions illustrate the formal nature of W‑CPSG. Please note that they are system-internal definitions and are used in W‑CPSG to serve the purpose of configurational constraints (see Chapter VI of Li, W. 1997b). (4) Definition: sign a_sign KANJI kanji MORPH expected CATEGORY category COMP0 expected COMP1 expected COMP2 expected MOD expected KNOWLEDGE knowledge CONTENT content DTR dtr A sign is the most fundamental concept of grammar. A sign is a dynamic unit of grammatical analysis. It can be a morpheme, a word, a phrase or a sentence. Formally, a sign is defined by the TFS a_sign, which introduces a set of linguistic features for its description, as shown above. These features include the orthographic feature KANJI; morphological feature MORPH; syntactic features CATEGORY, COMP0, COMP1, COMP2, and MOD; structural feature (for both morphology and syntax) DTR; semantic features KNOWLEDGE and CONTENT. (5) Definition: morpheme a_sign MORPH ~saturated A morpheme is a sign whose morphological expectation has not been saturated. In W‑CPSG, ~saturated is equivalent to obligatory/optional/null. For example, the suffix ‑ xing (‑ness) is such a morpheme whose morphological expectation for a preceding adjective is obligatory. In W‑CPSG, a morpheme like ‑ xing (‑ness) ceases to be a morpheme when its obligatory expectation, say the adjective ke-du (readable), is saturated. Therefore, the sign ke-du-xing (readability) is not a morpheme, but becomes a word per se . (6) Definition: word a_sign MORPH ~obligatory DTR no_syn_dtr In W‑CPSG, ~obligatory is equivalent to saturated/optional/null. The specification defines a syntactic sign, i.e. a sign whose obligatory morphological expectation has been saturated. A word is a syntactic sign with no syntactic daughters, i.e. . Obviously, word with overlaps morpheme with in cases when the morphological expectation is optional or null. Just like the overlapping of morpheme and word, there is also an intersection between word and phrase. Compare the following definition of phrase with the above definition of word. (7) Definition: phrase a_sign MORPH ~obligatory COMP0 ~obligatory COMP1 ~obligatory COMP2 ~obligatory A phrase is a syntactic sign whose obligatory complement expectation has all been saturated, i.e. . When a word has only optional complement expectation or no complement expectation, it is also a phrase. The overlapping relationship among morpheme, word and phrase can be shown by the following illustration of the three sets. S is a syntactic sign satisfying the following 3 conditions: (1) its category is pred (which includes V and A); (2) its comp0 is saturated; (3) its obligatory comp1 and comp2 are saturated. 3.2. Lexicalized grammar W‑CPSG takes a radical lexicalist approach. We started with individual words in the lexicon and have gradually built up a lexical hierarchy and the grammar prototype. W‑CPSG consists of two parts: a minimized general grammar and an information-enriched lexicon. The general grammar contains only 11 PS rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. We formulate a PS rule for illustration. This comp0 PS rule is similar to the rule S == NP VP in the conventional phrase structure grammar. The feature COMP0 represents the expectation of the head daughter for its external complement (subject or specifier) on its left side, i.e. . The nature of its expected comp0, NP or other types of sign, is lexically decided by the individual head (hence head-driven or lexicon-driven). It will always be warranted by the general grammar, here via the index . This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, they say one thing, namely, 2 signs can combine so long as the lexicon so indicates. The indices and represent configurational constraint. They ensure that internal obligatory complements COMP1 and COMP2 must be saturated before this rule can be applied. Finally, Head Feature Principle (defined elsewhere in the grammar based on the adaptation of the Head Feature Principle in HPSG, Pollard Sag, 1994) ensures that head features are percolated up from the head daughter to the mother sign. The lexicon houses lexical entries with their linguistic description and knowledge representation. Potential morphological structures, as well as potential syntactic structures, are lexically encoded (in the feature MORPH for the former and in the features COMP0, COMP1, COMP2, MOD for the latter). Our knowledge representation is also embodied in the lexicon (in the feature KNOWLEDGE). I believe that this is an effective and realistic way of handling natural language phenomena and their disambiguation without having to resort to an encyclopedia-like knowledge base. The following sample formulation of the lexical entry chi (eat) projects a rough picture of what the W‑CPSG lexicon looks like. The lexicon also contains lexical generalizations. The generalizations are captured by the inheritance of the lexical hierarchy and by a set of lexical rules. Due to space limitations, I will not show them in this paper. Implementation and application of W‑CPSG A substantial Chinese computational grammar has been implemented in the W‑CPSG prototype. It covers all basic Chinese constructions. Particular attention is paid to the handling of function words and verb patterns. On the basis of the information- enriched lexicon and the general grammar, the system adequately handles the relationship between linguistic individuality and generality. The grammar formalism which I use to code W‑CPSG is ALE, a grammar compiler on top of Prolog, developed by Carpenter Penn (1994). ALE is equipped with an inheritance mechanism on typed feature structures, a powerful tool in grammar modeling. I have made extensive use of the mechanism in the description of lexical categories as well as in knowledge representation. This seems to be an adequate way of capturing the inherent relationship between features in a grammar. Prolog is a programming environment particularly suitable for the development of unification and reversible grammars (Huang 1986, 1987). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. In the first experiment, W‑CPSG has parsed a corpus of 200 Chinese sentences of various types. An important benefit of a unification-based grammar is that the same grammar can be used both for parsing and generation. Grammar reversibility is a highly desired feature for multi-lingual machine translation application. Following this line, I have successfully applied W‑CPSG to the experiment of bi-directional machine translation between English and Chinese. The machine translation system developed in our Natural Language Lab is based on the shake-and-bake design (Whitelock 1992, 1994). I used the same three grammar modules (W‑CPSG, an English grammar and a bilingual transfer lexicon) and the same corpus for the experiment. As part of machine translation output, W‑CPSG has successfully generated the 200 Chinese sentences. The experimental results meet our design objective and verify the feasibility of our approach. References Carpenter, B. Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide Chen, K-J. (1996): Chinese sentence parsing Tutorial Notes for International Conference on Chinese Computing ICCC'96 , Singapore Feng, Z-W. (1996): COLIPS lecture series - Chinese natural language processing, Communications of COLIPS , Vol. 6, No. 1 1996, Singapore Fillmore, C. J. (1968): The case for case. Bach and Harms (eds.), Universals in Linguistic Theory . Holt, Reinhart and Winston, pp. 1-88. Huang, X-M. (1986): A bidirectional grammar for parsing and generating Chinese. Proceedings of the International Conference on Chinese Computing , Singapore, pp. 46-54 Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System , Doctoral dissertation, University of Essex. Li, L-D. (1986): Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan, Beijing Li, L-D. (1990): Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing Li, W. P. McFetridge (1995): Handling Chinese NP predicate in HPSG, Proceedings of PACLING-II , Brisbane, Australia Li, W. (1996): Interaction of syntax and semantics in parsing Chinese transitive patterns, Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore Li, W. (1997a): Chart parsing Chinese character strings, Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar , Doctoral dissertation, Simon Fraser University (on-going) Lü, S-X. et al. (ed.) (1980): Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan, Beijing Meng, Z., H-D. Zheng, Q-H. Meng, W-L. Cai (1987): Dongci Yongfa Cidian (Dictionary of Verb Usages), Shanghai Cishu Chubanshe, Shanghai Pollard, C. I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals . Centre for the Study of Language and Information, Stanford University, CA Pollard, C. I. Sag (1994): Head-Driven Phrase Structure Grammar , Centre for the Study of Language and Information, Stanford University, CA Shieber, S. (1986): An Introduction to Unification-Based Approaches to Grammar . Centre for the Study of Language and Information, Stanford University, CA Tesnière, L. (1959): éléments de Syntaxe Structurale , Paris: Klincksieck Whitelock, Pete (1992): Shake and bake translation, Proceedings of the 14th International Conference on Computational Linguistics , pp. 784-790, Nantes, France. Whitelock, Pete (1994). Shake and bake translation, C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation , pp. 339-359, London, Academic Press. Wilks, Y.A. (1975). A preferential pattern-seeking semantics for natural language interference. Artificial Intelligence , Vol. 6, pp. 53-74. Wilks, Y.A. (1978). Making preferences more active. Artificial Intelligence , Vol. 11, pp. 197-223 ------------------------------------- * This project was supported by the Science Council of British Columbia, Canada under G.R.E.A.T. Award (code: 61) and by my industry partner TCC Communications Corporation, British Columbia, Canada. I thank my academic advisors Paul McFetridge and Fred Popowich and my industry advisor John Grayson for their supervision and encouragement. Thanks also go to my colleagues Davide Turcato, James Devlan Nicholson and Olivier Laurens for their help during the implementation of this grammar in our Natural Language Lab. I am also grateful to the editors of the NWLC'97 Proceedings for their comments and corrections. We leave aside the other components such as discourse, pragmatics, etc. They are an important part of a grammar for a full analysis of language phenomena, but they are beyond what can be addressed in this research. In formulating W‑CPSG, we use uppercase for feature and lowercase for type ; ~ for logical not and / for logical or ; number in square brackets for unification. PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter II Role of Grammar PhD Thesis: Chapter III Design of CPSG95 PhD Thesis: Chapter IV Defining the Chinese Word PhD Thesis: Chapter V Chinese Separable Verbs PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation PhD Thesis: Chapter VII Concluding Remarks Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|3101 次阅读|0 个评论
PhD Thesis: Chapter VII Concluding Remarks
liwei999 2016-8-30 07:08
This chapter summarizes the research conducted in this dissertation, including its contributions as well as limitation. 7.0. Summary The goal of this dissertation is to explore effective ways of formally approaching Chinese morpho-syntactic interface in a phrase structure grammar. This research has led to the following results: (i) the design of a Chinese grammar, namely CPSG95, which enables flexible coordination and interaction of morphology and syntax; (ii) the solutions proposed in CPSG95 to a series of long-standing problems at the Chinese morpho-syntactic interface. CPSG95 was designed in the general framework of HPSG (Pollard and Sag 1987, 1994). The sign-based mono-stratal design from HPSG demonstrates the advantage in being capable of accommodating and accessing information of different components of a grammar. One crucial feature of CPSG95 is its introduction of morphology expectation feature structures and the corresponding morphological PS rules into HPSG. As a result, CPSG95 has been demonstrated to provide a favorable environment for solving morpho-syntactic interface problems. Three types of morpho-syntactic interface problems have been studied extensively: (i) the segmentation ambiguity in Chinese word identification; (ii) Chinese separable verbs, a borderline problem between compounding and syntax; and (iii) borderline phenomena between derivation morphology and syntax. In the context of the CPSG95 design, the segmentation ambiguity is no longer a problem as morphology and syntax are designed system internally in the grammar to support morpho-syntactic parsing based on non-deterministic tokenization (W. Li 1997, 2000). In other words, the design of CPSG95 itself entails an adequate solution to this long-standing problem, a problem which has been a central topic in Chinese NLP for the last two decades. This is made possible because the access to a full grammar including both morphology and syntax is available in the integrated process of Chinese parsing and word identification while traditional word segmenters can at best access partial grammar knowledge. The second problem involves an interesting case between compounding and syntax: different types of Chinese separable verbs demonstrate various degrees of separability in syntax while all these verbs, when used contiguously, are part of Chinese verb vocabulary. For each type of separable verbs, arguments were presented for the proposed linguistic analysis and a solution to the problem was then formulated in CPSG95 based on the analysis. All the proposed solutions provide a way of capturing the link between the separated use and the contiguous use of the separable verb phenomena. They are shown to be better solutions than previous approaches in the literature which either cannot link the separated use and the contiguous use in the analysis or suffer from being not formal. The third problem at the interface of derivation and syntax involves two issues: (i) a considerable amount of ‘quasi-affix’ data, and (ii) the intriguing case of zhe -suffixation which demonstrates an unusual combination of a phrase with a bound morpheme. A generic analysis of Chinese derivation has been proposed in CPSG95. This analysis has been demonstrated to be also effective in handling both quasi-affixation and zhe -affixation. 7.1. Contributions The specific contributions are reflected in the study of the following five topics, each constituting a chapter. On the topic of the Role of Grammar , the investigation leads to the central argument that knowledge from both morphology and syntax is required to properly handle the major types of morpho-syntactic interface problems. This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism. An in-depth study has been conducted in the area of the segmentation ambiguity in Chinese word identification. The most important discovery from the study is that the disambiguation involves the analysis of the entire input string. This means that the availability of a grammar is key to the solution of this problem. A natural solution to this problem is the use of grammatical analysis to resolve, and/or prepare the basis for resolving, the segmentation ambiguity. On the topic of the Design of CPSG95 , a mono-stratal Chinese phrase structure grammar has been established in the spirit of the HPSG theory. Components of a grammar such as morphology, syntax and semantics are all accommodated in distinct features of a sign. CPSG95 is designed to provide a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface. The essential part of this work is the design of expectation feature structures . Expectation feature structures are generalized from the HPSG feature structures for syntactic subcategorization and modification. One characteristic of the CPSG95 structural expectation is the design of morphological expectation features to incorporate Chinese productive derivation, which covers a wide range of linguistic phenomena in Chinese word formation. In order to meet the requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, modifications from the standard HPSG are proposed in CPSG95. The rationale and arguments for these modifications have been presented. The design of CPSG95 is demonstrated to be a successful application of HPSG in the study of Chinese morpho-syntactic phenomena. On the topic of Defining the Chinese Word , efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization. The theoretical inquiry follows the insight from Di Sciullo and Williams (1987) and Lü (1989). Two notions of word, namely grammar word and vocabulary word, have been examined and distinguished. While vocabulary word is easy to define once a lexicon is given, the object for linguistic study and generalization is actually grammar word. Unfortunately, as there is a considerable amount of borderline phenomena between Chinese morphology and syntax, no precise definition of Chinese grammar word has been available across systems. Therefore, an argument in favor of the system-internal wordhood definition and interface coordination within a grammar has been made. This leads to a case-by-case approach to the analysis of specific Chinese morpho-syntactic interface problems. On the other hand, three useful wordhood judgment methods have also been proposed as a complementary means to the case-by-case analysis. These methods are (i) syntactic process test involving passivization and topicalization; (ii) keyword based judgment patterns for verbs, and (iii) a general expansion test named X-insertion. These methods are demonstrated to be fairly operational and easy to apply. In terms of formalization, a system-internal representation of word has been defined in CPSG95 feature structures. This definition distinguishes a grammar word from both bound morphemes and syntactic constructions. The formalization effort is necessary for the rigid study of Chinese morpho-syntactic problems and ensures the implementability of the solutions to these problems as proposed in the dissertation. On the topic of Chinese Separable Verbs , the task is to coordinate the idiomatic nature of separable verbs and their separated uses in various syntactic patterns. Since there are different degrees of ‘separability’ for different types of Chinese separable verbs, there is no uniform analysis which can handle all separable verbs properly. A case-by-case study for each type of separable verbs has been conducted. An essential part of this study is the arguments for the wordhood judgment for each type. In the light of this judgment, CPSG95 provides formalized analyses of separable verbs which satisfy two criteria: (i) they all capture both structural and semantic aspects of the constructions at issue; (ii) they all provide a way of capturing the link between the separated use and contiguous use. Finally, on the topic of Morpho-syntactic Interface Involving Derivation , a general approach to Chinese derivation has been proposed. This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of the special problem in zhe- suffixation. In the CPSG95 analysis, the affix serves as head of a derivative and can impose various constraints in the lexicon on its expected stem sign for the morphological expectation. Coupled with only two PS rules formulated in the general grammar (Prefix PS Rule and Suffix PS Rule), it has been shown that various Chinese affixation phenomena can be captured equally well. The PS rules ensure that all the lexical constraints be observed before the affix and the stem combine and that the output of derivation be a word. As for the quasi-affixation problem, based on the observation that there is no fundamental structural difference between quasi-affixation and other affixation, a proper treatment of 'quasi-affixes' can be established in the same way as other affixes are handled in CPSG95; the individual difference in semantics is shown to be capturable in the lexicon. The study of zhe -suffixation started with arguments for its analysis of VP+ -zhe . This is an unsolvable problem in any system which enforces sequential processing of morphology before syntax. The solution which CPSG95 offers demonstrates the power of designing derivation morphology and syntax in a mono-stratal grammar. With this novel design in modeling Chinese grammar, the CPSG95 general approach to derivation readily applies to the tough case of zhe- suffixation. This is possible because of the ability of an affix in placing any lexicalized constraints, VP in this case, on the expected stem for morphological expectation. In addition, the proposed lexicalized solution also captures the building of the semantic content for this morpho-syntactic borderline phenomenon. 7.2. Limitation The major limitation of the work reported in this thesis lies in the following two aspects. Limited by space, the thesis has only presented some sample formulation of typical affixes and quasi-affixes to demonstrate the proposed general approach to Chinese derivation morphology. As many affixes/quasi-affixes have their distinctive semantic property, a reader who likes to experiment with this proposal in implementation still has to work out the technical details for each affix. However, it is believed that the general strategy has been presented in sufficient details to allow for easy accommodation of individual aspects of an affix which have not been specifically addressed in the thesis. Limited by the focus on a handful of major morpho-syntactic interface problems, the treatment of reduplication and unlisted proper names have not been listed as special topics for in-depth exploration. They are only briefly discussed in Chapter II (Section 2.2) as cases of productive word formation for the need to involve syntax when they involve segmentation ambiguity at the boundaries. However, they are also long-standing word identification problems which affect morpho-syntactic interface when the segmentation ambiguity is involved. In particular, it is felt that the treatment of transliterated foreign names requires further research before a satisfactory solution can be found in the framework of CPSG95. 7.3. Final Notes This last section is used to place the research reported in this thesis in a larger context. Chinese NLP has reached a new stage marked by the publication of Guo’s series of papers on Chinese tokenization (Guo 1997a,b,c,d, Guo 1998). There are signs that the major research focus is being shifted from word segmentation to the grammar design and development. In this process, the morph-syntactic interface will remain a hot topic for quite some time to come. The work on CPSG95 can be seen as one of the efforts in this direction. The design of CPSG95, a formal grammar capable of representing both morphology and syntax in a uniform formalism, is one successful application of the modern linguistic theory HPSG in the area of Chinese morpho-syntactic interface research. However, this is by no means to claim that CPSG95 is the only or best framework to capture the morpho-syntactic problems. This is only one approach which has been shown to be feasible and effective. Other equally good or better approaches may exist. In terms of future directions, constraints from semantics and discourse should be made available in the grammatical analysis. In Chapter II (Section 2.4), we have seen problems whose ultimate solutions depend on the access to the semantic or discourse constraints. It is believed that the sign-based mono-stratal design of CPSG95 will be extensible to accommodate these constraints. However, this will require years of future research before they can be formally modeled and properly introduced into the grammar. -------------------------- As a matter of fact, the CPSG95 experiment shows that most segmentation ambiguity is resolved automatically as a by-product of morpho-syntactic parsing and the remaining ambiguity is embodied in the multiple syntactic trees as the results of the analysis. However, in the CPSG95 implementation, the problem of handling the Chinese person names, a special case of compounding, has been solved fairly satisfactorily. The proposal is to use the surname as the head sign to expect the given name (of one or two characters) on its right to form potential full names. As the right boundary of a person name is difficult to define without the support of sentential analysis, the conventional word segmenter frequently makes wrong segmentation in such cases. In contrast, the approach implemented in CPSG95 is free from this problem because whether a potential name proposed by the surname ultimately survive as a proper name is decided by whether it contributes to a valid parse for the processed sentence. In last few years, there has been rapid progress on proper name identification in the area of information extraction, called named entity tagging (MUC7 1998; Chen et al 1997). BIBLIOGRAPHY Bauer, Laurie (1988). Introducing Linguistic Morphology . Edinburgh: Edinburgh University Press. Bloomfield, Leonard (1933). Language , New York: Henry Holt Co. Borsley, Robert (1987). Subjects and Complements in HPSG . Technical report no. CSLI-107-87. Stanford: Center for the Study of Language and Information. Carpenter, B. and G. Penn (1994). ALE, The Attribute Logic Engine, User's Guide. From http://www.sfs.nphil.uni-tuebingen.de/~gpenn/ale.html (accessed January 30, 2001). Chao, Yuen-Ren (1968). A Grammar of Spoken Chinese . Berkeley: University of California Press. Chen, H.-H et al (1997). Description of the NTU System used for MET-2. Proceedings of MUC-7. From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001). Chen, K. and S. Liu (1992). Word Identification for Mandarin Chinese Sentences. Proceedings of 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, 101-107. Chen, M.Y. and W. S-Y. Wang (1975). Sound Change: Actuation and Implementation. Language 51:2, 255-281. Chen, Ping (1994). “Shilun Hanyu zhong San Zhong Juzi Chengfen yu Yuyi Cheng Fen de Peiwei Yuanze” (On Mapping Principles of Relationship between Chinese Three Syntactic Constituents and Semantic Roles). Zhongguo Yuwen (Chinese Linguistics), No.3. Chomsky, Noam (1970). Remarks on Nominalization. Readings in English Transformational Grammar , eds. by R. Jacobs and P. Rosenbaum, Waltham, Massachasetts: Ginn and Company, 184-221. Dai, John Xiang-ling (1993). Chinese Morphology and its Interface with Syntax . Ph.D. Dissertation, Ohio State University. DeFrancis, John (1984). The ChineseLanguage: Fact and Fantasy . Honolulu: University of Hawaii Press. Di Sciullo, A.M. and E. Williams (1987). On The Definition of Word . The MIT Press, Cambridge, Massachusetts. Ding, Shengshu (1953). “Hanyu Yufa Jianghua” (Lectures of Chinese Grammar), Zhongguo Yuwen (Chinese Linguistics), No. 3 and No. 4. Dowty, D. (1982). More on the Categorial Analysis of Grammatical Relations. In A. Zaenen (Ed.), Subjects and Other Subjects: Proceedings of the Harvard Conference on Grammatical Relations . Bloomington: Indiana University Linguistics Club. Feng, Zhiwei (1996). COLIPS Lecture Series - Chinese Natural Language Processing, Communications of COLIPS , Vol.6, No.1, Singapore. Gan, Kok Wee (1995). Integrating Word Boundary Disambiguation with Sentence Understanding , Ph.D. Dissertation, National University of Singapore. Gazdar, G., E. Klein, G.K. Pullum, and I.A. Sag (1985). Generalized Phrase Structure Grammar . Cambridge: Blackwell, and Cambridge, Mass.: Harvard University Press. Guo, Jin (1997a). Critical tokenization and its properties. Computational Linguistics , Vo. 23, No.4, 569-596. Guo, Jin (1997b). Chinese Language Modeling for Speech Recognition . Ph.D. dissertation, Institute of Systems Science, National University of Singapore. Guo, Jin (1997c). A Comparative Study on Sentence Tokenization Generation Schemes. In review for journal publication from http://sunzi.iss.nus.sg:1996/guojin/papers/ (accessed March 25, 1999). Guo, Jin (1998). One tokenization per source. Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL ’98), Montreal, Canada, 457-463. He, K., H. Xu and B. Sun (1991). Design Principles of an Expert System for Automatic Word Segmentation of Written Chinese Texts, Journal of Chinese Information Processing , Vol. 5, No. 2, 1-14. Hockett, C.F. (1958). A Course in Modern Linguistics . New York: Macmillan. Hu, F. and L. Wen (1954). “Ci de fanwei, xingtai, gongneng” (Scope, form and function of word). Zhongguo Yuwen (Chinese Linguistics), August issue. Jackendoff, Ray (1972). Semantic Interpretation In Generative Grammar , Cambridge, Massachusetts: MIT Press. Jensen, John T. (1990). Morphology: Word Structure in Generative Grammar . Amsterdam/Philadephia: John Benjamins Publishing Company. Kathol, Andreas (1999). Agreement and the Syntax-Morphology Interface in HPSG. In Robert Levine and Georgia Green (eds.) Studies in Current Phrase Structure Grammar . Cambridge University Press, 223-274. Kolman, B. and R.C. Busby (1987). Discrete Mathematical Structures for Computer Science , 2nd edition. Prentice-Hall, Inc. Krieger, Hans-Ulrich (1994). Derivation without Lexical Rules, in C.J Rupp, M. Rosner and R. Johnson (eds), Constraints, Language, and Computation . Academic Press, 277-313. Li, C.N. and S.A. Thompson (1981). Mandarin Chinese: A Functional Grammar . Berkeley: University of California Press. Li, Linding (1986). Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing. Li, Linding (1990). Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing. Li, Qinghua (1983). “Tan liheci de tedian he yongfa” (On the characteristics and usages of separable words). Yuyan Jiaoxue He Yan Jiu (Language Instruction and Research), No.3. Li, Wei (1996). Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore. Li, Wei (1997). Chart Parsing Chinese Character Strings. Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9), Victoria, Canada. Li, Wei (2000). On Chinese parsing without using a separate word segmenter. Communication of COLIPS 10 (1): 19-68. Liang, Nanyuan (1987). CDWS -- A Written Chinese Automatic Word Segmentation System. Journal of Chinese Information Processing, 1(2): 44-52. Lieber, R. (1992). Deconstructing Morphology . Chicago: University of Chicago Press. Lin, Handa (1983). “Shime shi ci – xiaoyu ci de bu shi ci” (What is a word – a unit smaller than a word is not a word). Zhongguo Yuwen (Chinese Linguistics), No.34. Lu, Jianming (1988). “Mingci-xing ‘laixin’ shi ci haishi cizu” (Nominal laixin: word or word group). Zhongguo Yuwen (Chinese Linguistics), No. 5. Lu, Zhiwei (1957). Hanyu de Goucifa (Chinese Word Formation), Kexue Chubanshe (Science Publishing House).. Lü, Shuxiang. (1946). “Cong Zhuyu, Binyu de Fenbie Tan Guoyu Juzi de Fenxi” (On Sentence Analysis of Mandarin Chinese from the Angle of the Distinction between Subject and Object), Kaiming Shudian Er Shi Zhounian Jiannian Wenji (Selected Works to Celebrate the 20th Anniversary of Kaiming Bookstore). Lü, Shuxinag et al (ed.) (1980). Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing. Lü, Shuxiang (1989). “Hanyu Yufa Fenxi Wenti” (Issues on Chinese grammatical analysis), Lü Shuxiang Zixuanji (Self-selected Works of Shuxiang Lü), Shang Hai Jiaoyu Chubanshe (Shanghai Education Publishing House), Shanghai, 93-180. Lua, Kim Teng (1994). Application of Information Theory Binding in Word Segmentation. Computer Processing of Chinese and Oriental Languages 8(1): 115-124. Lyons, John (1968). Introduction to Theoretical Linguistics . Cambridge: Cambridge University Press. MUC-7 (1998). Proceedings of the Seventh Message Understanding Conference (MUC-7). From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001). Pollard, C. and I. Sag (1987). Information based Syntax and Semantics Vol. 1: Fundamentals . Centre for the Study of Language and Information, Stanford University, CA. Pollard, C. and I. Sag (1994). Head-Driven Phrase Structure Grammar . The University of Chicago Press. Riehemann, Susanne (1993). Word Formation in Lexical Type Hierarchies – A Case Study of bar -Adjectives in German. SfS-Report-02-93, University of Tübingen. Riehemann, Susanne (1998). Type-based derivational morphology. Jo urnal of Comparative Germanic Linguistics 2. 49-77. Sapir, Edward (1921). Language: Introduction to the Study of Speech . NewYork: Harcourt, Brace, and World. Selkirk, E. (1982). The Syntax of Words . Cambridge: MIT Press. Shi, Youwei (1992). Huhuan Rouxing – Hanyu Yufa Tanyi (A Call for Flexibility – Peculiarities of Chinese Grammar), Hunan Publishing House. Shieber, S. (1986). An Introduction to Unification-Based Approaches to Grammar . Centre for the Study of Language and Information, Stanford University, CA. Sproat, R., C. Shih, V. Gale, and N. Chang (1996). A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computational Linguistics . Vol. 22, No. 3. Sun, L. and P. Cole (1991). The effect of morphology on long-distance reflexives. Journal of Chinese Linguistics 19:1, 42-62. Sun, M. and B. T’sou (1995). Ambiguity resolution in Chinese word segmentation. Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation (PACLIC-95), Hong Kong, 121-126. Sun, M. and C. Huang (1996). Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Texts , A Tutorial at the 1996 International Conference on Chinese Computing (ICCC96), Singapore. Thompson, S.A. (1973). Resultative Verb Compounds in Mandarin Chinese: A Case of Lexical Rules. Language 49:2, 361-379. Wang, Li (1955). ZhongguoYufa Lilun (Chinese Grammatical Theory), Zhonghua Shuju, Shanghai. Wang, Xiaolong (1989). Automatic Chinese Word Segmentation, in Word Separating and Mutual Translation of Syllable and Character Strings , Ph.D. Dissertation, Dept. of Computer Science and Engineering, Harbin Institute of Technology. Webster, J. J. and C-Y Kit. (1992). Tokenization as the Initial Phase in NLP. Proceedings of the 14th International Conference on Computational Linguistics (COLING-92). Nantes, France, 1106-1110. Wu, A. and Z. Jiang (1998). Word Segmentation in Sentence Analysis. Proceedings of the 1998 International Conference on Chinese Information Processing . Beijing, China, 169-180. Wu, Dekai (1998). A Position Statement on Chinese Segmentation. Presented at the Chinese Language Processing Workshop, University of Pennsylvania. (Current draft at http://www.cs.ust.hk/~dekai/papers/segmentation.html , accessed January 30, 2001). Wu, M. and K. Su (1993). Corpus-Based Automatic Compound Extraction with Mutual Information and Relative Frequency Count. Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) VI, Taiwan, 207-216. Xue, Ping (1991). Syntactic Dependencies in Chinese and their Theoretical Implications . Ph.D. dissertation, University of Victoria, Canada. Yao, T., G. Zhang, and Y. Wu (1990). A Rule-Based Chinese Automatic Segmentation System. Journal of Chinese Information Processing 4(1): 37-43. Yeh, C-L. and H-J. Lee (1991). Rule-Based Word Identification For Mandarin Chinese Sentences -- A Unification Approach. Computer Processing of Chinese and Oriental Languages . Vol. 5, No. 2, 97-118. Yu, Shihong et al (1997). Description of the Kent Ridge Digital Labs System Used for MUC-7. Proceedings of MUC-7. From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001). Zhang, J., Z. Chen and S. Chen (1991). A Method of Word Identification for Chinese by Constraint Satisfaction and Statistical Optimization Techniques. Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) IV, Taiwan, 147-165. Zhang, Shoukang (1957). “Lüetan hanyu goucifa” (A brief discussion on Chinese word formation) Xiandai Hanyu Cankao Ziliao (Reference for Comtemporary Chinese), ed. by Yushu Hu (1981), Shanghai: Shanghai Jiaoyu Chubanshe (Shanghai Education Publishing Company), 241-256. Zhao, S. and B. Zhang (1996). “Liheci de queding yu liheci de xingzhi” (Determination and characteristics of separable words). Yuyan Jiaoxue he Yanjiu (Language Instruction and Research), No.1, 40-51. Zhu, Dexi (1985). Yufa Wenda (Questions and Answers on Chinese Grammar). Shangwu Yinshuguan (Commercial Press), Beijing. Zwicky, A.M. (1987). Slashes in the Passive. Linguistics 25, 639-669. Zwicky, A.M. (1989). Idioms and Constructions. Eastern States Conference on Linguistics 5, 547-558. PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter II Role of Grammar PhD Thesis: Chapter III Design of CPSG95 PhD Thesis: Chapter IV Defining the Chinese Word PhD Thesis: Chapter V Chinese Separable Verbs PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation PhD Thesis: Chapter VII Concluding Remarks Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|2070 次阅读|0 个评论
PhD Thesis: Chapter VI Interface Involving Derivation
liwei999 2016-8-30 02:34
6.0. Introduction This chapter studies some challenging problems of Chinese derivation and its interface with syntax. These problems have been a challenge to existing word segmenters; they are also long-standing problems for Chinese grammar research. It is observed that a good number of signs have become more and more like affixes as the Chinese language develops. Typical, indisputable examples include signs like the nominalizer 性 ‑xing (-ness) and the prefix 第 di- (-th). While few people doubt the existence of affixes in Contemporary Chinese, there is no general agreement on the exact number of Chinese affixes, due to a considerable number of borderline cases often referred to as ‘quasi-affixes’ (类语缀 lei yu-zhui ). It will be argued that the quasi-affixes belong to morphology and are structurally not different from other affixes. The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes lies mainly in the following aspect. The former retain some ‘solid’ meaning while the latter are more functionalized. However, this does not prevent CPSG95 from providing a proper treatment of quasi-affixes in the same way as it handles other affixes. It will be shown that the difference in semantics between affixes or quasi-affixes can be accommodated fairly easily in the CPSG95 lexicon. Based on the examination of the common property of Chinese affixes and quasi-affixes, a general approach to Chinese derivation is proposed. This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of a special problem in Chinese derivation, namely zhe- suffixation. The affix status of 者 -zhe (-er) is generally acknowledged (classified as suffix in the authoritative books like Lü et al 1980): it attaches to a verb sign and produces a word. The peculiar aspect of this suffix is that the verb stem which it attaches to can be syntactically expanded. In fact, there is significant amount of evidence for the argument that this suffix expects a VP as its stem (see 6.5 for evidence). Since a VP is only formed in syntax and derivation is within the domain of morphology, this phenomenon presents a highly challenging case on how morphology should be interfaced properly to syntax. The solution which is offered in CPSG95 demonstrates the power of designing morphology and syntax in an integrated grammar formalism. In contrast, in any system which enforces sequential processing of derivation morphology before syntax - most traditional systems assume this, this is an unsolvable problem. There does not seem to be a way of enabling partial output of syntactic analysis (i.e. VP) to feed back to some derivation rule in the preprocessing stage. In Section 6.1, the general approach to Chinese derivation is proposed first. Following this proposal, prefixation is illustrated in 6.2 and suffixation in 6.3. Section 6.4 shows that this general approach to derivation applies equally well to the 'quasi-affix' phenomena. Section 6.5 investigates the suffixation of -zhe (-er). The analysis is based on the argument that this suffixation involves the combination VP+ -zhe . The specific solution following the CPSG95 general approach will be presented based on this analysis. 6.1. General Approach to Derivation This section examines the property of Chinese affixes and proposes a corresponding general approach to Chinese derivation. This serves as the basis for the specific solutions to be presented in the remaining sections to various problems in Chinese derivation. It is fairly easy to observe that in Chinese derivation it is the affix which selects the stem, not the other way round. For example, the suffix 性 -xing (‑ness) expects an adjective to produce an (abstract) noun. Based on the examination of the behavior of a variety of Chinese affixes or quasi-affixes, the following generalization has been reached. That is, an affix lexically expects a sign of category x, with possible additional constraints, to form a derived word of category y. This generalization is believed to capture the common property shared by Chinese affixes/quasi-affixes. It seems to account for all Chinese derivational data, including typical affixation, quasi-affixation (see 6.4) and the special case of zhe -suffixation (see 6.5). So far no counter evidence has been found to challenge this generalization. The observation and the generalization above support the argument that in a grammar which relies on lexicalized expectation feature structures to drive the building of structures, affixes, not the stems, should be selecting heads of the morphological structures. Leaving aside the non-productive affixation, the general strategy to Chinese productive derivation is proposed as follows. In the lexicon, the affix as head of derivative is encoded with the following derivation information: (i) what type of stem (constraints) it expects; (ii) where to look for the expected stem, on its right or left; (iii) what type of (derived) word it leads to (category, semantics, etc.). Based on this lexical information, CPSG95 has two PS rules in the general grammar for derivation: one for prefixation, one for suffixation. These rules ensure that all the constraints be observed before an affix and a stem are combined. They also determine that the output of derivation, i.e. the mother sign, be a word. Along this line, the key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem. The constraints on the expected stem can be lexically specified in the morphological expectation feature or of the affix. The property (category, syntactic expectation, semantics, etc.) of the derivative can also be encoded directly in the lexical entry of the affix, seen as the head of a derivational structure in the CPSG95 analysis. This property information, as part of head features, will be percolated up when the derivation rules are applied. In the remaining part of this chapter, it will be demonstrated how this proposed general approach is applied to each specific derivation problem. 6.2. Prefixation The purpose of this section is to present the CPSG95 solution to Chinese prefixation. This is done by formulating a sample lexical entry for the ordinal prefix 第 di- (-th) in CPSG95. It will be shown how the lexical information drives the prefix rule in the general grammar for the derivational combination. Thanks to the productivity of the prefix 第 di- (-th), the ordinal numeral is always a derived word from the cardinal numeral via the following rule, informally formulated in (6-1). (6-1.) 第 di- + cardinal numeral -- ordinal numeral 第22条军规 di- 22 tiao jun-gui -th 22 CLA military-rule the 22-nd military rule (Catch-22) 第八个是铜像 di- ba ge shi tong-xiang -th eight CLA be bronze-statue The eighth is the bronze statue. The basic function of the Chinese numeral, whether cardinal or ordinal, is to combine with a classifier, as shown in the sample sentences above. To capture this phenomenon, CPSG95 defines two subtypes for the category numeral , namely the and . The lexical entries of the prefix 第 di‑ (‑th) and the cardinal numeral 五 wu (five) are formulated in (6-2) and (6-3). The prefix encodes the lexical expectation for the derivation 第 di- + ‑‑ plus the semantic composition of the combination. Note that the constraint @numeral inherits all common property specified for the numeral macro. As indicated before, prefixation in CPSG95 is handled by the Prefix PS Rule based on the lexical specification. More specifically, it is driven by the lexical expectation encoded in . The prefix rule is formulated in (6-4). Like all PS rules in CPSG95, whenever two adjacent signs satisfy all the constraints, this rule takes effect in combining them into a higher level sign in parsing. For example, the prefix 第 di- (-th) and the sign 五 wu (five) will be combined into the sign as shown in (6-5). The combination of 第五 di+wu in (6-5) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese prefixation. 6.3. Suffixation Like prefixation, the Suffix PS Rule for suffixation is driven by the lexically encoded expectation in . Parallel to the Prefix PS Rule, the suffix rule is formulated in (6-6). With this PS rule in hand, all that is needed is to capture the individual derivational constraint in the lexical entries of the suffixes at issue. For example, the suffix 性 - xing (-ness) changes an adjective or verb into an abstract noun: A/V + ‑ xing ‑‑ N. This information is contained in the formulation of the suffix 性 –xing (-ness) in the CPSG95 lexicon, as shown in (6-7). Note that abstract nouns are uncountable, hence the call to the uncountable_noun macro to inherit the common property of uncountable nouns. Suppose the suffix 性 -xing (-ness) appears immediately after the adjective 实用 shi-yong (practical) formulated in (6-8), the suffix PS rule will combine them into a noun, as shown in (6-9). The combination of 实用性 shi-yong+xing in (6-9) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese suffixation. 6.4. Quasi-affixes The purpose of this section is to propose an adequate treatment of the quasi-affix phenomena in Chinese. This is an area which has not received enough investigation in the field of Chinese NLP. Few Chinese NLP systems demonstrate where and how to handle these quasi-affixes. To achieve the purpose, typical examples of ‘quasi-affixes’ are presented and compared with some ‘genuine’ affixes. The comparison highlights the general property shared by both 'quasi-affixes' and other affixes and also shows their differences. Based on this study, it is found to be a feasible proposal to treat quasi-affixes within the derivation morphology of CPSG95. The proposed solution will be presented by demonstrating how a typical quasi-affix is represented in CPSG95 and how the general affix rules can work with the lexical entries of 'quasi-affixes' as well. The tables in (6-10) and (6-11) list some representative quasi-affixes in Chinese. (6-10.) Table for sample quasi-prefixes prefixation examples lei (quasi-)+N -- N 类前缀 lei- : quasi- 前缀 qian (before, pre-, former-) zhui (...) ban (semi-)+N -- N 半文盲 ban- : semi-illiterate 文盲 wen (written-language), mang (blind) dan (mono-)+N -- N 单音节 dan- : mono-syllable 音节 yin (sound), jie (segment) shuang (bi-)+N -- N 双音节 shuang- : bi-syllable duo (multi-)+N -- N 多音节 duo- : multi-syllable fei (non-)+N/A -- A 非谓 fei-wei : non-predicate 非正式 fei- : non-official xiang (each other)+Vt (mono-syllabic) -- Vi 相爱 xiang-ai : love each other zi (self-)+Vt -- Vi 自爱 zi-ai : self-love zi-xue-xi: self-learning qian (former, ex-) + N -- N 前夫人 qian- : ex-wife 前总统 qian- : former president (6-11.) Table for sample quasi-suffixes suffixation Examples N + shi (style) -- N 美国式 -shi : American-style NUM/N + xing (model) -- N 1980型 1980-xing : 1980 model; IV型 IV-xing : Model IV A/V + lü (rate) -- N 准确率 -lü : (percentage of) precision NUM + liu (class) -- A 一流 yi-liu : first class 三流 san-liu : third class N + mang ('blind', person who has little knowledge of) -- N 法盲 fa-mang : person who has no knowledge of law 计算机盲 -mang : computer-layman Compare the above quasi-affixes with the few widely acknowledged affixes like 性 -xing (-ness) and 第 di- (-th), it is fairly easy to observe that the property as generalized in Section 6.1 is shared by both affixes and quasi-affixes. That is, in all cases of the combination, the affix or quasi-affix expects a sign of category x, with possible additional constraints, either on the right or on the left to form a derived word of category y (y may be equal to x). For example, the quasi-prefix 自 zi- (self-) expects a transitive verb to produce an intransitive verb, etc. This property supports the following two points of view: (i) the affix or quasi-affix is the selecting head of the combination; (ii) both types of combination (affixation) should be properly contained in morphology since the output is always a word (derivative). In terms of difference, it is observed that there are different degrees of the functionalization of the meaning between quasi-affixes and other affixes. For example, the nominalizer 性 -xing (‑ness) seems to be semantically more functionalized than the quasi-suffix 盲 -mang (blind-man, person who has little knowledge of). In the case of 性 -xing (-ness), there is believed to be little semantic contribution from the affix. But in cases of affixation by quasi-affixes, the semantic contribution of the affixes is non-trivial, and it must be ensured that proper semantics be built based on semantic compositionality of both the stem and the affix. Except for the different degrees of semantic abstractness, there is no essential grammatical difference observed between quasi-affixes and the few widely accepted affixes. As the semantic variation can be easily accommodated in the lexicon, nothing needs to be changed in the general approach to Chinese derivation as described before. The text below demonstrates how the quasi-affix phenomena are handled in CPSG95, using a sample quasi-affix to show the derivation. The quasi-prefix to examine is 相 xiang- (each other). It is used before a mono-syllabic transitive verb, making it an intransitive verb: 相 xiang- + Vt (monosyllabic) ‑‑ Vi. More precisely, the syntactic object of the transitive verb is morphologically satisfied so that the derivative becomes an intransitive verb. Unlike the original verb, the verb derived via xiang -prefixation requires a plural subject, as shown in (6-12). This is a linguistically interesting phenomenon. In a sense, it is a version of subject-predicate agreement in Chinese. (6-12.) (a) 他们相爱过。 ta-men xiang- ai guo they each-other love GUO They used to love each other. (b) 他爱过。 ta ai guo he love GUO. He used to love (someone). (b) * 他相爱过。 ta xiang- ai guo he each-other love GUO. This number agreement can help decode the plural semantics of the subject noun as shown in the first sentence (6-13a) in the following group. Sentence (6-13a) illustrates a common, number-underspecified case where the NP has no plural marker. This contrasts with (6-13b) which includes a plural marker 们 men (-s), and with (6-13c) which resorts to the use of a numeral-classifier construction. (6-13.) (a) 孩子相爱了。 hai-zi xiang- ai le child each-other love LE The children have fallen in love with each other. (b) 孩子们相爱了。 hai-zi men xiang- ai le child PLU each-other love LE The children have fallen in love with each other. (c) 两个孩子相爱了。 liang ge hai-zi xiang- ai le two CLA child each-other love LE The two children have fallen in love with each other. Following the practice for number agreement in HPSG, the agreement can be captured by enforcing an additional plural constraint on the subject expectation , as shown in the formulation of the lexical entry for 相 xiang- (each other) in (6-14) below. As shown above, the affixation also necessitates corresponding modification of the semantics in the argument structure: the first argument is equal to the second via index . Note that the notation , or more accurately, the most general feature structure, is used as a place holder. For example, HANZI stands for the constraint of a mono- hanzi sign. Another thing worth noticing is that the derivative requires that a subject must appear before it. In other words, the subject expectation becomes obligatory. This is based on the fact that this derived verb cannot stand by itself in syntax, unlike most original verbs in Chinese, say 爱 ai (love), whose subject expectation is optional. With the lexical entries for the quasi-affixes taking care of the differences in the building of semantics, there is no need for any modification of the CPSG95 PS rules. For example, the prefix 相 xiang- (each other) and the verb 爱 ai (love) formulated in (6-15) will be combined into the derivative 相爱 xiang-ai (love each other) shown in (6-16) via the Prefix PS Rule. In summary, the proposed approach to Chinese derivation is effective in handling quasi-affixes as well. The general grammar rules for derivation remain unchanged while lexical constraints are accommodated in the lexicon. This demonstrates the advantages of the lexicalized design for grammar development. 6.5. Suffix 者 zhe (-er) This section analyzes zhe- suffixation, a highly challenging case at the interface between morphology and syntax. This is believed to be an unsolvable problem as long as a system is based on the sequential processing of derivation morphology and syntax. The solution to be proposed in this section is based on the argument that this suffixation is a combination of VP+ zhe. The suffix 者 zhe (-er, person) is a very productive bound morpheme. It is often compared to the English suffix ‑er or ‑or, as seen in the pairs in (6-17). (6-17.) 工作 gong-zuo (work) 工作者 -zhe (work‑er) 劳动 lao-dong (labor) 劳动者 -zhe (labor-er) 学习 xue-xi (learn) 学习者 -zhe (learn-er);. But 者 ‑zhe is not an ordinary suffix; it belongs to the category of so-called ‘phrasal affix’, with very different characteristics than the English counterpart. Although the output of the zhe -suffixation is a word, the input is a VP, not a lexical V. In other words, it combines with a VP and produces a lexical N: VP+ zhe -- N. The arguments to be presented below support this analysis. The first thing is to demonstrate the word status of zhe‑ suffixation. This is fairly straightforward: there are no observed facts to show that the zhe- derivative is different from other lexical nouns in the syntactic distribution. For example, like other lexical nouns, the derivative can combine with an optional classifier construction to form a noun phrase. Compare the following pairs of examples in (6-18) and (6-19). (6-18.) (a) 两名违反这项规定者 liang ming -zhe] two CLA violate this CLA regulation -er two persons who have violated this regulation (b) 两名学生 liang ming xue-sheng two CLA student two students (6-19.) (a) 他是一位优秀工作者 ta shi yi wei you-xiu -zhe] he be one CLA excellent work -er He is an excellent worker. (b) 他是一位优秀工人。 ta shi yi wei you-xiu gong-ren he be one CLA excellent worker He is an excellent worker. The next thing is to demonstrate the phrasal nature of the ‘stem’. The stem is judged as a VP because it can be freely expanded by syntactical complements or modifiers without changing the morphological relationship between the stem and the suffix, as shown in (6‑20) below. (6-20a) involves a modifier (努力 nu-li ) before the head verb. The verb stem in (6-20b) and (6-20c) is a transitive VP consisting of a verb and an NP object. (6-20.) (a) 努力工作者 -zhe hard work ‑er hard-worker, person who works hard (b) 学习鲁迅者 -zhe learn Lu Xun -er person who is learning from Lu Xun (c) 违反这项规定者 -zhe violate this CLA regulation -er person who violates this rule More examples with the head verb 雇 gu (employ) are given in (6-21), with the last two expressions involving passivized VP. (6-21.)(a) 雇者 gu-zhe employ-er (b) 雇人者 -zhe employ person -er those who employ people, employer/recruiter (c) 被雇者 -zhe -er employee (d) 被人雇者 -zhe by person employ -er those who are employed by (other) people In fact, the stem VP is semantically equivalent to a relative clause. A Chinese relative clause is normally expressed in the form of a DE-phrase: VP+ de +N (Xue 1991). In other words, 者 ‑zhe embodies functions of two signs, an N (‘person’, by default) and a relative clause introducer de , something like English one that + VP (or person who + VP). Compare the two examples in (6-22) and (6-23) with the same meaning - the expression in (6-23) is more colloquial than the first in (6-22) which uses the suffix 者 ‑zhe . (6-22.) 违反规定者,处以罚款。 w ei-fan gui-ding zhe , chu-yi fa-kuan violate regulation one that punish-by fine Those who violate the regulations will be punished by fines. (6-23.) 违反规定的人,处以罚款。 w ei-fan gui-ding de ren , chu-yi fa-kuan violate regulation DE person punish-by fine Those who violate the regulations will be punished by fines. On further examination, it is found that VPs with attached aspect markers combine with the suffix 者 -zhe with difficulty, as seen in the following examples. (6-24.) (a) 违反规定者 w ei-fan gui-ding zhe violate regulation -er Those who violate the regulations (b) ? 违反了规定者 w ei-fan le gui-ding zhe violate LE regulation one that This means that some further constraint may be necessary in order to prevent the grammar from producing strings like (6-24b). If CPSG95 is only used for parsing, such a constraint is not absolutely necessary because, in normal Chinese text, such input is almost never seen. Since CPSG95 is intended to be procedure-neutral, for use in both parsing and generation, the further constraint is desirable. This constraint is in fact not an isolated phenomenon in Chinese grammar. In syntax, the constraint is commonly required when the VP is not in the predicate position. For example, when a verb, say 喜欢 xi-huan (like), or a preposition, say 为了 wei-le (in order to), subcategorizes for a VP as a complement, it actually expects a VP with no aspect markers attached. The following pair of sentences demonstrates this point. (6-25.) (a) 我喜欢打篮球。 wo xi-huan da lan-qiu. I like play basket-ball I like playing basket-ball. (b) * 我喜欢打了篮球。 wo xi-huan da le lan-qiu I like play LE basket-ball To accommodate such common constraint requirement in both Chinese morphology and syntax, a binary feature is designed for Chinese verbs in CPSG95. In the lexicon, this feature is under-specified for each Chinese verb, i.e. . When an aspect marker 了着过 le/zhe/guo combines with the verb, this feature is unified to be . We can then enforce the required constraint in the morphological expectation or syntactic expectation to prevent aspected VP from appearing in a position expecting a non-predicate un-aspected VP. Based on the above analysis, the lexical entry of the suffix 者 –zhe is formulated in (6-26). Note the notation for the macro with parameter (placed in parentheses) @common_noun(名|位|个). This macro represents the following information. The derivative is like any other common noun, it inherits the common property; it can combine with an optional classifier construction using the classifier 名 ming or 位 wei or 个 ge . As seen, the VP expectation is realized by using the macro constraint @vp. The semantics of the derivative is , an instance of -er with restriction from the event of VP, represented by . The index ensures that whatever is expected as a subject by the VP, which has no chances to be satisfied syntactically in this case, is semantically identical to this noun. In other words, this derived noun semantically fills an argument slot held by the subject in the VP semantics . In the active case, say, 雇人者 –zhe (‘person who employs people’), the subject is the first argument, i.e. the index of this noun is the logical subject of employ . However, when the VP is in passive, say, 被人雇者 ‑zhe (‘person who is employed by other people’), the subject expected by the VP fills the second argument, i.e. the noun in this case is the logical object of the VP. It is believed that this is the desired result for the semantic composition of zhe- derivation. With the lexical expectation of the suffix as the basis, the general Suffix PS Rule is ready to work. Remember that there is nothing restricting the input stem to the derivation in either of the derivation rules, formulated in (6-4) and (6-6) before. In CPSG95, this is not considered part of the general grammar but rather a lexical property of the head affix. It is up to the affix to decide what constraints such as category, wordhood status, semantic constraint, etc., to impose on the expected stem to produce a derivative. In most cases of derivation, the input status of the stem is a word, but now we have an intricate case where the suffix zhe (-er) expects a verb phrase for derivation. The general property for all cases of derivation is that regardless of the input, the output of derivation (as well as any other types of morphology) is always a word. Before demonstrating by examples how zhe- derivation is implemented, there is a need to address the configurational constraints of CPSG95. This is an important factor in realizing the flexible interaction between morphology and syntax as required in this case. In all HPSG-style grammars, some type of configurational constraint is in place to ensure the proper order of rule application. A typical constraint is that the subject rule should apply after the object rule. This is implemented in CPSG95 by imposing the constraint in the subject PS rule that the head daughter must be a phrase and by imposing the constraint in the object PS rule that the subject of the head daughter may not be satisfied. Since derivation morphology and syntax are designed in the same framework in CPSG95, constraints are called for to ensure the ordering of rule application between morphological PS rules and syntactic PS rules as well. In general, morphological rules apply before syntactic rules. However, if this constraint is made absolute, to the extent that that all morphological rules must apply before all syntactic rules, we in effect make morphology and syntax two independent, successive modules, just like the case for traditional systems. The grammar will then lose the power of flexible interaction between morphology and syntax and cannot handle cases like zhe- derivation. However, this is not a problem in CPSG95. The proposed constraint regulating the rule application order between morphological PS rules and syntactic PS rules is as follows. Only when a sign has both obligatory morphological expectation and syntactic expectation will CPSG95 have constraints ensuring that the morphological rule apply first. For example, as formulated in (6-14) before, the sign 相 xiang- (each other) has both morphological expectation in as a bound morpheme and syntactic expectation for the subject in as (head of) derivative. If the input string is 他们相爱 ta-men (they) xiang- (each other) ai (love), the prefix rule will first combine 相 xiang- (each other) and the stem 爱 ai (love) before the subject rule can apply. The result is the expected structure embodying the results of both morphological analysis and syntactic analysis, ]. This constraint is implemented by specifying in all syntactic PS rules that the head daughter cannot have obligatory morphological expectation yet to be satisfied. It effectively prevents a bound morpheme from being used as a constituent in syntax. It should be emphasized that this constraint in the general grammar does not prohibit a bound morpheme from combining with any types of sign; such constraints are only lexically decided in the expectation feature of the affix. The following text shows step by step the CPSG95 solution to the problem of zhe- derivation. The chosen example is the derivation for the derived noun 违法规定者 -zhe] ‘persons violating (the) regulation’. The lexical sign of the suffix 者 -zhe (-er) has already been formulated in (6-26) before. The words 违反 wei-fan (violate) and 规定 gui-ding (regulation) in the CPSG95 lexicon are shown in (6-27) and (6-28) respectively. Note that all common nouns, specified as @common_noun, in the lexicon have the following INDEX features , i.e. third person with unspecified number. As for the feature , it is encoded in the noun itself with one of the following , , , or unspecified as . The corresponding sort hierarchy is: consists of sub-sorts and ; and is sub-typed into and . Of course, 规定 gui-ding (regulation) is lexically specified as . The following is the VP built by the object PS rule in the CPSG95 syntax. As seen, the building of the semantics follows the practice in HPSG, with the argument slots filled by the feature of the subject and object. In this VP case, has been realized. The VP result in (6-29) and the suffix 者 –zhe will combine into the expected derived noun via the Suffix PS Rule, as shown in (6-30). To summarize, it is the integrated model of derivational morphology and syntax in CPSG95 that makes the above analysis implementable. Without the integration, there is no way that a suffix is allowed to expect a phrasal stem. The lexicalist approach adopted in CPSG95 facilitates the capturing of the individual feature of the phrase expectation for the few individual affixes like 者 - zhe. This enables the general PS rules for derivation in CPSG95 to be applicable to both typical cases of affixation and special cases of affixation. 6.6. Summary This chapter has investigated some representative phenomena of Chinese derivation and their interface to syntax. The solutions to these problems have been presented based on the arguments for the analysis. The key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem. The constraints on the expected stem are lexically specified in the corresponding morphological expectation feature structure of the affix. The property of the derivative is also lexically encoded in the affix, seen as head of derivational structure in the CPSG95 analysis. This property information will be percolated up when the derivation rules are applied. These rules ensure that the output of derivation is a word. It has been shown that this approach applies equally well to derivation via ‘quasi-affixes’ and the tough case of zhe- suffixation as well. ------------------------------------ Some linguists (e.g. Li and Thompson 1981) hold the view that Chinese has only a few affixes; others (e.g. Chao 1968) believe that the inventory of Chinese affixes should be extended to include quasi-affixes. Interestingly, the sign lei (quasi-, original sense ‘class’) itself is a quasi-prefix in Chinese. Phenomena similar to Chinese quasi-affixes, called ‘semi-affixes’ or ‘Affixoide’, also exist in German morphology (Riehemann 1998). This is similar to the practice in many grammars, including HPSG, that a functional sign preposition is the selecting head of the corresponding syntactic structure, namely Prepositional Phrase. Those affixes which are not or no longer productive, e.g. lao‑ (original meaning ‘old’) in lao‑hu (tiger) and lao‑shu (mouse), are not a problem. The corresponding derived words are simply listed in the CPSG95 lexicon. The CPSG95 phrase-structural approach to Chinese productive derivation was inspired by the implementation in HPSG of a word-syntactic approach in Krieger (1994). Similar practice is also seen in Selkirk (1982), Riehemann (1993) and Kathol (1999) in an effort to explore alternative approaches than the lexical rule approach to morphology. The major common property is reflected in two aspects, formulated in the macro definition of uncountable_noun in CPSG95. First, there is value setting for the feature, i.e. . The CPSG95 sort hierarchy for the type is defined as {a_number, no_number} where is further sub-typed into {singular, plural}. applies to uncountable nouns while is used for countable noun where the plurality is yet to be decided (i.e. under-specified for plurality). Second, based on the syntactic difference between Chinese countable nouns and uncountable nouns, the classifier expected by uncountable nouns is exclusively zhong (kind/sort of). That is, uncountable nouns may only combine with a preceding classifier construction using the classifier zhong . For time being, the subtle difference in semantics between pairs like We love ourselves and We love each other is not represented in the content. It requires a more elaborate system of semantics to reflect the nuance. The elaboration of semantics is left for future research. Some linguists (e.g. Z. Lu 1957; Lü et al 1980; Lü 1989; Dai 1993) have briefly introduced the notion of ‘phrasal affix’ in Chinese. Lü further indicates that these ‘phrasal affixes’ are a distinctive characteristic of the Chinese grammar. The English possessive morpheme ‘s is arguably a suffix which expects an NP instead of a lexical noun as its stem: NP + -’s. Unlike VP + -zhe , the result of this NP + -‘s combination is generally regarded as a phrase, not a word. In this sense, ‘s seems to be closer to a functional word, similar to a preposition or postposition, than to a suffix. Chinese zhe- suffixation is somewhat like the English phenomenon of what-clause (in ‘ what he likes is not what interests her ’). ‘What’ in this use also embodies functions of two signs that which . But the English what-clause functions as an NP, but VP+ zhe forms a lexical N. It is generally agreed in the circle of Chinese grammar research that Chinese predicate (or finite) verbs have aspect distinction, using or not using aspect markers. This is in contrast to English where both finite and non-finite verbs have aspect distinction but only finite verbs are tensed . It is generally agreed that each Chinese common noun may only combine with a classifier construction using a specific set of classifiers. This classifier specification is generally regarded as lexical, idiosyncratic information of nouns (Lü et al 1980). Using the macro with the classifier parameter follows this general idea. It is worth noticing that the lexical formulation for -zhe (-er) in CPSG95 does not rely on any specific NP analysis chosen in syntax, except that the classifier specification should be placed under the entry for nouns (or derived nouns). The proposal in building the semantics for the zhe- derivative is based on ideas similar to the assumption adopted for the complement control in HPSG that ‘the fundamental mechanism of control was coindexing between the unexpressed subject of an unsaturated complement and its controler’ (Pollard and Sag 1994:282). If the object expectation is obligatory, this constraint ensures the priority of the object rule over the subject rule in application, building the desirable structure ] instead of O]. This is because, a verb with obligatory object yet to be satisfied is by definition not a phrase. If the object expectation is optional, the order of rule application is still in effect although the lexical V in this scenario does not violate the phrase definition. There are two cases for this situation. In case one, the object O happens to occur in the input string. The subject PS rule will tentatively combine S and V via the subject rule, but it can go no further. This is because the object rule cannot apply after the subject rule, due to the constraint in the object rule that the head cannot have a satisfied subject. The successful parse will only build the expected structure ]. In case two, the object O does not appear in the input string. Then the tentative combination built by the subject rule becomes the final parse. For example, if the lexical rule approach were adopted for derivation, this problem could not be solved. PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter II Role of Grammar PhD Thesis: Chapter III Design of CPSG95 PhD Thesis: Chapter IV Defining the Chinese Word PhD Thesis: Chapter V Chinese Separable Verbs PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|3724 次阅读|0 个评论
PhD Thesis: Chapter IV Defining the Chinese Word
liwei999 2016-8-27 00:23
4.0. Introduction This chapter examines the linguistic definition of the Chinese word and establishes its formal representation in CPSG95. This lays a foundation for the treatment of Chinese morpho-syntactic interface problems in later chapters. To address issues on interfacing morphology and syntax in Chinese NLP, the fundamental question is: what is a Chinese word? A proper answer to this question defines the boundaries between morphology, the study of how morphemes combine into words, and syntax, the study of how words combine into phrases. However, there is no easy answer to this question. In fact, how to define Chinese words has been a central topic among Chinese grammarians for decades (Hu and Wen 1954; L. Wang 1955; Z. Lu 1957; Lin 1983; Lü 1989; Shi 1992; Dai 1993; Zhao and Zhang 1996). In late 50's, there was a heated discussion on the definition of Chinese word in China. This discussion was induced by the campaign for the Chinese writing system reform ( wenzi gaige yundong ). At that time, the government policy was to ultimately replace the Chinese characters ( hanzi ) by a Romanized writing system. The system of pinyin , based on the Latin alphabet, was designed to represent the pronunciation of the characters in the Contemporary Mandarin. The simplest way is to use pinyin as a writing system and simply translate Chinese characters into syllables in pinyin . But it was soon found impractical due to the many-to-one correspondence from hanzi to syllable. Text in pinyin with no explicit word boundary delimiters is hardly comprehensible. Linguists agree that the key issue for the feasibility of a pinyin -based writing system is to establish a standard or definition for Chinese words (Z. Lu 1957). Once words can be identified by a common standard, the pinyin system can in principle be adopted for recording the Chinese language by using space and punctuation marks to separate words. This is because the number of homophones at the word level is dramatically reduced when compared with the number of homophones at the hanzi (morpheme or mono-syllabic) level. But the definition of a Chinese word is a very complicated issue due to the existence of a considerable amount of borderline cases. It has never been possible to reach a precise definition which can be applied to all circumstances and which can be accepted by linguists from different schools. There have been many papers addressing the Chinese wordhood issue (e.g. Z. Lu 1957; Lin 1983; Lü 1989; Dai 1993). Although there are still many problems in defining Chinese words for borderline cases and more debate will continue for many years to come, the understanding of Chinese wordhood has been deepened in the general acknowledgement of the following key aspects: (i) the distinct status of Chinese morphology; (ii) the distinction of different notions of word; and (iii) the lack of absolute definition across systems or theories. Almost all Chinese grammarians agree that unlike Classical Chinese, Contemporary Chinese is not based on single-morpheme words. In other words, the word and the morpheme are no longer co-extensive in Contemporary Chinese. In fact, that is the reason why we need to define Chinese morphology. If the word and the morpheme stand for the same linguistic object in a language, like Classical Chinese, the definition of morpheme will entail the definition of word and there is no role of morphology. As it stands, there is little debate on the definition of morpheme in Chinese. It is generally acknowledged that each syllable (or its corresponding written form hanzi ) corresponds to (at least) one morpheme. In a characteristic ‘isolating language’ - Classical Chinese is close to this, there is no or very poor morphology. However, Contemporary Chinese contains a significant number of bound morphemes in word formation (Dai 1993). In particular, it is observed that many affixes are highly productive (Lü et al 1980). It is widely acknowledged that the grammar of Contemporary Chinese is not complete without the component of morphology (Z. Lu 1957; Chao 1968; Li and Thompson 1981; Dai 1993; etc.). Based on this widely accepted assumption, one major task for this thesis is to argue for the proper place to cut the line between morphology and syntax, and to explore effective ways of interleaving the two for analysis. A significant development concerning the Chinese wordhood study is the distinction between two different notions of word: grammar word versus vocabulary word . It is now clear that in terms of grammar analysis, a vocabulary word is not an appropriate notion (Lü 1989; more discussion to come in 4.1). Decades of debate and discussion on the definition of a Chinese word have also shown that an operational definition for a grammar word precise enough to apply to all cases can hardly be established across systems or theories. But a computational grammar of Chinese cannot be developed without precise definitions. This leads to an argument in favor of the system internal wordhood definition and interface coordination within a grammar. The remaining sections of this chapter are organized like this. Section 4.1 examines two notions of word. Making sure that we use the right notion based on some appropriate guideline, some operational methods for judging a Chinese grammar word will be developed in 4.2. Section 4.3 demonstrates the formal representation of a word in CPSG95. This formalization is based on the design of expectation feature structures and the structural feature structure presented in Chapter III. 4.1. Two Notions of Word This section examines the two notions of word which have caused confusion. The first notion, namely vocabulary word , is easy to define. However, for the second notion, namely, grammar word , unfortunately, no operational definition has been available. It will be argued that a feasible alternative is to system internally define a grammar word and the labor division between Chinese morphology and syntax. A grammar word stands for the grammatical unit which fits in the hierarchy of morpheme, word and phrase in linguistic analysis. This gives the general concept of this notion but it is by no means an operational definition. Vocabulary word, on the other hand, refers to the listed entry in the lexicon. This definition is simple and unambiguous once a lexicon is given. The lexical lookup will generate vocabulary words as potential building blocks for analysis. On one hand, vocabulary words come from the lexicon; they are basic building blocks for linguistic analysis. On the other hand, as the ‘resulting’ unit for morphological analysis as well as the ‘starting’ or ‘atomic’ unit for syntactic analysis, the grammar word is the notion for linguistic generalization. But it is observed that a vocabulary word is not necessarily a grammar word and vice versa. It is this possible mismatch between vocabulary word and grammar word that has caused a problem in both Chinese grammar research and Chinese NLP system development. Lü (1989) indicates that not making a distinction between these two notions of word has caused considerable confusion on the definition of Chinese word in the literature. He further points out that only the former notion should be used in the grammar research. Di Sciullo and Williams (1987) have similar ideas on these two notions of word. They indicate that a sign listable in the lexicon corresponds to no certain grammatical unit. It can be a morpheme, a (grammar) word, or a phrase including sentence. Some examples of different kinds of Chinese vocabulary words are given below to demonstrate this insight. (4-1.) sample Chinese vocabulary words (a) 性 bound morpheme, noun suffix, ‘-ness’ (b) 洗 free morpheme or word, V: ‘wash’ (c) 澡 word (only used in idioms), N: ‘bath’ (d) 澡盆 compound word, N: ‘bath-tub’ (e) 洗澡 idiom phrase, VP: ‘take a bath’ (f) 他们 pronoun as noun phrase, NP: ‘they’ (g) 城门失火,殃及池鱼 idiomatic sentence, S: ‘When the gate of a city is on fire, the fish in the canal around the gate is also endangered.’ The above signs are all Chinese vocabulary words. But grammatically, they do not necessarily function as a grammar word. For example, (4-1a) functions as a suffix, smaller than a word. (4-1e) behaves like a transitive VP (see 5.1 for more evidence), and (4-1g) acts as a sentence, both larger than a word. The consequence of mixing up these different units in a grammar is the loss of power for a grammar to capture the linguistic generality for each level of grammatical unit. The definition of grammar word has been a contentious issue in general linguistics (Di Sciullo and Williams 1987). Its precise definition is particularly difficult in Chinese linguistics as there is a considerable amount of phenomena marginal between Chinese morphology and syntax (Zhu 1985; L. Li 1990; Sun and Huang 1996). The morpheme-word-phrase transition is a continuous band in the linguistic reality. Different grammars may well cut the division differently. As long as there is no contradiction in coordinating these objects within the grammar, there does not seem to exist absolute judgment on which definition is right and which is wrong. It is generally agreed that a grammar word is a smallest unit in syntax (Lü 1989), as also emphasized by Di Sciullo and Williams (1987) on the 'syntactic atomicity' of word. But this statement only serves as a guideline in theory, it is not an operational definition for the following reason. It is logically circular to define word, smallest unit in syntax, and syntax, study of how words combine into phrases, one upon the other. To avoid this 'circular definition' problem, a feasible alternative is to system internally define grammar word and the labor division between Chinese morphology and syntax, as in the case of CPSG95. Of course, the system internal definition still needs to be justified based on the proposed morphological or syntactic analysis of borderline phenomena in terms of capturing the linguistic generality. More specifically, three things need to be done: (i) argue for the analysis case by case, e.g. why a certain construction should be treated as a morphological or syntactic phenomenon, what linguistic generality is captured by such a treatment, etc.; (ii) establish some operational methods for wordhood judgment to cover similar cases; (iii) use formalized data structures to represent the linguistic units after the wordhood judgment is made. Section 4.2 will handle task (ii) and Section 4.3 is devoted to the formal definition of word required by task (iii). The task in (i) will be pursued in the remaining chapters. Another important notion related to grammar word is unlisted word . Conceptually, an unlisted word is a novel construction formed via morphological rules, e.g. a derived word like ke-du-xing (-able-read-ness: readability), foolish-ness , a compound person name (given name + family name) such as John Smith, mao-ze-dong (Mao Zedong). Unlisted words are often rule based. This is where productive word formation sets in. However, unlisted word is not a crystal clear notion, just like the underlying concept grammar word . Many grammarians have observed that phrases and unlisted words in Chinese are formed under similar rules (e.g. Zhu 1985; J. Lu 1988). As both syntactic constructions and unlisted words are rule based, it can be difficult to judge a significant amount of borderline constructions as morphological or syntactic. There are fuzzy cases where a construction is regarded as a grammar word by one and judged as a syntactic construction by another. For example, while san (three) ge (CLA) is regarded as a syntactic construction, namely numeral-classifier phrase, in many grammars including CPSG95, such constructions are treated as compound words by others (e.g. Chen and Liu 1992). ‘Quasi-affixation’ presents another outstanding ‘gray area’ (see 6.2). The difficulty in handling the borderline phenomena leads back to the argument that the labor division between Chinese morphology and syntax should be pursued system-internally and argued case by case in terms of capturing the linguistic generality. To implement the required system internal definition, it is desirable to investigate practical wordhood judgment methods in addition to case-by-case arguments. Some judgment methods will be developed in 4.2. Case-by-case arguments and analysis for specific phenomena will be presented in later chapters. After the wordhood judgment is made, there is a need for the formal representation. Section 4.3 defines the formal representation of word with illustrations. 4.2. Judgment Methods This section proposes some operational wordhood judgment methods based on the notion of ‘syntactic atomicity’ (Di Sciullo and Williams 1987). These methods should be applied in combination with arguments of the associated grammatical analysis. In fact, whether a sign is judged as a morpheme, a grammar word or a phrase ultimately depends on the related grammatical analysis. However, the operationality of these methods will help facilitate the later analysis for some individual problems and avoid unnecessary repetition of similar arguments. Most methods proposed for Chinese wordhood judgment in the literature are not fully operational. For example, Chao (1968) agrees with Z. Lu (1957) that a word can fill the functional frame of a typical syntactic structure. Dai (1993) points out that this method may effectively separate bound morphemes from free words, it cannot differentiate between words and phrases, as phrases may also be positioned in a syntactic frame. In fact, whether this method can indeed separate bound morphemes from free words is still a problem. This method cannot be made operational unless the definition of ‘frame of a typical syntactic structure’ is given. The judgment methods proposed in this section try to avoid this ‘lack of operationality’ problem. Dai (1993) made a serious effort in proposing a series of methods for cutting the line between morphemes and syntactic units in Chinese. These methods have significantly advanced the study of this topic. However, Dai admits that there is limitation associated with these proposals. While each proposed method provides a sufficient (but not necessary) condition for judging whether a unit is a morpheme, none of the methods can further determine whether this unit is a word or a phrase. For example, the method of syntactic independence tests whether a unit in a question can be used as a short answer to the question. If yes, the syntactic independence is confirmed and this unit is not a morpheme inside a word. Obviously, such a method tells nothing about the syntactic rank of the tested unit because a word, a phrase or clause can all serve as an answer to a question. In order to achieve that, other methods and/or analyses need to be brought in. The first judgment method proposed below involves passivization and topicalization tests. In essence, this is to see whether a string involves syntactic processes. As an atomic unit, the internal structure of a word is transparent to syntax. It follows that no syntactic processes are allowed to exert effects on the internal structure of a word. As passivization and topicalization are generally acknowledged to be typical syntactic processes, if a potential combination A+B is subject to passivization B+ bei +A and topicalization B+…+NP+A, it can be concluded that A+B is not a word: the relation between A and B must be syntactic. The second method is to define an unambiguous pattern for the wordhood judgment, namely, judgment patterns. Judgment patterns are by no means a new concept. In particular, keyword based judgment patterns have been frequently used in the literature of Chinese linguistics as a handy way for deterministic word category detection (e.g. L. Wang 1955; Zhu 1985; Lü 1989). The following keyword (i.e. aspect markers) based patterns are proposed for judging a verb sign. (4-2.) (a) V(X)+着/过 à word(X) (b) V(X)+着/过/了+NP à word(X) The pattern (4-2a) states that if X is a sign of verb, no matter transitive or intransitive, appearing immediately before zhe / guo , then X is a word. This proposal is backed by the following argument. It is an important and widely acknowledged grammatical generalization in Chinese syntax that the aspect markers appear immediately after lexical verbs (Lü et al 1980). Note that the aspect marker le (LE) is excluded from the pattern in (4‑2a) because the same key word le corresponds to two distinctive morphemes in Chinese: the aspect le (LE) attaches to a lexical V while the sentence-final le (LEs) attaches to a VP (Lü et al 1980). Therefore, judgment cannot be reliably made when a sentence ends in X+ le , for example, when X is an intransitive verb or a transitive verb with the optional object omitted. However, le in pattern (4-2b) has no problem since le is not in the ambiguous sentence final position. This pattern says that if any of the three aspect markers appears between a sign X of verb and NP, X must be a word: in fact, it is a lexical transitive verb. There are two ways to use the judgment patterns. If a sub-string of the input sentence matches a judgment pattern, one reaches the conclusion promptly. If the input string does not match a pattern directly, one can still make indirect use of the patterns for judgment. The idiomatic combination xi (wash) zao (bath) is a representative example. Assume that the vocabulary word xi zao is a grammar word. It follows that it should be able to fill in the lexical verb position in the judgment pattern (4-2a). We then make a sentence which contains a sub-string matching the pattern to see whether it is grammatical. The result is ungrammatical: *ta (he) xi-zao (V) zhe (ZHE); *ta (he) xi-zao (V) guo (GUO). Therefore, our assumption must be wrong: xi zao is not a grammar word. We then change the assumption and try to insert aspect markers inside them (it is in fact an expansion test , to be discussed shortly). The new assumption is that the verb xi alone is a grammar word. What we get are perfectly grammatical sentences and they match the pattern (4-2b): ta (he) x i (V) zhe (ZHE) zao (bath): ‘He is taking a bath’; ta (he) xi (V) guo (GUO) zao (bath): ‘He has taken the bath’. Therefore the assumption is proven to be correct. This way, all V+X combinations can be judged based on the judgment patterns (4-2a) or (4-2b). The third method proposed below involves a more general expansion test. As an atomic unit in syntax, the internal parts of a word are in principle not separable. Lü (1989) emphasized inseparability as a criterion for judging grammar words. But he did not give instructions how this criterion should be applied. Nevertheless, many linguists (e.g. Bloomfield 1933; Z. Lu 1957; Lyons 1968; Dai 1993) have discussed expansion tests one way or another in assisting the wordhood judgment. The method of expansion to be presented below for wordhood judgment is called X-insertion . X-insertion is based on Di Sciullo and Williams’ thesis of the syntactic atomicity of word. The rationale is that the internal parts of a word cannot be separated by syntactic constituents. As a method, how to perform X-insertion is defined as follows. Suppose that one needs to judge whether the combination A+B is a word. If a sign X can be found to satisfy the following condition, then A+B is not a word, but a syntactic combination: (i) A+X+B is a grammatical string, (ii) X is not a bound morpheme, and (iii) the sub-structure is headed by A or the sub-string is headed by B. The first constraint is self-evident: a syntactic combination is necessarily a grammatical string. The second constraint aims at eliminating the danger of wrongly applying an infix here. In fact, if X is a morphological infix, the conclusion would be just opposite: A+B is a word. The last constraint states that X must be a dependant of the head A (or B). Otherwise it results in a different structure. There is no direct structural relation between A and B when A (or B) is a dependant of the head X in the structure. Therefore, the question of whether A+B is a phrase or a word does not apply in the first place. After the wordhood judgment is made on strings of signs based on the above judgment methods and/or the arguments for the analysis involved, the next step is to have them properly represented (coded) in the grammar formalism used. This is the topic to be presented in 4.3 below. 4.3. Formal Representation of Word The expectation feature structure and structural phrase structure in the mono-stratal design of CPSG95 presented in Chapter III provide means for the formal definition of the basic unit word in CPSG95. Once the wordhood judgment for a unit is made based on arguments for a structural analysis and/or using the methods presented in Section 4.2., the formal representation is required for coding it in CPSG95. This type of formalization is required to ensure its implementability in enforcing a required configurational constraint. For example, the suffix -xing expects an adjective word to form an abstract noun, such constraints and @word can be placed in the morphological expectation feature . These constraints will permit, for example, the legitimate derived word -xing] (serious-ness), but will block the following combination * -xing] (very-serious-ness). This is because violates the formal constraint as given in the word definition: it is not an atomic unit in syntax. In CPSG95, word is defined as a syntactically atomic unit without obligatory morphological expectations, formally represented in the following macro. word macro a_sign PREFIXING saturated | optional SUFFIXING saturated | optional STRUCT no_syn_dtr Note that the above formal definition uses the sorted hierarchy for the structural feature structure and the sorted hierarchy for the expectation feature structure. The definitions of these feature structures have been given in the preceding Chapter III. Based on the sorted hierarchy struct: {syn_dtr, no_syn_dtr} , the constraint ensures that the word sign do not contain any syntactic daughter. This prevents syntactic constructions from being treated as words. On the other hand, since , and are three subtypes of , the constraint prevents a bound morpheme, say a prefix or suffix which has obligatory expectation in or , from being treated as a word. This macro definition covers the representation of mono-morpheme words, e.g. e ‘goose’, du ‘read’, etc., or multi-morpheme words, e.g. xiao-kan ‘look down upon’, tian-e ‘swan’, etc., as well as unlisted words such as derived words whose internal morphological structures have already been formed. Some typical examples of word are shown below. For a derived word, note that the specification of and , or and , assigned by the corresponding PS rule is compatible with the macro word definition. The above word definition is an extension of the corresponding representation features from HPSG (Pollard and Sag 1987). HPSG uses a binary structural feature to distinguish lexical signs, , and non-lexical signs, . In addition, is divided into and . Except for the one-to-one correspondence between and in terms of rank (which stands for non-atomic syntactic constructs including phrases), neither of these HPSG binary divisions account for the distinction between a bound morpheme and a free morpheme. Such a distinction is not necessary in HPSG because bound morphemes are assumed to be processed in the preprocessing stage (e.g. lexical rules for English inflection, Pollard and Sag 1987) and do not show themselves as independent input to the parser. As CPSG95 involves both derivation morphology and syntax in an integrated general grammar, the HPSG binary divisions are no longer sufficient for formalizing the word definition. ‘Word’ in CPSG95 needs to be distinguished with proper constraints from not only syntactic constructs, but also from affixes (bound morphemes). In CPSG95, as productive derivation is designed to be an integrated component of the grammar, the word definition is both specified in the lexicon for some free morpheme words and assigned by the rules in morphological analysis. This practice in essence follows one suggestion in the original HPSG book: we might divide rules of grammar into two classes: rules of word formation, including compounding rules, which introduce the specification on the mother, and other rules, which introduce on the mother. (Pollard and Sag 1987:73). It is worth noticing that words thus defined can fill either a morphological position or a syntactic position. This reflects the interface nature of word: word is an eligible unit in both morphology and syntax. This is in contrast to bound morphemes which can only be internal parts of morphology. In morphology, derivation combines a word and an affix into a derived word. These derivatives are eligible to feed morphology again. This is shown above by the examples in (4-5) and (4-6). The adjective word ke‑du (read-able) is derived from the prefix morpheme ke- (-able) and the word du (read). Like other adjective words, this derived word can further combine with the suffix –xing (-ness) in morphology. It can also directly enter syntax, as all words do. To syntax, all words are atomic units. If a lexical position is specified, via the macro constraint @word in CPSG95, in a syntactic pattern, it makes no difference whether a filler of this position is a listed grammar word, or an unlisted word such as a derivative. Such distinction is transparent to the syntactic structure. 4.4. Summary Efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization. The main spirit of the HPSG theory and Di Sciullo and Williams' ‘syntactic atomicity’ theory has been applied to the study of Chinese wordhood and its formal representation. Some effective wordhood judgment methods have also been proposed, based on theoretical guidelines. The above work in the area of Chinese wordhood study provides a sound foundation for the analysis of the specific Chinese morpho-syntactic interface problems in Chapter V and Chapter VI. ------------------------------------------------------- For Classical Chinese, word, morpheme, syllable and hanzi are presumably all co-extensive. This is the so-called Monosyllabic Myth of Chinese (DeFrancis 1984: ch.8). The development of large numbers of homophones, mainly due to the loss of coda stops, has led to the development of large quantities of bi-syllabic and poly-syllabic word-like expressions (Chen and Wang 1975). Classical Chinese arguably allows for a certain degree of compounding. In the linguistic literature, some linguists (e.g. Sapir 1921; Zhang 1957; Jensen 1990) did not strictly distinguish Contemporay/Modern Chinese from Classical Chinese and they held the general view that Chinese has little morphology except for limited compounding. But this view of Contemporary Chinese has been criticized as misconception (Dai 1993) and is no longer accepted by the community of Chinese grammarians. Di Sciullo and Williams call a sign listable in the lexicon listeme , equivalent to the notion vocabulary word . In the literature, variations of this view include the Lexicalist position (Chomsky 1970), the Lexical Integrity Hypothesis (Jackendoff 1972), the Principle of Morphology-Free Syntax (Zwicky 1987), etc. This type of ‘atomicity’ constraint (Di Sciullo and Williams 1987) is generally known as Lexical Integrity Hypothesis (LIH, Jackendoff 1972), which states that syntactic rules or operations cannot refer to part of a word. A more elaborate version of LIH is proposed by Zwicky (1987) as a Principle of Morphology-Free Syntax. This principle states that syntactic rules cannot make reference to the internal morphological composition of words. The only lexical properties accessible to syntax, according to Zwicky, are syntactic category, subcategory, and features like gender, case, person, etc. Of course, in theory a word may be separated by morphological infix . But except for the two modal signs de3 (can) and bu (cannot) (see Section 5.3 in Chapter V), there does not seem to exist infixation in Mandarin Chinese. In terms of rank, in CPSG95 corresponds to the type in HPSG (Pollard and Sag 1987). A binary division between and is enough in HPSG to distinguish the atomic unit word from syntactic construction. But, as CPSG95 incorporates derivation in the general grammar, covers for both free morphemes and bound morphemes. That is why the constraint on alone cannot define word in CPSG95; it needs to involve constraints on morphological expectation structures as well, as shown in the macro definition. Note that there are signs which are not of the type . PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter II Role of Grammar PhD Thesis: Chapter III Design of CPSG95 Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|4176 次阅读|0 个评论
PhD Thesis: Chapter III Design of CPSG95
liwei999 2016-8-26 09:04
3.0. Introduction CPSG95 is the grammar designed to formalize the morpho-syntactic analysis presented in this dissertation. This chapter presents the general design of CPSG95 with emphasis on three essential aspects related to the morpho-syntactic interface: (i) the overall mono-stratal design of the sign; (ii) the design of expectation feature structures; (iii) the design of structural feature structures. The HPSG-style mono-stratal design of the sign in CPSG95 provides a general framework for the information flow between different components of a grammar via unification. Morphology, syntax and semantics are all accommodated in distinct features of a sign. An example will be shown to illustrate the information flow between these components. Expectation feature structures are designed to accommodate lexical information for the structural combination. Expectation feature structures are vital to a lexicalized grammar like CPSG95. The formal definition for the sort hierarchy for the expectation features will be given. It will be demonstrated that the defined sort hierarchy provides means for imposing a proper structural hierarchy as defined by the general grammar. One characteristic of the CPSG95 structural expectation is the unique design of morphological expectation features to incorporate Chinese productive derivation. This design is believed to be a feasible and natural way of modeling Chinese derivation, as shall be presented shortly below and elaborated in section 3.2.1. How this design benefits the interface coordination between derivation and syntax will be further demonstrated in Chapter VI. The type for the expectation features is similar to the HPSG definition of and . They both accommodate lexical expectation information to drive the analysis conducted via the general grammar. In order to meet some requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, three major modifications from the standard HPSG are proposed in CPSG95. They are: (i) the CPSG95 type is more generalized as to cover productive derivation in addition to syntactic subcategorization and modification; (ii) unlike HPSG which tries to capture word order phenomena as independent constraints, Chinese word order in CPSG95 is integrated in the definition of the expectation features and the corresponding morphological/syntactic relations; (iii) in terms of handling the syntactic subcategorization, CPSG95 pursues a non-list alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy. The rationale and arguments for these modifications are presented in the corresponding sections, with a brief summary given below. The first modification is necessitated by meeting the needs of introducing Chinese productive derivation into the grammar. It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (Dai 1993). The expectation information that drives the analysis of a Chinese productive derivation is found to be capturable lexically by the affix sign; this is very similar to how the information for the head-driven syntactic analysis is captured in HPSG. The expansion of the expectation notion to include productive morphology can account for a wider range of linguistic phenomena. The feasibility of this modification has been verified by the implementation of CPSG95 based on the generalized expectation feature structures. One outstanding characteristic of all the expectation features designed in CPSG95 is that the word order information is implied in the definition of these features. Word order constraints in CPSG95 are captured by individual PS rules for the structural relationship between the constituents. In other words, Chinese word order constraints are not treated as phenomena which have sufficient generalizations of themselves independent of the individual morphological or syntactic relations. This is very different from the word order treatment in theories like HPSG (Pollard and Sag 1987) and GPSG (Gazdar, Klein, Pullum and Sag 1985). However, a similar treatment can be found in the work from the school of ‘categorial grammar’ (e.g. Dowty 1982). The word order theory in HPSG and GPSG is based on the assumption that structural relations and syntactic roles can be defined without involving the factor of word order. In other words, it is assumed that the structural nature of a constituent (subject, object, etc.) and its linear position in the related structures can be studied separately. This assumption is found to be inappropriate in capturing Chinese structural relations. So far, no one has been able to propose an operational definition for Chinese structural relations and morphological/syntactic roles without bringing in word order. As Ding (1953) points out, without the means of inflections and case markers, word order is a primary constraint for defining and distinguishing Chinese structural relations. In terms of expectation, it can always be lexically decided where for the head sign to look for its expected daughter(s). It is thus natural to design the expectation features directly on their expected word order. The reason for the non-list design in capturing Chinese subcategorization can be summarized as follows: (i) there has been no successful attempt by anyone, including the initial effort involved in the CPSG95 experiment, which demonstrates that the obliqueness design can be applied to Chinese grammar with sufficient linguistic generalizations; (ii) it is found that the atomic approach with separate features for each complement is a feasible and flexible proposal in representing the relevant linguistic phenomena. Finally, the design of the structural feature originates from in HPSG (Pollard and Sag 1987). Unlike the binary type for , the type for forms an elaborate sort hierarchy. This is designed to meet the configurational requirements of introducing morphology into CPSG95. This feature structure, together with the design of expectation feature structures, will help create a favorable framework for handling Chinese morpho-syntactic interface. The proposed structural feature structure and the expectation feature structures contribute to the formal definition of linguistic units in CPSG95. Such definitions enable proper lexical configurational constraints to be imposed on the expected signs when required. 3.1. Mono-stratal Design of Sign This section presents the data structure involving the interface between morphology, syntax and semantics in CPSG95. This is done by defining the mono-stratal design of the fundamental notion sign and by illustrating how different components, represented by the distinct features for the sign, interact. As a dynamic unit of grammatical analysis, a sign can be a morpheme, a word, a phrase or a sentence. It is the most fundamental object of HPSG-style grammars. Formally, a sign is defined in CPSG95 by the type , as shown below. (3-1.) Definition: a_sign a_sign HANZI hanzi_list CONTENT content CATEGORY category SUBJ expected COMP0_LEFT expected COMP1_RIGHT expected COMP2_RIGHT expected MOD_LEFT expected MOD_RIGHT expected PREFIXING expected SUFFIXING expected STRUCT struct The type introduces a set of linguistic features for the description of a sign. These are features for orthography, morphology, syntax and semantics, etc. The types, which are eligible to be the values of these features, have their own definitions in the sort hierarchy. An introduction of these features follows. The orthographic feature contains a list of Chinese characters ( hanzi or kanji ). The feature embodies the semantic representation of the sign. carries values like for noun, for verb, for adjective, for preposition, etc. The structural feature contains information on the relation of the structure to its sub-constituents, to be presented in detail in section 3.3. The features whose appropriate value must be the type are called expectation features . They are the essential part of a lexicalist grammar as these features contain information about various types of potential structures in both syntax and morphology. They specify various constraints on the expected daughter(s) of a sign for structural analysis. The design of these expectation features and their appropriate type will be presented shortly in section 3.2. The definition of illustrates the HPSG philosophy of mono-stratal analysis interleaving different components. As seen, different components of Chinese grammar are contained in different feature structures for the general linguistic unit sign . Their interaction is effected via the unification of relevant feature structures during various stages of analysis. This will unfold as the solutions to the morpho-syntactic interface problems are presented in Chapter V and Chapter VI. For illustration, the prefix 可 ke (-able) is used as an example in the following discussion. As is known, the prefix ke – (-able) makes an adjective out of a transitive verb: ke- + Vt – A. This lexicalized rule is contained in the CPSG95 entry for the prefix ke- , shown in (3-2). Following the ALE notation, @ is used for macro , a shorthand mechanism for a pre-defined feature structure. As seen, the prefix ke- morphologically expects a sign with . An affix is analyzed as the head of a derivational structure in CPSG95 (see section 6.1 for discussion) and is a representative head feature to be percolated up to the mother sign via the corresponding morphological PS rule as formulated in (6-4) of section 6.2, this expectation eventually leads to a derived word with . Like most Chinese adjectives, the derived adjective has an optional expectation for a subject NP to account for sentences like 这本书很可读 zhe (this) ben (CLA) shu (book) hen (very) ke-du (read-able): ‘This book is very readable’. This syntactic optional expectation for the derivative is accommodated in the head feature . Note that before any structural combination of ke- with other expected signs, ke- is a bound morpheme, a sign which has obligatory morphological expectation in . As a head for both the morphological combination ke +Vt and the potential syntactic combination NP+ , the interface between morphology and syntax in this case lies in the hierarchical structures which should be imposed. That is, the morphological structure (derivation) should be established before its syntactic expected structure can be realized. Such a configurational constraint is specified in the corresponding PS rules, i.e. the Subject PS Rule and The Prefix PS Rule. It guarantees that the obligatory morphological expectation of ke- has to be saturated before the sign can be legitimately used in syntactic combination. The interaction between morphology/syntax and semantics in this case is encoded by the information flow, i.e. structure-sharing indicated by the number index in square brackets, between the corresponding feature structures inside this sign. The semantic compositionality involved in the morphological and syntactic grouping is represented like this. There is a semantic predicate marked as (for worthiness) in the content feature ; this predicate has an argument which is co-indexed by with the semantics of the expected Vt. Note that the syntactic subject of the derived adjective, say ke-du (read-able) or ke-chi (eat-able), is the semantic (or logical) object of the stem verb, co-indexed by in the sample entry above. The head feature which reflects the semantic compositionality will be percolated up to the mother sign when applicable morphological and syntactic PS rules take effect in structure building. In summary, embodied in CPSG95 is a mono-stratal grammar of morphology and syntax within the same formalism. Both morphology and syntax use same data structure (typed feature structure) and mechanisms (unification, sort hierarchy, PS rules, lexical rules, macros). This design for Chinese grammar is original and is shown to be feasible in the CPSG95 experiments on various Chinese constructions. The advantages of handling morpho-syntactic interface problems under this design will be demonstrated throughout this dissertation. 3.2. Expectation Feature Structures This section presents the design of the expectation features in CPSG95. In general, the expectation features contain information about various types of potential structures of the sign. In CPSG95, various constraints on the expected daughter(s) of a sign are specified in the lexicon to drive both morphological and syntactic structural analysis. This provides a favorable basis for interleaving Chinese morphology and syntax in analysis. The expected daughter in CPSG95 is defined as one of the following grammatical constituents: (i) subject in the feature ; (ii) first complement in the feature or ; (iii) second complement in ; (iv) head of a modifier in the feature or ; (v) stem of an affix in the feature or . The first four are syntactic daughters which will be investigated in sections 3.2.2 and 3.2.3. The last one is the morphological daughter for affixation, to be presented in section 3.2.1. All these features are defined on the basis of the relative word order of the constituents in the structure. The hierarchy for the structure at issue resorts to the configurational constraints which will be presented in section 3.2.4. 3.2.1. Morphological Expectation One key characteristic of the CPSG95 expectation features is the design of morphological expectation features to incorporate Chinese productive derivation. It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (see section 6.1 for more discussion). An affix can lexically define what stem to expect and can predict the derivation structure to be built. For example, the suffix 性 –xing demands that it combine with a preceding adjective to make an abstract noun, i.e. A+ -xing – N. This type of information can be easily captured by the expectation feature structure in the lexicon, following the practice of the HPSG treatment of the syntactic expectation such as subcategorization and modification. In the CPSG95 lexicon, each affix entry is encoded to provide the following derivation information: (i) what type of stem it expects; (ii) whether it is a prefix or suffix to decide where to look for the expected stem; (iii) what type of (derived) word it produces. Based on this lexical information, the general grammar only needs to include two PS rules for Chinese derivation: one for prefixation, one for suffixation. These rules will be formulated in Chapter VI (sections 6.2 and 6.3). It will also be demonstrated that this lexicalist design for Chinese derivation works for both typical cases of affixation and for some difficult cases such as ‘quasi-affixation’ and zhe- suffixation. In summary, the morphological combination for productive derivation in CPSG95 is designed to be handled by only two PS rules in the general grammar, based on the lexical specification in and . Essentially, in CPSG95, productive derivation is treated like a ‘mini-syntax’; it becomes an integrated part of Chinese structural analysis. 3.2.2. Syntactic Expectation This section presents the design of the expectation features to represent Chinese syntactic relations. It will be demonstrated that constraints like word order and function words are crucial to the formalization of syntactic relations. Based on them, four types of syntactic relations can be defined, which are accommodated in six syntactic expectation feature structures for each head word. There is no general agreement on how to define Chinese syntactic relations. In particular, the distinction between Chinese subject and object has been a long debated topic (e.g. Ding 1953; L. Li 1986, 1990; Zhu 1985; Lü 1989). The major difficulty lies in the fact that Chinese does not have inflection to indicate subject-verb agreement and nominative case or accusative case, etc. Theory-internally, there have been various proposals that Chinese syntactic relations be defined on the basis of one or more of the following factors: (i) word order (more precisely, constituent order); (ii) the function words associated with the constituents; (iii) the semantic relations or roles. The first two factors are linguistic forms while the third factor belongs to linguistic content. L. Li (1986, 1990) relies mainly on the third factor to study Chinese verb patterns. The constituents in his proposal are named as NP-agent ( ming-shi ), NP-patient ( ming-shou ), etc. This practice amounts to placing an equal sign between the syntactic relation and semantic relation. It implies that the syntactic relation is not an independent feature. This makes syntactic generalization difficult. Other Chinese grammarians (e.g. Ding 1953; Zhu 1985) emphasize the factor of word order in defining syntactic relations. This school insists that syntactic relations be differentiated from semantic relations. More precisely, semantic relations should be the result of the analysis of syntactic relations. That is also the rationale behind the CPSG95 practice of using word order and other constraints (including function words) in the definition of Chinese relations. In CPSG95, the expected syntactic daughter in CPSG95 is defined as one of the following grammatical constituents: (i) subject in the feature , typically an NP which is on the leftmost position relative to the head; (ii) complements closer to the head in the feature or , in the form of an NP or a specific PP; (iii) the second complement in : this complement is defined to be an XP (NP, a specific PP, VP, AP, etc.) farther away from the head than in word order; (iv) head of a modifier in the feature or . In this defined framework of four types of possible syntactic relations, for each head word, the lexicon is expected to specify the specific constraints in its corresponding expectation feature structures and map the syntactic constituents to the corresponding semantic roles in . This is a secure way of linking syntactic structures and their semantic composition for the following reason. Given a specific head word and a syntactic structure with its various constraints specified in the expectation feature structures, the decoding of semantics is guaranteed. A Chinese syntactic pattern can usually be defined by constraints from category, word order, and/or function words (W. Li 1996). For example, NP+V, NP+V+NP, NP+PP(x)+NP, NP+V+NP+NP, NP+V+NP+VP, etc. are all such patterns. With the design of the expectation features presented above, these patterns can be easily formulated in the lexicon under the relevant head entry, as demonstrated by the sample formulations given in (3-3) and (3-4). The structure in (3-3) is a Chinese transitive pattern in its default word order, namely NP1+Vt+NP2. The representation in (3-4) is another transitive pattern NP+PP(x)+Vt. This pattern requires a particular preposition x to introduce its object before the head verb. The sample entry in (3-5) is an example of how modification is represented in CPSG95. Following the HPSG semantics principle, the semantic content from the modifier will be percolated up to the mother sign from the head-modifier structure via the corresponding PS rule. The added semantic contribution of the adverb chang-chang (often) is its specification of the feature for the event at issue. 3.2.3. Chinese Subcategorization This section presents the rationale behind the CPSG95 design for subcategorization. Instead of a SUBCAT-list, a keyword approach with separate features for each complement is chosen for representing the subcategorization information, as shown in the corresponding expectation features in section 3.2.2. This design has been found to be a feasible alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy and SUBCAT Principle when handling subject and complements. The CPSG95 design for representing subcategorization follows one proposal from Pollard and Sag (1987:121), who point out: “It may be possible to develop a hybrid theory that uses the keyword approach to subjects, objects and other complements, but which uses other means to impose a hierarchical structure on syntactic elements, including optional modifiers not subcategorized for in the same sense.” There are two issues for such a hybrid theory: the keyword approach to representing subject and complements and the means for imposing a hierarchical structure. The former is discussed below while the latter will be addressed in the subsequent section 3.2.4. The basic reason for abandoning the list design is due to the lack of an operational definition of obliqueness which captures generalizations of Chinese subcategorization. In the English version of HPSG (Pollard and Sag 1987, 1994), the obliqueness ordering is established between the syntactic notions of subject, direct object and second object (or oblique object ). But these syntactic relations themselves are by no means universal. In order to apply this concept to the Chinese language, there is a need for an operational definition of obliqueness which can be applied to Chinese syntactic relations. Such a definition has not been available. In fact, how to define Chinese subject, object and other complements has been one of the central debated topics among Chinese grammarians for decades (Lü 1946, 1989; Ding 1953; L. Li 1986, 1990; Zhu 1985; P. Chen 1994). No general agreement for an operational, cross-theory definition of Chinese subcategorization has been reached. It is often the case that formal or informal definitions of Chinese subcategorization are given within a theory or grammar. But so far no Chinese syntactic relations defined in a theory are found to demonstrate convincing advantages of a possible obliqueness ordering, i.e. capturing the various syntactic generalizations for Chinese. Technically, however, as long as subject and complements are formally defined in a theory, one can impose an ordering of them in a SUBCAT list. But if such a list does not capture significant generalizations, there is no point in doing so. It has turned out that the keyword approach is a promising alternative once proper means are developed for the required configurational constraint on structure building. The keyword approach is realized in CPSG95 as follows. Syntactic constituents for subcategorization, namely subject and complements, are directly accommodated in four parallel features , , and . The feasibility of the keyword approach proposed here has been tested during the implementation of CPSG95 in representing a variety of structures. Particular attention has been given to the constructions or patterns related to Chinese subcategorization. They include various transitive structures, di-transitive structures, pivotal construction ( jianyu-shi ), ba- construction ( ba-zi ju ), various passive constructions ( bei-dong shi ), etc. It is found to be easy to accommodate all these structures in the defined framework consisting of the four features. We give a couple of typical examples below, in addition to the ones in (3-3) and (3-4) formulated before, to show how various subcategorization phenomena are accommodated in the CPSG95 lexicon within the defined feature structures for subcategorization. The expected structure and example are shown before each sample formulation in (3‑6) through (3-8) (with irrelevant implementation details left out). Based on such lexical information, the desirable hierarchical structure on the related syntactic elements, e.g. ] instead of O], can be imposed via the configurational constraint based on the design of the expectation type. This is presented in section 3.2.4 below. 3.2.4. Configurational Constraint The means for the configurational constraint to impose a desirable hierarchical morpho-syntactic structure defined by a grammar is the key to the success of a keyword approach to structural constituents, including subject and complements from the subcategorization. This section defines the sort hierarchy of the expectation type . The use of this design for flexible configurational constraint both in the general grammar and in the lexicon will be demonstrated. As presented before, whether a sign has structural expectation, and what type of expectation a sign has, can be lexically decided: they form the basis for a lexicalized grammar. Four basic cases for expectation are distinguished in the expectation type of CPSG95: (i) obligatory: the expected sign must occur; (ii) optional: the expected sign may occur; (iii) null: no expectation; (iv) satisfied: the expected sign has occurred. Note that case (i), case (ii) and case (iii) are static information while (iv) is dynamic information, updated at the time when the daughters are combined into a mother sign. In other words, case (iv) is only possible when the expected structure has actually been built. In HPSG-style grammars, only the general grammar, i.e. the set of PS rules, has the power of building structures. For each structure being built, the general grammar will set to the corresponding expectation feature of the mother sign. Out of the four types, case (i) and case (ii) form a natural class, named as ; case (iii) and case (iv) are of one class named as . The formal definition of the type is given (3-9]. (3-9.) Definition: sorted hierarchy for expected: {a_expected, saturated} a_expected: {obligatory, optional} ROLE role SIGN a_sign saturated: {null, satisfied} The type introduces two features: and . specifies the semantic role which the expected sign plays in the structure. houses various types of constraints on the expected sign. The type is designed to meet the requirement of the configurational constraint. For example, in order to guarantee that syntactic structures for an expecting sign are built on top of its morphological structures if the sign has obligatory morphological expectation, the following configurational constraint is enforced in the general grammar. (The notation | is used for logical OR.) (3-10.) configurational constraint in syntactic PS rules PREFIXING saturated | optional SUFFIXING saturated | optional The constraint means that syntactic rules are permitted to apply if a sign has no morphological expectation or after the morphological expectation has been satisfied. The reason why the case does not block the application of syntactic rules is the following. Optional expectation entails that the expected sign may or may not appear. It does not have to be satisfied. Similarly, within syntax, the constraints can be specified in the Subject PS Rule: (3-11.) configurational constraint in Subject PS rule COMP0_LEFT saturated | optional COMP1_RIGHT saturated | optional COMP2 saturated | optional This ensures that complement rules apply before the subject rule does. This way of imposing a hierarchical structure between subcategorized elements corresponds to the use of SUBCAT Principle in HPSG based on the notion of obliqueness. The configurational constraint is also used in CPSG95 for the formal definition of phrase , as formulated below. phrase macro a_sign PREFIXING saturated | optional SUFFIXING saturated | optional COMP1_LEFT saturated | optional COMP1_RIGHT saturated | optional COMP2 saturated | optional Despite the notational difference, this definition follows the spirit reflected in the phrase definition given in Pollard and Sag (1987:69) in terms of the saturation status of the subcategorized complements. In essence, the above definition says that a phrase is a sign whose morphological expectation and syntactic complement expectation (except for subject) are both saturated. The reason to include in the definition is to cover phrases whose head daughter has optional expectation, for example, a verb phrase consisting of just a verb with its optional object omitted in the text. Together with the design of the structural feature (section 3.3), the sort hierarchy of the type will also enable the formal definition for the representation of the fundamental notion word (see Section 4.3 in Chapter IV). Definitions such as @word and @phrase are the basis for lexical configurational constraints to be imposed on the expected signs when required. For example, -xing (-ness) will expect an adjective stem with the word constraint and -zhe (-er) can impose the phrase constraint on the expected verb sign based on the analysis proposed in section 6.5. 3.3. Structural Feature Structure The design of the feature serves important structural purposes in the formalization of the CPSG95 interface between morphology and syntax. It is necessary to present the rationale of this design and the sort hierarchy of the type used in this feature. The design of originates from the binary structural feature structure in the original HPSG theory (Pollard and Sag 1987). However, in the CPSG95 definition, the type forms an elaborate sort hierarchy. It is divided into two types at the top level: and . A sub-type of is . The CPSG95 lexicon encodes the feature for all single morphemes. Another sub-type of is (for units formed via affixation) which is further sub-typed into and , assigned by the Prefix PS rule and Suffix PS Rule. In syntax, includes sub-types like , and . Despite the hierarchical depth of the type, it is organized to follow the natural classification of the structural relation involved. The formal definition is given below. (3-12.) Definition: sorted hierarchy for struct: {syn_dtr, no_syn_dtr} syn_dtr: {subj, comp, mod} comp: {comp0_left, comp1_right, comp2_right} mod: {mod_left, mod_right} no_syn_dtr: {no_dtr, affix} affix: {prefix, suffix} In CPSG95, is not a (head) feature which percolates up to the mother sign; its value is solely decided by the structure being built. Each PS rule, whether syntactic or morphological, assigns the value of the feature for the mother sign, according to the nature of combination. When morpheme daughters are combined into a mother sign word , the value of the feature for the mother sign remains a sub-type of . But when some syntactic rules are applied, the rules will assign the value to the mother sign as a sub-type of to show that the structure being built is a syntactic construction. The design of the feature structure is motivated by the new requirement caused by introducing morphology into the general grammar of CPSG95. In HPSG, a simple, binary type for is sufficient to distinguish lexical signs, i.e. , from signs created via syntactic rules, i.e. . But in CPSG95, as presented in section 3.2.1 before, productive derivation is also accommodated in the general grammar. A simple distinction between a lexical sign and a syntactic sign cannot capture the difference between signs created via morphological rules and signs created via syntactic rules. This difference plays an essential role in formalizing the morpho-syntactic interface, as shown below. The following examples demonstrate the structural representation through the design of the feature . In the CPSG95 lexicon, the single Chinese characters like the prefix ke- (-able) and the free morphemes du (read), bao (newspaper) are all coded as . When the Prefix PS Rule combines the prefix ke- and the verb du into an adjective ke-du , the rule assigns to the newly built derivative. The structure may remain in the domain of morphology as the value is a sub-type of . However, when this structure is further combined with a subject, say, bao (newspaper) by the syntactic Subj PS Rule, the resulting structure ] (‘Newspapers are readable’) is syntactic, having assigned by the Subj PS Rule; in fact, this is a simple sentence. Similarly, the syntactic Comp1_right PS Rules can combine the transitive verb du (read) and the object bao (newspaper) and assign for the unit du bao (read newspapers) in the feature . In general, when signs whose value is a sub-type of combine into a unit whose is assigned a sub-type of , it marks the jump from the domain of morphology to syntax. This is the way the interface of Chinese morphology and syntax is formalized in the present formalism. The use of this feature structure in the definition of Chinese word will be presented in Chapter IV. Further advantages and flexibility of the design of this structural feature structure and the expectation feature structures will be demonstrated in later chapters in presenting solutions to some long-standing problems at the morpho-syntactic interface. 3.4. Summary The major design issues for the proposed mono-stratal Chinese grammar CPSG95 are addressed. This provides a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface. It has been shown that the design of the CPSG95 expectation structures enables configuration constraints to be imposed on the structure hierarchy defined by the grammar. This makes the keyword approach to Chinese subcategorization a feasible alternative to the list design based on the obliqueness hierarchy of subject and complements. Within this defined framework of CPSG95, the subsequent Chapter IV will be able to formulate the system-internal, but strictly formalized definition of Chinese word. Formal definitions such as @word and @phrase enable proper configurational constraints to be imposed on the expected signs when required. This lays a foundation for implementing the proposed solutions to the morpho-syntactic interface problems to be explored in the remaining chapters. ——————————————————————————— More precisely, it is not ‘word’ order, it is constituent order, or linear precedence (LP) constraint between constituents. L. Li (1986, 1990)’s definition on structural constituents does not involve word order. However, his proposed definition is not an operational one from the angle of natural language processing. He relies on the decoding of the semantic roles for the definitions of the proposed constituents like NP-agent (ming-shi), NP-patient (ming-shou), etc. Nevertheless, his proposal has been reported to produce good results in the field of Chinese language teaching. This seems to be understandable because the process of decoding semantic roles is naturally and subconsciously conducted in the mind of the language instructors/learners. Most linguists agree that Chinese has no inflectional morphology (e.g. Hockett 1958; Li and Thompson 1981; Zwicky 1987; Sun and Cole 1991). The few linguists who believe that Chinese has developed or is developing inflection morphology include Bauer (1988) and Dai (1993). Typical examples cited as Chinese inflection morphemes are aspect markers le, zhe, guo and the plural marker men . A note for the notation: uppercase is used for feature and lowercase, for type. Phonology and discourse are not yet included in the definition. The latter is a complicated area which requires further research before it can be properly integrated in the grammar analysis. The former is not necessary because the object for CPSG95 is Written Chinese. In the few cases where phonology affects structural analysis, e.g. some structural expectation needs to check the match of number of syllables, one can place such a constraint indirectly by checking the number of Chinese characters instead (as we know, a syllable roughly corresponds to a Chinese character or hanzi ). The macro constraint @np in (3-2) is defined to be and a call to another macro constraint @phrase to be defined shortly in Section 3.2.4. These expectation features defined for are a maximum set of possible expected daughters; any specific sign may only activate a subset of them, represented by non-null value. This is similar to viewing morphology as ‘the syntax of words’ (Selkirk 1982; Lieber 1992; Krieger 1994). It seems that at least affixation shares with syntax similar structural constraints on constituency and linear ordering in Chinese. The same type of mechanisms (PS rules, typed feature structure for expectation, etc) can be used to capture both Chinese affixation and syntax (see Chapter VI). More precisely, the decoding of possible ways of semantic composition is guaranteed. Syntactically ambiguous structures with the same constraints correspond to multiple ways of semantic compositionality. These are expressed as different entries in the lexicon and the link between these entries is via corresponding lexical rules, following the HPSG practice. (W. Li 1996) Borsley (1987) has proposed an HPSG framework where subject is posited as a distinct feature than other complements. Pollard and Sag (1994:345) point out that “the overwhelming weight of evidence favors Borsley’s view of this matter”. The only possible benefit of such arrangement is that one can continue using the SUBCAT Principle for building complement structure via list cancellation. It also includes idioms whose internal morphological structure is unknown or has no grammatical relevance. The reader might have noticed that the assigned value is the same as the name of the PS rule which applies. This is because there is correspondence between what type of structure is being built and what PS rule is building it. Thus, the feature actually records the rule application information. For example, reflects the fact that the Subj PS Rule is the most recently applied rule to the structure in point; a structure built via the Prefix PS Rule has in place; etc. This practice gives an extra benefit of the functionality of ‘tracing’ which rules have been applied in the process of debugging the grammar. If there has never been a rule applied to a sign, it must be a morpheme carrying from the lexicon. PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) PhD Thesis: Chapter I Introduction PhD Thesis: Chapter II Role of Grammar Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|3863 次阅读|0 个评论
PhD Thesis: Chapter I Introduction
liwei999 2016-8-25 02:50
1.0. Foreword This thesis addresses the issue of the Chinese morpho-syntactic interface. This study is motivated by the need for a solution to a series of long-standing problems at the interface. These problems pose challenges to an independent morphology system or a separate word segmenter as there is a need to bring in syntactic information in handling these problems. The key is to develop a Chinese grammar which is capable of representing sufficient information from both morphology and syntax. On the basis of the theory of Head-Driven Phrase Structure Grammar (Pollard and Sag 1987, 1994), the thesis will present the design of a Chinese grammar, named CPSG95 (for Chinese Phrase Structure Grammar ). The interface between morphology and syntax is defined system internally in CPSG95. For each problem, arguments will be presented for the linguistic analysis involved. A solution to the problem will then be formulated based on the analysis. The proposed solutions are formalized and implementable; most of the proposals have been tested in the implementation of CPSG95. In what follows, Section 1.1 reviews some important developments in the field of Chinese NLP (Natural Language Processing). This serves as the background for this study. Section 1.2 presents a series of long-standing problems related to the Chinese morpho-syntactic interface. These problems are the focus of this thesis. Section 1.3 introduces CPSG95 and sketches its morpho-syntactic interface by illustrating an example of the proposed morpho-syntactic analysis. 1.1. Background This section presents the background for the work on the interface between morphology and syntax in CPSG95. Major development on Chinese tokenization and parsing, the two areas which are related to this study, will be reviewed. 1.1.1. Principle of Maximum Tokenization and Critical Tokenization This section reviews the influential Theory of Critical Tokenization (Guo 1997a) and its implications. The point to be made is that the results of Guo’s study can help us to select the tokenization scheme used in the lexical lookup phase in order to create the basis for morpho-syntactic parsing. Guo (1997a,b,c) has conducted a comprehensive formal study on tokenization schemes in the framework of formal languages, including deterministic tokenization such as FT (Forward Maximum Tokenization) and BT (Backward Maximum Tokenization), and non-deterministic tokenization such as CT (Critical Tokenization), ST (Shortest Tokenization) and ET (Exhaustive Tokenization). In particular, Guo has focused on the study of the rich family of tokenization strategies following the general Principle of Maximum Tokenization, or “PMT”. Except for ET, all the tokenization schemes mentioned above are PMT-based. In terms of lexical lookup, PMT can be understood as a heuristic by which a longer match overrides all shorter matches. PMT has been widely adopted (e.g. Webster and Kit 1992; Guo 1997b) and is believed to be “the most powerful and commonly used disambiguation rule” (Chen and Liu 1992:104). Shortest Tokenization, or “ST”, first proposed by X. Wang (1989), is a non-deterministic tokenization scheme following the Principle of Maximum Tokenization. A segmented token string is shortest if it contains the minimum number of vocabulary words possible - “short” in the sense of the shortest word string length. Exhaustive Tokenization, or “ET”, does not follow PMT. As its name suggests, the ET set is the universe of all possible segmentations consisting of all candidate vocabulary words. The mathematical definition of ET is contained in Definition 4 for “the character string tokenization operation” in Guo (1997a). The most important concept in Guo’s theory is Critical Tokenization, or “CT”. Guo’s definition is based on the partially ordered set, or ‘poset’, theory in discrete mathematics (Kolman and Busby 1987). Guo has found that different segmentations can be linked by the cover relationship to form a poset. For example, abc|d and ab|cd both cover ab|c|d , but they do not cover each other. Critical tokenization is defined as the set of minimal elements, i.e. tokenizations which are not covered by other tokenizations, in the tokenization poset. Guo has given proof for a number of mathematical properties involving critical tokenization. The major ones are listed below. Every tokenization is a subtokenization of (i.e. covered by) a critical tokenization, but no critical tokenization has a true supertokenization; The tokenization variations following the Principle of Maximum Tokenization proposed in the literature, such as FT, BT, FT+BT and ST, are all true sub-classes of CT. Based on these properties, Guo concludes that CT is the precise mathematical description of the widely adopted Principle of Maximum Tokenization. Guo (1997c) further reports his experimental studies on relative merits of these tokenization schemes in terms of three quality indicators, namely, perplexity, precision and recall . The perplexity of a tokenization scheme gives the expected number of tokenized strings generated for average ambiguous fragments. The precision score is the percentage of correctly tokenized strings among all possible tokenized strings while the recall rate is the percentage of correctly tokenized strings generated by the system among all correctly tokenized strings. The main results are: Both FT and BT can achieve perfect unity perplexity but have the worst precision and recall; ET achieves perfect recall but has the lowest precision and highest perplexity; ST and CT are simple with good computational properties. Between the two, ST has lower perplexity but CT has better recall. Guo (1997c) concludes, “for applications with moderate performance requirement, ST is the choice; otherwise, CT is the solution.” In addition to the above theoretical and experimental study, Guo (1997b) also develops a series of optimized algorithms for the implementation of these generation schemes. The relevance and significance of Guo’s achievement to the research in this thesis lie in the following aspect. The research on Chinese morpho-syntactic interface is conducted with the goal of supporting Chinese morpho-syntactic parsing. The input to a Chinese morpho-syntactic parser comes directly from the lexical lookup of the input string based on some non-deterministic tokenization scheme (W. Li 1997, 2000; Wu and Jiang 1998). Guo’s research and algorithm development can help us to decide which tokenization schemes to use depending on the tradeoff between precision, recall and perplexity or the balance between reducing the search space and minimizing premature commitment. 1.1.2. Monotonicity Principle and Task-driven Segmentation This section reviews the recent development on Chinese analysis systems involving the interface between morphology and syntax. The research on the Chinese morpho-syntactic interface in this thesis echoes this new development in the field of Chinese NLP. In the last few years, projects have been proposed for implementing a Chinese analysis system which integrates word identification and parsing. Both rule-based systems and statistical models have been attempted with good results. Wu (1998) has addressed the drawbacks of the conventional practice on the development of Chinese word segmenters, in particular, the problem of premature commitment in handling segmentation ambiguity. In his A Position Statement on Chinese Segmentation, Wu proposed a general principle: Monotonicity Principle for segmentation : A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose. The rationale behind this principle is to prevent premature commitment and to avoid repetition of work between modules. In fact, traditional word segmenters are modules independent of subsequent applications (e.g. parsing). Due to the lack of means for accessing sufficient grammar knowledge, they suffer from premature commitment and repetition of work, hence violating this principle. Wu’s proposal of the monotonicity principle is a challenge to the Principle of Maximum Tokenization. These two principles are not always compatible. Due to the existence of hidden ambiguity (see 1.2.1), the PMT-based segmenters by definition are susceptible to premature commitment leading to “too-long segments”. If the target application is designed to solve the hidden ambiguity problem in the segments, “decomposition” of some segments is unavoidable. In line with the Monotonicity Principle, Wu (1998) proposes an alternative approach which he claims “eliminates the danger of premature commitment”, namely task-driven segmentation . Wu (1998) points out, “Task-driven segmentation is performed in tandem with the application (parsing, translating, named-entity labeling, etc.) rather than as a preprocessing stage. To optimize accuracy, modern systems make use of integrated statistically-based scores to make simultaneous decisions about segmentation and parsing/translation.” The HKUST parser, developed by Wu’s group, is such a statistical system employing the task-driven segmentation. As for rule-based systems, similar practice of integrating word identification and parsing has also been explored. W. Li (1997, 2000) proposed that the results of an ET-based lexical lookup directly feed the parser for the hanzi-based parsing . More concretely, morphological rules are designed to build word internal structure for productive morphology and non-productive morphology is lexicalized via entry enumeration. This approach is the background for conducting the research on Chinese morpho-syntactic interface for CPSG95 in this dissertation. The Chinese parser on the platform of multilingual NLPWin developed by Microsoft Research also integrates word identification and parsing (Wu and Jiang 1998). They also use a hand-coded grammar for word identification as well as for sentential parsing. The unique part of this system is the use of a certain lexical constraint on ET in the lexical lookup phase. This effectively reduces the parsing search space as well as the number of syntactic trees produced by the parser, with minimal sacrifice in the recall of tokenization. This tokenization strategy provides a viable alternative to the PMT-based tokenization schemes like CT or ST in terms of the overall balance between precision, recall and perplexity. The practice of simultaneous word identification and parsing in implementing a Chinese analysis system calls for the support of a grammar (or statistical model) which contains sufficient information from both morphology and syntax. The research on Chinese morpho-syntactic interface in this dissertation aims at providing this support. 1.2. Morpho-syntactic Interface Problems This section presents a series of outstanding problems in Chinese NLP which are related to the morpho-syntactic interface. One major goal of this dissertation is to argue for the proposed analyses of the problems and to provide solutions to them based on the analyses. Sun and Huang (1996) have reviewed numerous cases which challenge the existing word segmenters. As many of these cases call for an exchange of information between morphology and syntax, an appropriate solution can hardly be reached within the module of a separate word segmenter. Three major problems at issue are presented below. 1.2.1. Segmentation ambiguity This section presents the long-standing problem in Chinese tokenization, i.e. the resolution of the segmentation ambiguity. Within a separate word segmenter, resolving the segmentation ambiguity is a difficult, sometimes hopeless job. However, the majority of ambiguity can be resolved when a grammar is available. Segmentation ambiguity has been the focus of extensive study in Chinese NLP for the last decade (e.g. Chen and Liu 1992; Liang 1987; Sproat, Shih, Gale and Chang 1996; Sun and Huang 1996; Guo 1997b). There are two types of segmentation ambiguities (Liang 1987; Guo 1997b): (i) overlapping ambiguity: e.g. da-xue | sheng-huo vs. da-xue-sheng | huo as shown in (1-1) and (1-2); and (ii) hidden ambiguity: ge-ren vs. ge | ren , as shown in (1-3) and (1-4). (1-1.) 大学生活很有趣 da-xue | sheng-huo | hen | you-qu university | life | very | interesting The university life is very interesting. (1-2.) 大学生活不下去了 da-xue-sheng | huo | bu | xia-qu | le university student | live | not | down | LEs University students can no longer make a living. (1-3.) 个人的力量 ge-ren | de | li-liang individual | DE | power the power of an individual (1-4.) 三个人的力量 san | ge | ren | de | li-liang three | CLA | person |DE | power the power of three persons These examples show that the resolution of segmentation ambiguity requires larger syntactic context and grammatical analysis. There will be further arguments and evidence in Chapter II (2.1) for the following conclusion: both types of segmentation ambiguity are structural by nature and require sentential analysis for the resolution. Without access to a grammar, no matter how sophisticated a tokenization algorithm is designed, a word segmenter is bound to face an upper bound for the precision of word identification. However, in an integrated system, word identification becomes a natural by-product of parsing (W. Li 1997, 2000; Wu and Jiang 1998). More precisely, the majority of ambiguity can be resolved automatically during morpho-syntactic parsing; the remaining ambiguity can be made explicit in the form of multiple syntactic trees. But in order to make this happen, the parser requires reliable support from a grammar which contains both morphology and syntax. 1.2.2. Productive Word Formation Non-listable words created via productive morphology pose another challenge (Sun and Huang 1996). There are two major problems involved in this issue: (i) problem in identifying lexicon-unlisted words; (ii) problem of possible segmentation ambiguity. One important method of productive word formation is derivation . For example, the derived word 可读性 ke-du-xing (-able-read-ness: readability) is created via morphology rules, informally formulated below (1-5.) derivation rules ke + X (transitive verb) -- ke- X (adjective, semantics: X-able) Y (adjective or verb) + xing -- Y- xing (abstract noun, semantics: Y-ness) Rules like the above have to be incorporated properly in order to correctly identify such non-listable words. However, there has been little research in the literature on what formalism should be adopted for Chinese morphology and how it should be interfaced to syntax. To make the case more complicated, ambiguity may also be involved in productive word formation. When the segmentation ambiguity is involved in word formation, there is always a danger of wrongly applying morphological rules. For example, 吃头 chi-tou (worth of eating) is a derived word (transitive verb + suffix tou ); however, it can also be segmented as two separate tokens chi (eat) | tou (CLA), as shown in (1-6) and (1-7) below. (1-6.) 这道菜没有吃头 zhe | dao | cai | mei-you | chi-tou this | CLA | dish | not have | worth-of-eating This dish is not worth eating. (1-7.) 他饿得能吃头牛 ta | e | de | neng | chi | tou | niu he | hungry | DE3 | can | eat | CLA | ox He is so hungry that he can eat an ox. To resolve this segmentation ambiguity, as indicated before in 1.2.1, the structural analysis of the complete sentences is required. An independent morphology system or a separate word segmenter cannot handle this problem without accessing syntactic knowledge. 1.2.3. Borderline Cases between Morphology and Syntax It is widely acknowledged that there is a remarkable gray area between Chinese morphology and Chinese syntax (L. Li 1990; Sun and Huang 1996). Two typical cases are described below. The first is the phenomena of Chinese separable verbs. The second case involves interfacing derivation and syntax. Chinese separable verbs are usually in the form of V+N and V+V or V+A. These idiomatic combinations are long-standing problems at the interface between compounding and syntax in Chinese grammar (L. Wang 1955; Z. Lu 1957; Lü 1989; Lin 1983; Q. Li 1983; L. Li 1990; Shi 1992; Zhao and Zhang 1996). The separable verb 洗澡 xi zao (wash‑bath: take a bath) is a typical example. Many native speakers regard xi zao as one word (verb), but the two morphemes are separable. In fact, xi+zao shares the syntactic behavior and the pattern variations with the syntactic transitive combination V+NP: not only can aspect markers appear between xi and zao , but this structure can be passivized and topicalized as well. The following is an example of topicalization (of long distance dependency) for xi zao . (1-8.)(a) 我认为他应该洗澡 wo ren-wei ta ying-gai xi zao . I think he should wash-bath I think that he should take a bath. (b) 澡我认为他应该洗 zao wo ren-wei ta ying-gai xi . bath I think he should wash The bath I think that he should take. Although xi zao behaves like a syntactic phrase, it is a vocabulary word in the lexicon due to its idiomatic nature. As a result, almost all word segmenters output xi-zao in (1-8a) as one word while treating the two signs in (1-8b) as two words. Thus the relationship between the separated use of the idiom and the non-separated use is lost. The second case represents a considerable number of borderline cases often referred to as ‘quasi-affixes’. These are morphemes like 前 qian (former, ex-) in words like 前夫 qian-fu (ex-husband), 前领导 qian- (former boss) and -盲 mang (person who has little knowledge of) in words like 计算机盲 -mang (computer layman), 法盲 fa-mang (person who has no knowledge of laws). It is observed that 'quasi-affixes' are structurally not different from other affixes. The major difference between 'quasi-affixes' and the few generally honored ('genuine') affixes like the nominalizer 性 -xing (-ness) lies mainly in the following aspect. The former retain some 'solid' meaning while the latter are more functionalized. Therefore, the key to this problem seems to lie in the appropriate way of coordinating the semantic contribution of the derived words using 'quasi-affixes' to the building of the semantics for the entire sentence. This is an area which has not received enough investigation in the field of Chinese NLP. While many word segmenters have included some type of derivational processing for a few typical affixes, few systems demonstrate where and how to handle these 'quasi-affixes'. 1.3. CPSG95: HPSG-style Chinese Grammar in ALE To investigate the interaction between morphological and syntactic information, it is important to develop a Chinese grammar which incorporates morphology and syntax in the same formalism. This section gives a brief presentation on the design and background of CPSG95 (including lexicon). 1.3.1. Background and Overview of CPSG95 Shieber (1986) distinguishes two types of grammar formalism: (i) theory-oriented formalism; (ii) tool-oriented formalism. In general, a language-specific grammar turns to a theory-oriented formalism for its foundation and a tool-oriented formalism for its implementation. The work on CPSG95 is developed in the spirit of the theory-oriented formalism Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard and Sag 1987). The tool-oriented formalism used to implement CPSG95 is the Attribute Logic Engine (ALE, developed by Carpenter and Penn 1994). The unique feature of CPSG95 is its incorporation of Chinese morphology in the HPSG framework. Like other HPSG grammars, CPSG95 is a heavily lexicalized unification grammar. It consists of two parts: a minimized general grammar and an information-enriched lexicon. The general grammar contains a small number of Phrase Structure (PS) rules, roughly corresponding to the HPSG schemata tuned to the Chinese language. The syntactic PS rules capture the subject-predicate structure, complement structure, modifier structure, conjunctive structure and long-distance dependency. The morphological PS rules cover morphological structures for productive word formation. In one version of CPSG95 (its source code is shown in APPENDIX I), there are nine PS rules: seven syntactic rules and two morphological rules. In CPSG95, potential morphological structures and potential syntactic structures are both lexically encoded. In syntax, a word can expect ( subcat-for or mod in HPSG terms) another sign to form a phrase. Likewise, in Chinese morphology, a morpheme can expect another sign to form a word. One important modification of HPSG in designing CPSG95 is to use an atomic approach with separate features for each complement to replace the list design of obliqueness hierarchy among complements. The rationale and arguments for this modification are presented in Section 3.2.3 in Chapter III. 1.3.2. Illustration The example shown in (1-9) demonstrates the morpho-syntactic analysis in CPSG95. (1-9.) 这本书的可读性 zhe ben shu de ke du xing this CLA book DE AF:-able read AF:-ness this book’s readability (Note: CLA for classifier; DE for particle de ; AF for affix.) Figure 1 illustrates the tree structure built by the morphological PS rules and the syntactic PS rules in CPSG95 Figure 1. Sample Tree Structure for CPSG95 Analysis As shown, the tree embodies both morphological analysis (the sub-tree for ke-du-xing ) and syntactic analysis (the NP structure). The results of the morphological analysis (the category change from V to A and to N and the building of semantics, etc.) are readily accessible in building syntactic structures. 1.4. Organization of the Dissertation The remainder of this dissertation is divided into six chapters. Chapter II presents arguments for the need to involve syntactic analysis for a proper solution to the targeted morpho-syntactic problems. This establishes the foundation on which CPSG95 is based. Chapter III presents the design of CPSG95. In particular, the expectation feature structures will be defined. They are used to encode the lexical expectation of both morphological and syntactic structures. This design provides the necessary means for formally defining Chinese word and the interface of morphology, syntax and semantics. Chapter IV is on defining the Chinese word. This is generally recognized as a basic issue in discussing Chinese morpho-syntactic interface. The investigation leads to a way of the wordhood formalization and a coherent, system-internal definition of the work division between morphology and syntax. Chapter V studies Chinese separable verbs. It discusses wordhood judgment for each type of separable verbs based on their distribution. The corresponding morphological or syntactic solutions will then be presented. Chapter VI investigates some outstanding problems of Chinese derivation and its interface with syntax. It will be demonstrated that the general approach to Chinese derivation in CPSG95 works both for typical cases of derivation and the two special problems, namely 'quasi-affix' phenomena and zhe- affixation. The last chapter, Chapter VII, concludes this dissertation. In addition to a concise retrospect for what has been achieved, it also gives an account of the limitations of the present research and future research directions. Finally, the three appendices give the source code of one version of the implemented CPSG95 and some tested results. -------------------------------------------------- In line with the requirements by Chinese NLP, this thesis places emphasis on the analysis of productive morphology: phenomena which are listable in the lexicon are not the major concern. This is different from many previous works on Chinese morphology (e.g. Z. Lu 1957; Dai 1993) where the bulk of discussions is on unproductive morphemes (affixes or ‘bound stems’). Ambiguity which remains after sentential parsing may be resolved by using further semantic, discourse or pragmatic knowledge, or ‘filters’. In CPSG95 and other HPSG-style grammars, a ‘sign’ usually stands for the generalized notion of grammatical units such as morpheme, word, phrase, etc. Researchers have looked at the incorporation of morphology of other natural languages in the HPSG framework (e.g. Type-based Derivation Morphology by Riehemann 1998). Arguments for the inclusion of morphological features in the definition of sign will be presented in detail in Chapter III Note that ‘phrase structure’ in terms like Phrase Structure Grammar (PSG) or Phrase Structure rules (PS rules) does not necessarily refer to structures of (syntactic) phrases . It stands for surface-based constituency structure, in contrast to, say, dependency structure in Dependency Grammar. In CPSG95, some productive morphological structures are also captured by PS rules. Note that in this dissertation, the term expect is used as a more generalized notion than the terms subcat-for (subcategorize for) and mod (modify). ‘Expect’ is intended to be applied to morphology as well as to syntax. There are differences in technical details between the proposed grammar in this dissertation and the implemented version. This is because any implemented version was tested at a given time while this thesis evolved over a long period of time. It is the author’s belief that it best benefits readers (including those who want to follow the CPSG practice) when a version was actually tested and given as was. PhD Thesis: Morpho-syntactic Interface in CPSG (cover page) Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|4181 次阅读|0 个评论
PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)
liwei999 2016-8-24 22:15
TheMorpho-syntactic Interface in aChinese Phrase Structure Grammar by Wei Li B.A., Anqing Normal College, China, 1982 M.A., The Graduate School of Chinese Academy of Social Sciences, China, 1986 Thesis submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in the Department of Linguistics Morpho-syntactic Interface in a Chinese Phrase Structure Grammar Wei Li 2000 SIMON FRASER UNIVERSITY November 2000 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author. Approval Name: Wei Li Degree: Ph.D. Title of thesis: THE MORPHO-SYNTACTIC INTERFACE IN A CHINESE PHRASE STRUCTURE GRAMMAR (Approved January 12, 2001) Abstract This dissertation examines issues related to the morpho-syntactic interface in Chinese, specifically those issues related to the following long-standing problems in Chinese Natural Language Processing (NLP): (i) disambiguation in Chinese word identification; (ii) Chinese productive word formation; (iii) borderline phenomena between morphology and syntax, such as Chinese separable verbs and ‘quasi-affixation’. All these problems pose challenges to an independent Chinese morphology system or separate word segmenter. It is argued that there is a need to bring in the syntactic analysis in handling these problems. To enable syntactic analysis in addition to morphological analysis in an integrated system, it is necessary to develop a Chinese grammar that is capable of representing sufficient information from both morphology and syntax. The dissertation presents the design of such a Chinese phrase structure grammar, named CPSG95 (for Chinese Phrase Structure Grammar ). The unique feature of CPSG95 is its incorporation of Chinese morphology in the framework of Head-Driven Phrase Structure Grammar. The interface between morphology and syntax is then defined system internally in CPSG95 and uniformly represented using the underlying grammar formalism used by the Attribute Logic Engine. For each problem, arguments are presented for the proposed analysis to capture the linguistic generality; morphological or syntactic solutions are formulated based on the analysis. This provides a secure approach to solving problems at the interface of Chinese morphology and syntax. Dedication To my daughter Tian Tian whose babbling accompanied and inspired the writing of this work And to my most devoted friend Dr. Jianjun Wang whose help and advice encouraged me to complete this work Acknowledgments First and foremost, I feel profoundly grateful to Dr. Paul McFetridge, my senior supervisor. It was his support that brought me to SFU and the beautiful city Vancouver, which changed my life. Over the years, he introduced me into the HPSG study, and provided me with his own parser for testing grammar writing. His mentorship and guidance have influenced my research fundamentally. He critiqued my research experiments and thesis writing in many facets, from the development of key ideas, selection of topics, methodology, implementation details to writing and presentation style. I feel guilty for not being able to promptly understand and follow his guidance at times. I would like to thank Dr. Fred Popowich, my second advisor. He has given me both general academic guidance on research methodology and numerous specific comments for the thesis revision which have helped shape the present version of the thesis as it is today. I am also grateful to Dr. Nancy Hedberg from whom I have taken four graduate courses, including the course of HPSG. I have not only learned a lot from her lectures in the classroom, but have benefited greatly from our numerous discussions on general linguistic topics as well as issues in Chinese linguistics. Thanks to Davide Turkato, my friend and colleague in the Natural Language Lab. He is always there whenever I need help. We have also shared many happy hours in our common circle of Esperanto club in Vancouver. I would like to thank Dr. Ping Xue, Dr. Zita McRobbie, Dr. Thomas Perry, Dr. Donna Gerdts and Dr. Richard DeArmond for the courses I have taken from them. These courses were an important part of my linguistic training at SFU. For various help and encouragement I have got during my graduate studies, I should also thank all the faculty, staff and colleagues of the linguistics department and the Natural Language Lab of SFU, in particular, Rita, Sheilagh, Dr. Ross Saunders, Dr. Wyn Roberts, Dr. Murray Munro and Dr. Olivier Laurens. I am particularly thankful to Carol Jackson, our Graduate Secretary for her years of help. She is remarkable, very caring and responsive. I would like to extend my thanks to all my fellow students and friends in the linguistics department of SFU, in particular, Dr. Trude Heift, Dr. Janine Toole, Susan Russel, Dr. Baoning Fu, Zhongying Lu, Dr. Shuicai Zhou, Jianyi Yu, Jean Wang, Cliff Burgess and Kyoung-Ja Lee. We have had so much fun together and have had many interesting discussions, both academic and non-academic. Today, most of us have graduated, some are professors or professionals in different universities or institutions. Our linguistics department is not big, but it is such a nice department where faculty, staff and the graduate student body form a very sociable community. I have truly enjoyed my graduate life in this department. Beyond SFU, I would like to thank Dr. De-Kang Lin for the insightful discussion on the possibilities of integrated Chinese parsing back in 1995. Thanks to Gerald Penn, one of the authors of ALE, for providing the powerful tool ALE and for giving me instructions on modifying some functions in ALE to accommodate some needs for Chinese parsing during my experiment in implementing a Chinese grammar. I am also grateful to Dr. Rohini Srihari, my current industry supervisor, for giving me an opportunity to manage NLP projects for real world applications at Cymfony. This industrial experience has helped me to broaden my NLP knowledge, especially in the area of statistical NLP and the area of shallow parsing using Finite State Transducers. Thanks to Carrie Pine and Walter Gadz from US Air Force Research Laboratory who have been project managers for the Small Business Innovation Research (SBIR) efforts ‘A Domain Independent Event Extraction Toolkit’ (Phase II), ‘Flexible Information Extraction Learning Algorithm’ (Phase I and Phase II) and ‘Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization’ (Phase I and Phase II). I have been Principal Investigator for these government funded efforts at Cymfony Inc. and have had frequent and extremely beneficial contact with them. With these projects, I have had an opportunity to apply the skills and knowledge I have acquired from my Ph.D. program at SFU. My professional training at SFU was made possible by a grant that Dr. Paul McFetridge and Dr. Nick Cercone applied for. The work reported in this thesis was supported in the later stage by a Science Council of B.C. (CANADA) G.R.E.A.T. award. I am grateful to both my academic advisor Paul McFetridge and my industry advisor John Grayson, CEO of TCC Communications Corporation of Victoria, for assisting me in obtaining this prestigious grant. I would not have been able to start and continue my research career without many previous helps I got from various sources, agencies and people in the last 15 years, for which I owe a big prayer of thanks. I owe a great deal to Prof. Zhuo Liu and Prof. Yongquan Liu for leading me into the NLP area and supervising my master program in computational linguistics at CASS (Chinese Academy of Social Sciences, 1983-1986). Their guidance in both research ideas and implementation details benefited me for life. I am grateful to my former colleagues Prof. Aiping Fu, Prof. Zhenghui Xiong and Prof. Linding Li at the Institute of Linguistics of CASS for many insightful discussions on issues involving NLP and Chinese grammars. Thanks also go to Ms. Fang Yang and the machine translation team at Gaoli Software Co. in Beijing for the very constructive and fruitful collaborative research and development work. Our collaboration ultimately resulted in the commercialization of the GLMT English-to-Chinese machine translation system. Thanks to Dr. Klaus Schubert, Dr. Dan Maxwell and Dr. Victor Sadler from BSO (Utrecht, The Netherlands) for giving me the project of writing a computational grammar of Chinese dependency syntax in 1988. They gave me a lot of encouragement and guidance in the course of writing the grammar. This work enabled me to study Chinese grammar in a formal and systematic way. I have carried over this formal study of Chinese grammar to the work reported in this thesis. I am also thankful to the Education Ministry of China, Sir Pao Foundation and British Council for providing me with the prestigious Sino-British Friendship Scholarship. This scholarship enabled me to study computational linguistics at Centre for Computational Linguistics, UMIST, England (1992). During my stay in UMIST, I had opportunities to attend lectures given by Prof. Jun-ichi Tsujii, Prof. Harold Somers and Dr. Paul Bennett. I feel grateful to all of them for their guidance in and beyond the classroom. In particular, I must thank Dr. Paul Bennett for his supervision, help and care. I would like to thank Prof. Dong Zhen Dong and Dr. Lua Kim Teng for inviting and sponsoring me for a presentation at ICCC'96 in Singapore. They are the leading researchers in the area of Chinese NLP. I have benefited greatly from the academic contact and communication with them. Thanks to anonymous reviewers of the international journals of Communications of COLIPS, Journal of Chinese Information Processing, World Science and Technology and grkg/Humankybernetik. Thanks also to reviewers of the International Conference on Chinese Computing (ICCC’96), North American Conference on Chinese Linguistics (NACCL‑9), Applied Natural Language Conference (ANLP’2000), Text Retrieval Conference (TREC-8), Machine Translation SUMMIT II, Conference of the Pacific Association for Computational Linguistics (PACLING-II) and North West Linguistics Conferences (NWLC). These journals and conferences have provided a forum for publishing the NLP-related research work I and my colleagues have undertaken at different times of my research career. Thanks to Dr. Jin Guo who has developed his influential theory of tokenization. I have benefited enormously from exchanging ideas with him on tokenization and Chinese NLP. In terms of research methodology and personal advice, I owe a great deal to my most devoted friend Dr. Jianjun Wang, Associate Professor at California State University, Bakersfield, and Fellow of the National Center for Education Statistics in US. Although in a totally different discipline, there has never been an obstacle for him to understand the basic problem I was facing and to offer me professional advice. At times when I was puzzled and confused, his guidance often helped me to quickly sort things out. Without his advice and encouragement, I would not have been able to complete this thesis. Finally, I wish to thank my family for their support. All my family members, including my parents, brothers and sisters in China, have been so supportive and understanding. In particular, my father has been encouraging me all the time. When I went through hardships in my pursuit, he shared the same burden; when I had some achievement, he was as happy as I was. I am especially grateful to my wife, Chunxi. Without her love, understanding and support, it is impossible for me to complete this thesis. I wish I had done a better job to have kept her less worried and frustrated. I should thank my four-year-old daughter, Tian Tian. I feel sorry for not being able to spend more time with her. What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow. Approval ii Abstract iii Dedication iv Acknowledgments v Chapter I Introduction 1 1.0. Foreword 1 1.1. Background 2 Principle of Maximum Tokenization and Critical Tokenization 2 Monotonicity Principle and Task-driven Segmentation 5 1.2. Morpho-syntactic Interface Problems 8 1.2.1. Segmentation ambiguity 8 1.2.2. Productive Word Formation 10 1.2.3. Borderline Cases between Morphology and Syntax 11 1.3. CPSG95: HPSG-style Chinese Grammar in ALE 13 1.3.1. Background and Overview of CPSG95 14 1.3.2. Illustration 15 1.4. Organization of the Dissertation 16 Chapter II Role of Grammar 18 2.0. Introduction 18 2.1. Segmentation Ambiguity and Syntax 19 2.1.1. Resolution of Hidden Ambiguity 19 2.1.2. Resolution of Overlapping Ambiguity 24 2.2. Productive Word Formation and Syntax 33 2.3. Borderline Cases and Grammar 37 2.4. Knowledge beyond Syntax 39 2.5. Summary 46 Chapter III Design of CPSG95 48 3.0. Introduction 48 3.1. Mono-stratal Design of Sign 52 3.2. Expectation Feature Structures 57 3.2.1. Morphological Expectation 58 3.2.2. Syntactic Expectation 59 3.2.3. Chinese Subcategorization 63 3.2.4. Configurational Constraint 67 3.3. Structural Feature Structure 70 3.4. Summary 73 Chapter IV Defining the Chinese Word 75 4.0. Introduction 75 4.1. Two Notions of Word 78 4.2. Judgment Methods 83 4.3. Formal Representation of Word 88 4.4. Summary 92 Chapter V Chinese Separable Verbs 93 5.0. Introduction 93 5.1. Verb-object Idioms: V+N I 96 5.2. Verb-object Idioms: V+N II 107 5.3. Verb-modifier Idioms: V+A/V 116 5.4. Summary 122 Chapter VI Morpho-syntactic Interface Involving Derivation 123 6.0. Introduction 123 6.1. General Approach to Derivation 125 6.2. Prefixation 127 6.3. Suffixation 130 6.4. Quasi-affixes 132 6.5. Suffix zhe (-er) 139 6.6. Summary 151 Chapter VII Concluding Remarks 152 7.0. Summary 152 7.1. Contributions 154 7.2. Limitation 158 7.3. Final Notes 159 BIBLIOGRAPHY 161 APPENDIX I Source Code of Implemented CPSG95 170 APPENDIX II Source Code of Implemented CPSG95 Lexicon 208 APPENDIX III Tested Results in Three Experiments Using CPSG95 229 Overview of Natural Language Processing Dr. Wei Li’s English Blog on NLP
个人分类: 立委科普|2972 次阅读|0 个评论
HPSG理论方法论基础:语言学理论的性质和研究方法
热度 1 saif 2016-2-12 06:02
HPSG理论,亦即,中心词驱动短语结构语法,本质上是对《句法结构》思想的继承和发展,也是继生成语义学派之后第二次对Chomsky理论的反思。其结果是彻底抛弃“转换”(transformation)的概念以及由此产生的派生理论,在承认语言研究是研究【语言知识】(linguistic knowledge)的基本理念下专注于对语言理论的形式化、精密化和严格化的尝试。如果说《句法结构》只是在理论上提出了语言学应当走向更加严谨的精密科学问题,HSPG就是在这个思想下一次伟大的实践。关于这一点,如果有兴趣可参考本人《我的句法结构读书笔记》。 HPSG全面继承了被后来的主流理论所抛弃的《句法结构》的这个基本原则,并有所发展,将这个原则理论化、系统化,在采纳了其它形式语法理论、数理语言学和数学、逻辑学、理论计算机科学、人工智能的工具和方法之后为语言学理论建立了一个全新的、可以媲美任何其它自然科学的基础框架,使这个框架具有了科学研究的第一要素:可证伪性。 语言学理论走向形式化、精密化,它的一部分特别是句法、语义等核心内容应当成为形式科学一部分,这一理念首先是在《句法结构》中提出来的: “为语言结构建立精确的模型本身就可以在发现过程中起很重要的、正反两方面的作用。用一种精确的、但不适任的形式化阐述推导出一个无法接受的结论来,那么我们往往由此能揭示不适任的真正根源因而也获得了对语言材料更深刻的理解。从更积极方面来说,一种形式化的理论,除了能解决理论设计明确要解决的问题之外,或许还可以自动解决许多这些问题之外的问题。而模糊不清的、囿于直觉的概念即不能引出荒谬的结论也无法提出新的、正确的结论,因而这类概念在两个重要方面都没有用处。精确严谨地阐述所提出的理论并把这种理论严格地应用到语言材料上而不想靠靠临时凑合应时的调整或不严谨的阐述来避免那些难以接受的结论,这样的研究方法可能会更有成效,而有些语言学家无法认识到这一点,质疑精确、严格语言学研究的价值。” 这段论述可以说是生成语法理论的总纲,基本路线。遗憾的是,乔氏本人并未严格遵循这一点。从标准理论开始,理论的定位从语言结构理论变成了人脑中的内在语法、语言能力的研究,语言学的研究被渲染成“认知学的革命”。从上个世纪70年代末到80年代,随着更多的学者加入到生成语法理论的研究,研究的深度和广度不断扩大,一些更基础的深层次的问题被提了出来。首先,什么是转换?什么是生成能力?这些概念的正当化基础在哪里?生成语法作为一门自称是“精密科学”的理论,它本身的理论基础又在哪里?随着乔氏做过贡献的形式语言理论在计算机科学的成功,一些人开始用同样的方法研究转换语法:试图为这个理论找到一个像短语结构语法那样的数学模型。在“乔姆斯基层级 ”理论(Chomsky Hierarchy)中,有限状态语法和上下文无关语法在形式语言理论中都得到了充分的研究,而乔氏提出的新语言结构模型--转换语法却无人给出一个相应的数学模型。按理说这是乔氏本人的任务,就像他以前对前两个语言模型所做的研究一样。不过遗憾的是当时的乔氏的兴趣已经转向,更热衷于人脑和语言的关系,将生成语法的地位上升为对人类获得、产生、理解语言过程的一种外在建模,并认为这种意义上的转换生成语法是17世纪Port Royal理性主义语法学派“新的觉醒”。 而探讨“转换语法”的理论基础问题已经提不到乔氏的日程。这样,直到1970年代,才有人开始认真严肃地研究这个问题:他们就是Stanley Peters和R.W. Ritchie。二人的研究方法和乔氏一样,试图为新的语法模型--转换语法找到一个相应的数学模型。Peters和Ritchie在1973年的论文《On the Generative Power of Transformational Grammar》中,第一次形式化的证明了所有递归可枚举字符串的集合都属于转换语法所生成的语言(every recursively enumerable set of strings is the language generated by some transformational grammar)。 从此“转换语法”从科学理论的神坛上跌落下来,到了上世纪1980年代,一批独立思考的生成语法学家开始抛弃“转换语法”,首先是回归短语结构语法,因为这个语法的数学模型--上下文无关语法(CFG)自1950年代以来在理论计算机科学中已经十分成熟,上世纪1970/1980年代非常著名的计算机编程语言Algol 60就是第一个基于CFG设计的,并且大获成功,从此CFG在理论计算机科学领域深入人心。在这种情况下语言学界提出回归短语结构语法自然赢得了许多学者的共鸣。其中最著名的就是由Gazdar、Klein、Pullum和Sag提出的“广义短语结构语法”(GPSG)。这个语法理论的最大特点,就是在坚持“生成语法”基本原则的同时,彻底放弃了“转换”的概念,重申了《句法结构》中的宣言:语言学的任务就是建立关于自然语言的语言结构一般性理论,通过该理论得到明确的对特定语言的描述。这个理论的基本框架应当具有证伪性,而研究方法则采用了数理逻辑中模型论:首先定义对象集,再递归定义对对象集的操作,以及表达式的“良形”(well-formedness)定义,最后对符合良形定义的表达式赋予语义解释。 以GPSG为代表的的反主流理论的生成学派,第一次明确宣告:生成语法理论的精神实质在《句法结构》所设定的目标:形式化、明确化、精密化,使语言学研究迈向自然科学的方向,因而生成语法理论的发展在理念上要向《句法结构》回归,在理论框架上要向短语结构语法回归,在方法论上要向蒙塔古语法回归。反过来,GPSG认为乔氏当时的GB理论并不属于GPSG作者们所定义的“生成语法”,理由是: there are little signs of any commitment to the explicit specification of grammars or theoretical principles in this genre of linguistics。 继承了GPSG传统的HPSG完全接受了这个理念,这个理论一开始就把语言学理论的精确、明确和严格作为第一要求,从1987年第一次提出这个理论开始到1994完成理论的系统构建,全面完整地提出语言学理论的基本性质和科学方法论的理论。 1. 语言学理论必须要和其它自然科学、特别是物理学一样,严格定义研究范围(empirical domain),研究对象以及相应的方法。具体地说就是: 1) 设定要研究的“现象”(phenomenon),得到所要研究的范围(empirical domain);对确定了的研究范围装入(populate)研究对象; 2) 利用数学(特别是近世代数中的结构概念)对这个研究范围中的对象建立模型(modeling)。 3) 得到对这个模型的原始数据,对这个模型进行研究形成理论(theory)。 如果表示(representation)这个理论的语言是自然语言,则这个理论是非形式化的; 如果用一套严格的形式语言将理论表示(reprensenting)出来,则这个理论是形式化的。最极端的形式化,则是以一阶谓词逻辑为语言的公理系统。 4) 理论的作用是对来自研究范围中的新数据进行预测(prediction),看结果是否符合预期。 总之,任何科学理论的基本要素都是:研究对象(研究范围)、模型、形式化理论。一旦建立了模型,理论将不再直接研究现象本身而是研究建模的结构,或者说建模的结构解释、阐述理论。 在《Head-Driven Phrase Structure Grammar》(Pollard Sag: 1994,以下简称PS-94)中,作者以天体物理学中太阳系轨道运动物体为例,对上述的原则进行了具体的说明: 在这个图中,研究范围是多体(或称N体)系统的运动问题,通过建模(modelling)过程得到的模型是Hamilton向量场;研究对象是多体质点,它们的质量、速率、方向等运动属性。而理论则由①多体系统微分方程;②公理化集合论;③一阶形式语言构成。从这个图可以看出整个研究的方法是这样的: 确定研究范围和研究对象,建立理论,用理论对其建模,其方法论用的是数理逻辑中的模型论方法,得到的是一个向量空间;从对研究范围的对象的初始研究得到对新事实的预测能力。 理论的关注点在于模型,但是能够对研究对象本身的规律产生预测能力。PS-94的结论是,就方法论而言,语言学理论应当具有和这个多体力学相同的研究范式(paradigm)。 有了这样的类比,HPSG提出包括语言学理论在内的一般科学的研究方法论就是三要素:理论、模型和研究范围(impirical domain)。具体到语言研究,HPSG理论的研究范围只限于句法和语义的研究,在这个范围内,语言对象是特定的话语(utterance)片段。建模的结果相当于多体力学中向量的概念,在HPSG中称作【sign】。【sign】是研究所有语言对象的最基本单位。描述【sign】的工具就是【特征结构】,而【特征结构】的基础是【特征逻辑】(feature logic) -- 一种在在上世纪1980 -1990年代发展出来的新的逻辑系统。按照这样的描述,HPSG的语言研究范式就有了和上面天体力学研究范式相似的构造和关系: 在这个图景中,对语言对象直接的描述是HPSG本身的语法理论,递归理论和特征逻辑。语法理论包括了若干“普遍原则”(universal principles)、“规则程式”(rule schemata);递归理论用来构建多层次的复合“特征结构”;而“特征逻辑”则是描述【sign】的工具。 现在我们关注的是方法论和研究工具而不是这些细节。 2. 和其它科学一样,HPSG在方法论上的另一个原则称作“概念节俭”原则 (ontological parsimony)。这个原则的基本思想就是“简化”(simplicity),理论应当用最少的概念说明尽可能多的事实(这个事实应当是模型中出现的事实),尽量不要设定观察不到或没有实证可能性的概念。当然,这一点不是禁忌,如果为了说明理论必须要设定则尽可能少用。具体到生成语法理论,早期的理论设定了大量的短语句法规则和转换规则,每一个规则都要用到相应的中间概念,因为根据乔姆斯基范式(Chomsky normal form),在一个产生式的推导过程中,如果给定两个规则 (1) A → BC (2) A → α 那么派生所用的资源则是: 1.长度为n个字符串需要n次A → α 的派生,因此需要n个语法变元; 2.n个变元需要n-1次A → BC 的派生(从S开始,每次派生增加1个变元,增加n-1次); 3.由1和2得知,长度为n且满足乔姆斯基范式语法的字符串需要2n-1次派生。 (关于乔姆斯基范式的介绍,引用了维基百科的内容) 这里,每一个“语法变元”实际上是为派生该字符串所使用的中间概念,也就是说是在模型内观察不到的概念,这显然是与“概念节俭”原则相违背的。 3.语言学理论应当将语法和语言处理的理论分离,将二者看做是有联系但完全不同的理论体系。由语言学理论所提供的语法,应当是对语言处理中立的(lanuage process neutual)。为什么?因为“语言处理”的本质是语言的使用,而作为语言知识存在的语法在语言处理的过程中只是其中的一个要素之一。语言处理,包括了话语产生(utterance producing)--说话者;话语理解(comprehension)--听者;话语转换(translation)--改述、翻译者;话语操作(manipulation)--语言游戏、修辞等,但基本方面就是话语的产生和理解两个侧面。如果语言的语法是面向处理的,那么偏重产生和理解两个方面任何一个方面都是不经济的,因此语法的设计应当是不考虑话语的产生和理解过程的。换句话说,一个语法只是静态地描写语言对象的语言信息,这些信息应当是高度描述性的,和研究范围内的话语的成分结构(constituent structure)无关,语法应当是语言知识模型,是对具体话语结构高度抽象。这一点和面向对象的编程范式非常相似:面对现实世界千变万化的各种不同问题,这种范式只设定了若干最基本的概念,如“类”、“对象”、“接口”等,以及对这些概念关系的定义:继承、相关、包含、注入等。这样高度形式化、抽象化的结果使得一套编程语言、加上设计模式和软件工程的方法就可以应对从电信到金融、从卫星控制到学生成绩管理等性质完全不同的问题。 将语法与语言处理分离理论的另一个理由,如上所述,语法应当是对语言对象的语言信息的静态描述,而语言处理则是语言使用的动态操作,其中各种信息交织在一起,包括语言知识、百科知识、上下文知识、话题焦点知识等。如果将语法与语言处理绑定在一起,那么就面临着艰难的选择:语法是描述话语产生还是描述话语理解?因为这两个过程从顺序上正好是相反的。从语言学的角度,人们对语言处理机制的了解远不如对语言结构的了解。 基于这样的理由,语法的设计,其基本元素必定是对处理中立的(process neutral or unbiased)。而已知的最好的处理中立的方式就是将语法设计成一个声明式的制约系统。 所谓声明式,就是只规定某个语言对象具有某个属性,至于这个对象的这个属性在句法操作中有什么样的作用则一概忽略。其实现代web编程的三大利器:HTML、CSS和Javascript和这样的语言理论非常相似:HTML是一种语言,定义了一个web应用其本要素,相当于我们这里语法;CSS则规定了这些应用元素的外在表现形式,相当于我们语法中的规则程式;而Javascript则定义这些web元素的行为,就像我们刚在谈到的语言处理理论。 早期的生成语法理论恰恰是没有理顺这样的关系(当然早期的计算机编程也是如此),例如转换语法,是一个典型的瀑布型的语言处理理论,从一种结构到另一种结构,是利用转换操作进行的,而转换规则的使用必须服从严格的顺序规定。而且这个语法模式完全是偏重话语产生模式的,如果利用这个语法模式处理理解过程,则必须反向使用这些转换规则,这基本上是一个无法完成的工作。 因此声明式语法,将为以后的语言处理理论提供坚实、灵活的语言知识框架。HPSG理论,至少1994年的版本不会直接提供语言处理的基本框架,但这个理论整体必须满足语言理论的可确定性条件(decidability)。亦即,某一特定语言中的待定表达式能否被语法赋予一个良形结构(是否合法)必须是由算法可确定的,换句话说就是通过有限的、被定义的步骤就可以确定的。由于严格的算法只能对精确定义的语言对象有效,因此能否展现理论的可确定性完全取决于所选的的形式语言,例如HPSG的形式语言是【特征逻辑】语言。 正是这个可确定性条件将语言理论中的语言知识和语言处理部分分离。 HPSG对语言知识的定义是:递归可定义的语言类型系统(a recursively definable system of linguistic types)。关于这个定义的确切含义我们会在后面详细谈到,这里只提示一点:语言知识研究的不是某个特定的语言事件(event)或token(用标),而是从个别语言事件和token(用标)的抽象出的类型。 而对语言类型的哲学属性我们这里不做评论,在历史上的争论大致分为两派:唯理论和实在论,争论焦点是语言类型应当属于心智对象还是心智外对象?索绪尔、乔姆斯基属于前者,而布龙菲尔德、卡茨和巴威斯当属后者。 参考资料 《Head-Driven Phrase Structure Grammar》- C. Pollard I. Sag 1994 《Information-Based Syntax and Semantics Volume 1》 - C. Pollard I. Sag 1987 《An Informal Introduction to HPSG》- Wolf Paprotté 维基百科《乔姆斯基范式》 维基百科《Chomsky normal form》 (完)
个人分类: 语言学|4053 次阅读|1 个评论
钩沉:博士阶段的汉语HPSG研究
liwei999 2015-11-2 17:09
【立委按】博士阶段趟了这趟合一(unification)文法HPSG(Head-driven Phrase Structure Grammar)的浑水。这一条路子当年炒得火热,合一看上去也的确很美好,但却终于没成气候,for a good reason。从旧档案中翻出这篇论文,也算是一个足迹。 F_SGP99.doc W. Li. 1996. Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. In Proceedings of International Chinese Computing Conference (ICCC'96). Singapore Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG Abstract This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint). We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar . Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis. The contribution of our research can be summarized as: (1) the insight on the interaction of syntax and semantics in analysis; (2) a proposed lexical rule approach to semantic deviation based on (1); (3) the application of (2) to the study of the Chinese transitive patterns; (4) the implementation of (3) in a unification based Chinese HPSG prototype. Interaction of syntax and semantics in parsing Chinese transitive verb patterns 1* 1. Background When Chomsky proposed his Syntactic Structures in fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence: 1) Colorless green ideas sleep furiously. Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense. However, things are not always this simple. Compare the following Chinese sentences of the same form 2a) dianxin wo chi le. Dim-Sum I eat LE. The Dim Sum I have eaten. Note: LE is a particle for perfect aspect. 2b) wo dianxin chi le. I have eaten the Dim Sum. Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation . Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation. It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model. While this is one way to organize a grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely. For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection. Head-driven Phrase Structure Grammar (HPSG) assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi (eat) : chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems. The problem is, the constraint should not always be enforced. In practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical 3) xiang chi yueliang, ni gou de3 zhao me? want eat moon, you reach DE3 -able ME? Wanting to eat the moon, but can you reach it?Note: DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question. 4) dajia dou chi shehui zhuyi, neng bu qiong me? people all eat social -ism, can not poor ME? Everyone is eating socialism, can it not be poor? yueliang (moon) is not food, of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those To capture such deviation, Wilks came up with his Preference Semantics . A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches. The problem with this design is the serious computational complexities involved in the model . In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion. What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns. Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure's syntactic constraint (in terms of the syntactic categories and configuration, word order, function words and/or inflections) and its interpretation (role assignment). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V:SOV is another pattern, and NP V: SOV (the BA construction) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this 2. Chinese transitive patterns Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese.2 5a) wo chi le dianxin. SVO 5b) wo dianxin chi le. SOV 5c) dianxin wo chi le. OSV SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP, the semantic constraint has to yield to syntax for interpretation. 6) daodi shi ni zai du shu ne, haishi shu zai du ni ne? on-earth be you ZAI read book NE, or book ZAI read you NE Are you reading the book, or is the book reading you, anyway? Note: ZAI is a particle for continuous aspect. NE is a sentence final particle for or-question. Same as in the English equivalent, the interpretation of 6) can only be SVO, no matter how contradictory it might be to our common sense. In other words, in the form of NP V NP, syntax plays a ??? In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not hold. In fact, this SOV pattern (NP1 NP2 V: SOV) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the DStructure for transitive pattern and Chinese is an underlying SOV language (called SOV Hypothesis: see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b). However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern, and that SOV tends to (not must) transform to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition. This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in is also a semantics dependent pattern. Compare: 7a) zhe zhang zhuozi san tiao tui. this Cl. table(furniture) three Cl. leg This table is three-legged. Note: Cl for classifier. 2 The other combinations are: 5d1) * dianxin chi le wo. OVS 5d2) dianxin chi le wo. The Dim Sum ate me. Note: It is OK with the 5d2) reading in the pattern NP V NP: SVO. 5e1) * chi le wo dianxin. VSO 5e2) chi le wo dianxin. (Somebody) ate my Dim Sum. Note: It is OK with the 5e2) reading of in the pattern V : VO where NP1 modifies NP2. 5f1) * chi le dianxin wo. VOS 5f2) chi le dianxin, wo. Eaten the Dim Sum, I have. Note: It is OK in Spoken Chinese, with a short pause before wo, in a pattern like V NP, NP: VOS.. 7b) * zhe zhang ditu san tiao tui. this Cl. map(non-furniture) three Cl. leg There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate). Without this semantic agreement, Chinese NP is normally not capable of functioning as a predicate, as shown in 7b). Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV, it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject. 8) shitou wo ye xiang chi, kexi yao bu dong. stone(non-food) I(animate) also want eat, pity chew not -able Even stones I also want to eat, but it's such a pity that I am not able to chew them. If the constraint on the object matches well, is the subject allowed to be semantically deviant? 9) ? dianxin zhuozi chi le. Dim-Sum(food) table(non-animate) eat LE. Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it. Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency. 10a) dianxin wo xiangxin ni yiwei Lisi chi le. Dim-Sum I believe you think Lisi eat LE The Dim Sum I believe you think that Lisi ate. 10b) * Lisi wo xiangxin ni yiwei dianxin chi le. 10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern: 11a) dianxin wo jinjinyouwei de2 chi le. Dim-Sum I with-relish DE2 eat LE The Dim Sum I ate with relish. Note: DE2 is a particle introducing a preverbal adjunct of manner. 11b) * wo dianxin jinjinyouwei de2 chi le. 11c) * wo jinjinyouwei de2 dianxin chi le. There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs. 12a) wo ba dianxin jinjinyouwei de2 chi le. I BA Dim-Sum with-relish DE2 eat LE I ate the Dim Sum with relish. 12b) wo jinjinyouwei de2 ba dianxin chi le. With relish, I ate the Dim Sum. 12c) dianxin ba wo jinjinyouwei de2 chi le. The Dim Sum ate me with relish. 12d) dianxin jinjinyouwei de2 ba wo chi le. With relish, the Dim Sum ate me. For the OSV order, there is another so-called BEI construction. The BEI construction is usually regarded as an explicit passive pattern in Chinese. NP V: OSV 13a) dianxin bei wo chi le. Dim-Sum BEI I eat LE The Dim Sum was eaten by me. 13b) wo bei dianxin chi le. I was eaten by the Dim Sum. The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table. It should be emphasized that this observation constitutes the rationale behind our approach. 3. Formulation of lexical rules Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation. A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation, especially for lexicalist theories. Our general design is as follows, still using chi (eat) for illustration: (1) Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object. (2) Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented. (3) The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns. As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs' semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model. Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure's syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns elliptical patterns (with some complement(s) omitted), in contrast to full patterns. With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns. We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure - transitive structure. V ((NP1, NP2), (constr1, constr2)) -- NP1 V NP2: SVO The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2. NP is syntactic category, and constr is semantic category (human, animate, food, etc.). The output pattern is in a defined word order SVO and waives the semantic constraint. V ((NP1, NP2), (constr1, constr2)) -- V: SOV Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object. 14) ta ni qingjiao guo me? he(human) you(human) consult GUO ME Him, have you ever consulted? Note: GUO is a particle for experience aspect. 15) ni ta qingjiao guo me? You, has he ever consulted? In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint. Lexical rule 2' (refined version): V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2)) -- V: SOV V ((NP1, NP2), (constr1, constr2)) -- NP1 V: SOV This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions (xiang, dao, dui, etc.) required by different verbs to do the same job. 16a) ni qingjiao guo ta me? you consult GUO he ME Have you ever consulted him? 16b) ni xiang ta qingjiao guo me? you XIANG he consult GUO ME Have you ever consulted him? 16c) * ni ba ta qingjiao guo me? you BA he consult GUO ME 17a) ta qu guo Beijing. he go-to GUO Beijing He has been to Beijing. 17b) ta dao Beijing qu guo. he DAO Beijing go-to GUO He has been to Beijing. 17c) * ta ba Beijing qu guo. he BA Beijing go-to GUO 18a) ta hen titie zhangfu. she very tenderly-care-for husband She cares for her husband very tenderly. 18b) ta dui zhangfu hen titie. she DUI husband very tenderly-care-for She cares for her husband very tenderly. 18c) * ta ba zhangfu hen titie. she BA husband very tenderly-care-for This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object. We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition). With the valency information in place, wenow reformulate Lexical rule 3 to make it more general: Lexical rule 3' (refined version): V ((NP1, NP2), (constr1, constr2), (valency_preposition=P), (P not = null)) -- NP1 V: SOV V ((NP1, NP2), (constr1, constr2)) -- NP2 ... V: OSV This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly. Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5. V ((NP1, NP2), (constr1, constr2)) -- NP2 V: OSV We now turn to elliptical patterns. V ((NP1, NP2), (constr1, constr2)) -- V NP2: VO 19) chi guo jiaozi me? eat GUO dumpling ME Have (you) ever eaten dumpling? V ((NP1, NP2), (constr1, constr2)) -- V: SV 21) ji chi le. chicken1(animate) eat LE The chicken has eaten (it). Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate; (2) chicken2 as food. We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8. 22) ni qingjiao guo me? you consult GUO ME Have you consulted (someone)? 22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8. V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2)) -- V: OV 23) ji chi le. chicken2(food) eat LE The chicken has been eaten. V ((NP1, NP2), (constr1, constr2)) -- NP2 : OV 24) dianxin bei chi le. Dim-Sum BEI eat LE The Dim Sum has been eaten. Lexical rule 10: V ((NP1, NP2), (constr1, constr2)) -- V: V (Have you) eaten (it)? 4. Implementation We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns. Then, we will show how our proposal works and discuss some related implementation issues. HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata) associated with a couple of principles. The data structure is typed feature structure. The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism . Note: (1) Uppercase notation for feature; Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take. In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature). Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure. A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation. This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing, which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism. CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x -- CATEGORY:p (where x is not null). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures . (Similar design is found in the new software paradigm of Object Oriented Programming.) Thanks to ALE, we can now use literals (ba, xiang, dao, dui, etc) as well as major categories (n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories (human, animate, food, etc.) to capture the thesaurus inference like human -- animate. This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar. In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time). At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules.3 The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 V: SOV (see Lexical rule 3').4 This is justified as follows. The valency preposition P (ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example, NP1 V: SOV -- NP1 V NP2: SVO is easier than NP1 V NP2: SVO -- NP1 V: SOV. This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics. In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model. The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns. It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research.3 The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar. 4 This choice is coincidental to the base generated account of the BA construction in , but that does not mean much. First, our so-called basic pattern is not their D-Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar. References Carpenter, B. Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Version 2.0 Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. Pollard C. (eds.) Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation. Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order Constituency in Mandarin Chinese. Kluwer Academic Publishers Li, Wei McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia Pollard, Carl Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA Pollard, Carl Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA Wilks, Y.A. (1978): “Making Preferences More Active”, Artificial Intelligence, Vol. 11 Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6 * This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments. 【置顶:立委科学网博客NLP博文一览(定期更新版)】
个人分类: 立委其人|4804 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-5 07:05

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部