科学网

 找回密码
  注册

tag 标签: text

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

Antitrust Analysis, Problems,Text,Cases(下)
黄安年 2019-1-30 12:37
Antitrust Analysis , Problems,Text,Cases(下) 【 Phillip Areeda 著 《 反托拉斯分析:问题、文本、案例 》】 【黄安年个人藏书书目(美国问题英文部分编号 104 )】 黄安年辑 黄安年的博客 /2019 年 1 月 30 日 发布(第 20773 篇) 自2019年起,笔者将通过博客陆续发布个人收藏的全部图书书目,目前先发布美国问题英文书目,每本单独编号,不分出版时间先后与图书类别。 这里发布的是 Phillip Areeda 著 Antitrust Analysis , Problems,Text,Cases(下) ( 《 反托拉斯分析:问题、文本、案例 》),Little, Brown Company, 1974年第二版,1034页。 照片 26 张拍自该书, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 , 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
个人分类: 个人藏书书目|1470 次阅读|0 个评论
【MATLAB】调整contour的显示步长
JerryYe 2018-12-20 16:49
\0 \0
个人分类: Matlab|3964 次阅读|0 个评论
Linux:在终端上用Geany或Sublime Text打开文本文件
haibaraxx 2017-8-30 15:49
方法一: 在终端中输入: $ open -a Geany $ open -a /Applications/Geany .app Sublime Text $ open -a /Applications/Sublime\ Text .app 注:空格的表示法为 \空格 。 方法二: 1. 在.bashrc文件中添加如下路径并保存。 # Geany export geany=/Applications/Geany.app/Contents/MacOS export PATH=\$geany:\$PATH # Sublime Text export subl=/Applications/Sublime\ Text.app/Contents/SharedSupport/bin export PATH=\$subl:\$PATH 2. 重新打开一个终端,输入如下命令就可以用对应的编辑器打开文本文件了。 $ geany $ subl
个人分类: Linux|5538 次阅读|0 个评论
[转载]Perl: mac下使用 Sublime Text 3 添加Perl编译支持
haibaraxx 2016-12-23 01:15
转自: http://www.zhimengzhe.com/mac/78542.html 1. Tools - Build System - New Build System; 2. 添加如下代码 (见附件): Perl.sublime-build.txt 3. 保存代码文件,命名为 Perl.sublime-build; 4. Tools - Build System - Perl 5. 执行Perl脚本, command+B
个人分类: Perl|2630 次阅读|0 个评论
遭遇脸书的 Deep Text
热度 1 liwei999 2016-6-16 08:29
前几天脸书发布 Deep Text 新闻,在AI和自然语言理解领域引起热议,媒体上也闹出很大的动静。昨天笔者第一次亲身遭遇脸书的 deep text, 确认了其浅层无结构的本质,甭管它训练了多少层。 我跟女儿对话总是用脸书,她的圈子都用脸书,基本不用微信。她遇到一个烦扰有点着急,我就告诉她 take a deep breath, 没想到脸书立即跳出了 Uber 的链接:我只要一按钮 出租车就会来。 天哪 这就是所谓 deep?很可能不过是个基于 ngram 的分类系统,哪里有 deep nlp 和结构的影子? 大概训练集里有不少 Take a ride, Take a cab, 结果 take a deep breath 就也成了“出行”类事件了。这种信息抽取要是在 parsing 的结构基础上,哪里会出这样的笑话。 报道说什么deep text理解语言接近人的水平,牛皮吹没边了。比我们 parsing 支持的抽取能力和精准 相差何止以里计。 这其实不是意外的发现,因为机器学习界一直就是在浅层做NLP,没有深度,没有结构,没有理解,缺乏细线条的分析 (parsing) 能力,大多是粗线条的分类 (classification) 工作。 对于分类系统 只有输入text大 机器学习才有效。如果是短消息,基本就是瞎蒙,关键词密度在短消息中没有了优势,缺乏 data points 的证据。 事实上,迄今的几乎所有的nlp应用,基本局限于无结构,机器学习 deep 不 deep 没有改变这一点。这很可能是为什么深度学习(DL)在 text 方面似乎不给力的症结所在。 宋老师前两天说话,学习 deep 了 的好处是可以消化更多的训练数据,但是数据的增加永远是线性的,而 text 里面的结构性决定了语言的组合爆炸,因此深度学习不会因为增加数据而根本改观,稀疏数据依然是挑战。ngram 与 bow(bag of word) model 不变,再深的训练依然是在语言浅层挣扎,只能做粗线条的 nlp,却难以胜任细线条nlp的任务。ngram 只是语言结构的拙劣近似,缺乏结构是迄今的死穴。parsing 基础上的事件抽取(event extraction)比ngram上的事件分类(event classification)高出岂止一头,一细一粗,一精一庸。
个人分类: 立委科普|3823 次阅读|1 个评论
Social media mining on credit industry in China
liwei999 2014-9-21 02:56
The purpose of this investigation is to collect the public opinions from Chinese social media on one of the most important industries in the financing world of China: Credit Card and its associated issues. Name brands such as Ali Pay, and Citi Bank are analyzed in this context. We all know China has seen continuous economic growth in the last three decades, unprecedented in human history. Just about 15 years ago, the Chinese people rarely heard of credit cards, online payment and personal credits: everything looked so remote and most all transactions were in cash only. Look at today. Look at the incredible IPO of Alibaba, who (among others) helped build the concept and practice of credits and online payment by a client base involving over a billion people. Chinese market is important for international banks too. Hence an accurate Chinese social media analysis on critical financial topics of credit cards will help them in their Chinese business as well. This study using real life automatic mining of social media big data shows that we have the state-of-the-art Chinese NLP (Natural Language Processing) technology that really works, in fact it is the only real-life fully automatic Chinese deep analysis system scaled up to the entire social media available in industry. Despite the anarchy and all kinds of jargon and ungrammaticality in Chinese social media, our system is able to make the best sense of the massive data to uncover true intelligence behind, including public opinions and sentiments and, more importantly, the underlying motivations behind the opinions/sentiments. The exercise below demonstrates a flavor of that. We have defined two related category topics for study: credit card (信用卡) and credit card fraud (信用卡欺诈, including all types of security issues). It is believed that these are topics that are of general interest to people in the financing world. The above summary represents data over the past 1-year Chinese social media from 9/15/2013 up to 9/15/2014, which has very limited Weibo data due to the data cost constraints, but includes almost all other Chinese social media sources such as 天涯,豆瓣,百度帖吧,淘宝, etc, excluding WeChat (微信) due to its being largely private data, not open to anyone for public mining and analysis (fortunately or unfortunately). As it shows, the topic “credit card” is mentioned 1.4 million times and“credit card fraud” 139k times, about 1/10 of the former topic. It shows that fraud is indeed a significant subtopic with credit cards which people are concerned about. Also noticeable in the summary is the associated net-sentiment measures (a metric representing the ratio of positive comments versus negative comments, an indicator of the public image of a brand or topic in people's mind as represented by social media ) : 28% for “credit card” and -41% for “credit card fraud”. Based on our past metrics on different brands and topics , 28% is fair for a neutral category topic and it shows that people still like and adopt credit cards despite some concerns related to them . -41% is a very negative net sentiment for a topic, which is natural in this case because the fraud topic itself is a negative thing we are investigating. In the Timeline trends graph above, we can see the topics' ups and downs over the year in Chinese social media. Looks like near the end of 2013 and around March and April 2014, the topic was hot. We can drill down to show what events caused the spike of the topic in social media at those times, if needed. The next graph on Crosstab shows our association analysis of the category topics with some known brands we chose to investigate: 支付宝 (Ali Pay, Alibaba’s famous payment system,China’s Paypal) 建行(China’s Construction Bank) Citi (花旗) HSBC (汇丰) Deutsche Bank (德银). The category association analysis gives a quick view on how serious an issue is associated with a brand and how one brand is compared with other brands for that issue. If an issue is serious, we should drill down to analyze what is going on behind the numbers. There are tools and widgets handy in our system to help with all kinds of drill-down to the relevant data at will and a variety of ways of looking at the data from different perspectives with different constraints to reveal the cause-effect or other insightful relationships. For the credit fraudulent topic, the table below shows that the two Chinese brands are deeply involved in the issue, with more concerns on Alibaba’s system. More specifically, we have 5k mentions related to some type of fraud out of 67.7k topic data for Alibaba’s payment system Ali Pay and 1.8k mentions out of over 100k topic data for China Construction Bank. This makes sense as Alibaba’s system handles online payment exclusively, with so many transactions, by so many online stores, that it seems more subject to fraud events. As for Citi, the situation is not bad, 166 mentions out of 6990 credit topic data; this is very comparable to HSBC, 114 mentions out of 5629 topic data. That is the overall picture of the issue in comparing brands. The next two graphs are Word Clouds on the themes and emotions related to the topics. The major sentiments on credit card are positive, many Chinese consumers talk about “support”(支持), “like”(喜欢), “use”(用), “trust”(信赖) , and ”enjoy”(享受) with regards to credit cards, and generally regard them as “good stuff” (好东西), the negative sentiments are far less, including“NOT support” (不支持), “NOT accept” (不接受) and “does not work” (不行). The subtopic “credit card fraud” is associated, quite naturally, with lots of “worry” (担心), “ not well” (不善), “high risks” (危险) and “issues” (问题). Different from many other teams who claim to do sentiment analysis, our system does not just mine emotional sentiments, we can reveal reasons behind sentiments as well: why people like or dislike something. This type of insights are far more complicated as there are thousands of reasons although there are only a couple of major sentiment types such as positive or negative (or neutral) and maybe a dozen sub-types such as hate, anger, disappointment, love, like, thankfulness, or mixed feelings. However, the uncovered reasons and motivations behind the sentiments are far more valuable and actionable for business decision making. This is shown in our Likes and Dislikes clouds and pie-charts shown below. There are lots of interesting insights here and some may be worth drilling down for further analysis using our tool. Let us focus on the top insights. From the pie charts, we see the top reasons why people like credit cards are: 方便(convenience),优惠(promotions),行(works). The top dislikes are: 被盗(stolen), 逾期 (pass deadline), 诈骗(fraud), 费(fees), 伪造(fake). These all seem to be common sense. The point is that these factors can change in time and order, reflecting the social sentiments and consumers’ opinions and concerns at the time. For example, “promotions (优惠)” are almost equally important as“convenience” in consumers’ social talk as top reasons for using credit cards,this gives confirmation that the incentives in credit card promotion campaigns must have worked and there are good reasons to keep promoting. On the negative side, we see almost 50% of the top 10 dislikes are related to some type of fraud and about 30% related to concerns of fines and fees. This type of insight and comparison are exactly what credit card companies are looking for,, who need to address such concerns in order. In general, the results look really impressive and the quality is good. We can drill down to details interactively in our live demo if interested. 【置顶:立委科学网博客NLP博文一览(定期更新版)】
个人分类: 社媒挖掘|5531 次阅读|0 个评论
matlab中text函数的用法
zrl780567443 2014-8-16 20:22
text(x,y,'string') 在图形中指定的位置 (x,y) 上显示字符串 string. text(x,y,z, ’ string ’ .'PropertyName',PropertyValue „ ) 对引号中的文字 string 定位于用坐标轴指定的位置,且对指定的 属性进行设置。 例: x = –4:0.2:4; y = sin(x); hp = line(x,y,‘LineWidth’,3); thand = text(2,0,‘Sin(\pi)\rightarrow’) 结果如图1所示: 对图中所加的标识符进行设置后 set(thand,‘BackgroundColor’, ,. . . ‘ EdgeColor’, ) 结果如图2所示: 有关texthan'shu函数更详细的介绍请参考网址: http://wenku.baidu.com/link?url=zV4QstKM20Pi_OJIgStIDAf19Hm8Fl9d5k2Tm6Wr1v3U3mRUxN-mDF4Kn_iYfd_3RuTUBZI7M5i66aIjl9ls07l5gbETlRaqbaLFSr1Urpa 附加内容: rh = rectangle(‘Position’, , ‘ Curvature’, ); axis( ) set(rh,‘Linewidth’,3,‘LineStyle’,‘:’) % 上面代码中0.2,0.2表示的是图形de的起始位置,0.5,0.8表示在x与y轴上的宽度;curature是对弯曲斜率的设置 结果如下图所示: 在三位坐标里画出三个点,利用这三个点,做出一个四面体 x = ; y = ; z = ; plot3(x,y,z,‘ko’) polyhedron .vertices = ; % 设置顶点的位置,第一行表示第一个点的位置,以此类推 polyhedron.faces = ; % 设置点与点之间的连线,第一行表示 1 2 3 三个点之间的连线 pobj = patch(polyhedron, … ‘ FaceColor’, ,… ‘ EdgeColor’,‘black’); %patch是个底层的图形函数,用来创建补片图形对象。一个补片对象是由其顶点坐标确定的一个或多个多边形。用户可以 指定补片对象的颜色和灯光。 % patch(‘PropertyName’,propertyvalue,...) 利用指定的属性/值参数对来指定补片对象的所有属性。除非用户显式的指定FaceColor和EdgeColor的值,否则,MATLAB会使用缺省的属性值。该调用格式允许用户使用Faces和Vertices属性值来定义补片。 详细的patch介绍k参考: http://blog.sina.com.cn/s/blog_707b64550100z1nz.html
个人分类: matlab|83830 次阅读|0 个评论
如何简单去掉txt数据列中的逗号?
热度 2 Shifengyu 2014-4-29 10:59
实验测试过程中常遇到带有逗号(,)的text数据列,如XRD数据。有的学生实在没办法,一个一个删除逗号,然后进行数据处理。若用exel导入数据,则可简单除掉text列中逗号的困扰。虽然很简单,但如果没用过,感觉还是挺费劲的。 具体步骤如下:打开excel,点击窗口中数据栏,下拉,选中导入外部数据,再选择导入数据。然后,打开目标文档text文件,分隔符,点击下一步,选中逗号,下一步,常规,完成,确定。可以看到处理掉逗号的数据列了。希望对您有所帮助。
个人分类: 糖尿病|8294 次阅读|3 个评论
matlab pcolor绘图时text字体被图片遮挡的问题
sinogyang 2012-6-7 09:08
今天发现在某些时候(具体什么时候我还真没有发现规律),在pcolor图上添加字体会被图片本身遮挡住而显示不出来。遂在网上寻找方法,发现相同的问题和解决方案。分享之。 故障现象:如果键入命令如下 pcolor(peaks) shading interp text(30,30,'hello world'); 在图中不能看到‘hello world’字样,那么恭喜你,有解决办法了,如下; change text(30,30,'hello world'); to text(30,30,1,'hello world'); 原因:pcolor is really a surf plot viewed directly above, therefore you need to give your text a vertical offset so that it appears to be in front of the image. 原来是需要要给text的内容所在的层的一个定义,就像PPT中的“置于顶层”那样的命令一样。 Bingo! 摘自:http://www.mathworks.com/matlabcentral/newsreader/view_thread/298118
个人分类: Matlab|10683 次阅读|0 个评论
Text typology register, genre and text type
carldy 2012-4-10 11:32
Excerps from: Anna Trosborg . 1997. Text Typology: Register, Genre and Text Type. Text Typology and Translation : 3-23. Amsterdam/Philadelphia: John Benjamins. can be retrieved from the following website: http://paginaspersonales.deusto.es/abaitua/konzeptu/nlp/trosborg97.htm Text typology register, genre and text type Which categories can be used to classify and explain ways in which types of discourse may be accounted for? Terminological problems concerning the distinction of text, discourse, register, genre, text type, discourse purpose, communicative purpose, rhetorical purpose and communicative function will be dealt with. A framework comprising a classifiction into registers and genres , with communicative function and text type as crucial categories within a discourse framework of field, tenor and mode will be suggested. Text types Virtanen (1990) has studied the difference in terminology use: Discourse Grimes 1975 Sinclair Coulthard 1975 Longacre 1982 Text Halliday Hasan 1976 Quirk et al. 1985 Biber 1989 Trosborg (1997) proposes to use them interchangeably. Text types cutting across registers and genres Genre distinctions do not adequately represent the underlying text functions of English. Genres and texts types must be distinguished. Texts within particular genres can differ greately in their linguistic characteristics (texts in newspaper articles can range from narrative and colloquial to informational and elaborated). On the other hand, different genres can be similar linguistically (newspaper and magazine articles). Linguistically distinct texts within a genre may represent different text types, while linguistically similar texts from different genres may represent a single text type (Biber 1989:6). While genres form an open-ended set (Schauber and Spolsy 1986), text types constitute a closed set with only a limited number of categories (also Chafe 1982, who proposes a four-way classification of texts, 'involvement-detachment' and 'integration-fragmentation'). Kinneavy (1971, 1980) classifies texts in terms of modes of how reality can be viewed. His text types are cognitive categories offering ways of conceptualizing, perceiving and protraying the world. narration: our dynamic view of reality looks at change evaluation: our dynamic view focuses at the potential of reality to be different description: our static view focuses on individual existence classification: focuses on groups Based on cognitive properties, Werlich (1976) includes five idealized text types or modes (adopted by Hatim and Mason 1990, Albrecht 1995, Biber 1989 -based on linguistic criteria): description: differentiation and interrelation of perceptions in space narration: differentation and interrelation of perceptions in time exposition: comprehension of general concepts through differentation by analysis or synthesis argumentation: evaluation of relations between concepts through the extraction of similarities, contrasts, and transformations instruction: planning of future behaviour with option (advertisments, manuals, recepes) without option (legislation, contracts) The relationship between text types and genres is not straightforward. Genres reflect differences in external format and situations of use, and are defined on the basis of systematic non-linguistic criteria. Text types may be defined on the basis of cognitive categories or linguistic criteria. Biber captures the salient linguistic differences among texts in English (see also Longacre 1976, 1982, and Smith 1985). For 2,400 years there have been two traditions of classifying texts: deriving from Aristotle 's Rhetoric Rhetoric often refers to the uses of language. More specific, it refers to modes of discourse realized through text types (narration, description, exposition, argumentation, etc.) i.e. the classification of texts by type (Kinneavy 1980:3-4) For others, it refers to communicative funcions as rhetorical strategies (Trimble 1985) 1) Classification according to purpose In terms of communicative functions , is the discourse intended to inform express an attitude persuade create a debate Whereas genre refers to completed texts, communicative functions and text types , being properties of a text, cut across genres. informative texts: newspaper reports TV news textbooks argumentative texts: debates political speeches newspaper articles 2) Classification according to type or mode descriptive narrative expository argumentative instrumental (Kinneavy 1980, Faigley Meyer 1983) The focus is on functional categories or rhetorical strategies , which is not normative but abstract knowledge Longacre (1976, 1982), Smith (1985) and Biber (1989) refer to text types as "underlying shared communicative functions ". Trosborg reserves these functions to a classification of speech acts according to a typology by Kinneavy, restricting text types to modes of discourse. Other authors (but not Trosborg) take function as "the kind of reality referred to" (Cassirer 1944, Urban 1939) the level of social formality of a given discourse (Kenyon 1952) nonmorphological classes of words in grammar (Fries 1952) While communicative purpose is the aim of a text, rhetorical purpose is made up of strategies which constitute the mode of discourse realized through text types. Text types are "a conceptual framework which enables us to classify texts in terms of communicative intentions serving an overall rhetorical purpose" (Hatim and Mason 1990:140). Communicative functions Purpose of discourse may depend on four factors of the linguistic process: speaker listener message (thing referred) lingustic material (text) Aristotle proposed "a language concerned with things", and "a language directed to the hearer". A three-dimensional model of communication (triangle) was proposed by Bühler (1933). Based on Aristotle and Bühler, a text can be classified into a particular type according to which component in the communication process receives the primary focus (Jakobson 1960, Kinneavy 1971). speaker: expressive listener: persuasive world (thing referred): referential language (linguistic code): literary Roman Jakobson (1960) added two other uses: metalanguage phatic communication (to keep the channel open) Jakobson's model was adopted by Dell Hymes (1974). Kinneavy (1980:65) acknowledges Aristotle and Aquinas, Cassier, Morris, Miller, Russell, Reichenbach, Richards, Bühler, and Jacobson. Reiss (1976) makes a typology of texts based on communicative functions for translation, similar to Nord (1997). Speech acts Speech act theory views language as action made up of a communicative act (Austin 1962, Searle 1969, 1976). Searle (1976) distinguishes six major classes of speech act : assertives , eg. the speaker states the door is open and believes that it is. directives , eg. the speaker gives the comman open the door and he wants the door to be opened. commissives , eg. the speaker says I will open the door and intends to do it. expressives (and evaluatives), eg. the speaker exclaims I like your coat and he means it. declarations , eg. in saying I resign ore You're fired the speaker must have the role of employee or boss respectively. Also, the state described by the propositional content is realized by the very act of saying it. There is no psychological state needed. representatives , eg. the judge declaring, I find you guilty as charged . Again, the state described by the propositional content is realized by the very act of saying it, and in addition there is the sincerity condition that the speaker must believe the proposition expressed. The distinction of these general classes is based on four dimensions: Illocutionary point (assertive, directive, commissive...) Direction of fit (word-to-world WWL, or world-to-word WLW) Psychological state (believes B, wants W, intends I) Propositional content. Each class divides into a number of different speech acts (eg. depending on whether the speaker is begging, asking, ordering or threatening; the illocutionary point is the same - influencing the hearer -, but different illocutionary forces are expressed). Acknowledging Traugott and Pratt (1980), Hatim and Mason (1990) adopted this framework for translation. An advertisement may be predominantly referential, consisting of expressive (informative) statements, but the aim is persuading the consumer to buy, i.e. they are directive. This is why speech acts interrelate with sequences and conform the notion of illocutionary (or communicative) structure of a text. Austin (1962) declared that speakers do not simply produce sentences that are true or false, but rather perform speech actions such as requests, warnings, assertions, etc. Searle (1969) adopted Grice's (1957) recognition of intention to his effort to specify the necessary conditions on the performance of speech acts. Text pragmatics studies how sequences of speech acts are evaluated on the basis of higher order expectations about the text, and how these sequences of coherent microtexts contribute to the global coherence of a larger text (Ferrara 1985). Text act Text act : the predominant illocutionay force of sequences of speech acts must be recognized (Hatim and Mason 1990, Horner 1975). Context focus No theory of modes of discourse is rigid in its categorization, multiple views of reality and multiple types (Kinneavy 1980). Pure narration, description, exposition and argumentation hardly occur. A particular genre may make use of several modes of presentation. Text type focus or contextual focus refers to text type at the macro level, the dominant function of a text type in a text (Morris 1946, Werlich 1976, HM 1990, Virtanen 1992 'discourse type'). Two-level typology of text types and communicative functions: at the macrolevel of discourse, text type may be assumed to precede the level of text-strategic choices, thus affecting the whole strategy of the text the choice of microlevel text type has to do with the textualization process, which is determined by the text producer's text strategy Text types employed in a particular text (or genre) need not agree with its contextual focus. An argumentative text-type focus may be realized through narration, instructions may take the form of description, etc. There is interaction between communicative purspose and rhetorical purpose (text type), eg. to persuade it is possible to narrate, describe, argue. Genre Genres reflect differences in external format and situations of use, and are defined on the basis of systematic non-linguistic criteria. guidebook nursery rhyme poem business letter newpaper article radio play advertisement Registers are divided into genres refelcting the way social purposes are accomplished in and through them in settings in which they are used. Bathia (1993:17) points out a science research article is an instance of scientific language as is an extract from a chemistry lab report. Academic language shows in casual chats lectures conversations class email memos scholarly papers books Legal register (language of law) legislative texts contracts deeds wills judge declaring the law judge/counsel interchanges counsel/witness interchanges textbooks lawyers communications In the case of restricted registers there is a close connection between register and genre, eg. weather forecasts. Genre is a macrolevel concept, a communicative act within a discourseive network: ... repertoires of typified social responses in recurrent situations -from greetings to thank yous to acceptance speeches and full-blown, written expositions of scientific investigations - genres are use to package speech and make it recognizable to the exigencies of the situation (Berkenkotter Huckin (1995) Swales (1990) analyses the development of the concept genre in the fields of folklore studies literature linguistics rhetoric Aristotle: genres as classes of texts "a distinctive type or category of literary composition" (Webster) Today genre refers to a distinctive category of discoruse of any type, spoken or written, with or without literary aspirations The stuty of texts as genres, "how texts are perceived, categorized and used by members of a community" (Swales 1990:42), has attracted little attention from Linguistics (eg. Frow 1980), until the Systemic school put hands on it. Rhetorical scholars have given genre a more central place, recently focused on social constitution of nonliterary forms of writing and speaking. Ethnographers concern about which labels are used to type communications, in order to reveal elements of verbal communication which are sociolinguistically salient (Saville-Troike 1982). In the field of LSP there has been growing interest in the sociocultural functions of disciplinary genres , eg. legal and scientific communication: medical English (Maher 1986) legal English (Bhatia 1987) Genres are not simply assembies of similar textual objects, but coded and keyed events set within social communicative process (Todorov 1976, Fowler 1982, Swales 1990). "A rhetorically sound definition of genre must be centred not on the substance or form of the discourse but on the action it is used to accomplish" (Miller1984:151) Genres embrace each of the linguistically ralized activity types which comprise so much of our culture (Martin 1985:250) Genre is a system for accomplishing social purposes by verbal means. It "refers to the stages purposeful social processes through which a culture is realized in a language" (Martin and Rothery 1986:243) Communicative purpose as the defining criterion of genre For some scholars genres are defined on the basis of external criteria: newspaper articles in newspapers, etc. (Biber 1989:6) of communicative purpose or luinguistic content and form (Swales 1990, Bhatia 1993, Berkenkotter Huckin 1995, Bhatia 1995) Swales emphasizes the socio-rhetorical context of genre, the categories are those of the community, and communicative purpose is the defining criterion. Genre as a social action operates as a mechanism to clarifying what communicative goals are. instances of genres vary in their prototypicality with the community's nomenclature for genres discourse community, genre and task are bound by communicative purpose according to Swales 1990:10, communicative purpose drives language activities of a discourse community is the prototypical criterion for genre identity operates as the primary determinant of tasks A multi-dimensional approach to genre Unclear relation between genre and register (Ventola 1984). Is genre a system underlying register? For Trosborg (1997) it is not. Register In the narrow sense of occupational field, genres such as contracts will be part of legal register a sermon will involve the religious register Genre but, a particular genre may cut across a number of registers a research article in chemistry may be similar to a research article in sociology (Swales 1981) One register may be realized through various genres, in this sense genres are subordinated to registers. Conversely, one genre may be realized through a number of registers just as a genre constrains the ways in which register variables of field, tenor and mode can be combined. Registers impose constraints at the linguistic level of vocabulary and syntax. Genre constraints operate at the level of discourse structure. Genre specifies conditions for beginning, structuring and ending a text. Genres can only be realized as completed texts (Couture 1986) Trosborg (1997) sees genres as having complementary registers. Communicative success of a text may require appropriate combinations of genre and register (Couture 1986). In agreement with the stand taken by Swales (1990), Bhatia (1993) takes genre analysis form linguistic description to explanation: Why do members of a specialist community write the way they do? Berkenkotter Huckin (1995) develop a sociocognitive theory of genre, which Trosborg (1997) applies as an explanatory approach to hybrid political texts from the EU. Genres cannot be identified by communicative purpose, eg. poetic genres aimed at giving verbal pleasure defy ascripton of communicative purpose. Medium of communication may also be decisive: memos, emails, faxes. Model: texts form part of communicative situations. Hallidays (1971) functional approach with three-fold division (used by Vermeer, Nord, Hatim Mason 1990, and Baker 1992 for translation, and by Bhatia 1993 for SPL). field : ideational component covering linguistic content tenor : interpersonal component covering communicative functions in relation to sender/receiver roles mode : textual component involving medium, channel and nature of participation A genre can only be accounted through a specification of field, tenor and mode and a description of the linguistic features realized in the ideational, inerpersonal and textual components of particular texts (Eggins 1994). Kussmaul 1997 shows how a change of a single parameter may result in a change of genre. Register Varieties of language use have been referred to as registers (Reid 1956, Halliday et al. 1964) Halliday et al. 1964 divided language into user-related varieties or dialects (Corder 1973) geographical temporal social non-standard dialects idiolects use-related varieties or registers of occupational fields religion legal documents newspaper reporting medicine technical reporting These are some approaches to the study of register: Register as a functional language variation is a "contextual category correlating groupings of linguistic features with recurrent situational features" (Gregory Carroll 1978:4). Frequency of lexico-grammatical features show sub-codes of particular text-varieties (Crystal Davy 1969, Gregory Carrool 1978). Frequencies of syntactic poperties provide evidence of diverencies of language variation (Barber 1962, Crystal Davy 1969, Gustaffsson 1975). Some specific linguistic features have restricted values in scientific communication, eg. pre-modifying en -participles textualize two different aspects of chemestry text depending upon whether the author is exemplifying or generalizing (Swales 1990:41). This shows a relation between grammatical choices and rhetorical functions, i.e. communicative functions (Lackstrom, Selinker Trimble 1973, Swales 1981, Trimble 1985). The collocation of two or more lexical items, rather than the occurrence of isolated items determines the identity of a given register. Conscious stylistic choices made by language users in different situation type. A situation type includes any number of situations (tokens) of the general type, eg. making your next appointment with the dentist is a particular token of a recognized type of situation.s (Halliday et al. 1964, Hatim Mason 1990) Trosborg (1997) concludes that register is too broad a notion. Focusing on the language of a field, register analysis disregards differences of genres within the field. Labels such as legal English are misleading and overprivilege a homogeneity of content at the expense of variation in communicative purpose, addresser-addresee relationships and genre conventions (Swales 1990:3). In order to understand how texts organize informationally, rhetorically and stylistically, there is the need of other considerations on top of pure textual evidence. Inappropriate usages of the term register: 'employer register', focusing on tenor (Werlich 1976) 'written register', adjusted to mode (Schleppegrell 1996) Main references Vijay K. Bhatia . 1993. Analysing Genre. Language use in professional settings . Longman. Douglas Biber . 1989. A Typology of English Texts. Linguistics 27: 3-43. James L. Kinneavy. 1980. A Theory of Discourse . Norton. M.A. K. Halliday R. Hasan. 1976. Cohesion in English . Longman Basil Hatim Ian Mason. 1990. Discourse and the Translator . Longman. John M. Swales. 1990. Genre Analysis. English in academic and research settings . Cambridge University Press. John M. Swales. 2000. Further Reflections on Genre and ESL Academic Writing. S ymposium on Second Language Writing . West Lafayette, Indiana, USA September 15-16, 2000 Although many would likely concur with Bakhtin's dictum that "The better our command of genres, the more freely we employ them," operationally genre remains a disputed framework for ESL writing courses and approaches. Controversies polarize around repression versus expression, individual voice versus conventionalized pattern, imitative play versus contextual realpolitik, specific guidelines versus general principles, and cultural subordination versus cultural resistance. Recent work confirms the contested nature of the theoretical ground. On the one hand, Johns (2000) offers several example of genre-based approaches in effective action; on the other hand, Freedman (2000) questions whether EAP instructors can sufficiently escape their own classroom contexts to offer real assistance with the genres of the wider academy. In this presentation, I discuss these controversies through the lens of new advanced materials for NNS graduate students (Swales Feak, 2000) premised on cross-disciplinary "difference," participant disciplinary analysis, genre systems, and a task taxononmy privileging rhetorical reflection. I argue that while border crossings may be hazardous with undergraduate "school genres," and certainly in preparing students for writing at work (Freedman, 1993), they are less so in research genres. Reasons for this include the public nature of many research genres, the established evaluative processes that adjudicate them, and student capacity to assess the appropriacy of any advice offered. http://icdweb.cc.purdue.edu/~silvat/symposium/2000/keynote.html#swales
个人分类: 翻译教学与实践 Translation Practice & Teaching|6468 次阅读|0 个评论
[转载]Text Linguistic Models for the Study of SI
carldy 2012-4-10 11:22
http://www.geocities.com/~tolk/lic/licproj_w2wtoc.htm Text Linguistic Models for the Study of Simultaneous Interpreting Helge Niska Stockholm University, 1999 Table of Contents Text linguistic models for the study of simultaneous interpreting 1 Introduction 1.1 The relevance of text linguistics to research on simultaneous interpreting 1.2 Scope and purpose of the study 1.3 Research material and methods 1.3.1 The speakers and the interpreters 1.3.2 Transcription 1.3.3 Research method 1.3.4 Units of analysis 2 Psycholinguistic theories and studies of the interpreting process 2.1 Interpretation studies in the West 2.2 The "Paris school" 2.3 Deverbalisation or not? 2.3.1 Semantic structure and metalanguage 2.3.2 Testing "deverbalisation" 2.3.3 The "sentence" as research unit 2.4 The simultaneity of simultaneous interpreting 2.5 An early Swedish study on simultaneous interpreting: Vamling 2.5.1 Coherence 2.5.2 Resegmentation 2.5.3 Other findings 3 Textuality in written and oral texts 3.1 Standards of textuality 3.2 Cohesion 3.2.1 Substitution and ellipsis 3.2.2 Conjunction 3.2.2.1 Junction 3.2.3 Reference 3.2.4 Lexical cohesion 3.2.5 Lexical cohesion in interpreting 3.2.6 Cohesion vs. coherence 3.2.7 The importance of text structure 3.3 Textuality and simultaneous interpreting 3.3.1 Cohesion in spoken texts 3.3.2 Intonation in interpreting 3.4 Oral and written language - Halliday’s critique 3.4.1 The oral–literate continuum in simultaneous interpreting 3.5 Impromptu speech 4 Text typology 4.1 Intertextuality and text types 4.1.1 Text types and translation quality 4.2 Katharina Reiss' text typology 4.3 Typologies of texts in simultaneous interpreting situations 4.3.1 Kopczynski (1980) 4.3.2 Niedzielski (1988) 4.4 The relevance of text typology to interpreting 4.4.1 Alexieva’s semantic model as a tool for establishing text types 4.4.2 Suggestions for a hierarchical text typology for interpreting 5 Text linguistic models 5.1 The Kintsch and van Dijk model for discourse processing 5.2 Text linguistic methods in translation and interpretation research 5.2.1 Description of argumentative texts (Tirkkonen-Condit) 5.2.1.1 Illocutionary and interactional structure 5.2.1.1.1 Speech acts and illocutions 5.2.1.2 Problem-solution analysis 5.2.1.3 Macrostructure analysis 5.2.1.4 Textual analysis of an interpreted event: a dialogic approach 5.2.2 Inadequacy in translation (Vehmas-Lehto) 5.2.3 Shifts in cohesive elements in simultaneous interpretation (Shlesinger) 5.2.4 Component processes in simultaneous interpreting (Dillinger) 5.3 Applications of the Kintch and van Dijk model in simultaneous interpreting research 5.3.1 Mackintosh (1985) 5.3.2 Lambert (1988) 6 Textual structures in simultaneous interpreting 6.1.1 The application of the Kintsch van Dijk model in our study 6.1.2 The interpreter as editor 6.1.2.1 "Proof-reading" 6.1.2.2 Explicitation 6.1.2.3 Cohesion 6.1.2.4 Other textual considerations: "Interpreter's edition" 7 Cognitive models 7.1 Chernov’s "probability prediction mechanism" 7.1.1 Redundancy 7.1.2 Distribution of redundancy in texts 7.1.3 Probability prediction model 7.2 Cognitive problems: an example 7.3 Alexieva’s model 7.3.1 Pronunciation errors 8 The relationship between interpreting and translator-variants 8.1 Interpreting strategies: "Translatorese" 8.2 Two interpreter types 9 Conclusions for future research 9.1 The relevance of the models 9.2 Text typology for interpreting 9.3 References, understanding and production strategies in the interpretation of specialist discourse 9.3.1 Lexical problems 9.3.2 Interpretation strategies References Footnotes sample: PrefaceThe basis of this paper is a report from a pilot study on simultaneous interpreting, carried out at the Department of Finnish, Stockholm University, and financed by a grant from the Swedish Council for Research in the Humanities and Social Sciences (HSFR). The aim of the project was to test the applicability of some text linguistic and text-linguistically-oriented models to the study of simultaneous interpreting. The material that forms the basis for this analysis is a number of transcribed printouts of audio tape recordings of simultaneous interpreting situations using Finnish and Swedish, in both directions. The transcripts have been analysed at several linguistic levels, and the results of the analyses and the acquired empirical data are continuously being used as a resource and theoretical basis for new research. The project supervisor was Professor Erling Wande. The interpretation corpus was recorded by Birgitta Romppanen, who also participated in the initial stage of the project, and the transcription of the tapes was done by Miriam Sissala. The final analyses and the compilation of the report was made by Helge Niska. Comments, criticism and suggestions for improvement can be sent to Helge.Niska@tolk.su.se . Address: Institute for Interpretation and Translation Studies, Stockholm University, S-106 91 Stockholm. Tel. +46 8 162000. Fax: +46 8 161396. Helge Niska. 1 Introduction 1.1 The relevance of text linguistics to research on simultaneous interpreting In the activity of interpreting, two aspects are especially significant: orality and interaction. The interpreter translates oral discourse (which may or may not have been prepared as written text) in various communicative situations, where messages are exchanged, through the interpreter, between people. Text linguistics has emerged from the 1970s and onwards as the study of the property of texts (written or oral) and their uses in communicative interaction. The modern conception of text linguistics is a broad one, encompassing discourse analysis and pragmatics, as well as influences from cognitive sciences, communication studies and artificial intelligence. Written text is still the usual object of study within text linguistics, and monographs and scholarly papers abound with examples of written texts. But underneath, or behind, the written text, there are cognitive and other kinds of processes that can take the form of either written or spoken or signed discourse. In working on this study, the contemporary text linguistic approaches of de Beaugrande Dressler (1981) and van Dijk Kintsch (1978, 1983) have been especially inspiring. Two other researchers, who are not text linguists, but who take utterances and whole texts as their point of departure in describing the interpreting process, have likewise been very influential, namely Chernov (1978, 1979, 1985, 1994) and Alexieva (1985, 1988). Of these two, Alexieva is probably the most text-linguistically oriented, while Chernov relies more heavily on psychological, especially cognitive research. Nevertheless, we feel that their models, apart from being highly relevant as descriptions of the interpreting process, in this context serve both as valuable complements and correctives of the more "explicitly" text linguistic model proposed by Kintsch and van Dijk. 1.2 Scope and purpose of the study Internationally speaking, simultaneous interpreting is a relatively new research field, and in Sweden there is virtually no empirical or theoretical research in this area . The aim of this project was to make a preliminary study of the process of simultaneous interpreting, as a pilot study. The objectives were both to assess some of the text linguistic models for the description of the process of simultaneous interpreting that had been presented in previous research, and to test a hypothesis as to the existence, in the simultaneous interpreting situation, of a special variant of translation, for which we coined the term "translatorese". In order to do this, analyses of the interpretations had to be done on several linguistic levels. As a result of this, we also expected to get a preliminary inventory of linguistically and theoretically interesting topics for future research. The languages involved are Finnish and Swedish. This is interesting from a typological point of view, as the bulk of all research on simultaneous interpreting so far has been conducted on Indo-European languages. The pilot project nature of this project implies that another of our primary aims was, on the basis of specified theoretically-founded studies of samples from the empirical material, to develop a better point of departure for larger future projects in the area. 1.3 Research material and methods 1.3.1 The speakers and the interpreters The material for this study was collected at two conferences in Finland in the autumn of 1990. Altogether, it consists of about 15 hours of audio recordings from two conferences. Within the time frame of the pilot study, we were only able to transcribe approximately five hours from one of the conferences, the participants of which were female authors from Sweden and Finland. From this conference we have analysed speeches delivered by government officials and professional authors, and the interpreting conducted by three, likewise female, professional conference interpreters. The audio recordings were copied to four-channel audio tapes where the original speaker input occupies one channel pair, and the interpretation the other pair. In this way it is possible to listen to the speaker and the interpreter either separately or concurrently. The analyses have been made mainly on the basis of the transcribed material, and the original recordings have been used for control purposes. 1.3.2 Transcription The transcription of the material was done by PhD student Miriam Sissala. In the transcriptions, only lower case letters have been used, and there is no punctuation. The transcription is orthographic, and spoken language forms have been used extensively, if not exclusively; e.g. ’ja’ for Swedish ’jag’ , ’’ for ’och’ , ’med’ for ’med’ , ’hitta’ for ’hittade’ , ’e’ for ’r’ . Because of technical limitations, there is no measurement of time, e.g., length of utterances, duration of pauses or lags between original speech and interpretation. The transcriptions have, however, been marked for pauses within the respective utterances, where a single slash ’/’ denotes a short pause, and double slash ’//’ denotes a long pause. False starts and mispronunciations are recorded in the transcripts. The transcription of both originals and interpretations are printed in two (roughly) synchronised parallel columns. For the sake of clarity, in the samples printed in this report translation into English of the original and the interpretation has been added. We have aimed at using spoken language forms in the translations as well. The code within parenthesis at the bottom of the first column of the samples (T 20) shows the location of the extract in the transcribed corpus. Swedish original Translation of original Finnish interpretation Translation of interpretation men de e s mnga frgor som dyker upp nr ja skriver som ja inte har svar p // de e s mycket som hnder / man borde ha tie liv // (T 20) but there are so many questions popping up when I’m writing and which I don’t have answers for // there is so much happening / you should have ten lives // mutta on niin paljon kysymyksi jotka tulevat mieleen kirjoittaessani enk pysty vastaamaan niihin / on niin paljon mik tapahtuu / pitisi olla kymmenen elm itsell / but there are so many questions that come into my mind when I’m writing and I am not able to answer them / there is so much happening / you should have ten lives for yourself / Figure 1-1 Sample transcription with English translation 1.3.3 Research method The basic method in our study was to compare, on the basis of the transcripts, the output of the original speaker to that of the interpreters. Although the text-linguistic approach was, obviously, the most important one, it was natural to make analyses on both "micro" and "macro" levels. On the "micro level" we took note of anomalous intonation and mispronunciations, grammatical "errors" and possible interference from the source language in the target language, like changes in word order and other syntactic changes, changes in use of pronouns, lexical changes and mistranslation. The analyses on this level were made on short discourse segments, the equivalents of "sentences" in written texts. This is also a kind of analysis that we were used to do when assessing the performance of community interpreters in university examinations and state accreditation tests. Already at this stage it was possible to match the findings of our analyses to the models that we were about to assess. Both Alexieva’s and Chernov’s models are well suited to cater for the phenomena that we noticed at this level (cf. section 6.3 for Alexieva’s four stage simultaneous interpreting model). The analyses on the "macro" level had to do with issues like the interpreters’ "editing" of the texts, e.g. changes in the order of subtopics, their handling of special terminology, and strategies for coping with cognitive problems. Since the aim of the study was to assess the applicability of text-linguistic models to the study of simultaneous interpreting, the main task for us was to try to apply the models to samples from the transcribed material. The results of these "tests" will be reported in the respective sections below. 1.3.4 Units of analysis Traditional linguistics sees the sentence or possibly clauses within sentences as its basic unit of research. Even in text linguistics, which supposedly works with larger units, the sentence, or "the orthographic unit that is contained between full stops" (Halliday 1985:193), is often the unit quoted in examples etc. This is natural when dealing with written texts, but when working with a spoken corpus like ours, things are more complicated. In a forthcoming paper, Robert de Beaugrande proposes a new way of defining the sentence (de Beaugrande forthc.) : Our prime question would then be: which sets of criteria might be relevant for making (or not making) a given stretch of discourse into a sentence, or for recognising it to be a sentence? At least the following sets of criteria might be considered: 5.2.1. structural: a "sentence" consists of an array of relations ("structures") among units, e.g., the "Subject" and the "Predicate" in an "independent clause"; 5.2.2. formal: a "sentence" matches an array of formal symbols stipulated by a "formal grammar"; 5.2.3. logical: a "sentence" is an "expression" derived via "rules" from a "logical system" of basic "axioms"; 5.2.4. philosophical: a "sentence" is a "true or false statement" about a "state of affairs" in a "real or possible world"; 5.2.5. cognitive: a "sentence" is a "proposition" with a "predicate" and one or more "arguments"; 5.2.6. thematic: a sentence is a pattern for relating the "theme" (or "topic") conveying known or predictable information with the "rheme" (or "comment") conveying new or unpredictable information; 5.2.7. intonational: a "sentence" corresponds to a "tone group" uttered as a "prosodic" unit with a characteristic pitch, e.g., rising or falling; 5.2.8. pragmatic: a "sentence" is the expression of a "constative" or "performative speech act"; 5.2.9. behavioural: a "sentence" is a separate segment within the "stream of speech"; 5.2.10. orthographic: a "sentence" is an orthographic unit of written language whose outer boundaries and at least some of its inner patterns are indicated in many writing systems by punctuation; 5.2.11. stylistic: the sentence is one of the key units for working out the style of an individual or group, especially in literary discourse; 5.2.12. rhetorical: the sentence is one of the key units for achieving rhetorical effects such as being expansive or terse, brisk or relaxed, excited or calm, and so on; 5.2.13. registerial: the sentence is a unit whose form and organisation adapt to differing "registers", e.g., in an official business letter as compared to a casual memo; 5.2.14. social: the sentence is a unit of higher importance for persons in some social roles or positions than for others, e.g., for a BBC radio announcer as compared to a barman in a rural pub. Among these criteria, in the work with a corpus like ours of oral interpreting, the tone group (5.2.7) and information group (5.2.6) seem to be the most valid ones (de Beaugrande, personal communication.) Since we have worked only on the written transcripts, we have rather used a combination of cognitive and thematic criteria (5.2.5 and 5.2.6). Cf. also section 2.3.3 in the present paper.
个人分类: 翻译教学与实践 Translation Practice & Teaching|2811 次阅读|0 个评论
四篇应该仔细读的关于文本分析的tutorial类文章
热度 1 xiaohai2008 2012-2-9 15:24
对文本分析进行详细深入介绍的肯定不只这四篇,这是本人目前读过的,其他比较好的tutorial类文章欢迎大家推荐补充。 第一篇:详细介绍了离散数据的参数估计方法,而不是像大多数教材中使用的Gaussian分布作为例子进行介绍。个人觉得最值得一读的地方是它使用Gibbs采样对LDA进行推断,其中相关公式的推导非常详细,是许多人了解LDA及其他相关topic model的必读文献。 @TECHREPORT{Hei09, author = {Heinrich, Gregor}, title = {Parameter Estimation for Text Analysis}, institution = {vsonix GmbH and University of Leipzig}, year = {2009}, type = {Technical Report Version 2.9}, abstract = {Presents parameter estimation methods common with discrete probability distributions, which is of particular interest in text modeling. Starting with maximum likelihood, a posteriori and Bayesian estimation, central concepts like conjugate distributions and Bayesian networks are reviewed. As an application, the model of latent Dirichlet allocation (LDA) is explained in detail with a full derivation of an aaproximate inference algorithm based on Gibbs sampling, including a discussion of Dirichlet hyperparameter estimation.}, } 第二篇:正像该文文摘中所陈述的那样,特别适合于计算机科学家。其中涉及的数学知识比较少,适用于不太关心数学细节的同仁。uninitiated好像是门外汉的意思,不难看出Resnik和Hardisty写该文的目的。 @TECHREPORT{RH10, author = {Resnik, Philip and Hardisty, Eric}, title = {Gibbs Sampling for the Uninitiated}, institution = {University of Maryland}, year = {2010}, type = {Technical Report CS-TR-4956, UMIACS-TR-2010-04, LAMP-153}, abstract = {This document is intended for computer scientists who would like to try out a Markov Chain Monte Carlo (MCMC) technique, particularly in order to do inference with Bayesian models on problems related to text processing. We try to keep theory to the absolute minimum needed, though we work through the details much more explicitly than you usually see even in "introductory" explanations. That means we've attempted to be ridiculously explicit in our exposition and notation. After providing the reasons and reasoning behind Gibbs sampling (and at least nodding our heads in the direction of theory), we work through an example application in detail---the derivation of a Gibbs sampler for a Na\"{i}ve Bayes model. Along with the example, we discuss some practical implementation issues, including the integrating out of continuous parameters when possible. We conclude with some pointers to literature that we've found to be somewhat more friendly to uninitiated readers. Note: as of June 3, 2010 we have corrected some small errors in the original April 2010 report.}, keywords = {Gibbs Sampling; Markov Chain Monte Carlo; Na\"{i}ve Bayes; Bayesian Inference; Tutorial}, url = {http://drum.lib.umd.edu/bitstream/1903/10058/3/gsfu.pdf} } 第三篇:Knight是做NLP的同仁们非常熟悉的大牛,就不多介绍了。 @ELECTRONIC{Kni09, author = {Knight, Kevin}, title = {Bayesian Inference with Tears: A Tutorial Workbook for Natural Language Researchers}, url = {http://www.isi.edu/natural-language/people/bayes-with-tears.pdf}, } 第四篇,LDA之父Blei和他的学生Gershman共同撰写的,对Bayesian非参数模型进行了详细介绍,特别对 Chinese Restaurant Process (CRP)和Indian Buffet Process以非常直观的方式进行了讨论。 @ARTICLE{GB11, author = {Gershman, Samuel J. and Blei, David M.}, title = {A Tutorial on Bayesian Nonparametric Models}, journal = {Journal of Mathematical Psychology}, year = {2011}, abstract = {A key problem in statistical modeling is model selection, that is, how to choose a model at an appropriate level of complexity. This problem appears in many settings, most prominently in choosing the number of clusters in mixture models or the number of factors in factor analysis. In this tutorial, we describe Bayesian nonparametric methods, a class of methods that side-steps this issue by allowing the data to determine the complexity of the model. This tutorial is a high-level introduction to Bayesian nonparametric methods and contains several examples of their application.}, keywords = {Bayesian Methods; Chinese Restaurant Process; Indian Buffer Process}, }
个人分类: 机器学习|12214 次阅读|2 个评论
《科普随笔:机器八卦》
liwei999 2011-10-14 17:09
机器八卦:Text Mining and Intelligence Discovery (13219) Posted by: liwei999 Date: June 10, 2006 10:07PM 犀角提议,干脆用机器挖掘吧。我不想吓唬大家,但是,理论上说,除非你不冒泡,言多必失,机器八卦,比人工挖掘,可能揭示出你的更多特征。好在该技术还不成熟。 Text mining 是我这几年的研究重点之一,简单介绍一下。我们上贴用的是自然语言(英语,汉语),它们只是一串串字符,称作 unstructured text, 不是真的没有结构,而是结构是隐含的(语法结构、语义结构),需要NLU(Natural Language Understanding)技术中的 parsing 才能使其结构化。为什么要结构化?你想啊,千变万化的字符串组合,表达各种意义,如果不结构化,怎么从中有效地抽取信息(IE: information extraction),并挖掘出有价值的 intelligence (所谓 intelligence discovery) 呢? 当然,也有人不用结构去提取和挖掘,所谓 keyword-based information extraction and text mining. 一些浅层的信息和情报也可能这样被提取/挖掘出来。这就好比大家用 Google 搜索,Google 并不懂你的 query, 在 Google 眼中,不过是一串串互不相干、没有结构 words (search terms),但是由于网上有海量的带有很大 redundancy 的信息,东方不亮西方亮,查询结果往往很不错。Nevertheless, search 也好,IE 和text mining 也好,其最终突破在于 NLU. Text mining 这个术语从 Data mining 而来,后者通常指从数据库里面的有结构的数据中挖掘出规律来(hidden correlations and patterns)。Data mining 是个比较成熟的在实际应用中的技术。它能挖掘出对于 target marketing 很有价值的情报出来。比较 data mining 和 text mining, 可以知道,前者的成熟是建立在数据的结构化(数据库一般是人工建立和输入的)基础之上。因此,要想提高 text mining 的可用度,重点还是把 unstructured text 转化成结构化的 representation. 这就是我们一辈子也研究不完的题目了。 分析主谓宾及其修饰语关系(decoding Subject-Verb-Object, or SVO),是自然语言自动分析 (Natural Language Parsing)的主要任务。SVO parsing 做好了,就为语言理解打好了基础。在此基础上做信息抽取(IE: Information Extraction)和文本挖掘(Text Mining)就事半功倍了。 信息抽取和文本挖掘的区别是,前者提取的是“事实”(facts),文本中 explicitly 表达出来的东东(比如我曾说过我籍贯安徽,是世界语者,喜欢红楼梦,爱好音乐等等),而后者是挖掘文本中没有明说的 hidden relationships, patterns and trends. 所以 信息抽取可以充当文本挖掘的基础:根据已知事实挖掘隐含的联系、规律和走向,真地是八卦了,基于科学基础上的八卦。将来有一天,机器很有可能挖掘出这样一条爆炸性信息来:本坛网友某某某有同性恋倾向。那可比网络上的“人言”厉害,这是有“科学”根据的预测啊。真地是跳到黄河洗不清了。 总结如下: Natural Language Parsing -- Information Extraction -- Text Mining -------- 立委名言:如果生活能重来,我应该从事新闻采编。 钻这个牛角的意思 (13320) Posted by: seeit Date: June 11, 2006 07:25PM 还有一个问题请教立委,网上的信息可信度太低,text mining 如何考虑可信度?不同可信度的信息组合的最终结论的可信度又如何控制?头都想大了。 --------------------------------------------------------------------- 这是个大家都头大的问题。 (13322) Posted by: liwei999 Date: June 11, 2006 07:59PM 样本不够的时候没有什么好办法吧。比如李四一共才冒泡100次,而且刻意隐瞒、歪曲、半真半假。在这样的样本上是挖掘不出可信的情报来。挖掘的情报也只能是参考。 但是,如果样本很大,就可以过滤掉噪音和不实信息(deconflicting),前提是人天生不是时时事事在说谎(这个前提统计学上是成立的)。 情报挖掘由于domain dependent,样本有可能有限度:比如老友论坛,一共也不到两万帖,就是加上隔壁读书论坛,也不到20万帖子的存档吧。对于 domain independent 的知识习得(knowledge acquisition),海量存档提供了极好的过滤基础。去年还是前年,Google 一位仁兄就发了一篇 paper, 谈怎样从海量存档中获取 entity ontology 的知识。方法很简单,却非常有效,他只用了两个 patterns: 1. E1, E2, ..., and other C e.g. desks, tables, chairs and other furniture 2. C such as E1, E2, ..., and En funiture such as desks, tables and chairs E is supposed to be an entity noun, C should be a category noun of the entities. 这两个简单的语言 patterns, 只是英语用来表达实体上下位概念的常用说法,还有更多的说法没有概括进来,所以 recall (查全度)是不够的。这两个 patterns 的精确度(precision)也还有问题,error rate大概导致3-5%的噪音/不实信息。可是Google 数据量大啊,只要运算速度跟上来,海量数据可以弥补查全率的不足(由于 redundancy),而且也过滤了噪音(threshold设置高一点就成)。其结果出奇的好: furniture: desk, table, chair, bench, bookshelf, ... US States: California, Washington, Texas, New York, ... dictators: Saddam Hussein, Castro, Jiang Zemin, Kim Jong Il,... etc. etc. 而所谓 semantic web, 就是在源头上解决问题, (13238) Posted by: liwei999 Date: June 11, 2006 01:23AM 在网页编制发布时就人工参与地结构化了,简单地说,就是让我们搞自然语言的人失业。 但 semantic web 主旨并不是表达 domain independent 的语言分析(主谓宾什么的),而是表达 domain-dependent 的语义 ontology, 直接抓住网页的核心内容。 ontology 是知识表达体系,我们NLP(natural langiuage processing)/NLU(natural language understanding)/IE(information extraction)的目的就是 decode unstructured text,把内容map到预定的 ontology 上。 这是从目标上看。在 decoding 过程中,也有用到知识库,通常是 lexicalized thesaurus 什么的(比如 WordNet),这里面的知识也是成体系的,也包含 ontology. 我们做 NLP/NLU 的人,并非为分析而分析,parsing unstructured text 的主要目标是为 information extraction 和 text mining 服务。主谓宾之类只是手段,而非目的。Semantic web 的理想(或幻想)就是在源头上把目的达到。从这个意义上,是在抢我们的饭碗。 当然,人类用自然语言随处可见,不大可能都愿意麻烦或有条件走 semantic web 所指的路。所以,担心没有活干,是没必要的。 【置顶:立委科学网博客NLP博文一览(定期更新版)】
个人分类: 立委科普|3808 次阅读|0 个评论
[转载]Cross-Lingual Text Mining
timy 2011-2-14 23:34
Encyclopedia of Machine Learning Springer Science+Business Media, LLC2011 10.1007/978-0-387-30164-8_189 ClaudeSammut and GeoffreyI.Webb Cross-Lingual Text Mining NicolaCancedda and Jean-MichelRenders (1) Xerox Research Centre Europe, Meylan, France Without Abstract Definition Cross-lingual text mining is a general category denoting tasks and methods for accessing the information in sets of documents written in several languages, or whenever the language used to express an information need is different from the language of the documents. A distinguishing feature of cross-lingual text mining is the necessity to overcome some language translation barrier. Motivation and Background Advances in mass storage and network connectivity make enormous amounts of information easily accessible to an increasingly large fraction of the world population. Such information is mostly encoded in the form of running text which, in most cases, is written in a language different from the native language of the user. This state of affairs creates many situations in which the main barrier to the fulfillment of an information need is not technological but linguistic. For example, in some cases the user has some knowledge of the language in which the text containing a relevant piece of information is written, but does not have a sufficient control of this language to express his/her information needs. In other cases, documents in many different languages must be categorized in a same categorization schema, but manually categorized examples are available for only one language. While the automatic translation of text from a natural language into another (machine translation) is one of the oldest problems on which computers have been used, a palette of other tasks has become relevant only more recently, due to the technological advances mentioned above. Most of them were originally motivated by needs of government Intelligence communities, but received a strong impulse from the diffusion of the World-Wide Web and of the Internet in general. Tasks and Methods A number of specific tasks fall under the term of Cross-lingual text mining (CLTM), including: • Cross-language information retrieval • Cross-language document categorization • Cross-language document clustering • Cross-language question answering These tasks can in principle be performed using methods which do not involve any Text Mining , but as a matter of fact all of them have been successfullyapproached relying on the statistical analysis of multilingual document collections,especially parallel corpora . While CLTM tasks differ in many respect, they are allcharacterized by the fact that they require to reliably measure the similarity oftwo text spans written in different languages. There are essentially two families ofapproaches for doing this: 1. In translation-based approaches one of the two text spans is first translated into the language of the other. Similarity is then computed based on any measure used in mono-lingual cases. As a variant, both text spans can be translated in a third pivot language. 2. In latent semantics approaches, an abstract vector space is defined based on the statistical properties of a parallel corpus (or, more rarely, of a comparable corpus ). Both text spans are then represented as vectors in such latent semantic space, where any similarity measure for vector spaces can be used. The rest of this entry is organized as follows: first Translation-related approaches will be introduced, followed by Latent-semantic approaches. Finally, each of the specific CLTM tasks will be discussed in turn. Translation-Based Approaches The simplest approach consists in using a manually-written machine-readable bilingual dictionary: words from the first span are looked up and replaced with words in the second language (see e.g., Zhang Vines, 2005 ). Since typically dictionaries contain entries for “citation forms” only (e.g., the singular for nouns, the infinitive for verbs etc.), words in both spans are preliminarily lemmatized , i.e., replaced with the corresponding citation form. In all cases when the lexica and morphological analyzers required to perform lemmatization are not available, a frequently adopted crude alternative consists in stemming (i.e., truncating by taking away a suffix) both the words in the span to be translated and in the corresponding side in the lexicon. Some languages (e.g., Germanic languages) are characterized by a very productive compounding : simpler words are connected together to form complex words. Compound words are rarely in dictionaries as such: in order to find them it is first necessary to break compounds into their elements. This can be done based on additional linguistic resources or by means of heuristics, but in all cases it is a challenging operation in itself. If the method used afterward to compare the two spans in the target language can take weights into account, translations are “normalized” in such a way that the cumulative weight of all translations of a word is the same regardless of the number of alternative translations. Most often, the weight is simply distributed uniformly among all alternative translations. Sometimes, only the first translation for each word is kept, or the first two or three. A second approach consists in extracting a bilingual lexicon from a parallel corpus instead of using a manually-written one. Methods for extracting probabilistic lexica look at the frequencies with which a word s in one language was translated with a word t to estimate the translation probability p ( t | s ). In order to determine which word is the translation of which other word in the available examples, these examples are preliminarily aligned, first at the sentence level (to know what sentence is the translation of what other sentence) and then at the word level. Several methods for aligning sentences at the word level have been proposed, and this problem is a lively research topic in itself (see Brown, Della Pietra, Della Pietra, Mercer, 1993 for a seminal paper). Once a probabilistic bilingual dictionary is available, it can be used much in the same way as human-written dictionaries, with the notable difference that the estimated conditional probabilities provide a natural way to distribute weight across translations. When the example documents used for extracting the bilingual dictionaries are of the same style and domain as the text spans to be translated, this can result in a significant increase in accuracy for the final task, whatever this is. It is often the case that a parallel corpus sufficiently similar in topic and style to the spans to be translated is unavailable, or it is too small to be used for reliably estimating translation probabilities. In such cases, it can be possible to replace or complement the parallel corpus with a “comparable” corpus. A comparable corpus is a pair of collections of documents, one in each of the languages of interest, which are known to be similar in content, although not the translation of one another. A typical case might be two sets of articles from corresponding sections of different newspapers collected during a same period of time. If some additional bilingual seed dictionary (human-written or extracted from a parallel corpus) is also available, then the comparable corpus can be leveraged as well: a word t is likely to be the translation of a word s if it turns out that the words often appearing near s are translations of the words often appearing near t . Using this observation it is thus possible to estimate the probability that t is a valid translation of s even though they are not contained in the original dictionary. Most approaches proceed by associating with s a context vector . This vector, with one component for each word in the source language, can simply be formed by summing together the count histograms of the words occurring within a fixed window centered in all occurrences of s in the corpus, but is often constructed using statistically more robust association measures, such as mutual information. After a possible normalization step, the context vector CV ( s ) is translated using the seed dictionary into the target language. A context vector is also extracted from the corpus for all target words t . Eventually, a translation score between s and t is computed as 〈 Tr ( CV ( s )), CV ( t )〉: where a is the association score used to construct the context vector. While effective in many cases, this approach can provide inaccurate similarity values when polysemous words and synonyms appear in the corpus. To deal with this problem, Gaussier, Renders, Matveeva, Goutte, and Déjean ( 2004 ) propose the following extension: which is more robust in cases when the entries in the seed bilingual dictionary do not cover all senses actually present in the two sides of the comparable corpus. Although these methods for building bilingual dictionaries can be (and often are) used in isolation, it can be more effective to combine them. Using a bilingual dictionary directly is not the only way for translating a span from one language into another. A second alternative consists in using a machine translation (MT) system. While the MT system, in turn, relies on a bilingual dictionary of some sort, it is in general in the position of leveraging contextual clues to select the correct words and put them in the right order in the translation. This can be more or less useful depending on the specific task. MT systems fall, broadly speaking, into two classes: rule-based and statistical. Systems in the first class rely on sets of hand-written rules describing how words and syntactic structures should be translated. Statistical machine translation (SMT) systems learn this mapping by performing a statistical analysis of a parallel corpus. Some authors (e.g., Savoy Berger, 2005 ) also experimented with combining translation from multiple machine translation systems. Latent Semantic Approaches In CLTM, Latent Semantic approaches rely on some interlingua (language-independent) representation. Most of the time, this interlingua representation is obtained by linear or non-linear statistical analysis techniques and more specifically dimensionality reduction methods with ad-hoc optimization criterion and constraints. But, others adopt a more manual approach by exploiting multilingual thesauri or even multilingual ontologies in order to map textual objects towards a list– possibly weighted – of interlingua concepts. For any textual object (typically a document or a section of document), the interlingua concept representation is derived from a sequence of operations that encompass: 1. Linguistic preprocessing (as explained in previous sections, this step amounts to extract the relevant, normalized “terms” of the textual objects, by tokenisation, word segmentation/decompounding, lemmatisation/stemming, part-of-speech tagging, stopword removal, corpus-based term filtering, Noun-phrase extractions, etc.). 2. Semantic enrichment and/or monolingual dimensionality reduction. 3. Interlingua semantic projection. A typical semantic enrichment method is the generalized vector space model , that adds related terms– or neighbour terms – to each term of the textual object, neighbour terms being defined by some co-occurrence measures (for instance, mutual information). Semantic enrichment can alternatively be achieved by using (monolingual) thesaurus, exploiting relationships such as synonymy, hyperonymy and hyponymy. Monolingual dimensionality reduction consists typically in performing some latent semantic analysis (LSA), some form of principal component analysis on the textual object/term matrix. Dimensionality reduction techniques such as LSA or their discrete/probabilistic variants such as probabilistic semantic analysis (PLSA) and latent dirichlet allocation (LDA) offer to some extent a semantic robustness to deal with the effects of polysemy/synonymy, adopting a language-dependent concept representation in a space of dimension much smaller than the size of the vocabulary in a language. Of course, steps (1) and (2) are highly language-dependent. Textual objects written in different languages will not follow the same linguistic processing or semantic enrichment/ dimensionality reduction. The last step (3), however, aims at projecting textual objects in the same language-independent concept space, for any source language. This is done by first extracting these common concepts, typically from a parallel corpus that offers a natural multiple-view representation of the same objects. Starting from these multiple-view observations, common factors are extracted through the use of canonical correlation analysis (CCA), cross-language latent semantic analysis, their kernelized variants (eg. Kernel-CCA) or their discrete, probabilistic extensions (cross-language latent dirichlet allocation, multinomial CCA, …). All these methods try to discover latent factors that simultaneously explain as much as possible the “intra-language” variance and the “inter-language” correlation. They differ in the choice of the underlying distributions and how they precisely define and combine these two criteria. The following subsections will describe them in more details. As already emphasized, CLTM mainly relies on defining appropriate similarities between textual objects expressed in different languages. Numerous categorization, clustering and retrieval algorithms focus on defining efficient and powerful measures of similarity between objects, as strengthened recently by the development of kernel methods for textual information access. We will see that the (linear) statistical algorithms used for performing steps (2) and (3) can most of the time be embedded into one valid (Mercer) kernel, so that we can very easily obtain non-linear variants of these algorithms, just by adopting some standard non-linear kernels. Cross-Language Semantic Analysis This amounts to concatenate the vectorial representation of each view of the objects of the parallel collection (typically, objects are aligned sentences), and then to perform standard singular value decomposition of the global object/term matrix. Equivalently, defining the kernel similarity matrix between all pairs of multi-view objects as the sum of the mono-lingual textual similarity matrices, this amounts to perform the eigenvalue decomposition of the corresponding kernel Gram matrix, if a dual formulation is adopted. The number of eigenvalues/eigenvectors that are retained to define the latent factors and the corresponding projections is typically from several hundreds of components to several thousands, still much fewer than the original sizes of the vocabulary. Note that this process does not really control the formation of interlingua concepts: nothing prevents the method from extracting factors that are linear combination of terms in one language only. Cross-Language Latent Dirichlet Allocation The extraction of interlingua components is realised by using LDA to model the set of parallel objects, by imposing the same proportion of components (topics) for all views of the same object. This is represented in Fig. 1 . Cross-Lingual Text Mining. Figure1 Latent dirichlet allocation of a parallel corpus LDA is performing some form of clustering, with a predefined number of components ( K ) and with the constraint that the two views of the same object belongs to the clusters with the same membership values. This results in 2. K component profiles that are then used for “folding in” (projecting) new documents by launching some form of EM to derive their posterior probabilities to belong to each of the language-independent component. The similarity between two documents written in different languages is obtained by comparing their posterior distribution over these latent classes. Note that this approach could easily integrate supervised topic information and provides a nice framework for semi-supervised interlingua concept extraction. Cross-Language Canonical Correlation Analysis The Primal Formulation CCA is a standard statistical method to perform multi-block multivariate analysis, the goal being to find linear combinations of variables for each block (i.e., each language) that are maximally correlated. In other words, CCA is able to enforce the commonality of latent concept formations by extracting maximally correlated projections. Starting from a set of paired views of the same objects (typically, aligned sentences of a parallel corpus) in languages L1 and L2, the algebraic formulation of this optimization problem leads to a generalized eigenvalue problem of size ( n 1 + n 2), where n 1 and n 2 are the sizes of the vocabularies in L1 and L2 respectively. For obvious scalability reasons, the dual – or kernel – formulation (of size N , the number of paired objects in the training set) is often preferred. Kernel Canonical Correlation Analysis Basically, Kernel Canonical Correlation Analysis amounts to do CCA on some implicit, but more complex feature space and to express the projection coefficients as linear combination of the training paired objects. This results in the dual formulation, which is a generalized eigenvalue/vector problem of size 2 N , that involves only the monolingual kernel gram matrices K 1 and K 2 (matrices of monolingual textual similarities between all pairs of objects in the training set in language L1 and L2 respectively). Note that it is easy to show that the eigenvalues go by pairs: we always have two symmetrical eigenvalues + λ and − λ. This kernel formulation has the advantage to include any text specific prior properties in the kernel (e.g., use of N-gram kernels, word-sequence kernels, and any semantically-smoothed kernel). After extraction of the first k generalized eigenvalues/eigenvectors, the similarity between any pair of test objects in languages L1 and L2 can be computed by using projection matrices composed of extracted eigenvector as well as the (monolingual) kernels of the test objects with the training objects. Regularization and Partial Least Squares Solution When the number of training examples ( N ) is less than n 1 and n 2 (the dimensions of the monolingual feature spaces), the eigenvalue spectrum of the KCCA problem has generally two null eigenvalues (due to data centering), ( N − 1) eigenvalues in + 1 and ( N − 1) eigenvalues in − 1, so that, as such, the KCCA problem only results in trivial solutions and is useless. When using kernel methods, the case ( N n 1, n 2) is frequent, so that some regularization scheme is needed. One way of realizing this regularization is to resort to finding the directions of maximum covariance (instead of correlation): this can be considered as a partial least squares (PLS) problem, whose formulation is very similar to the CCA problem. Adopting a mixed criterion CCA/PLS (trying to maximize a combination of covariance and correlation between projections) turns out to both avoid over-fitting (or spurious solutions) and to enhance numerical stability. Approximate Solutions Both CCA and KCCA suffer from a lack of scalability, due to the fact the complexity of generalized eigenvalue/vector decomposition is O ( N 3) for KCCA or O (min( n 1, n 2)3) for CCA. As it can be shown that performing a complete KCCA (or KPLS) analysis amounts to do first complete PCA’s, and then a linear CCA (or PLS) on the resulting new projections, it is obvious that we could reduce the complexity by working on a reduced-rank approximation (incomplete KPCA) of the kernel matrices. However, the implicit projections derived from incomplete KPCA may be not optimal with respect to cross-correlation or covariance criteria. Another idea to decrease the complexity is to perform some incomplete Cholesky decomposition of the (monolingual) kernel matrices K 1 and K 2 (that is equivalent to partial Gram-Schmit orthogonalisation in the feature space): K 1 = G 1. G 1 t and K 2 = G 2. G 2 t , with G i of rank k ≪ N . Considering G i as the new representation of the training data, KCCA now reduces to solving a generalized eigenvalue problem of size 2. k . Specific Applications The previous sections illustrated a number of different ways of solving the core problem of cross-language text mining: quantifying the similarity between two spans of text in different languages. In this section we turn to describing some actual applications relying on these methods. Cross-Language Information Retrieval (CLIR) Given a collection of documents in several languages and a single query, the CLIR problem consists in producing a single ranking of all documents according to their relevance to the query. CLIR is in particular useful whenever a user has some knowledge of the languages in which documents are written, but not enough to express his/her information needs in those languages by means of a precise query. Sometimes CLIR engines are coupled with translation tools to help the user access the content of relevant documents written in languages unknown to him/her. In this case document collections in an even larger number of languages can be effectively queried. It is probably fair to say that the vast majority of the CLIR systems use a translation-based approach. In most cases it is the query which is translated in all languages before being sent to monolingual search engines. While this limits the amount of translation work that needs be done, it requires doing it on-line at query time. Moreover, when queries are short it can be difficult to translate them correctly, since there is little context to help identifying the correct sense in which words are used. For these reasons several groups also proposed translating all documents at indexing time instead. Regardless of whether queries or documents are translated, whenever similarity scores between (possibly translated) queries and (possibly translated) documents are not directly comparable, all methods then face the problem of merging multiple monolingual rankings in a single multilingual ranking. Research in CLIR and cross-language question answering (see below) has been significantly stimulated by at least three government-sponsored evaluation campaigns: • The NII Test Collection for IR Systems (NTCIR) ( http://research.nii.ac.jp/ntcir/ ), running yearly since 1999, focusing on Asian languages (Japanese, Chinese, Korean) and English. • The Cross-Language Evaluation Forum (CLEF) ( http://www.clef-campaign.org ), running yearly since 2000, focusing on European languages. • A cross-language track at the Text Retrieval Conference (TREC) ( http://trec.nist.gov/ ), which was run until 2002, focused on querying documents in Arabic using queries in English. The respective websites are ideal starting points for any further exploration on the subject. Cross-Language Question Answering (CLQA) Question answering is the task of automatically finding the answer to a specific question in a document collection. While in practice this vague description can be instantiated in many different ways, the sense in which the term is mostly understood is strongly influenced by the task specification formulated by the National Institute of Science and Technology (NIST) of the United States for its TREC evaluation conferences (see above). In this sense, the task consists in identifying a text snippet , i.e., a substring, of a predefined maximal length (e.g., 50 characters, or 200 characters) within a document in the collection containing the answer. Different classes of questions are considered: • Questions around facts and events. • Questions requiring the definition of people, things and organizations. • Questions requiring as answer lists of people, objects or data. Most proposals for solving the QA problem proceed by first identifying promising documents (or document segments) by using information retrieval techniques treating the question as a query, and then performing some finer-grained analysis to converge to a sufficiently short snippet. Questions are classified in a hierarchy of possible “question types.” Also, documents are preliminarily indexed to identify elements (e.g., person names) that are potential answers to questions of relevant types (e.g., “Who” questions). Cross-language question answering (CLQA) is the extension of this task to the case where the collection contains documents in a language different than the language of the question. In this task a CLIR step replaces the monolingual IR step to shortlist promising documents. The classification of the question is generally done in the source language. Both CLEF and NTCIR (see above) organize cross-language question answering comparative evaluations on an annual basis. Cross-Language Categorization (CLCat) and Clustering (CLCLu) Cross-language categorization tackles the problem of categorizing documents in different languages in a same categorization scheme. The vast majority of document categorization systems rely on machine learning techniques to automatically acquire the necessary knowledge (often referred to as a model ) from a possibly large collection of manually categorized documents. Most often the model is based on frequency counts of words, and is thus intrinsically language-dependent. The most direct way to perform categorization in different languages would consist in manually categorizing a sufficient amount of documents in all languages of interest and then train a set of independent categorizer. In some cases, however, it is impractical to manually categorize a sufficient number of documents to ensure accurate categorization in all languages, while it can be easier to identify bilingual dictionaries or parallel (or comparable) corpora for the language pairs and in the application domain of interest. In such cases it is then preferable to obtain manually categorized documents only for a single language A and use them to train a monolingual categorizer. Any of the translation-based approaches described above can then be used to translate a document originally in language B – or most often its representation as a bag of words – into language A . Once the document is translated, it can be categorized using the monolingual A system. As an alternative, latent-semantics approaches can be used as well. An existing parallel corpus can be used to identify an abstract vector space common to A and B . The manually categorized documents in A can then be represented in this space, and a model can be learned which operates directly on this latent-semantic representation. Whenever a document in B needs be categorized, it is first projected in the common semantic space and then categorized using the same model. All these considerations carry unchanged to the cross-language clustering task, which consists in identifying subsets of documents in a multilingual document collection which are mutually similar to one another according to some criterion. Again, this task can be effectively solved by either translating all documents into a single language or by learning a common semantic space and performing the clustering task there. While CLCat and Clustering are relevant tasks in many real-world situations, it is probably fair to say that less effort has been devoted to them by the research community than to CLIR and CLQA. Recommended Reading Brown, P. E., Della Pietra, V. J., Della Pietra, S. A., Mercer,R.L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 12 (2), 263–311. Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., Déjean, H. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd annual meeting of the association for computational linguistics , Barcelona, Spain. Morristown, NJ: Association for Computational Linguistics. Savoy, J., Berger, P. Y. (2005). Report on CLEF-2005 evaluation campaign: Monolingual, bilingual and GIRT information retrieval. In Proceedings of the cross-language evaluation forum (CLEF) (pp. 131–140). Heidelberg: Springer. Zhang, Y., Vines, P. (2005). Using the web for translation disambiguation. In Proceedings of the NTCIR-5 workshop meeting , Tokyo, Japan.
个人分类: 文本挖掘|0 个评论
google学术搜索导入到endnote出现ancient text问题
dingyuan0314 2010-3-9 22:24
Scholar Google搜文章,并且用其自带的导入EndNote功能将其存为 Endnote可以直接导入的题录格式。近几天突然发现每次导入的内容都不对,主要表现为:正常的文献类型由Journal Article(期刊文章)改为了Ancient Text(古文献)或者Generic,这样导入的文献就没有杂志名称,及时现在再在Endnote里把文献类型由Ancient Text改回Journal Article也不能解决问题。 后来发现从其他杂志主页提供的导入Endnote方式保存的题录也存在同样问题。这是Endnote X 的一个Bug,这样的问题在x2版本中比较普遍。 解决的办法需要升级版本,步骤为:菜单help-endnote program updates...,一直操作,重新启动,问题解决。
个人分类: 生活点滴|3677 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-3 21:11

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部