将一段话中的句子分离出来不是一件容易的事。因为句子的开头和结尾并不是很规则,而且句子内部会出现句号。这使得通过单一的正则表达式分离句子是不可能的。有时你能成功,但大多数时候你会出错。这里我们用nltk模块来做。 第一部分:使用正则表达式 import re paragraph = Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. I say. What's wrong with you? I am confused by your activity. #匹配句尾的那个特殊空格,所有后面只能用依据空格用split分割 rule = re.compile(r(?!\w\.\w.)(?! \.)(?=\.|\?|\!|\)\s) result = re.split(rule, paragraph) for sentence in result: print sentence #如果段落中含有双引号就报错。此时我们应该改用三双引号或三单引号,亲测有效。当然,正则表达式也需要变化。下面是利用正则表达式提取文本文件中的句子的代码。 import re #open the txt file which must be in ANSI format #TXT file in unicode format doesn't work. I don't why. input = open('test.txt') input_result = input.read() rule = re.compile(r(?!\w\.\w.)(?! \.)(?=\.|\?|\!|\)\s) result = re.split(rule, input_result) #for sentence in result: #print sentence input.close() #This command will create the ouput.txt file for you. output = open(ouput.txt,a+) for sentence in result: output.write(sentence) output.write(\n) output.close() 第二部分:提取字符串中的句子 from nltk import tokenize paragraph = Good morning Dr. Adams. The patient is waiting for you in room number 3. print tokenize.sent_tokenize(paragraph) 第三部分:提取文本文件中的句子 import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') fp = open(test.txt) data = fp.read() print '\n-----\n'.join(tokenizer.tokenize(data)) 备注:暂时无法成功安装nltk模块,提示缺少某dll文件! 参考资料 http://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python http://stackoverflow.com/questions/4576077/python-split-text-on-sentences
从中国疯狗分离的一个狂犬病毒街毒株的全基因组测序 Complete Genome Sequence of a Street Rabies Virus Isolated from a Rabid Dog in China 2008 年从中国安徽省的一只狗分离到一株狂犬病街毒 (RABV) ,命名为 DRV-AH08 。对其全基因组进行了测序,发现它与近期在中国及其他亚洲国家分离的 RABV 密切相关,同源性为 87%-98% ;但它与 RABV 分支 1 (clade I ) 中的“ 全球广布 (cosmopolitan) ”组 的病毒关系较远,同源性仅为 84% 至 85% 。 A rabies virus (RABV) was isolated from a dog in Anhui Province, China, in 2008. The virus was designated DRV-AH08. Its entire genome was sequenced and found to be closely related to RABV recently isolated in China and other Asian countries (homology of 87 to 98%) but distantly related to RABV in the “cosmopolitan” group (homology of 84 to 85%) in the clade I of RABV. 作者: Fulai Yu, a Guoqing Zhang,a, b Shaobo Xiao, a Liurong Fang, a Gelin Xu, c Jiaxing Yan, c Huanchun Chen, a and Zhen F. Fu a , b 作者单位: a) State-Key Laboratory of Agricultural Microbiology, College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, China b) Department of Pathology, College of Veterinary Medicine, University of Georgia, Athens, Georgia, USA c) Wuhan Institute of Biological Products, Sino Pharma, Wuhan, China 原载: J. Virology. 2012, 86(19):10890. DOI: 10.1128/JVI.01775-12. 原文下载: J. Virol.-2012-Yu-10890-1.pdf
红糖中含有一些有色物质,要制成白糖,须将红糖溶于水,加入适量活性炭,将红糖中的有色物质吸附,再经过滤、浓缩、冷却后便可得到白糖。称取 5~10 g 红糖放在小烧杯中,加入 40 mL 水,加热使其溶解,加入 0.5~1 g 活性炭,不断搅拌并加热,趁热过滤悬浊液,得到无色液体,如果滤液呈黄色,可再加入适量的活性炭,直至无色为止。将滤液转移到小烧杯里,在水浴中蒸发浓缩,便有白糖析出。