GBrowse and GFF The purpose of this document is to explore how the tables in the mysql database used by gbrowse relate to the GFF from which they are populated. The conclusions summarize how I currently think it all works, and are most likely to be of interest to others. Methods Results Table Structure Fate of typical GFF fields Population of fattribute and fattribute_to_feature tables Alignments Conclusions Methods I will be using a system set up as described in my previous document . In addition to the work described on that page, I have additionally loaded in wormbase release 130. It is that dataset that I will be exploring in the following examples. Results Table Structure The table structure of GBrowse is as follows: Fate of typical GFF fields First, I will explore where exactly the information for a particular line of gff ends up. Here is an example line from the ws130 gff file: I Genefinder CDS 252119 253587 . + . CDS "Y48G1BM.gc6" or, split into traditional GFF fields by tabs, example line from ws130 GFF reference sequence I source Genefinder type CDS start position 252119 end position 253587 score . strand + phase . group CDS "Y48G1BM.gc6" I will now go through the seven tables in GBrowse depicted above to determine the fate of this information. fdata Table fdata (1 row) fref fstart fstop fbin ftypeid fscore fstrand fphase gid ftarget_start ftarget_stop I 252119 253587 10000.000025 6 null + null 121 null null !-- fdata (1 row) fref I fstart 252119 fstop 253587 fbin 10000.000025 ftypeid 6 fscore null fstrand + fphase null gid 121 ftarget_start null ftarget_stop null -- ftype Table ftype (1 row) ftypeid fmethod fsource 6 CDS Genefinder fgroup Table fgroup (1 row) gid gclass gname 121 CDS Y48G1BM.gc6 fdna Table fdna contained sequence as referenced by the fref field of fdata . fmeta There were no relevant rows in fmeta fattribute_to_feature There were no relevant rows in fattribute_to_feature fattribute There were no relevant rows in fattribute Comparison with the Gbrowse Tutorial reveals that the ftype table is used to determine which track to display this GFF span in, and the fgroup table is used when the user searches for a feature. This all makes intuitive sense. The fields that are null in fdata may pose a problem. Two of the field that are null, fscore and fphase , are analogous to the similarly named features in the GFF file. There are at least two unresolved issues: How can fattribute get populated? How can ftarget_start and ftarget_end get populated? Population of fattribute and fattribute_to_feature tables Consider the fate of the following line of GFF, I Coding_transcript intron 11690 14950 . + . Transcript "Y74C9A.2.4" ; Confirmed_EST yk1139h01.3 I Coding_transcript intron 11690 14950 . + . Transcript "Y74C9A.2.3" ; Confirmed_EST yk1139h01.3 I Coding_transcript intron 11690 14950 . + . Transcript "Y74C9A.2.2" ; Confirmed_EST yk1139h01.3 I curated intron 11690 14950 . + . CDS "Y74C9A.2" ; Confirmed_EST yk1139h01.3 I Coding_transcript intron 11690 14950 . + . Transcript "Y74C9A.2.1" ; Confirmed_EST yk1139h01.3 I Genefinder intron 11690 14950 . + . CDS "Y74C9A.gc2" ; Confirmed_EST yk1139h01.3 or fattribute example line from ws130 GFF reference sequence I source Coding_transcript type intron start position 11690 end position 14950 score . strand + phase . group Transcript "Y74C9A.2.4" ; Confirmed_EST yk1139h01.3 Yields the following table contents: fdata Table fdata (6 rows) fid fref fstart fstop fbin ftypeid fscore fstrand fphase gid ftarget_start ftarget_stop 9406655 I 11690 14950 10000.000001 20 null + null 8 null null 9406656 I 11690 14950 10000.000001 20 null + null 6 null null 9406657 I 11690 14950 10000.000001 20 null + null 9 null null 9406658 I 11690 14950 10000.000001 21 null + null 10 null null 9406659 I 11690 14950 10000.000001 20 null + null 7 null null 9406660 I 11690 14950 10000.000001 22 null + null 11 null null ftype Table ftype (3 rows) ftypeid fmethod fsource 20 intron Coding_transcript 21 intron curated 22 intron Genefinder fgroup Table fgroup (6 rows) gid gclass gname 6 Transcript Y74C9A.2.3 7 Transcript Y74C9A.2.1 8 Transcript Y74C9A.2.4 9 Transcript Y74C9A.2.2 10 CDS Y74C9A.2 11 CDS Y74C9A.gc2 fdna Table fdna contained sequence as referenced by the fref field of fdata . fmeta There were no relevant rows in fmeta fattribute_to_feature Table fattribute_to_feature (6 rows) fid fattribute_id fattribute_value 9406655 2 yk1139h01.3 9406656 2 yk1139h01.3 9406657 2 yk1139h01.3 9406658 2 yk1139h01.3 9406659 2 yk1139h01.3 9406660 2 yk1139h01.3 fattribute Table fattribute (1 row) fattribute_id fattribute_name 2 Confirmed_EST It should be mentioned that in searching for yk1139h01.3 in fattribute_to_feature, I found the following: fid fattribute_id fattribute_value 12236 2 yk1139h01.3 12237 2 yk1139h01.3 12238 2 yk1139h01.3 12239 2 yk1139h01.3 12240 2 yk1139h01.3 12241 2 yk1139h01.3 However, there are no enties in fdata that correspond to those fid's, and thus they are in principle useless. I am not entirely sure where these rows are coming from, but I note that when "grepping" the ws130 GFF files for yk1139h01.3, all the entires appeared to be duplicated. It is possible that here is a ton of redundancy in my ws130 database because of the change in names from CHROMOSOME_I to I that occured recently. These naming issues appear to be resolved by the bp_process_wormbase.pl, but the duplicate lines are not collapsed. Perhaps it would be worth running a second-pass script after bp_process_wormbase.pl that removed redundant lines; this is a non-trivial propspect because of the enormous size of the complete GFF for wormbase. Alignments The following lines of GFF likely give rise to alignment entries in the GBrowse tables. I BLAT_EST_BEST EST_match 11539 11561 99.7 - . Target "Sequence:yk1139h01.3" 658 636 I BLAT_EST_BEST EST_match 11618 11632 99.7 - . Target "Sequence:yk1139h01.3" 635 621 I BLAT_EST_BEST EST_match 11633 11689 99.7 - . Target "Sequence:yk1139h01.3" 619 563 I BLAT_EST_BEST EST_match 14951 15160 99.7 - . Target "Sequence:yk1139h01.3" 562 353 I BLAT_EST_BEST EST_match 16473 16781 99.7 - . Target "Sequence:yk1139h01.3" 352 44 I BLAT_EST_BEST EST_match 16783 16800 99.7 - . Target "Sequence:yk1139h01.3" 43 26 I BLAT_EST_BEST EST_match 16802 16817 99.7 - . Target "Sequence:yk1139h01.3" 25 10 I BLAT_EST_BEST EST_match 16820 16827 99.7 - . Target "Sequence:yk1139h01.3" 8 1 Searching fdata for just the start and stop of the first one, I get fdata fid fref fstart fstop fbin ftypeid fscore fstrand fphase gid ftarget_start ftarget_stop 9627963 I 11539 11561 1000.000011 55 99.7 - 12476 636 658 fgroup gid gclass gname 12476 Sequence yk1139h01.3 ftype ftypeid fmethod fsource 55 EST_match BLAT_EST_BEST fattribute_to_feature There were no rows in fattribute_to_feature with fid 9627963 This doesn't make a whole lot of sense... I don't understand where those other entries in fattribute_to_feature are coming from or what role they serve. Conclusions Fate of GFF lines The typical line of GFF results in entries into the fdata , ftype and fgroup tables. The fdata table holds most of the information from the GFF file, except for the contents of the source , type and group fields. The source and type fields form a unique pair in the ftype table, and are referenced by the ftypeid in the fdata table. The first semicolon-separated pair of terms in the group table is placed as a unique pair in the fgroup table, and referenced by the fgroupid in the fdata table. Special case : Alignments If the line of GFF represents an alignment, the group field will have a special structure similar to Target "Sequence:yk1139h01.3" 658 636 The Target group class is recognized by bp_load_gff.pl as signifying an alignment, and the next token is split on a colon to generate the real group class and name. The two tokens after that are taken as the start and stop of the alginment on the target sequence. I am not sure if the class of the target must always be sequence, but it would make sense. Special case : Additional attributes If there are one or more semicolon in the group field, such as Transcript "Y74C9A.2.4" ; Confirmed_EST yk1139h01.3 it is split and the first pair of terms is used as the group class and name, and the later pairs of terms form attribute name and values. Perhaps for performance reasons, instead of storing both the name and value in the fattribute_to_feature table, the attribute name is stored in the fattribute table and referenced by the fattribute_id . Enduring Mysteries The only thing I haven't been able to figure out about the GBrowse tables is where the seemingly useless entries in the fattribute_to_feature table come from, and for what they could concievably be used. My best explanation is that they are an artifact cause by the repetition of the GFF in the bp_process_wormbase script output. This web page was written by Alok Saldanha ( alok at caltech dot edu ).
A range of policy options are available for driving green growth. This document outlines these options and summarises many of the issues that need to be taken into account when embarking on a green growth strategy. Diagnose key constraints to green growth As discussed in Towards Green Growth, there are a range of constraints which can prevent the emergence of greener growth. These will vary from country to country and depending on particular environmental issues at stake. Figure 1 develops a diagnostic framework for identifying key constraints to greening growth. It characterises constraints to green growth as factors which limit returns to “green” investment and innovation i.e. those activities which can foster economic growth and development while ensuring that natural assets continue to provide the resources and environmental services on which our well-being relies. These constraints are divided into two categories: The first is low overall economic returns, encapsulating factors which create inertia in economic systems (i.e. fundamental barriers to change and innovation) and capacity constraints, or “low social returns”. The second is low appropriability of returns. This is where market and government failures prevent people from capturing the full value of improved environmental outcomes and efficiency of resource use. Examples include fossil fuel subsidies (government failure) or a lack of incentives for constructing energy efficient buildings (split incentives) or reducing air pollution (negative externalities). Low economic returns which are a function of inertia constrain the expansion of new or innovative production techniques, technologies and patterns of consumption. These constraints to green innovation are a mixture of market failure and market imperfection. Low returns to RD are a market failure. Network effects (e.g. barriers to entry that arise from increasing returns to scale in networks) and the bias in the market towards existing technologies are examples of market imperfection. The exception to this is that government failure can arise from attempts to deal with these market failures (e.g. regulatory barriers to competition and government monopolies in network industries). “Low social returns” implies the absence of enabling conditions for increasing returns to low environmental impact activities. These constraints reduce the choices of consumers and producers to pursue “green” activities. For example, inadequate electricity or water sanitation infrastructure may lead to water pollution or the use of high emission fuels or inefficient production of electricity. They can also include insufficient human capital such that people are not aware of alternative sources of energy or there is insufficient technical know-how to deploy them. In addition, at low levels of development, a mixture of poor infrastructure with low human capital and institutional quality can mean heavy reliance on natural resource extraction and little incentive for improved natural resource use like sustainable forestmanagement. These constraints reflect a mixture of government failure, market failures and market imperfections. 原文见 http://www.oecd.org/dataoecd/32/48/48012326.pdf
R functions for “Regression-type estimation of the parameters of symmetric -stable laws’ 1 Introduction Type: Regression-type estimation of the parameters of symmetric -stable laws Version: 1.0 Date: 2011-11-23 Author: Shibin Zhang Maintainer: Shibin Zhang sbzhang@shmtu.edu.cn Description: This document provides R functions for Regression-type estima- tion of the parameters of symmetric -stable laws. Usage: To use the software, you will need to download the file regressiontypeestsyS.R into a suitable directory on your computer. This contains the functions listed below and various supporting functions. You should not need to look at the R code in this file unless you want to see the details of what’s going on. 2 The functions regressiontypestable(x,rp=1,error=10) The function regressiontypeestsyS is used to estimate parameters and . It employs the method in Koutrouvelis (1980) and Akgiray and Lamoureux (1989). In the function regressiontypeestsyS, x is the sample. rp is the maximal recursive times. error is also used to control the recursive times but it means the maximal di erence between two consecutive estimate. See Koutrouvelis (1980) and Akgiray and Lamoureux (1989) for more details. References V. Akgiray, C.G. Lamoureux, “Estimation of stable-law parameters: a comparative study,” J. Bus. Econ. Statist., vol. 7, no. 1, pp. 85-93, 1989. I. Koutrouvelis, “Regression-type estimation of the parameters of stable laws,” J. Amer. Statist. Assoc., vol. 75, pp. 918-928, 1980. regressiontypeestsyS.R
1)在SWP安装目录下的\Shells\Standard LaTeX目录中编辑文件Blank - Standard LaTeX Article.shl,在\begin{document}之前加入如下几行: \RequirePackage{CJK} \AtBeginDocument{\begin{CJK*}{GBK}{song}\CJKtilde} \AtEndDocument{\end{CJK*}} 然后另存为Blank - Standard CJK-LaTeX Article.shl; 2)在SWP中typeset菜单中的Expert Setting栏目下选择DVI Format Setting为MikTeX LaTeX(根据CTEX套装的安装目录指向相应的latex.exe文件,如果你的Ctex安装在c:,默认为C:/CTeX/texmf/miktex/bin/latex.exe),DVI Preview Setting指向MikTeX Yap, DVI Print Setting指向dvips; 在swp5.5中并不需要更改latex2.dat,只需要做简单但关键的一步,那就是在上述2)时将其中的charater下拉列表中的normal改为Simplified Chinese. 最后还有一点要注意,保存tex文件时一定要用save as 存为Portable Latex, character set 要选为Simplified Chinese.
我成功的第一个latex,激动!激动!激动! \documentclass{article} \usepackage{amsmath} \begin{document} \title{my paper} \maketitle \tableofcontents \section{WinEdt} Game Theoretic Analysis of Voting in Committees, Cambridge University Press \end{document}
维基百科,自由的百科全书 TF-IDF (term frequency–inverse document frequency)是一种用于 资讯检索 与 文本挖掘 的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个 语料库 中的其中一份 文件 的重要程度。字词的重要性随着它在文件中出现的次数成 正比 增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被 搜索引擎 应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,互联网上的搜寻引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。 目录 1 原理 2 例子 3 在向量空间模型里的应用 4 参考资料 5 外部链接 原理 在一份给定的文件里, 词频 (term frequency,TF)指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化,以防止它偏向长的文件。(同一个词语在长文件里可能会比短文件有更高的词频,而不管该词语重要与否。)对于在某一特定文件里的词语 t i 来说,它的重要性可表示为: 以上式子中 n i , j 是该词在文件 d j 中的出现次数,而分母则是在文件 d j 中所有字词的出现次数之和。 逆向文件频率 (inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取 对数 得到: 其中 |D|:语料库中的文件总数 :包含词语 t i 的文件数目(即 的文件数目)如果该词语不在语料库中,就会导致被除数为零,因此一般情况下使用 然后 某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。 例子 有很多不同的 数学公式 可以用来 计算 TF-IDF。这边的例子以上述的数学公式来计算。词频 (TF) 是一词语出现的次数除以该文件的总词语数。假如一篇文件的总词语数是100个,而词语“母牛”出现了3次,那么“母牛”一词在该文件中的词频就是3/100=0.03。一个计算文件频率 (DF) 的方法是测定有多少份文件出现过“母牛”一词,然后除以文件集里包含的文件总数。所以,如果“母牛”一词在1,000份文件出现过,而文件总数是10,000,000份的话,其逆向文件频率就是 ln(10,000,000 / 1,000)=4。最后的TF-IDF的分数为0.03 * 4=0.12。 在向量空间模型里的应用 TF-IDF权重计算方法经常会和 余弦相似度 (cosine similarity)一同使用于 向量空间模型 中,用以判断两份文件之间的 相似性 。 参考资料 Salton, G. and McGill, M. J. 1983 Introduction to modern information retrieval . McGraw-Hill, ISBN 0-07-054484-0 . Salton, G., Fox, E. A. and Wu, H. 1983 Extended Boolean information retrieval. Commun. ACM 26, 1022–1036. Salton, G. and Buckley, C. 1988 Term-weighting approaches in automatic text retrieval. Information Processing Management 24(5): 513–523. 外部链接 Term Weighting Approaches in Automatic Text Retrieval Robust Hyperlinking :An application of tf–idf for stable document addressability.