博文

circos学习笔记（二）

已有 3528 次阅读 2017-10-12 10:38 |系统分类:科研笔记|关键词:学者| 生物信息, 基因组圈图, circos

今天用自己的数据做了一下circos圈图，当然，是在生信媛那三篇教程的基础之上（http://mp.weixin.qq.com/s/qXDlKf7ut8QF2c5_0hiKpA http://www.biotrainee.com/thread-755-1-1.html），修改了部分参数。其实，主要是数据格式的处理过程，我采用python脚本处理的，处理起来感觉还好，没有多少难度，因为大部分的数据是提取两列而已，不需要我进行更复杂的处理，更复杂的处理估计也有软件解决，不需要我造轮子了。

http://www.zd200572.com/2017/10/12/circos_learnning-2/

下面是我的过程记录：

1.contig大小、起始位点的配置文件

我的原始数据来自美吉生物的结果，有一个assemble文件夹里边有一个fasta_name.txt，有这些统计信息，想起来了，我个我是用一个shell命令获得的，但是这些信息都在那个组装好的基因组文件里。

cat d-1.Contigs.fna | grep ^'>' > fasta_name.txt

把数据处理成circos格式，一个python脚本解决（虽然python水平蹩脚，但是能解决小问题了）：

fin = open('fasta_name.txt')contig = ''start = 0end = 0dict = {}for line in fin:        li = []        contig = line.strip().split(' ')[0].split('>')[1]        dict[contig] = ''        li.append(start)        end += int(line.strip().split('=')[1].split('   ')[0])        li.append(end)        dict[contig] = li        start = end  #print(dict)fout = open('tRNA.gff', 'w')fin2 = open('G:\\Users\\shenyou\\Desktop\\ddm-1\\tRNA\\tRNA.gff')for line in fin2:        #print(line)        contig = line.strip().split('   ')[0].strip()        print(contig)        start1 = int(line.strip().split('       ')[2])        print(start1)        end1 = int(line.strip().split(' ')[3])        print(end1)        #break        start1 += dict[contig][0]        end1 += dict[contig][0]        fout.write(contig + ' ' + str(start1) + ' ' + str(end1) + '\n')fout.close()

2.基因注释数据的处理

上面获得了，contig的数据，基本的环就形成了，然后再按生信媛的教程加上刻度不表。下面我选择了测序数据报告里的tRNA、rRNA和基因的预测数据，进行了处理，处理方式都是一样的，可以用一个脚本改改参数解决。主要处理方式就是，把那个在每个contig上的相对刻度转化为基因组的绝对刻度，我是字典解决问题的。脚本如下：

fin = open('fasta_name.txt')contig = ''start = 0end = 0dict = {}for line in fin:        li = []        contig = line.strip().split(' ')[0].split('>')[1]        dict[contig] = ''        li.append(start)        end += int(line.strip().split('=')[1].split('   ')[0])        li.append(end)        dict[contig] = li        start = end  #print(dict)fout = open('predict.gff', 'w')fin1 = open('G:\\Users\\shenyou\\Desktop\\ddm-1\\predict\\ddm-1.ffn')for line in fin1:        if line.startswith('>'):                contig = line.strip().split(' ')[3]                start1 = int(line.strip().split(' ')[4])                end1 = int(line.strip().split(' ')[5])                start1 += dict[contig][0]                end1 += dict[contig][0]                fout.write(contig + ' ' + str(start1) + ' ' + str(end1) + '\n')fout.close()

3.gc含量的获得、

其实，我个步骤对于我来说是最难的，我试图找一个软件来解决，但是碍于自己知识不足，都搜不到相关信息，只有自己造轮子了，写了一个小脚本竟然解决问题了，有点成就感。主要思路就是按生信媛教程里的数据格式，每1000bp计算一次gc含量，对了，还有偏移可以用，由于不清楚正负链的情况，就一锅端了。脚本如下：

fin1 = open('fasta_name.txt')contig = ''start = 0end = 0dict = {}for line in fin1:        li = []        contig = line.strip().split(' ')[0].split('>')[1]        dict[contig] = ''        li.append(start)        end += int(line.strip().split('=')[1].split('   ')[0])        li.append(end)        dict[contig] = li        start = end  #print(dict)i = 1fin = open('G:\\Users\\shenyou\\Desktop\\ddm-1\\assembly\\ddm-1.Contigs.fna')seq = ''di = {}flag = 0#sequence = ''for line in fin:        if line.startswith('>'):                if seq != '':                        di[contig] = seq                        #print(contig, di[contig][:10])                contig = line.strip().split(' ')[0].split('>')[1]                #print(contig)                i += 1                #break                di[contig] = ''                seq= ''        else:                seq += line.strip()                #sequence += line.strip()seq_gc = 0.6537235194562051#print(di.keys())fout = open('gc.txt', 'w')for a in di.keys():        #print(a)        j = 0        for j in range(int(len(di[a]) / 1000) - 1):                if j == 0:                        if len(di[a]) >= 1000:                                #print(str(di[a]))                                seqi_gc = (di[a][0:1000].count('G') + di[a][0:1000].count('C')) / 1000                                start = dict[a][0] + 1                                end = 1000 + dict[a][0]                        elif len(di[a]) < 1000:                                #print(str(di[a]))                                seqi_gc = (di[a][0:len(di[a])].count('G') + di[a][0:len(di[a])].count('C')) / len(di[a])                                start = dict[a][0] + 1                                end = len(di[a]) + dict[a][0]                elif 1 <= (j + 1) < int(len(di[a]) / 1000):                        #print(str(di[a]))                        seqi_gc = (di[a][j*1000:(j + 1)*1000].count('G') + di[a][j*1000:(j + 1)*1000].count('C')) / 1000                        start = j*1000 + dict[a][0] + 1                        end = (j + 1)*1000 + dict[a][0]                elif (j + 1) == int(len(di[a]) / 1000):                        #print(str(di[a]))                        seqi_gc = (di[a][j*1000:len(di[a])].count('G') + di[a][j*1000:len(di[a])].count('C')) / (len(di[a]) - j*1000)                        start = j*1000 + dict[a][0] + 1                        end = len(di[a]) + dict[a][0]                print(start, end)                fout.write(str(a) + ' ' + str(start) + ' ' + str(end) + ' ' + str(seqi_gc) + '\n')                j += 1        #breakfout.close()

4.我的配置文件总览

根据教程，我的配置文件如下，没有出现报错的情况，但是想要掌握这个画图，还是要下很多很多功夫的。

<<include etc/colors_fonts_patterns.conf>><<include etc/image.conf>>karyotype = karyotype.txtchromosomes_units = 100000chromosomes_display_default = yesdefault = 0.005rradius = 0.80rthickness = 6pfill = yesfill_color = deepskybluestroke_color = blackstroke_thickness = 1pshow_label = yeslabel_font = lightlabel_radius = 1r + 110plabel_size = 30label_parallel = yes<<include etc/housekeeping.conf>>show_ticks = yesshow_tick_labels = yesskip_first_label = noskip_last_label = noradius = dims(ideogram,radius_outer)color = blackthickness = 2psize = 30pmultiplier = 1e-6format = %.2fspacing = 20000bsize = 10pshow_label = nothickness = 3pspacing = 100000bsize = 20pshow_label = yeslabel_size = 25plabel_offset = 10pformat = %.2fz = 0file = predict.gffr0 = 0.90rr1 = 0.99rfill_color = 169,169,169file = tRNA.gffr0 = 0.81rr1 = 0.90rfill_color = 169,169,169file = rRNA.gffr0 = 0.27rr1 = 0.36rtype = linethickness = 1pz = 2max_gap = 0ufile = gc.txtcolor = redfill_color = redr0 = 0.69rr1 = 0.78rorientation = outz = 2max_gap = 0ufile = gc_a_m.txtcolor = aquamarinefill_color = aquamariner0 = 0.48rr1 = 0.57rorientation = out

6.我的图

最基础的图画好了，凑活着看了，没想到第一次画就能画成这样，感谢教程。

circos学习笔记（二）

转载本文请联系原作者获取授权，同时请注明本文来自赵加栋科学网博客。
链接地址：https://m.sciencenet.cn/blog-623545-1080319.html

上一篇：circos学习笔记（一）
下一篇：四十不惑：DNA测序技术前世今生和未来

收藏分享

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

赵加栋

扫一扫，分享此博文

zd200572的个人博客分享 http://blog.sciencenet.cn/u/zd200572

博文

circos学习笔记（二）

1.contig大小、起始位点的配置文件

2.基因注释数据的处理

3.gc含量的获得、

4.我的配置文件总览

6.我的图

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

赵加栋

全部作者的其他最新博文

全部精选博文导读

相关博文

zd200572的个人博客分享 http://blog.sciencenet.cn/u/zd200572

博文

circos学习笔记（二）

1.contig大小、起始位点的配置文件

2.基因注释数据的处理

3.gc含量的获得、

4.我的配置文件总览

6.我的图

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

赵加栋

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)