小柯机器人

科学家研发深度嵌入和比对蛋白质序列的方法
2022-12-16 16:59

法国谷歌Research Jean-Philippe Vert小组在研究中取得进展。他们的最新研究揭示了深度嵌入和比对蛋白质序列的方法。相关论文于2022年12月15日发表在《自然-方法学》杂志上。

研究人员利用深度学习在语言建模和可微编程方面的最新进展,研发了DEDAL(深度嵌入和可微比对),这是一种用来比对蛋白质序列和检测同系物的模型。DEDAL是一种基于机器学习的模型,通过阅读原始蛋白质序列和正确注释的大型数据集来进行序列比对。经过训练后,研究发现DEDAL比现有方法的比对正确率提高了两到三倍,并更好地将远亲同系物与进化上不相关的序列区分开来,这为许多依赖于结构和功能基因组学序列比对的下游方法改进铺平了道路。

据了解,蛋白质序列比对是大多数生物信息学的关键组成部分,用于研究蛋白质的结构和功能。然而,对高度发散的序列进行匹配仍然是一项艰巨任务,目前的算法通常无法进行准确匹配,这造成了许多蛋白质或开放阅读框注释不准确。

附:英文原文

Title: Deep embedding and alignment of protein sequences

Author: Llinares-Lpez, Felipe, Berthet, Quentin, Blondel, Mathieu, Teboul, Olivier, Vert, Jean-Philippe

Issue&Volume: 2022-12-15

Abstract: Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

DOI: 10.1038/s41592-022-01700-2

Source: https://www.nature.com/articles/s41592-022-01700-2

Nature Methods:《自然—方法学》,创刊于2004年。隶属于施普林格·自然出版集团,最新IF:47.99
官方网址:https://www.nature.com/nmeth/
投稿链接:https://mts-nmeth.nature.com/cgi-bin/main.plex


本期文章:《自然—方法学》:Online/在线发表

分享到:

0