序列两两比对算法_学会正确选择多序列比对（coding-sequences）软件

前几天，实验室的师弟师妹通过本地blast获取一些没有基因组注释物种的蛋白编码序列。原本以为可以快速地进行下一步的选择压力分析，没想到却在多序列比对这一环节出现了棘手的问题。以前，我都是经过PRANK软件进行多序列比对，然后再使用Gblocks软件对数据进行过滤的。现在，由于师弟师妹在拼接CDS序列时，有些碱基并不是保留3的倍数，造成很多编码序列出现移码突变，甚至变成了伪假基因(几百个基因)。虽然，现在可以进行多序列比对的软件有很多，比如Muscle、MAFFT、PRANK等，但它们均不能解决移码突变的问题。于是，我开始去google搜寻解决方案。功夫不负有心人，我最终找到了完美的解决办法，收获颇丰，所以把这个过程记录下来。

2013年，Blackburne 和 Whelan 在Molecular Biology and Evolution期刊发表他们的研究成果，该成果证明了MSA方法在下游分析中的关键作用，突出了在分析中选择的不同的MSA软件对其结果有明显的影响。

Ltynoja A. 2014

Ranwez 2018

此外，还有几种基于隐马尔科夫模型(Profile HMM Methods)的多序列比对工具，这些工具目前我还不清楚它们的具体优缺点。

SEPP(SATé-enabled Phylogenetic Placement)：解决将short reads放入参考序列和树的系统发育问题；

TIPP(Taxonomic Identification and Phylogenetic Profiling)：解决元组数据的分类识别和丰度分析问题

UPP(Ultra-large alignments using Phylogeny-aware Profiles)：解决非常大的数据集对齐的问题，这些数据集可能包含一些零碎的数据，可以将数据集多达1,000,000条序列对齐；

HIPPI(Highly Accurate Protein Family Classification with Ensembles of HMMs)：解决蛋白质家族分类的问题；

02. Trimming

FasParser

03. Know Your Limits

数据类型： DNA vs. RNA, coding vs. non-coding nucleotides (wobble bp), AAs, proteins, etc.

数据特性：substitution (≠ mutation) rate strength (↑ vs. ↓), indel size and rate (% gap & gap length), pairwise sequence identity (PID), etc.

数据矩阵特性：# of tips, # of sequences, (alignment length ∝) data matrix weight, e.g., light (K, M) vs. heavy (G, T), etc.

计算资源：CPU time and RAM memory

04. Summary

每个人的数据可能与其他人的数据有很大差别，一定要学会正确选择适合自己数据的软件类型，这样得到的结果才是准确、可靠、可重复的。

生物信息学与计算机类似，更新速度很快，几年前还在使用的经典软件，现在可能已经过时了，因而保持与时俱进对生物信息人员是十分重要的。

最后，感谢中国科学院昆明动物所孙艳波老师给与的指导与帮助。

05. 这里放一张处理序列中存在移码突变的序列比对流程，已亲测(欢迎交流)

问题解决了，收获满满，开心！！！

参考文献：

Blackburne B P, Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis[J]. Molecular biology and evolution, 2012, 30(3): 642-653.

Ltynoja A. Phylogeny-aware alignment with PRANK[M]//Multiple sequence alignment methods. Humana Press, Totowa, NJ, 2014: 155-170.

Ltynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis[J]. Science, 2008, 320(5883): 1632-1635.

Fletcher W, Yang Z. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection[J]. Molecular biology and evolution, 2010, 27(10): 2257-2267.

Ranwez V, Douzery E J P, Cambon C, et al. MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons[J]. Molecular biology and evolution, 2018, 35(10): 2582-2584.

Sun Y B. FasParser2: a graphical platform for batch manipulation of tremendous amount of sequence data[J]. Bioinformatics, 2018, 34(14): 2493-2495.

Whelan S, Irisarri I, Burki F. PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences[J]. Bioinformatics, 2018, 34(22): 3929-3930.

Di Franco A, Poujol R, Baurain D, et al. Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences[J]. BMC evolutionary biology, 2019, 19(1): 21.

文章知识点与官方知识档案匹配，可进一步学习相关知识算法技能树首页概览34241 人正在系统学习中相关资源：世新砸蛋抽奖软件V2.3.10官方安装版-其它代码类资源-CSDN文库

声明：本站部分文章及图片源自用户投稿，如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢！

序列两两比对算法_学会正确选择多序列比对（coding-sequences）软件

相关推荐