Analysis of differential gene expression and alternative splicing is significantly influenced by choice of reference genome
Abstract
AbstractRNA-Seq analysis has enabled the evaluation of transcriptional changes in many species including non-model organisms. However, in most species only a single reference genome is available and RNA-Seq reads from highly divergent varieties are typically aligned to this reference. Here, we quantify the impacts of the choice of mapping genome in rice where three high-quality reference genomes are available. We aligned RNA-Seq data from a popular productive rice variety to three different reference genomes and found that the identification of differentially expressed genes differed depending on which reference genome was used for mapping. Furthermore, the ability to detect differentially used transcript isoforms was profoundly affected by the choice of reference genome: only 30% of the differentially used splicing features were detected when reads were mapped to the more commonly used, but more distantly related reference genome. This demonstrated that gene expression and splicing analysis varies considerably depending on the mapping reference genome, and that analysis of individuals that are distantly related to an available reference genome may be improved by acquisition of new genomic reference material. We observed that these differences in transcriptome analysis are, in part, due to the presence of single nucleotide polymorphisms between the sequenced individual and each respective reference genome, as well as annotation differences between the reference genomes that exist even between syntenic orthologs. We conclude that even between two closely related genomes of similar quality, using the reference genome that is most closely related to the species being sampled significantly improves transcriptome analysis