EMSAR: Estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering

Soohyun Lee, Chae Hwa Seo, Burak Han Alver, Sanghyuk Lee, Peter J. Park

Research output: Contribution to journalArticlepeer-review

11 Scopus citations


Background: RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost. Results: We introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods. Conclusions: EMSAR is a method for quantifying transcripts from RNA-seq data with high accuracy and low computational cost. EMSAR is available at https://github.com/parklab/emsar

Original languageEnglish
Article number278
JournalBMC Bioinformatics
Issue number1
StatePublished - 3 Sep 2015

Bibliographical note

Funding Information:
Sanghyuk Lee was supported by a grant from the National Research Foundation of Korea (NRF-2014M3C9A3065221). We thank Daniel S. Day, Lovelace J. Luquette, Lucy Jung and Niklas Smedemark-Margulies in the Park lab for their helpful comments and/or for testing EMSAR. We also thank Lior Pachter at U.C. Berkeley for his feedback during his visit in Mar, 2014.

Publisher Copyright:
© 2015 Lee et al.


  • Expression quantification
  • Isoforms
  • Multi-reads
  • Optimization
  • Suffix array


Dive into the research topics of 'EMSAR: Estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering'. Together they form a unique fingerprint.

Cite this