Identification of coding elements in the genome is a fundamental step to understanding the building blocks of living systems. Previous genome annotation pipeline mainly focused on the proteins longer than 100 amino acids. However, recent works have identified that many proteins shorter than 100 amino acids (small proteins) also play important roles in diverse functions such as development, muscle contraction, and DNA repair. Identification of previously neglected small proteins may contribute in important ways to cellular and organismal biology, emphasizing the need for an unbiased and comprehensive strategy to evaluate translation empirically. In recent years, the use of comparative genomics, proteomics, and a combination of evolutionary conservation and ribosome profiling data have shown that the number of small proteins is probably much more than previously suspected.
Deep transcriptome sequencing has revealed the existence of many transcripts that lack long or conserved open reading frames, which have been termed long non-coding RNAs. Although several lncRNAs have regulatory functions but the vast majorities of lncRNAs do not have known functions. While their existence is undisputed, their coding potential and functionality have remained controversial. Ribosome profiling, a technique that measures ribosome occupancy and translation genome-wide, has indicated that translation is far more pervasive than anticipated and takes place on many transcripts previously assumed to be non-coding RNAs. Besides, several small proteins encoded by ncRNAs have also been shown to be functional. These small proteins have diverse regulatory roles. A small protein database will offer new avenues of research into lncRNA regulatory mechanisms.