1.Why to create SmProt database?

Identification of coding elements in the genome is a fundamental step to understanding the building blocks of living systems. Previous genome annotation pipeline mainly focused on the proteins longer than 100 amino acids. However, recent works have identified that many proteins shorter than 100 amino acids (small proteins) also play important roles in diverse functions such as development, muscle contraction, and DNA repair. Identification of previously neglected small proteins may contribute in important ways to cellular and organismal biology, emphasizing the need for an unbiased and comprehensive strategy to evaluate translation empirically. In recent years, the use of comparative genomics, proteomics, and a combination of evolutionary conservation and ribosome profiling data have shown that the number of small proteins is probably much more than previously suspected.

Deep transcriptome sequencing has revealed the existence of many transcripts that lack long or conserved open reading frames, which have been termed long non-coding RNAs. Although several lncRNAs have regulatory functions but the vast majorities of lncRNAs do not have known functions. While their existence is undisputed, their coding potential and functionality have remained controversial. Ribosome profiling, a technique that measures ribosome occupancy and translation genome-wide, has indicated that translation is far more pervasive than anticipated and takes place on many transcripts previously assumed to be non-coding RNAs. Besides, several small proteins encoded by ncRNAs have also been shown to be functional. These small proteins have diverse regulatory roles. A small protein database will offer new avenues of research into lncRNA regulatory mechanisms.


2.How to use SmProt database?

Click here to download the user's guide.

3.Pipeline to construct SmProt database.
4.The data sources of SmProt database.

The small proteins are mainly collected from four different sources:
(1) Literature mining.
(2) Ribosome profiling.
(3) mass spectrometry (MS).
(4) Known databases.
As different data sources had different confidences, we re-organized the small proteins and defined 5 different data sources as described in the table below. We added the data source to each small protein.

Data sourcesDescription
Low-throughput Literature MiningLiterature is obtained from PubMed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature which focused on a specific small protein.
High-throughput Literature MiningLiterature is obtained from Pubmed. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature which focused on batch discovery of small proteins.
MS dataMS data sets are collected from ENCODE project, and then we analyzed these data to obtain small proteins encoded by ncRNAs.
DatabasesWe also collected small proteins from databases. We only obtained the reliable small proteins (such as having a manual test)and reprocessed according to the flow chart.
Ribosome profilingRibosome profiling data sets are collected from GEO database, and RiboTaper software was used for small proteins identification.

5.The software tools used in SmProt database.

SoftwareUsageURL
LiftOverto converts genome coordinateshttp://genome.ucsc.edu/cgi-bin/hgLiftOver
BedToolsto process Bed format fileshttp://bedtools.readthedocs.org/en/latest/
RiboTaperto process Ribo-seq datahttps://ohlerlab.mdc-berlin.de/software/
SRA Toolkitfor file format conversionhttp://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software
Trimmomaticto remove adapters from sequencing datahttp://www.usadellab.org/cms/?page=trimmomatic
Bowtie2to remove rRNA from sequencing datahttp://bowtie-bio.sourceforge.net/bowtie2/index.shtml
FASTX-Toolkitfor FASTQ files preprocessinghttp://hannonlab.cshl.edu/fastx_toolkit/
FastQCfor quality control checks on sequencing datahttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/
STARfor mapping reads to the genomehttps://github.com/alexdobin/STAR

6.The ribosome profiling datasets used in SmProt.

Click here to download the information about ribosome profiling datasets used in SmProt.

7.The explanation of ORF types in SmProt database.

sORF: sORF is Short Open Reading Frame that encodes proteins which are 100 amino acids or less in length.
uORF: uORF is Upstream Open Reading Frame within the 5'UTR (5' untranslated region) of an mRNA. In SmProt, we only gathered the uORF with translated protein length shorter than 100 amino acids.
dORF: dORF is Downstream Open Reading Frame within the 3'UTR (3' untranslated region) of an mRNA. In SmProt, we only gathered the dORF with translated protein length shorter than 100 amino acids.