In the RNA world, RNA is assumed to be the dominant macromolecule performing most, if not all, core "house-keeping" functions. The ribo-cell hypothesis suggests that the genetic code and the translation machinery may both be born of the RNA world, and the introduction of DNA to ribo-cells may take over the informational role of RNA gradually, such as a mature set of genetic code and mech- anism enabling stable inheritance of sequence and its variation. In this context, we modeled the genetic code in two content variables^C and purine contents--of protein-coding sequences and measured the purine content sensitivities for each codon when the sensitivity (% usage) is plotted as a function of CJC content variation. The analysis leads to a new pattern--the symmetric pattern--where the sensitivity ofpurine content variation shows diagonally symmetry in the codon table more significantly in the two GC content invariable quarters in addition to the two existing patterns where the table is divided into either four GC content sensitivity quarters or two amino acid diversity halves. The most insensitive codon sets are GUN (valine) and CAN (CAR for asparagine and CAY for aspartic acid) and the most biased amino acid is valine (always over-estimated) followed by alanine (always under-estimated). The unique position of valine and its codons suggests its key roles in the final recruitment of the complete codon set of the canonical table. The distinct choice may only be attributable to sequence signatures or signals of splice sites for spliceosomal introns shared by all extant eukaryotes.
The genetic code serves as one of the natural links for life's two conceptual frameworks--the informational and operational tracks-- bridging the nucleotide sequence of DNA and RNA to the amino acid sequence of protein and thus its structure and function. On the informational track, DNA and its four building blocks have four basic variables: order, length, GC and purine contents; the latter two exhibit unique characteristics in prokaryotic genomes where protein-coding sequences dominate. Bridging the two tracks, tRNAs and their aminoacyl tRNA synthases that interpret each codon--nucleotide triplet, together with ribosomes, form a complex machinery that translates genetic information encoded on the messenger RNAs into proteins. On the operational track, proteins are selected in a context of cellular and organismal functions constantly. The principle of such a functional selection is to minimize the damage caused by sequence alteration in a seemingly random fashion at the nucleotide level and its function-altering consequence at the protein level; the principle also suggests that there must be complex yet sophisticated mechanisms to protect molecular interactions and cellular processes for cells and organisms from the damage in addition to both immediate or short-term eliminations and long-term selections. The two- century study of selection at species and population levels has been leading a way to understand rules of inheritance and evolution at molecular levels along the informational track, while ribogenomics, epigenomics and other operationally-defined omits (such as the metabolite-centric metabolomics) have been ushering biologists into the new millennium along the operational track.
Although strand-biased gene distribution (SGD) was described some two decades ago, the underlying molecular mechanisms and their relationship remain elusive. Its facets include, but are not limited to, the degree of biases, the strand-preference of genes, and the influence of background nucleotide composition variations. Using a dataset composed of 364 non-redundant bacterial genomes, we sought to illus- trate our current understanding of SGD. First, when we divided the collection of bacterial genomes into non-polC and polC groups according to their possession of DnaE isoforms that correlate closely with taxonomy, the SGD of the polC group stood out more sig- nificantly than that of the non-polC group. Second, when examining horizontal gene transfer, coupled with gene functional conservation (essentiality) and expressivity (level of expression), we realized that they all contributed to SGD. Third, we further demonstrated a weaker G-dominance on the leading strand of the non-polC group but strong purine dominance (both G and A) on the leading strand of the polC group. We propose that strand-biased nucleotide composition plays a decisive role for SGD since the polC-bearing genomes are not only AT-rich but also have pronounced purine-rich leading strands, and we believe that a special mutation spectrum that leads to a strong purine asymmetry and a strong strand-biased nucleotide composition coupled with functional selections for genes and their functions are both at work.
Mammalian testis development is a complex and highly sophisticated process. To study the dynamic change of normal testis development at the transcriptional level, we investigated mouse testes at three postnatal ages: 6 days postnatal, 4 weeks old, and 10 weeks old, representing infant (PN1), juvenile (PN2), and adult (PN3) stages, respectively. Using ultra high-throughput RNA sequencing (RNA-seq) technology, we obtained 211 million reads with a length of 35 bp. We identified 18837 genes that were expressed in mouse testes, and found that genes expressed at the highest level were involved in spermatogenesis. The gene expression pattern in PN1 was distinct from that in PN2 and PN3, which indicates that spermatogenesis has commenced in PN2. We analyzed a large number of genes related to spermatogenesis and somatic development of the testis, which play important roles at different developmental stages. We also found that the MAPK, Hedgehog, and Wnt signaling pathways were significantly involved at different developmental stages. These findings further our understanding of the molecular mechanisms that regulate testis development. Our study also demonstrates significant advantages of RNA-seq technology for studying transcriptome during development.
GONG WeiPAN LinLinLIN QiangZHOU YuanYuanXIN ChengQiYU XiaoMinCUI PengHU SongNianYU Jun
Histone H3 lysine 4 trimethylation (H3K4me3) is well known to occur in the promoter region of genes for transcription activation. How- ever, when investigating the H3K4me3 profiles in the mouse cerebrum and testis, we discovered that H3K4me3 also has a significant enrichment at the 3' end of actively transcribed (sense) genes, named as 3'-H3K4me3. 3'-H3K4me3 is associated with ~15% of pro- tein-coding genes in both tissues. In addition, we examined the transcriptional initiation signals including RNA polymerase II (RNAPII) binding sites and Y-CAGE-tag that marks transcriptional start sites. Interestingly, we found that 3'-H3K4me3 is associated with the ini- tiation of antisense transcription. Furthermore, 3'-H3K4me3 modification levels correlate positively with the antisense expression levels of the associated sense genes, implying that 3'-H3K4me3 is involved in the activation of antisense transcription. Taken together, our findings suggest that H3K4me3 may be involved in the regulation of antisense transcription that initiates from the 3' end of sense genes. In addition, a positive correlation was also observed between the expression of antisense and the associated sense genes with 3'-H3K4me3 modification. More importantly, we observed the 3'-H3K4me3 enrichment among genes in human, fruitfly and Arabidopsis, and found that the sequences of 3'-H3K4me3-marked regions are highly conserved and essentially indistinguishable from known promoters in ver- tebrate. Therefore, we speculate that these 3'-H3K4me3-marked regions may serve as potential promoters for antisense transcription and 3'-H3K4me3 appear to be a universal epigenetic feature in eukaryotes. Our results provide a novel insight into the epigenetic roles of H3K4me3 and the regulatory mechanism of antisense transcription.
Immortality and tumorigenicity are two distinct characteristics of cancers. Immortalization has been suggested to precede tumorigenesis. To understand the molecular mechanisms of tumorigenicity and cancer progression in mammary epithelium, we established a tumori- genic cell model by means of heavy-ion radiation of an immortal cell model, which was created by overexpressing the human telomerase reverse transcriptase (hTERT) in normal human mammary epithelial cells. We examined the expression profile of this tumorigenic cell line (T hMEC) using the hTERT-overexpressing immortal cell line (IhMEC) as a control. In-depth RNA-seq data was generated by using the next-generation sequencing (NGS) platform (Life Technologies SOLID3). We found that house-keeping (HK) and tissue-spe- cific (TS) genes were differentially regulated during the tumorigenic process. HK genes tended to be activated while TS genes tended to be repressed. In addition, the HK genes and TS genes tended to contribute differentially to the variation of gene expression at different RPKM (gene expression in reads per exon kilobase per million mapped sequence reads) levels. Based on transcriptome analysis of the two cell lines, we defined 7053 differentially-expressed genes (DEGs) between immortality and tumorigenicity. Differential expression of 20 manually-selected genes was further validated using qRT-PCR. Our observations may help to further our understanding of cellular mechanism(s) in the transition from immortalization to tumorigenesis.
Ribonucleic acid(RNA) deserves not only a dedicated field of biological research –– a discipline or branch of knowledge –– but also explicit definitions of its roles in cellular processes and molecular mechanisms. Ribogenomics is to study the biology of cellular RNAs, including their origin, biogenesis, structure and function. On the informational track, messenger RNAs(mRNAs) are the major component of ribogenomes, which encode proteins and serve as one of the four major components of the translation machinery and whose expression is regulated at multiple levels by other operational RNAs. On the operational track, there are several diverse types of RNAembryonic development, circadian and seasonal rhythms, defined life-span stages, pathological conditions and anatomy-driven tissue/organ/cell types.
Since the human genome is mostly transcribed, genetic variations must exhibit sequence signatures reflecting the relationship between transcription processes and chromosomal structures as we have observed in unicellular or- ganisms. In this study, a set of 646 ubiquitous expression-invariable genes (EIGs) which are present in germline cells were defined and examined based on RNA-sequencing data from multiple high-throughput transcriptomic data. We demonstrated a relationship between gene expression level and transcript-centric mutations in the human genome based on single nucleotide polymorphism (SNP) data. A significant positive correlation was shown be- tween gene expression and mutation, where highly-expressed genes accumulate more mutations than low- ly-expressed genes. Furthermore, we found four major types of transcript-centric mutations: C---~T, A---~G; C---~ and G--~T in human genomes and identified a negative gradient of the sequence variations aligning from the 5' end to the 3' end of the transcription units (TUs). The periodical occurrence of these genetic variations across TUs is associated with nucleosome phasing. We propose that transcript-centric mutations are one of the major driving forces for gene and genome evolution along with creation of new genes, gene/genome duplication, and horizontal gene transfer.