Codondex: kmer

Showing posts with label kmer. Show all posts

Tuesday, June 2, 2026

The Hidden Topography of Gene Regulation

A gene is usually read as a linear instruction, a sequence running from promoter to exon, intron, splice junction, UTR and termination site, but Codondex suggests that a gene should also be read as chromosomal geography. Beneath the annotated map of exons and introns there is another terrain: a repeat-density topography formed by short DNA words that recur, overlap, nest inside longer words, cluster into local fields and rise into summits. These summits are not defined by conventional gene annotation. They are not necessarily exons, splice sites, enhancers or promoters. They are sequence-density formations inherent in the DNA itself. Codondex calls these nested formations High-Density Repeat Fields ("HDRF"s or HDRNF).

A HDRF is not simply a repeated sequence. It is discovered as a local field in which many short, non-trivial motifs recur through adjacent, overlapping and nested k-mer relationships. A repeated 8-mer may sit inside a repeated 9-mer, which sits inside a repeated 10-mer, which is carried through a population of longer 13–28-mers. The importance of the field is not that one short motif repeats many times in isolation. The importance is that the motif is embedded in a dense neighborhood of related repeating sequence words. Local DNA is therefore not merely repetitive; it is architecturally loaded. It carries a concentrated burden of repetitive sequence possibilities that can be read by chromatin, transcription factors, polymerase, splice machinery, RNA-binding proteins and, after transcription, by the nascent RNA environment or it inherently affect biological concentrations.

In this model, the genome is not flat text. It is landscape, and some parts of that landscape are loaded with encoded densities. For example, Introns are not empty space. They may contain ridges, basins and summits of repeat-density potential. The highest HDRF is the mountain in that landscape: the point where nested repeat architecture is most concentrated, where the gene’s internal sequence burden reaches its maximum, and where encoded DNA density may be most readily converted into biological concentration through chromatin exposure, transcription, RNA processing or synthetic mimicry. Codondex begins at that summit because the summit is where the gene has already concentrated its own sequence logic.

This is why HDRFs are best understood as chromosomal geography. A gene has valleys where nested repeat burden is low, ridges where motifs begin to cluster, plateaus where repeat families spread across local sequence, and peaks where the density of nested, overlapping, non-trivial motifs reaches a maximum. The highest peak in that landscape is the HDRF Summit: the local sequence region, Codondex represents computationally by a synthesis-length 28-mer, that carries the maximum nested-repeat burden within the gene or transcript region being analyzed.

The mountain analogy is useful because it does not overstate function. A mountain is real whether or not anyone climbs it. Likewise, an HDRF is real as sequence architecture whether or not the gene is actively transcribed at a given moment. The DNA contains the topography before transcription. Transcription does not create the field; transcription reads through it and may convert its encoded DNA density into RNA motif density. When chromatin opens, when polymerase traverses the region, when an intron is copied into pre-mRNA, when splice factors scan the nascent transcript, or when RNA-binding proteins engage the sequence, the latent geography may become regulatory opportunity.

This distinction is central. A high-frequency k-mer in a table does not automatically prove biological function. K-mer density is not itself biochemical concentration. But k-mer analysis can reveal a real feature of the genome geography: inherent sequence-density concentration. In DNA, this means an increased local density of potential interaction sites. In RNA, after transcription, the same encoded field may become a repeated motif substrate available for folding, binding, splicing, retention, decay or compartmental interaction. The biological question is therefore not whether every repeated word is functional. The stronger question is whether a gene’s highest-density nested repeat fields mark regions where regulatory potential is unusually concentrated.

This is especially important in first introns. First introns are often regulatory-rich, promoter-proximal and involved in early transcriptional architecture, chromatin accessibility, elongation and co-transcriptional processing. For example: In TP53 and MEN1, the intron 1 repeat landscapes suggest that transcript variants do not merely differ in length. They preserve different repeat-density fields. Even when transcript lengths are normalized, variant-specific clustering can remain because normalization rescales the sequence but does not erase its internal motif architecture. The gene’s repeat geography survives the scaling.

In introns of one TP53 transcript, for example, the short motif 'CCCAGCTA' emerges as a dominant repeat core. Its significance is not simply that this 8-mer appears frequently. The deeper signal is that CCCAGCTA is repeatedly nested inside adjacent and overlapping longer sequence contexts. It is surrounded by neighboring motifs that also recur. A 28-mer containing that core may therefore represent a compact summit of a broader HDRF: a local sequence unit carrying the densest accessible sample of the gene’s nested repeat architecture. The 28-mer is not chosen because 28 has mystical biological status; it is chosen because it is a practical synthetic length that can capture a local field of internal 8–12, 8–18 and 8–28 motif burden.

The computational task is therefore not merely to find the most frequent k-mer. That would overvalue trivial homopolymers and low-complexity tracts. The task is to compute the nested burden of each candidate window. For each 28-mer, Codondex sums the recurrence frequencies of all internal k-mers from length 8 to 28, with optional weighting for entropy, GC content, CpG content, palindromic potential, stem-capability, transcript conservation and non-triviality. Adjacent high-scoring 28-mers merge into a peak. The highest-scoring local maximum becomes the HDRF Summit.

This produces a different kind of gene map. Instead of asking only where the exons are, where the promoter is, or where the canonical splice junctions sit, Codondex asks: where is the gene’s highest encoded motif-density burden? Where are the repeat summits? Which short motifs form the summit core? Which adjacent motifs amplify the field? Which transcript variants carry the summit, and which exclude it? Does the summit sit in intron 1, in a UTR, near a splice boundary, inside a retained intron, in a GC-rich regulatory compartment, or in a low-complexity region that may influence chromatin rather than sequence-specific binding?

The biological implications are broad but must be stated precisely. HDRFs may contribute to regulation at the DNA level by increasing the effective local density of potential binding sites, altering DNA shape, influencing nucleosome preference, supporting chromatin-factor recruitment, contributing to methylation-associated architecture or affecting the probability of transcription-factor rebinding. They may contribute during transcription by shaping polymerase pausing, elongation or co-transcriptional splice recognition. They may contribute at the RNA level when the same density field is copied into pre-mRNA, creating repeated substrates for RNA-binding proteins, splice enhancers, splice silencers, intronic structure, R-loop tendency or RNA compartmental behavior.

The aggregate burden may also matter. A local HDRF is not isolated from the rest of the gene. A gene may contain multiple HDRF peaks, some sharing the same core motif family, some distributed across introns, some concentrated near the 5′ region, some sitting in transcript-specific compartments. The gene-level HDRF burden may shape the background geography within which the local summit operates. The summit is the highest mountain, but the surrounding range may affect its biological visibility. Context score estimates whether the summit is likely to be biologically exposed, transcribed, accessible or regulatory.

This framework also clarifies the possible role of synthetic DNA or RNA candidates. A synthetic 28-mer derived from an HDRF Summit does not reproduce the entire gene. It does not automatically carry the whole biological meaning of the chromosomal field. But it may act as a compact concentration mimic of the summit architecture. If introduced at sufficient copy number, in the correct chemical form and cellular compartment, it may present a dense version of a sequence field that the gene already carries internally. Its potential mechanism could be decoy-like, scaffold-like, guide-like, competitive, structural or binding-mediated. The hypothesis is not that any high-frequency motif will function. The hypothesis is that a summit-derived 28-mer is a rational candidate because it is selected from the strongest encoded motif-concentration point in the gene’s own geography.

HDRF geography therefore moves gene analysis away from the idea that regulation is only a list of known motifs at known annotations. It proposes that each gene carries an internal terrain of motif density. Some of that terrain may be silent, some structural, some regulatory, some transcript-specific, some disease-contextual. But the terrain exists. It can be measured. It can be ranked. It can be compared between transcript variants, genes, tissues and disease states.

In this model, the genome is not flat text. It is landscape. Introns are not empty space. They may contain ridges, basins and summits of encoded regulatory potential. The highest HDRF is the mountain in that landscape: the place where nested repeat architecture is most concentrated, where the gene’s internal sequence burden rises to its maximum, and where Codondex begins looking for the most compact representation of that hidden regulatory geography.

Thursday, March 26, 2026

When Processing, Not Presence, Determines Visibility

It is easy to assume that if a protein accumulates in a diseased cell, the immune system will eventually see it. In the case of p53, that assumption has always had an intuitive appeal. p53 is one of the central stress-response proteins in biology, frequently altered in cancer, often stabilized, and deeply woven into the molecular logic of cell fate. If any intracellular protein should become immunologically visible, it ought to be p53.

But the deeper one looks at antigen presentation, the less that simple view holds. What matters is not merely whether p53 is present. What matters is whether peptide fragments derived from p53 are generated in the right form, survive intracellular trimming, fit the preferences of a particular HLA groove, and remain stable enough on the cell surface to be interrogated by either a T cell or an NK-cell receptor system. The 2022 Codondex article, Expanding Treatment Horizons, was already moving in that direction by highlighting an underappreciated observation from the HLA-C ligandome literature: a TP53-derived peptide, TAKSVTCTY, was identified as a naturally presented ligand of HLA-C*02:02. That observation comes from Moreno Di Marco and colleagues’ immuno-peptidomics study, which also listed MAGEA3-derived peptides among ligands presented by the same allotype.

That point remains important, but it also needs sharpening. The HLA-C paper tells us that a TP53-derived peptide can be naturally presented by HLA-C02:02. It does not tell us that HLA-C02:02 is already a dominant or clinically validated p53 presentation route in the way that HLA-A02:01 has become. For that, the literature is far stronger on the HLA-A side. A substantial body of work has shown that **wild-type p53 peptides presented by HLA-A02:01**, especially the well-known p53(264–272) epitope LLGRNSFEV, can stimulate cytotoxic T-cell responses and can be recognized on tumor cells. This was shown in studies such as Chikamatsu et al. Hoffmann et al. Gnjatic et al. and later vaccine-oriented work including Svane et al and the broader review literature on p53-targeting vaccines. In other words, for HLA-A*02:01, p53 is not just a theoretical ligand source; it is already part of a fairly mature immunotherapeutic story.

The most useful contribution of the recent Nature paper, The DNA virome varies with human genes and environments, is that it sharpens the mechanistic frame through which both HLA-C02:02 and HLA-A02:01 should now be viewed. The paper is not a p53 paper. It does not center tumor antigens, and it does not establish anything directly about TP53 peptide presentation. What it does show, at population scale, is that viral DNA load is shaped not only by HLA variation but also by the antigen-processing machinery, especially ERAP1 and ERAP2. That matters because it shifts the center of gravity away from a simplistic “does the peptide bind?” model and toward a more realistic “does the peptide survive the whole processing pipeline?” model.

That shift is especially important for p53. The HLA-A02:01 literature had already hinted that presentation of the classic p53(264–272) epitope depends on more than sequence alone. The work by Kuckelkorn et al showed that generation of this epitope is influenced by the interferon-γ-inducible processing machinery and that a hotspot mutation at residue 273 can prevent proper generation of the epitope. This is a reminder that even for the most familiar p53/HLA-A02:01 peptide, presentation is a processing problem before it becomes a recognition problem. The Nature virome study widens that principle: inherited variation in antigen processing can have measurable biological consequences at human scale. Read together, these papers suggest that p53 visibility is governed not simply by the existence of a fitting sequence, but by whether intracellular processing delivers that sequence intact to the appropriate HLA molecule.

This is where the contrast between HLA-A02:01 and HLA-C02:02 becomes genuinely interesting. HLA-A02:01 has a long experimental trail behind it: peptides were mapped, CTLs were induced, tumors were shown to present certain epitopes, and vaccine studies were built on top of that scaffold. HLA-C02:02, by contrast, remains more conditional. The ligandome study establishes that TAKSVTCTY from TP53 can indeed appear on HLA-C02:02, and it also gives a broader view of the peptide preferences of that allotype. In that same work, HLA-C02:02 is described as favoring small aliphatic or hydrophilic residues at position 2, with additional motif features helping define its ligand space. That does not diminish the importance of the TP53 observation; it means the TP53 peptide should be treated as a real but selective presentation event rather than assumed to be broadly immunodominant.

The biology becomes even more layered because HLA-C is not simply a lower-profile version of HLA-A. HLA-C occupies a distinct place in immune regulation. Compared with HLA-A and HLA-B, HLA-C is generally expressed at lower surface levels and is more tightly integrated with KIR-mediated NK-cell regulation. That broader point is well summarized in the Nature Communications paper Structural and regulatory diversity shape HLA-C protein expression levels, which notes both the lower surface expression of HLA-C and its extensive functional relationship with KIRs. This makes HLA-C particularly interesting for p53 because a peptide displayed by HLA-C is not only a possible T-cell target; it is also part of a signaling surface read by NK cells.

That NK dimension turns out not to be merely background context. More recent work has shown that KIR recognition of HLA-C is often peptide-dependent. The point is made clearly in studies such as Sim et al. 2017 and Sim et al. 2023: the HLA-C molecule is not being read in a peptide-blind way. Inhibitory and activating KIRs can be strongly shaped by the identity of the peptide bound in the groove. That has profound implications for any discussion of TP53 peptides on HLA-C02:02. A TP53-derived peptide on HLA-C02:02 may not simply mark a cell for CD8 T-cell inspection; it may also alter the threshold for NK inhibition or activation. This is one of the most important places where the older Codondex article and the newer immunogenetic literature genuinely converge.

So the corrected reading is not that the 2026 Nature paper newly proves something specific about HLA-C*02:02 presenting p53. It does not. What it does is make the older HLA-C02:02 observation more meaningful by placing it inside a stronger mechanistic framework. The question is no longer only whether TAKSVTCTY can bind HLA-C02:02; the question is whether an individual’s processing machinery, inflammatory state, and HLA context allow that peptide to be generated, preserved, loaded, displayed, and then interpreted by either T cells or NK cells in a biologically consequential way. That is a more demanding question, but it is also a more interesting one.

This also helps explain why HLA-A*02:01 remains the more established p53 route. The A02:01 pathway has yielded peptides that are repeatedly recoverable in experimental systems, repeatedly recognized by CTLs, and repeatedly leveraged in translational work. The HLA-C02:02 pathway looks more contingent: real, but likely more dependent on peptide selection pressure, trimming, and the NK-facing consequences of peptide-loaded HLA-C. Seen this way, HLA-A02:01 is the clearer adaptive pathway, while HLA-C02:02 may be a narrower but potentially more intriguing bridge between tumor antigen presentation and innate immune tuning.

That may be the most useful lesson from putting these papers together. p53 is not simply “presented” or “not presented.” It passes through a filter. In HLA-A02:01, that filter has already produced a clinically legible signal. In HLA-C02:02, the signal is fainter, but perhaps more information-rich, because it may be read simultaneously by T cells and NK-cell receptor systems. If that is right, then the next real step is not more speculation about binding motifs alone. It is experimental work that directly compares TP53 peptide generation, ERAP dependence, surface abundance, and KIR/TCR consequences across HLA-A02:01 and HLA-C02:02 backgrounds. That is where the overlap becomes testable rather than merely suggestive.

Saturday, January 17, 2026

Genome Balance: Repeats, Immunity, and Cancer

Cancer is usually described as a disease of mutations. Genes break, pathways fail, and cells escape control. That framing has been powerful, but it misses a deeper layer that may reveal how it begins.

The human genome is not primarily a coding genome. It is a repeat genome. More than half of our DNA consists of repetitive elements, with Alu retroelements alone numbering over a million copies. These sequences are a defining feature of primate genomes and they create a unique biological problem that human cells must continuously manage. Recent work suggests that cancer may emerge, in part, when this management system loses balance.

Alu elements are short retrotransposons that readily form double‑stranded RNA stem‑loop structures when transcribed, particularly in antisense orientation within introns and untranslated regions. To the innate immune system, these structures resemble viral RNA. This means that normal gene expression in human cells constantly risks triggering antiviral immune responses against self‑derived RNA.

A striking recent study shows that human cells rely on active suppression to avoid this outcome. In Ku suppresses RNA‑mediated innate immune responses in human cells to accommodate primate‑specific Alu expansion, the authors demonstrate that the DNA repair protein Ku (Ku70/Ku80) plays an essential second role: binding Alu‑derived dsRNA stem‑loops and preventing activation of innate immune sensors such as MDA5, RIG‑I, PKR, and OAS/RNase L.

When Ku is depleted interferon and NF‑κB signaling are strongly activated, translation is suppressed, and cells undergo growth arrest or death. Notably, Ku levels scale tightly with Alu expansion across primates, and Ku is essential in human cells but not in mice. The implication is clear:

Human cell viability depends on continuous suppression of Alu‑derived innate immune activation.

Alu expression is not harmless noise, it is actively tolerated! Ku functions as a finite buffer that allows primate cells to tolerate structurally immunogenic RNA produced by repeat‑rich genomes. When structured RNA load increases simultaneously from endogenous repeat transcription and exogenous viral RNA infection, Ku becomes functionally saturated and redistributed, weakening nuclear retention and cytoplasmic buffering. This pressurizes the cell’s capacity to contain dsRNA stress, promoting escape of repeat‑derived RNA, activation of innate sensors, and eventual selection for immune‑tolerant states.

A second line of evidence connects this tolerance to cancer evolution. A 2025 bioRxiv preprint, p53 loss promotes chronic viral mimicry and immune tolerance, shows that loss of p53 permits transcription of immunogenic repetitive elements, generating signals that resemble viral infection. Rather than leading to effective immune clearance, this state becomes chronic. Tumor cells adapt by dampening innate immune responses and tolerating persistent repeat‑derived nucleic acids.

In this view, “viral mimicry” is not a one‑time immune alarm. It is a conditioning process repeat RNAs accumulate, immune pathways are activated, and progressively suppressed or rewired to allow survival. Cancer cells do not simply evade immunity, they learn to live with endogenous viral‑like signals.

These immune findings align with earlier evidence that repeat control begins at the level of genome structure itself. A 2022 Nature Communications study demonstrated that retroelements embedded within the first intron of TP53 act as cis‑repressive genomic architecture. Removing this intron increases TP53 expression, indicating that long‑embedded repeats contribute directly to regulating a core tumor suppressor gene.

Importantly, this repression is architectural rather than motif‑driven. The repeats do not act through a single conserved sequence, but through repeat‑dense structure.

Together, these findings suggest a layered system of control:

Structural repression of repeats within introns.
Immune suppression of repeat‑derived dsRNA.
p53‑dependent governance of both genome stability and immune signaling.

One long‑standing challenge in repeat biology is inconsistency. Different tumors show different repeat fragments. Even different regions of the same tumor can look unrelated at the sequence level.

From a traditional biomarker perspective, this appears discouraging. From a structural perspective, it is expected. Codondex analyses of repeat‑dense introns, including TP53 intron 1, show that cancer does not preserve specific Alu sequences. Instead, it perturbs repeat topology:

dominance and skew within intronic scaffolds,
stem‑loop‑prone architectures,
context‑specific fragmentation patterns.

The sequences vary. The instability regime does not. This is characteristic of a state change, not a discrete genetic event. Repeat‑dense introns behave like stress recorders. They integrate replication stress, chromatin relaxation, repair pathway bias, and immune tolerance history.

Unlike coding mutations, these signals are heterogeneous, region‑specific, and reflective of ongoing cellular state.

They are difficult to interpret with gene‑centric tools, but powerful when viewed architecturally.

Most cancer diagnostics ask:

What mutation is present? A repeat‑aware framework asks:

Has this tissue entered a stable state of repeat derepression coupled with immune tolerance?

That state may precede aggressive behavior, accompany treatment resistance, or mark transitions in disease evolution. Future prognostic approaches may therefore combine repeat‑topology instability metrics, repeat RNA burden, and evidence of immune decoupling from dsRNA load. Not to identify a single driver, but to detect loss of containment.

Alu repeats do not cause cancer on their own, but human cells must continuously restrain them, structurally and immunologically. Cancer appears, at least in part, when that restraint erodes and tolerance replaces control. Introns, long treated as background, may be one of the clearest places to see this shift, not because they encode instructions, but because they actively record genomic history and project it into a measure of present state.

Wednesday, August 13, 2025

Repeats as Signatures of Regulatory Potential

In the vast landscape of AI genomics, emerging analyses reveals non-coding DNA (ncDNA) as a treasure trove of regulatory information. At Codondex, our innovative k-mer-based approach uncovers how repetitive subsequences—short DNA fragments known as k-mers—serve as powerful signatures of regulatory potential. By viewing these repeats through a topological lens, we transform linear sequences into dynamic networks that highlight subtle distinctions in gene transcripts, offering new insights into gene regulation, isoform diversity, and disease mechanisms.

The Codondex Method: From Sequences to Topology

Codondex begins by "amplifying" ncDNA sequences associated with gene transcripts, generating all contiguous k-mers of length 8 or greater. For a gene like TP53, with its multiple isoforms (variants), we associate these k-mers with transcript-specific signatures derived from cDNA, mRNA or protein constants. The result? A rich dataset of subsequences, where repeats—identical k-mers appearing multiple times—emerge as key players.

Rather than treating DNA as a flat string, we interpret it topologically: k-mers as nodes in a graph, with repeats forming edges that indicate connections, clusters, and symmetries. Metrics like i-Score (normalizing contained k-mers by length) and inclusiveness (repeat frequency) rank these patterns, while cDNA or protein vectors capture fine distinctions. In our analyses of genes such as MEN1 and TP53, symmetries in repeat length and frequency stand out, unrelated to obvious features like reverse complements. These non-random patterns suggest repeats are not artifacts but deliberate signatures encoded for regulation.

Repeats as Regulatory Hotspots

How do these repeats signal regulatory potential? First, they often manifest as binding sites for proteins. Repetitive motifs can amplify affinity for transcription factors or splicing regulators. In TP53 introns, high-frequency k-mers align with p53-binding elements, potentially modulating tumor-suppressive isoforms. Variants with asymmetric repeats might weaken these interactions, leading to dysregulation in cancer.

Second, repeats influence secondary structures. Topologically, frequent repeats create "hubs" in the network, fostering DNA/RNA folds like hairpins that affect chromatin accessibility or mRNA stability. Our MEN1 intron1 study, analyzing 15 variants, revealed length-biased repeat clusters in scatter-graphs—despite length-agnostic algorithms—indicating structured motifs that differentiate stable from unstable transcripts. Disruptions from low-length repeats, as seen in TP53 vectors, act like regulatory "switches," fine-tuning expression in response to cellular stress.

Third, symmetries in repeats point to evolutionary conservation. Equal-length k-mers recurring with balanced frequencies form symmetric graphs, preserving robust modules across species. In MEN1, linked to endocrine tumors, these patterns suggest intron-driven adaptations for hormone regulation. Disruptions in variants could flag pathogenicity, enabling predictive modeling without coding-sequence reliance.

Real-World Implications and Validation

Our deep k-mer analysis, first detailed in a 2018 blog post, showcased MEN1 intron symmetries predicting protein outcomes, later validated through lab tests at Tel Aviv University. For TP53, stable vector positions disrupted by specific repeats correlated with isoform-specific roles, highlighting ncDNA's influence on cancer hallmarks.

This topological view empowers genomics: identifying regulatory elements for drug targeting, differentiating disease variants, and advancing precision medicine. At Codondex, we're excited to explore how these repeat signatures unlock ncDNA's secrets—join us in redefining genomic potential.

Monday, June 28, 2021

Immunity keeping p53 in check!

In a 2012 study on the topology of the human and mouse m6A RNA methylomes, Gene Ontology (GO) analysis of differentially expressed genes (DEG's) indicated a noteworthy enrichment of the p53 signaling pathway: 22/23 genes had differentially expressed splice variants, of which 18 were methylated. Moreover, 15 other members of the signaling pathway, which were not significant DEG's, exhibited significant differential isoform expressions. For example, isoforms of MDM4, needed for p53 inactivation were downregulated. Similar pro-apoptotic effects were observed in other pathway genes including MDM2, FAS and BAX. Higher apoptosis rate in HaCaT-T cells resulted with knockdown of m6A subunit METTL3, which also reversed a significant decrease in p53 activity. Modulation of p53 signaling through splicing may be relevant to induction of apoptosis by silencing of METTL3.

Then, in 2019 a study of arsenite-induced human keratinocyte transformation demonstrated that knockdown of METTL3 significantly decreased m6A level, restored p53 activation and inhibited cellular transformation phenotypes in the-transformed cells. Further, m6A downregulated the expression of the positive p53 regulator, PRDM2, through the YTHDF2-promoted decay of PRDM2 mRNAs. m6A also upregulated expression of negative p53 regulator, YY1 and MDM2 through YTHDF1-stimulated translation of YY1 and MDM2 mRNA. Taken together, the study revealed the novel role of m6A in mediating human keratinocyte transformation by suppressing p53 activation and sheds light on the mechanisms of arsenic carcinogenesis via RNA epigenetics.

Finally in 2021 a discovery that YTHDF2 is upregulated in NK cells upon activation by cytokines, tumors, and cytomegalovirus infection. Ythdf2 deficiency in NK cells impaired its anti-tumor and anti-viral activity in vivo. YTHDF2 maintains NK cell homeostasis and terminal maturation, correlating with modulating NK cell trafficking and regulating Eomes, respectively. It promotes NK cell effector function and is required for IL-15-mediated NK cell survival and proliferation by forming a STAT5-YTHDF2 positive feedback loop. Analysis showed significant enrichment in cell cycle, division, and division-related processes, including mitotic cytokinesis, chromosome segregation, spindle, nucleosome, midbody, and chromosome. This data supports roles of YTHDF2 in regulating NK proliferation, survival, and effector functions. Transcriptome-wide screening identified Tardbp (TDP-43) to be involved in cell proliferation or survival as a YTHDF2-binding target in NK cells.

Downregulation of METTL3, which in spinal cord contributes with YTHDF2 to modulate inflammatory pain may upregulate differentially expressed p53 network splice variants that oppose YTHDF2 induced downregulation of p53, via PRDM2 leading to apoptotic or diseased cells. In diseased environments cytokines may upregulate YTHDF2 in NK cells leading to downregulation of p53 and cytoskeletal transformation that may be sufficient, at an immune synapse to advance cytolysis.

p53 signals may inform selections of cells and tissue that prime NK cells for advanced, personalized immune therapy.

Sunday, June 20, 2021

First Intron DNA - Site for a Genetic Brain?

DNA Methylation

The first intron of a gene, regardless of tissue or species is conserved as a site of downstream methylation with an inverse relationship to transcription and gene expression. Therefore, it is an informative gene feature regarding the relationship between DNA methylation and gene expression. But, expression in induced pluripotent stem cells (iPSC's) has been a major challenge to the stem cell industry, because by comparison these cells have not yet reached the state of natural pluripotent or embryonic stem cells (ESC's).

In mice two X chromosomes (XC) are active in the epiblasts of blastocysts as well as in pluripotent stem cells. One XC is inactivated triggered by Xist (non coding) RNA transcripts coating it to become silent. Designer transcription factor (dTF) repressors, binding the Xist intron 1 enhancer region caused higher H3K9me3 methylation and led to XC's opening and X-linked gene repression in MEFs. This substantially improved iPSC production and somatic cell nuclear transfer (SCNT) preimplantation embryonic development. This also correlated with much fewer abnormally expressed genes frequently associated with SCNT, even though it did not affect Xist expression. In stark contrast, the dTF activator targeting the same enhancer region drastically decreased both iPSC generation and SCNT efficiencies and induced ESC differentiation.

A genome-wide, tissue-independent quasi-linear, inverse relationship exists between DNA methylation of the first intron and gene expression. More tissue-specific, differentially methylated regions exist in the first intron than in any other gene feature. These have positive or negative correlation with gene expression, indicative of distinct mechanisms of tissue-specific regulation. CpGs in transcription factor binding motifs are enriched in the first intron and methylation tends to increase with distance from the first exon–first intron boundary, with a concomitant decrease in gene expression.

Since the relationship between sequence, methylation, repression and transcription is determinative in ESC differentiation it may also suggest a broader link to differential translation. Translation is required for miRNA-dependent transcript destabilization that alters levels of coding and noncoding transcripts. But, steady-state abundance and decay rates of cytosolic long non-coding RNA's (lncRNAs) are insensitive to miRNA loss. Instead lncRNAs fused to protein-coding reporter sequences become susceptible to miRNA-mediated decay.

In this model, first intron DNA sequences that are differentially methylated, bind transcription factors that effect transcription, impact splicing, expressions of coding or non-coding transcripts and transcript destabilizations resulting in differential rates and possible variations in translation. This bottom-up, dynamic view of the classical process may elevate the first intron from 'junk' to a DNA 'brain' because it plays a more extensive role, heading the process toward translation of any gene or switching it off entirely.

For this reason, among others Codondex uses first intron k-mers relative to the transcripts mRNA as the basis for comparing same gene transcripts in diseased cells or tissue samples. Further, p53 and BRCA1 miRNA key sequences, discovered using Codondex iScore algorithm, when transfected into HeLa cells resulted in significantly reduced proliferation that may result from this accelerated, transfected miRNA dependent decay.

Tuesday, June 1, 2021

Short Sequences of Proximally Disordered DNA

Oxford Nanopore Device Reducing Sequencing Cost

Relationships exist between short sequences of proximal DNA (SSPD) of a gene that when transcribed into RNA present stronger or weaker binding attractions to RNA binding proteins (RBP'S) that settle, edit, splice and resolve messenger RNA (mRNA). Responsive to epigenetic stimuli on Histones and DNA, mRNA are constantly transcribed in different quantity, at different times such that different mRNA strands are transported from the nucleus to cytoplasm where they are translated into and produce any of more than 30,000 different proteins.

Single nucleotide polymorphisms and DNA mutations can alter SSPD combinations in different diseased cells thus altering sequence proximity, ordering that affects transcribed RNA's attraction and optimal binding of RBP's. This may result in modified splicing of RNA, assembly of mRNA and slight or major variations in some or all translated protein derived from that gene.

The specific effects of these DNA variations, on the multitude of proteins produced are generally unknown. However, it remains important to understand their effects in disease, diagnosis and therapy. Typically these have historically been researched by large scale analysis of RBP on RNA as opposed to the more fundamental, yet underrepresented massive array of diseased variant DNA to mRNA transitions.

Most pharmaceutical research is directed to a molecular interference targeting an aberrant protein to cure widely represented or highly impactful disease conditions of society. Economic assessments generally influence government decisions to support research based on loss of GDP contribution by a specific disease in a patient cohort. However, in the modern multi-omics era top down research into protein-RNA activity is descending deeper into the cell to include RNA-mRNA and mRNA-DNA customizable therapies that will eventually resolve individually assessed diseases at a price that addresses much larger array of patient needs.

SNP's and other mutations can vary considerably in cells. These variations can cause instability during division and lead to translated differences that can ultimately drive cancerous cell growth to escape patient immunity. Like a 'whack-a-mole' game, pattern variation and mechanistic persistence eventually beat the player. Without effective immune clearance these cells can replicate into tumors and contribute to microenvironments that support their existence.

Link to video on tumor microenvironment https://youtu.be/Z9H2utcnBic

We thought to analyze DNA and mRNA transcripts from cells in tumors and their microenvironments to see if we could expose the SSPD disordered combinations that may have promoted sub-optimal RBP attractions and led to sustained immune escape. Given the complexity of DNA to mRNA transcription, for any given gene many distortions in gene data sets have to be filtered. To do that we focused on p53, the most mutated gene in cancer. We designed a method to compare sequences arrays of DNA and mRNA Ensembl transcripts, from the consensus of healthy patients to multiple cell samples extracted from different sections of a patients tumor and tumor microenvironment.

We previously identified and measured different levels of Natural Killer (NK) cell cytotoxicity, produced from cocultures with the extracted samples of each of the multiple sites of a biopsy. We will measure the different p53 transcript SSPD combinations associated with each sample and determine whether disordered SSPD's corelate with NK cytotoxicity from each coculture. We expect to identify whether biopsied tumor cells, ranked by SSPD's predict the cytotoxicity resulting from NK cell cocultures. We will narrow our research to identify the varied expressions of receptor combinations associated with degrees of cytotoxicity. We will test immune efficacy to lyse and destroy tumor cells. Finally we will test for adaptive immune response.

Our vision is for per-patient, predictable cell co-culture pairings, for innate immune cell education based on ranking DNA-mRNA combinations to lead to multiple effective therapies. The falling cost of sequencing and sophistication of GMP laboratories presently servicing oncologists may support a successful use of this analytical approach to laboratory assisted disease management.

Thursday, May 13, 2021

Non-Coding DNA Key Sequences

DNA Structural Inherency

Wind two strands of elastic, eventually it will knot, ultimately it will double up on itself. Separate the strands. From the point of unwinding, forces will be directed to different regions and the separation will approximately return to the wound state of the band. Do the same with each of 10 different bands or strings of any type, they will all behave in much the same way. For a given section of DNA being transcribed, the effect of separation will be much the same. For a given gene, there will be sequences that can tolerate force to greater or lesser degrees. For different transcripts, of a gene variation at those sequences may be crucial to the integrity of transcription machinery that separates DNA strands to initiate replication to RNA and for the outcome.

Cellular biology is enormously complex in all regards. The physics of molecular interaction, fluid dynamics, and chemistry combine in a system where cause and effect is near impossible to predict. At the most elementary level we hypothesize some non-coding DNA (ncDNA) possess structural inherencies that can be deployed to direct gene proteins and cell function for diagnosis or therapy.

Coding DNA and its regulatory, non-coding gene compliment is transcribed and spliced from a transcribed gene. Transcription to RNA, edited mRNA, spliced non-coding RNA and ultimately mRNA translation to protein can produce wide ranging, variable outcomes that may not be re-captured experimentally.

A single nucleotide polymorphism (SNP) or SNP combinations within a gene may affect the finely tuned balance that results. Under different environmental conditions this could be material to the protein produced. Additionally other mutations of the gene could add complexity to the environment and/or the resulting protein translation.

At this level of cellular biology, genetic DNA stores instruction for protein assemblies to produce new protein required for the fully functional cell. However, DNA's stored mutations can lead to different functional or non-functional versions of protein depending on many different factors. Relationships between ncDNA, including mutations and the transcripts' edited, protein coding mRNA may represent unexplored inherencies that can regulate the gene's mRNA or translated protein.

We built an algorithm to elaborately compare ncDNA sequences of multiple protein coding transcripts of the same gene. For each transcript it steps through every variable length ncDNA sequence (kmer) (specifically intron1), computes a signature for each and indexes it to the constant of the transcripts' mRNA signature. For each step these signatures order the kmers for each of the transcript's. The order is represented in a vector of all the transcripts being compared.

At millions of successive steps (depending on total intron 1 length's) transcripts mostly retain their vector ordering except, as expected at a kmer length change. Mostly transcript order in the vector does not change, occasionally a few positions change, vary rarely do all positions change. Position changes that cause another, like a domino effect are filtered out. For the rarest positions changes at a step, we look to the root causes in the kmer (sequence). We call this a Key Sequence because it is identified by the significance of changes to transcript positions in the vector compared to the vector at the next step.

Therefore, Key Sequences cause the most position changes between transcripts being compared by the algorithm. This relative measure is step dependent and Key Sequences are discovered by comparing transcript positions in the vector at the next step location. Logically, this infers a genes structural inherency discovered through ncDNA Key Sequence relationships to mRNA, to other transcripts, error in gene alignments, sequenced reads or the algorithm.

In assay testing we were able to predict and synthesize non-coding RNA Key Sequences that significantly reduced proliferation of HeLa cells. In our pre-clinical work, based on comparisons to transcripts of the TP53 we will be predicting the efficacy of cell and tissue selections that educate and activate Natural Killer cells.

If Key Sequences are inherent they could open a new frontier for diagnosis and therapy.

Codondex