Codondex: k-mer

Showing posts with label k-mer. Show all posts

Saturday, January 17, 2026

Genome Balance: Repeats, Immunity, and Cancer

Cancer is usually described as a disease of mutations. Genes break, pathways fail, and cells escape control. That framing has been powerful, but it misses a deeper layer that may reveal how it begins.

The human genome is not primarily a coding genome. It is a repeat genome. More than half of our DNA consists of repetitive elements, with Alu retroelements alone numbering over a million copies. These sequences are a defining feature of primate genomes and they create a unique biological problem that human cells must continuously manage. Recent work suggests that cancer may emerge, in part, when this management system loses balance.

Alu elements are short retrotransposons that readily form double‑stranded RNA stem‑loop structures when transcribed, particularly in antisense orientation within introns and untranslated regions. To the innate immune system, these structures resemble viral RNA. This means that normal gene expression in human cells constantly risks triggering antiviral immune responses against self‑derived RNA.

A striking recent study shows that human cells rely on active suppression to avoid this outcome. In Ku suppresses RNA‑mediated innate immune responses in human cells to accommodate primate‑specific Alu expansion, the authors demonstrate that the DNA repair protein Ku (Ku70/Ku80) plays an essential second role: binding Alu‑derived dsRNA stem‑loops and preventing activation of innate immune sensors such as MDA5, RIG‑I, PKR, and OAS/RNase L.

When Ku is depleted interferon and NF‑κB signaling are strongly activated, translation is suppressed, and cells undergo growth arrest or death. Notably, Ku levels scale tightly with Alu expansion across primates, and Ku is essential in human cells but not in mice. The implication is clear:

Human cell viability depends on continuous suppression of Alu‑derived innate immune activation.

Alu expression is not harmless noise, it is actively tolerated! Ku functions as a finite buffer that allows primate cells to tolerate structurally immunogenic RNA produced by repeat‑rich genomes. When structured RNA load increases simultaneously from endogenous repeat transcription and exogenous viral RNA infection, Ku becomes functionally saturated and redistributed, weakening nuclear retention and cytoplasmic buffering. This pressurizes the cell’s capacity to contain dsRNA stress, promoting escape of repeat‑derived RNA, activation of innate sensors, and eventual selection for immune‑tolerant states.

A second line of evidence connects this tolerance to cancer evolution. A 2025 bioRxiv preprint, p53 loss promotes chronic viral mimicry and immune tolerance, shows that loss of p53 permits transcription of immunogenic repetitive elements, generating signals that resemble viral infection. Rather than leading to effective immune clearance, this state becomes chronic. Tumor cells adapt by dampening innate immune responses and tolerating persistent repeat‑derived nucleic acids.

In this view, “viral mimicry” is not a one‑time immune alarm. It is a conditioning process repeat RNAs accumulate, immune pathways are activated, and progressively suppressed or rewired to allow survival. Cancer cells do not simply evade immunity, they learn to live with endogenous viral‑like signals.

These immune findings align with earlier evidence that repeat control begins at the level of genome structure itself. A 2022 Nature Communications study demonstrated that retroelements embedded within the first intron of TP53 act as cis‑repressive genomic architecture. Removing this intron increases TP53 expression, indicating that long‑embedded repeats contribute directly to regulating a core tumor suppressor gene.

Importantly, this repression is architectural rather than motif‑driven. The repeats do not act through a single conserved sequence, but through repeat‑dense structure.

Together, these findings suggest a layered system of control:

Structural repression of repeats within introns.
Immune suppression of repeat‑derived dsRNA.
p53‑dependent governance of both genome stability and immune signaling.

One long‑standing challenge in repeat biology is inconsistency. Different tumors show different repeat fragments. Even different regions of the same tumor can look unrelated at the sequence level.

From a traditional biomarker perspective, this appears discouraging. From a structural perspective, it is expected. Codondex analyses of repeat‑dense introns, including TP53 intron 1, show that cancer does not preserve specific Alu sequences. Instead, it perturbs repeat topology:

dominance and skew within intronic scaffolds,
stem‑loop‑prone architectures,
context‑specific fragmentation patterns.

The sequences vary. The instability regime does not. This is characteristic of a state change, not a discrete genetic event. Repeat‑dense introns behave like stress recorders. They integrate replication stress, chromatin relaxation, repair pathway bias, and immune tolerance history.

Unlike coding mutations, these signals are heterogeneous, region‑specific, and reflective of ongoing cellular state.

They are difficult to interpret with gene‑centric tools, but powerful when viewed architecturally.

Most cancer diagnostics ask:

What mutation is present? A repeat‑aware framework asks:

Has this tissue entered a stable state of repeat derepression coupled with immune tolerance?

That state may precede aggressive behavior, accompany treatment resistance, or mark transitions in disease evolution. Future prognostic approaches may therefore combine repeat‑topology instability metrics, repeat RNA burden, and evidence of immune decoupling from dsRNA load. Not to identify a single driver, but to detect loss of containment.

Alu repeats do not cause cancer on their own, but human cells must continuously restrain them, structurally and immunologically. Cancer appears, at least in part, when that restraint erodes and tolerance replaces control. Introns, long treated as background, may be one of the clearest places to see this shift, not because they encode instructions, but because they actively record genomic history and project it into a measure of present state.

Wednesday, August 13, 2025

Repeats as Signatures of Regulatory Potential

In the vast landscape of AI genomics, emerging analyses reveals non-coding DNA (ncDNA) as a treasure trove of regulatory information. At Codondex, our innovative k-mer-based approach uncovers how repetitive subsequences—short DNA fragments known as k-mers—serve as powerful signatures of regulatory potential. By viewing these repeats through a topological lens, we transform linear sequences into dynamic networks that highlight subtle distinctions in gene transcripts, offering new insights into gene regulation, isoform diversity, and disease mechanisms.

The Codondex Method: From Sequences to Topology

Codondex begins by "amplifying" ncDNA sequences associated with gene transcripts, generating all contiguous k-mers of length 8 or greater. For a gene like TP53, with its multiple isoforms (variants), we associate these k-mers with transcript-specific signatures derived from cDNA, mRNA or protein constants. The result? A rich dataset of subsequences, where repeats—identical k-mers appearing multiple times—emerge as key players.

Rather than treating DNA as a flat string, we interpret it topologically: k-mers as nodes in a graph, with repeats forming edges that indicate connections, clusters, and symmetries. Metrics like i-Score (normalizing contained k-mers by length) and inclusiveness (repeat frequency) rank these patterns, while cDNA or protein vectors capture fine distinctions. In our analyses of genes such as MEN1 and TP53, symmetries in repeat length and frequency stand out, unrelated to obvious features like reverse complements. These non-random patterns suggest repeats are not artifacts but deliberate signatures encoded for regulation.

Repeats as Regulatory Hotspots

How do these repeats signal regulatory potential? First, they often manifest as binding sites for proteins. Repetitive motifs can amplify affinity for transcription factors or splicing regulators. In TP53 introns, high-frequency k-mers align with p53-binding elements, potentially modulating tumor-suppressive isoforms. Variants with asymmetric repeats might weaken these interactions, leading to dysregulation in cancer.

Second, repeats influence secondary structures. Topologically, frequent repeats create "hubs" in the network, fostering DNA/RNA folds like hairpins that affect chromatin accessibility or mRNA stability. Our MEN1 intron1 study, analyzing 15 variants, revealed length-biased repeat clusters in scatter-graphs—despite length-agnostic algorithms—indicating structured motifs that differentiate stable from unstable transcripts. Disruptions from low-length repeats, as seen in TP53 vectors, act like regulatory "switches," fine-tuning expression in response to cellular stress.

Third, symmetries in repeats point to evolutionary conservation. Equal-length k-mers recurring with balanced frequencies form symmetric graphs, preserving robust modules across species. In MEN1, linked to endocrine tumors, these patterns suggest intron-driven adaptations for hormone regulation. Disruptions in variants could flag pathogenicity, enabling predictive modeling without coding-sequence reliance.

Real-World Implications and Validation

Our deep k-mer analysis, first detailed in a 2018 blog post, showcased MEN1 intron symmetries predicting protein outcomes, later validated through lab tests at Tel Aviv University. For TP53, stable vector positions disrupted by specific repeats correlated with isoform-specific roles, highlighting ncDNA's influence on cancer hallmarks.

This topological view empowers genomics: identifying regulatory elements for drug targeting, differentiating disease variants, and advancing precision medicine. At Codondex, we're excited to explore how these repeat signatures unlock ncDNA's secrets—join us in redefining genomic potential.

Wednesday, February 19, 2025

P53 - Stability and Life Or Disorder and Death!

Chromosomal stability is central to good health, but the push and shove war of genesis, division, transcription, replication and restraint can promote disorder. Disruption can also be retained resulting in ageing, reduced organ function or diseases that often follow. Recently a man escaped his genetic predisposition, to becoming a victim of Alzheimer's disease, illustrating how far we are from understanding even the most well studied conditions.

Active or passive, mobile Transposable Elements (TE) represent around 40-50% of the human genome and around 30% are found in the non-coding introns of genes. The first intron is conserved as a site of downstream methylation with an inverse relationship to transcription and gene expression. Our understanding of non-coding RNA (ncRNA) suggests one of its primary functions is the restraint of mobile TE's. Several species of ncRNA are associated with this restraint and genomic stability, most contain p53 binding sites that are also known to be involved in tumor suppression.

Of the short ncRNA species, LINE-1 (L1), siRNAs are typically 21-23 nucleotides long and play a role in silencing L1 transcripts, thus preventing retro-transposition. p53 binds the L1 promoter to restrict autonomous copies of these mobile elements in human cells. Alu elements are the most abundant transposable elements (capable of shifting their positions) containing over one million copies dispersed throughout the human genome. As little as 0.7% sequence divergence resulted in a significant reduction in recombination after double stranded breaks. piRNAs, usually 26-31 nucleotides, derived from Alu repeats restrain transposable elements. Endogenous Retroviruses (ERVs) can give rise to microRNAs (miRNAs) of 22 nucleotides, that can regulate the expression of ERV sequences and other cellular genes.

TE's serve as templates for the generation of p53- binding-sites on a genome-wide scale . The formation of the p53 binding motifs was facilitated via methylation and deamination that distributes p53-binding sites and recruits new target genes to its regulatory network in a species-specific manner. This p53 mechanism conducts genomic restraint, where instability and loss or mutation of p53 are commonly associated with hallmark's of cancer.

Through a novel piRNA of the KIR3DL1 gene, antisense transcripts mediate Killer Ig-like receptor (KIR) transcriptional silencing in Natural Killer (NK) cell lineage that may be broadly used in orchestrating immune development. Silencing individual KIR genes is strongly correlated with the presence of CpG dinucleotide methylation within the promoter.

The emergence of recombination-activating genes (RAGs) in jawed vertebrates endowed adaptive immune cells with the ability to assemble a diverse set of antigen receptor genes. Innate NK cells are unable to express RAGs or RAG endonuclease activity during ontogeny. However, RAG expression in uncommitted hematopoietic progenitors and NK cell precursors mark functionally distinct subsets of NK cells in the periphery, a surprising and novel role for RAG in the functional specialization of the NK cell lineage.

The p53 C-terminal including amino acids 360-393 of the full-length protein locate to the mitochondrial permeability transition pore and facilitate apoptosis. However fragments of p53 at amino acid 1-186 and 22-186 drive the most mitochondrial depolarization. Crystal structures demonstrate amino acid 239 binds 106 and 241 binds 105 for one p53 unit and 243 binds 103-264-265 for a second unit, which are both are required to bind BCL-xl for apoptosis.

p53 regulates the expression of major histocompatibility complex (MHC) class I on cell surfaces. p53 peptides presented on HLA/MHC-I could attract immune surveillance as in the target-specific antitumor effects of p53 amino acids at positions 264-272, epitope 264scTCR with IL-2 on p53+/HLA-A2.1+ tumors that are primarily mediated by NK cells.

Initially, NK cells might be activated due to the combined effect of reduced inhibition (due to decreased KIR3DL1) and increased activation signals from p53 epitopes. This NK cell activation could lead to the release of cytokines that not only enhance further NK activity but also attract and activate T cells.

To summarize, p53 can influence both the presentation of its antigens through MHC-I and the regulation of NK cell inhibitory receptors like KIR3DL1 via piRNA. This could lead to a more effective immune response against cells with compromised p53 function, although the exact dynamics would depend on the specific context of cancer development, immune cell status, and individual genetic variations.

Wednesday, May 17, 2023

Immune Synchronization

Stem Cell

Navigating the regulatory regimes that govern drug safety can be challenging. But, rigorous standards are more relaxed in the lesser used track for autologous and/or minimally manipulated cell treatments. Toward meeting the challenges of this minimal regulation track, the wide-spectrum of NK cells, of the innate immune system, are compelling candidates to address complex cellular and tissue personalization's or conditions of disease. One effect of cell function on NK cell potency occurs via aryl hydrocarbon receptor (AhR) dietary ligands, potentially explaining numerous associations that have been observed in the past.

The AhR was first identified to bind the xenobiotic compound dioxin, environmental contaminants and toxins in addition to a variety of natural exogenous (e.g., dietary) or endogenous ligands and expression of AhR is also induced by cytokine stimulation. Activation with an endogenous tryptophan derivative, potentiates NK cell IFN-γ production and cytolytic activity which, in vivo, enhances NK cell control of tumors in an NK cell and AhR-dependent manner.

A combination of ex vivo and in vivo studies revealed that Acute Myeloid Leukemia (AML) skewed Innate Lymphoid Cell (ILC) Progenitor towards ILC1's and away from NK cells as a major mechanism of ILC1 generation. This process was driven by AML-mediated activation of AhR, a key transcription factor in ILC's, as inhibition of AhR led to decreased numbers of ILC1's and increased NK cells in the presence of AML.

Activation of AhR also induces chemoresistance and facilitates the growth, maintenance, and production of long-lived secondary mammospheres, from primary progenitor cells. AhR supports the proliferation, invasion, metastasis, and survival of the Cancer Stem Cells (CSC's) in choriocarcinoma, hepatocellular carcinoma, oral squamous carcinoma, and breast cancers leading to therapy failure and tumor recurrence.

Loss of AhR increases tumorigenesis in p53-deficient mice and activation of p53 in human and murine cells, by DNA-damaging agents, differentially regulates AhR levels. Activation of the AhR/CYP1A1 pathway induces epigenetic repression of many tumor suppressor and tumor activating genes, through modulation of their DNA methylation, histone acetylation/deacetylation, and the expression of several miRNAs.

p53 is barely detectable under normal conditions, but levels begin to elevate and locations change particularly in cells undergoing DNA damage. The significant network effect of p53 availability and its mutational status in cancer makes it the worlds most widely studied gene.

From 48 sequenced samples of two different tumors, Codondex identified 316 unique Key Sequences (KS) of the TP53 Consensus. 9 of these contained the core AhR 5′-GCGTG-3′ binding sequence, and some overlapped p53 quarter binding sites as illustrated below;

Key Sequence

GGATAGGAGTTCCAGACCAGCGTGGCCA (intron1) AhR [1699,1726], p53 @ [1706,1710]

AAAAATTAGCTGGGCGTGGTGGGTGCCT (intron1) AhR [1760,1787], p53 [1783,1787]

AAAAAAAATTAGCCGGGCGTGGTGCTGG (intron6) AhR [12143,12170]

GAGGCTGAGGAAGGAGAATGGCGTGAAC (intron6) AhR [12195,12222]

We propose that DNA damage liberates transposable DNA elements that are normally repressed by p53 and other suppressor genes. The p53 repair/response also includes increased cooperation between p53 and AhR, which further influence transcription, mRNA splicing or post-translation events. Repeated damage, at multi-cellular scale, may proximally bias ILC's toward NK cells capable of specific non-self detection, through localized ligand, receptor relationships that trigger cytolysis and immune cascades.

KS's are a retrospective view of transcripts ncDNA elements, ranked by cDNA that may reflect inherent bias that can be used to direct NK cell education. One way to accomplish minimal manipulation may be to leverage patient immunity by educating autologous NK cells with computationally selected tumor cells, identified by KS alignments to the index of past experiments that expanded and triggered a more desirable immune response. Customizable immune cascades, capable of managing disease or preventatively supporting a desired heterogeneity being the primary objective.

Wednesday, November 17, 2021

Retroviral Defense And Mitochondrial Offense

Chromosomal DNA has played host to the long game of viral insertions that repeat and continue as a genetic and epigenetic symbiosis along its phosphate and pentose sugar backbone. But, the bacterial origin of mitochondria and its hosted DNA also promotes its offense.

Research suggests that retrovirus insertions evolved from a type of transposon called a retrotransposon. The evolutionary time scales of inherited, endogenous retroviruses (ERV) and the appearance of the zinc finger gene that binds its unique sequences occur over same time scales of primate evolution. Additionaly the zinc-finger genes that inactivate transposable elements are commonly located on chromosome 19. The recurrence of independent ERV invasions can be countered by a reservoir of zinc-finger repressors that are continuously generated on copy number variant (CNV) formation hotspots.

One of the more intiguing aspects of prevalent CNV hotspots on chromosome 19 are their proximity to killer immunoglobulin receptor gene's (KIR's) and other critical gene's of the innate immune system.

Frequently occuring DNA breaks can cause genomic instability, which is a hallmark of cancer. These breaks are over represented at G4 DNA quadruplexes within, hominid-specific, SVA retrotransposons and generally occur in tumors with mutations in tumor suppressor genes, such as TP53. Cancer mutational burden is shaped by G4 DNA, replication stress and mitochondrial dysfunction, that in lung adenocarcinoma downlregulates SPATA18, a mitochondrial eating protein (MIEAP) that contributes to mitophagy.

Genetic variations, in non-coding regions can control the activity of conserved protein-coding genes resulting in the establishment of species-specific transcriptional networks. A chromosome 19 zinc finger, ZNF558 evolved as a suppressor of LINE-1 transposons, but has since been co-opted to singly regulate SPATA18. These variations are evident from a panel of 409 human lymphoblastoid cell lines where the lengths of the ZNF558 variable number tandem repeats (VNTR) negatively correlated with its expression.

Colon cancer cells with p53 deletion were used to analyze deregulated p53 target genes in HCT116 p53 null cells compared to HCT116-p53 +/+ cells. SPATA18 was the most upregulted gene in the differential expression providing further insight to p53 and mitophagy via SPATA18-MIEAP.

p53 response elements (p53RE) can be shaped by long terminal repeats from endogenous retroviruses, long interspersed nuclear repeats, and ALU repeats in humans and fuzzy tandem repeats in mice. Further, p53 pervasively binds to p53REs derived from retrotransposons or other mobile genetic elements and can suppress transcription of retroelements. The p53- mediated mechanisms conferring protection from retroelements is also conserved through evolution. Certainly, p53 has been shown to have other roles in DNA context, such as playing an important role in replication restart and replication fork progression. The absence of these p53-dependent processes can lead to further genomic instability.

The frequency of variable length, long or short nucleotide repeats and their locations within a gene may be key to the repression of DNA sequences that would otherwise cause genomic instability or protein expressions that would eat bacterial mitochondria or destroy its cell host.

The complexity of variable length insertions is made evident when exhaustively analyzing a simple length 12 sequence for the potential frequency of each of its variable length repeats starting from a minumum variable length of 8.

Then, for TGTGGGCCCACA(12)

All possible internal variable length combinations from and including length 8:

For example, reviewing length (8) only:

TGTGGGCC (8) occurs 5 times

GTGGGCCC (8) occurs 8 times

TGGGCCCA (8) occurs 9 times

GGGCCCAC (8) occurs 8 times

GGCCCACA (8) occurs 5 times

Any repeat can be ranked based on its ocurrence within all possible combinations of a given sequence, known as the repeats' iScore rank. This illustrates a potential useful statistical ranking that, subject to biology may describe a repeats inherency to be more or less effective, in increments of the gene sequence.

Repression of the most active sequences, especially in context of repeats may result in genetic variation.

Sunday, June 20, 2021

First Intron DNA - Site for a Genetic Brain?

DNA Methylation

The first intron of a gene, regardless of tissue or species is conserved as a site of downstream methylation with an inverse relationship to transcription and gene expression. Therefore, it is an informative gene feature regarding the relationship between DNA methylation and gene expression. But, expression in induced pluripotent stem cells (iPSC's) has been a major challenge to the stem cell industry, because by comparison these cells have not yet reached the state of natural pluripotent or embryonic stem cells (ESC's).

In mice two X chromosomes (XC) are active in the epiblasts of blastocysts as well as in pluripotent stem cells. One XC is inactivated triggered by Xist (non coding) RNA transcripts coating it to become silent. Designer transcription factor (dTF) repressors, binding the Xist intron 1 enhancer region caused higher H3K9me3 methylation and led to XC's opening and X-linked gene repression in MEFs. This substantially improved iPSC production and somatic cell nuclear transfer (SCNT) preimplantation embryonic development. This also correlated with much fewer abnormally expressed genes frequently associated with SCNT, even though it did not affect Xist expression. In stark contrast, the dTF activator targeting the same enhancer region drastically decreased both iPSC generation and SCNT efficiencies and induced ESC differentiation.

A genome-wide, tissue-independent quasi-linear, inverse relationship exists between DNA methylation of the first intron and gene expression. More tissue-specific, differentially methylated regions exist in the first intron than in any other gene feature. These have positive or negative correlation with gene expression, indicative of distinct mechanisms of tissue-specific regulation. CpGs in transcription factor binding motifs are enriched in the first intron and methylation tends to increase with distance from the first exon–first intron boundary, with a concomitant decrease in gene expression.

Since the relationship between sequence, methylation, repression and transcription is determinative in ESC differentiation it may also suggest a broader link to differential translation. Translation is required for miRNA-dependent transcript destabilization that alters levels of coding and noncoding transcripts. But, steady-state abundance and decay rates of cytosolic long non-coding RNA's (lncRNAs) are insensitive to miRNA loss. Instead lncRNAs fused to protein-coding reporter sequences become susceptible to miRNA-mediated decay.

In this model, first intron DNA sequences that are differentially methylated, bind transcription factors that effect transcription, impact splicing, expressions of coding or non-coding transcripts and transcript destabilizations resulting in differential rates and possible variations in translation. This bottom-up, dynamic view of the classical process may elevate the first intron from 'junk' to a DNA 'brain' because it plays a more extensive role, heading the process toward translation of any gene or switching it off entirely.

For this reason, among others Codondex uses first intron k-mers relative to the transcripts mRNA as the basis for comparing same gene transcripts in diseased cells or tissue samples. Further, p53 and BRCA1 miRNA key sequences, discovered using Codondex iScore algorithm, when transfected into HeLa cells resulted in significantly reduced proliferation that may result from this accelerated, transfected miRNA dependent decay.

Tuesday, June 1, 2021

Short Sequences of Proximally Disordered DNA

Oxford Nanopore Device Reducing Sequencing Cost

Relationships exist between short sequences of proximal DNA (SSPD) of a gene that when transcribed into RNA present stronger or weaker binding attractions to RNA binding proteins (RBP'S) that settle, edit, splice and resolve messenger RNA (mRNA). Responsive to epigenetic stimuli on Histones and DNA, mRNA are constantly transcribed in different quantity, at different times such that different mRNA strands are transported from the nucleus to cytoplasm where they are translated into and produce any of more than 30,000 different proteins.

Single nucleotide polymorphisms and DNA mutations can alter SSPD combinations in different diseased cells thus altering sequence proximity, ordering that affects transcribed RNA's attraction and optimal binding of RBP's. This may result in modified splicing of RNA, assembly of mRNA and slight or major variations in some or all translated protein derived from that gene.

The specific effects of these DNA variations, on the multitude of proteins produced are generally unknown. However, it remains important to understand their effects in disease, diagnosis and therapy. Typically these have historically been researched by large scale analysis of RBP on RNA as opposed to the more fundamental, yet underrepresented massive array of diseased variant DNA to mRNA transitions.

Most pharmaceutical research is directed to a molecular interference targeting an aberrant protein to cure widely represented or highly impactful disease conditions of society. Economic assessments generally influence government decisions to support research based on loss of GDP contribution by a specific disease in a patient cohort. However, in the modern multi-omics era top down research into protein-RNA activity is descending deeper into the cell to include RNA-mRNA and mRNA-DNA customizable therapies that will eventually resolve individually assessed diseases at a price that addresses much larger array of patient needs.

SNP's and other mutations can vary considerably in cells. These variations can cause instability during division and lead to translated differences that can ultimately drive cancerous cell growth to escape patient immunity. Like a 'whack-a-mole' game, pattern variation and mechanistic persistence eventually beat the player. Without effective immune clearance these cells can replicate into tumors and contribute to microenvironments that support their existence.

Link to video on tumor microenvironment https://youtu.be/Z9H2utcnBic

We thought to analyze DNA and mRNA transcripts from cells in tumors and their microenvironments to see if we could expose the SSPD disordered combinations that may have promoted sub-optimal RBP attractions and led to sustained immune escape. Given the complexity of DNA to mRNA transcription, for any given gene many distortions in gene data sets have to be filtered. To do that we focused on p53, the most mutated gene in cancer. We designed a method to compare sequences arrays of DNA and mRNA Ensembl transcripts, from the consensus of healthy patients to multiple cell samples extracted from different sections of a patients tumor and tumor microenvironment.

We previously identified and measured different levels of Natural Killer (NK) cell cytotoxicity, produced from cocultures with the extracted samples of each of the multiple sites of a biopsy. We will measure the different p53 transcript SSPD combinations associated with each sample and determine whether disordered SSPD's corelate with NK cytotoxicity from each coculture. We expect to identify whether biopsied tumor cells, ranked by SSPD's predict the cytotoxicity resulting from NK cell cocultures. We will narrow our research to identify the varied expressions of receptor combinations associated with degrees of cytotoxicity. We will test immune efficacy to lyse and destroy tumor cells. Finally we will test for adaptive immune response.

Our vision is for per-patient, predictable cell co-culture pairings, for innate immune cell education based on ranking DNA-mRNA combinations to lead to multiple effective therapies. The falling cost of sequencing and sophistication of GMP laboratories presently servicing oncologists may support a successful use of this analytical approach to laboratory assisted disease management.

Thursday, May 13, 2021

Non-Coding DNA Key Sequences

DNA Structural Inherency

Wind two strands of elastic, eventually it will knot, ultimately it will double up on itself. Separate the strands. From the point of unwinding, forces will be directed to different regions and the separation will approximately return to the wound state of the band. Do the same with each of 10 different bands or strings of any type, they will all behave in much the same way. For a given section of DNA being transcribed, the effect of separation will be much the same. For a given gene, there will be sequences that can tolerate force to greater or lesser degrees. For different transcripts, of a gene variation at those sequences may be crucial to the integrity of transcription machinery that separates DNA strands to initiate replication to RNA and for the outcome.

Cellular biology is enormously complex in all regards. The physics of molecular interaction, fluid dynamics, and chemistry combine in a system where cause and effect is near impossible to predict. At the most elementary level we hypothesize some non-coding DNA (ncDNA) possess structural inherencies that can be deployed to direct gene proteins and cell function for diagnosis or therapy.

Coding DNA and its regulatory, non-coding gene compliment is transcribed and spliced from a transcribed gene. Transcription to RNA, edited mRNA, spliced non-coding RNA and ultimately mRNA translation to protein can produce wide ranging, variable outcomes that may not be re-captured experimentally.

A single nucleotide polymorphism (SNP) or SNP combinations within a gene may affect the finely tuned balance that results. Under different environmental conditions this could be material to the protein produced. Additionally other mutations of the gene could add complexity to the environment and/or the resulting protein translation.

At this level of cellular biology, genetic DNA stores instruction for protein assemblies to produce new protein required for the fully functional cell. However, DNA's stored mutations can lead to different functional or non-functional versions of protein depending on many different factors. Relationships between ncDNA, including mutations and the transcripts' edited, protein coding mRNA may represent unexplored inherencies that can regulate the gene's mRNA or translated protein.

We built an algorithm to elaborately compare ncDNA sequences of multiple protein coding transcripts of the same gene. For each transcript it steps through every variable length ncDNA sequence (kmer) (specifically intron1), computes a signature for each and indexes it to the constant of the transcripts' mRNA signature. For each step these signatures order the kmers for each of the transcript's. The order is represented in a vector of all the transcripts being compared.

At millions of successive steps (depending on total intron 1 length's) transcripts mostly retain their vector ordering except, as expected at a kmer length change. Mostly transcript order in the vector does not change, occasionally a few positions change, vary rarely do all positions change. Position changes that cause another, like a domino effect are filtered out. For the rarest positions changes at a step, we look to the root causes in the kmer (sequence). We call this a Key Sequence because it is identified by the significance of changes to transcript positions in the vector compared to the vector at the next step.

Therefore, Key Sequences cause the most position changes between transcripts being compared by the algorithm. This relative measure is step dependent and Key Sequences are discovered by comparing transcript positions in the vector at the next step location. Logically, this infers a genes structural inherency discovered through ncDNA Key Sequence relationships to mRNA, to other transcripts, error in gene alignments, sequenced reads or the algorithm.

In assay testing we were able to predict and synthesize non-coding RNA Key Sequences that significantly reduced proliferation of HeLa cells. In our pre-clinical work, based on comparisons to transcripts of the TP53 we will be predicting the efficacy of cell and tissue selections that educate and activate Natural Killer cells.

If Key Sequences are inherent they could open a new frontier for diagnosis and therapy.

Saturday, February 13, 2021

Cell's with an Index like Google?

Its been a while since I last wrote about DNA repeats or their RNA descendants. In that time advanced research has emerged relating repeats to increasing numbers of viral or other disease. Generally the repeats of interest here can be either long or short sequences of nucleotides that from part of an unspliced gene. Logically, counts of long sequences that repeat would be less than short sequences, but when normalized to their respective nucleotide lengths the indexed results can shift the relative order of repeating sequences quite dramatically.

In most knowledge systems repeats in low level data present redundancy and opportunity to improve efficacy in local or global upstream processes acting on that data. We see this in the structure of efficient alphabets that had a significant impact on whether or not a language survived continuous use. Why use ten words when precise meaning, including abstracts can be derived from three. Or why alpha when, at least for some period in the language history alphanumeric made it more effective?

Search engines reduce their primary index to the least redundant data set used to drive efficient data access by upstream requests and processes to satisfy any query. However, at the storage level, data redundancy is permitted because energy efficiency is gained. Similarly genetic DNA is massively redundant. Redundant data stores can make highly indexed systems more efficient because frequently accessed data elements are more accessible at multiple locations and parallel processes can more efficiently satisfy upstream requests.

Repetitive sequences constitute 50%–70% of the human genome. Some of these can transpose positions, these transposable elements (TE's) are DNA transposons and retrotransposons. The latter are predominant in most mammals and can be further divided into long terminal repeat (LTR)-containing endogenous retrovirus transposons and non-LTR transposons including short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs). The most abundant subclass of SINEs comprises primate-specific Alu elements in human with more abundant GC-rich DNA. Humans have up to 1.4 million copies of these repeats, which constitute about 10.6% of the genomic DNA. Long interspersed element-1 (LINE1 or L1), are abundant in AT-rich DNA, constitute 19% of the human genome and make up the largest proportion of transposable element-derived sequences.

Most TE classes are primarily involved in reduced gene expression, but Alu elements are associated with up regulated gene expression. Intronic Alu elements are capable of generating alternative splice variants in protein-coding genes that illustrate how Alu elements can alter protein function or gene expression levels. Non-coding regions were found to have a great density of TEs within regulatory sequences, most notably in repressors. TEs have a global impact on gene regulation that indicates a significant association between repetitive elements and gene regulation.

In liquid systems, phase separation is one of the most fundamental phase transition phenomena and ubiquitous in nature. De-mixing of oil and water in salad dressing is a typical example. The discovery of biological phase separation in living cells led to the identification that phase-separation dynamics are controlled by mechanical relaxation of the network-forming dense phase, where the limiting process is permeation flow of the solvent for colloidal suspensions and heat transport for pure fluids. The application of this derived governing universal law is a step to understanding and defining the liquid biological indexing equivalence of data-processing systems and inherent genetic redundancy.

Repeats have been widely implicated. In plant immunity a TE has been domesticated through histone marks and generation of alternative mRNA isoforms that were both directly linked to immune response to a particular pathogen. p53 transcription sites evolved through epigenetic methylation, deamination and histone regulation that constituted a universal mechanism found to generate various transcription-factor binding sites in short TE's or Alu repeats. In disease cytoplasmic synthesis of Alu cDNA was implicated in age related macular degeneration and there is transient increase of nearly 20-fold in the levels of Alu RNA during stress, viral infection and cancer.

In chromosomal DNA, each sequence, relative to its length may conveniently describe a phase-separated indexed location and method for discovery. Repeats within genetic DNA may present precisely sensitive phase-separated guidance to drive histone, epigenetic and transcription factors to specific genetic locations at the cells' 'end-of-line' from where the genetic response to upstream membrane bound changes begin.

Sunday, December 13, 2020

Natural Killers Linked to Overall Survival in Cancer

A meta analysis of tumor samples, collected between 1973 and 2016, in 53 studies confirmed overall survival (OS) correlated with Natural Killer cell infiltration into solid tumors. The number of NK cells infiltrating solid tumors, including those considered “highly ”infiltrated was relatively low, compared with other immune populations. Notwithstanding, the presence of a single NK cell, within a high powered microscopic field was associated with significantly improved OS and disease free survival in colorectal cancer, HER2 + breast cancer and hepatocellular carcinoma.

The finding supports the prospect that single tumor infiltrating NK cells, in a sampled tissue can be determinative for OS. By inference a single tumor infiltrating NK cell or cells possess characteristics that are relative to OS and beneficial to patient.

NK cell surface receptors are densely varied defining at least 30,000 unique NK cell populations within each individual. NK cell classifications, relative to tumor infiltration and OS is enormously complex, especially at this scale and present definitions of activating and inhibiting receptor combinations underwhelm. To identify NK cells that have infiltrated or may be capable of infiltrating a patient tumor to improve OS we focused on biopsied tumor tissue selections whether or not they include NK cells.

Our work is with two tumor types in humanized mice. Multiple sections of each tumor were resected and divided into multiple parts for coculture with allogenic naïve, IL2 and probiotic enhanced NK cells and for DNA sequencing. After coculture NK cell cytotoxicity and other detailed measures resulting from each resected section and from single cells were assessed. Presently sequencing of DNA from each resected, divided section (pre-coculture) is focused on comparisons derived from TP53.

In the final stage NK cells will be cocultured with resected tumor tissue and will be made to challenge new tumor tissue and single cells, from the resected tumor from which the NK coculture was derived. The objective will be whether Codondex analysis of TP53 DNA sequencing can predict the most successful tumor tissue candidates based upon the most effective cocultured NK cell challenge to the tumor derived tissue or cells.

If Codondex algorithm is found to identify a direct or indirect logic for tissue or cell selection that is effective in vitro our work will continue to next stage in vivo testing and analysis on similar grounds.

Thursday, September 24, 2020

$100,000 Biohunt

Some of the past research on neoantigen and p53 antibodies in immunity has been encouraging. The data is enormously complex, but keeps pointing to TP53's great potential. To this end, we were anxious to start our mega-experiment, but were delayed by C19, now I'm glad to report we are well underway. In co-operation with researchers at UCLA we aim to determine whether Codondex transcript analysis, of TP53 can predict the best tumor tissue selection for most effective Natural Killer (NK) cell priming, activation and cell killing, including in autologous tumor micro environments.

We're hoping to to achieve a result along the path toward our ambitious clinical goal. We aim to prove that a specifically selected section from biopsied tissue can be used to effectively prime autologus NK cells for patient reapplication and disease treatment.

This co-culture vs. sequencing challenge uses sections (T1-T8) taken from each of two tumors. Each section is co-cultured with 2 treated NK cell and one naive NK cell line and tests the efficacy of NK cell cytotoxicity against tumor cell and tumor tissue in killing assays. Separately, by sequencing TP53 of each selection and computing Codondex iScore(TM) algorithm we hope to identify specific features of each tissue selection that point computed results to research outcomes.

Co-culture vs. Sequencing Challenge

To better understand the analysis and encourage research contributions we are inviting applicants for first grants directed toward this objective.

Codondex tools analyse genetic sequences at an arbitrary number of nucleotides. The tool provides an easy way to observe fine repetitive details of small subsequences contained within a gene. We compute various metrics for each subsequence including 'Inclusiveness', which measures the total occurrences of every computed smaller subsequence is found within the subsequence of interest.

Our primary interest is intronic, non-coding DNA in multi-transcript genes. In these systems we create a transcript list, which we call the Vector, that is sorted by Codondex i-Score. This metric looks at Inclusiveness scaled by the length of the subsequence, to better account for intrinsic probability of finding smaller subsequences within progressively longer ones. Using this we look at the way order of this vector changes from subsequence to subsequence. Large changes in these vectors then prompts us to tag them for further investigation as it represents large deviation from transcript similarity, with this subsequence being labelled a Key Sequence.

Codondex is proposing 3 grants for open problems to aid in our journey towards a more biologically useful platform. These 3 problems span statistical analysis, data acquisition and biological relevance of various aspects that are integral to our platform.

Applicants should inquire further and sign up here.

Monday, June 8, 2020

Oil and Water and Cellular Function

Genetic DNA are single acid nucleotide's stringed along a sugar-phosphate spine that winds around proteins, called histones and collapses into a chromosome assembly. At specific 'gene' locations DNA are often unwound and replicated into smaller, related RNA strings that can be incorporated by clustered proteins to attract and assemble amino acid combinations that may fold into functional proteins. Aqueous proteins aggregate in complex units and interact with DNA, RNA, amino acids and other proteins to build life on planet earth.

Entropy can disrupt the order of liquid-liquid phase separation (LLP) and other density based separations that govern events effecting DNA and are central to cellular bio-physics. Since the discovery of DNA in 1869 and its double helix structure in 1953, research has been directed to decipher the vast string assemblies of billions of these ordered acid combinations that govern cells of different species. Recently research has more beautifully described how orders of short repeating DNA sequences govern cellular mechanics and provides insight to the delicate balance in aqueous separations.

Chromosomes of cells that divide and replicate are tethered via centromere including concentrated short, ordered DNA combinations repeated at extending distances along the sugar-phosphate spine. They attract proteins and other epigenetic factors that may direct the cells centrosome - a protein tube geared to a vast cytoskeleton spindle to move chromosomes and the cells skeletal structure in response to activity on its centromere and distant regions.

Intron regions of genes are considered regulatory since exons or DNA coding regions, when replicated into RNA exclusively translate combinations of amino acids for protein. The intron regions of yeast centromeres were found to promote formation of centromeric heterochromatim - DNA wound around histones and methylated to repress regions and maintain lineage during replication.

A study of centromere heterochromatin surprisingly showed that distant euchromatic regions enriched in repressed methylated genes also interacted with the hierarchical organization of centromeric DNA. These 3D spacial interactions are likely mediated by LLP (similar to how oil and vinegar separate in salad dressing), resulting liquid-like fusion events and can influence the fitness of individuals. Repressed gene's were identified as Transposable Elements (TE's), sequences often associated with pathogenic DNA insertions that have been persistently retained.

A study found 96.3% of TEs enriched in 156 gene bodies overlapped introns, in line with the normally observed distribution of introns and exons in the human genome. Across cells in different tissues, genes that are consistently replicated are less likely to be associated with TE's. Multiple TE's in tissue-specific, active regulatory regions are enriched in intron enhancer sequences to attract and bind protein transcription factors as master replication regulators.

TE's have mostly been analyzed by the frequency of short identical repeating sequences, but methods have not revealed the full extent of the TE repeat hierarchy. When any part of TE's are replicated and released from their sugar-phosphate spine the hierarchy of repeats may effect dissociation. Codondex built a uniform analytic to tease out the inherent hierarchy of repeating sequences that may expose separation potential whether or not the DNA is classified as a TE.

As outlined, repressed DNA regions with more frequent repeats are less actively replicated into RNA. Therefore, actively transcribed regions yield more RNA for coding proteins and edited intron RNA can accumulate to concentrate in the liquid nucleus, be transported to the cytoplasm or be degraded. A cell's machinery must be finely tuned to process the RNA remnant of DNA replication, but mutations and aberrant separations can disrupt the order of these finely tuned micro-organisms.

If repeats define a universal separation hierarchy that is heavily weighted toward regulatory introns then de novo chromosome and gene repeat analysis may identify distant and centromeric influences to the centrosome. The iScore(TM) algorithm repeatedly explodes any DNA or RNA string into its ordered, theoretical hierarchy of repeats until the smallest required string length and may provide a structural basis for liquid separations. A repeat-hierarchy, for any gene would have to also relate to its chromosome repeats for inherent, universal influence over 3D spacial interaction and potentially cell function.

The complete record of repeats for an average length gene explodes to 100,000,000+ ordered strings representing its iScore signature. If a repeat hierarchy does exit for aqueous aggregations, a gene transcripts' intron iScore should be sufficient to measure and compare its inherent repeat potential to other transcripts. Significant consecutive iScore variations with any of the 100,000,000+ strings could be used to expose systemic, structural separation differences for that transcript in context of other transcripts in their aqueous environments.

Thursday, December 19, 2019

Therapeutic Coding and non-Coding DNA Relationships

Relationships of coding and non-coding intra-gene DNA are good cause for intense research and scientific debate. Many cellular functions of non coding DNA have been discovered in the past 30 years, but prior to that these genomic regions were mostly considered 'junk'.

Probing relationships between a genes' protein coding, cDNA and at least one non-coding DNA section of the transcript, which in our work is intron1 can yield important data about genomic features in the combination. Over the past 7 years we focused on interrogating combination relationships, across multiple transcripts to construct intra-gene DNA signatures from apparently disparate DNA elements that are known to perform vastly different biological functions, yet are proximal and often adjacent.

First we considered codon to amino acid coding may operate a little different to the classical view if reading a first and second nucleotide made the third deterministic. This method would not alter the outcome of known protein coding, but it may alter the way we consider combination relationships between nucleotide's. For a transcript, any given length of cDNA and its respective intron1 sequence could possess undiscovered intrinsic order. In a model where order was tightly honored, transcript relativity may identify cDNA sequences that caused significant change in the order at each next nucleotide step.

To investigate transcripts, from the first nucleotide we computed every length cDNA k-mer. We associated k-mer's, of every possible length with the cDNA transcripts intron1 signature. Then, for a set of multiple same gene transcripts, in nucleotide order our algorithm ordered the transcripts into a vector based on their respective cDNA-kmer:intron1-signatures. Stepping through from one k-mer to the next we observed whether next k-mer significantly changed the order of transcripts in the vector. After filtering domino effects we ranked k-mers with the most significantly changed transcript order from the previous k-mer.

Size of circle 'K' in the example indicates k-mer length, but we only compare same length K

In the above example, it is evident that k-mer2 vs k-mer3 was the most changed because all three transcript positions moved without a domino effect. From the vector we identify intra:inter transcript conditions in next nucleotide relationships as represented in the k-mers.

As an example, in our work with 15 viable consensus transcripts for p53 occasionally all 15 transcripts in the vector changed positions at the next k-mer. These intra transcript k-mer relationships govern the transcripts order in the vector, but when, at the next k-mer transcript order is relaxed and positions move, particularly where the significant majority of positions move it is indicative that the intra transcript k-mer condition is relative to other transcript k-mers in the vector. The more and the further transcripts move positions in the vector the more relevant their intra transcript k-mer relationships are likely to be to gene.

This transcript comparative presents a new method for diagnosis and therapy because each new transcript, when compared to the consensus set has the capacity to disrupt order in the vector and yield k-mers that are specifically relevant to the gene. In our assay testing we were able to predict and synthesize ncRNA sequences that significantly reduced proliferation of HeLa cells. In our pre-clinical work, based on comparisons to transcripts of the TP53 consensus we will be predicting the efficacy of cell and tissue selections that educate and activate Natural Killer cells.

Pre-clinical flow chart to educate NK cells with tumor tissue/cell co-cultures and prove prediction

Codondex