Wednesday, August 13, 2025

Repeats as Signatures of Regulatory Potential


In the vast landscape of AI genomics, emerging analyses reveals non-coding DNA (ncDNA) as a treasure trove of regulatory information. At Codondex, our innovative k-mer-based approach uncovers how repetitive subsequences—short DNA fragments known as k-mers—serve as powerful signatures of regulatory potential. By viewing these repeats through a topological lens, we transform linear sequences into dynamic networks that highlight subtle distinctions in gene transcripts, offering new insights into gene regulation, isoform diversity, and disease mechanisms.

The Codondex Method: From Sequences to Topology

Codondex begins by "amplifying" ncDNA sequences associated with gene transcripts, generating all contiguous k-mers of length 8 or greater. For a gene like TP53, with its multiple isoforms (variants), we associate these k-mers with transcript-specific signatures derived from cDNA, mRNA or protein constants. The result? A rich dataset of subsequences, where repeats—identical k-mers appearing multiple times—emerge as key players.

Rather than treating DNA as a flat string, we interpret it topologically: k-mers as nodes in a graph, with repeats forming edges that indicate connections, clusters, and symmetries. Metrics like i-Score (normalizing contained k-mers by length) and inclusiveness (repeat frequency) rank these patterns, while cDNA or protein vectors capture fine distinctions. In our analyses of genes such as MEN1 and TP53, symmetries in repeat length and frequency stand out, unrelated to obvious features like reverse complements. These non-random patterns suggest repeats are not artifacts but deliberate signatures encoded for regulation.

Repeats as Regulatory Hotspots

How do these repeats signal regulatory potential? First, they often manifest as binding sites for proteins. Repetitive motifs can amplify affinity for transcription factors or splicing regulators. In TP53 introns, high-frequency k-mers align with p53-binding elements, potentially modulating tumor-suppressive isoforms. Variants with asymmetric repeats might weaken these interactions, leading to dysregulation in cancer.

Second, repeats influence secondary structures. Topologically, frequent repeats create "hubs" in the network, fostering DNA/RNA folds like hairpins that affect chromatin accessibility or mRNA stability. Our MEN1 intron1 study, analyzing 15 variants, revealed length-biased repeat clusters in scatter-graphs—despite length-agnostic algorithms—indicating structured motifs that differentiate stable from unstable transcripts. Disruptions from low-length repeats, as seen in TP53 vectors, act like regulatory "switches," fine-tuning expression in response to cellular stress.

Third, symmetries in repeats point to evolutionary conservation. Equal-length k-mers recurring with balanced frequencies form symmetric graphs, preserving robust modules across species. In MEN1, linked to endocrine tumors, these patterns suggest intron-driven adaptations for hormone regulation. Disruptions in variants could flag pathogenicity, enabling predictive modeling without coding-sequence reliance.

Real-World Implications and Validation

Our deep k-mer analysis, first detailed in a 2018 blog post, showcased MEN1 intron symmetries predicting protein outcomes, later validated through lab tests at Tel Aviv University. For TP53, stable vector positions disrupted by specific repeats correlated with isoform-specific roles, highlighting ncDNA's influence on cancer hallmarks.

This topological view empowers genomics: identifying regulatory elements for drug targeting, differentiating disease variants, and advancing precision medicine. At Codondex, we're excited to explore how these repeat signatures unlock ncDNA's secrets—join us in redefining genomic potential.